In this section, the feature filtering preprocesses are mentioned at first, and then the temporalspatialfrequency feature extraction and classification processing are explained in the form of proposed architecture.
Multiscale filter bank CSP
The organization of EEG signal logged through hand movement imagination was done by the Filter bank CSP technique^{8}. Let \(X^c=\left[x_1^c,x_2^c,\dots ,x_n^c\right]\) be EEG data matrices of an experiment where \(\mathrmC=\mathrm1,2,\dots ,\mathrmC\) is a number of classes, \(x_i^c\in R^D\times S\) is a \(D\times T\) matrix comprising information of \(i_th\) trial for class c where D represents the number of channels, and T stands for the number of time sample in each trials measurement. Averaged normalized covariance matrix belonging to class c can be shown as:
$$\overlineR^c = \frac{{\mathop \sum \nolimits_i = 1^N \fracx_i^c x_1^c^T {{tr\left( {x_i^c x_1^c^T } \right)}}}}N$$
(1)
The subsequent optimization issue is solved to obtain w_{csp}:
$$w_csp = \arg \mathop \max \limits_w \fracw^T R_j ww^T R_i w$$
(2)
There are numerous techniques for resolving the maximization problem. Solving the following eigenvalue decomposition is a conventional way to find optimal \(w_csp\) :
$$R_i W = \left( R_i + R_j \right)WD$$
(3)
where D is a generalized eigenvalue of C_{i}. Those filters with higher Eigenvalues deliver greater alteration and vice versa as the Eigen vector at both ends offers discriminative features on one class against another. Communal training in the organization of motor imagery EEG signals is to chosen numerous Eigenvectors from both ends as a spatial filter. Besides, variable M signifies the number of choosen filters from both ends. Consequently, the spatially filtered signal Y in CSP subspace from a single EEG trial \(x_i^c\) in sensorspace can be derived as:
$$Y = \omega \times x_i^c$$
(4)
where \(\omega\) is the designated filter from W. Filter bank CSP (FBCSP) is an extension algorithm of CSP that was formerly presented by the winner of BCI competition IV2a^{9}. In our method, the used feature extraction process is primarily relies on FBCSP, and it will be clarified in the subsequent segment. As neural activities of different individuals are not identical in response and preparation time, selecting all sampled data in a trial to obtain our relevant signal is not essential. Therefore, FBCSP is extended to Multiscale FBCSP (MSFBCSP) to utilize these neural activity phenomena^{30}. Figure 2 demonstrates the Multiscale FBCSP algorithm Block and other building blocks of our planned method mentioned in detail in the next section.
The procedure for MSFBCSP is as follows: The Matrix data of the EEG signals is \(N_Trial\times T\times N_CH\) where \(N_Trial\) is the number of trials recorded from subjects and T and \(N_CH\) signify the number of time samples and recorded channels, respectively. Furthermore, \(N_F\) represents the number of bandpass filters, usually recognized as Frequency Filter Bank. Typically, nine frequency filters with a bandwidth of 4_{HZ} like 48_{HZ}, 812_{HZ},…,3640_{HZ}, are employed. Then, the Matrix is reshaped into \(N_Trial\times N_F\times T\times N_CH\). In individual frequency bands, MultiScaled signals were composed by dividing these signals into chunks of a shorter time step. Consequently, the shape of constructed Matrix is \(N_Trial\times N_F\times N_T\times T\times N_CH\) where \(N_T\) expresses a number of time steps.

1.
The spatial filter’s weights \(W_i,j\) for time step i and frequency band j will be computed by the CSP algorithm. Then M most extreme eigenvalue and their corresponding pair are selected from each W_{i, j} and for every frequency band and time step. Therefore, they are \(2\times M\times N_CH\) spatial filters.

2.
Calculated spatial filters are employed in each time step, and frequency band on the Matrix data and MultiScaled Filter Banked signals are made with the new shape of the matrix data of \(N_Trial\times N_F\times N_T\times T\times M\times 2\).
Variance across time is used to extract conventional energy of signals and therefore, the data Matrix develops in the maximum shape of \(N_Trial\times N_F\times N_T\times N_CH\times 2\). It will be employed as a classifier input data.
Extracting temporal, spatial and frequency band information
Once MSFBCSP transforming is performed, spatial filter weights that have already been calculated transferred EEG signals from sensor space to CSP subspace. A feature selection procedure is used for scaling down the search space after projecting the signal to the CSP subspace. Although the feature space dimensionality will be reduced, information loss might happen in waste features. Besides, CNN architectures are well recognized for their capabilities to massively decrease the number of factors in a model. Therefore, with using this kind of architecture, it is not vital to train a large number of factors. We elude information loss produced by feature selection algorithms and anticipate our network to consider weighing the features in the process.
Execution of the Hilbert transform on the aforementioned signal would extract in time dimension results in the envelope of each spatially filtered signal^{9}. Performing Hilbert transform produces the analytic formula of the signal in complexvalue, which simplifies envelope extraction by taking the amplitude of that complexvalued signal.
Downsampling could be done on the signal without any important loss of information since the spectral nature of envelopes is low in frequency. There are two main advantages in this operation:

1.
Downsampling will combine the length of each signal as sampling duration varies between time steps.

2.
our input feature dimension is reduced and therefore, the filters shape and the trainable parameter of the proposed model will be decreased.
The identical value of N = 2 and timestep intervals of 2.5–3.5 s, 3–4 s, 4–5 s, 5–6 s, 2.5–4.5 s, 4–6 s, 2.5–6 s after creation of imagery task has been adopted for this part^{9}. The novel sampling rate of the data is 250 HZ, and it relies on the time step’s interval length whereas it will has dissimilar interval lengths. In addition, the cutoff frequency of the envelope is 4 HZ which means the least sampling frequency it needs to represent signals is 8 HZ (Nyquist Rate). However, different sampling frequencies are chosen for each timestep interval to get a united input Matrix. For example, for 4 s interval, a 10 Hz sampling rate is chosen to get 40time points. Different sampling rates are chosen for other intervals to get unified time points.
Compact convolution neural network
CNN is classified as a kind of artificial neural network, and it has a multilayer perceptron structure. This method is inherently enthused by the working standard of the visual cortex, and the convolutional layers are introduced into CNN. Weightsharing and sparse connectivity are the advantages of the convolutional layers. The two benefits can meaningfully decrease computational difficulty. Dissimilar images and videos have a lot of data to train CNN, the amount of data on EEG signals is very small. For categorizing the EEG signals, several convolutional layers of CNN can easily result in overfitting of the training model. Consequently, it is very significant to construct a suitable CNN model. CompactCNN is a special CNN with depthwise and separable convolutions, and it has fewer parameters^{16}. Figure 3 illustrates the example assembly of CompactCNN on the BCI competition IV2a data set.
As it is mentioned in Figure 3, the first part is inspired by Block 1 of EEGNet^{16}. The order of two convolutional layers reproduces the models of the MSFBCSP. The initial convolutional layer achieves a convolution exclusively along the time axis. As an example, the explanation of the architecture of CompactCNN on the BCI competition IV2a data set is shown in Table 1. It is a regular convolutional layer with F_{1 }= 8 kernels of size (1, 64) with padding that yields the same size as the input. Meanwhile, this strategy will permit the kernel to perform as a temporal filter that mines the related frequencies of the EEG signals. Then, the second convolutional layer makes a convolution along the electrodes (channels, space) axis only. A depthwise convolution was used with kernels of size (C = 22, 1) with no padding. In fact, this lets the CNN to learn numerous spatial filters by each feature map of the temporal convolution. In this block, C = 22 signifies the number of channels, T = 288 Number of trails, and the depth is D = 2 which means that there is an expansion of the number of feature maps from F_{1 }= 8 to D * F_{1 }= 16. This indicates that the layer will perform a linear grouping of the channels over the time where each channel will have its weight^{16}. In the present study, three values are chosen via Coordinatedescent and Bayesian optimization: number of temporal filters, depth multiplier (number of spatial filters) and number of pointwise filters. The number of units in the output of the model is K = 4, and it is equal to the number of classes.
Parameter selection and training configuration
Tuning of numerous hyperparameters for neural networks and deep learning methods requires careful considerations. Unfortunately, this tuning is often a cumbersome task requiring expert experience, rules of thumb, or sometimes bruteforce search. Consequently, inclination for automatic tactics with abilities to improve the enactment of any given learning algorithm has been arisen. An example of such a model is Multilayer convolutional neural networks in which a comprehensive exploration of hyperparameters and architectures is useful, as it has been publicized with Bashashati et al^{42}. For presented network architecture, a few features i.e. network hyperparameter such as type of pooling layer, number of the filter unit, convolution stride, kernel size, etc. should be chosen. It is approximately not possible and rational to find over the whole parameters to attain optimal values for each parameter. To resolve this issue, Bayesian parameter selection and Coordinatedescent are used as a suboptimal method to search through parameter space.
Parameter selection based on coordinatedescent
Hyperparameters, could be appropriately selected via crossvalidation approach^{9}. It is not possible to search over the factor space owing to time and computation limits. As an alternative, Coordinatedescent is used as a suboptimal technique to achieve cross validation for the network factors^{43}. In this method, a set of parameters, \(\theta = \left[ \theta_1 , \theta_2 , . . . ,\theta_N \right],\) is initialized and then the objective function or score function is optimized for each \(\theta_i , \left( i = 1, . . . , N \right)\) independently, updating the values of the initial \(\theta\) with the newly optimized parameters. After N optimizations, the \(\theta\) vector will be completely updated and a new iteration of optimization can be initiated. To improve our results, the algorithm can be reiterated for several iterations^{9}. In this work, three values are chosen via Coordinatedescent: number of temporal filters, depth multiplier (number of spatial filters) and the number of pointwise filters. Tenfold cross validation is performed only once to opt the factors. The convolutional layer parameters are initial chosen via crossvalidation and then the selected values are applied for cross validation to select the number of convolutional nodes.
Parameter selection based on bayesian optimization
In Bayesian method, unlike both random and grid explorations, preceding efforts are applied to reach optimal values in parameter form and space. It uses a probabilistic model for mapping hyperparameters to a probability of score on the objective function^{42}. It is a capable algorithm that is skilled to optimize tasks and functions that are costly to evaluation via computational method and do not have identical structure to mathematical terms. In fact, this algorithm is extensively applied for optimization of hyperparameters in the technique of machine learning. Its ability to comprise previous data about the optimization task aid this method to reserve its effectiveness even in a high number of functions. When the hyperparameters are optimized, the nominee points in the area of a definite point x have closely similar function values (namely the optimized function is smooth). In this method, a Kernel function is applied to join this domain information about the system. To gain a fresh candidate point, all the information attained from previous function approximations is applied in the Bayesian optimization. Indeed, the global knowledge reached in the likelihood distribution is applied to fit over the obtainable data to offer a fresh candidate^{42}.
Unlike to the conventional optimization task, finding the maximum of the acquisition function is not difficult; however, this function is not still convex. For the optimization of this function, a derivative free optimization algorithm or a gradient descent algorithm could be used. The fresh nominee is a local maximum of the acquisition function. Then, the whole process is also done for the T iterations. Here, Bayesian optimization, with an optimization function of 10–tenfold crossvalidation, has been performed to opt the best hyperparameters on the validation set. However, for regularization purposes and to avoid overfitting problems, batch normalization and dropout with a 0.5 rate have been employed after each convolutional layer.
LSTM base CNNs
As a kind of regular neural network, the LSTM layer could learn longterm dependencies within the input data^{24}. LSTM is an intermittent neural network with the capability to preserve the structure of data for a long time and classify the preferred pattern When the LSTM layer is loaded to a CNN design, the temporal features of the brain signals are professionally mined. At the center of the LSTM is the cell state which can be adapted by adding or eliminating information from the cell state. The removal or addition of the information from the cell states is normalized using structures named gates. The LSTM networks rely on stacked blocks containing three gates which are called the input gate, output gate, and forget gate. The aforementioned control cells are described by the following equations:
$$i = \sigma \left( W_i x_t + U_i h_t – 1 + b_i \right)$$
(5)
$$f = \sigma \left( W_f x_t + U_f h_t – 1 + b_f \right)$$
(6)
$$o = \sigma \left( W_o x_t + U_o h_t – 1 + b_o \right)$$
(7)
$$\tildec = W_c x_t + U_i h_t – 1 + b_c$$
(8)
$$c_t = f \odot c_t – 1 + i \odot \tildec$$
(9)
$$h_t = o \odot c_t$$
(10)
$$\sigma \left( x \right) = \frac11 + e^ – x $$
(11)
where \(W\), \(U\) and \(b\) represent sets of learnable parameters to control each gate. \(x, h, i, f, o\) and \(c\) represents input, output, input gate, forget gate, output gate and memory cell state, respectively. \(\odot\) represents elementwise product.\(t\) represents the data as the time series^{24}.
Fuzzy neural block
A formal description of our planned FNB is explained in detail in this section. A fuzzy neural block (FNB) is known as an order of processing layers creating the activation of the antecedents of a fuzzy rule. Initially, the regularized output of the previous layer O is taken and then they are flattened as \(v_i=Vec\left(O_i\right)=[o_\mathrm1,1,\dots ,o_m,1,o_\mathrm1,2,\dots ,o_m,2,o_1,n,\dots ,o_m,n,]\). Next, the fuzzy clustering method by Kilic et al. is employed to obtain a set of K centroids of shape \(c^k=[c_1^k,c_2^k,\dots ,c_d^k]\), via a collection D of \(v_i\) of layer outputs obtained from the preceding period^{29}. For the initial period, the centroids are set to zero. The Gaussian membership value of \(v_i\) is computed as:
$$\mu_i^k \left( v_i ,c^k ,\alpha \right) = \textexp\left( – 1/4\left( v_i – c^k \right)^2 /\alpha^2 \right)$$
(12)
The scaling vector \(\alpha\) is a parameter that is set to learn by the network, and the rule activation consists of a tnorm operator and normalization step such:
$$o^k \left( \mathop \prod \limits_j = 1^d \mu_i,j^k \right)\;\textand\;\tildeo^k = o^k /\mathop \sum \limits_f = 1^k o^f$$
(13)
where d is the dimension of the \(\mu _i^k\) vector, and \(O^^\prime=[\widetildeo^1,\dots ,\widetildeo^k]\) is the output of the FNB that is forwarded to next layer. Note the output dimension of the FNB is reduced to \(K\)^{25}.
Proposed fuzzy convolution recurrent neural network (EEGCLFCNet model)
The valuable information of EEG signals could be completely used afterward mining the temporalspatialfrequency features. The key aim of our study is to develop grouping accuracy with complete feature extraction. Thus, how to join these three features is crucial. To attain the study goal, series convolutional recurrent neural network framework is compared and designed. The structure is shown in Fig. 4.
The spatial and frequency features of the filtered EEG signals are primarily extracted by CompactCNN, and then the sequences of the extracted features is used as input in LSTM to extract temporal features. The output of the last time step of LSTM layers is transported to a fully connected layer and FNB. A softmax classifier finds the final prediction at last. CompactCNN are used as the CNN module to define the series convolutional recurrent neural network with LSTM. The proposed method is named as EEGCLFCNet, respectively.
#Hybrid #fuzzy #deep #neural #network #temporalspatialfrequency #features #learning #motor #imagery #signals #Scientific #Reports