In this section, the feature filtering preprocesses are mentioned at first, and then the temporal-spatial-frequency feature extraction and classification processing are explained in the form of proposed architecture.
Multi-scale filter bank CSP
The organization of EEG signal logged through hand movement imagination was done by the Filter bank CSP technique8. Let \(X^c=\left[x_1^c,x_2^c,\dots ,x_n^c\right]\) be EEG data matrices of an experiment where \(\mathrmC=\mathrm1,2,\dots ,\mathrmC\) is a number of classes, \(x_i^c\in R^D\times S\) is a \(D\times T\) matrix comprising information of \(i_th\) trial for class c where D represents the number of channels, and T stands for the number of time sample in each trials measurement. Averaged normalized covariance matrix belonging to class c can be shown as:
$$\overlineR^c = \frac{{\mathop \sum \nolimits_i = 1^N \fracx_i^c x_1^c^T {{tr\left( {x_i^c x_1^c^T } \right)}}}}N$$
(1)
The subsequent optimization issue is solved to obtain wcsp:
$$w_csp = \arg \mathop \max \limits_w \fracw^T R_j ww^T R_i w$$
(2)
There are numerous techniques for resolving the maximization problem. Solving the following eigenvalue decomposition is a conventional way to find optimal \(w_csp\) :
$$R_i W = \left( R_i + R_j \right)WD$$
(3)
where D is a generalized eigenvalue of Ci. Those filters with higher Eigenvalues deliver greater alteration and vice versa as the Eigen vector at both ends offers discriminative features on one class against another. Communal training in the organization of motor imagery EEG signals is to chosen numerous Eigenvectors from both ends as a spatial filter. Besides, variable M signifies the number of choosen filters from both ends. Consequently, the spatially filtered signal Y in CSP subspace from a single EEG trial \(x_i^c\) in sensor-space can be derived as:
$$Y = \omega \times x_i^c$$
(4)
where \(\omega\) is the designated filter from W. Filter bank CSP (FBCSP) is an extension algorithm of CSP that was formerly presented by the winner of BCI competition IV-2a9. In our method, the used feature extraction process is primarily relies on FBCSP, and it will be clarified in the subsequent segment. As neural activities of different individuals are not identical in response and preparation time, selecting all sampled data in a trial to obtain our relevant signal is not essential. Therefore, FBCSP is extended to Multi-scale FBCSP (MSFBCSP) to utilize these neural activity phenomena30. Figure 2 demonstrates the Multi-scale FBCSP algorithm Block and other building blocks of our planned method mentioned in detail in the next section.

Building blocks of proposed work, Deep Frequency Band Convolutional Neural Network.
The procedure for MSFBCSP is as follows: The Matrix data of the EEG signals is \(N_Trial\times T\times N_CH\) where \(N_Trial\) is the number of trials recorded from subjects and T and \(N_CH\) signify the number of time samples and recorded channels, respectively. Furthermore, \(N_F\) represents the number of bandpass filters, usually recognized as Frequency Filter Bank. Typically, nine frequency filters with a bandwidth of 4HZ like 4-8HZ, 8-12HZ,…,36-40HZ, are employed. Then, the Matrix is reshaped into \(N_Trial\times N_F\times T\times N_CH\). In individual frequency bands, Multi-Scaled signals were composed by dividing these signals into chunks of a shorter time step. Consequently, the shape of constructed Matrix is \(N_Trial\times N_F\times N_T\times T\times N_CH\) where \(N_T\) expresses a number of time steps.
-
1.
The spatial filter’s weights \(W_i,j\) for time step i and frequency band j will be computed by the CSP algorithm. Then M most extreme eigenvalue and their corresponding pair are selected from each Wi, j and for every frequency band and time step. Therefore, they are \(2\times M\times N_CH\) spatial filters.
-
2.
Calculated spatial filters are employed in each time step, and frequency band on the Matrix data and Multi-Scaled Filter Banked signals are made with the new shape of the matrix data of \(N_Trial\times N_F\times N_T\times T\times M\times 2\).
Variance across time is used to extract conventional energy of signals and therefore, the data Matrix develops in the maximum shape of \(N_Trial\times N_F\times N_T\times N_CH\times 2\). It will be employed as a classifier input data.
Extracting temporal, spatial and frequency band information
Once MSFBCSP transforming is performed, spatial filter weights that have already been calculated transferred EEG signals from sensor space to CSP subspace. A feature selection procedure is used for scaling down the search space after projecting the signal to the CSP subspace. Although the feature space dimensionality will be reduced, information loss might happen in waste features. Besides, CNN architectures are well recognized for their capabilities to massively decrease the number of factors in a model. Therefore, with using this kind of architecture, it is not vital to train a large number of factors. We elude information loss produced by feature selection algorithms and anticipate our network to consider weighing the features in the process.
Execution of the Hilbert transform on the aforementioned signal would extract in time dimension results in the envelope of each spatially filtered signal9. Performing Hilbert transform produces the analytic formula of the signal in complex-value, which simplifies envelope extraction by taking the amplitude of that complex-valued signal.
Down-sampling could be done on the signal without any important loss of information since the spectral nature of envelopes is low in frequency. There are two main advantages in this operation:
-
1.
Down-sampling will combine the length of each signal as sampling duration varies between time steps.
-
2.
our input feature dimension is reduced and therefore, the filters shape and the trainable parameter of the proposed model will be decreased.
The identical value of N = 2 and time-step intervals of 2.5–3.5 s, 3–4 s, 4–5 s, 5–6 s, 2.5–4.5 s, 4–6 s, 2.5–6 s after creation of imagery task has been adopted for this part9. The novel sampling rate of the data is 250 HZ, and it relies on the time step’s interval length whereas it will has dissimilar interval lengths. In addition, the cut-off frequency of the envelope is 4 HZ which means the least sampling frequency it needs to represent signals is 8 HZ (Nyquist Rate). However, different sampling frequencies are chosen for each time-step interval to get a united input Matrix. For example, for 4 s interval, a 10 Hz sampling rate is chosen to get 40-time points. Different sampling rates are chosen for other intervals to get unified time points.
Compact convolution neural network
CNN is classified as a kind of artificial neural network, and it has a multilayer perceptron structure. This method is inherently enthused by the working standard of the visual cortex, and the convolutional layers are introduced into CNN. Weight-sharing and sparse connectivity are the advantages of the convolutional layers. The two benefits can meaningfully decrease computational difficulty. Dissimilar images and videos have a lot of data to train CNN, the amount of data on EEG signals is very small. For categorizing the EEG signals, several convolutional layers of CNN can easily result in over-fitting of the training model. Consequently, it is very significant to construct a suitable CNN model. Compact-CNN is a special CNN with depthwise and separable convolutions, and it has fewer parameters16. Figure 3 illustrates the example assembly of Compact-CNN on the BCI competition IV-2a data set.

Overall visualization of the Compact-CNN architecture.
As it is mentioned in Figure 3, the first part is inspired by Block 1 of EEGNet16. The order of two convolutional layers reproduces the models of the MSFBCSP. The initial convolutional layer achieves a convolution exclusively along the time axis. As an example, the explanation of the architecture of Compact-CNN on the BCI competition IV-2a data set is shown in Table 1. It is a regular convolutional layer with F1 = 8 kernels of size (1, 64) with padding that yields the same size as the input. Meanwhile, this strategy will permit the kernel to perform as a temporal filter that mines the related frequencies of the EEG signals. Then, the second convolutional layer makes a convolution along the electrodes (channels, space) axis only. A depthwise convolution was used with kernels of size (C = 22, 1) with no padding. In fact, this lets the CNN to learn numerous spatial filters by each feature map of the temporal convolution. In this block, C = 22 signifies the number of channels, T = 288 Number of trails, and the depth is D = 2 which means that there is an expansion of the number of feature maps from F1 = 8 to D * F1 = 16. This indicates that the layer will perform a linear grouping of the channels over the time where each channel will have its weight16. In the present study, three values are chosen via Coordinate-descent and Bayesian optimization: number of temporal filters, depth multiplier (number of spatial filters) and number of pointwise filters. The number of units in the output of the model is K = 4, and it is equal to the number of classes.
Parameter selection and training configuration
Tuning of numerous hyperparameters for neural networks and deep learning methods requires careful considerations. Unfortunately, this tuning is often a cumbersome task requiring expert experience, rules of thumb, or sometimes brute-force search. Consequently, inclination for automatic tactics with abilities to improve the enactment of any given learning algorithm has been arisen. An example of such a model is Multi-layer convolutional neural networks in which a comprehensive exploration of hyperparameters and architectures is useful, as it has been publicized with Bashashati et al42. For presented network architecture, a few features i.e. network hyperparameter such as type of pooling layer, number of the filter unit, convolution stride, kernel size, etc. should be chosen. It is approximately not possible and rational to find over the whole parameters to attain optimal values for each parameter. To resolve this issue, Bayesian parameter selection and Coordinate-descent are used as a suboptimal method to search through parameter space.
Parameter selection based on coordinate-descent
Hyperparameters, could be appropriately selected via cross-validation approach9. It is not possible to search over the factor space owing to time and computation limits. As an alternative, Coordinate-descent is used as a suboptimal technique to achieve cross validation for the network factors43. In this method, a set of parameters, \(\theta = \left[ \theta_1 , \theta_2 , . . . ,\theta_N \right],\) is initialized and then the objective function or score function is optimized for each \(\theta_i , \left( i = 1, . . . , N \right)\) independently, updating the values of the initial \(\theta\) with the newly optimized parameters. After N optimizations, the \(\theta\) vector will be completely updated and a new iteration of optimization can be initiated. To improve our results, the algorithm can be reiterated for several iterations9. In this work, three values are chosen via Coordinate-descent: number of temporal filters, depth multiplier (number of spatial filters) and the number of pointwise filters. Ten-fold cross validation is performed only once to opt the factors. The convolutional layer parameters are initial chosen via cross-validation and then the selected values are applied for cross validation to select the number of convolutional nodes.
Parameter selection based on bayesian optimization
In Bayesian method, unlike both random and grid explorations, preceding efforts are applied to reach optimal values in parameter form and space. It uses a probabilistic model for mapping hyperparameters to a probability of score on the objective function42. It is a capable algorithm that is skilled to optimize tasks and functions that are costly to evaluation via computational method and do not have identical structure to mathematical terms. In fact, this algorithm is extensively applied for optimization of hyper-parameters in the technique of machine learning. Its ability to comprise previous data about the optimization task aid this method to reserve its effectiveness even in a high number of functions. When the hyperparameters are optimized, the nominee points in the area of a definite point x have closely similar function values (namely the optimized function is smooth). In this method, a Kernel function is applied to join this domain information about the system. To gain a fresh candidate point, all the information attained from previous function approximations is applied in the Bayesian optimization. Indeed, the global knowledge reached in the likelihood distribution is applied to fit over the obtainable data to offer a fresh candidate42.
Unlike to the conventional optimization task, finding the maximum of the acquisition function is not difficult; however, this function is not still convex. For the optimization of this function, a derivative free optimization algorithm or a gradient descent algorithm could be used. The fresh nominee is a local maximum of the acquisition function. Then, the whole process is also done for the T iterations. Here, Bayesian optimization, with an optimization function of 10–tenfold cross-validation, has been performed to opt the best hyperparameters on the validation set. However, for regularization purposes and to avoid overfitting problems, batch normalization and dropout with a 0.5 rate have been employed after each convolutional layer.
LSTM base CNNs
As a kind of regular neural network, the LSTM layer could learn long-term dependencies within the input data24. LSTM is an intermittent neural network with the capability to preserve the structure of data for a long time and classify the preferred pattern When the LSTM layer is loaded to a CNN design, the temporal features of the brain signals are professionally mined. At the center of the LSTM is the cell state which can be adapted by adding or eliminating information from the cell state. The removal or addition of the information from the cell states is normalized using structures named gates. The LSTM networks rely on stacked blocks containing three gates which are called the input gate, output gate, and forget gate. The aforementioned control cells are described by the following equations:
$$i = \sigma \left( W_i x_t + U_i h_t – 1 + b_i \right)$$
(5)
$$f = \sigma \left( W_f x_t + U_f h_t – 1 + b_f \right)$$
(6)
$$o = \sigma \left( W_o x_t + U_o h_t – 1 + b_o \right)$$
(7)
$$\tildec = W_c x_t + U_i h_t – 1 + b_c$$
(8)
$$c_t = f \odot c_t – 1 + i \odot \tildec$$
(9)
$$h_t = o \odot c_t$$
(10)
$$\sigma \left( x \right) = \frac11 + e^ – x $$
(11)
where \(W\), \(U\) and \(b\) represent sets of learnable parameters to control each gate. \(x, h, i, f, o\) and \(c\) represents input, output, input gate, forget gate, output gate and memory cell state, respectively. \(\odot\) represents element-wise product.\(t\) represents the data as the time series24.
Fuzzy neural block
A formal description of our planned FNB is explained in detail in this section. A fuzzy neural block (FNB) is known as an order of processing layers creating the activation of the antecedents of a fuzzy rule. Initially, the regularized output of the previous layer O is taken and then they are flattened as \(v_i=Vec\left(O_i\right)=[o_\mathrm1,1,\dots ,o_m,1,o_\mathrm1,2,\dots ,o_m,2,o_1,n,\dots ,o_m,n,]\). Next, the fuzzy clustering method by Kilic et al. is employed to obtain a set of K centroids of shape \(c^k=[c_1^k,c_2^k,\dots ,c_d^k]\), via a collection D of \(v_i\) of layer outputs obtained from the preceding period29. For the initial period, the centroids are set to zero. The Gaussian membership value of \(v_i\) is computed as:
$$\mu_i^k \left( v_i ,c^k ,\alpha \right) = \textexp\left( – 1/4\left( v_i – c^k \right)^2 /\alpha^2 \right)$$
(12)
The scaling vector \(\alpha\) is a parameter that is set to learn by the network, and the rule activation consists of a t-norm operator and normalization step such:
$$o^k \left( \mathop \prod \limits_j = 1^d \mu_i,j^k \right)\;\textand\;\tildeo^k = o^k /\mathop \sum \limits_f = 1^k o^f$$
(13)
where d is the dimension of the \(\mu _i^k\) vector, and \(O^^\prime=[\widetildeo^1,\dots ,\widetildeo^k]\) is the output of the FNB that is forwarded to next layer. Note the output dimension of the FNB is reduced to \(K\)25.
Proposed fuzzy convolution recurrent neural network (EEG-CLFCNet model)
The valuable information of EEG signals could be completely used afterward mining the temporal-spatial-frequency features. The key aim of our study is to develop grouping accuracy with complete feature extraction. Thus, how to join these three features is crucial. To attain the study goal, series convolutional recurrent neural network framework is compared and designed. The structure is shown in Fig. 4.

Visualization of Proposed Deep Fuzzy Convolutional Neural Network architecture.
The spatial and frequency features of the filtered EEG signals are primarily extracted by Compact-CNN, and then the sequences of the extracted features is used as input in LSTM to extract temporal features. The output of the last time step of LSTM layers is transported to a fully connected layer and FNB. A softmax classifier finds the final prediction at last. Compact-CNN are used as the CNN module to define the series convolutional recurrent neural network with LSTM. The proposed method is named as EEG-CLFCNet, respectively.
#Hybrid #fuzzy #deep #neural #network #temporalspatialfrequency #features #learning #motor #imagery #signals #Scientific #Reports