Publicly available EGG-based BCI dataset for inner speech recognition

Multi speech-related BCI dataset consisting of EEG recordings from ten naive BCI users, performing four mental tasks in three different conditions

author avatar

26 Feb, 2022

EEG-based BCI Open Source Dataset [Image Source: iStock by Getty Images]

EEG-based BCI Open Source Dataset [Image Source: iStock by Getty Images]

In recent years, the brain-computer interface (BCI) has witnessed technological breakthroughs in a broad range of applications, such as enhancing the capability of the human brain in communicating and interacting with the environment. This has become widely popular for inner speech recognition through automatic detection of brain patterns that allows faster and more reliable translation of neural activity into commands. Neural activity is measured by electroencephalography (EEG) as it is a non-invasive technique and the measuring device can be portable. Non-invasive EEG-based BCI applications are geared towards alternative communication for paralyzed patients by interacting with wheelchairs, prostheses or any virtual interface device. 

The lack of a publicly available EEG dataset has restricted the growth of inner speech recognition in the community. The need for an open-source EEG-based BCI dataset for inner speech recognition has become an important part of leading the line of research to develop new techniques. According to the reports, there are publicly available datasets for imagined speech and motor imagery, but there is no single open-source EEG dataset for the scientific community for a better understanding of brain mechanisms. A team of researchers from different universities in Argentina collaborated to develop a multi-speech BCI dataset consisting of EEG recordings from ten BCI users undergoing four tasks in various different conditions: inner speech, pronounced speech and visualized conditions. 

The Method

In the paper, “Thinking out loud, an open-access EEG-based BCI dataset for inner speech recognition,” the team explains the importance of an open-source dataset for BCI applications and the methodology to develop a dataset with more than 9 hours of EEG data recordings. As part of the dataset, the team selected ten right-handed participants with a mean age of 34 who possess no hearing loss, no speech loss and no neurological, psychiatric or movement disorder. All of these participants were native Spanish speakers with no prior BCI experience. 

Open-access EEG-based BCI DatasetProcedures for Acquiring EEG Data from the Participants [Image Credit: Research Article]

To perform EEG without the risk of interference and environmental noise disturbance to the sensitive measurement, an electrically shielded room is preferred for the experimentation. The participants were made to sit on a comfortable chair in front of the computer screen where the visual cues were presented with a break period between sessions to prevent boredom and fatigue and a fifteen-second baseline for the participant to relax. Each session consisted of five simulations corresponding to different conditions of pronounced speech, inner speech and visualized condition.

For data acquisition of Electroencephalography (EEG), Electrooculography (EOG) and Electromyography (EMG) data, the team decided to use BioSemi ActiveTwo high-resolution biopotential measuring system. “For data acquisition, 128 active EEG channels and 8 external active EOG/EMG channels with a 24-bit resolution and a sampling rate of 1024Hz were used,” the researchers note. “The software used for recording was ActiView, developed also by BioSemi as it provides a way of checking the electrode impedance and the general quality of the incoming data.” The EEG data file contains the acquired data for each participant and session after processing that has 128 channels used for recording with 1154 samples, each of them corresponding to 4.5 seconds of signal acquisition with a sampling rate of 256Hz.

Trial Workflow with screens presented to the participants at different timestamps [Image Source: Research Paper]

In terms of limitations of this open-source EEG dataset, even though the participants were instructed not to focus on any muscular activity, there is no guarantee that the recorded mental activity does not have any other factors contributing. Also, as the participants were naive BCI users, it may have been challenging for them to differentiate between the different components of speech. “There is not enough evidence to support that imagined speech and inner speech generate distinguishable brain processes. And even if they in fact do, it is not clear that distinguishable features could be captured with current Electroencephalography technologies. Nonetheless, inner speech is clearly a more natural way of controlling a BCI,” the team explains in the research article. 

For community contributions, the code used in the paper is made publicly available which can be accessed at the GitHub repository along with simulation protocol and MatLab functions. The research article was published in Nature’s Scientific Data under open-access terms for public viewing.


[1] Nieto, N., Peterson, V., Rufiner, H.L. et al. Thinking out loud, an open-access EEG-based BCI dataset for inner speech recognition. Sci Data 9, 52 (2022). 

26 Feb, 2022

Abhishek Jadhav is an engineering student, RISC-V ambassador and a freelance technology and science writer with bylines at EdgeIR, Electromaker, Embedded Computing Design, Electronics-Lab, Hackster, and Electronics-Lab.

Stay Informed, Be Inspired

Wevolver’s free newsletter delivers the highlights of our award winning articles weekly to your inbox.