Wednesday, May 14, 2025

Top 3 LLM+RL Advances in Equity Trading (2025)

In 2025, a handful of practical studies emerged on combining Large Language Models (LLMs) with reinforcement learning (RL) for systematic equity trading. These papers stand out for their robust methodologies, rigorous backtesting, and focus on real-world applicability. Below we review three top papers — each backed by credible authors (including experienced quant researchers or fintech collaborations) — summarizing their contributions, methodology, results, and practical strengths/limitations. Recent works help R&D teams and firms focused on market data analysis derive new alpha, manage risk in trading and creating best portfolios for systematic and discretionary trading.

1. FinRL-DeepSeek: LLM-Infused Risk-Sensitive RL (Benhenda, 2025)

Methodology & Contribution: Introduces a hybrid trading agent that augments deep RL with LLM-derived signals from financial news for both trade decisions and risk management. It extends a risk-aware RL algorithm (CVaR-Proximal Policy Optimization, CPPO) by feeding in an LLM’s news-based stock recommendations and risk assessment score each day. This goes beyond simple sentiment analysis by using carefully prompted LLMs (e.g. DeepSeek V3, Qwen-2.5, Llama 3.3) to extract nuanced risk/return insights from news. The result is an RL agent that adapts its actions based on news-informed risk signals, aiming to manage downside risk in volatile markets.
Backtesting & Results: The study backtests the agent on the Nasdaq-100 index over decades of data (1999–2023 news from the FNSPID dataset). Code and data are open-sourced for transparency. In tests, the LLM-enhanced RL (especially the risk-sensitive CPPO variant) showed improved performance over baseline RL and even outperformed the Nasdaq-100 benchmark in certain runs. Notably, the CPPO+DeepSeek agent excelled in bear markets (e.g. post-2021 downturn), where it outperformed a vanilla RL agent that struggled with high volatility. This suggests the news-informed risk signals helped the agent avoid losses during market stress. (By contrast, a standard PPO agent without risk adjustment was too volatile.) The author reports that the RL agent with LLM inputs achieved higher risk-adjusted returns (e.g. better Information Ratio and tail-risk metrics) when calibrated appropriately.
Credibility & Practical Strengths: This paper is by Mostapha Benhenda (affiliated with LAGA, a math/CS research lab) – an active contributor to AI-for-Finance projects. The approach is practical: it uses readily available news data and existing large models via API, rather than hypothetical data. The inclusion of risk management (CVaR) makes it appealing for real trading, as it targets drawdowns and not just returns. The fact that code and agents are released adds credibility and allows quants to replicate or build on the work. It also formed the basis of a 2025 FinRL trading contest, underscoring industry interest in the approach.
Limitations: While promising, the results indicate that LLM signals must be integrated carefully. In the study, simply injecting strong LLM-based biases into a plain PPO agent actually hurt performance (the agent overreacted to news noise). Only when combined with the CVaR-PPO risk-aware framework did the LLM signals consistently add value. This highlights that real-world use would require careful calibration of LLM influence (the paper shows performance sensitivity to the “infusion strength” of LLM signals. Also, the author notes that results vary between bull and bear regimes, so this approach may need further tuning or regime detection to be reliably profitable across cycles. Overall, FinRL-DeepSeek demonstrates a credible and actionable integration of news-reading LLMs into an RL trading strategy, with an emphasis on risk-adjusted performance.

2. FLAG-Trader: Fusion LLM Agent with Gradient RL (Xiong et al., 2025)

Methodology & Contribution: FLAG-Trader is a large collaborative effort (12 authors from academia and a fintech lab) that proposes a unified architecture fusing an LLM with deep RL for trading decisions. Here, a pre-trained language model is partially fine-tuned on financial data (to imbue it with domain knowledge) and then used as the policy network in an RL agent. In other words, the LLM itself learns to output trading actions (buy/sell/hold) in response to state prompts, and its parameters are further optimized via policy-gradient RL (PPO) using trading rewards. This approach leverages the LLM’s broad knowledge (financial reasoning, context understanding) while training it to achieve specific trading goals, effectively marrying pattern recognition with goal-driven learning. The team uses parameter-efficient fine-tuning so that even a relatively small model can adapt to market specifics without massive compute costs.
Backtesting & Results: The authors conducted extensive empirical tests to validate that this LLM+RL fusion improves trading performance. They evaluated multiple model sizes and types (from a tiny 135M-parameter custom model up to 70B Llama and even OpenAI’s GPT-4 as baselines) on historical market data, measuring metrics like cumulative return (CR), Sharpe ratio (SR), volatility, and max drawdown. A key finding is that their RL-fine-tuned small model (135M) actually outperformed much larger off-the-shelf models (like GPT-4) on trading metrics such as total return and Sharpe. For example, after training, the 135M agent achieved higher risk-adjusted returns than even a 175B GPT-3.5 when the latter was used naively for trading signals. This highlights the benefit of domain-specific RL fine-tuning – the LLM agent learned to make coherent multi-step trading decisions, beating models that, while powerful in general knowledge, weren’t specialized to the task. The paper reports that FLAG-Trader’s agent surpassed a buy-and-hold baseline and other benchmarks in their tests, and even showed improved generalization to other financial tasks (like question-answering) as a byproduct of the training. These results were demonstrated on standard market environments (using the FinRL framework) to ensure reproducibility and rigor.
Credibility & Team: The research was carried out by a cross-disciplinary team from institutions like Harvard, Columbia, Stevens Institute, University of Manchester, NVIDIA, and TheFinAI (a fintech AI research group). Notably, Dr. Xiao-Yang Liu — known for the FinRL library and prior RL trading research — is a co-author, lending credence to the implementation quality. This blend of academic and industry perspectives helped ensure the approach wasn’t just theoretically sound but also practically oriented. The “extensive empirical evidence” and comparisons to many baseline models demonstrate a high level of rigor.
Practical Strengths: FLAG-Trader stands out for showing a viable path to use LLMs directly as trading agents, which could potentially shorten development time for new strategies (since the LLM can incorporate knowledge from financial texts, analyst reports, etc. out of the box). The fact that a smaller, open-source modelfine-tuned with their RL method can outperform larger proprietary models is encouraging for real-world use – it implies one can achieve strong performance without relying on black-box models or exorbitant computing resources. This makes the approach attractive for firms that want control over the model. Moreover, by focusing on risk-adjusted metrics (Sharpe) and testing against a buy-hold benchmark, the paper addresses what matters to quant traders (not just raw profits, but consistency).
Limitations: On the flip side, integrating an LLM into an RL loop is computationally intensive. The authors acknowledge challenges with training stability and the non-stationarity of markets; the LLM policy may need continual retraining as market regimes change. There’s also the question of latency — large models can be slow, though the success with a 135M model suggests a practical trade-off can be found. Another consideration is interpretability: while the LLM can explain its actions in plain text (a nice bonus), its decision-making after RL fine-tuning might become am opaque mix of learned patterns. The paper notes that future work should explore techniques like continual learning to keep the LLM agent adaptive in live markets. In summary, FLAG-Trader provides a robust, tested framework that a quant team could experiment with, especially if they have access to domain-specific LLMs – it balances cutting-edge AI with awareness of trading metrics, making it one of the most practically credible LLM+RL studies of 2025.

3. LLM-Guided Stock Trading via RL (Stock-Evol-Instruct) (Riahi-Samani et al., 2025)

Methodology & Contribution: This work (submitted to ICLR 2025) takes a holistic, pipeline-driven approach to integrating LLMs with reinforcement learning for stock trading. The authors propose an architecture that uses multiple LLMs in various roles to inform an RL trading agent. They introduce an algorithm called “Stock-Evol-Instruct”, which generates dynamic trading instructions or prompts for the RL agent based on market data and news. In practice, they experimented with six different LLMs (both general models and finance-specific ones) and developed a procedure where the LLMs analyze daily stock news and indicators, then produce trading guidance that the RL agent uses to adjust its strategy. The RL component is implemented with Deep Q-Network variants (DQN/DDQN) to make daily trading decisions, while the LLM-generated insights serve as an additional input or modulator for the agent’s state/reward (a form of LLM-informed policy fine-tuning). Uniquely, Stock-Evol-Instruct continuously updates the instruction prompts using recent market trends and the agent’s performance, creating a feedback loop between the LLM analysis and the RL policy learning. They also fine-tuned two smaller open-source language models (Mistral-7B and LLaMA-3B) on the generated instruction data, effectively turning them into specialized trading agents able to act directly on market observations. This two-stage setup (LLMs guiding an RL agent, and in turn improving the LLMs) is a novel contribution, aiming to make the RL more sample-efficient by leveraging rich textual data that pure price-based models might miss.
Backtesting & Performance: The authors evaluated their system on real daily stock data for two instruments– the iShares Silver Trust (SLV, a silver ETF) and JPMorgan Chase & Co. stock – over a multi-year period. The choice of two very different assets (one commodity-based, one banking equity) was to test generality. Results indicate that the LLM-guided approach beat several conventional trading models and baselines. Notably, the RL agent augmented with LLM insights achieved higher prediction accuracy for trading signals and improved profitability compared to an RL agent without LLM guidance. For example, they report higher F1-scores in predicting profitable trades (over 81% in the case of the LLaMA-based agent on JPM) and better cumulative returns than baseline strategies. The LLM-informed agent outperformed a standard DQN trader and even a purely LLM-driven strategy, highlighting the synergy of combining them. The paper’s abstract notes a “significant potential to outperform conventional trading models”– indeed, in backtests the integrated approach had better Sharpe ratios and cumulative returns than a classical price-only RL and a buy-and-hold benchmark. These experiments, while limited to two assets, provide a proof-of-concept that blending news understanding (via LLMs) with technical decision-making (via RL) can yield a more powerful trading strategy.
Authors & Credibility: The team (Ali Riahi Samani, Fatemeh Darvishvand, and Feng Chen) comes from a data science research background and submitted this work to a top-tier AI conference (ICLR), indicating it underwent rigorous peer review. While not directly from a bank or hedge fund, the authors demonstrate solid understanding of both NLP and trading domains. The paper references real-world entities (like using JPM stock data and citing industry data sources), showing an intent to solve practical problems rather than toy examples. The empirical approach – using actual market data and incorporating fundamental news – reflects a quant mindset. Additionally, by integrating numerous prior studies (86 references, as noted in an SSRN review), the authors built on a broad base of knowledge. This gives the work credibility despite the academic origin, as it clearly targets applicable techniques.
Practical Strengths: One strength of this approach is its comprehensiveness: it doesn’t rely on a single model or data stream but rather combines technical signals with fundamental context via natural language. For a quant team, this is appealing because it resembles how a human analyst might operate (reading news and charts together) – here that process is automated. The introduction of a human-in-the-loop style instruction generation (Stock-Evol-Instruct) is also practically interesting: it can be seen as an offline training step that produces high-quality scenarios for the RL agent to learn from, potentially improving stability. The paper also demonstrated that even relatively small models, when fine-tuned properly on financial text, can add significant value to an RL trader. This suggests a path to deployment that doesn’t require heavyweight infrastructure. Moreover, the use of distinct LLMs for different subtasks (they mention experimenting with closed-source vs open-source LLMs, presumably ChatGPT, BloombergGPT, etc.) means the framework is flexible – a firm could plug in whatever language model (or ensemble of models) they trust to get the best insights and then use RL to translate those insights into trades.
Limitations: Despite its innovation, this study is somewhat complex and not fully proven at scale. The backtests were on two assets; it remains to be seen if the approach scales to a broad portfolio or intraday trading (the paper focused on daily decisions). The heavy use of multiple LLMs and a custom instruction generator means the strategy could be hard to reproduce without significant NLP expertise. There is also the challenge of timeliness and overfitting: news-based signals can decay quickly, and an RL agent trained on past news might not generalize if the market regime or news tone shifts (a risk the authors acknowledge). Additionally, the need to fine-tune LLMs on finance data and then integrate them adds latency – in live trading, one might prefer the LLM analysis to be faster or even streaming. From an implementation standpoint, maintaining two learning loops (one for the LLM instructions and one for the trading agent) is resource intensive. The authors propose it as a frontier framework, and a quant team considering it would likely simplify or streamline parts of it. In summary, this paper demonstrates a powerful idea – that LLMs can actively coach an RL trader – and provides evidence of enhanced performance on real stocks. Its practical viability will depend on careful engineering, but it offers a roadmap for those looking to combine textual data and reinforcement learning in a systematic strategy.

Conclusion:

In 2025 the fusion of Large Language Models with reinforcement learning moved from theory to credible practice in systematic equity trading, where three clear currents now dominate: first, hybrid frameworks that feed RL agents rich context from financial news and analyst reports, bridging structured price data with unstructured text; second, the success of compact, domain-fine-tuned LLMs—often outperforming huge general models once their policies are optimized for risk-aware objectives; and third, a shift from chasing raw alpha to maximizing risk-adjusted metrics such as Sharpe, CVaR and drawdown resilience. Early studies show that LLM-infused RL agents can outpace vanilla RL systems and even passive benchmarks, especially in volatile regimes, proving the commercial viability of this approach. The beneficiaries span quantitative hedge funds, prop desks, AI research groups, and—critically—vendors that aggregate, cleanse, and annotate market news or alternative data, because high-quality, machine-ready information is the fuel that makes these next-generation trading engines run.

Wednesday, March 13, 2024

Revolutionizing Wall Street with AI

AI Unleashed:

Through Systematic Trading and Alpha Discovery

The rapid pace of technological innovation, particularly in the realm of artificial intelligence (AI), is fundamentally transforming industries worldwide. None more so than on Wall Street, where the fusion of AI with financial systems is redefining the paradigms of investment, trading, and market analysis. This convergence is ushering in a new era for systematic trading, alpha research, and market efficiency, promising unprecedented opportunities and challenges alike.

As we delve into the intricacies of these developments, it's crucial to understand the mechanisms through which AI technologies—spanning machine learning, natural language processing, and generative AI—are being leveraged to enhance decision-making processes, optimize trading strategies, and improve overall market dynamics.

This article aims to explore the recent advancements in AI and their significant impact on Wall Street, offering insights into how these technologies are reshaping the landscape of finance and trading.

The recent advancements in artificial intelligence (AI) are reshaping industries across the globe, and Wall Street is no exception. Systematic trading, alpha research, and market efficiency are at the forefront of this transformation, driven by the integration of machine learning, natural language processing (NLP), and generative AI technologies.

AI's Role in Investment and Systematic Trading

First, what is systematic trading? This domain refers to the use of computer-driven models to make trading decisions in financial markets. This approach relies on quantitative analysis, algorithms, and technological tools to identify trading opportunities based on predefined criteria. Unlike discretionary trading, where decisions are made based on human judgment, systematic trading removes emotional bias, employing strategies that are tested and executed automatically. Systematic traders utilize vast datasets, including market price and volume data, economic indicators, and even sentiment analysis, to inform their trading models. At BlackRock, AI and NLP techniques are employed to parse a wide array of text sources, such as broker analyst reports and corporate earnings calls, to inform return forecasts. This use of transformer-based large language models (LLMs), similar to ChatGPT, allows for more accurate analysis of text by understanding the interactions between words in a sentence. Such precision in text analysis provides an edge in investment predictions, demonstrating the substantial potential AI holds for improving market predictions and trading strategies.

Generative AI and Market Efficiency

Generative AI refers to a subset of artificial intelligence technologies that can generate new content, from text and images to music and code, based on the patterns it learns from large datasets. Unlike traditional AI, which analyzes and interprets data, generative AI creates original outputs that mimic the style or content of its training data. This technology has groundbreaking applications across various fields, including creative arts, design, and content creation, revolutionizing how machines can augment human creativity and innovation.

Generative AI has experienced a significant breakout, with leading companies leveraging AI to create new business models and revenue sources, beyond just cost reduction. McKinsey report suggests, these AI high performers are investing heavily in AI, embedding it across multiple business functions, which includes using AI in product development, risk modeling, and optimizing the product-development cycle. This extensive adoption of AI technologies is indicative of the transformative potential AI has across various sectors, including finance.

The Impact on Alpha Research and Market Dynamics

Alpha research involves the rigorous analysis and identification of investment strategies that aim to generate returns above the market average, known as "alpha." This process leverages statistical models, financial data, and computational techniques to discover market inefficiencies or predictive signals that can be exploited for profit. Alpha research is crucial for fund managers and investors seeking to outperform market benchmarks and achieve superior investment performance.

In the realm of alpha research, AI technologies like machine learning and NLP are becoming pivotal in extracting valuable insights from vast amounts of data. Companies like Canoe and AlphaSense use AI to organize investment documentation and streamline market research, respectively. This not only enhances the efficiency of the research process but also allows for more precise and data-driven decision-making.

Efficient Market Hypothesis (EMH) and AI

The Efficient Market Hypothesis (EMH) posits that market prices fully reflect all available information, making it challenging to achieve consistent market outperformance.

“The proposition is that prices reflect all available information, which in simple terms means since prices reflect all available information, there’s no way to beat the market.”

– Eugene Fama

However, the advent of AI and systematic trading strategies poses interesting questions about EMH. By leveraging AI for predictive analytics and decision-making, there's potential to identify and exploit market inefficiencies more effectively, potentially challenging the traditional notions of EMH.

About the author

As a passionate and innovative professional in the domains of AI, ML, and quantitative finance, I have dedicated myself to developing advanced trading algorithms and machine learning models. My work focuses on leveraging my expertise in signal analysis, embedded systems, and quantitative research to contribute to cutting-edge ML solutions and trading strategies. Through my projects such as the algorithmic trading systems, I aim to harness the power of AI to enhance market efficiency and uncover new opportunities in systematic trading and alpha research.

Saturday, October 27, 2018

Hacking Brain with Neural Network

Detecting brain activity state using Brain Computer Interface

Brain computer interface device (BCI) from OpenBCI

You may have noticed that within past few years scientists, labs, startups and companies like Facebook and Google are working on brain computer interfaces that will enable human to interact with machine just with power of thought.

You may think that such technology is exotic and at least very expensive to play with. However, today there are devices available on the market that would allow you to experiment on, write your own projects and learn how to create mind machine yourself!

Within this project let me guide you through one of such exercises - reading human brain with handy and geek friendly device.

Introduction

Brain Computer Interface sometimes called a neural-control interface (NCI), mind-machine interface (MMI), direct neural interface (DNI), or brain–machine interface (BMI), is a direct communication pathway between an enhanced or wired brain and an external device. BCI differs from neuromodulation in that it allows for bidirectional information flow. BCIs are often directed at researching, mapping, assisting, augmenting, or repairing human cognitive or sensory-motor functions.

Brain computer interface technology represents a highly growing field of research with application systems. Its contributions in medical fields range from prevention to neuronal rehabilitation for serious injuries. Mind reading, and remote communication have their unique fingerprint in numerous fields such as educational, self-regulation, production, marketing, security as well as games and entertainment. It creates a mutual understanding between users and the surrounding systems.

The motivation for this project is to research and learn the BCI technology with applying Machine Learning algorithms to estimate the complexity of creating systems of communications between human and machine or in between humans with machine as intermediaries, using brain activity and BCI devices.

The data analysis and model training were performed on pre-collected datasets from EEG device.
Electroencephalography is an electrophysiological monitoring method to record electrical activity of the brain. It is typically noninvasive, with the electrodes placed along the scalp, although invasive electrodes are sometimes used such as in electrocorticography. EEG measures voltage fluctuations resulting from ionic current within the neurons of the brain. The EEG device used for this project is Ultracortex "Mark IV" EEG Headset (8 channels) from OpenBCI. EEG has 8 electrodes that located on designated areas of scalp and provide 8 channels data respectively.

Code

For easy exploration and convenient representation, the analysis and algorithmic model were put in Jupyter notebook. Code and requirements along with the notebook you may find in GitHub repository or open notebook in separate tab right away to navigate as you read material here and lookup the code.

Data

Data collected and used for this project is not available within repository. If you are interested in data sample please see Data_extract.txt file.

Case statement

For the purpose of this project it was decided to train neural network model to recognize two user's states while user performs two different but mentally similar activities: reading and writing. Reading and writing activities were performed in almost identic environments but within various 24 hours ranges. User was reading the same book and writing at the same desk in same light conditions and pose.

Data collection session consisted of two phases: 1) EEG setup (mounting, connecting to interface, checking); 2) Recording while performing activity (when user reads or writes, EEG is activated and recording signals to file).

For network training the significant amount of data is required. To make collected data consistent and eliminate excessive preprocessing, data samples should be coherent, and noise excluded (occurred due to distractions, unwanted muscular activity, etc.). With that purpose multiple recorded datasets need to fit 'identical' environment and psychological conditions of user. To produce datasets more coherent to each other it was decided to make recordings for around one-minute length.

Data Exploration

During this project it was noted that the environment, user’s mood and psychological state play very important role. Such conditions like daydreaming, twilight state (when person is drifting to sleep or from awake) become significant factors from data standpoint. Within those factors such brain waves of specific diapason – alpha – may skew sample data significantly.

Therefore, there were three options for collecting and analyzing data identified:

Option 1:To collect necessary amount of data with multiple datasets obtained within multiple sessions for short period of time. For instance, in between 5pm and 8pm within single 24h period.

Option 2:To collect large number of datasets during multiple sessions completed within long range of time - one week, month or couple of months.

Option 3:Mixed option - collect data in similar conditions within unidentified period of time. For instance, each day between 6pm and 8pm within couple of weeks.

Option 1 is good to exclude mental and psychological states that may impact research results, while option 2 might be good to embrace most occurred states for long period of time to consider them and train the model appropriately.

Benefit from 1-st option, where the focus is just on current user’s state and psychological state is to get relatively fast and proved results for research, train model on more coherent data and obviously achieving goal of the project, distinguishing 2 activities' states. However, within this option there is no opportunity to use pre-trained model for prediction of user's states at any other time in future due to lack of background data generated from users mental and psychological conditions. This option is also not perfect to proceed due to persons physical and emotional variance - working on project in lab conditions reading and writing states blur and merge into single 'dreaming/abandoned' passive condition, where person starts loosing focus. This drawback was noticed after numerous attempts and analysis of brainwaves' patterns.

There is much more benefit from 2-nd option, where users mental and psychological states may be considered due to long term period of data collection.

Using this option, unique dataset will be created that is useful not only for purposes of this research. Also, dataset collected within such option will help to train the model, which may be used in future any time to predict user's state of given actions (reading/writing). Such model will be more reliable. However, this option is very time and resource consuming.

Given the purpose of the project, option 3 was selected and performed.

8 channels EEG data is represented in µV and collected with 1000Hz sample rate. Filters are applied within equipment firmware to generate data with minimum noise. For this particular dataset the 7–13Hz bandpass and a 60Hz notch filters were applied.

Data Visualization

Obtaining data and first look:

It is necessary to get 8 channels of data from EEG recorded dataset. Raw datafile contains more than that and requires preprocessing and cleanup.

Sample of data

As you can see from tiny sample of data above, there are columns 1 - 8 representing corresponding channels (obtained data from respective electrodes 1 - 8). Column 12 is required to break out data by seconds.

Below is plot of one of the whole recorded sessions.

Raw dataset

Data Preprocessing

Samples rebalancing

Data is logged and recorded in high resolution with 1000 Hz sample rate. Ideally each row of data represents 1/1000 fraction of 1 second. Other words, there are 1000 samples/ impulses recorded per 1 second. And 1 second represents 1 data sample required for the network model. However, due to equipment specifics some data cycles/impulses may be lost, causing unequal number of impulses per second rather than 1000. Therefore, the goal is to compile each sample of data input consisting of equal number of impulses for 1 second.

Therefore to balance the data and equalize number of impulses in one data sample it is needed to go through each sample of training data and preprocess it respecitvely. It was decided to equalize each sample of data to have 990 rows (recorded impulses). Samples with less rows than 990 will be deleted.

Spikes cleaning

Before even rescaling the dataset, there is another problem to solve. Although data received from EEG is filtered, it can still contain noise - spikes that are coming from muscular activity. This needs to be eliminated.

Lets see short sample of data before rescaling. There is noticeable spike between second 15 and 16.

Short piece of data before rescaling and spikes removing.

To handle that it was decided to remove spikes on non-rescaled dataset first. The main parameter for spikes removing is ‘margin’variable, which sets up the ‘corridor’ for wave oscillation. Such ‘corridor’ is set up by written function 'variance_clean()'. Wave taken by that ‘corridor’ function will be trimmed to fit. It is important to note that spikes trimming was performed for each second of wave – such approach helped to preserve wave’s wider dynamic range observed through whole recorded time long.

Below image represents the part of wave with spike:

Spike

And here is the same part with spike after trimming:

Trimmed spike

There is still shift on the place where spike has been removed, however such slip has much less impact on data integrity.

It also noticeable that the length of dataset (number of seconds we want to work with) decreases while we clean up the data. For this reason simple 'seconds()' function was used to monitor the remaining useful data left for training.

Rescaling

One of the most important functions for data preprocessing is scaling function -'scaler()'. With it help the dataset is rescaled with equal seconds batches.

Rescaling is major thing for dataset before feeding it to training model. But each data sample can’t be rescaled against the whole dataset because it will skew the integrity and content of dataset after rescaling. As a result of wholistic rescaling it could be possible to train the model and even achieve good results, but such model will not be able to make prediction on shorter datasets that would represent just couple of seconds.

Below images shows how long (300 seconds/samples) 'reading' dataset looks like.

'Reading' dataset

It is noticeable where subsets received from multiple sessions are concatenated and how scattered data is.

As a result it was decided to rescale dataset by seconds' batches. Dedicated 'scaler()' function iterates through all dataset with short seconds-long window to rescale each window/batch consecutively. The size of window/batch was defined given two goals:

1) Shortest possible interval for the model to be able to predict (especially will be useful for live wave stream recognition);

2) Get highest model accuracy. Given the amount of training data and specifics of this project it was challenging to pick the right size of batch to train the model effectively.

Given rescaled batches it should not be confused with training batches for the model, where one second represents just one data sample for training.

With regards to rescaling range it was decided to rescale all channels in between 0 and 1 range.

On below image there are samples of data after seconds rebalancing and spikes cleaning:

Spikes cleaned

Below image depicts data samples after rescaling – last preprocessing step before labeling and training:

Preprocessed set

Noticeable shifts are acceptable as far as shifts' edges coincide with rescaled batches boundaries.

Preprocessed results

Lets have a look now how larger dataset looks before preprocessing and after.

'Reading' raw dataset

'Reading' dataset after preprocessing

Lets zoom into couple of rescaled batches:

Batches concatenation

Labeling and training preparation

After data is preprocessed it should be properly labelled. For that purpose, two labels were created: 0 and 1 - 'reading' and 'writing' correspondingly. Dataset that contained recorded ‘reading’ state has been measured in length and respective series of labels of the same size have been created. The same approach was taken for ‘writing’ dataset. After that two datasets have been concatenated: features data represents ‘x’ data variable and labels data represents ‘y’ data variable. It is important to note that output labels were one-hot encoded using ‘keras.utils.to_categorical()’ function.

Before model training, the compiled dataset has been split onto training and testing samples. For this purpose ‘train_test_split()’ function was used from python sklearn library (dataset size allowed to use that - beware having enormously big datasets you will not be able to use train_test_split() from sklearn and more likely to write something custom).

The Model

Convolutional 1D Neural Network model from Keras has been selected due to specifics and volume of data to train on. As far as the problem deals with classification and the model will be used for multiclassification, the decision was to build corresponding model architecture.

The 1D Convolutional model consists of 2 convolutional layers, 2 MaxPooling layers, 2 BatchNormalization, 1 Flatten and 2 Dense layers where last Dense layer has softmax activation. Such architecture allows to train model fast and with high accuracy.

Model summary:

Model Summary

For model optimization ‘Adam’optimizer has been selected as best practice. Learning rate was set on 0.001. Such configuration for optimizer worked best.

For the purposes of saving best training results and use them to upload model when necessary ‘ModelCheckpoint()’ function has been used to store model weights in separate file.

Model was trained on 16 batches through 25 epochs. Such number of epochs proved to be optimal for training. If model was not successfully trained within 25 batches, it was a sign to change the model parameters rather that number of batches.

Parameters of model were adjusted in course of multiple training attempts. The final model has been tested on different datasets, different rescaled batches sizes to prove model’s efficiency. However, it was noted that if dataset was significantly skewed by mental/physical state of user, or subsets had different length, the model parameters required some tweaks to reach highest accuracy. The following parameters of the model may need to be tuned in case if dataset features like length change drastically:

Pool size of MaxPooling layer;

Batch size.

Rest of parameters were proven to be stable. In case if model fails to train this an evidence of low quality dataset that need to be fixed.

Given the task for this project the higher model accuracy the better it will help in recognizing ‘reading’ and ‘writing’ states through complex noisy environment. So it is critical to receive highest accuracy possible.

Benchmarking and Test

The benchmarking goal is to test model in real life conditions, i.e. on absolutely new dataset. For this purpose, new dataset was collected separately at different time. As it was mentioned before, the challenge here is variability of data given users mental and physical states. Also, given that data collected is relatively small and does not refer to persons mental states, the results expected might be not very much impressive. If total accuracy of developed model is more than 65% it means that approach works, and project goal may deem accomplished for now.
There were bunch of new datasets used for testing. Some of them turned test to prove model’s accuracy more than 50%, some of them, up to 93%.

Accuracy measurement:

Total model accuracy is calculated by measuring average accuracy for both reading test and writing test. Each test accuracy is measured by number of correct predictions of state. For example, if model predicts 50 seconds of reading from 100 seconds of reading test set, the accuracy is 50% for reading.
The formula for overall accuracy measurement is following:

Accuracy = (Nr / Tr + Nw / Tw) / S

where:

Nr – correct 'reading' seconds predicted

Tr – total 'reading' seconds

Nw – correct 'writing' seconds predicted

Tw – total 'writing' seconds

S – number of states (in this case = 2 (reading and writing)

The length of benchmark dataset should be not less than 30 seconds to ensure clean results of the test.

In this project benchmark test there were two new datasets used for reading and writing with 50 seconds and 100 seconds respectively.

Final model benchmark test has proved that the model corresponds to the problem and resolves it. Accuracy achieved was over 70%. Please see notebook for detailed results.

Conclusion

The idea of recognizing ‘reading’and ‘writing’states was picked as simple task to train and test the model for the purpose of this project.
However, in course of research it was identified that those two states may not significantly differ from each other. First of all, when person performs routine exercises it is very hard to capture expressive signal patterns, secondly it was noted that person while reading or writing is being mostly in idle state.
What might help to contribute in distinguishing those state are eyeballs muscular patterns that may be captured from electrodes located on forehead and mid-scalp. In the rest of it, both writing or reading activities may activate similar areas of brain. For example, when person writes not just piece of text but something that requires thinking ahead, person is planning the story and vision-processing regions in the brain become active. Same process may appear while reading thoughtfully and focused.
It was noticed that unfocused reading as well as compulsive writing does not bring any benefit for the purpose of research. To address that challenge the user’s data was collected within different time ranges. There was also control put in place to maintain user focused on specific task. At the end of the day only reliable datasets were picked for training with most confidence of being appropriate for each ‘reading’ and ‘writing’ state.

Finally, the model was able to distinguish such activities given the patterns learned. The accuracy tested may not be impressive, however this is good start to research this problematic further.

Future plans

Given this project proved opportunity to create deep learning models to train on brain signals obtained from BCI device - research and experiments to be continued!

My next project with BCI will deal with recognizing limbs movement.

Stay tuned!

AI for Life