640803469
640803469
640803469
Abstract—Large language models (LLMs) have demonstrated processing stock data. Voigt et al. [3] have demonstrated that
remarkable success in the field of natural language processing language data and stock market data not only share conceptual
(NLP). Despite their origins in NLP, these algorithms possess but also structural similarities, suggesting the potential appli-
the theoretical capability to process any data type represented
in an NLP-like format. In this study, we use stock data to cability of LLMs to financial datasets. Despite being relatively
illustrate three methodologies for processing regression data with underexplored in contemporary research, numerous time series
LLMs, employing tokenization and contextualized embeddings. problems can be reformulated to align with the operational
By leveraging the well-known LLM algorithm Bidirectional En- framework of LLMs.
coder Representations from Transformers (BERT) [1], we apply LLMs typically incorporate three coarse-grained processing
quantitative stock price prediction methodologies to predict stock
prices and stock price movements, showcasing the versatility and stages. Initially, the input text undergoes tokenization, during
potential of LLMs in financial data analysis. which each of the l input word-tokens is mapped to an
Index Terms—finance, quantitative stock price prediction, index within a predefined vocabulary V ⊂ N. Subsequently,
natural language processing, stock movement prediction, fintech, each token index is assigned a contextualized (pre-trained)
machine learning, large language models embedding tensor (e.g. using the Word2Vec [4], [5] algorithm).
In the final stage, these embedding vectors are concatenated
I. I NTRODUCTION to form a two-axis tensor, which is then fed into the model
Few subfields in machine learning (ML) have garnered as for further processing. A sketch of the basic processing idea
much attention in recent years as NLP and the LLMs integral can be seen in the first row in Figure 1.
to its advancements. Although these models excel in NLP We propose a series of methodologies to adapt each step
tasks, they are fundamentally versatile algorithms capable of of the pipeline of LLMs in order to make them usable for
processing any data type that is appropriately formatted. stock data. First, we employ stock data as embedding values,
At a conceptual level, LLMs process sequentially arranged inputting these embeddings over a specified time lag ∆t and
data points - specifically, word-tokens - that encapsulate their for a set of company stocks C. This approach replaces the
interrelationships and correlations through the positions of traditional method of concatenating embedding vectors with
their respective embedding vectors in the vector space. Nu- stock data. Secondly, we explore the application of scaling
merous tasks within NLP involve the derivation of subsequent contextualized embeddings ei for various companies ci ∈ C.
textual outcomes (i.e. prediction of future developments of The method, originally proposed in [3] serves as an analog to
the input sequence as for example in next-token prediction) the technique used in Word2Vec for generating word vector
or overarching semantic interpretations, such as sentiment embeddings, adapted for financial entities. Lastly, our third
analysis, based on these structured inputs. Upon examining approach involves the tokenization of stock regression data,
these internal mechanisms, it becomes evident that numerous which aims to replicate the entire pipeline of an LLM. A
research fields within ML exhibit analogous processing re- visualization of these concepts is shown in Figure 1. We have
quirements for their respective input data. Stocks for example decided to adapt the BERT model [1] for this work, as it is
are highly correlated, dependent on other stocks [2] and one of the first models in the LLM landscape.
sequentially ordered trough their temporal price development. To summarize our contributions, we delineate three method-
Based on these shared characteristics with NLP data, it is ologies that leverage LLMs for broader applications to regres-
obvious why LLM algorithms can be potentially interesting for sion and time-series analysis, specifically using stock data as
an example. We validate our ideas using the BERT model gies commonly integrate correlation matrices directly into the
trained on 60-minute resolution intraday stock data of Standard modeling pipeline, as exemplified by [13], or leverage them
and Poor’s 500 (S&P-500) companies. As tasks we use the within graph-based networks, as showcased in [14]. However,
prediction of future prices (stock price prediction) and the few studies opt for the utilization of ci -specific vectors ei , as
prediction of whether a stock price will rise or fall in the demonstrated by [15] or [2].
future (stock movement prediction). We conceptualize end The importance of including relational information in stock
to end models that do not require further feature selection, presentations was demonstrated, for example, by Kim et
domain knowledge or work from (expensive) financial experts. al. [16]. In their model, there are performance differences
The primary objective of this work is to demonstrate the depending on the basis from which relationships between
applicability of established NLP algorithms in the domain of two stocks are modeled (e.g. “Industry-Product or material
financial analytics by utilizing language models and stock data. produced” or “Country of origin-Country”). The ablation study
We aim to present initial findings that support the feasibility in [14] shows decreasing performance if relationship modeling
of this approach. It is important to note that our goal is not is omitted.
to identify superior stock movement prediction (SMP) / stock An approach loosely related to the tokenization of stock
price prediction (SPP) methodologies, but rather to highlight data that involves inputting quantitative stock data (along
the potential and usability of LLMs in this novel context. fundamental data) as prompts into LLMs is discussed in [17]
Given this objective, our initial focus is on quantitative stock and in [18]. One of the relatively few instances of employing
price prediction, which exclusively utilizes numerical, histori- LLMs for non-linguistic data is exemplified by the ALBEF
cal stock data [6]. This approach is in contrast to fundamental model developed by Li et al. [19]. This model utilizes parts
stock analysis [7], which incorporates a broader spectrum of of the pre-trained BERT architectures to process merged visual
information, including annual reports, social media data, or and textual input features.
other relevant sources. For the integration of such fundamental SPP/SMP are generally regarded as difficult problems [20]
data, additional methodologies are discussed in Section VII. and the performance of ML models is correspondingly low.
Especially when only quantitative methods are used, as in
II. R ELATED W ORK the approaches presented here. In order to make our proposed
Voigt et al. [3] pioneered the concept of adapting speech approach comparable with the literature, we will look at the
models to multivariate regression data (i.e. stock data) by performance of some models that were also trained on US-
reformatting such data into sentence-like structures. This American stocks. In their study, Qin et al. [21] demonstrated
approach involves scaling multidimensional contextualized the efficacy of quantitative SPP models, achieving a root
embedding vectors êi for each ci dependent on the price mean square error (RMSE) of 0.31 on National Association of
(t)
information xi of ci at timestep t, thereby incorporating Securities Dealers Automated Quotations (NASDAQ) data, in
regression data. Although [3] introduced these foundational contrast to a higher RMSE of 0.96 observed in their Recurrent
ideas, their work lacked experimental validation of the adapted Neural Network [22] (RNN)-based baseline model. Similarly,
speech models (ASMs) and did not explore the use of tok- Feng et al. [12] reported models (using non-quantitative ex-
enized regression data or two-axis tensor representations as ternal relationship information) that exhibited an RMSE of
embedding inputs. 0.015 for New York Stock Exchange (NYSE) data and 0.019
Further advancing their concepts, [3] also introduces the for NASDAQ data. Their Long short-term memory (LSTM)
Stock2Vec algorithm, designed to train embeddings êi specif- [23] model based baseline models recorded RMSE values
ically tailored for various ci . In the financial ML domain, the of 0.019 on the NASDAQ dataset and 0.015 on the NYSE
strategy of representing ci as contextualized embeddings is dataset, respectively. In the context of SMP, Ding et al. [24]
commonly employed to uncover correlations among assets, proposed a model that attained an accuracy of 57.3% on
primarily to enhance risk minimization and portfolio opti- NASDAQ data, surpassing the LSTM baseline’s accuracy of
mization strategies. Contextualized embeddings for ci work 53.89%. Furthermore, the same model achieved an accuracy of
similar to NLP where contextualized word-token embeddings 58.7% on China A-shares data, which also exceeded the LSTM
represent relationships and meanings of the word-tokens. baseline accuracy of 56.7%. The model proposed in [25]
Embeddings of ci typically cluster stocks within the same exhibited significant performance, attaining an SMP accuracy
industry close together or express similar relations trough of 60.7% across diverse interday data for single ci from the
similar distance vectors. In related literature such as [8], S&P 500 and Korea Composite Stock Price Index (KOSPI).
[9], [10], and [11] embedding training strategies for ci are The authors also introduced baseline models that achieved
introduced. Non of these models use the embeddings for accuracies within the range of 51.49% to 57.36%.
SMP/SPP downstream tasks.
The intercorrelation among stocks stands as a pivotal factor III. M ODEL
in forecasting future stocks prices, a notion widely acknowl- As delineated in Section I, we propose three distinct
edged in contemporary research literature as for example in methodologies employing transformer-encoder based language
[12] or [2]. In our proposed approach the intricate interrela- models for the prediction of stock prices. These approaches,
tionships are encapsulated through ei . Traditional methodolo- along with the specific components of the LLM pipeline they
NLP Speech Model Lorem ipsum dolor sit amet. FT (.) (v1, v2, ..., vl) FE (.) (e1, e2, ..., el) FS (.) hCLS
Stock2Text
[ X11 X12 ... X1 ∆t
X21 X22 ... X2 ∆t
...
X|C|1 X|C|2 ... X|C| ∆t
] FE (.) (e1, e2, ..., el) FS (.) hCLS
e|C|
e|C|
e1 e1 x|C|Δt *ê|C|
x1Δt * ê1
1
x11 * ê1 x|C| *ê|C|
{
[x|C|1 ... x|C|Δt]
{
FE(.) X [e1 * x11 ... e1 * x1Δt] ... [PUNC] ... [e|C| * x|C|1 ... e|C|* x|C|Δt] FS(.)
Fig. 2. Visualization of a scaled company embedding input after the conversion of a stock data time series input into a Stock2Sentence representation.