[go: up one dir, main page]

0% found this document useful (0 votes)
1 views9 pages

640803469

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 9

“© 2025 IEEE. Personal use of this material is permitted.

Permission from IEEE must be obtained for


all other uses, in any current or future media, including reprinting/republishing this material for
advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works.”
Adapting Speech Models for Stock Price Prediction
Frederic Voigt Jose Alcarez Calero Keshav Dahal
University of the West of Scotland / University of the West of Scotland University of the West of Scotland
Hamburg University of Applied Sciences School of Computing, School of Computing,
Department of Computer Science Engineering & Physical Sciences Engineering & Physical Sciences
Hamburg, Germany Paisley, Scotland Paisley, Scotland
b01742821@studentmail.uws.ac.uk jose.alcaraz-calero@uws.ac.uk keshav.dahal@uws.ac.uk

Qi Wang Kai von Luck Peer Stelldinger


University of the West of Scotland Hamburg University of Applied Sciences Hamburg University of Applied Sciences
School of Computing, Department of Computer Science Department of Computer Science
Engineering & Physical Sciences Hamburg, Germany Hamburg, Germany
Paisley, Scotland kai.vonluck@haw-hamburg.de peer.stelldinger@haw-hamburg.de
qi.wang@uws.ac.uk

Abstract—Large language models (LLMs) have demonstrated processing stock data. Voigt et al. [3] have demonstrated that
remarkable success in the field of natural language processing language data and stock market data not only share conceptual
(NLP). Despite their origins in NLP, these algorithms possess but also structural similarities, suggesting the potential appli-
the theoretical capability to process any data type represented
in an NLP-like format. In this study, we use stock data to cability of LLMs to financial datasets. Despite being relatively
illustrate three methodologies for processing regression data with underexplored in contemporary research, numerous time series
LLMs, employing tokenization and contextualized embeddings. problems can be reformulated to align with the operational
By leveraging the well-known LLM algorithm Bidirectional En- framework of LLMs.
coder Representations from Transformers (BERT) [1], we apply LLMs typically incorporate three coarse-grained processing
quantitative stock price prediction methodologies to predict stock
prices and stock price movements, showcasing the versatility and stages. Initially, the input text undergoes tokenization, during
potential of LLMs in financial data analysis. which each of the l input word-tokens is mapped to an
Index Terms—finance, quantitative stock price prediction, index within a predefined vocabulary V ⊂ N. Subsequently,
natural language processing, stock movement prediction, fintech, each token index is assigned a contextualized (pre-trained)
machine learning, large language models embedding tensor (e.g. using the Word2Vec [4], [5] algorithm).
In the final stage, these embedding vectors are concatenated
I. I NTRODUCTION to form a two-axis tensor, which is then fed into the model
Few subfields in machine learning (ML) have garnered as for further processing. A sketch of the basic processing idea
much attention in recent years as NLP and the LLMs integral can be seen in the first row in Figure 1.
to its advancements. Although these models excel in NLP We propose a series of methodologies to adapt each step
tasks, they are fundamentally versatile algorithms capable of of the pipeline of LLMs in order to make them usable for
processing any data type that is appropriately formatted. stock data. First, we employ stock data as embedding values,
At a conceptual level, LLMs process sequentially arranged inputting these embeddings over a specified time lag ∆t and
data points - specifically, word-tokens - that encapsulate their for a set of company stocks C. This approach replaces the
interrelationships and correlations through the positions of traditional method of concatenating embedding vectors with
their respective embedding vectors in the vector space. Nu- stock data. Secondly, we explore the application of scaling
merous tasks within NLP involve the derivation of subsequent contextualized embeddings ei for various companies ci ∈ C.
textual outcomes (i.e. prediction of future developments of The method, originally proposed in [3] serves as an analog to
the input sequence as for example in next-token prediction) the technique used in Word2Vec for generating word vector
or overarching semantic interpretations, such as sentiment embeddings, adapted for financial entities. Lastly, our third
analysis, based on these structured inputs. Upon examining approach involves the tokenization of stock regression data,
these internal mechanisms, it becomes evident that numerous which aims to replicate the entire pipeline of an LLM. A
research fields within ML exhibit analogous processing re- visualization of these concepts is shown in Figure 1. We have
quirements for their respective input data. Stocks for example decided to adapt the BERT model [1] for this work, as it is
are highly correlated, dependent on other stocks [2] and one of the first models in the LLM landscape.
sequentially ordered trough their temporal price development. To summarize our contributions, we delineate three method-
Based on these shared characteristics with NLP data, it is ologies that leverage LLMs for broader applications to regres-
obvious why LLM algorithms can be potentially interesting for sion and time-series analysis, specifically using stock data as
an example. We validate our ideas using the BERT model gies commonly integrate correlation matrices directly into the
trained on 60-minute resolution intraday stock data of Standard modeling pipeline, as exemplified by [13], or leverage them
and Poor’s 500 (S&P-500) companies. As tasks we use the within graph-based networks, as showcased in [14]. However,
prediction of future prices (stock price prediction) and the few studies opt for the utilization of ci -specific vectors ei , as
prediction of whether a stock price will rise or fall in the demonstrated by [15] or [2].
future (stock movement prediction). We conceptualize end The importance of including relational information in stock
to end models that do not require further feature selection, presentations was demonstrated, for example, by Kim et
domain knowledge or work from (expensive) financial experts. al. [16]. In their model, there are performance differences
The primary objective of this work is to demonstrate the depending on the basis from which relationships between
applicability of established NLP algorithms in the domain of two stocks are modeled (e.g. “Industry-Product or material
financial analytics by utilizing language models and stock data. produced” or “Country of origin-Country”). The ablation study
We aim to present initial findings that support the feasibility in [14] shows decreasing performance if relationship modeling
of this approach. It is important to note that our goal is not is omitted.
to identify superior stock movement prediction (SMP) / stock An approach loosely related to the tokenization of stock
price prediction (SPP) methodologies, but rather to highlight data that involves inputting quantitative stock data (along
the potential and usability of LLMs in this novel context. fundamental data) as prompts into LLMs is discussed in [17]
Given this objective, our initial focus is on quantitative stock and in [18]. One of the relatively few instances of employing
price prediction, which exclusively utilizes numerical, histori- LLMs for non-linguistic data is exemplified by the ALBEF
cal stock data [6]. This approach is in contrast to fundamental model developed by Li et al. [19]. This model utilizes parts
stock analysis [7], which incorporates a broader spectrum of of the pre-trained BERT architectures to process merged visual
information, including annual reports, social media data, or and textual input features.
other relevant sources. For the integration of such fundamental SPP/SMP are generally regarded as difficult problems [20]
data, additional methodologies are discussed in Section VII. and the performance of ML models is correspondingly low.
Especially when only quantitative methods are used, as in
II. R ELATED W ORK the approaches presented here. In order to make our proposed
Voigt et al. [3] pioneered the concept of adapting speech approach comparable with the literature, we will look at the
models to multivariate regression data (i.e. stock data) by performance of some models that were also trained on US-
reformatting such data into sentence-like structures. This American stocks. In their study, Qin et al. [21] demonstrated
approach involves scaling multidimensional contextualized the efficacy of quantitative SPP models, achieving a root
embedding vectors êi for each ci dependent on the price mean square error (RMSE) of 0.31 on National Association of
(t)
information xi of ci at timestep t, thereby incorporating Securities Dealers Automated Quotations (NASDAQ) data, in
regression data. Although [3] introduced these foundational contrast to a higher RMSE of 0.96 observed in their Recurrent
ideas, their work lacked experimental validation of the adapted Neural Network [22] (RNN)-based baseline model. Similarly,
speech models (ASMs) and did not explore the use of tok- Feng et al. [12] reported models (using non-quantitative ex-
enized regression data or two-axis tensor representations as ternal relationship information) that exhibited an RMSE of
embedding inputs. 0.015 for New York Stock Exchange (NYSE) data and 0.019
Further advancing their concepts, [3] also introduces the for NASDAQ data. Their Long short-term memory (LSTM)
Stock2Vec algorithm, designed to train embeddings êi specif- [23] model based baseline models recorded RMSE values
ically tailored for various ci . In the financial ML domain, the of 0.019 on the NASDAQ dataset and 0.015 on the NYSE
strategy of representing ci as contextualized embeddings is dataset, respectively. In the context of SMP, Ding et al. [24]
commonly employed to uncover correlations among assets, proposed a model that attained an accuracy of 57.3% on
primarily to enhance risk minimization and portfolio opti- NASDAQ data, surpassing the LSTM baseline’s accuracy of
mization strategies. Contextualized embeddings for ci work 53.89%. Furthermore, the same model achieved an accuracy of
similar to NLP where contextualized word-token embeddings 58.7% on China A-shares data, which also exceeded the LSTM
represent relationships and meanings of the word-tokens. baseline accuracy of 56.7%. The model proposed in [25]
Embeddings of ci typically cluster stocks within the same exhibited significant performance, attaining an SMP accuracy
industry close together or express similar relations trough of 60.7% across diverse interday data for single ci from the
similar distance vectors. In related literature such as [8], S&P 500 and Korea Composite Stock Price Index (KOSPI).
[9], [10], and [11] embedding training strategies for ci are The authors also introduced baseline models that achieved
introduced. Non of these models use the embeddings for accuracies within the range of 51.49% to 57.36%.
SMP/SPP downstream tasks.
The intercorrelation among stocks stands as a pivotal factor III. M ODEL
in forecasting future stocks prices, a notion widely acknowl- As delineated in Section I, we propose three distinct
edged in contemporary research literature as for example in methodologies employing transformer-encoder based language
[12] or [2]. In our proposed approach the intricate interrela- models for the prediction of stock prices. These approaches,
tionships are encapsulated through ei . Traditional methodolo- along with the specific components of the LLM pipeline they
NLP Speech Model Lorem ipsum dolor sit amet. FT (.) (v1, v2, ..., vl) FE (.) (e1, e2, ..., el) FS (.) hCLS

Embedding Based Model


[ X11 X12 ... X1 ∆t
X21 X22 ... X2 ∆t
...
X|C|1 X|C|2 ... X|C| ∆t
] FS (.) hCLS

Stock2Text
[ X11 X12 ... X1 ∆t
X21 X22 ... X2 ∆t
...
X|C|1 X|C|2 ... X|C| ∆t
] FE (.) (e1, e2, ..., el) FS (.) hCLS

Tokenisation Based Model


[ X11 X12 ... X1 ∆t
X21 X22 ... X2 ∆t
...
X|C|1 X|C|2 ... X|C| ∆t
] FT (.) (v1, v2, ..., vl) FE (.) (e1, e2, ..., el) FS (.) hCLS

Fig. 1. Visualized comparison of the ASMs with classic NLP models.

e|C|
e|C|

e1 e1 x|C|Δt *ê|C|
x1Δt * ê1
1
x11 * ê1 x|C| *ê|C|

[x11 ... x1Δt]


...

{
[x|C|1 ... x|C|Δt]
{
FE(.) X [e1 * x11 ... e1 * x1Δt] ... [PUNC] ... [e|C| * x|C|1 ... e|C|* x|C|Δt] FS(.)

Fig. 2. Visualization of a scaled company embedding input after the conversion of a stock data time series input into a Stock2Sentence representation.

adapt, are illustrated in Figure 1. Each approach utilizes the A. Notation


CLS-token h ∈ Rη , where η represents the size of the model. In the following (with strong orientation to the notation
Subsequent to processing by the LLM, the position of this of [3]) the price information of a specific company ci at
token is input into a linear layer, which is followed by a (t)
timestep t is denoted as xi ∈ R4 . Stock data is typically
Sigmoid activation function to perform the model prediction
expressed using a 5-dimensional datapoint for a given time
ŷ. Depending on the specific application, either the Mean
interval by specifying for each time step (e.g. the last 60
Squared Error (MSE) for SPP or Binary Cross Entropy (BCE)
minutes), the Opening Price, the Highest Price, the Closing
H(ŷ, y) for SMP is employed as the loss function. Similar to
Price, the Lowest Price and the Trading Volume (OHCLV-
traditional predictive tasks in NLP, the CLS-token serves as
features). In the approaches presented here, we do not use
a representative of the entire processed input sequence, with
Trading Volume information. The price of all ci ∈ C at t is
respect to the specified tasks.
expressed as the “Market Snapshot” [3] X (t) ∈ R4·|C| stacking
the price information of all companies. The concatenation of
As a language model we define the basic speech model Market Snapshots over the whole observation horizon ∆t is
l
(BERT) FS ((ẽi )1=i ) taking embedded inputs in the form referenced as X ∈ R(4·|C|)×∆t .
(ẽ1 , ẽ2 , . . . ẽl ) with ∀ẽi ∈ Rη and transforming it into h.
In order to create the embedded word-tokens (ẽi )1=i an
l B. Embedding Based Approach
l
embedding model FE ((v̂i )1=i , Ẽ) which transforms tokenized In alignment with the frameworks previously delineated in
text input in the form (ṽ1 , ṽ2 , . . . ṽl ) (with ∀ṽi ∈ Ṽ and [26] or [24] for example, this approach employs a concatenated
Ṽ ⊂ N) is used. To tokenize the input text X̃ we use the array of ∆t Market Snapshots X (t) ∈ R4·|C| as embedding
tokenizer FT (X̃, Ṽ ). vectors, directly into FS (.), thereby circumventing the initial
embedding and tokenization procedure. Here we map |C| ≡ η adapted (compared with [26]) version in which, for example,
and ∆t ≡ l. time-dependent embeddings are used for the CBOS algorithm
The primary distinction between the application of a stan- and the task is SMP instead of SPP.
dard Transformer model, i.e the one described in [27], lies One significant advantage of the Stock2Sentence approach
in the nuanced specificities of the modified language model is that it improves the information presentation to the model
FS (.) (in this case BERT). as each position in the “text” encapsulates a single data
Following the methodology introduced by Yoo et al., a linear point from the second (l-dimensional) input tensor axis. In
transformation layer, with parameters uniformly shared across contrast, the method employing X (t) , as discussed in Sec-
all ci and incorporating a Rectified Linear Unit (ReLU) acti- tion III-B, incorporates |C| information points per position.
vation function, is applied to transform the raw price features This approach endeavors to leverage the underlying principles
X into a latent feature representation X̄. A consequential and mechanisms of NLP models by aligning more closely
aspect of this approach is the flexibility it affords in the input with traditional NLP text structures where each input position
dimensionality of FS (.), thereby enabling the utilization of η, includes one word-token. However, a notable drawback of this
as specified by the original language model. method is the resultant expansion in input length by a factor
of |C|, which substantially increases computational demands.
C. Stock2Sentence
This is particularly challenging for transformer architectures,
The idea to map X into sentence-like structures is discussed which are characterized by a space complexity of O(n2 +n·k)
in [3]. For this methodology, the procedure commences by [27], thus rapidly depleting computational resources.
“flattening” X, subsequently transposing the input at each
respective timestep dimension, and then concatenating the D. Reinterpretation as Text
flattened and transposed inputs sequentially. Each ci is trans- The final methodology incorporates FE (.), FS (.), and
formed into an embedding tensor eˆi utilizing FE (.) alongside FT (.), and applies tokenization to regression values. This to-
the trainable embedding matrix E. The derivation and speci- kenization parallels the discretization process, bearing resem-
fications of E are elaborated upon in Paragraph III-C0a. blance to the embedding-scaling techniques, albeit its applica-
To encode price information and construct the final em- tion within the domain of equities is notably unconventional.
bedding vector ei , two distinct scaling methodologies are It has precedent in other ML contexts e.g. discretization can
(t) (t)
evaluated: ei = eˆi ⋆xi or ei = eˆi ⋆FÊ (fnorm (xi ), Ê) with be analogized to the utilization of histograms for transforming
⋆ being defined as either a multiplicative or additive method. continuous variables, as for example in [29].
Each of the four OHCL-features represents one quarter of the This strategy fundamentally reinterprets X not as a ten-
η dimensions of ei by stacking the vectors. In the approach sor, but rather as textual data predominantly composed of
utilizing FÊ (.), Ê represents a learnable parameter matrix, numerical entries. Consequently, this technique facilitates a
and fnorm (.) is a function designed to normalize the input comprehensive reorientation of the problem from SPP to
values of each ci and each OHCL-feature to an integer range processes typical for NLP, marking a significant paradigm shift
between 0 and θnorm . As delineated in [26], the technique of in the approach to data analysis.
shifting embedding tokens within the vector space constitutes a To do this, we define the vocabulary V using the word-
prevalent strategy, employed, for instance, to encode positional tokens from the set
information, as exemplified in [28].
In order to generate sentence like structures we assign C ∪ {“-”, 0, “[PUNC]”} ∪ {n|n ∈ N, n ≤ θV } . (3)
(t)
a(t) [j, i] = ei [j] (1) The input is represented as a sentence
h i
(j) (j) (j)
and b(j) = c1 x̃1 c2 x̃2 c|C| x̃|C| (4)
 (1) (2) (∆t)

A= a [PUNC] a [PUNC] ... a (2) and the whole input, as
η×1
with [PUNC] ∈ R as outlined in [3]. The punctuation 
B = b(1) [PUNC] b(2) [PUNC] ...

b(∆t) . (5)
token [PUNC] is utilized to aid the model in differentiating
among various t. A schematic representation of this method For each ci , stock ticker symbols (such as [APPL]) can be
is depicted in Figure 2. employed as single tokens, akin to the methodologies outlined
a) Contextualized Stock Embedding Vectors: The embed- in [8].
(t) (t)
ding matrix E can be optimized through several methodolo- We define each transformed xi as x̃i by choosing only
gies, including Stock2Vec and other algorithms detailed in one price feature (i.e. the Closing Price) and implementing a
Section II. The aim of using E is to represent relationships digit reversal of the original values, subsequently multiplying
between different ci abstractly as high-dimensional vectors. by a scaling factor 10θx , and rounding to θx decimal places.
Access was obtained to the pre-trained weights from the work This reversal process restructures the price display such that
of Dolphin et al. in [11] and [10], facilitating empirical test- each position aligns uniformly across different companies;
ing. Alternatively, E can be initialized randomly and trained for instance, the most significant digit represents the fourth
from scratch. For the Stock2Vec algorithm, we use a slightly decimal place, the next significant digit represents the third
decimal place, and so forth. This methodological adjustment ci and OHCL-feature respectively as done in previous studies
has been found to enhance the stability of the training process. such as [35].
An alternative method could involve employing the default
FT (.) and V of the corresponding language model offering a TABLE I
different avenue for data preprocessing and analysis. R ESULTS FOR EMBEDDING BASED APPROACH .

Model SMP SPP


IV. M ETHODS ∆t 4 8 16 4 8 16
BERTη=64 50.5 50.6 50.9 3.8 3.6 3.7
In adherence to classical methodologies within the realm BERT 53.2 53.4 52.9 2.7 2.8 2.8
of LLMs, we delineate two principal phases: “pre-training” BERTbase 52.0 52.2 52.1 2.6 2.8 2.9
and “fine-tuning”. Pre-training is employed in NLP to instill a BERTlarge 54.1 54.3 55.1 3.0 3.2 3.1
BERTpre-trained 50.6 50.7 50.9 3.5 3.8 3.3
foundational understanding of language in LLMs - a “gener-
alized language understanding” [26]. Conversely, fine-tuning
is dedicated to adapting the model for specific downstream
TABLE II
tasks. As delineated in [26] and [3], the adaptation of popular R ESULTS FOR S TOCK 2S ENTENCE BASED APPROACH . T HE SUBSCRIPT
techniques such as masked language modeling (MLM) [1] SPECIFIES HOW E WAS INITIALIZED . H ERE THE “S TOCK E MBEDDINGS ”
and next-sequence prediction (NSP) [1] can be similarly ARE FROM [11] AND THE “S TOCK E MBEDDINGS -CBR” ARE FROM [10]
( FOR ci FOR WHICH THERE WERE NO WEIGHTS , RANDOM INITIALIZATION
applied to stock prediction models. This approach teaches the WAS USED ). T HE S2V APPROACHES WERE TAKEN FROM [3] AND
models a generalized comprehension of stock data, elucidates RETRAINED . F OR ALL APPROACHES , IT WAS EMPIRICALLY DETERMINED
intercorrelations among various ci , and enhances the overall THAT THE USE OF Ê, θNORM = 100 AND ⋆ AS AN ADDITION OPERATION
WORKS BEST. T O VALIDATE THIS ASSUMPTION , TWO MORE APPROACHES
generalized performance. Pre-training stages are deferred to WERE TRAINED .
future investigative efforts.
In this research, our focus is primarily on the fine-tuning Model SMP SPP
∆t 4 5 6 4 5 6
tasks. These tasks involve the already described SPP with BERT 54.1 53.7 53.9 2.6 2.2 2.5
the target being y = X (t+o) [D] and SMP with the target BERTSock Embeddings 50.7 50.7 50.8 3.6 3.6 3.7
being y = I(t) (X (t) > X (t+o) )[D] (adhering to the notation BERTSock Embeddings- CBR 50.9 51.2 51.3 2.9 3.1 3.1
BERTS2V- SG 53.4 53.6 53.6 1.9 2.0 2.0
established by Yoo et al in [30]) with D being the indices BERTS2V-CBOS 54.3 54.1 53.9 2.5 2.4 2.5
of the Closing Price datapoints. For the purposes of this study BERT⋆=Multiplication 50.3 50.9 51.2 3.8 3.9 3.6
o = 1 holds, as is conventionally employed in related research, BERT⋆=Addition 51.2 51.4 52.5 3.0 3.2 2.9
exemplified in studies such as those in [12] or [31].

V. DATASET AND E XPERIMENT TABLE III


R ESULTS FOR T OKENIZATION BASED APPROACH .
The foundational data source for our dataset is the Alpha
Vantage API1 , which provides us with extensive research Model SMP SPP
access to stock data. Consistent with the methodology of Voigt ∆t 3 4 5 3 4 5
BERTV 51.2 51.1 51.4 3.7 3.4 2.9
et al., our analysis incorporates data from 309 companies listed BERTDefault-BERT-Vocab 50.8 50.9 50.8 3.6 3.6 3.7
on the S&P-500 index that had available records dating back BERTDefault-BERT-Vocab, Pre-trained 50.3 50.8 50.9 3.5 3.5 3.8
to the year 2000. The temporal scope of our dataset extends
from 2000 to 2023, with data aggregated at intraday 60-minute
VI. R ESULTS
intervals. Following the approach adopted, we selected the
Closing Price of each timestep interval as the predictive target In Table I, we delineate the outcomes for the embedding-
y. We select the data from the year 2000 to the year 2021 based approach; in Table II, we present the results for the
as the training data, used to optimize the model parameters Stock2Sentence-based approach; and in Table III, the out-
and the data from 2021 to 2022 for the validation set used to comes for the tokenization-based method are detailed. Due to
optimize the hyperparameters. For the final test set evaluation the scaling of the input length by |C| for the Stock2Sentence
we use the data from 2022 to 2023. This data split is the most approach and by the factor |C| · v for the tokenization-based
established approach in SPP/SMP and followed for example approach, ∆t must be selected significantly smaller for these
in [32], [33] or [34]. two approaches than for the embedding-based approach. The
The phenomenon of absent data points is frequently ob- hope is that the detailed correlation and relationship modeling
served in intraday datasets, particularly exacerbated with finer between the individual stocks will nevertheless enable a good
time granularity. Conversely, interday datasets typically exhibit performance to be achieved. We report the Accuracy for SMP
(t) and the RMSE for SPP scaled by the factor 100 on the min-
minimal occurrences of missing xi values. To address this
issue, in cases where data points are absent, we impute the max normalized values. Each experiment was repeated five
(t)
value by padding from the most recent existing xi entry, times.
as done in [3]. All values are min-max normalized for each It was observed that there is considerable variation in the
performance across different companies, which supports the
1 https://www.alphavantage.co/ hypothesis that predictability may vary significantly among
In the Stock2Sentence methodology, the initialization of
E with pre-trained weights, from Dolphin et al. from [11]
and [10], yields suboptimal results. This inefficacy is likely
attributable to the small values of η used during the training of
these models (η = 16 or η = 20). Future investigations could
reconsider these approaches by re-training the models from
[10] and [11] and potentially employing larger dimensions
E. Additionally, strategies that do not incorporate Ê also
demonstrate inadequate performance. A plausible explanation
for this phenomenon is the scaling of ê, particularly when
operations such as ⋆ as multiplication are applied. This
scaling may cause price information to excessively overlap
with the semantic content encapsulated in ê, to the extent that
the model becomes incapable of accurately understanding ê.
Fig. 3. Grad-CAM visualisation of attention scores for Stock2Sentence
approach. The darker the color the higher the relevance (red) and the lighter Tokenization-based methodologies exhibit generally poor
(yellow) the lower. Please refer to the Appendix for detailed explanation of performance. One potential issue is the increased input length
the used visualization algorithm. that results from tokenizing regression values. This approach
is especially ineffective when applied to pre-trained BERT
models and utilizing the BERT vocabulary. A deficiency which
stocks. This could have been one of the reasons why in [25]
may be attributed to the significant divergence of the tokenized
(see Section II) only single stocks were chosen for prediction.
stock data from the natural language corpus originally used to
A real world application of ASMs could be based on a trading
train the BERT models.
strategy in which investments are only made in stocks with
As previously emphasized, the primary goal of this research
high predictability / high accuracy values.
is to elucidate the intrinsic properties of these models and
A. Investigating Time and Stock Relationships their capability to obviate the necessity for human analysis or
As delineated in Paragraph III-C0a, the Stock2Sentence domain-specific expertise. The preliminary results demonstrate
approach facilitates the representation of each of the |C| · ∆t the efficacy of the outlined models, which, in many instances,
data points as an individual component of the input sequence. surpass the established baseline and exhibit substantial poten-
This methodological innovation permits a detailed examination tial for generating profits. Given the innovative nature of these
of the internal model attention mechanisms with respect to methodologies, there is considerable scope for enhancing their
each data point. Consequently, it becomes feasible to analyze performance through further investigation, particularly con-
the relative importance of each ci at each t in relation to every cerning the tokenization-based approach. Prior studies, such as
other ci at each t. A visualization is given in Figure 3. for example [17] and [18], suggest that incorporating trend and
In contrast to prior methodologies, which were restricted histogram inputs in LLMs may yield more effective results.
to investigating either the attention across different ci as in Additionally, employing more intricate technical indicators
[30], or across different t as in [36], the Stock2Sentence for defining e or including fundamental data could further
approach provides a more granular and interconnected analysis augment model performance.
of attention dynamics, thereby enhancing our understanding of One of the primary advantages of the Stock2Sentence (as
model behavior across both company and time dimensions. well as for the tokenization approach) approach is its capability
to eschew a fixed embedding dimension position for each
VII. D ISCUSSION AND F UTURE R ESEARCH ci , unlike the majority of existing state-of-the-art SPP/SMP
The embedding-based methodology aligns closely with con- models. This flexibility enables the model to accommodate
ventional models, such as the approach detailed by Ding et al. new entities, such as emerging companies, by dynamically in-
(t)
in [24]. Consequently, the results obtained through this method tegrating or excluding specific xi values, for instance during
are comparable, and in some instances, slightly inferior, as periods of zero trading or when data is missing. Additionally,
evidenced in the results presented in Table I. Notably, the the approach allows for the incorporation of contextual fun-
effectiveness of the model appears to be positively correlated damental data, such as data derived from processing current
with its size. For instance, when η is set to 64, the model news and social media. Notably, this last methodology war-
performance merely approaches the baseline threshold. Ad- rants further exploration, as incorporating fundamental data
ditionally, initializing the model with weights pre-trained for often enhances model performance by embedding additional
NLP tasks, (denoted as BERTpre-trained ) detrimentally impacts contextual information that may not be readily apparent from
its performance. This reduction in efficacy is anticipated given mere time series data alone.
the original training objectives of the pre-trained model, which Future research should prioritise the integration of further,
differ significantly from the current application. Moreover, publicly available, LLMs, such as GPT-2 [37], TransformerXL
variations in ∆t exert a minimal influence on the model’s [38], T5 [39], and LLaMA [40], within the models used for
performance. the Stock2Sentence or tokenization methodologies. Moreover,
it is essential to adapt pre-training techniques such as MLM [3] F. Voigt, K. Von Luck, and P. Stelldinger, “Assessment of the Applicabil-
and NSP. These methodologies could facilitate the models’ ity of Large Language Models for Quantitative Stock Price Prediction,”
in Proceedings of the 17th International Conference on PErvasive
ability to discern inter-correlations among different ci , which Technologies Related to Assistive Environments, PETRA ’24, (New
is regarded as a crucial strategy for enhancing the accuracy of York, NY, USA), p. 293–302, Association for Computing Machinery,
SMP/SPP. 2024.
[4] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
Currently, our research prioritizes predictive models, which word representations in vector space,” in 1st International Conference on
are common in the domains of SPP and SMP, due to the Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May
complexity associated with generative approaches that forecast 2-4, 2013, Workshop Track Proceedings, 2013.
[5] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed
over extended future periods. However, generative models can representations of words and phrases and their compositionality,” CoRR,
still be employed as suggested by [3], where predictions are vol. abs/1310.4546, 2013.
/ R|C| .
selectively generated for specific ci (or t), i.e. ŷ (t) ∈ [6] R. DeFusco, D. McLeavey, J. Pinto, and D. Runkle, Quantitative
Investment Analysis. 2015. . John Wiley Sons. (Cited on pages 1
This approach enables the model to restrict its predictions to and 3).
prices or price movements where it possesses a significant [7] A. S. Wafi, H. Hassan, and A. Mabrouk, “Fundamental analysis models
degree of confidence and incorporate the idea of the different in financial markets – review study,” Procedia Economics and Finance,
predictability of individual stocks. vol. 30, pp. 939–947, 2015. IISES 3rd and 4th Economics and Finance
Conference.
[8] X. Gabaix, R. Koijen, and M. Yogo, “Asset embeddings,” SSRN Elec-
VIII. S UMMARY tronic Journal, 01 2023.
In conclusion, we have delineated three methodologies to [9] B. Sarmah, N. Nair, D. Mehta, and S. Pasquali, “Learning embedded
representation of the stock correlation matrix using graph machine
replace specific components of a conventional NLP model learning,” 2022.
pipeline of LLMs tailored for regression data, exemplified in [10] R. Dolphin, B. Smyth, and R. Dong, “Industry classification using a
the context of SPP/SMP. We have expounded upon the ben- novel financial time-series case representation,” 2023.
[11] R. Dolphin, B. Smyth, and R. Dong, “Stock embeddings: Learning
efits of our models and validated our methodologies through distributed representations for financial assets,” 2022.
empirical testing of the BERT model. Furthermore, we have [12] F. Feng, X. He, X. Wang, C. Luo, Y. Liu, and T.-S. Chua, “Temporal
suggested avenues for future research focusing on adaptations relational ranking for stock prediction,” ACM Trans. Inf. Syst., vol. 37,
of generative speech models and of the incorporation of textual mar 2019.
[13] W. Li, R. Bao, K. Harimoto, D. Chen, J. Xu, and Q. Su, “Modeling the
data. stock relation with graph network for overnight stock movement predic-
tion,” in Proceedings of the Twenty-Ninth International Joint Conference
ACKNOWLEDGMENT on Artificial Intelligence, IJCAI-20 (C. Bessiere, ed.), pp. 4541–4547,
International Joint Conferences on Artificial Intelligence Organization,
The stock data used for the models presented in this research 7 2020. Special Track on AI in FinTech.
was collected courtesy of research access kindly provided by [14] G. Ang and E.-P. Lim, “Guided attention multimodal multitask finan-
Alpha Vantage 1 . cial forecasting with inter-company relationships and global and local
news,” in Proceedings of the 60th Annual Meeting of the Association
We used the chatGPT AI 2 to improve the text in all sections for Computational Linguistics (Volume 1: Long Papers) (S. Muresan,
of this work. P. Nakov, and A. Villavicencio, eds.), (Dublin, Ireland), pp. 6313–6326,
We thank Rian Dolphin who provided us with his trained Association for Computational Linguistics, May 2022.
[15] X. Du and K. Tanaka-Ishii, “Stock embeddings acquired from news
weights for the initialization of E from his publications [11] articles and price history, and an application to portfolio optimization,”
and [10]. in Proceedings of the 58th Annual Meeting of the Association for Com-
putational Linguistics (D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault,
A PPENDIX eds.), (Online), pp. 3353–3363, Association for Computational Linguis-
tics, July 2020.
To visualize the Stock2Sentence approach we adapt and [16] R. Kim, C. H. So, M. Jeong, S. Lee, J. Kim, and J. Kang, “Hats: A
modify the Grad-CAM visualization from [41]. The relevance hierarchical graph attention network for stock movement prediction,”
2019.
map R ∈ R(|C|·∆t)×(|C|·∆t) visualizes the relevance of each
[17] X. Li, X. Shen, Y. Zeng, X. Xing, and J. Xu, “Finreport: Explainable
of the |C| · ∆t input points with respect to each other. It is stock earnings forecasting via news factor analyzing model,” 2024.
calculated as [18] X. Yu, Z. Chen, and Y. Lu, “Harnessing LLMs for temporal data - a
1
P study on explainable financial time series forecasting,” in Proceedings
∂ ( |C| · ci ∈C H(ŷi , yi )) of the 2023 Conference on Empirical Methods in Natural Language
R = Ā and ▽A := (6) Processing: Industry Track (M. Wang and I. Zitouni, eds.), (Singapore),
∂A pp. 739–753, Association for Computational Linguistics, Dec. 2023.
following the notation of [41]. [19] J. Li, R. Selvaraju, A. D. Gotmare, S. Joty, C. Xiong, and S. Hoi,
“Align before fuse: Vision and language representation learning with
R EFERENCES momentum distillation,” 2021.
[20] A. Pagliaro, “Forecasting significant stock market price changes using
[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training machine learning: Extra trees classifier leads,” Electronics, vol. 12,
of deep bidirectional transformers for language understanding,” in North no. 21, 2023.
American Chapter of the Association for Computational Linguistics, [21] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. W. Cottrell,
2019. “A dual-stage attention-based recurrent neural network for time series
[2] X. Zhang, Y. Zhang, S. Wang, Y. Yao, B. Fang, and P. S. Yu, “Im- prediction,” CoRR, vol. abs/1704.02971, 2017.
proving stock market prediction via heterogeneous information fusion,” [22] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14,
Knowledge-Based Systems, vol. 143, p. 236–247, Mar. 2018. pp. 179–211, 1990.
[23] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
2 https://chat.openai.com/ computation, vol. 9, pp. 1735–80, 12 1997.
[24] Q. Ding, S. Wu, H. Sun, J. Guo, and J. Guo, “Hierarchical multi-scale
gaussian transformer for stock movement prediction,” in Proceedings
of the Twenty-Ninth International Joint Conference on Artificial Intelli-
gence, IJCAI-20 (C. Bessiere, ed.), pp. 4640–4646, International Joint
Conferences on Artificial Intelligence Organization, 7 2020. Special
Track on AI in FinTech.
[25] T.-T. Nguyen and S. Yoon, “A novel approach to short-term stock price
movement prediction using transfer learning,” Applied Sciences, vol. 9,
no. 22, 2019.
[26] F. Voigt, “Adapting natural language processing strategies for stock price
prediction.” DC@KI2023: Proceedings of Doctoral Consortium at KI
2023, 2023.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in Neural Information Processing Systems (I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
eds.), vol. 30, Curran Associates, Inc., 2017.
[28] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative
position representations,” 2018.
[29] S. Oymak, M. Mahdavi, and J. Chen, “Learning feature nonlinearities
with regularized binned regression,” in 2019 IEEE International Sym-
posium on Information Theory (ISIT), pp. 1452–1456, 2019.
[30] J. Yoo, Y. Soun, Y.-c. Park, and U. Kang, “Accurate multivariate
stock movement prediction via data-axis transformer with multi-level
contexts,” in Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining, KDD ’21, (New York, NY, USA),
p. 2037–2045, Association for Computing Machinery, 2021.
[31] J. Liu, H. Lin, X. Liu, B. Xu, Y. Ren, Y. Diao, and L. Yang,
“Transformer-based capsule network for stock movement prediction,” in
Proceedings of the First Workshop on Financial Technology and Natural
Language Processing, (Macao, China), pp. 66–73, Aug. 2019.
[32] K. Chen, Y. Zhou, and F. Dai, “A lstm-based method for stock returns
prediction: A case study of china stock market,” 2015 IEEE International
Conference on Big Data (Big Data), pp. 2823–2824, 2015.
[33] E. Ramos-Pérez, P. J. Alonso-González, and J. J. Núñez-Velázquez,
“Multi-transformer: A new neural network-based architecture for fore-
casting samp;p volatility,” Mathematics, vol. 9, no. 15, 2021.
[34] W. Lu, J. Li, J. Wang, and L. Qin, “A cnn-bilstm-am method for
stock price prediction,” Neural Computing and Applications, vol. 33,
pp. 4741–4753, May 2021.
[35] J. M.-T. Wu, Z. Li, N. Herencsar, B. Vo, and J. C.-W. Lin, “A graph-
based cnn-lstm stock price prediction algorithm with leading indicators,”
Multimedia Systems, vol. 29, pp. 1751–1770, Jun 2023.
[36] J. Chen, T. Chen, M. Shen, Y. Shi, D. Wang, and X. Zhang, “Gated three-
tower transformer for text-driven stock market prediction,” Multimedia
Tools and Applications, vol. 81, pp. 30093–30119, Sep 2022.
[37] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
“Language models are unsupervised multitask learners,” 2019.
[38] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhut-
dinov, “Transformer-xl: Attentive language models beyond a fixed-
length context,” in Annual Meeting of the Association for Computational
Linguistics, 2019.
[39] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
with a unified text-to-text transformer,” Journal of Machine Learning
Research, vol. 21, no. 140, pp. 1–67, 2020.
[40] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez,
A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient
foundation language models,” 2023.
[41] H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explain-
ability for interpreting bi-modal and encoder-decoder transformers,” in
2021 IEEE/CVF International Conference on Computer Vision (ICCV),
pp. 387–396, 2021.

You might also like