Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study
<p>Existing news impact analysis process.</p> "> Figure 2
<p>Categories of topic models.</p> "> Figure 3
<p>NIA Parameters Model (PM).</p> "> Figure 4
<p>Extended software architecture for news impact analysis.</p> "> Figure 5
<p>The defined generative process of LDA.</p> "> Figure 6
<p>The generative process of Top2Vec and BERTopic.</p> "> Figure 7
<p>A sample of the news data used in the experiment.</p> "> Figure 8
<p>Visual representation of LDA-generated topics (100 topics and 10 maximum iterations).</p> "> Figure 9
<p>Word cloud for one of the Top2Vec-generated topics based on “banking”.</p> "> Figure 10
<p>Topic words by the BERTopic model.</p> "> Figure 11
<p>Visual presentation of BERTopic-generated topics.</p> "> Figure 12
<p>Part of hierarchical reduction by the BERTopic model.</p> "> Figure 13
<p>Comparison of various topic models used in the experiment by (<b>a</b>) training time, (<b>b</b>) c_v coherence score, and (<b>c</b>) u_mass coherence score.</p> "> Figure 14
<p>The news selection process based on the “banking” topic and negative sentiments.</p> "> Figure 15
<p>Mean abnormal returns identified by the news impact analysis.</p> ">
Abstract
:1. Introduction
1.1. Background and Motivation
1.2. Research Question and Contributions
- Reviewing and summarizing state-of-the-art TM techniques.
- Integrating TM techniques with the existing financial news impact analysis process.
- Comparing and evaluating TM techniques in the context of financial news impact analysis with large-scale and long financial news corpus.
- Providing a systematic method and guidelines for finance domain experts and NLP practitioners to conduct financial news impact analysis leveraging TM capabilities.
1.3. Structure of the Article
2. Literature Review
3. News Impact Analysis Using Topic Modeling Techniques
3.1. NIA Framework
3.2. Integrated Topic Modeling Techniques
3.2.1. LDA
3.2.2. Top2Vec
3.2.3. BERTopic
4. Experiments and Results
4.1. Experimental Setup
4.1.1. Dataset
4.1.2. Hardware and Software Prototype Implementation
- Operating System: Windows 11 64 bits
- CPU: 11th Gen Intel(R) Core (TM) i7-11370H @ 3.30 GHz
- RAM: 16 GB
4.1.3. Scenario
4.2. Results
4.2.1. LDA Results
4.2.2. Top2Vec Results
4.2.3. BERTopic Results
4.2.4. Evaluation of Topic Models
4.2.5. News Impact Analysis Results
4.3. Discussion
4.3.1. Comparison between LDA, Top2Vec and BERTopic
4.3.2. Validating the NIA Framework
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Anbaee Farimani, S.; Vafaei Jahan, M.; Milani Fard, A.; Tabbakh, S.R.K. Investigating the informativeness of technical indicators and news sentiment in financial market price prediction. Knowl.-Based Syst. 2022, 247, 108742. [Google Scholar] [CrossRef]
- Chen, W.; El Majzoub, A.; Al-Qudah, I.; Rabhi, F.A. A CEP-driven framework for real-time news impact prediction on financial markets. Serv. Oriented Comput. Appl. 2023, 17, 129–144. [Google Scholar] [CrossRef]
- Bonifazi, G.; Cauteruccio, F.; Corradini, E.; Marchetti, M.; Sciarretta, L.; Ursino, D.; Virgili, L. A Space-Time Framework for Sentiment Scope Analysis in Social Media. Big Data Cogn. Comput. 2022, 6, 130. [Google Scholar] [CrossRef]
- TajMazinani, M.; Hassani, H.; Raei, R. A comprehensive review of stock price prediction using text mining. Adv. Decis. Sci. 2022, 26, 116–152. [Google Scholar]
- Lauriola, I.; Lavelli, A.; Aiolli, F. An introduction to Deep Learning in Natural Language Processing: Models, techniques, and tools. Neurocomputing 2022, 470, 443–456. [Google Scholar] [CrossRef]
- Allen, D.E.; McAleer, M.; Singh, A.K. Daily market news sentiment and stock prices. Appl. Econ. 2019, 51, 3212–3235. [Google Scholar] [CrossRef] [Green Version]
- Taj, S.; Shaikh, B.B.; Meghji, A.F. Sentiment Analysis of News Articles: A Lexicon based Approach. In Proceedings of the 2019 2nd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), Sukkur, Pakistan, 30–31 January 2019; pp. 1–5. [Google Scholar]
- Shahzad, F.; Yannan, D.; Kamran, H.W.; Suksatan, W.; Nik Hashim, N.A.A.; Razzaq, A. Outbreak of epidemic diseases and stock returns: An event study of emerging economy. Econ. Res.-Ekon. Istraživanja 2022, 35, 2313–2332. [Google Scholar] [CrossRef]
- Eachempati, P.; Srivastava, P.R.; Kumar, A.; Muñoz de Prat, J.; Delen, D. Can customer sentiment impact firm value? An integrated text mining approach. Technol. Forecast. Soc. Chang. 2022, 174, 121265. [Google Scholar] [CrossRef]
- Lin, W.-C.; Tsai, C.-F.; Chen, H. Factors affecting text mining based stock prediction: Text feature representations, machine learning models, and news platforms. Appl. Soft Comput. 2022, 130, 109673. [Google Scholar] [CrossRef]
- Ashtiani, M.N.; Raahemi, B. News-based intelligent prediction of financial markets using text mining and machine learning: A systematic literature review. Expert Syst. Appl. 2023, 217, 119509. [Google Scholar] [CrossRef]
- Chen, W.; Al-Qudah, I.; Rabhi, F. A Framework for Facilitating Reproducible News Sentiment Impact Analysis. In Proceedings of the 2022 the 5th International Conference on Software Engineering and Information Management (ICSIM), Yokohama, Japan, 21–23 January 2022; pp. 125–131. [Google Scholar]
- Bonifazi, G.; Cauteruccio, F.; Corradini, E.; Marchetti, M.; Montella, D.; Scarponi, S.; Ursino, D.; Virgili, L. Performing Wash Trading on NFTs: Is the Game Worth the Candle? Big Data Cogn. Comput. 2023, 7, 38. [Google Scholar] [CrossRef]
- Churchill, R.; Singh, L. The Evolution of Topic Modeling. ACM Comput. Surv. 2022, 54, 215. [Google Scholar] [CrossRef]
- Vayansky, I.; Kumar, S.A.P. A review of topic modeling methods. Inf. Syst. 2020, 94, 101582. [Google Scholar] [CrossRef]
- Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
- Gallagher, R.J.; Reing, K.; Kale, D.; Ver Steeg, G. Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge. Trans. Assoc. Comput. Linguist. 2017, 5, 529–542. [Google Scholar] [CrossRef] [Green Version]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Moody, C.E. Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv 2016, arXiv:1605.02019. [Google Scholar]
- Gerlach, M.; Peixoto, T.P.; Altmann, E.G. A network approach to topic models. Sci. Adv. 2018, 4, eaaq1360. [Google Scholar] [CrossRef] [Green Version]
- Bhat, M.R.; Kundroo, M.A.; Tarray, T.A.; Agarwal, B. Deep LDA: A new way to topic model. J. Inf. Optim. Sci. 2020, 41, 823–834. [Google Scholar] [CrossRef]
- Angelov, D. Top2vec: Distributed representations of topics. arXiv 2020, arXiv:2008.09470. [Google Scholar]
- Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
- Bonifazi, G.; Corradini, E.; Ursino, D.; Virgili, L. Defining user spectra to classify Ethereum users based on their behavior. J. Big Data 2022, 9, 37. [Google Scholar] [CrossRef]
- Maier, D.; Waldherr, A.; Miltner, P.; Wiedemann, G.; Niekler, A.; Keinert, A.; Pfetsch, B.; Heyer, G.; Reber, U.; Häussler, T.; et al. Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology. Commun. Methods Meas. 2018, 12, 93–118. [Google Scholar] [CrossRef]
- Asmussen, C.B.; Møller, C. Smart literature review: A practical topic modelling approach to exploratory literature review. J. Big Data 2019, 6, 93. [Google Scholar] [CrossRef] [Green Version]
- Hu, N.; Zhang, T.; Gao, B.; Bose, I. What do hotel customers complain about? Text analysis using structural topic model. Tour. Manag. 2019, 72, 417–426. [Google Scholar] [CrossRef]
- Chen, X.; Zou, D.; Cheng, G.; Xie, H. Detecting latent topics and trends in educational technologies over four decades using structural topic modeling: A retrospective of all volumes of Computers & Education. Comput. Educ. 2020, 151, 103855. [Google Scholar] [CrossRef]
- Ghasiya, P.; Okamura, K. Investigating COVID-19 News across Four Nations: A Topic Modeling and Sentiment Analysis Approach. IEEE Access 2021, 9, 36645–36656. [Google Scholar] [CrossRef]
- Poongodi, M.; Nguyen, T.N.; Hamdi, M.; Cengiz, K. Global cryptocurrency trend prediction using social media. Inf. Process. Manag. 2021, 58, 102708. [Google Scholar] [CrossRef]
- Egger, R.; Yu, J. A Topic Modeling Comparison between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Sociol. 2022, 7, 80–92. [Google Scholar] [CrossRef]
- Yin, H.; Song, X.; Yang, S.; Li, J. Sentiment analysis and topic modeling for COVID-19 vaccine discussions. World Wide Web 2022, 25, 1067–1083. [Google Scholar] [CrossRef]
- Egger, R.; Yu, J. Identifying hidden semantic structures in Instagram data: A topic modelling comparison. Tour. Rev. 2022, 77, 1234–1246. [Google Scholar] [CrossRef]
- García-Méndez, S.; de Arriba-Pérez, F.; Barros-Vila, A.; González-Castaño, F.J.; Costa-Montenegro, E. Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation. Appl. Intell. 2023. [Google Scholar] [CrossRef]
- Alcoforado, A.; Ferraz, T.P.; Gerber, R.; Bustos, E.; Oliveira, A.S.; Veloso, B.M.; Siqueira, F.L.; Costa, A.H.R. ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling; Springer: Cham, Switzerland, 2022; pp. 125–136. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Singh, B.; Dhall, R.; Narang, S.; Rawat, S. The Outbreak of COVID-19 and Stock Market Responses: An Event Study and Panel Data Analysis for G-20 Countries. Glob. Bus. Rev. 2020, 0972150920957274. [Google Scholar] [CrossRef]
- Birhane, A.; Kasirzadeh, A.; Leslie, D.; Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 2023, 5, 277–280. [Google Scholar] [CrossRef]
Ref. | Year | Corpus and Size | Length | TM Technique | Details |
---|---|---|---|---|---|
[25] | 2018 | Webpages (10,000) | long | LDA | Applying LDA to identify websites related to food safety issues and highlighting the potential of LDA as a valuable tool for communication researchers. |
[26] | 2019 | Research papers (650) | long | LDA | Smart literature review conducted by loading research papers, using LDA to generate topics, and selecting appropriate topics for literature review. |
[27] | 2019 | Hotel reviews (27,864) | long | STM | Analyzing New York City hotel reviews using STM and showing that it improves inferences about consumer dissatisfaction. |
[28] | 2020 | Research papers (3963) | long | STM | Utilizing STM to identify research topics out of the title, keywords, and the abstract of articles published in the journal Computers & Education over 42 years. |
[29] | 2021 | News (100,000) | long | Top2Vec | Identifying the most widely reported topics or issues within COVID-related news in outlets of UK, India, Japan, and South Korea using Top2Vec, followed by news sentiment analysis using the RoBERTa model. |
[30] | 2021 | Forum posts and Tweets (Unknown size) | short | LDA | Analyzing bitcoin-related posts on Twitter, Reddit and Bitcoin Talk using LDA, and the result is then used by an LSTM-based neural system for stock price prediction. |
[31] | 2022 | Tweets (31,800) | short | LDA NMF Top2Vec BERTopic | Adopting and comparing four TM techniques on social media data for the purpose of social science research. NMF and BERTopic performed better than the other two in this scenario. |
[32] | 2022 | Tweets (78,827) | short | LDA | Extracting the topics using LDA and sentiment polarity using a dictionary-based sentiment analysis method out of the tweets related to the COVID-19 vaccine. |
[33] | 2022 | Instagram (33,881) | short | LDA CorEx NMF | Utilizing three topic modeling techniques to identify traveler experiences out of Instagram posts with a certain hashtag in 2020. |
[34] | 2023 | News (2158) | long | LDA | Topic modeling on financial news using LDA, and highlighting predictions and speculative statements within text via a graphical user interface. |
Service Name | Description |
---|---|
News Import Service | Responsible for importing news data from a variety of data sources as per the user-defined PM and feeding it into the Data Layer as the News Dataset. |
Topic Modeling Service | Conducting topic modeling on the News Dataset, filtering the news by topic based on the user-defined PM, and then generating the Selected News Dataset in the Data Layer, or feeding the result into the following Sentiment Analysis Service for further news filtering. |
Sentiment Analysis Service | Generating sentiment scores and identifying extreme news based on the user-defined PM. Results are committed to the Selected News Dataset in the Data Layer. |
Entity Extraction Service | Generating the Entity-News Pairs in the Data Layer and updates the PM accordingly, based on user selections or the content of selected news in the Selected News Dataset. |
Market Data Import Service | Responsible for importing financial market data based on the defined PM and the Entity-News Pairs. The data are saved as the Market Dataset in the Data Layer. |
Data Integration Service | Merging the Entity-News Pairs and the Market Dataset into the Impact Measures Dataset in the Data Layer. |
Impact Analysis Service | Performing impact analysis based on the PM. Results that are committed to the Results Dataset in the Data Layer along with some data visualization. |
Item | LDA | Top2Vec | BERTopic |
---|---|---|---|
Python library | scikit-learn | top2vec | bertopic |
version | 1.1.1 | 1.0.29 | 0.14.1 |
No. of topics | 20, 50, 100, 500 | undefined | undefined |
max iterations | 1, 10 | undefined | undefined |
min topic size | undefined | undefined | 10 |
dimensionality reduction | undefined | UMAP | UMAP |
clustering | undefined | HDBSCAN | HDBSCAN |
topic representation | default | centroid proximity | c-TF-IDF, MMR |
Parameter Name | Parameter Value |
---|---|
News data source | News with the type of “Companies and Markets”, sourced from AFR |
Financial instruments | Daily close prices of the stocks of the Big Four Australian banks (security codes: CBA, WBC, NAB, and ANZ), sourced from Yahoo Finance |
Benchmark | All Ordinaries index, sourced from Yahoo Finance |
TM setting | LDA (with 20, 50, 100, 500 topics & 1 and 10 max iterations), Top2Vec, and BERTopic |
SA setting | RoBERTa (threshold = “mean sentiment score”) |
Analysis period | (−20 days, +20 days) |
Comparison period | (−100 days, −21 days) |
Impact measure | Mean cumulative abnormal returns (MCAR) |
No. | Top-10 Keywords | Annotation |
---|---|---|
Topic 1 | financial, commission, report, court, regulator, claim, review, action, government, case | economy |
Topic 2 | bank, loan, credit, financial, banking, capital, billion, customer, rate, risk | banking |
Topic 3 | rate, economy, global, china, economic, bond, world, central, investor, policy | economy |
Topic 4 | project, group, construction, contract, infrastructure, road, toll, building, billion, contractor | construction |
Topic 5 | share, shareholder, board, deal, group, investor, offer, capital, executive, director | investment |
Topic 6 | crown, network, medium, telstra, mobile, casino, nbn, news, service, content | communication |
Topic 7 | share, price, growth, profit, earnings, month, billion, stock, analyst, result | investment |
Topic 8 | say, people, time, executive, chief, big, like, make, think, way | - |
Topic 9 | energy, power, solar, electricity, vehicle, car, renewable, battery, generation, wind | energy |
Topic 10 | coal, port, rail, union, queensland, worker, thermal, aurizon, terminal, agreement | energy |
No. | Top-10 Keywords | Annotation |
---|---|---|
Topic 1 | deal, billion, merger, agreement, asset, acquisition, transaction, talk, potential, offer | merger |
Topic 2 | davis, healy, cromwell, quicksilver, pierre, deleted, familiarity, inman, lewinns, callaghan | - |
Topic 3 | bain, cochlear, hearing, tasmanian, implant, piper, fish, remark, salmon, private | hearing |
Topic 4 | santos, pipeline, gas, cooper, apa, williams, narrabri, central, coates, mccormack | energy |
Topic 5 | woolworth, food, coles, supermarket, sale, sup0plier, price, chain, product, retailer | retail (grocery) |
Topic 6 | fuel, caltex, refinery, petrol, arrium, whyalla, refining, viva, conversion, ampol | petrol |
Topic 7 | bhp, bhps, mackenzie, billion, shale, henry, exploration, production, scarborough, asset | mining |
Topic 8 | coal, thermal, tonne, queensland, coking, miner, hunter, mining, export, whitehaven | mining |
Topic 9 | store, retailer, retail, sale, online, brand, myer, chain, customer, jones | retail |
Topic 10 | bank, banking, customer, westpac, anz, nab, cba, commonwealth, royal, banker | banking |
No. | Top-10 Keywords | Annotation |
---|---|---|
Topic 1 | steadyoil, steady, hang, seng, shanghai, pm, changecash, nikkei, commodities, yr | commodities |
Topic 2 | economists, rba, inflation, reserve, economist, recession, dales, hike, dovish, unemployment | economy |
Topic 3 | emissions, climate, carbon, greenhouse, fossil, decarbonisation, warming, emitting, emission, decarbonise | climate |
Topic 4 | fundie, stocks, conviction, caps, managers, stockpicker, quant, investing, you, equities | investment |
Topic 5 | coles, supermarket, grocery, supermarkets, banducci, woolworths, cain, aldi, groceries, durkan | retail (grocery) |
Topic 6 | strategists, stocks, strategist, cassidy, defensives, tevfik, rotation, financials, overweight, valuations | investment |
Topic 7 | mott, cet, nim, wiles, unquestionably, sproules, nab, anz, cba, banks | banking |
Topic 8 | eu, theresa, brexiters, brexit, tory, boris, brussels, referendum, commons, chancellor | politics |
Topic 9 | monetary, ecb, kuroda, boj, qe, draghi, central, easing, bond, quantitative | economy |
Topic 10 | republican, republicans, democrats, democratic, trump, clinton, biden, congress, presidential, voters | politics |
No. | Top-10 Keywords | Annotation |
---|---|---|
Topic 1 | wine, treasury, penfolds, wines, clarke, estates, china, brands, blass, wolf | wine |
Topic 2 | myer, myers, lew, umbers, premier, hounsell, lews, brookes, store, sales | retail |
Topic 3 | solar, energy, renewable, power, wind, grid, renewables, electricity, rooftop, projects | energy |
Topic 4 | afterpay, afterpays, later, merchants, pay, square, buy, eisen, molnar, credit | payment |
Topic 5 | china, chinese, beijing, trade, hong, xi, us, trump, kong, chinas | politics |
Topic 6 | climate, carbon, emissions, change, zero, warming, risks, transition, risk, climaterelated | climate |
Topic 7 | fed, inflation, feds, yellen, central, rates, monetary, powell, rate, us | economy |
Topic 8 | anz, elliott, anzs, bank, banks, banking, shayne, institutional, loans, loan | banking |
Topic 9 | wesfarmers, bunnings, coles, scott, goyder, homebase, kmart, gillam, stores, conglomerate | retail |
Topic 10 | rio, ore, iron, tonnes, rios, pilbara, mine, production, tonne, jacques | mining |
Topic Model | No. of Topics | Training Time (s) | Coherence (c_v) | Coherence (u_mass) |
---|---|---|---|---|
LDA 20-1 | 20 | 69.44 | 0.575 | −16.939 |
LDA 50-1 | 50 | 331.99 | 0.582 | −16.942 |
LDA 100-1 | 100 | 472.75 | 0.590 | −17.197 |
LDA 500-1 | 500 | 777.06 | 0.591 | −17.246 |
LDA 20-10 | 20 | 237.78 | 0.586 | −17.230 |
LDA 50-10 | 50 | 1363.26 | 0.586 | −17.140 |
LDA 100-10 | 100 | 2052.66 | 0.586 | −17.217 |
LDA 500-10 | 500 | 4244.94 | 0.591 | −17.222 |
Top2Vec | 444 | 3321.50 | 0.545 | −6.957 |
BERTopic | 608 | 2632.65 | 0.823 | −1.156 |
Period (Days) | MCAR |
---|---|
−20 to +20 | +0.165% |
−10 to +10 | +0.340% |
−5 to +5 | +0.194% |
−1 to +1 | +0.048% |
0 | +0.036% |
−20 to −11 | −1.210% |
−10 to 0 | +0.326% |
0 to +10 | +0.050% |
+11 to +20 | +1.034% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, W.; Rabhi, F.; Liao, W.; Al-Qudah, I. Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study. Electronics 2023, 12, 2605. https://doi.org/10.3390/electronics12122605
Chen W, Rabhi F, Liao W, Al-Qudah I. Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study. Electronics. 2023; 12(12):2605. https://doi.org/10.3390/electronics12122605
Chicago/Turabian StyleChen, Weisi, Fethi Rabhi, Wenqi Liao, and Islam Al-Qudah. 2023. "Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study" Electronics 12, no. 12: 2605. https://doi.org/10.3390/electronics12122605
APA StyleChen, W., Rabhi, F., Liao, W., & Al-Qudah, I. (2023). Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study. Electronics, 12(12), 2605. https://doi.org/10.3390/electronics12122605