CN106022522A

CN106022522A - Method and system for predicting stocks based on big data published by internet

Info

Publication number: CN106022522A
Application number: CN201610338598.4A
Authority: CN
Inventors: 马健; 俞扬
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-05-20
Filing date: 2016-05-20
Publication date: 2016-10-12

Abstract

The invention discloses a method and system for predicting stocks based on big data published on the Internet. Firstly, relevant information about stocks before the trading day is crawled; Predictive model training, the evaluation standard of the model is the rate of return for a period of time according to the operation method of selling the stocks bought in the previous trading day at the opening of the market every day and buying the stocks recommended in the current trading day; Construct a new test set from the data, and use the prediction model trained in the previous step to make predictions to get the final recommended stocks. The present invention provides a new useful and reliable source of information for quantitative stock selection or stock forecasting. The addition of these information combined with traditional information can better reflect the market. On this basis, the stock forecasting model obtained by using machine learning technology can better capture the market. The internal operating mechanism can effectively improve the returns of investors.

Description

A method and system for predicting stocks based on Internet-disclosed big data

技术领域technical field

本发明涉及一种大数据股票预测方法，特别涉及一种基于互联网公开的股民操作、分析师预测、股民评论、新闻、公告、历史股价、资金流向、基本面等大数据股票预测方法及系统。The present invention relates to a big data stock forecasting method, and in particular to a big data stock forecasting method and system based on Internet-based stockholder operations, analyst forecasts, stockholder comments, news, announcements, historical stock prices, capital flows, and fundamentals.

背景技术Background technique

上世纪70年代以前，股票投资是一种定性的分析，没有数据应用，而是一门主观的艺术。随着电脑的普及，很多人开始研究驱动股价变化的规律，把传统基本面研究方法用模型代替，市盈率、市净率的概念诞生，量化投资由此兴起。Before the 1970s, stock investment was a qualitative analysis without data application, but a subjective art. With the popularization of computers, many people began to study the laws driving stock price changes, and replaced the traditional fundamental research methods with models. The concepts of price-earnings ratio and price-to-book ratio were born, and quantitative investment emerged.

从主观判断到量化投资，是从艺术转为科学的过程。上世纪70年代以前一个基本面研究员只能关注20只到50只股票，覆盖面很有限。有了量化模型就可以覆盖所有股票，这就是一个大的飞跃。此外，随着计算机处理能力的发展，信息的用量也有一个飞跃变化。过去看三个指标就够了，现在看的指标越来越多，做出的预测越来越准确。From subjective judgment to quantitative investment is a process from art to science. Before the 1970s, a fundamental researcher could only pay attention to 20 to 50 stocks, and the coverage was very limited. With quantitative models, all stocks can be covered, which is a big leap. In addition, with the development of computer processing power, the amount of information used has also changed dramatically. In the past, it was enough to look at three indicators, but now more and more indicators are looked at, and the predictions made are becoming more and more accurate.

随着21世纪的到来，量化投资又遇到了新的瓶颈，就是同质化竞争。各家机构的量化模型越来越趋同，导致投资结果同涨同跌。“能否在看到报表数据之前，用更大的数据寻找规律？”这是大数据策略创业者们试图解决的问题。With the advent of the 21st century, quantitative investment has encountered a new bottleneck, which is homogeneous competition. The quantitative models of various institutions are increasingly converging, resulting in investment results rising and falling at the same time. "Can you use larger data to find patterns before seeing the report data?" This is the problem that big data strategy entrepreneurs are trying to solve.

2013年诺贝尔经济学奖得主罗伯特·席勒于设计的投资模型至今仍被业内称道。在他的模型中，主要参考三个变量：投资项目计划的现金流、公司资本的估算成本、股票市场对投资的反应(市场情绪)。他认为，市场本身带有主观判断因素，投资者情绪会影响投资行为，而投资行为直接影响资产价格。计算机通过分析新闻、研究报告、社交信息、搜索行为等，借助自然语言处理方法，提取有用的信息；而借助机器学习智能分析，过去量化投资只能覆盖几十个策略，大数据投资则可以覆盖成千上万个策略。The investment model designed by Robert Shiller, the 2013 Nobel Prize winner in economics, is still praised by the industry today. In his model, three variables are mainly referred to: the planned cash flow of the investment project, the estimated cost of capital of the company, and the reaction of the stock market to the investment (market sentiment). He believes that the market itself has subjective judgment factors, investor sentiment will affect investment behavior, and investment behavior directly affects asset prices. Computers extract useful information by analyzing news, research reports, social information, search behavior, etc., with the help of natural language processing methods; and with the help of machine learning intelligent analysis, quantitative investment in the past could only cover dozens of strategies, while big data investment can cover Thousands of strategies.

据此得出传统的股票预测都是基于股票价格的历史走势，资金流向，以及各股票的市值，市盈率等信息进行股票分析预测。在现在互联网深度影响诸多传统行业的情形下，相比于几十年前互联网还没发明前，乃至于互联网还没有这么普及前，除了传统的那些股票数据外，互联网上还有很多关于股票的数据，包括公开数据的股民的实际操作、分析师的预测、股民的评论、新闻，公告等等信息。这些信息在一定程度上是对当前股市的反应，也会表现出对未来股市的预期的反应。本发明试图利用这些新的有用的数据以及传统的数据利用自然语言处理、机器学习等技术创造一种大数据股票预测模型。Based on this, the traditional stock forecast is based on the historical trend of stock prices, capital flows, and the market value of each stock, price-earnings ratio and other information for stock analysis and forecasting. In the current situation where the Internet has deeply affected many traditional industries, compared to decades ago before the Internet was invented, or even before the Internet was not so popular, in addition to the traditional stock data, there are many stocks on the Internet. Data, including the actual operation of stockholders with public data, analysts' forecasts, stockholders' comments, news, announcements and other information. To a certain extent, this information is the response to the current stock market, and it will also show the response to the expectations of the future stock market. The present invention attempts to utilize these new useful data and traditional data to utilize techniques such as natural language processing, machine learning to create a kind of big data stock prediction model.

发明内容：Invention content:

发明目的：针对现有技术中存在的问题，本发明提出一种基于互联网上公开的股民和分析师操作行为的大数据量化选股方法及系统，为广大股民，基金公司等做投资参考。Purpose of the invention: Aiming at the problems existing in the prior art, the present invention proposes a big data quantitative stock selection method and system based on the operating behaviors of shareholders and analysts published on the Internet, and provides investment reference for shareholders and fund companies.

技术方案：本发明提出一种基于互联网公开的大数据预测股票的方法，包括如下步骤：Technical solution: The present invention proposes a method for predicting stocks based on Internet-disclosed big data, including the following steps:

1)爬取交易日前股票的相关信息；1) Crawl the relevant information of the stock before the trading day;

具体的爬取方法为：先爬取一些代理IP，而后使用Scrapy框架爬取相关网站的数据，将数据转化成json格式后存入Mongodb数据库中；The specific crawling method is: first crawl some proxy IPs, then use the Scrapy framework to crawl the data of related websites, convert the data into json format and store it in the Mongodb database;

爬取的具体信息包括雪球网、金罗盘、股吧、凤凰财经、新浪财经等网站上关于股票的票的股民的股票操作、分析师的预测、股民评论、新闻、公告，以及每只股票的历史价格数据、市值、净资产收益率、资产收益率、每股收益增长率、流动负债比率、企业价值倍数、净利润同比增长率、股权集中度、自由流通市值以及最近一个月的股价格收益率和波动率。The specific information crawled includes stock operations, analysts' forecasts, stockholder comments, news, announcements, and the stock price of each stock on websites such as Xueqiu.com, Golden Compass, Stock Bar, Phoenix Finance, and Sina Finance. Historical price data, market capitalization, return on equity, return on assets, earnings per share growth rate, current liability ratio, enterprise value multiples, year-on-year growth rate of net profit, equity concentration, free float market value, and stock price returns in the last month rate and volatility.

2)利用步骤1爬取的数据进行特征提取，构造训练数据集，并使用Group Lasso进行预测模型训练；2) Use the data crawled in step 1 to perform feature extraction, construct a training data set, and use Group Lasso for prediction model training;

构造的训练数据集：由当前交易日的前一个星期的5个交易日的数据组成，对于这5个交易日的每个交易日，每只股票由特征和类别组成，其中特征用根据相关信息处理得到的向量表示，类别为下一个交易日该股票价格是否增涨，如果涨就为1否则为0，这样便得到初始训练矩阵；由于数据存在冗余，该步骤会先过滤掉信息量不足的数据，具体的过滤标准为：过滤掉爬取的数据中当日股民对股票的操作数低于10次的样本。Constructed training data set: It is composed of the data of 5 trading days in the previous week of the current trading day. For each trading day of these 5 trading days, each stock is composed of features and categories, where the features are based on relevant information The processed vector indicates whether the stock price will rise in the next trading day, if it rises, it will be 1, otherwise it will be 0, so as to obtain the initial training matrix; due to the redundancy of data, this step will first filter out the lack of information The specific filtering criteria are: filter out the samples with less than 10 operations on stocks by stockholders in the crawled data on that day.

表征股票特征的向量的提取方法为：对于股民操作数据，按照股民的上个月收益率，将股民分为10个组，对每个等级的组提取该组对该股票的前1天、3 天、7天、15天、30天等时间戳中每个时间戳中的买进个数、卖出个数、持仓量、仓位改变量、该组在每个时间戳的平均收益率等特征；The extraction method of the vector characterizing the characteristics of the stock is as follows: for the stockholders’ operation data, the stockholders are divided into 10 groups according to the stockholders’ last month’s return rate, and for each level group, the group’s previous 1 day, 3 Features such as the number of buys, number of sells, position volume, position changes, and the average rate of return of the group at each time stamp in each time stamp of 1 day, 7 days, 15 days, and 30 days ;

对于分析师预测数据，提取分析师对该股票的前1天、3天、7天、15天、30天等时间戳中每个时间戳中的买进个数、卖出个数等特征；For the analyst's forecast data, extract the analyst's characteristics such as the number of purchases and the number of sales in each timestamp of the previous 1 day, 3 days, 7 days, 15 days, 30 days, etc.;

对于股民评论数据，提取分析师对该股票的前1天、3天、7天、15天、30天等时间戳中每个时间戳中该股票的评论数，各个评论的情感值的均值，方差等特征；For the shareholder comment data, extract the number of analysts’ comments on the stock in each timestamp of the previous 1 day, 3 days, 7 days, 15 days, 30 days, etc., and the average value of the sentiment value of each comment, Variance and other characteristics;

对于新闻数据，提取分析师对该股票的前1天、3天、7天、15天、30天等时间戳中每个时间戳中该股票的新闻个数，各个新闻的情感值的均值，方差等特征；For the news data, extract the number of news of the stock in each timestamp of the analyst's previous 1 day, 3 days, 7 days, 15 days, 30 days, etc., and the average value of the sentiment value of each news, Variance and other characteristics;

对于公告数据，提取分析师对该股票的前1天、3天、7天、15天、30天等时间戳中每个时间戳中该股票的公告个数，各个公告中对应的公告关键词库中的词出现的次数的总和等特征；For the announcement data, extract the number of announcements made by analysts on the stock in each timestamp of the previous 1 day, 3 days, 7 days, 15 days, 30 days, etc., and the corresponding announcement keywords in each announcement Features such as the sum of the number of occurrences of words in the library;

对于历史股价数据，提取分析师对该股票的前1天、3天、7天、15天、30天等时间戳中每个时间戳中该股票的开盘价、收盘价、最高价、最低价、与前30日价格的比值、3日线斜率、7日线斜率、10日线斜率、15日线斜率、30日线斜率等特征；For historical stock price data, extract the opening price, closing price, highest price, and lowest price of the stock in each timestamp of the analyst's previous 1 day, 3 days, 7 days, 15 days, 30 days, etc. , the ratio to the price of the previous 30 days, the slope of the 3-day line, the slope of the 7-day line, the slope of the 10-day line, the slope of the 15-day line, the slope of the 30-day line, etc.;

对于资金流向数据，提取分析师对该股票的前1天、3天、7天、15天、30天等时间戳中每个时间戳中该股票主力资金的流进量和流出量的比值等特征；For the capital flow data, extract the ratio of the inflow and outflow of the stock's main funds in each time stamp of the previous 1 day, 3 days, 7 days, 15 days, 30 days and other time stamps of the stock, etc. feature;

对于其他信息数据，提取该股票当前的市值、净资产收益率、资产收益率、每股收益增长率、流动负债比率、企业价值倍数、净利润同比增长率、股权集中度、自由流通市值以及最近一个月的股价格收益率和波动率等特征；For other information data, extract the stock's current market value, return on equity, return on assets, growth rate of earnings per share, current liability ratio, enterprise value multiple, year-on-year growth rate of net profit, equity concentration, free float market value, and recent Features such as one-month stock price return and volatility;

对于股民评论、新闻、公告等文本数据首先基于金融情感词库、公告关键词库两个词库采用自然语言处理技术对文本进行分词，再根据文本中出现的金融情感词计算每条股民评论、新闻等的情感值，以及公告中相应关键词出现的的次数，金融情感词库中列举了一些股票情感关键词以及该关键词对应的情感得分，公告关键词库中列举了一些和公告相关的关键词，这两个词库是用过众包的方式人工标注得到的。For text data such as stockholder comments, news, announcements, etc., first use natural language processing technology to segment the text based on the two lexicons of financial sentiment and announcement keywords, and then calculate each stockholder comment, The emotional value of news, etc., and the number of times the corresponding keywords appear in the announcement. The financial emotional lexicon lists some stock emotional keywords and the corresponding emotional scores of the keywords. The announcement keyword library lists some announcement-related keywords. Key words, these two thesauruses are manually marked by crowdsourcing.

由于在特征提取中对于股民操作数据，按照股民的上个月收益率，将股民分为10个组，在此的每个组相当于一个分组(Group)，每个分组内部的特征是有较强的关联的，而不同分组之间的特征间的关联性则没有那么强，在模型训练时，希望能对同一个分组内的特征有整体考虑的因素，在此基础上使用机器学习中的Group Lasso算法能够更好的考虑到这些因素，所以选用Group Lasso算法。Since in the feature extraction, the shareholder operation data is divided into 10 groups according to the shareholder's last month's rate of return, each group here is equivalent to a group (Group), and the characteristics of each group are relatively Strong correlation, but the correlation between features between different groups is not so strong. When training the model, it is hoped that the features in the same group can be considered as a whole. On this basis, use machine learning. The Group Lasso algorithm can better take these factors into consideration, so the Group Lasso algorithm is selected.

Group Lasso算法表示如下：The Group Lasso algorithm is expressed as follows:

${\overset{^^}{β β}}_{λ λ} = = \underset{β β}{arg arg min min} ((| | | | Y Y - - X x β β | | {| |}_{22}^{22} + + λ λ {Σ Σ}_{g g = = 11}^{G G} | | | | {β β}_{{I I}_{g g}} | | {| |}_{22}))$

其中，为模型训练结果，X为训练样本矩阵，Y为样本的类别向量，I_g表示属于第g个Group的特征索引，其中g＝1,...,G，表示属于第g个Group的特征索引对应的模型训练出的权重结果的值。in, is the model training result, X is the training sample matrix, Y is the category vector of the sample, I _g represents the feature index belonging to the gth Group, where g=1,...,G, Indicates the value of the weight result of the model training corresponding to the feature index belonging to the gth Group.

在模型训练的过程中，利用交叉检验的方法，针对每一轮测试集根据预测的概率降序选取预测概率最高的股票，然后按照每天开盘卖出上个交易日买入的股票，买进当前交易日推荐的股票这样的操作方式两周时间总收益的收益率，以此调节模型的参数。In the process of model training, use the cross-validation method to select the stock with the highest prediction probability in descending order according to the predicted probability for each round of test set, and then sell the stocks bought on the previous trading day according to the opening of each day, and buy the current transaction. Adjust the parameters of the model based on the rate of return of the total return of the stocks recommended on a daily basis for two weeks.

3)爬取交易日当天的数据构造新的测试集，并使用步骤2训练好的预测模型进行预测，得到最终推荐的股票。3) Crawl the data of the trading day to construct a new test set, and use the prediction model trained in step 2 to make predictions to obtain the final recommended stocks.

本发明还提出一种基于互联网公开的大数据预测股票的系统，包括数据爬取存储模块、预测模型训练模块和股票预测模块；其中，数据爬取存储模块用于爬取和存储股票的相关信息；预测模型训练模块利用交易日前爬取的数据构造训练数据集，并使用GroupLasso训练预测模型；股票预测模块，利用交易日当天爬取的数据构造新的测试集，并使用训练好的预测模型预测最终推荐的股票。The present invention also proposes a system for predicting stocks based on Internet-disclosed big data, including a data crawling storage module, a forecasting model training module, and a stock prediction module; wherein, the data crawling storage module is used to crawl and store stock related information ;The forecasting model training module uses the data crawled before the trading day to construct a training data set, and uses GroupLasso to train the forecasting model; the stock forecasting module uses the data crawled on the trading day to construct a new test set, and uses the trained forecasting model to predict Ultimate recommended stock.

基于互联网公开数据的大数据预测股票的系统还包括展示模块，用于将股票预测结果展示给客户。The big data stock prediction system based on Internet public data also includes a display module, which is used to display the stock prediction results to customers.

有益效果：本发明为量化选股或股票预测提供了新的有用的可靠的信息来源，诸如股民的操作、分析师的预测、新闻、公告、研报等数据相对于传统的例如股票的历史价格、资金流向等数据是新型的数据来源，这些信息在一定程度上是对当前股市的反应，也会表现出对未来股市的预期的反应。由于有大量的文本数据，这些数据的实时爬取和分析的难度比传统股票数据的爬取和处理要困难，本发明使用Scrapy框架爬虫和自然语言处理等技术针对这些类型的数据进行实时爬取和处理，以及和传统的例如股票的历史价格、资金流向等数据的结合更加能反映市场。由于本发明的提取的特征有些部分是按照股民的上个月收益率，将股民分为多个分组，每个分组内部的特征是有较强的关联的，而不同分组之间的特征间的关联性则没有那么强，在模型训练时，我们希望能对同一个分组内的特征有整体考虑的因素，在此基础上使用机器学习中的Group Lasso算法能够更好的考虑到这些因素，得到的股票预测模型更加能够捕捉市场的内在运行机制，大大提高了给资者带来的收益。Beneficial effects: the present invention provides a new useful and reliable source of information for quantitative stock selection or stock forecasting, such as stockholders' operations, analysts' predictions, news, announcements, research reports and other data relative to traditional historical prices such as stocks Data such as data and capital flow are new sources of data. To a certain extent, this information is a response to the current stock market, and it will also show a response to the expectations of the future stock market. Because there is a large amount of text data, the difficulty of real-time crawling and analysis of these data is more difficult than the crawling and processing of traditional stock data. The present invention uses technologies such as Scrapy frame crawler and natural language processing to perform real-time crawling for these types of data And processing, as well as the combination with traditional data such as historical stock prices, capital flows, etc., can better reflect the market. Because some parts of the extracted features of the present invention are divided into multiple groups according to the stockholders' last month's rate of return, the features inside each group have a strong correlation, while the features between different groups The correlation is not so strong. During model training, we hope to have an overall consideration of the features in the same group. On this basis, using the Group Lasso algorithm in machine learning can better take these factors into account, and get The stock forecasting model is more able to capture the internal operating mechanism of the market, which greatly improves the returns to investors.

附图说明Description of drawings

图1为本发明的股票预测系统的整体架构图；Fig. 1 is the overall architecture diagram of the stock prediction system of the present invention;

图2为本发明的数据爬取存储模块的架构图；Fig. 2 is the architectural diagram of the data crawling storage module of the present invention;

图3为本发明的预测模型训练模块的架构图；Fig. 3 is the architecture diagram of the predictive model training module of the present invention;

图4为本发明的股票预测预测模块的架构图。Fig. 4 is a structure diagram of the stock prediction prediction module of the present invention.

具体实施方式detailed description

下面结合具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

图1为本发明的股票预测系统的整体框架，包括四个模块，数据爬取存储模块、股票预测模型训练模块、股票预测模块和展示模块。本发明语言使用Python，数据库使用Mongodb。Fig. 1 is the overall framework of the stock prediction system of the present invention, including four modules, a data crawling storage module, a stock prediction model training module, a stock prediction module and a display module. The language of the present invention uses Python, and the database uses Mongodb.

数据爬取存储模块如图2所示，爬虫使用Scrapy框架，Scrapy是一个基于Python开发的快速、高层次的Web信息抓取系统，主要用于自动访问相关Web站点并从页面中提取结构化的数据。Scrapy使用的是高效的Twisted异步网络库来处理网络通讯，Scrapy整体架构如图3所示。The data crawling storage module is shown in Figure 2. The crawler uses the Scrapy framework. Scrapy is a fast, high-level Web information crawling system developed based on Python. It is mainly used to automatically access related Web sites and extract structured information from pages. data. Scrapy uses the efficient Twisted asynchronous network library to handle network communication. The overall architecture of Scrapy is shown in Figure 3.

在爬虫中，为了解决诸如雪球网等网站的防爬问题，先爬取一些代理IP，而后使用Scrapy框架爬取雪球网、金罗盘、股吧、凤凰财经、新浪财经、巨潮资讯等网站的数据，将数据转化成json格式后存入Mongodb数据库中。其中，雪球网中可以爬取到一些股民的操作数据、股民评论、新闻、公告等数据，金罗盘可以爬取到分析师的预测等数据，股吧可以爬取到股民评论等数据，凤凰财经和新浪财经可以爬取到新闻以及股票的历史价格、资金流向、基本面等数据，巨潮资讯可以爬取到公告等数据。In the crawler, in order to solve the anti-crawling problem of websites such as Xueqiu.com, first crawl some proxy IPs, and then use the Scrapy framework to crawl websites such as Xueqiu.com, Golden Compass, Stock Bar, Phoenix Finance, Sina Finance, and Juchao Information. The data is converted into json format and stored in the Mongodb database. Among them, Xueqiu.com can crawl some stockholder operation data, stockholder comments, news, announcements and other data, Golden Compass can crawl to analysts’ forecasts and other data, Stock Bar can crawl to stockholder comments and other data, Phoenix Finance And Sina Finance can crawl to news, historical stock prices, capital flows, fundamentals and other data, and Juchao Information can crawl to announcements and other data.

股票预测模型训练模块如图4所示，先构造机器学习的训练数据集，训练数据集由距离当前交易日的前一个星期的5个交易日的数据组成。对于这5个交易日的每个交易日，A股2780只股票每只股票由特征和类别组成，其中特征用一个向量表示，该向量有700维左右，类别为下一个交易日该股票价格是否增涨，如果涨就为1否则为0，这样可以得到一个5*2780*701左右的矩阵。这是初始训练集。The stock forecasting model training module is shown in Figure 4. First, construct the training data set for machine learning. The training data set consists of the data of 5 trading days one week before the current trading day. For each trading day of these 5 trading days, each stock of 2780 A shares consists of features and categories, where the features are represented by a vector, which has about 700 dimensions, and the category is whether the stock price on the next trading day is Increase, if it rises, it will be 1, otherwise it will be 0, so that a matrix of about 5*2780*701 can be obtained. This is the initial training set.

表1 700维左右的特征向量的组成Table 1 Composition of feature vectors with about 700 dimensions

由于有的股票某天爬取的数据不是很多，所以用原有的700维向量描述可能失真，所以股票预测模型训练模块会过滤掉信息量不足的数据，具体的过滤标准可以根据评价准则进行调节，现阶段本发明过滤掉爬取的数据中当日股民对股票的操作数低于10次的样本。这样可以得到过滤后训练集。Since some stocks do not crawl a lot of data on a certain day, the original 700-dimensional vector description may be distorted, so the stock prediction model training module will filter out data with insufficient information, and the specific filtering criteria can be adjusted according to the evaluation criteria , at this stage, the present invention filters out the samples in which the stockholders operate on stocks less than 10 times in the crawled data on that day. In this way, the filtered training set can be obtained.

接着用机器学习中的Group Lasso算法进行模型训练，相同类型的统计量为一个Group。在此和传统的机器学习问题不同，这里的模型好坏的评价标准不是准确率、F1等，而是根据模型每天推荐8只股票、按照每天开盘卖出上个交易日买入的股票，买进当前交易日推荐的股票这样的操作方式这段时间的收益率。以此来调节模型的参数。Group Lasso算法表示如下：Then use the Group Lasso algorithm in machine learning for model training, and the same type of statistics is a Group. This is different from traditional machine learning problems. The evaluation criteria for the quality of the model here is not accuracy rate, F1, etc., but recommends 8 stocks every day according to the model, and sells the stocks bought on the previous trading day according to the daily opening. The rate of return during this period of time is based on the operation method of stocks recommended in the current trading day. This is used to adjust the parameters of the model. The Group Lasso algorithm is expressed as follows:

其中，为模型训练结果，X为训练样本矩阵，Y为样本的类别向量，I_g表示属于第g个Group的特征索引、其中g＝1,...,G，表示属于第g个Group的特征索引对应的模型训练出的权重结果的值。in, is the model training result, X is the training sample matrix, Y is the category vector of the sample, I _g represents the feature index belonging to the gth Group, where g=1,...,G, Indicates the value of the weight result of the model training corresponding to the feature index belonging to the gth Group.

这样就得到了大数据股票预测模型，在每个交易日开盘前的10个小时左右，本发明进行当日模型的训练。In this way, a big data stock forecasting model is obtained, and the present invention trains the model of the day about 10 hours before the opening of each trading day.

大数据股票预测模型的预测模块如图4所示，根据当日爬取的数据提取特征得到测试数据集，这样可以得到A股2780只股票的2780条样本。再按照训练数据及过滤的方法，去除掉信息量少的样本，得到过滤后的测试集。最后使用训练好的大数据股票预测模型对过滤后的测试集进行预测，挑选输出概率最高的8只股票作为下个交易日的推荐股票。The prediction module of the big data stock prediction model is shown in Figure 4. The test data set is obtained according to the extracted features of the data crawled on the same day, so that 2780 samples of 2780 A-share stocks can be obtained. According to the training data and filtering method, samples with less information are removed to obtain a filtered test set. Finally, use the trained big data stock prediction model to predict the filtered test set, and select the 8 stocks with the highest output probability as the recommended stocks for the next trading day.

Claims

1. A method for predicting stocks based on big data disclosed by the Internet comprises the following steps:

1) crawling related data information of stocks on the trading day before;

2) performing feature extraction by using the data crawled in the step 1, constructing a training set, and performing training of a big data stock prediction model by using a Group Lasso algorithm;

3) and (4) crawling data on the current day of the trading day to construct a new test set, and predicting by using the prediction model trained in the step (2) to obtain the finally recommended stocks.

2. The method for predicting stocks based on internet published big data as claimed in claim 1, wherein the method for extracting stock information in step 1 is: crawling some agent IPs, then crawling data of related websites by using a Scapy framework, converting the data into a Json format, and storing the Json format into a Mongolb database.

3. The method for predicting stocks based on internet published big data as claimed in claim 1, wherein the specific information crawled in step 1 includes stock operations of stocks related to stocks on websites such as snowweb, gold compass, stock bar, phoenix, new wave, etc., predictions of analysts, stocks reviews, news, announcements, and reports of price history data, market value, net asset profitability, income growth rate per stock, mobile liability ratio, business value multiple, net profit unity growth rate, share right concentration, free circulation market value, and stock price profitability and volatility in the last month.

4. The method for predicting stocks based on internet published big data as claimed in claim 1, wherein said step 2 filters out data with insufficient information, and the specific filtering criteria are: and filtering out samples of the crawled data, wherein the number of operations of stocks by the current shareholder is less than 10 times.

5. The method of claim 1, wherein the training data set constructed in step 2 is composed of data of 5 trading days one week before the current trading day, and each stock is composed of a feature and a category for each of the 5 trading days, wherein the feature is represented by a vector processed according to the related information, and the category is whether the stock price increases for the next trading day, and if the increase is 1, the category is 0, so that an initial training matrix is obtained.

6. The method for predicting stocks based on internet published big data as claimed in claim 5, wherein the vector extraction method for characterizing the stocks is:

for the operation data of the stocks, dividing the stocks into 10 groups according to the previous monthly profitability of the stocks, and extracting the characteristics of the group, such as the number of bought items, the number of sold items, the amount of taken positions, the amount of position change, the average profitability of the group in each time stamp in the time stamp of the group, such as the first 1 day, the last 3 days, the last 7 days, the last 15 days, the last 30 days and the like of the stocks;

for the analyst prediction data, extracting characteristics such as the number of bought items and the number of sold items in each timestamp in timestamps such as the first 1 day, the first 3 days, the first 7 days, the first 15 days, the first 30 days and the like of the stock by the analyst;

for the stock comment data, extracting the comment number of the stock in each time stamp in the time stamps of 1 day, 3 days, 7 days, 15 days, 30 days and the like of the stock, and characteristics such as the mean value, the variance and the like of the emotion value of each comment by an analyst;

for news data, extracting the characteristics of the number of news of the stock, the mean value and the variance of the emotion value of each news and the like in each timestamp of the timestamps of 1 day, 3 days, 7 days, 15 days, 30 days and the like of the stock by an analyst;

for the announcement data, extracting the number of announcements of the stock in each timestamp in timestamps of 1 day, 3 days, 7 days, 15 days, 30 days and the like of the stock by an analyst, and the characteristics of the sum of the occurrence times of words in an announcement keyword library corresponding to each announcement;

for historical stock price data, extracting the characteristics of the analyst such as the opening price, closing price, highest price, lowest price, ratio to the price of the stock of the previous 30 days, slope of a 3-day line, slope of a 7-day line, slope of a 10-day line, slope of a 15-day line, slope of a 30-day line and the like in each timestamp of timestamps such as the previous 1 day, 3 days, 7 days, 15 days, 30 days and the like of the stock;

for the capital flow data, extracting characteristics such as the ratio of the inflow amount and the outflow amount of capital funds of the stocks in each time stamp of the time stamps of the previous 1 day, 3 days, 7 days, 15 days, 30 days and the like of the stocks by an analyst;

for other information data, extracting the current market value, net asset profitability, income increase rate of each stock, mobile liability ratio, enterprise value multiple, net profit same-ratio increase rate, equity concentration ratio, free circulation market value, stock price profitability and fluctuation rate of the last month and other characteristics of the stock;

for text data such as stock comments, news, bulletins and the like, firstly, segmenting a text by adopting a natural language processing technology based on two word banks of a financial emotion word bank and a bulletin key word bank, then, calculating the emotion value of each stock comment, news and the like according to financial emotion words appearing in the text and the frequency of the appearance of corresponding keywords in the bulletin, wherein some stock emotion keywords and emotion scores corresponding to the keywords are listed in the financial emotion word bank, some keywords related to the bulletin are listed in the bulletin key word bank, and the two word banks are obtained by manual labeling in a crowdsourcing mode.

7. The method for predicting stocks based on internet published big data as claimed in claim 6, wherein in the feature extraction, for the operation data of the stocks, the stocks are divided into 10 groups according to the previous month profitability of the stocks, each Group is equivalent to a Group, the features in each Group are strongly related, and the relationship between the features in different groups is not so strong, in order to take the features in the same Group into consideration as a whole, the prediction model training is performed by using the Group Lasso algorithm in machine learning, and the Group Lasso algorithm is expressed as follows:

{\hat{β}}_{λ} = \underset{β}{\arg \min} (| | Y - X β | |_{2}^{2} + λ Σ_{g = 1}^{G} | | β_{I_{g}} | |_{2})

wherein,for the model training results, X is the training sample matrix, Y is the class vector of the sample, I_gDenotes a feature index belonging to the G-th Group, where G1., G,representing the value of the weight result trained by the model corresponding to the characteristic index belonging to the g Group;

in the process of model training, a cross-checking method is utilized, stocks with the highest prediction probability are selected according to the predicted probability descending order of each round of test sets, and then the yield of the total income in two weeks is adjusted according to the operation mode of selling the stocks bought on the last trading day and buying the stocks recommended on the current trading day every day.

8. A system for predicting stocks based on big data disclosed by the Internet comprises a data crawling storage module, a prediction model training module and a stock prediction module; the data crawling and storing module is used for crawling and storing relevant information of the stocks; the prediction model training module constructs a training data set by using data crawled before the transaction day and trains a prediction model by using Group Lasso; and the stock forecasting module is used for constructing a new test set by utilizing data crawled on the same day of the trading day and forecasting the finally recommended stocks by using a trained forecasting model.

9. The system for forecasting stocks based on internet published big data as claimed in claim 8, further comprising a presentation module for presenting the stock forecasting result to the client.