Rule-configurable webpage data analysis method
Technical Field
The invention belongs to the field of webpage data processing, and particularly relates to a rule-configurable webpage data analysis method.
Background
In recent years, with the more and more clear big data strategy in China, data capture and information acquisition series products meet huge development opportunities, and the number of acquired products also rapidly increases. The web page analysis, namely, the program automatically analyzes the web page content and acquires the information, so as to further process the information, and the web page analysis is an indispensable and very important part in realizing the web crawler. However, the current webpage data analysis method is complex to operate when analyzing and configuring webpage data; or when dynamic data in a webpage is acquired, the speed is slower.
Disclosure of Invention
In order to solve the above problems, the present invention provides a rule-configurable web page data parsing method, which includes the following steps:
s1, Web end task creation: the Web application program sends a data request to a server side, a required webpage starting URL, a webpage collecting rule and a webpage analyzing rule are configured on a task configuration page, then data are lifted through an HTML (hypertext markup language) label to which the configuration data belong, and the configured information is submitted after the task configuration information is filled;
s2, web page collection: acquiring acquisition information configured through task configuration in Web, starting to grab a webpage according to an incoming URL by a background, and determining a grabbing mode according to a configured webpage acquisition rule, wherein the grabbing mode comprises an enhanced mode and a common mode, the enhanced mode combines and uses Selenium and ChromeDriver, and a mode of using a Python Useragent library to construct an access head to access a corresponding URL, and the common mode uses a mode of using a Python aiohttp library and a Useragent library to construct an access head to access a corresponding URL; after the access is successfully completed, saving the webpage information, the URL, the page number and the page grade into a list; after the access of the web pages is finished, storing the captured web page information into a server in the form of an HTML (hypertext markup language) file, and storing corresponding information into a database;
s3, webpage analysis: acquiring analysis information configured by task configuration in Web, acquiring list information after acquiring a webpage for data analysis, and analyzing the webpage through a Beautiful Soup library of Python; extracting data and related tags in a tag type and value mode according to HTML tags configured by a page during analysis; after the analysis is finished, storing the data into a database;
s4, data downloading: and viewing the task result through the task list, downloading the acquired webpage content in the task result, and viewing and downloading the analyzed data.
Further, the web page collecting rule of step S1 includes whether to collect a sub page, whether to collect a next page, and whether to use the enhanced mode.
Further, the webpage parsing rule of step S1 is at most three lines, and the webpage parsing rules in each line are used to parse the webpage individually, and are finally combined into a result, and the result is stored in the database.
Still further, the webpage parsing rule comprises four parameters, wherein a first parameter is used for selecting the webpage parsing rule, a second parameter and a fourth parameter are configuration information corresponding to the webpage parsing rule, a third parameter is a relationship between the second parameter configuration information and the fourth parameter configuration information, and the relationship is one of contained, not contained and contained.
Further, when the enhanced mode is selected for web page acquisition in step S2, if a sub-page needs to be captured, two chromedrivers are opened, one for accessing a first-level page and the other for accessing a sub-page; after a primary page is accessed, a sub-page URL link of the primary page is acquired through configured tag information, and then the sub-page is accessed; and if the next page needs to be grabbed, acquiring a next page link through the configured next page tag for accessing.
Further, when the common mode is selected for web page acquisition in step S2, if a sub-page needs to be captured, the first-level page is accessed, then the sub-page link is obtained and stored in the list through the configured tag information, and then the sub-page is accessed in a coroutine mode; and if the next page needs to be grabbed, acquiring a next page link through the configured next page tag for accessing.
The invention has the beneficial effects that:
1) the method adopts a B/S architecture mode, thereby avoiding downloading of a C/S architecture client and being convenient to use;
2) when the method is used for acquiring the webpage and analyzing and configuring the webpage data, the configuration can be performed only by knowing the HTML structure, and a large amount of operation is not required during the configuration;
3) the method can conveniently acquire the dynamic data in the webpage, and can quickly acquire the webpage by using the coroutine.
Drawings
FIG. 1 is a flow chart of a rule-based configurable web page data parsing method.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
The invention provides a rule-configurable webpage data analysis method, as shown in fig. 1, which specifically comprises the following steps:
firstly, a server side in a Win10 environment is started, a designated port is monitored, and Socket connection is waited.
Then, a Web end task is created, and the Web application program sends a data request to the server end. In the step, a required webpage starting URL, a webpage collecting rule and a webpage analyzing rule are configured on a task configuration page, then data are lifted through an HTML (hypertext markup language) label to which the configuration data belong, and the configured information is submitted after the task configuration information is filled. In this step, the webpage collecting rules include whether to collect a sub-page, whether to collect a next page, and whether to use an enhanced mode, specifically:
1) when selecting to collect a sub-page, a "get sub-page tag" must be configured, which is in the form of an HTML tag: class = "xxx" >, the background searches all links conforming to the label in the level one page according to the label for access;
2) when the next page is selected to be collected, a "tag for acquiring the next page" needs to be configured, and the tag is in an HTML tag form: class = 'next' > next page, the background can search a corresponding next page link according to the label for access;
3) the enhanced mode is used for accurately acquiring dynamic web pages, and the enhanced mode is selected to access the web pages by combining the Selenium and the ChromeDriver.
In addition, the webpage analysis rule is at most three lines, the webpage analysis rule of each line is used for analyzing the webpage independently, wherein the webpage analysis rule of each line comprises four parameters, the first parameter is used for selecting the webpage analysis rule, the second parameter and the fourth parameter are configuration information corresponding to the webpage analysis rule, the third parameter is a relation between the second parameter configuration information and the fourth parameter configuration information, and the relation is one of contained, not contained and only contained.
Regular expression rules can also be added in the configuration of the rules.
Then, after the Web-side task is created, Web page collection is started. Acquiring acquisition information configured through task configuration in Web, starting to capture a webpage according to an incoming URL by a background, and determining a capture mode according to a configured webpage acquisition rule, wherein the capture mode comprises an enhanced mode and a common mode, and specifically comprises the following steps:
1) the enhancement mode combines the mode of using Selenium and ChromeDriver and using a Python UserAgent library to construct an access head to access a corresponding URL, if a sub-page needs to be captured, two ChromeDrivers are opened, one is used for accessing a primary page, and the other is used for accessing a sub-page. After a primary page is accessed, a sub-page URL link of the primary page is acquired through configured tag information, and then the sub-page is accessed; and if the next page needs to be grabbed, acquiring a next page link through the configured next page tag for accessing. In particular, the primary page setting captures up to 10 pages;
2) the common mode uses a mode of constructing an access header by using an aiohttp library and a user agent library of Python to access a corresponding URL (uniform resource locator), if a sub-page needs to be captured, a first-level page is accessed, then a sub-page link is obtained through configured tag information and stored in a list, and then a mode of coroutine is used to access the sub-page; and if the next page needs to be grabbed, acquiring a next page link through the configured next page tag for accessing. In particular, the level one page setting grabs 10 pages at most.
And after the access is successfully completed, saving the webpage information, the URL, the page number and the page grade into a list. And after the web pages are completely accessed, storing the captured web page information into the server in the form of an HTML (hypertext markup language) file, and storing the corresponding information into the database.
Then, the web page is analyzed. In the step, analysis information configured through task configuration in the Web is obtained, list information after a webpage is collected is obtained for data analysis, and the webpage is analyzed through a Beautiful Soup library of Python; extracting data and related tags in a tag type and value mode according to HTML tags configured by a page during analysis; and after the analysis is finished, storing the data into the database.
And finally downloading the data. And viewing the task result through the task list, downloading the acquired webpage content in the task result, and viewing and downloading the analyzed data.
In the description of the present invention, it should be noted that the terms "first", "second", "third", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.