[go: up one dir, main page]

CN114329137A - Crawler method and device, electronic equipment and storage medium - Google Patents

Crawler method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114329137A
CN114329137A CN202111596503.6A CN202111596503A CN114329137A CN 114329137 A CN114329137 A CN 114329137A CN 202111596503 A CN202111596503 A CN 202111596503A CN 114329137 A CN114329137 A CN 114329137A
Authority
CN
China
Prior art keywords
crawler
page
website
crawler task
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111596503.6A
Other languages
Chinese (zh)
Inventor
陈曾华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Zhenshi Information Technology Co Ltd
Original Assignee
Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Zhenshi Information Technology Co Ltd filed Critical Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority to CN202111596503.6A priority Critical patent/CN114329137A/en
Publication of CN114329137A publication Critical patent/CN114329137A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a crawler method, a crawler device, electronic equipment and a storage medium, and the technical scheme is as follows: compiling a crawler task code corresponding to a website to be crawled and then publishing the crawler task code to a project path of a crawler system; the website information to be captured is sent to a message queue of a crawler system; the website information to be grabbed comprises an initial url and a full-limited name of a crawler task class defined in the crawler task code; acquiring website information to be captured from a message queue of a crawler system, capturing a website page corresponding to an initial url in the website information to be captured according to a full limit name of a crawler task class defined in a crawler task code, performing page analysis on the website page corresponding to the initial url, and storing a page analysis result of the website page corresponding to the initial url. The invention can realize full configuration and customization operation of the crawler code without shutdown, and does not need shutdown for redeployment.

Description

一种爬虫方法、装置、电子设备及存储介质A crawler method, device, electronic device and storage medium

技术领域technical field

本发明涉及网络搜索技术领域,特别涉及一种爬虫方法、装置、电子设备及存储介质。The invention relates to the technical field of network search, in particular to a crawler method, device, electronic device and storage medium.

背景技术Background technique

经典的爬虫系统一般分为四个模块:任务调度模块、页面抓取模块、页面解析模块以及数据存储模块。经典爬虫系统中,任务调度模块一般会采用消息队列实现爬虫url链接的存储调度,初始url一般通过程序硬编码、数据库、配置文件读入到消息队列中,然后进入到页面抓取模块,抓取的页面再进入到页面解析模块,最终解析的数据进入到数据存储模块进行存储。上述抓取网页数据的业务逻辑及相关的各种属性配置都是在爬虫系统开发时就写好的。The classic crawler system is generally divided into four modules: task scheduling module, page crawling module, page parsing module and data storage module. In the classic crawler system, the task scheduling module generally uses the message queue to realize the storage and scheduling of the crawler url link. The initial url is generally read into the message queue through program hard coding, database, and configuration file, and then enters the page crawling module to crawl The parsed page then enters the page parsing module, and the final parsed data enters the data storage module for storage. The above business logic for crawling web page data and related various attribute configurations are written when the crawler system is developed.

传统的爬虫系统,抓取网页数据的业务逻辑代码一旦写好,就只能按照既定逻辑抓取,如果被抓取网站的页面结构发生变化或者用户抓取逻辑及属性(如抓取字段、抓取页数、抓取工具、数据存储系统等等)发生变化的话,用户只能重新开发代码,再打包重新发布部署爬虫系统。另外,若要抓取新的网站,也需要在原有项目上开发完新的代码,将生产系统停机后再重新部署系统。对于一个拥有数十个甚至成百上千个爬虫任务的大型爬虫系统来说,经常性的停机部署是不可接受的。In traditional crawler systems, once the business logic code for crawling web page data is written, it can only be crawled according to the established logic. If the page structure of the crawled website changes or the user crawling logic and attributes (such as crawling fields, If the number of pages taken, crawling tools, data storage system, etc.) changes, the user can only re-develop the code, and then package and re-release the crawler system. In addition, if you want to crawl a new website, you also need to develop new code on the original project, shut down the production system, and then redeploy the system. For a large crawler system with dozens or even hundreds of crawler tasks, frequent downtime deployments are unacceptable.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本发明的目的在于提供了一种爬虫方法、装置、电子设备及存储介质,能够在不停机的情况下实现爬虫代码的全配置化、定制化操作,不需要停机重新部署。In view of this, the purpose of the present invention is to provide a crawler method, device, electronic device and storage medium, which can realize full configuration and customized operation of crawler code without downtime, and do not need downtime for redeployment.

为了达到上述目的,本发明提供了如下技术方案:In order to achieve the above object, the present invention provides the following technical solutions:

一种爬虫方法,应用于爬虫系统,该方法包括:A crawler method, applied to a crawler system, includes:

将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;Compile the crawler task code corresponding to the website to be crawled and publish it to the project path of the crawler system;

将待抓取网站信息发送到爬虫系统的消息队列中;所述待抓取网站信息包括初始url和所述爬虫任务代码中定义的爬虫任务类的全限定名;Send the website information to be crawled to the message queue of the crawler system; the website information to be crawled includes the initial url and the fully qualified name of the crawler task class defined in the crawler task code;

从爬虫系统的消息队列中获取待抓取网站信息,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,抓取待抓取网站信息中的初始url对应的网站页面,对所述初始url对应的网站页面进行页面解析,存储对所述初始url对应的网站页面的页面解析结果。The information of the website to be crawled is obtained from the message queue of the crawler system, and according to the fully qualified name of the crawler task class defined in the crawler task code, the website page corresponding to the initial url in the information of the website to be crawled is crawled. The website page corresponding to the initial url is subjected to page parsing, and the page parsing result of the website page corresponding to the initial url is stored.

一种爬虫装置,包括:任务调度模块、页面抓取模块、页面解析模块、数据存储模块、消息启动模块、和代码发布模块;A crawler device, comprising: a task scheduling module, a page grabbing module, a page parsing module, a data storage module, a message starting module, and a code publishing module;

所述代码发布模块,用于将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;The code publishing module is used to compile and publish the crawler task code corresponding to the website to be crawled to the project path of the crawler system;

所述消息启动模块,用于将待抓取网站信息发送到所述任务调度模块维护的消息队列中;所述待抓取网站信息包括初始url和所述爬虫任务代码中定义的爬虫任务类的全限定名;The message startup module is used to send the information of the website to be crawled to the message queue maintained by the task scheduling module; the information of the website to be crawled includes the initial url and the crawler task class defined in the crawler task code. fully qualified name;

所述页面抓取模块,用于从所述任务调度模块维护的消息队列中获取待抓取网站信息,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,抓取待抓取网站信息中的初始url对应的网站页面;The page crawling module is used to obtain the information of the website to be crawled from the message queue maintained by the task scheduling module, and according to the fully qualified name of the crawler task class defined in the crawler task code, crawling the website to be crawled The website page corresponding to the initial url in the information;

所述页面解析模块,用于从所述任务调度模块维护的消息队列中获取待抓取网站信息,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,对所述初始url对应的网站页面进行页面解析;The page parsing module is used to obtain the information of the website to be crawled from the message queue maintained by the task scheduling module, and according to the fully qualified name of the crawler task class defined in the crawler task code, to the corresponding initial url. Web page analysis;

所述数据存储模块,用于从所述任务调度模块维护的消息队列中获取待抓取网站信息,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,存储所述页面解析模块对所述初始url对应的网站页面的页面解析结果。The data storage module is used to obtain the information of the website to be crawled from the message queue maintained by the task scheduling module, and according to the fully qualified name of the crawler task class defined in the crawler task code, store the page parsing module pair. The page parsing result of the website page corresponding to the initial url.

一种电子设备,包括:至少一个处理器,以及与所述至少一个处理器通过总线相连的存储器;所述存储器存储有可被所述至少一个处理器执行的一个或多个计算机程序;所述至少一个处理器执行所述一个或多个计算机程序时实现上述爬虫方法中的步骤。An electronic device comprising: at least one processor, and a memory connected to the at least one processor through a bus; the memory stores one or more computer programs executable by the at least one processor; the The steps in the above crawler method are implemented when at least one processor executes the one or more computer programs.

一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个计算机程序,所述一个或多个计算机程序被处理器执行时实现上述爬虫方法中的步骤。A computer-readable storage medium storing one or more computer programs, when the one or more computer programs are executed by a processor, implement the steps in the above-mentioned crawling method.

由上面的技术方案可知,本发明中,通过将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;根据所述爬虫任务代码中定义的爬虫任务类的全限定名对待抓取网站信息中的初始url对应的网站页面进行抓取,对所述初始url对应的网站页面进行页面解析,存储对所述初始url对应的网站页面的页面解析结果。可以看出,本发明中,通过待抓取网站对应的爬虫任务代码发布和爬虫任务代码定义的爬虫任务类的全限定名,能够在不停机的情况下实现爬虫代码的全配置化、定制化操作,不需要停机重新部署。As can be seen from the above technical solutions, in the present invention, the crawler task code corresponding to the website to be crawled is compiled and published under the project path of the crawler system; The website page corresponding to the initial url in the website information is crawled, and the website page corresponding to the initial url is parsed, and the page analysis result of the website page corresponding to the initial url is stored. It can be seen that in the present invention, through the release of the crawler task code corresponding to the website to be crawled and the fully qualified name of the crawler task class defined by the crawler task code, the full configuration and customization of the crawler code can be realized without downtime. Operation without downtime for redeployment.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

图1是本发明实施例一爬虫方法流程图;1 is a flowchart of a crawler method according to an embodiment of the present invention;

图2是本发明实施例二爬虫方法流程图;Fig. 2 is the flow chart of the crawler method according to the second embodiment of the present invention;

图3是本发明实施例三爬虫方法流程图;Fig. 3 is the flow chart of the third embodiment of the present invention crawler method;

图4是本发明实施例四爬虫方法流程图;Fig. 4 is the flow chart of the four crawler method according to the embodiment of the present invention;

图5是本发明实施例提供的爬虫装置的结构示意图;5 is a schematic structural diagram of a crawler device provided by an embodiment of the present invention;

图6是本发明实施例提供的电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含。例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其他步骤或单元。The terms "first", "second", "third", "fourth", etc. (if present) in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein can, for example, be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those steps or units expressly listed, but may include steps or units not expressly listed or for such process, method, product or Other steps or units inherent to the device.

下面以具体实施例对本发明的技术方案进行详细说明。下面几个具体实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。The technical solutions of the present invention will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.

参见图1,图1是本发明实施例一爬虫方法流程图,该方法应用于爬虫系统,如图1所示,主要包括以下步骤:Referring to FIG. 1, FIG. 1 is a flowchart of a crawler method according to an embodiment of the present invention. The method is applied to a crawler system, as shown in FIG. 1, and mainly includes the following steps:

步骤101、将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;Step 101: Compile the crawler task code corresponding to the website to be crawled and publish it to the project path of the crawler system;

步骤102、将待抓取网站信息发送到爬虫系统的消息队列中;所述待抓取网站信息包括初始url和所述爬虫任务代码中定义的爬虫任务类的全限定名;Step 102, sending the information of the website to be crawled to the message queue of the crawler system; the information of the website to be crawled includes the initial url and the fully qualified name of the crawler task class defined in the crawler task code;

步骤103、从爬虫系统的消息队列中获取待抓取网站信息,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,抓取待抓取网站信息中的初始url对应的网站页面,对所述初始url对应的网站页面进行页面解析,存储对所述初始url对应的网站页面的页面解析结果。Step 103, obtaining the information of the website to be crawled from the message queue of the crawler system, and according to the fully qualified name of the crawler task class defined in the crawler task code, crawling the website page corresponding to the initial url in the information of the website to be crawled, Perform page parsing on the website page corresponding to the initial url, and store the page parsing result of the website page corresponding to the initial url.

根据图1所示方法可以看出,本实施例中,通过将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;根据所述爬虫任务代码中定义的爬虫任务类的全限定名对待抓取网站信息中的初始url对应的网站页面进行抓取,对所述初始url对应的网站页面进行页面解析,存储对所述初始url对应的网站页面的页面解析结果。本实施例中,通过待抓取网站对应的爬虫任务代码发布和爬虫任务代码定义的爬虫任务类的全限定名,能够在不停机的情况下实现爬虫代码的全配置化、定制化操作,不需要停机重新部署。It can be seen from the method shown in FIG. 1 that in this embodiment, the crawler task code corresponding to the website to be crawled is compiled and released to the project path of the crawler system; according to the crawler task class defined in the crawler task code The fully qualified name of the website page corresponding to the initial url in the website information to be crawled is crawled, the website page corresponding to the initial url is parsed, and the page parsing result of the website page corresponding to the initial url is stored. In this embodiment, through the release of the crawler task code corresponding to the website to be crawled and the fully qualified name of the crawler task class defined by the crawler task code, the fully configured and customized operation of the crawler code can be realized without downtime. Requires downtime to redeploy.

参见图2,图2是本发明实施例二爬虫方法流程图,该方法应用于爬虫系统,如图2所示,主要包括以下步骤:Referring to FIG. 2, FIG. 2 is a flowchart of the second embodiment of the present invention, a crawler method. The method is applied to a crawler system, as shown in FIG. 2, and mainly includes the following steps:

步骤201、将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;Step 201: Compile the crawler task code corresponding to the website to be crawled and publish it to the project path of the crawler system;

在实际实现时,可以将待抓取网站对应的爬虫任务代码编译成class文件,将class文件发布到爬虫系统的项目路径下。In actual implementation, the crawler task code corresponding to the website to be crawled can be compiled into a class file, and the class file can be published to the project path of the crawler system.

步骤202、将待抓取网站信息发送到爬虫系统的消息队列中;所述待抓取网站信息包括初始url和所述爬虫任务代码中定义的爬虫任务类的全限定名;所述爬虫任务代码中定义的爬虫任务类的全限定名包括第一爬虫任务类的全限定名。Step 202, sending the information of the website to be crawled to the message queue of the crawler system; the information of the website to be crawled includes the initial url and the fully qualified name of the crawler task class defined in the crawler task code; the crawler task code The fully qualified name of the crawler task class defined in includes the fully qualified name of the first crawler task class.

本实施例中,待抓取网站信息作为消息体被发送到爬虫系统维护的消息队列中,通过在待抓取网站信息中包括与页面抓取相关的多种信息,为后续执行页面抓取操作、页面解析操作、以及数据存储操作提供依据。In this embodiment, the information of the website to be crawled is sent as the message body to the message queue maintained by the crawler system. By including various information related to page crawling in the information of the website to be crawled, it is used for subsequent page crawling operations. , page parsing operations, and data storage operations.

步骤2031、从爬虫系统的消息队列中获取待抓取网站信息;Step 2031: Obtain the information of the website to be crawled from the message queue of the crawler system;

步骤2032、根据所述爬虫任务代码中定义的第一爬虫任务类的全限定名,通过java反射机制生成第一爬虫任务类实例对象,调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取;Step 2032: According to the fully qualified name of the first crawler task class defined in the crawler task code, generate the first crawler task class instance object through the java reflection mechanism, and call the page crawling method of the first crawler class task instance object to be crawled. Take the website page corresponding to the initial url in the website information for crawling;

在实际应用中,java反射机制的核心是在程序运行时动态加载类并获取类的详细信息,从而操作类或对象的属性和方法,本质是JVM得到class对象之后,再通过class对象进行反编译,从而获取对象的各种信息。In practical applications, the core of the java reflection mechanism is to dynamically load the class and obtain the detailed information of the class when the program is running, so as to operate the attributes and methods of the class or object. The essence is that after the JVM obtains the class object, it decompiles the class object. , so as to obtain various information about the object.

本实施例中,通过将待抓取网站对应的爬虫任务代码编译后得到的class文件后发布到爬虫系统的项目路径下,使得后续可以通过java反射机制得到class文件中的class对象(例如上述的第一爬虫任务类实例对象),通过class对象进行反编译,从而可以获取class对象的各种信息,包括第一爬虫任务类实例对象的页面抓取方法,进而可以利用该页面抓取方法对所述初始url对应的网站页面进行页面抓取。In this embodiment, the class file obtained after compiling the crawler task code corresponding to the website to be crawled is published to the project path of the crawler system, so that the class object in the class file (for example, the above-mentioned class object can be obtained through the java reflection mechanism later) The first crawler task class instance object), decompile the class object, so that various information of the class object can be obtained, including the page grabbing method of the first crawler task class instance object, and then the page grabbing method can be used to The website page corresponding to the initial url is crawled.

在本发明的一个实施例中,所述待抓取网站信息还可以包括抓取工具。开发人员可以根据实际需求选定适用于待抓取网站的抓取工具,例如可以是httpClient客户端、Chrome无头浏览器、phantomJs等抓取工具中的一种。In an embodiment of the present invention, the information on the website to be crawled may further include a crawling tool. Developers can select a crawling tool suitable for the website to be crawled according to actual needs, such as one of the httpClient client, Chrome headless browser, phantomJs and other crawling tools.

所述调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取,可具体包括:The page crawling method that calls the first crawler class task instance object to crawl the website page corresponding to the initial url in the website information to be crawled may specifically include:

将待抓取网站信息中的抓取工具作为第一爬虫类任务实例对象的页面抓取方法的用于指定抓取工具的输入参数,调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取。Take the crawler in the information of the website to be crawled as the input parameter of the page crawling method of the first crawler task instance object for specifying the crawler, and call the page crawling method of the first crawler task instance object to be crawled Take the website page corresponding to the initial url in the website information for crawling.

这里,将待抓取网站信息中的抓取工具作为第一爬虫类任务实例对象的页面抓取方法的用于指定抓取工具的输入参数,使得该页面抓取方法被调用时,可以使用该抓取工具对所述初始url对应的网站页面进行页面抓取。Here, the crawling tool in the information of the website to be crawled is used as the input parameter for specifying the crawling tool in the page crawling method of the first crawler task instance object, so that when the page crawling method is called, the page crawling method can be used. The crawling tool performs page crawling on the website page corresponding to the initial url.

在本发明的另一实施例中,所述待抓取网站信息还包括抓取页数n。在实际应用中,部分网页有多个页面,例如某个比较热门的帖子,会存在上百个页面,这种情况下,可以选择不抓取全部页面,而只是抓取靠前若干页面,因此,可以在待抓取网站信息中规定抓取页数n,以减少页面抓取而造成的资源占用,其中n的值可以根据需求确定,例如可以设置为10。In another embodiment of the present invention, the information on the website to be crawled further includes the number n of crawled pages. In practical applications, some web pages have multiple pages. For example, there will be hundreds of pages for a relatively popular post. In this case, you can choose not to crawl all pages, but only the first few pages. Therefore, , the number of pages n to be crawled can be specified in the information of the website to be crawled to reduce resource occupation caused by page crawling, where the value of n can be determined according to requirements, for example, it can be set to 10.

所述调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取,可具体包括:The page crawling method that calls the first crawler class task instance object to crawl the website page corresponding to the initial url in the website information to be crawled may specifically include:

将所述抓取页数作为第一爬虫类任务实例对象的页面抓取方法的用于指定最大抓取页数的输入参数,调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取。Taking the number of pages to grab as the input parameter of the page grabbing method of the first crawler task instance object for specifying the maximum number of grabbed pages, and calling the page grabbing method of the first crawler task instance object to grab the website The website page corresponding to the initial url in the information is crawled.

这里,将所述抓取页数作为第一爬虫类任务实例对象的页面抓取方法的用于指定最大抓取页数的输入参数,使得该页面抓取方法被调用时,如果所述初始url对应的网站页面有多个页面,则只会对前n个页面进行抓取。Here, the number of crawled pages is used as an input parameter of the page crawling method of the first crawler task instance object for specifying the maximum number of crawled pages, so that when the page crawling method is called, if the initial url If the corresponding website page has multiple pages, only the first n pages will be crawled.

以上步骤2032是图1所示步骤103中“根据所述爬虫任务代码中定义的爬虫任务类的全限定名,抓取待抓取网站信息中的初始url对应的网站页面”的具体细化。The above step 2032 is a specific refinement of “crawl the website page corresponding to the initial url in the website information to be crawled according to the fully qualified name of the crawler task class defined in the crawler task code” in step 103 shown in FIG. 1 .

步骤2033、根据所述爬虫任务代码中定义的爬虫任务类的全限定名,对所述初始url对应的网站页面进行页面解析,存储对所述初始url对应的网站页面的页面解析结果。Step 2033: Perform page parsing on the website page corresponding to the initial url according to the fully qualified name of the crawler task class defined in the crawler task code, and store the page parsing result of the website page corresponding to the initial url.

以上步骤2031至步骤2033是图1所示步骤103的具体细化。The above steps 2031 to 2033 are specific refinements of step 103 shown in FIG. 1 .

根据图2所示方法可以看出,本实施例中,通过将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;将待抓取网站信息发送到爬虫系统的消息队列中;,通过java反射机制生成第一爬虫任务类实例对象,调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取,对所述初始url对应的网站页面进行页面解析,存储对所述初始url对应的网站页面的页面解析结果。本实施例中,通过待抓取网站对应的爬虫任务代码发布和爬虫任务代码定义的爬虫任务类的全限定名,能够在不停机的情况下实现爬虫代码的全配置化、定制化操作,不需要停机重新部署。另外,本实施例中,还通过在待抓取网站消息包括抓取工具,从而指示页面抓取模块利用该抓取工具进行待抓取网站的页面抓取从而可以避免因抓取工具不合适而导致抓取失败;此外,还通过在待抓取网站消息包括抓取页数,从而控制页面抓取模块对初始url对应的网站页面的抓取页数,减少因抓取过多页面而造成的资源占用。According to the method shown in FIG. 2, it can be seen that in this embodiment, the crawler task code corresponding to the website to be crawled is compiled and released to the project path of the crawler system; the information of the website to be crawled is sent to the message queue of the crawler system , generate the first crawler task class instance object through the java reflection mechanism, call the page crawling method of the first crawler class task instance object to crawl the website page corresponding to the initial url in the website information to be crawled, and fetch the website page corresponding to the initial url in the website information. Perform page parsing on the website page corresponding to the url, and store the page parsing result of the website page corresponding to the initial url. In this embodiment, through the release of the crawler task code corresponding to the website to be crawled and the fully qualified name of the crawler task class defined by the crawler task code, the fully configured and customized operation of the crawler code can be realized without downtime. Requires downtime to redeploy. In addition, in this embodiment, by including a crawling tool in the message of the website to be crawled, the page crawling module is instructed to use the crawling tool to crawl the page of the website to be crawled, so as to avoid the inappropriate crawling tool. In addition, the information of the website to be crawled includes the number of pages to be crawled, so as to control the number of pages crawled by the page crawling module for the website page corresponding to the initial url, and reduce the number of pages crawled due to too many pages. resource usage.

参见图3,图3是本发明实施例三爬虫方法流程图,该方法应用于爬虫系统,如图3所示,主要包括以下步骤:Referring to FIG. 3, FIG. 3 is a flowchart of the third embodiment of the present invention, a crawler method. The method is applied to a crawler system, as shown in FIG. 3, and mainly includes the following steps:

步骤301、将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;Step 301: Compile and publish the crawler task code corresponding to the website to be crawled under the project path of the crawler system;

在实际实现时,可以将待抓取网站对应的爬虫任务代码编译成class文件,将class文件发布到爬虫系统的项目路径下。In actual implementation, the crawler task code corresponding to the website to be crawled can be compiled into a class file, and the class file can be published to the project path of the crawler system.

步骤302、将待抓取网站信息发送到爬虫系统的消息队列中;所述待抓取网站信息包括初始url和所述爬虫任务代码中定义的爬虫任务类的全限定名;所述爬虫任务代码中定义的爬虫任务类的全限定名包括第二爬虫任务类的全限定名;Step 302, sending the information of the website to be crawled to the message queue of the crawler system; the information of the website to be crawled includes the initial url and the fully qualified name of the crawler task class defined in the crawler task code; the crawler task code The fully qualified name of the crawler task class defined in includes the fully qualified name of the second crawler task class;

本实施例中,通过在待抓取网站信息中包括与页面抓取相关的多种信息,为后续执行页面抓取操作、执行页面解析操作、以及数据存储操作提供依据。In this embodiment, by including various information related to page crawling in the information of the website to be crawled, a basis is provided for subsequent page crawling operations, page parsing operations, and data storage operations.

步骤3031、从爬虫系统的消息队列中获取待抓取网站信息;Step 3031: Obtain the information of the website to be crawled from the message queue of the crawler system;

步骤3032、根据所述爬虫任务代码中定义的爬虫任务类的全限定名,抓取待抓取网站信息中的初始url对应的网站页面;Step 3032, according to the fully qualified name of the crawler task class defined in the crawler task code, grab the website page corresponding to the initial url in the website information to be crawled;

步骤3033、根据所述爬虫任务代码中定义的第二爬虫任务类的全限定名,通过java反射机制生成第二爬虫任务类实例对象,调用第二爬虫类任务实例对象的页面解析方法对所述初始url对应的网站页面进行页面解析;Step 3033: According to the fully qualified name of the second crawler task class defined in the crawler task code, generate the second crawler task class instance object through the java reflection mechanism, and call the page parsing method of the second crawler class task instance object to analyze the Perform page analysis on the website page corresponding to the initial url;

在实际应用中,java反射机制的核心是在程序运行时动态加载类并获取类的详细信息,从而操作类或对象的属性和方法,本质是JVM得到class对象之后,再通过class对象进行反编译,从而获取对象的各种信息。In practical applications, the core of the java reflection mechanism is to dynamically load the class and obtain the detailed information of the class when the program is running, so as to operate the attributes and methods of the class or object. The essence is that after the JVM obtains the class object, it decompiles the class object. , so as to obtain various information about the object.

本实施例中,通过将待抓取网站对应的爬虫任务代码编译后得到的class文件后发布到爬虫系统的项目路径下,使得后续可以通过java反射机制得到class文件中的class对象(例如上述的第二爬虫任务类实例对象),通过class对象进行反编译,从而可以获取class对象的各种信息,包括第二爬虫任务类实例对象的页面解析方法,进而可以利用该页面解析方法对所述初始url对应的网站页面进行页面解析。由于该页面解析方法是针对待抓取网站的页面结构而针对性的生成的解析代码,因此可以实现页面的准确解析。In this embodiment, the class file obtained after compiling the crawler task code corresponding to the website to be crawled is published to the project path of the crawler system, so that the class object in the class file (for example, the above-mentioned class object can be obtained through the java reflection mechanism later) The second crawler task class instance object), decompile the class object, so that various information of the class object can be obtained, including the page parsing method of the second crawler task class instance object, and then the page parsing method can be used to analyze the initial The website page corresponding to the url is parsed. Since the page parsing method is a parsing code generated specifically for the page structure of the website to be crawled, accurate parsing of the page can be achieved.

在本发明的一个实施例中,所述待抓取网站信息还包括抓取字段。在实际应用中,页面抓取的目的有二,其一是获取页面内容,其二是获得页面关联的新的url进行进一步的抓取,对此,并不需要对页面中的全部字段进行解析,而只需要针对性的对页面中的部分字段进行解析即可实现上述目的,因此,在该实施例中,可以在待抓取网站信息中规定需要对哪些字段进行解析,例如可以规定对页面中的如下字段进行解析:title、clicknum、reply、pubTime、author、content、img。In an embodiment of the present invention, the information on the website to be crawled further includes a crawling field. In practical applications, there are two purposes of page crawling. One is to obtain the content of the page, and the other is to obtain the new url associated with the page for further crawling. For this, it is not necessary to parse all the fields in the page. , and only need to parse some fields in the page in a targeted manner to achieve the above purpose. Therefore, in this embodiment, it is possible to specify which fields need to be parsed in the information of the website to be crawled. The following fields are parsed: title, clicknum, reply, pubTime, author, content, img.

所述调用第二爬虫类任务实例对象的页面解析方法对所述初始url对应的网站页面进行页面解析,可具体包括:The page parsing method for invoking the second crawler task instance object to perform page parsing on the website page corresponding to the initial url may specifically include:

将所述抓取字段作为第二爬虫类任务实例对象的页面解析方法的用于指定目标解析字段的输入参数,调用第二爬虫类任务实例对象的页面解析方法,解析所述初始url对应的网站页面中对应于每一抓取字段的字段内容。Use the grab field as an input parameter of the page parsing method of the second crawler task instance object for specifying the target parsing field, call the page parsing method of the second crawler task instance object, and parse the website corresponding to the initial url Field content for each fetched field in the page.

将所述抓取字段作为第二爬虫类任务实例对象的页面解析方法的用于指定目标解析字段的输入参数,从而使得该页面抓取方法被调用时,可以只对所述抓取字段进行解析。The grab field is used as the input parameter of the page analysis method of the second crawler task instance object for specifying the target analysis field, so that when the page grab method is called, only the grab field can be parsed .

步骤3034、根据所述爬虫任务代码中定义的爬虫任务类的全限定名,存储对所述初始url对应的网站页面的页面解析结果。Step 3034: Store the page parsing result of the website page corresponding to the initial url according to the fully qualified name of the crawler task class defined in the crawler task code.

根据图3所示方法可以看出,本实施例中,通过将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;根据所述爬虫任务代码中定义的爬虫任务类的全限定名对待抓取网站信息中的初始url对应的网站页面进行抓取;根据所述爬虫任务代码中定义的第二爬虫任务类的全限定名,通过java反射机制生成第二爬虫任务类实例对象,调用第二爬虫类任务实例对象的页面解析方法对所述初始url对应的网站页面进行页面解析;存储对所述初始url对应的网站页面的页面解析结果。本实施例中,通过待抓取网站对应的爬虫任务代码发布和爬虫任务代码定义的爬虫任务类的全限定名,能够在不停机的情况下实现爬虫代码的全配置化、定制化操作,不需要停机重新部署。此外,本实施例中还通过只对页面中的部分字段进行解析,从而可以减少因页面解析而造成的资源占用。It can be seen from the method shown in FIG. 3 that in this embodiment, the crawler task code corresponding to the website to be crawled is compiled and published under the project path of the crawler system; according to the crawler task class defined in the crawler task code The fully qualified name of the website page corresponding to the initial url in the website information to be crawled is to be crawled; according to the fully qualified name of the second crawler task class defined in the crawler task code, the second crawler task class instance is generated through the java reflection mechanism object, call the page parsing method of the second crawler class task instance object to perform page parsing on the website page corresponding to the initial url; store the page parsing result of the website page corresponding to the initial url. In this embodiment, through the release of the crawler task code corresponding to the website to be crawled and the fully qualified name of the crawler task class defined by the crawler task code, the fully configured and customized operation of the crawler code can be realized without downtime. Requires downtime to redeploy. In addition, in this embodiment, only some fields in the page are parsed, thereby reducing resource occupation caused by page parsing.

参见图4,图4是本发明实施例四爬虫方法流程图,该方法应用于爬虫系统,如图4所示,主要包括以下步骤:Referring to FIG. 4, FIG. 4 is a flowchart of the fourth embodiment of the present invention, a crawler method. The method is applied to a crawler system, as shown in FIG. 4, and mainly includes the following steps:

步骤401、将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;Step 401: Compile the crawler task code corresponding to the website to be crawled and publish it under the project path of the crawler system;

在实际实现时,可以将待抓取网站对应的爬虫任务代码编译成class文件,将class文件发布到爬虫系统的项目路径下。In actual implementation, the crawler task code corresponding to the website to be crawled can be compiled into a class file, and the class file can be published to the project path of the crawler system.

步骤402、将待抓取网站信息发送到爬虫系统的消息队列中;所述待抓取网站信息包括初始url和所述爬虫任务代码中定义的爬虫任务类的全限定名;所述爬虫任务代码中定义的爬虫任务类的全限定名包括第三爬虫任务类的全限定名;Step 402: Send the information of the website to be crawled to the message queue of the crawler system; the information of the website to be crawled includes the initial url and the fully qualified name of the crawler task class defined in the crawler task code; the crawler task code The fully qualified name of the crawler task class defined in includes the fully qualified name of the third crawler task class;

本实施例中,通过在待抓取网站信息中包括与页面抓取相关的多种信息,为后续执行页面抓取操作、执行页面解析操作、以及数据存储操作提供依据。In this embodiment, by including various information related to page crawling in the information of the website to be crawled, a basis is provided for subsequent page crawling operations, page parsing operations, and data storage operations.

步骤4031、从爬虫系统的消息队列中获取待抓取网站信息;Step 4031: Obtain the information of the website to be crawled from the message queue of the crawler system;

步骤4032、根据所述爬虫任务代码中定义的爬虫任务类的全限定名,抓取待抓取网站信息中的初始url对应的网站页面,对所述初始url对应的网站页面进行页面解析;Step 4032: According to the fully qualified name of the crawler task class defined in the crawler task code, grab the website page corresponding to the initial url in the website information to be crawled, and perform page analysis on the website page corresponding to the initial url;

步骤4033、根据所述爬虫任务代码中定义的第三爬虫任务类的全限定名,通过java反射机制生成第三爬虫任务类实例对象,调用第三爬虫类任务实例对象的页面存储方法对所述初始url对应的网站页面的页面解析结果进行存储。Step 4033: According to the fully qualified name of the third crawler task class defined in the crawler task code, generate the third crawler task class instance object through the java reflection mechanism, and call the page storage method of the third crawler class task instance object to describe the The page parsing result of the website page corresponding to the initial url is stored.

在实际应用中,java反射机制的核心是在程序运行时动态加载类并获取类的详细信息,从而操作类或对象的属性和方法,本质是JVM得到class对象之后,再通过class对象进行反编译,从而获取对象的各种信息。In practical applications, the core of the java reflection mechanism is to dynamically load the class and obtain the detailed information of the class when the program is running, so as to operate the attributes and methods of the class or object. The essence is that after the JVM obtains the class object, it decompiles the class object. , so as to obtain various information about the object.

本实施例中,通过将待抓取网站对应的爬虫任务代码编译后得到的class文件后发布到爬虫系统的项目路径下,使得后续可以通过java反射机制得到class文件中的class对象(例如上述的第三爬虫任务类实例对象),通过class对象进行反编译,从而可以获取class对象的各种信息,包括第三爬虫任务类实例对象的页面存储方法,进而可以利用该页面存储方法对所述初始url对应的网站页面的页面解析结果进行存储。In this embodiment, the class file obtained after compiling the crawler task code corresponding to the website to be crawled is published to the project path of the crawler system, so that the class object in the class file (for example, the above-mentioned class object can be obtained through the java reflection mechanism later) The third crawler task class instance object), decompile the class object, so as to obtain various information of the class object, including the page storage method of the third crawler task class instance object, and then use the page storage method to The page parsing result of the website page corresponding to the url is stored.

在本发明的一个实施例中,所述待抓取网站信息还包括数据标识生成规则和目标存储系统信息,其中,所述数据标识生成规则包括参与生成数据标识的字段和生成数据标识的方法。所述生成数据标识的方法可以根据具体需求确定,例如,可以规定参与生成数据标识的字段为url、title、pubTime,而生成数据标识的方法可以是:将参与生成数据标识的字段的字段内容按照预设顺序拼接起来,生成拼接而成的字符串对应的MD5值,将该MD5值作为生成的数据标识。其中,所述预设顺序可以根据实际需求确定,例如预设顺序可以为:url字段内容、title字段内容、pubTime字段内容。In an embodiment of the present invention, the website information to be crawled further includes data identification generation rules and target storage system information, wherein the data identification generation rules include fields involved in generating data identifications and a method for generating data identifications. The method for generating the data identification can be determined according to specific needs. For example, the fields involved in generating the data identification can be specified as url, title, and pubTime, and the method for generating the data identification can be: The preset sequence is spliced together to generate the MD5 value corresponding to the spliced string, and the MD5 value is used as the generated data identifier. The preset sequence may be determined according to actual requirements, for example, the preset sequence may be: content of the url field, content of the title field, and content of the pubTime field.

在实际应用中,抓取到页面后需要进行数据存储,而在进行数据存储时,可以先为待存储数据生成数据标识,然后将数据标识和待存储数据对应存储起来,从而方便后续可能出现的数据查询等操作,为此,可以在待抓取网站信息中规定数据标识生成规则,另外,还可以在待抓取网站信息中指明数据存储的目标存储系统信息。这里,所述目标存储系统信息中具体可以包括如下信息:目标存储系统的类型(例如mysql)、链接地址(用url标识)、以及用户登录目标存储系统的用户名(username)和密码(password)。In practical applications, data storage needs to be performed after the page is captured. When data storage is performed, a data identifier can be generated for the data to be stored, and then the data identifier and the data to be stored are stored correspondingly, so as to facilitate the subsequent possible occurrence of For operations such as data query, for this purpose, data identification generation rules may be specified in the information of the website to be crawled, and in addition, the target storage system information of the data storage may be specified in the information of the website to be crawled. Here, the target storage system information may specifically include the following information: the type of the target storage system (for example, mysql), the link address (identified by url), and the user name (username) and password (password) for logging in to the target storage system. .

本实施例中,调用第三爬虫类任务实例对象的页面存储方法对所述初始url对应的网站页面的页面解析结果进行存储,可具体包括:In this embodiment, calling the page storage method of the third crawler task instance object to store the page parsing result of the website page corresponding to the initial url may specifically include:

确定所述页面解析结果中参与生成数据标识的每一字段的字段内容;Determine the field content of each field involved in generating the data identifier in the page parsing result;

根据参与生成数据标识的每一字段的字段内容和生成数据标识的方法确定所述页面解析结果对应的数据标识;Determine the data identifier corresponding to the page parsing result according to the field content of each field involved in generating the data identifier and the method for generating the data identifier;

调用第三爬虫类任务实例对象的页面存储方法,将所述页面解析结果及其对应的数据标识存储到目标存储系统。The page storage method of the third crawler class task instance object is called, and the page parsing result and its corresponding data identifier are stored in the target storage system.

根据图4所示方法可以看出,本实施例中,通过将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;根据所述爬虫任务代码中定义的爬虫任务类的全限定名对待抓取网站信息中的初始url对应的网站页面进行抓取并对所述初始url对应的网站页面进行页面解析;根据待抓取网站信息中所述爬虫任务代码定义的第三爬虫任务类的全限定名,通过java反射机制生成第三爬虫任务类实例对象,调用第三爬虫类任务实例对象的页面存储方法对所述初始url对应的网站页面的页面解析结果进行存储,从而完成对待抓取网站的页面抓取流程。本实施例中,通过待抓取网站对应的爬虫任务代码发布和爬虫任务代码定义的爬虫任务类的全限定名,能够在不停机的情况下实现爬虫代码的全配置化、定制化操作,不需要停机重新部署。此外,本实施例中还通过只对页面中的部分字段进行解析,从而可以减少因页面解析而造成的资源占用。除此之外,还通过生成页面解析结果对应的数据标识,将数据标识和页面解析结果一起存储到目标存储系统,从而可以方便后续的数据查询等数据操作。According to the method shown in FIG. 4, it can be seen that in this embodiment, the crawler task code corresponding to the website to be crawled is compiled and published under the project path of the crawler system; according to the crawler task class defined in the crawler task code The fully qualified name of the website page corresponding to the initial url in the website information to be crawled is crawled and the website page corresponding to the initial url is parsed; according to the third crawler defined by the crawler task code in the website information to be crawled The fully qualified name of the task class, the third crawler task class instance object is generated through the java reflection mechanism, and the page storage method of the third crawler class task instance object is called to store the page parsing result of the website page corresponding to the initial url, thereby completing The page crawling process for the site to be crawled. In this embodiment, through the release of the crawler task code corresponding to the website to be crawled and the fully qualified name of the crawler task class defined by the crawler task code, the fully configured and customized operation of the crawler code can be realized without downtime. Requires downtime to redeploy. In addition, in this embodiment, only some fields in the page are parsed, thereby reducing resource occupation caused by page parsing. In addition, by generating a data identifier corresponding to the page parsing result, the data identifier and the page parsing result are stored in the target storage system, so that subsequent data operations such as data query can be facilitated.

以上对本发明实施例提供的爬虫方法进行了详细介绍,需要说明的是,以上四个实施例中,第一爬虫任务类的全限定名、第二爬虫任务类的全限定名、第三爬虫任务类的全限定名可以相同,也可以不同。其中,三个全限定名相同时,表示是同一爬虫任务类,这种情况下,该爬虫任务类同时定义了页面抓取方法、页面解析方法、和页面存储方法,在生成该爬虫任务类的实例对象时,可以在只生成该爬虫任务类的一个实例对象,通过调用该实例对象中的页面抓取方法、页面解析方法、和页面存储方法来进行页面抓取操作、页面解析操作、数据存储操作。三个全限定名互不相同时,表示是不同爬虫任务类,即爬虫任务代码中定义了三个爬虫任务类,分别对应第一爬虫任务类(定义了页面抓取方法)、第二爬虫任务类(定义了页面解析方法)、第三爬虫任务类(定义了页面存储方法)。The crawler method provided by the embodiments of the present invention has been described in detail above. It should be noted that, in the above four embodiments, the fully qualified name of the first crawler task class, the fully qualified name of the second crawler task class, and the third crawler task class The fully qualified names of the classes can be the same or different. Among them, when the three fully qualified names are the same, it means the same crawler task class. In this case, the crawler task class defines the page grabbing method, page parsing method, and page storage method at the same time. When an instance object is created, only one instance object of the crawler task class can be generated, and page crawling, page parsing, and data storage can be performed by calling the page crawling method, page parsing method, and page storage method in the instance object. operate. When the three fully qualified names are different from each other, it means that they are different crawler task classes, that is, three crawler task classes are defined in the crawler task code, corresponding to the first crawler task class (which defines the page crawling method) and the second crawler task class. class (defines the page parsing method), the third crawler task class (defines the page storage method).

以下是待抓取网站信息作为消息体的一个具体示例,该待抓取网站信息中包括以下信息:初始url(http://bbs.tianya.cn/post-travel-862656-1.shtml)、共用的爬虫任务类的全限定名(com.jd.crawler.TianyaProcessor)、抓取工具(Chrome)、抓取页数(10)、抓取字段(title、clicknum、reply、pubTime、author、content、img)、数据标识生成规则(参与字段:url、title、pubTime,数据标识生成方法如前面示例,不再赘述)、目标存储系统信息。The following is a specific example of the information of the website to be crawled as the message body. The information of the website to be crawled includes the following information: initial url (http://bbs.tianya.cn/post-travel-862656-1.shtml), Fully qualified name of the shared crawler task class (com.jd.crawler.TianyaProcessor), crawler (Chrome), number of crawled pages (10), crawling fields (title, clicknum, reply, pubTime, author, content, img), data identification generation rules (participating fields: url, title, pubTime, the data identification generation method is as in the previous example, and will not be repeated), target storage system information.

对于上述示例,在爬虫系统中不存在适合该待抓取网站的爬虫任务代码时,可通过以下过程实现待抓取网站的页面抓取:For the above example, when there is no crawler task code suitable for the website to be crawled in the crawler system, the page crawling of the website to be crawled can be achieved through the following process:

首先、开发人员可以通过修改已有爬虫代码或重新撰写爬虫任务代码的方式,生成该待抓取网站对应的爬虫任务代码,将该爬虫任务代码编译后发布到爬虫系统的项目路径下;将该待抓取网站发送到爬虫系统的消息队列中;First, developers can generate crawler task code corresponding to the website to be crawled by modifying the existing crawler code or re-writing the crawler task code, and then compile and publish the crawler task code to the project path of the crawler system; The website to be crawled is sent to the message queue of the crawler system;

接着,可以从爬虫系统的消息队列中获取待抓取网站信息,根据待抓取网站信息中爬虫任务类的全限定名(com.jd.crawler.TianyaProcessor),通过java反射机制生成爬虫任务类实例对象TianyaProcessor,以抓取工具Chrome和抓取页数10为输入参数,调用爬虫任务类实例对象TianyaProcessor的页面解析方法抓取初始url(http://bbs.tianya.cn/post-travel-862656-1.shtml)对应的网站页面;Next, the information of the website to be crawled can be obtained from the message queue of the crawler system, and according to the fully qualified name of the crawler task class (com.jd.crawler.TianyaProcessor) in the information of the website to be crawled, an instance of the crawler task class can be generated through the java reflection mechanism The object TianyaProcessor takes the crawler Chrome and the number of pages to be crawled as input parameters, and calls the page parsing method of the crawler task class instance object TianyaProcessor to grab the initial url (http://bbs.tianya.cn/post-travel-862656- 1. shtml) the corresponding website page;

再接着,以抓取字段(title、clicknum、reply、pubTime、author、content、img)为输入参数,调用爬虫类任务实例对象TianyaProcessor的页面解析方法对所述初始url对应的网站页面进行页面解析得到如下字段的字段内容:title、clicknum、reply、pubTime、author、content、img;Next, take the grab fields (title, clicknum, reply, pubTime, author, content, img) as input parameters, call the page parsing method of the crawler task instance object TianyaProcessor to perform page parsing on the website page corresponding to the initial url to obtain Field content of the following fields: title, clicknum, reply, pubTime, author, content, img;

最后,确定对所述初始url对应的网站页面的页面解析结果中参与生成数据标识的每一字段(url、title、pubTime)的字段内容;根据参与生成数据标识的每一字段的字段内容和生成数据标识的方法确定所述页面解析结果对应的数据标识;调用爬虫类任务实例对象TianyaProcessor的页面存储方法将所述页面解析结果及其对应的数据标识存储到目标存储系统。Finally, determine the field content of each field (url, title, pubTime) involved in generating the data identifier in the page parsing result of the website page corresponding to the initial url; The data identification method determines the data identification corresponding to the page parsing result; the page storage method of the TianyaProcessor of the crawler class task instance object is called to store the page parsing result and its corresponding data identification in the target storage system.

本发明实施例还提供了一种爬虫方法,以下结合图5进行详细说明。An embodiment of the present invention further provides a crawler method, which will be described in detail below with reference to FIG. 5 .

参见图5,图5是本发明实施例提供的爬虫装置的结构示意图,该装置应用于爬虫系统,如图5所示,包括:任务调度模块501、页面抓取模块502、页面解析模块503、数据存储模块504、消息启动模块505、和代码发布模块506;Referring to FIG. 5, FIG. 5 is a schematic structural diagram of a crawler device provided by an embodiment of the present invention. The device is applied to a crawler system, as shown in FIG. 5, including: a task scheduling module 501, a page grabbing module 502, a page parsing module 503, a data storage module 504, a message activation module 505, and a code release module 506;

所述代码发布模块506,用于将待抓取网站对应的爬虫任务代码编译后发布到爬虫系统的项目路径下;The code publishing module 506 is used to compile and publish the crawler task code corresponding to the website to be crawled to the project path of the crawler system;

所述消息启动模块505,用于将待抓取网站信息发送到所述任务调度模块501维护的消息队列中;所述待抓取网站信息包括初始url和所述爬虫任务代码中定义的爬虫任务类的全限定名;The message startup module 505 is used to send the information of the website to be crawled to the message queue maintained by the task scheduling module 501; the information of the website to be crawled includes the initial url and the crawler task defined in the crawler task code the fully qualified name of the class;

所述页面抓取模块502,用于从所述任务调度模块501维护的消息队列中获取待抓取网站信息,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,抓取待抓取网站信息中的初始url对应的网站页面;The page crawling module 502 is used to obtain the information of the website to be crawled from the message queue maintained by the task scheduling module 501, and according to the fully qualified name of the crawler task class defined in the crawler task code, crawling the to-be-crawled website Get the website page corresponding to the initial url in the website information;

所述页面解析模块503,用于从所述任务调度模块501维护的消息队列中获取待抓取网站信息,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,对所述初始url对应的网站页面进行页面解析;The page parsing module 503 is used to obtain the information of the website to be crawled from the message queue maintained by the task scheduling module 501, and according to the fully qualified name of the crawler task class defined in the crawler task code, the initial url Perform page analysis on the corresponding website page;

所述数据存储模块504,用于从所述任务调度模块501维护的消息队列中获取待抓取网站信息,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,存储所述页面解析模块503对所述初始url对应的网站页面的页面解析结果。The data storage module 504 is configured to obtain the information of the website to be crawled from the message queue maintained by the task scheduling module 501, and store the page parsing according to the fully qualified name of the crawler task class defined in the crawler task code. The module 503 analyzes the page of the website page corresponding to the initial url.

图5所示装置中,In the device shown in Figure 5,

所述爬虫任务代码中定义的爬虫任务类的全限定名包括第一爬虫任务类的全限定名;The fully qualified name of the crawler task class defined in the crawler task code includes the fully qualified name of the first crawler task class;

所述页面抓取模块502,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,抓取待抓取网站信息中的初始url对应的网站页面,包括:The page crawling module 502, according to the fully qualified name of the crawler task class defined in the crawler task code, crawls the website page corresponding to the initial url in the information of the website to be crawled, including:

根据所述爬虫任务代码中定义的第一爬虫任务类的全限定名,通过java反射机制生成第一爬虫任务类实例对象,调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取。According to the fully qualified name of the first crawler task class defined in the crawler task code, the first crawler task class instance object is generated through the java reflection mechanism, and the page crawling method of the first crawler class task instance object is called to grab the website information Crawl the website page corresponding to the initial url in .

图5所示装置中,In the device shown in Figure 5,

所述待抓取网站信息还包括抓取工具;The information on the website to be crawled further includes crawling tools;

所述页面抓取模块502,调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取,包括:The page grabbing module 502 calls the page grabbing method of the first crawler task instance object to grab the website page corresponding to the initial url in the website information to be grabbed, including:

将所述抓取工具作为第一爬虫类任务实例对象的页面抓取方法的用于指定抓取工具的输入参数,调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取。The crawling tool is used as the input parameter for specifying the crawling tool in the page crawling method of the first crawler class task instance object, and the page crawling method of the first crawler class task instance object is called to be crawled in the website information. The website page corresponding to the initial url is crawled.

图5所示装置中,In the device shown in Figure 5,

所述待抓取网站信息还包括抓取页数n;The information of the website to be crawled further includes the number of crawled pages n;

所述页面抓取模块502,调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取,包括:The page grabbing module 502 calls the page grabbing method of the first crawler task instance object to grab the website page corresponding to the initial url in the website information to be grabbed, including:

将所述抓取页数作为第一爬虫类任务实例对象的页面抓取方法的用于指定最大抓取页数的输入参数,调用第一爬虫类任务实例对象的页面抓取方法对待抓取网站信息中的初始url对应的网站页面进行抓取。Taking the number of pages to grab as the input parameter of the page grabbing method of the first crawler task instance object for specifying the maximum number of grabbed pages, and calling the page grabbing method of the first crawler task instance object to grab the website The website page corresponding to the initial url in the information is crawled.

图5所示装置中,In the device shown in Figure 5,

所述爬虫任务代码中定义的爬虫任务类的全限定名包括第二爬虫任务类的全限定名;The fully qualified name of the crawler task class defined in the crawler task code includes the fully qualified name of the second crawler task class;

所述页面解析模块503,根据爬虫任务代码中定义的爬虫任务类的全限定名,对所述初始url对应的网站页面进行页面解析,包括:The page parsing module 503 performs page parsing on the website page corresponding to the initial url according to the fully qualified name of the crawler task class defined in the crawler task code, including:

根据所述爬虫任务代码中定义的第二爬虫任务类的全限定名,通过java反射机制生成第二爬虫任务类实例对象,调用第二爬虫类任务实例对象的页面解析方法对所述初始url对应的网站页面进行页面解析。According to the fully qualified name of the second crawler task class defined in the crawler task code, the second crawler task class instance object is generated through the java reflection mechanism, and the page parsing method of the second crawler class task instance object is called to correspond to the initial url page analysis of the website page.

图5所示装置中,In the device shown in Figure 5,

所述待抓取网站信息包括抓取字段;The information on the website to be crawled includes a crawling field;

所述页面解析模块503,调用第二爬虫类任务实例对象的页面解析方法对所述初始url对应的网站页面进行页面解析,包括:The page parsing module 503 invokes the page parsing method of the second crawler task instance object to perform page parsing on the website page corresponding to the initial url, including:

将所述抓取字段作为第二爬虫类任务实例对象的页面解析方法的用于指定目标解析字段的输入参数,调用第二爬虫类任务实例对象的页面解析方法,解析所述初始url对应的网站页面中对应于每一抓取字段的字段内容。Use the grab field as an input parameter of the page parsing method of the second crawler task instance object for specifying the target parsing field, call the page parsing method of the second crawler task instance object, and parse the website corresponding to the initial url Field content for each fetched field in the page.

图5所示装置中,In the device shown in Figure 5,

所述爬虫任务代码中定义的爬虫任务类的全限定名包括第三爬虫任务类的全限定名;The fully qualified name of the crawler task class defined in the crawler task code includes the fully qualified name of the third crawler task class;

所述数据存储模块504,根据所述爬虫任务代码中定义的爬虫任务类的全限定名,存储所述页面解析模块503对所述初始url对应的网站页面的页面解析结果,包括:The data storage module 504, according to the fully qualified name of the crawler task class defined in the crawler task code, stores the page parsing result of the website page corresponding to the initial url by the page parsing module 503, including:

根据所述爬虫任务代码中定义的第三爬虫任务类的全限定名,通过java反射机制生成第三爬虫任务类实例对象,调用第三爬虫类任务实例对象的页面存储方法对所述初始url对应的网站页面的页面解析结果进行存储。According to the fully qualified name of the third crawler task class defined in the crawler task code, the third crawler task class instance object is generated through the java reflection mechanism, and the page storage method of the third crawler class task instance object is called to correspond to the initial url The page parsing results of the website pages are stored.

图5所示装置中,In the device shown in Figure 5,

所述待抓取网站信息包括数据标识生成规则和目标存储系统,其中,所述数据标识生成规则包括参与生成数据标识的字段和生成数据标识的方法;The website information to be crawled includes a data identification generation rule and a target storage system, wherein the data identification generation rule includes a field involved in generating a data identification and a method for generating a data identification;

所述数据存储模块504,调用第三爬虫类任务实例对象的页面存储方法对所述初始url对应的网站页面的页面解析结果进行存储,包括:The data storage module 504 calls the page storage method of the third crawler task instance object to store the page parsing result of the website page corresponding to the initial url, including:

确定所述页面解析结果中参与生成数据标识的每一字段的字段内容;Determine the field content of each field involved in generating the data identifier in the page parsing result;

根据参与生成数据标识的每一字段的字段内容和生成数据标识的方法确定所述页面解析结果对应的数据标识;Determine the data identifier corresponding to the page parsing result according to the field content of each field involved in generating the data identifier and the method for generating the data identifier;

调用第三爬虫类任务实例对象的页面存储方法,将所述页面解析结果及其对应的数据标识存储到目标存储系统。The page storage method of the third crawler class task instance object is called, and the page parsing result and its corresponding data identifier are stored in the target storage system.

图5所示装置中,In the device shown in Figure 5,

所述参与生成数据标识的字段包括:url字段、tilte字段、pubTime字段;The described fields that participate in generating the data identification include: url field, tilte field, pubTime field;

所述生成数据标识的方法为:将参与生成数据标识的字段的字段内容按照预设顺序拼接起来,生成拼接而成的字符串对应的MD5值,将该MD5值作为生成的数据标识。The method for generating a data identifier is as follows: splicing together the field contents of the fields involved in generating the data identifier according to a preset order, generating an MD5 value corresponding to the spliced character string, and using the MD5 value as the generated data identifier.

本发明实施例还提供了一种电子设备,如图6所示,该电子设备包括:至少一个处理器601,以及与所述至少一个处理器601通过总线相连的存储器602;所述存储器602存储有可被所述至少一个处理器601执行的一个或多个计算机程序;所述至少一个处理器601执行所述一个或多个计算机程序时实现如图1-4中任一流程图所示的爬虫方法中的步骤。An embodiment of the present invention further provides an electronic device. As shown in FIG. 6 , the electronic device includes: at least one processor 601 and a memory 602 connected to the at least one processor 601 through a bus; the memory 602 stores There are one or more computer programs that can be executed by the at least one processor 601; the at least one processor 601 implements the one or more computer programs when executing the one or more computer programs. Steps in the crawler method.

本发明实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个计算机程序,所述一个或多个计算机程序被处理器执行时实现如图1-4中任一流程图所示的爬虫方法中的步骤。An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores one or more computer programs, and the one or more computer programs are implemented by a processor when executed as shown in FIGS. 1-4 Steps in the crawler method shown in any of the flowcharts.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims (12)

1. A crawler method is applied to a crawler system and is characterized by comprising the following steps:
compiling a crawler task code corresponding to a website to be crawled and then publishing the crawler task code to a project path of a crawler system;
the website information to be captured is sent to a message queue of a crawler system; the website information to be grabbed comprises an initial url and a full-limited name of a crawler task class defined in the crawler task code;
acquiring website information to be captured from a message queue of a crawler system, capturing a website page corresponding to an initial url in the website information to be captured according to a full limit name of a crawler task class defined in a crawler task code, performing page analysis on the website page corresponding to the initial url, and storing a page analysis result of the website page corresponding to the initial url.
2. The method of claim 1,
the fully qualified names of the crawler task classes defined in the crawler task codes comprise fully qualified names of a first crawler task class;
and capturing a website page corresponding to an initial url in the website information to be captured according to the full limit name of the crawler task class defined in the crawler task code, wherein the capturing comprises the following steps:
and generating a first crawler task class instance object through a java reflection mechanism according to the full limit name of the first crawler task class defined in the crawler task code, and calling a page grabbing method of the first crawler task class instance object to grab a website page corresponding to an initial url in the to-be-grabbed website information.
3. The method of claim 2,
the website information to be grabbed further comprises a grabbing tool;
the method for capturing the website page corresponding to the initial url in the website information to be captured by calling the page capturing method of the first crawler task instance object comprises the following steps:
and taking the grabbing tool as an input parameter of a specified grabbing tool of a page grabbing method of the first crawler task instance object, and calling the page grabbing method of the first crawler task instance object to grab the website page corresponding to the initial url in the website information to be grabbed.
4. The method of claim 2,
the website information to be grabbed also comprises a grabbing page number n;
the method for capturing the website page corresponding to the initial url in the website information to be captured by calling the page capturing method of the first crawler task instance object comprises the following steps:
and taking the snatching page number as an input parameter of the appointed maximum snatching page number of the page snatching method of the first crawler task instance object, and calling the page snatching method of the first crawler task instance object to snatch the website page corresponding to the initial url in the website information to be snatched.
5. The method of claim 1,
the fully qualified names of the crawler task classes defined in the crawler task codes comprise fully qualified names of a second crawler task class;
the page analysis module is used for carrying out page analysis on the website page corresponding to the initial url according to the fully-defined name of the crawler task class defined in the crawler task code, and comprises the following steps:
and generating a second crawler task class instance object through a java reflection mechanism according to the full limit name of the second crawler task class defined in the crawler task code, and calling a page resolution method of the second crawler task instance object to perform page resolution on the website page corresponding to the initial url.
6. The method of claim 5,
the website information to be grabbed comprises grabbing fields;
calling a page analysis method of a second crawler task instance object to perform page analysis on the website page corresponding to the initial url, wherein the page analysis method comprises the following steps:
and taking the captured fields as input parameters of a specified target analysis field of a page analysis method of a second crawler task instance object, calling the page analysis method of the second crawler task instance object, and analyzing the field content corresponding to each captured field in the website page corresponding to the initial url.
7. The method of claim 1,
the fully qualified names of the crawler task classes defined in the crawler task codes comprise fully qualified names of a third crawler task class;
storing the page analysis result of the website page corresponding to the initial url by the page analysis module according to the fully qualified name of the crawler task class defined in the crawler task code, wherein the page analysis result comprises:
and generating a third crawler task class instance object through a java reflection mechanism according to the full limit name of the third crawler task class defined in the crawler task code, and calling a page storage method of the third crawler task class instance object to store the page analysis result of the website page corresponding to the initial url.
8. The method of claim 7,
the website information to be captured comprises a data identifier generation rule and a target storage system, wherein the data identifier generation rule comprises a field participating in data identifier generation and a method for generating a data identifier;
calling a page storage method of a third crawler task instance object to store a page analysis result of a website page corresponding to the initial url, wherein the page storage method comprises the following steps:
determining the field content of each field participating in the generation of the data identification in the page analysis result;
determining a data identifier corresponding to the page analysis result according to the field content of each field participating in generating the data identifier and the method for generating the data identifier;
and calling a page storage method of a third crawler task instance object, and storing the page analysis result and the corresponding data identifier thereof to a target storage system.
9. The method of claim 8,
the fields participating in generating the data identification comprise: url field, tilt field, pubTime field;
the method for generating the data identifier comprises the following steps: and splicing the field contents of the fields participating in the generation of the data identification according to a preset sequence to generate an MD5 value corresponding to the spliced character string, and taking the MD5 value as the generated data identification.
10. A crawler apparatus, the apparatus comprising: the system comprises a task scheduling module, a page capturing module, a page analyzing module, a data storage module, a message starting module and a code publishing module;
the code publishing module is used for compiling the crawler task code corresponding to the website to be grabbed and then publishing the crawler task code to a project path of the crawler system;
the message starting module is used for sending the information of the websites to be captured to a message queue maintained by the task scheduling module; the website information to be grabbed comprises an initial url and a full-limited name of a crawler task class defined in the crawler task code;
the page grabbing module is used for acquiring website information to be grabbed from a message queue maintained by the task scheduling module, and grabbing a website page corresponding to an initial url in the website information to be grabbed according to a full limited name of a crawler task class defined in the crawler task code;
the page analysis module is used for acquiring website information to be captured from a message queue maintained by the task scheduling module, and performing page analysis on a website page corresponding to the initial url according to a full qualified name of a crawler task class defined in the crawler task code;
and the data storage module is used for acquiring the information of the website to be captured from the message queue maintained by the task scheduling module, and storing the page analysis result of the website page corresponding to the initial url by the page analysis module according to the full qualified name of the crawler task class defined in the crawler task code.
11. An electronic device, comprising: the system comprises at least one processor and a memory connected with the at least one processor through a bus; the memory stores one or more computer programs executable by the at least one processor; characterized in that the at least one processor, when executing the one or more computer programs, implements the steps in the method of any one of claims 1-9.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more computer programs which, when executed by a processor, implement the steps in the method of any one of claims 1-9.
CN202111596503.6A 2021-12-24 2021-12-24 Crawler method and device, electronic equipment and storage medium Pending CN114329137A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111596503.6A CN114329137A (en) 2021-12-24 2021-12-24 Crawler method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111596503.6A CN114329137A (en) 2021-12-24 2021-12-24 Crawler method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114329137A true CN114329137A (en) 2022-04-12

Family

ID=81013345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111596503.6A Pending CN114329137A (en) 2021-12-24 2021-12-24 Crawler method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114329137A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102141919A (en) * 2010-01-28 2011-08-03 北京邮电大学 Modularized java application software online updating system and method
CN109325161A (en) * 2018-09-11 2019-02-12 五八有限公司 Public sentiment data grasping means, device, equipment and storage medium
CN111221744A (en) * 2020-04-23 2020-06-02 杭州海康威视数字技术股份有限公司 Data acquisition method and device and electronic equipment
CN111814024A (en) * 2020-08-14 2020-10-23 北京斗米优聘科技发展有限公司 Distributed data acquisition method, system and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102141919A (en) * 2010-01-28 2011-08-03 北京邮电大学 Modularized java application software online updating system and method
CN109325161A (en) * 2018-09-11 2019-02-12 五八有限公司 Public sentiment data grasping means, device, equipment and storage medium
CN111221744A (en) * 2020-04-23 2020-06-02 杭州海康威视数字技术股份有限公司 Data acquisition method and device and electronic equipment
CN111814024A (en) * 2020-08-14 2020-10-23 北京斗米优聘科技发展有限公司 Distributed data acquisition method, system and storage medium

Similar Documents

Publication Publication Date Title
CN106980504B (en) Application program development method and tool and equipment thereof
KR102220127B1 (en) Method and apparatus for customized software development kit (sdk) generation
US12097622B2 (en) Repeating pattern detection within usage recordings of robotic process automation to facilitate representation thereof
US8443346B2 (en) Server evaluation of client-side script
US11820020B2 (en) Robotic process automation supporting hierarchical representation of recordings
Dayley Node. js, MongoDB, and AngularJS web development
CN111610978A (en) Applet conversion method, device, equipment and storage medium
US11960930B2 (en) Automated software robot creation for robotic process automation
CN106547527B (en) JavaScript file construction method and device
US20050268280A1 (en) Encapsulating changes to a software application
US11474796B1 (en) Build system for distributed applications
KR20160060023A (en) Method and apparatus for code virtualization and remote process call generation
CN111443901B (en) A business expansion method and device based on Java reflection
CN111026634A (en) Interface automation test system, method, device and storage medium
US8103607B2 (en) System comprising a proxy server including a rules engine, a remote application server, and an aspect server for executing aspect services remotely
CN112579151A (en) Method and device for generating model file
CN101763432A (en) Method for constructing lightweight webpage dynamic view
US10659567B2 (en) Dynamic discovery and management of page fragments
CN106776302A (en) Calculate method, the device of method execution time in JAVA projects
CN113297449A (en) Method and system for realizing streaming crawler
CN110516185B (en) Method and device for processing dynamic website
CN114329137A (en) Crawler method and device, electronic equipment and storage medium
US20150032789A1 (en) Dynamic object oriented remote instantiation
CN117111933A (en) Front-end code generation method, device and computer readable storage medium
Wever et al. Active coevolutionary learning of requirements specifications from examples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination