CN106598991A - Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode - Google Patents
Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode Download PDFInfo
- Publication number
- CN106598991A CN106598991A CN201510675362.5A CN201510675362A CN106598991A CN 106598991 A CN106598991 A CN 106598991A CN 201510675362 A CN201510675362 A CN 201510675362A CN 106598991 A CN106598991 A CN 106598991A
- Authority
- CN
- China
- Prior art keywords
- script
- website
- interaction
- engine
- conversational mode
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title abstract description 9
- 238000000605 extraction Methods 0.000 title abstract 2
- 230000002452 interceptive effect Effects 0.000 claims description 9
- 241000270322 Lepidosauria Species 0.000 claims description 8
- 230000004044 response Effects 0.000 abstract description 5
- 238000010276 construction Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 abstract 1
- 238000012544 monitoring process Methods 0.000 abstract 1
- 238000012795 verification Methods 0.000 abstract 1
- 238000000034 method Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 235000014510 cooky Nutrition 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 210000003813 thumb Anatomy 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Stored Programmes (AREA)
- Computer And Data Communications (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a web crawler system capable of realizing website interaction and automatic form extraction by a conversational mode. Considerable information on the Internet can be obtained by an account login way and through specific mouse clicks and keyboard input operations. A traditional web crawler development technology is realized in a way that a browser development tool is used for monitoring the interaction request and response of a browser and a server side during manual operation, intercepted request and response contents are subjected to manual analysis, and codes are developed. The invention provides one set of system which is feasible through verification, and the above construction work of crawler interaction information is realized by a mode close to full automation. The system consists of three elements including an HTML parser, a script transcribing engine and a script operation simulator, wherein the HTML parser is used for parsing tags including a form (<FORM/>), a link (<A/>) and the like from an HTML webpage, the script transcribing engine is used for providing access to a target website for users through an agency mode and recording the data interaction of the browser and the server side, and the script operation simulator operates a script transcribed in the previous step in an explanation way and can realize the interaction of a crawler and the website and information capture by script playing.
Description
The present invention provides a set of system feasible through checking, and the work of above-mentioned construction reptile interactive information is realized close to full automatic mode.This system is made up of three key elements:
1, a html parser, for parse from html web page list (<FORM/>) and link (<A/>) label;
2, a script recording engine,, by way of agency, any data interaction for providing the user the access to targeted website, user browser and server end is all recorded for it;
3, script runs simulator, runs the script that previous step is recorded with interpretive mode, play script be capable of achieving reptile and website interact and information crawl.
System has following working condition:
1, record script;
2, script argument, this is an artificial process, the parameter value for being needed setting and being obtained by artificial control;
3, script is played, reptile function is realized;
2nd point, when parametrization is done to script, can also be made some simple modifications to realize that some need the work that exploitation code can just be completed originally.For example, the Row control of circulation is controlled using cumulative variations, calls external function to realize automatic identification to picture validation code, etc..
Description of the drawings
Fig. 1 is the structure chart and workflow schematic diagram of the present invention.
Fig. 2 is to automatically generate and the script sample after artificial parameter.
Specific embodiment
The constitution content and working method of three key elements mentioned above described below.
Html parser is used for parsing html web page content, generates DOM (Document of Model) tree.For analysis reptile with server interaction this task, the form tags being only concerned in dom tree<FORM/>And link label<A/>.
Html parser in the systems described in the present invention, has two effects.First, the task in order to complete script recording engine, during html parser needs the former webpage of replacement<FORM/>With<A/>List is submitted to address or link to point to the URL corresponding to the server address for being changed to script recording engine by the sensing link of label.Second, html parser is responsible for analyzing web page, as the variable of reptile interactive operation next time, or as the final content returning result for needing crawl.The former such as extracts hits in webpage article, thumb up number etc. as extracted option names and option value in the combobox in webpage, the latter.
Script recording engine is the critical component for realizing automatically extracting with website interactive form.It has the WEB interface of a similar search engine, and station address can be input in input frame.
After user submits station address to, recording engine can connect website and obtain webpage.Webpage to returning does following process:
A) webpage is added<Base href=' $ URL) '/>Code, wherein $ URL) it is station address.This forces browser to the relative link on webpage by the network address relative localization specified, and shows the related resources such as picture such that it is able to normal;
B) it is right<A href=" $ { URL } "/>、<FORM action=" $ { URL } "/>、<FRAME src=" $ { URL } "/>、<IFRAME sre=" $ { URL } "/>Label, needs to distort the chained address $ { URL } in attribute.{ URL } is replaced with into the address of engine itself, and the value of former { URL } is carried on this address;
Now, the list done on recording engine by user is submitted to or clickthrough operation, will all be indicated that browser sends GET/POST requests to engine, and will be instead of directly in the originating website.So, the request of browser and original web and response series are all recorded engine and record.
After recording engine records the interaction data of request and response, script source code is translated into, similar to " GET http:// ... Param1=Value1 | Param2=Value2 | ... | ParamN=ValueN { ControlBits } " as code.Wherein, { ControlBits } e.g., distinguishes GET/POST requests, if remove Cookie, etc. for recording some auxiliary informations.The form parameters that Param1=Value1 is submitted to when being then POST modes.
The script that recording engine is generated can artificial further parametrization.Username and password is characterized than Param1=Value1 as described above and Param2=Value2, general shape is as username=jackey and password=123456, then the parameter that can be set when jackey and 123456 can be substituted for operation.Row control can also further be added, for example, add a circulation and cycle counter, with reference to the parameter modification that can be set during operation, batch operation can be completed.
After the completion of script argument, it is possible to run simulator " broadcasting " this section of script by script, realize the actual motion of reptile.Simulator mainly to be completed the work of three types:
A) explain and perform " GET http:// ... Param1=$ { Value1 } | Param2=$ { Value1 } | ... | ParamN=$ { ValueN } { ControlBits } " script; by $ { Value1 }, $ { Value1 }, $ { ValueN } replaces with operation when value, to website send ask and receive response;
B) explain and perform " FIND ... " script:Call html parser, analyzing web page, as the variable of reptile interactive operation next time, or as the final content returning result for needing crawl.
C) Row control:The Row control order included in processing script.
Claims (4)
1. a kind of use conversational mode realizes the network crawler system automatically extracted with website interactive form, and which is special
Levy and be:
1) system includes three main composition parts:One html parser;One script recording engine;
One script runs simulator.
2) system has three working conditions:One, record script;Two, can manpower intervention script argument;
Three, play script and realize reptile function.
2. the use conversational mode as described in right 1 realizes the web crawlers automatically extracted with website interactive form
The composition part " html parser " of system, it is characterised in that:Html parser is used for parsing HTML nets
Page content, generates a dom tree.
3. the use conversational mode as described in right 1 realizes the web crawlers automatically extracted with website interactive form
The composition part " script recording engine " of system, it is characterised in that:There is the WEB of a similar search engine
Interface, can be input into station address in input frame;Recording engine can connect website and obtain webpage;User exists
The list done on recording engine is submitted to or clickthrough operation, and recording engine records request and the friendship for responding
Mutually after data, script source code is translated into.
4. the use conversational mode as described in right 1 realizes the web crawlers automatically extracted with website interactive form
The composition part " script operation simulator " of system, it is characterised in that:Can be with described in " broadcasting " right 3
Script source code, complete the function of web crawlers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510675362.5A CN106598991A (en) | 2015-10-19 | 2015-10-19 | Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510675362.5A CN106598991A (en) | 2015-10-19 | 2015-10-19 | Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106598991A true CN106598991A (en) | 2017-04-26 |
Family
ID=58554265
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510675362.5A Pending CN106598991A (en) | 2015-10-19 | 2015-10-19 | Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106598991A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108076067A (en) * | 2017-12-27 | 2018-05-25 | 北京中关村科金技术有限公司 | A kind of method and system that the simulation of reptile configurationization is authorized to log in |
CN108664646A (en) * | 2018-05-16 | 2018-10-16 | 电子科技大学 | A kind of automatic download system of audio and video based on keyword |
CN108664461A (en) * | 2018-05-03 | 2018-10-16 | 北京神州泰岳软件股份有限公司 | A kind of web form Auto-writing method and device |
CN109246069A (en) * | 2018-06-15 | 2019-01-18 | 华为技术有限公司 | Webpage login method, device and readable storage medium storing program for executing |
CN109948025A (en) * | 2019-03-20 | 2019-06-28 | 上海古鳌电子科技股份有限公司 | A kind of data referencing recording method |
CN111159000A (en) * | 2019-12-30 | 2020-05-15 | 北京明朝万达科技股份有限公司 | Server performance test method, device, equipment and storage medium |
CN113946735A (en) * | 2021-10-05 | 2022-01-18 | 广州非凡信息安全技术有限公司 | Method and system for crawling and restoring WEB site by traffic recording |
CN115277451A (en) * | 2022-07-28 | 2022-11-01 | 中译语通科技股份有限公司 | Account login information initialization method and system based on automatic simulator |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
US7496636B2 (en) * | 2002-06-19 | 2009-02-24 | International Business Machines Corporation | Method and system for resolving Universal Resource Locators (URLs) from script code |
CN103051692A (en) * | 2012-12-11 | 2013-04-17 | 中国能源建设集团广东省电力设计研究院 | Mobile operation system working platform supporting extreme network environment |
CN104539053A (en) * | 2014-12-31 | 2015-04-22 | 国家电网公司 | Power dispatching automation polling robot and method based on reptile technology |
CN104750463A (en) * | 2013-12-26 | 2015-07-01 | 任子行网络技术股份有限公司 | A plug-in developing method and system |
-
2015
- 2015-10-19 CN CN201510675362.5A patent/CN106598991A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7496636B2 (en) * | 2002-06-19 | 2009-02-24 | International Business Machines Corporation | Method and system for resolving Universal Resource Locators (URLs) from script code |
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
CN103051692A (en) * | 2012-12-11 | 2013-04-17 | 中国能源建设集团广东省电力设计研究院 | Mobile operation system working platform supporting extreme network environment |
CN104750463A (en) * | 2013-12-26 | 2015-07-01 | 任子行网络技术股份有限公司 | A plug-in developing method and system |
CN104539053A (en) * | 2014-12-31 | 2015-04-22 | 国家电网公司 | Power dispatching automation polling robot and method based on reptile technology |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108076067A (en) * | 2017-12-27 | 2018-05-25 | 北京中关村科金技术有限公司 | A kind of method and system that the simulation of reptile configurationization is authorized to log in |
CN108076067B (en) * | 2017-12-27 | 2021-05-18 | 北京中关村科金技术有限公司 | Method and system for authorized crawler configuration simulation login |
CN108664461A (en) * | 2018-05-03 | 2018-10-16 | 北京神州泰岳软件股份有限公司 | A kind of web form Auto-writing method and device |
CN108664461B (en) * | 2018-05-03 | 2023-08-22 | 鼎富智能科技有限公司 | Automatic filling method and device for webpage form |
CN108664646A (en) * | 2018-05-16 | 2018-10-16 | 电子科技大学 | A kind of automatic download system of audio and video based on keyword |
CN109246069A (en) * | 2018-06-15 | 2019-01-18 | 华为技术有限公司 | Webpage login method, device and readable storage medium storing program for executing |
CN109246069B (en) * | 2018-06-15 | 2020-10-16 | 华为技术有限公司 | Webpage login method and device and readable storage medium |
CN109948025A (en) * | 2019-03-20 | 2019-06-28 | 上海古鳌电子科技股份有限公司 | A kind of data referencing recording method |
CN109948025B (en) * | 2019-03-20 | 2023-10-20 | 上海古鳌电子科技股份有限公司 | Data reference recording method |
CN111159000A (en) * | 2019-12-30 | 2020-05-15 | 北京明朝万达科技股份有限公司 | Server performance test method, device, equipment and storage medium |
CN113946735A (en) * | 2021-10-05 | 2022-01-18 | 广州非凡信息安全技术有限公司 | Method and system for crawling and restoring WEB site by traffic recording |
CN115277451A (en) * | 2022-07-28 | 2022-11-01 | 中译语通科技股份有限公司 | Account login information initialization method and system based on automatic simulator |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106598991A (en) | Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode | |
CN104200166B (en) | Script-based website vulnerability scanning method and system | |
CN104766014B (en) | Method and system for detecting malicious website | |
US9118549B2 (en) | Systems and methods for context management | |
CN103268361B (en) | Extracting method, the device and system of URL are hidden in webpage | |
US20140137006A1 (en) | Graphical Overlay Related To Data Mining And Analytics | |
CN112068824B (en) | Webpage development preview method and device and electronic equipment | |
CN105468779A (en) | Browser compatibility detection oriented client Web application capture and playback system and method | |
KR20080053293A (en) | Server-Side Initial Content Rendering for Client Script Web Pages | |
CN110851681B (en) | Crawler processing method, crawler processing device, server and computer readable storage medium | |
US20220276882A1 (en) | Artificial intelligence based systems and methods for autonomously generating customer service help guides with integrated graphical components and for autonomously error-checking knowledge base support resources | |
CN106126747A (en) | Data capture method based on reptile and device | |
US20060150111A1 (en) | Methods and apparatus for evaluating aspects of a web page | |
CN113010371A (en) | Method and system for monitoring real user experience of browser end in real time | |
CN101763432A (en) | Method for constructing lightweight webpage dynamic view | |
CN113723980A (en) | Method and device for detecting advertisement landing page, electronic equipment and storage medium | |
CN104025089B (en) | The method and system creeped based on situation | |
CN118740675A (en) | Network supportability testing method, device, equipment, medium and program product | |
CN118760581A (en) | Link detection method, link detection device, equipment and medium | |
CN105912573A (en) | Data updating method and data updating device | |
AU2021106041A4 (en) | Methods and systems for obtaining and storing web pages | |
CN105701175B (en) | A kind of data capture method and device | |
US20220244975A1 (en) | Method and system for generating natural language content from recordings of actions performed to execute workflows in an application | |
Li et al. | Modeling web application for cross-browser compatibility testing | |
CN104063488B (en) | A kind of form feature extracting method of semi-automatic learning type |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
DD01 | Delivery of document by public notice | ||
DD01 | Delivery of document by public notice |
Addressee: Shanghai Intple Technology Co.,Ltd. Document name: the First Notification of an Office Action |
|
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170426 |