[go: up one dir, main page]

CN106598991A - Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode - Google Patents

Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode Download PDF

Info

Publication number
CN106598991A
CN106598991A CN201510675362.5A CN201510675362A CN106598991A CN 106598991 A CN106598991 A CN 106598991A CN 201510675362 A CN201510675362 A CN 201510675362A CN 106598991 A CN106598991 A CN 106598991A
Authority
CN
China
Prior art keywords
script
website
interaction
engine
conversational mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510675362.5A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI INTPLE TECHNOLOGY CO LTD
Original Assignee
SHANGHAI INTPLE TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI INTPLE TECHNOLOGY CO LTD filed Critical SHANGHAI INTPLE TECHNOLOGY CO LTD
Priority to CN201510675362.5A priority Critical patent/CN106598991A/en
Publication of CN106598991A publication Critical patent/CN106598991A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)
  • Computer And Data Communications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a web crawler system capable of realizing website interaction and automatic form extraction by a conversational mode. Considerable information on the Internet can be obtained by an account login way and through specific mouse clicks and keyboard input operations. A traditional web crawler development technology is realized in a way that a browser development tool is used for monitoring the interaction request and response of a browser and a server side during manual operation, intercepted request and response contents are subjected to manual analysis, and codes are developed. The invention provides one set of system which is feasible through verification, and the above construction work of crawler interaction information is realized by a mode close to full automation. The system consists of three elements including an HTML parser, a script transcribing engine and a script operation simulator, wherein the HTML parser is used for parsing tags including a form (<FORM/>), a link (<A/>) and the like from an HTML webpage, the script transcribing engine is used for providing access to a target website for users through an agency mode and recording the data interaction of the browser and the server side, and the script operation simulator operates a script transcribed in the previous step in an explanation way and can realize the interaction of a crawler and the website and information capture by script playing.

Description

A kind of use conversational mode realizes the network crawler system automatically extracted with website interactive form
The present invention provides a set of system feasible through checking, and the work of above-mentioned construction reptile interactive information is realized close to full automatic mode.This system is made up of three key elements:
1, a html parser, for parse from html web page list (<FORM/>) and link (<A/>) label;
2, a script recording engine,, by way of agency, any data interaction for providing the user the access to targeted website, user browser and server end is all recorded for it;
3, script runs simulator, runs the script that previous step is recorded with interpretive mode, play script be capable of achieving reptile and website interact and information crawl.
System has following working condition:
1, record script;
2, script argument, this is an artificial process, the parameter value for being needed setting and being obtained by artificial control;
3, script is played, reptile function is realized;
2nd point, when parametrization is done to script, can also be made some simple modifications to realize that some need the work that exploitation code can just be completed originally.For example, the Row control of circulation is controlled using cumulative variations, calls external function to realize automatic identification to picture validation code, etc..
Description of the drawings
Fig. 1 is the structure chart and workflow schematic diagram of the present invention.
Fig. 2 is to automatically generate and the script sample after artificial parameter.
Specific embodiment
The constitution content and working method of three key elements mentioned above described below.
Html parser is used for parsing html web page content, generates DOM (Document of Model) tree.For analysis reptile with server interaction this task, the form tags being only concerned in dom tree<FORM/>And link label<A/>.
Html parser in the systems described in the present invention, has two effects.First, the task in order to complete script recording engine, during html parser needs the former webpage of replacement<FORM/>With<A/>List is submitted to address or link to point to the URL corresponding to the server address for being changed to script recording engine by the sensing link of label.Second, html parser is responsible for analyzing web page, as the variable of reptile interactive operation next time, or as the final content returning result for needing crawl.The former such as extracts hits in webpage article, thumb up number etc. as extracted option names and option value in the combobox in webpage, the latter.
Script recording engine is the critical component for realizing automatically extracting with website interactive form.It has the WEB interface of a similar search engine, and station address can be input in input frame.
After user submits station address to, recording engine can connect website and obtain webpage.Webpage to returning does following process:
A) webpage is added<Base href=' $ URL) '/>Code, wherein $ URL) it is station address.This forces browser to the relative link on webpage by the network address relative localization specified, and shows the related resources such as picture such that it is able to normal;
B) it is right<A href=" $ { URL } "/>、<FORM action=" $ { URL } "/>、<FRAME src=" $ { URL } "/>、<IFRAME sre=" $ { URL } "/>Label, needs to distort the chained address $ { URL } in attribute.{ URL } is replaced with into the address of engine itself, and the value of former { URL } is carried on this address;
Now, the list done on recording engine by user is submitted to or clickthrough operation, will all be indicated that browser sends GET/POST requests to engine, and will be instead of directly in the originating website.So, the request of browser and original web and response series are all recorded engine and record.
After recording engine records the interaction data of request and response, script source code is translated into, similar to " GET http:// ... Param1=Value1 | Param2=Value2 | ... | ParamN=ValueN { ControlBits } " as code.Wherein, { ControlBits } e.g., distinguishes GET/POST requests, if remove Cookie, etc. for recording some auxiliary informations.The form parameters that Param1=Value1 is submitted to when being then POST modes.
The script that recording engine is generated can artificial further parametrization.Username and password is characterized than Param1=Value1 as described above and Param2=Value2, general shape is as username=jackey and password=123456, then the parameter that can be set when jackey and 123456 can be substituted for operation.Row control can also further be added, for example, add a circulation and cycle counter, with reference to the parameter modification that can be set during operation, batch operation can be completed.
After the completion of script argument, it is possible to run simulator " broadcasting " this section of script by script, realize the actual motion of reptile.Simulator mainly to be completed the work of three types:
A) explain and perform " GET http:// ... Param1=$ { Value1 } | Param2=$ { Value1 } | ... | ParamN=$ { ValueN } { ControlBits } " script; by $ { Value1 }, $ { Value1 }, $ { ValueN } replaces with operation when value, to website send ask and receive response;
B) explain and perform " FIND ... " script:Call html parser, analyzing web page, as the variable of reptile interactive operation next time, or as the final content returning result for needing crawl.
C) Row control:The Row control order included in processing script.

Claims (4)

1. a kind of use conversational mode realizes the network crawler system automatically extracted with website interactive form, and which is special Levy and be:
1) system includes three main composition parts:One html parser;One script recording engine; One script runs simulator.
2) system has three working conditions:One, record script;Two, can manpower intervention script argument; Three, play script and realize reptile function.
2. the use conversational mode as described in right 1 realizes the web crawlers automatically extracted with website interactive form The composition part " html parser " of system, it is characterised in that:Html parser is used for parsing HTML nets Page content, generates a dom tree.
3. the use conversational mode as described in right 1 realizes the web crawlers automatically extracted with website interactive form The composition part " script recording engine " of system, it is characterised in that:There is the WEB of a similar search engine Interface, can be input into station address in input frame;Recording engine can connect website and obtain webpage;User exists The list done on recording engine is submitted to or clickthrough operation, and recording engine records request and the friendship for responding Mutually after data, script source code is translated into.
4. the use conversational mode as described in right 1 realizes the web crawlers automatically extracted with website interactive form The composition part " script operation simulator " of system, it is characterised in that:Can be with described in " broadcasting " right 3 Script source code, complete the function of web crawlers.
CN201510675362.5A 2015-10-19 2015-10-19 Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode Pending CN106598991A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510675362.5A CN106598991A (en) 2015-10-19 2015-10-19 Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510675362.5A CN106598991A (en) 2015-10-19 2015-10-19 Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode

Publications (1)

Publication Number Publication Date
CN106598991A true CN106598991A (en) 2017-04-26

Family

ID=58554265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510675362.5A Pending CN106598991A (en) 2015-10-19 2015-10-19 Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode

Country Status (1)

Country Link
CN (1) CN106598991A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108076067A (en) * 2017-12-27 2018-05-25 北京中关村科金技术有限公司 A kind of method and system that the simulation of reptile configurationization is authorized to log in
CN108664646A (en) * 2018-05-16 2018-10-16 电子科技大学 A kind of automatic download system of audio and video based on keyword
CN108664461A (en) * 2018-05-03 2018-10-16 北京神州泰岳软件股份有限公司 A kind of web form Auto-writing method and device
CN109246069A (en) * 2018-06-15 2019-01-18 华为技术有限公司 Webpage login method, device and readable storage medium storing program for executing
CN109948025A (en) * 2019-03-20 2019-06-28 上海古鳌电子科技股份有限公司 A kind of data referencing recording method
CN111159000A (en) * 2019-12-30 2020-05-15 北京明朝万达科技股份有限公司 Server performance test method, device, equipment and storage medium
CN113946735A (en) * 2021-10-05 2022-01-18 广州非凡信息安全技术有限公司 Method and system for crawling and restoring WEB site by traffic recording
CN115277451A (en) * 2022-07-28 2022-11-01 中译语通科技股份有限公司 Account login information initialization method and system based on automatic simulator

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
US7496636B2 (en) * 2002-06-19 2009-02-24 International Business Machines Corporation Method and system for resolving Universal Resource Locators (URLs) from script code
CN103051692A (en) * 2012-12-11 2013-04-17 中国能源建设集团广东省电力设计研究院 Mobile operation system working platform supporting extreme network environment
CN104539053A (en) * 2014-12-31 2015-04-22 国家电网公司 Power dispatching automation polling robot and method based on reptile technology
CN104750463A (en) * 2013-12-26 2015-07-01 任子行网络技术股份有限公司 A plug-in developing method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7496636B2 (en) * 2002-06-19 2009-02-24 International Business Machines Corporation Method and system for resolving Universal Resource Locators (URLs) from script code
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN103051692A (en) * 2012-12-11 2013-04-17 中国能源建设集团广东省电力设计研究院 Mobile operation system working platform supporting extreme network environment
CN104750463A (en) * 2013-12-26 2015-07-01 任子行网络技术股份有限公司 A plug-in developing method and system
CN104539053A (en) * 2014-12-31 2015-04-22 国家电网公司 Power dispatching automation polling robot and method based on reptile technology

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108076067A (en) * 2017-12-27 2018-05-25 北京中关村科金技术有限公司 A kind of method and system that the simulation of reptile configurationization is authorized to log in
CN108076067B (en) * 2017-12-27 2021-05-18 北京中关村科金技术有限公司 Method and system for authorized crawler configuration simulation login
CN108664461A (en) * 2018-05-03 2018-10-16 北京神州泰岳软件股份有限公司 A kind of web form Auto-writing method and device
CN108664461B (en) * 2018-05-03 2023-08-22 鼎富智能科技有限公司 Automatic filling method and device for webpage form
CN108664646A (en) * 2018-05-16 2018-10-16 电子科技大学 A kind of automatic download system of audio and video based on keyword
CN109246069A (en) * 2018-06-15 2019-01-18 华为技术有限公司 Webpage login method, device and readable storage medium storing program for executing
CN109246069B (en) * 2018-06-15 2020-10-16 华为技术有限公司 Webpage login method and device and readable storage medium
CN109948025A (en) * 2019-03-20 2019-06-28 上海古鳌电子科技股份有限公司 A kind of data referencing recording method
CN109948025B (en) * 2019-03-20 2023-10-20 上海古鳌电子科技股份有限公司 Data reference recording method
CN111159000A (en) * 2019-12-30 2020-05-15 北京明朝万达科技股份有限公司 Server performance test method, device, equipment and storage medium
CN113946735A (en) * 2021-10-05 2022-01-18 广州非凡信息安全技术有限公司 Method and system for crawling and restoring WEB site by traffic recording
CN115277451A (en) * 2022-07-28 2022-11-01 中译语通科技股份有限公司 Account login information initialization method and system based on automatic simulator

Similar Documents

Publication Publication Date Title
CN106598991A (en) Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode
CN104200166B (en) Script-based website vulnerability scanning method and system
CN104766014B (en) Method and system for detecting malicious website
US9118549B2 (en) Systems and methods for context management
CN103268361B (en) Extracting method, the device and system of URL are hidden in webpage
US20140137006A1 (en) Graphical Overlay Related To Data Mining And Analytics
CN112068824B (en) Webpage development preview method and device and electronic equipment
CN105468779A (en) Browser compatibility detection oriented client Web application capture and playback system and method
KR20080053293A (en) Server-Side Initial Content Rendering for Client Script Web Pages
CN110851681B (en) Crawler processing method, crawler processing device, server and computer readable storage medium
US20220276882A1 (en) Artificial intelligence based systems and methods for autonomously generating customer service help guides with integrated graphical components and for autonomously error-checking knowledge base support resources
CN106126747A (en) Data capture method based on reptile and device
US20060150111A1 (en) Methods and apparatus for evaluating aspects of a web page
CN113010371A (en) Method and system for monitoring real user experience of browser end in real time
CN101763432A (en) Method for constructing lightweight webpage dynamic view
CN113723980A (en) Method and device for detecting advertisement landing page, electronic equipment and storage medium
CN104025089B (en) The method and system creeped based on situation
CN118740675A (en) Network supportability testing method, device, equipment, medium and program product
CN118760581A (en) Link detection method, link detection device, equipment and medium
CN105912573A (en) Data updating method and data updating device
AU2021106041A4 (en) Methods and systems for obtaining and storing web pages
CN105701175B (en) A kind of data capture method and device
US20220244975A1 (en) Method and system for generating natural language content from recordings of actions performed to execute workflows in an application
Li et al. Modeling web application for cross-browser compatibility testing
CN104063488B (en) A kind of form feature extracting method of semi-automatic learning type

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
DD01 Delivery of document by public notice
DD01 Delivery of document by public notice

Addressee: Shanghai Intple Technology Co.,Ltd.

Document name: the First Notification of an Office Action

WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170426