CN111414525A - Data acquisition method and device for small program, computer equipment and storage medium - Google Patents
Data acquisition method and device for small program, computer equipment and storage medium Download PDFInfo
- Publication number
- CN111414525A CN111414525A CN202010216892.4A CN202010216892A CN111414525A CN 111414525 A CN111414525 A CN 111414525A CN 202010216892 A CN202010216892 A CN 202010216892A CN 111414525 A CN111414525 A CN 111414525A
- Authority
- CN
- China
- Prior art keywords
- applet
- program
- code
- data
- page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 230000008569 process Effects 0.000 claims abstract description 25
- 238000009877 rendering Methods 0.000 claims description 66
- 238000004590 computer program Methods 0.000 claims description 13
- 230000009191 jumping Effects 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 6
- 238000012544 monitoring process Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 10
- 238000009434 installation Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000013480 data collection Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Stored Programmes (AREA)
Abstract
The application relates to a data acquisition method and device of an applet, a computer device and a storage medium. The method comprises the following steps: acquiring a first applet and operating the first applet; hijacking a loading interface of the first applet in the running process of the first applet; acquiring a crawler program, and injecting codes of the crawler program into codes of the first small program to generate a second small program; the loading interface of the second small program is the same as that of the first small program; calling a loading interface of the second small program to load a code of the second small program and generating a page of the second small program; and collecting data in the page of the second small program through the crawler program. By adopting the method, the function of acquiring the data of the small program can be realized.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data acquisition method and apparatus for an applet, a computer device, and a storage medium.
Background
With the rapid growth of networks, the world wide web has become a carrier of a large amount of information, and how to efficiently extract and utilize such information has become a great challenge. In a traditional data collection method, a web crawler is generally adopted to collect data. The web crawler refers to a technology for sending a web request by simulating the behavior of a browser, and automatically analyzing and storing data according to a certain rule after receiving a request response. With the further development of network technology, small program technology has emerged. An applet refers to an application that is implemented based on a host program and can be used without download installation.
However, when the data of the applet is collected by adopting the web crawler, the data of the applet cannot be crawled.
Disclosure of Invention
In view of the above, it is necessary to provide a data acquisition method, an apparatus, a computer device, and a storage medium for an applet capable of acquiring data of the applet, in view of the above technical problems.
A data acquisition method of an applet, the method comprising:
acquiring a first small program and operating the first small program;
hijacking a loading interface of the first applet in the running process of the first applet;
acquiring a crawler program, and injecting codes of the crawler program into codes of the first small program to generate a second small program; the loading interface of the second applet is the same as the loading interface of the first applet;
calling a loading interface of the second small program to load a code of the second small program, and generating a page of the second small program;
and acquiring data in the page of the second small program through the crawler program.
An applet data acquisition apparatus, the apparatus comprising:
the running module is used for acquiring a first applet and running the first applet;
the hijack module is used for hijacking a loading interface of the first small program in the running process of the first small program;
the second applet generating module is used for acquiring a crawler program, injecting codes of the crawler program into codes of the first applet and generating a second applet; the loading interface of the second applet is the same as the loading interface of the first applet;
the page generating module is used for calling a loading interface of the second small program to load a code of the second small program and generating a page of the second small program;
and the data acquisition module is used for acquiring data in the page of the second small program through the crawler program.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
The data acquisition method, the data acquisition device, the computer equipment and the storage medium of the small program are used for acquiring a first small program and operating the first small program; hijacking a loading interface of the first applet in the running process of the first applet; acquiring a crawler program, and injecting codes of the crawler program into codes of the first small program to generate a second small program; the loading interface of the second small program is the same as that of the first small program; calling a loading interface of the second small program to load a code of the second small program and generating a page of the second small program; the first applet runs based on the host program, the code of the crawler program is injected into the code of the first applet, and the crawler program and the first applet can run based on the same bottom layer architecture, the same running logic and the like of the host program, so that data in a page of the second applet can be collected through the crawler program contained in the second applet, and the function of collecting the data of the second applet is achieved.
Drawings
FIG. 1 is a diagram of an application environment of a data acquisition method of an applet in one embodiment;
FIG. 2 is a schematic flow chart diagram illustrating a data acquisition method for an applet, in one embodiment;
FIG. 3 is a diagram of a security report generated for collected textual data in one embodiment;
FIG. 4 is a schematic diagram of a security report generated for captured picture data in one embodiment;
FIG. 5 is a diagram illustrating an application of an applet in one embodiment;
FIG. 6 is a diagram illustrating the collection of data in a page of a second applet, in one embodiment;
FIG. 7 is a schematic diagram of collecting data in a page of a second applet in another embodiment;
FIG. 8 is a schematic flow chart diagram illustrating a data acquisition method of an applet, in accordance with another embodiment;
FIG. 9 is a block diagram showing a data acquisition device of an applet in one embodiment;
FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The data acquisition method of the applet, provided by the application, can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 acquires a first applet and sends a data request to the server 104 through the network; when data returned by the server 104 according to the data request is received, the first applet can be operated; hijacking a loading interface of the first applet in the running process of the first applet; acquiring a crawler program, and injecting codes of the crawler program into codes of the first small program to generate a second small program; the loading interface of the second small program is the same as that of the first small program; calling a loading interface of the second small program to load a code of the second small program and generating a page of the second small program; and collecting data in the page of the second small program through the crawler program. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, there is provided a data acquisition method of an applet, comprising the steps of:
An applet refers to an application that is implemented based on a host program and can be used without download and installation. The host program may be a WeChat, Payment treasure, or other application, etc.
And the terminal opens the host program of the first small program, acquires the first small program from the host program and runs the first small program. In one embodiment, the first applet may be obtained from a set of applets of the host program. The applet set may be a set of applets used by the user in history, or a set of applets collected by the user, but is not limited thereto.
In another embodiment, a camera of the terminal may be called by the host program, so that the camera is opened to scan the scanning code corresponding to the first applet, thereby obtaining the first applet. Wherein, the scanning code can be a bar code, a two-dimensional code, and the like.
After the terminal acquires the first applet, the terminal receives an operation instruction generated by the first applet, the terminal operates and acquires a code of the first applet based on the operation instruction, and the code of the first applet is analyzed, so that the first applet is operated.
And step 204, hijacking a loading interface of the first applet in the running process of the first applet.
The execution environment of the first applet includes a rendering layer and a logic layer. The rendering layer of the first applet is used for representing data of the first applet, for example, displaying a default page of the first applet on a display interface of the terminal. The logic layer of the first applet is used for generating data of the first applet and processing the data of the first applet, such as transferring the data, checking the data, calling an interface and the like. The loading interface of the first applet refers to an interface for loading the code of the first applet so that the first applet can be presented on the display interface of the terminal.
Specifically, in the running process of the first applet, a hook technology is adopted to hijack a loading interface of the first applet. The hook technology is a special message processing mechanism that can monitor various event messages in a system or process, intercept and process messages sent to a target window. hook technology can be used to monitor the occurrence of specific events in the system, perform specific functions such as screen fetching words, monitoring logs, intercepting keyboard and mouse inputs, etc.
The crawler program refers to a program for automatically capturing data of a first small program according to a certain rule. The crawler program can be designed by a developer. The second applet refers to the applet that is to collect the data. The second applet contains the code of the first applet as well as the code of the crawler.
It can be understood that the second applet includes a code of the crawler program and a code of the first applet, and the crawler program is a program for collecting data in a page of the second applet, and does not change underlying architectures such as a loading interface of the first applet, so that basic functions, basic architectures, and the like of the second applet and the first applet are the same, and the loading interface of the second applet is also the same as the loading interface of the first applet.
And step 208, calling a loading interface of the second applet to load the code of the second applet and generating the page of the second applet.
The page of the second applet may include data such as a picture, a text, a link, and a video, and may also include elements such as a button (wx-button), an input box (wx-input), and a navigation bar (wx-navigator).
And the terminal calls a loading interface of the second applet to load the code of the second applet, namely the code of the first applet and the code of the crawler program, and generates a page of the second applet. Among the pages of the second applet, a default page of the first applet, for example, an initial page of the first applet; dynamic pages may also be included. A dynamic page refers to a page that is jumped by clicking on a link, button, or the like.
And step 210, collecting data in the page of the second small program through the crawler program.
The crawler program collects data in the page of the second applet, which may include data such as pictures, texts, links, videos, and the like, and may also include elements such as buttons and input boxes. The crawler may also grab a web request that the second applet sends to the server.
The crawler program refers to a program for automatically capturing data of the second small program according to a certain rule. And when the page is the default page of the first small program, acquiring data in the default page through the crawler program. When the page comprises elements such as a link, a button, an input box and a navigation bar, the crawler program can simulate user behavior according to the elements, click the elements such as the link, the button and the navigation bar or acquire input data, so as to jump to the next page, and then acquire data in the next page through the crawler program.
The data acquisition method of the small program acquires a first small program and operates the first small program; hijacking a loading interface of the first applet in the running process of the first applet; acquiring a crawler program, and injecting codes of the crawler program into codes of the first small program to generate a second small program; the loading interface of the second small program is the same as that of the first small program; calling a loading interface of the second small program to load a code of the second small program and generating a page of the second small program; the first applet runs based on the host program, the code of the crawler program is injected into the code of the first applet, and the crawler program and the first applet can run based on the same bottom layer architecture, the same running logic and the like of the host program, so that data in a page of the second applet can be collected through the crawler program contained in the second applet, and the function of collecting the data of the second applet is achieved.
The code of the crawler program is injected into the code of the first applet, so that the crawler program has the capability of accessing DOM (Document Object Model) data of the first applet and also has the capability of accessing a local interface of a host program of the first applet. Therefore, the crawler program can interact with the host program of the first small program, for example, the payment function of the host program is called, so that data of interaction between the second small program and the host program can be collected, and more complete data of the second small program can be collected.
The code of the crawler program is injected into the code of the first applet, so that the element type defined by a Document Object Model (DOM) in the first applet can be directly obtained, the data of the second applet can be accurately obtained based on the element type defined by the DOM in the first applet, and the problem that the webpage elements in the applet cannot be normally rendered by a browser due to the fact that the webpage elements cannot be normally rendered by the browser due to different element types in the process that the data of the applet are collected by the browser through the web crawler is solved.
It can be understood that the event mechanism of the applet depends on the low-level architecture of the applet, and in the conventional technology, when the web crawler is adopted, the browser cannot correctly trigger and process the events when the browser collects data for the applet. In this embodiment, the code of the crawler program is injected into the code of the first applet to generate the second applet, and the second applet also runs based on the bottom architecture of the applet defined by the developer, and the crawler program can acquire the data of the second applet, thereby implementing the function of acquiring the data of the second applet.
Further, after collecting data in the page of the second applet by the crawler, the method includes: scanning the collected data in the page of the second small program to obtain the security attribute of each data; a security report for the second applet is generated based on the security attributes of the respective data.
The security attribute of the data is, for example, safe and unsafe, and further, the security attribute of the data can be violence, pornography, people, scenery, and the like. And scanning the acquired data in the page of the second applet to obtain the security attribute of each data, so that the security of the second applet can be evaluated more accurately.
And when the data is a text, acquiring reference keywords, matching the reference keywords with the text, and determining the security attribute of the text according to the matching result. For example, the reference keyword may be a (a represents violence), and when text collected from a page of the second applet matches a, indicating that a is included in the text, the security attribute of the text may be "insecure-violence".
In one embodiment, as shown in FIG. 3, the second applet is a take-away applet, and text is obtained from the second applet to generate a security report for the text in the second applet. The text content with the number of 1 is 'A', the detected 'A' is a word indicating violence, so that the safety attribute of the generated 'A' is 'unsafe-violence', and the order in which the text content 'A' is located is 'B1', and the acquisition time of the text content 'A' is 2019-05-3109: 47: 21.
The text content with the number of 2 is '11 months', and the detection that '11 months' is a word representing time, so that the security attribute of '11 months' is 'security-time', and the order in which the text content is '11 months' is 'B2', and the acquisition time of the text content '11 months' is 2019-05-3109: 48:26 can also be obtained.
The text content with the number of 3 is a high and new area, the word indicating the address of the high and new area is detected, so that the safety attribute of the generated high and new area is a safety-address, the order where the text content is the high and new area is located is B2, and the acquisition time of the text content of the high and new area is 2019-05-3109: 48: 38.
The text content numbered 4 is "service attitude difference", and it is detected that "service attitude difference" is a word representing evaluation, so that the security attribute for generating "service attitude difference" is "security-evaluation", and it can also be obtained that the order in which the text content is "service attitude difference" is "B3", and the acquisition time of the text content "service attitude difference" is 2019-05-3109: 50: 21.
When the data is a picture or a video, detecting whether the picture or the video contains sensitive information; when sensitive information is contained in a picture or a video, the security attribute of the picture or the video can be 'insecure'; when sensitive information is not contained in a picture or video, the security attribute of the picture or video may be "secure". The sensitive information can be violent, pornographic, abuse and other information.
In one embodiment, a picture is taken from a page of the second applet and a security report of the taken picture is generated. As shown in fig. 4, the number of the captured picture is 1, the picture address of the picture is "a 1", the security attribute of the picture is detected to be "security-person", and the capture time of the picture can be further obtained to be 2019-05-3109: 47: 21.
The number of the collected picture is 2, the picture address of the picture is 'A2', the safety attribute of the picture is detected to be 'safety-landscape', and the collection time of the picture can be further obtained to be 2019-05-3109: 48: 26.
The number of the acquired picture is 3, the picture address of the picture is 'A3', the security attribute of the picture is detected to be 'security-person', and the acquisition time of the picture can be further acquired to be 2019-05-3109: 48: 38.
The number of the acquired picture is 4, the picture address of the picture is 'A4', the security attribute of the picture is detected to be 'security-figure', and the acquisition time of the picture can be further acquired to be 2019-05-3109: 50: 21.
When the data is a link, a reference link set can be obtained from the server, the link is matched with each reference link in the reference link set, and the security attribute of the link is determined according to the matching result. The reference link set also comprises reference security attributes corresponding to the reference links. And when the link collected from the second small program is matched with the reference link in the reference link set, taking the reference security attribute corresponding to the matched reference link as the security attribute of the link collected from the second small program. For example, if the link B1 matches the reference link B2 and the security attribute corresponding to the reference link B2 is pornography, the security attribute corresponding to the link B1 is "pornography" when the link B1 is acquired from the second applet.
As shown in FIG. 5, when the host program of the first applet is the WeChat, 502 refers to the WeChat client, including the application program of the WeChat itself and the first applet, the running environment of the applet is divided into a rendering layer and a logic layer, wherein a WXM L (WeiXin Markup L schema) template, a WXSS (WeiXin Style sheets) Style (DOM data), rendering logic, etc. work in the rendering layer, and a JS (JavaScript) script works in the logic layer, WXM L is a set of tag languages designed by a framework, and a structure of a page can be built by combining a base component and an event system, WXSS is a set of styles for describing the component Style of WXM L, DOM (Document Object Model) refers to a standard programming interface of an extensible Markup language organized by W3C (World Wide Web Consortium recommendation, World Java script, the rendering logic includes data such as JS operation interface.
504 is used for rendering an interface of a rendering layer of the first small program, namely WebView; the logic layer runs the JS script with 506, JsCore thread. The first applet has multiple interfaces, so the rendering layer has multiple WebView. The communication between the rendering layer and the logic layer is relayed via 508, Native (application of the wechat itself), and the logic layer sends the network request to the third party server 510 via 508, Native. And data sent by the third party server 510 is also forwarded into the logical layer via 508, Native. The third party server 510 communicates with the wechat client 502 in the terminal by using WebSocket. WebSocket is a Protocol for full duplex communication over a single TCP (Transmission Control Protocol) connection.
In one embodiment, the code of the first applet includes first rendering layer code and first logic layer code and the code of the crawler includes second rendering layer code and second logic layer code. Injecting the code of the crawler program into the code of the first small program to generate a second small program, wherein the method comprises the following steps: injecting a second rendering layer code of the crawler program into a first rendering layer code of the first applet to obtain a rendering layer code of the second applet; injecting a second logic layer code of the crawler program into a first logic layer code of the first small program to obtain a logic layer code of the second small program; a second applet is generated based on the rendering layer code of the second applet and the logic layer code of the second applet.
The rendering layer is used to render the data, for example, to display the elements in the interface. The logic layer is used for generating data and processing the data, such as transferring the data, checking the data, calling an interface and the like. The first rendering layer code refers to code of a rendering layer of the first applet. The first logical layer code refers to code of a logical layer of the first applet. The second rendering layer code refers to code of a rendering layer of the crawler. The second logical layer code refers to code of a logical layer of the crawler.
Specifically, injecting a second rendering layer code of the crawler program to the tail of a first rendering layer code of the first applet to obtain a rendering layer code of the second applet; and calling a v8 engine (JavaScript engine) interface to inject the second logic layer code into the first logic layer code of the first applet to obtain the logic layer code of the second applet.
In this embodiment, the rendering layer code of the second applet includes a second rendering layer code of the crawler program, and the logic layer code of the second applet includes a second logic layer code of the crawler program, so that the crawler program can interact with the first applet on the rendering layer, such as data collection, data addition, data modification, and the like, and can also verify data on the logic layer, so that more complete and accurate data can be collected.
In one embodiment, invoking a load interface of the second applet to load code of the second applet and generate a page of the second applet comprises: calling a loading interface of the second small program, loading a rendering layer code of the second small program, and verifying the logic of the rendering layer code of the second small program by adopting the logic layer code of the second small program; and when the logic verification of the rendering layer code of the second applet passes, generating a page corresponding to the rendering layer code of the second applet.
And the terminal calls a loading interface of the second applet, loads the rendering layer code of the second applet and can generate a page corresponding to the rendering layer code of the second applet. The logic of the rendering layer code of the second applet may include whether an API (Application programming interface) called in the rendering layer code of the second applet is correct, whether the content of an element in the rendering layer code of the second applet corresponds to the type of the element, and the like. For example, if the type of the element is a mobile phone number and the content of the element is a picture, the content of the element does not correspond to the type of the element.
When the logic check of the rendering layer code of the second small program passes, the more accurate page corresponding to the rendering layer of the second small program can be generated, so that more accurate data can be acquired, and the problems that the generated page is disordered and has messy codes due to the inaccurate logic of the rendering layer code of the second small program are solved.
In one embodiment, as shown in FIG. 6, collecting data in a page of a second applet by a crawler includes:
In the current page, various elements may be included, such as pictures, text, links, videos, navigation bars, and so forth. And traversing each element in the current page through a crawler program, and acquiring data of each element. The data of an element may include the type of the element, and may also include the data size of the element, the position of the element, the size of the element, and the like. The type of the element is, for example, input type, picture type, text type, link type, etc.
And step 604, when the type of the element is the input type, jumping to the next page according to the element of the input type.
Specifically, data of each element in a current page is obtained, wherein the data of the element comprises the type of the element; and screening out the elements with the types of the elements as input types from the data of the elements.
When the type of the element is an input type, data needs to be input to the element, so that the next page can be jumped to through the element of the input type. For example, if the type of the element is an address, address information needs to be input to the element, so that a jump is made to the next page. For another example, if the type of the element is a login button, a login instruction needs to be input to the element, and a square can be clicked or slid to a preset position, so as to log in to the next page. For another example, if the type of the element is a link, a click instruction needs to be input to the element, and a preset operation can be clicked or executed, so as to jump to the next page.
And step 606, taking the next page as a new current page, and executing the step of acquiring the data of each element in the current page until each page in the second applet is traversed and the data of each element in each page is acquired.
And after jumping to the next page, taking the next page as a new current page, acquiring the data of each element in the current page, and so on until each page in the second applet is traversed and the data of each element in each page is acquired.
In the embodiment, a crawler program is used for acquiring a current page of a second small program and acquiring data of each element in the current page; the data of the element includes a type of the element; when the type of the element is the input type, jumping to the next page according to the element of the input type; and taking the next page as a new current page, executing the step of acquiring the data of each element in the current page until each page in the second applet is traversed and the data of each element in each page is acquired, so that each page in the second applet and the data of each element in each page can be acquired, and more complete data of the second applet can be acquired.
In one embodiment, when the type of the element is the input type, jumping to a next page according to the element of the input type includes: when the type of the element is an input type, acquiring input data corresponding to the input type through a crawler program; inputting input data into an element to generate a jump instruction; and jumping to the next page according to the jump instruction.
The terminal may acquire the correspondence between the input type and the input data in advance. In one embodiment, the terminal may obtain the correspondence between the input type and the input data from a local memory. In another embodiment, the terminal may also obtain the corresponding relationship between the input type and the input data from the crawler program. In another embodiment, the terminal may further obtain a corresponding relationship between the input type and the input data from the background server. The background server may be a server corresponding to a host program of the first applet. For example, if the host program of the first applet is WeChat, the terminal may obtain the corresponding relationship between the input type and the input data from the WeChat server.
The jump instruction comprises the address of the next page, the next page is searched and loaded according to the address of the next page, and the next page is displayed on the display interface of the terminal.
The elements of the input type may be buttons, input boxes, links, navigation bars, etc. When the input type element is a button, the corresponding input data is a click instruction or a sliding instruction and the like; and inputting input data such as a click instruction or a sliding instruction into the button, namely clicking the button to generate a jump instruction and jumping to the next page.
When the element of the input type is an input box, the corresponding input data is texts, pictures, videos, links and the like; inputting input data such as texts, pictures, videos and links into an input box, generating a jump instruction, and jumping to the next page.
When the element of the input type is an input box, in order to more accurately input corresponding input data in the input box, the input type of the input box may be further distinguished. For example, the input types of the input box can be distinguished as: a phone number entry box, a verification code entry box, a password entry box, a picture entry box, an account number entry box, a text entry box, an address entry box, etc.
When the input type element is a link, the corresponding input data is a click instruction or other preset instructions and the like; and inputting input data such as a click instruction or other preset instructions into the link, namely clicking the link to generate a jump instruction and jumping to the next page.
In this embodiment, a crawler program is used to simulate user behaviors, such as clicking a link, inputting corresponding data in an input box, clicking a button, clicking a navigation bar, and the like, and a next page can be skipped to, so that data in the next page can be collected, and data of a second applet can be collected more completely.
In one embodiment, as shown in FIG. 7, the code of the crawler includes second rendering layer code and second logic layer code.
Acquiring input data corresponding to the input type through a crawler program, wherein the input data comprises the following steps:
The terminal acquires the corresponding relation between the input type and the input data in advance, and acquires the input data matched with the input type of the element from the corresponding relation between the input type and the input data when the type of the element acquired through the crawler program is the input type.
Inputting input data into an element, and generating a jump instruction, wherein the jump instruction comprises the following steps:
It can be understood that the input of the corresponding data in the input box can accurately jump to the next page. For example, the login button can accurately jump to the next page only by inputting a click instruction; and the corresponding verification code is required to be input in the verification code input box, and then the next page can be jumped to. For another example, when the input of the login button is a picture or a text, the input data does not correspond to the element, and the next page cannot be skipped; when a click instruction is input in the mobile phone number input box, the input data does not correspond to the element, and the next page cannot be skipped.
Therefore, the input data is checked through the second logic layer code of the crawler program, so that the input data corresponds to the element of the input type, and the next page can be jumped to.
In this embodiment, input data corresponding to an input type is acquired through a second rendering layer code of the crawler program; and inputting the input data into the elements, and verifying the input data through a second logic layer code of the crawler program, so that the input data corresponds to the elements of the input type, a jump instruction is generated, and the next page can be jumped to.
In one embodiment, the method further comprises: and analyzing the code of the first small program to obtain a loading interface of the first small program. In the running process of the first applet, hijacking a loading interface of the first applet comprises the following steps: monitoring an interface called by a first small program in the running process of the first small program; and when the first applet calls the loading interface, hijacking the loading interface of the first applet.
In the code of the first applet, each interface of the first applet has a corresponding identifier, and the identifier of the loading interface can be found from the code of the first applet, so that the loading interface of the first applet is obtained.
Monitoring an interface called by the first applet by adopting a hook technology in the running process of the first applet; and when the first applet is detected to call the loading interface, hijacking the loading interface of the first applet, namely not allowing the first applet to call the loading interface.
Further, acquiring an installation package of a host program of the first small program; performing decompiling on the installation package of the host program to obtain a code of the host program; the code of the first applet is obtained from the code of the host program.
It will be appreciated that the applet operates based on the host program, the applet is not installed using a separate installation package, and the applet's code is present in the installation package of the host program. Therefore, the installation package of the host program of the first small program is obtained first, and then the installation package of the host program is decompiled to obtain the code of the host program. And the code of the host program comprises the code of the first small program, the address of the code of the first small program is searched from the code of the host program, and the code of the first small program is obtained.
When the system running in the terminal is an Android system, the installation package of the host program can be decompiled by adopting an APK (Android application package) decompilation technology in a code decompilation tool of dex to obtain a pseudo code of the host program, the pseudo code of the host program is analyzed to obtain a pseudo code of a first applet, and the pseudo code of the first applet is converted into a code of the first applet.
In one embodiment, as shown in fig. 8, after the first applet and the crawler program are obtained, step 802 is executed to inject the code of the crawler program into the code of the first applet, and generate the second applet. Step 804 is executed, and the page of the second applet is obtained through the crawler program. Step 806 is performed to traverse the pages.
Step 808 is performed to determine whether all pages have been traversed. When the determination is no, that is, when all pages are not traversed, step 810 is executed to jump to the next page. Step 812 is performed to traverse all elements in the page. Step 814 is executed to determine whether the data of all elements in the page has been collected. When the determination is yes, that is, the data of all the elements in the page is collected, the step 806 is returned to be executed, and the pages are traversed. If not, that is, the data of all the elements in the page is not collected, step 816 is executed, and the elements of the input type are subjected to simulation operation by the crawler program. Specifically, input data corresponding to an input type is obtained through a crawler program; and inputting the input data into the element to generate a jump instruction. Step 810 is executed to jump to the next page.
After step 808 is executed to determine whether all pages have been traversed, when the determination is yes, that is, when all pages have been traversed, it indicates that all elements in all pages of the second applet have been collected by the crawler program, and then the process is ended.
It should be understood that, although the steps in the flowcharts of fig. 2, 6 to 8 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 6 to 8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 9, there is provided an applet data acquisition apparatus 900, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: the system comprises an operation module 902, a hijacking module 904, a second applet generating module 906, a page generating module 908 and a data collecting module 910, wherein:
an operation module 902 is configured to obtain a first applet and operate the first applet.
And a hijacking module 904, configured to hijack the loading interface of the first applet in the running process of the first applet.
A second applet generating module 906, configured to acquire a crawler program, and inject a code of the crawler program into a code of the first applet to generate a second applet; the loading interface of the second applet is the same as the loading interface of the first applet.
The page generating module 908 is configured to invoke a loading interface of the second applet to load a code of the second applet, and generate a page of the second applet.
And the data acquisition module 910 is configured to acquire data in a page of the second applet through the crawler program.
The data acquisition device of the small program acquires a first small program and runs the first small program; hijacking a loading interface of the first applet in the running process of the first applet; acquiring a crawler program, and injecting codes of the crawler program into codes of the first small program to generate a second small program; the loading interface of the second small program is the same as that of the first small program; calling a loading interface of the second small program to load a code of the second small program and generating a page of the second small program; the first applet runs based on the host program, the code of the crawler program is injected into the code of the first applet, and the crawler program and the first applet can run based on the same bottom layer architecture, the same running logic and the like of the host program, so that data in a page of the second applet can be collected through the crawler program contained in the second applet, and the function of collecting the data of the second applet is achieved.
In one embodiment, the code of the first applet includes first rendering layer code and first logic layer code and the code of the crawler includes second rendering layer code and second logic layer code. The second applet generating module 906 is further configured to inject a second rendering layer code of the crawler program into a first rendering layer code of the first applet to obtain a rendering layer code of the second applet; injecting a second logic layer code of the crawler program into a first logic layer code of the first small program to obtain a logic layer code of the second small program; a second applet is generated based on the rendering layer code of the second applet and the logic layer code of the second applet.
In an embodiment, the page generating module 908 is further configured to call a loading interface of the second applet, load the rendering layer code of the second applet, and verify the logic of the rendering layer code of the second applet by using the logic layer code of the second applet; and when the logic verification of the rendering layer code of the second applet passes, generating a page corresponding to the rendering layer code of the second applet.
In an embodiment, the data collection module 910 is further configured to obtain, by a crawler program, a current page of the second applet, and obtain data of each element in the current page; the data of the element includes a type of the element; when the type of the element is the input type, jumping to the next page according to the element of the input type; and taking the next page as a new current page, and executing the step of acquiring the data of each element in the current page until each page in the second small program is traversed and the data of each element in each page is acquired.
In an embodiment, the data collection module 910 is further configured to, when the type of the element is an input type, obtain input data corresponding to the input type through a crawler program; inputting input data into an element to generate a jump instruction; and jumping to the next page according to the jump instruction.
In one embodiment, the code of the crawler includes second rendering layer code and second logic layer code. The data acquisition module 910 is further configured to obtain input data corresponding to the input type through a second rendering layer code of the crawler program; inputting input data into the elements, and verifying the input data through a second logic layer code of the crawler program; when the input data is checked, a jump instruction is generated.
In an embodiment, the data acquiring apparatus 900 of the applet further includes an analyzing module, configured to analyze a code of the first applet to obtain a loading interface of the first applet. The hijack module 904 is further configured to monitor an interface called by the first applet in the running process of the first applet; and when the first applet calls the loading interface, hijacking the loading interface of the first applet.
For specific limitations of the data acquisition device of the applet, reference may be made to the above limitations of the data acquisition method of the applet, which are not described herein again. The respective modules in the data acquisition apparatus of the above applet may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a data acquisition method for an applet. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A method for data acquisition of an applet, the method comprising:
acquiring a first small program and operating the first small program;
hijacking a loading interface of the first applet in the running process of the first applet;
acquiring a crawler program, and injecting codes of the crawler program into codes of the first small program to generate a second small program; the loading interface of the second applet is the same as the loading interface of the first applet;
calling a loading interface of the second small program to load a code of the second small program, and generating a page of the second small program;
and acquiring data in the page of the second small program through the crawler program.
2. The method of claim 1, wherein the code of the first applet comprises first rendering layer code and first logic layer code, and wherein the code of the crawler comprises second rendering layer code and second logic layer code;
the step of injecting the code of the crawler program into the code of the first applet to generate a second applet comprises the following steps:
injecting the second rendering layer code of the crawler program into the first rendering layer code of the first applet to obtain a rendering layer code of a second applet;
injecting the second logic layer code of the crawler program into the first logic layer code of the first applet to obtain a logic layer code of a second applet;
generating the second applet based on the rendering layer code of the second applet and the logic layer code of the second applet.
3. The method of claim 2, wherein the invoking the load interface of the second applet loads the code of the second applet and generates the page of the second applet, comprising:
calling a loading interface of the second small program, loading a rendering layer code of the second small program, and verifying the logic of the rendering layer code of the second small program by adopting the logic layer code of the second small program;
and when the logic check on the rendering layer code of the second applet passes, generating a page corresponding to the rendering layer code of the second applet.
4. The method of claim 1, wherein the collecting, by the crawler program, data in the page of the second applet comprises:
acquiring a current page of the second small program through the crawler program, and acquiring data of each element in the current page; the data of the element comprises a type of the element;
when the type of the element is an input type, jumping to a next page according to the element of the input type;
and taking the next page as a new current page, and executing the step of acquiring the data of each element in the current page until each page in the second applet is traversed and the data of each element in each page is acquired.
5. The method according to claim 4, wherein when the type of the element is an input type, jumping to a next page according to the element of the input type comprises:
when the type of the element is an input type, acquiring input data corresponding to the input type through the crawler program;
inputting the input data into the element to generate a jump instruction;
and jumping to the next page according to the jump instruction.
6. The method of claim 5, wherein the code of the crawler comprises second rendering layer code and second logic layer code;
the obtaining of the input data corresponding to the input type through the crawler program includes:
acquiring input data corresponding to the input type through a second rendering layer code of the crawler program;
inputting the input data into the element, and generating a jump instruction, including:
inputting the input data into the element, and checking the input data through a second logic layer code of the crawler program;
and when the input data passes the verification, generating a jump instruction.
7. The method of claim 1, further comprising:
analyzing the code of the first small program to obtain a loading interface of the first small program;
the hijacking a loading interface of the first applet in the running process of the first applet comprises the following steps:
monitoring an interface called by the first applet in the running process of the first applet;
and hijacking the loading interface of the first small program when the first small program calls the loading interface.
8. An apparatus for data acquisition of an applet, the apparatus comprising:
the running module is used for acquiring a first applet and running the first applet;
the hijack module is used for hijacking a loading interface of the first small program in the running process of the first small program;
the second applet generating module is used for acquiring a crawler program, injecting codes of the crawler program into codes of the first applet and generating a second applet; the loading interface of the second applet is the same as the loading interface of the first applet;
the page generating module is used for calling a loading interface of the second small program to load a code of the second small program and generating a page of the second small program;
and the data acquisition module is used for acquiring data in the page of the second small program through the crawler program.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010216892.4A CN111414525B (en) | 2020-03-25 | 2020-03-25 | Method, device, computer equipment and storage medium for acquiring data of applet |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010216892.4A CN111414525B (en) | 2020-03-25 | 2020-03-25 | Method, device, computer equipment and storage medium for acquiring data of applet |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111414525A true CN111414525A (en) | 2020-07-14 |
CN111414525B CN111414525B (en) | 2024-01-02 |
Family
ID=71491416
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010216892.4A Active CN111414525B (en) | 2020-03-25 | 2020-03-25 | Method, device, computer equipment and storage medium for acquiring data of applet |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111414525B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112000393A (en) * | 2020-08-25 | 2020-11-27 | 上海连尚网络科技有限公司 | Method and device for running small program |
CN112162871A (en) * | 2020-09-25 | 2021-01-01 | 同程网络科技股份有限公司 | Method, device and storage medium for data exchange between applet and webview |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020198931A1 (en) * | 2001-04-30 | 2002-12-26 | Murren Brian T. | Architecture and process for presenting application content to clients |
US20080235671A1 (en) * | 2007-03-20 | 2008-09-25 | David Kellogg | Injecting content into third party documents for document processing |
CN106021257A (en) * | 2015-12-31 | 2016-10-12 | 广州华多网络科技有限公司 | Method, device, and system for crawler to capture data supporting online programming |
US10083108B1 (en) * | 2017-12-18 | 2018-09-25 | Clover Network, Inc. | Automated stack-based computerized application crawler |
CN108833264A (en) * | 2018-06-25 | 2018-11-16 | 厦门理工学院 | Data acquisition management system, method and application based on WeChat applet |
CN109710831A (en) * | 2018-12-28 | 2019-05-03 | 四川新网银行股份有限公司 | A kind of network crawler system based on browser plug-in |
CN110263266A (en) * | 2019-05-20 | 2019-09-20 | 江苏大学 | A kind of method for exhibiting data based on wechat small routine and crawler |
CN110347562A (en) * | 2018-04-08 | 2019-10-18 | 腾讯科技(深圳)有限公司 | Collecting method, device, computer-readable medium and intelligent terminal |
CN110750255A (en) * | 2019-09-25 | 2020-02-04 | 支付宝(杭州)信息技术有限公司 | Applet rendering method and device |
CN110837473A (en) * | 2019-11-07 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Application program debugging method, device, terminal and storage medium |
-
2020
- 2020-03-25 CN CN202010216892.4A patent/CN111414525B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020198931A1 (en) * | 2001-04-30 | 2002-12-26 | Murren Brian T. | Architecture and process for presenting application content to clients |
US20080235671A1 (en) * | 2007-03-20 | 2008-09-25 | David Kellogg | Injecting content into third party documents for document processing |
CN106021257A (en) * | 2015-12-31 | 2016-10-12 | 广州华多网络科技有限公司 | Method, device, and system for crawler to capture data supporting online programming |
US10083108B1 (en) * | 2017-12-18 | 2018-09-25 | Clover Network, Inc. | Automated stack-based computerized application crawler |
CN110347562A (en) * | 2018-04-08 | 2019-10-18 | 腾讯科技(深圳)有限公司 | Collecting method, device, computer-readable medium and intelligent terminal |
CN108833264A (en) * | 2018-06-25 | 2018-11-16 | 厦门理工学院 | Data acquisition management system, method and application based on WeChat applet |
CN109710831A (en) * | 2018-12-28 | 2019-05-03 | 四川新网银行股份有限公司 | A kind of network crawler system based on browser plug-in |
CN110263266A (en) * | 2019-05-20 | 2019-09-20 | 江苏大学 | A kind of method for exhibiting data based on wechat small routine and crawler |
CN110750255A (en) * | 2019-09-25 | 2020-02-04 | 支付宝(杭州)信息技术有限公司 | Applet rendering method and device |
CN110837473A (en) * | 2019-11-07 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Application program debugging method, device, terminal and storage medium |
Non-Patent Citations (1)
Title |
---|
操金金;何贞铭;冯梦琪;张金星;王丹媛;: "基于微信小程序的地质资料展示系统的设计与实现" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112000393A (en) * | 2020-08-25 | 2020-11-27 | 上海连尚网络科技有限公司 | Method and device for running small program |
CN112162871A (en) * | 2020-09-25 | 2021-01-01 | 同程网络科技股份有限公司 | Method, device and storage medium for data exchange between applet and webview |
Also Published As
Publication number | Publication date |
---|---|
CN111414525B (en) | 2024-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10846402B2 (en) | Security scanning method and apparatus for mini program, and electronic device | |
CN109376078B (en) | Mobile application testing method, terminal equipment and medium | |
US20150012924A1 (en) | Method and Device for Loading a Plug-In | |
CN113489713B (en) | Network attack detection method, device, equipment and storage medium | |
US20150213282A1 (en) | Online Privacy Management System with Enhanced Automatic Information Detection | |
CN105940654A (en) | Privileged static hosted WEB applications | |
US9443077B1 (en) | Flagging binaries that drop malicious browser extensions and web applications | |
CN111177623A (en) | Information processing method and device | |
CN108256322B (en) | Security testing method and device, computer equipment and storage medium | |
CN106528659A (en) | A control method and device for jumping from a browser to an application program | |
CN113469866A (en) | Data processing method and device and server | |
CN111414525B (en) | Method, device, computer equipment and storage medium for acquiring data of applet | |
US9571557B2 (en) | Script caching method and information processing device utilizing the same | |
CA2906517A1 (en) | Online privacy management | |
CN112671605A (en) | Test method and device and electronic equipment | |
CN114157568B (en) | Browser secure access method, device, equipment and storage medium | |
CN113326539B (en) | Method, device and system for private data leakage detection aiming at applet | |
CN113419738A (en) | Interface document generation method and device and interface management equipment | |
CN107526678B (en) | Web application program testing method and device | |
CN109492144B (en) | Association relation analysis method, device and storage medium for software system | |
CN112153059A (en) | Mail verification code acquisition method and device, electronic equipment and storage medium | |
CN112559278B (en) | Method and device for acquiring operation data | |
CN112417324A (en) | Chrome-based URL (Uniform resource locator) interception method and device and computer equipment | |
JP5320007B2 (en) | Information filter device | |
EA038687B1 (en) | Method and system for identifying devices connected to fraudulent phishing activity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |