[go: up one dir, main page]

CN104951516A - Method and device for storing webpage file - Google Patents

Method and device for storing webpage file Download PDF

Info

Publication number
CN104951516A
CN104951516A CN201510290694.1A CN201510290694A CN104951516A CN 104951516 A CN104951516 A CN 104951516A CN 201510290694 A CN201510290694 A CN 201510290694A CN 104951516 A CN104951516 A CN 104951516A
Authority
CN
China
Prior art keywords
webpage
file
files
web page
written
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510290694.1A
Other languages
Chinese (zh)
Inventor
谭国斌
马哲
沈建荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaomi Inc
Original Assignee
Xiaomi Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaomi Inc filed Critical Xiaomi Inc
Priority to CN201510290694.1A priority Critical patent/CN104951516A/en
Publication of CN104951516A publication Critical patent/CN104951516A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses and relates to a method and device for storing a webpage file, and belongs to the technical field of computers. The method comprises the steps that a webpage source code of the webpage file to be stored is acquired; the webpage source code is written into a text file. The problem that in relevant technologies, the processing expense of a system is too large due to the fact that a database is used for storing the webpage file captured by a web spider is solved. Compared with data read-write operation on the database, system processing resources consumed by data read-write operation on the text file can be much less, the webpage file can be stored more reasonably, and the technical effect of reducing the processing expense of the system is achieved.

Description

Method and device for storing webpage file
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for storing a web page file.
Background
The web crawler is an important component of a search engine system, and is used for capturing web page files from the internet and storing the captured web page files to the local.
In the related art, a database is used to store web page files captured by a web crawler. For example, a relational database such as MySQL or Oracle is employed to store the web page files grabbed by the web crawler.
However, since the data read-write operation on the database consumes more system processing resources, the efficient web crawling by the web crawler may result in a large amount of data read-write operations on the database in a short time, and further result in an excessive processing overhead of the system.
Disclosure of Invention
The embodiment of the disclosure provides a method and a device for storing a webpage file. The technical scheme is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided a method of storing a web page file, the method including:
acquiring a webpage source code of a webpage file to be stored;
and writing the webpage source code into a text file.
Optionally, the web page file to be stored comprises a web page file;
or, the webpage files to be stored comprise two or more webpage files.
Optionally, the writing the web page source code into a text file includes:
detecting whether the sum of the number of written files of the text file and the number of files to be written is larger than a preset threshold value or not; the written file number refers to the number of the webpage files corresponding to the written webpage source codes in the text files, and the number of the files to be written refers to the number of the webpage files to be stored;
and if the sum of the number of the written files and the number of the files to be written is less than or equal to the preset threshold value, writing the webpage source code into the text file.
Optionally, the method further comprises:
if the sum of the number of the written files and the number of the files to be written is larger than the preset threshold value, calculating a difference value d between the preset threshold value and the number of the written files, wherein d is not less than 0 and is an integer;
when d is 0, writing the webpage source code into at least one newly created empty text file;
when d is not equal to 0, d webpage files are selected from the webpage files to be stored, and webpage source codes of the selected d webpage files are written into the text file; and writing the remaining webpage source codes of the webpage files to be stored into at least one newly created empty text file.
Optionally, the method further comprises:
respectively creating a corresponding index item for each webpage file, wherein the index item is used for indicating the position of a webpage source code of the webpage file in the text file;
and storing the index entry into an index directory.
Optionally, the method further comprises:
acquiring an index item corresponding to the target webpage file from the index directory;
determining the position of the webpage source code of the target webpage file in the text file according to the index item corresponding to the target webpage file;
and reading the webpage source code of the target webpage file from the text file according to the determined position.
According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for storing a web page file, the apparatus including:
the code acquisition module is configured to acquire a webpage source code of a webpage file to be stored;
a code writing module configured to write the web page source code into a text file.
Optionally, the web page file to be stored comprises a web page file;
or, the webpage files to be stored comprise two or more webpage files.
Optionally, the code writing module includes:
the quantity detection submodule is configured to detect whether the sum of the number of written files of the text file and the number of files to be written is larger than a preset threshold value or not; the written file number refers to the number of the webpage files corresponding to the written webpage source codes in the text files, and the number of the files to be written refers to the number of the webpage files to be stored;
and the first writing sub-module is configured to write the webpage source code into the text file when the sum of the number of written files and the number of files to be written is less than or equal to the preset threshold value.
Optionally, the code writing module further includes:
the difference value calculation sub-module is configured to calculate the difference value d between the preset threshold value and the number of the written files when the sum of the number of the written files and the number of the files to be written is larger than the preset threshold value, wherein d is larger than or equal to 0 and is an integer;
a second writing submodule configured to write the web page source code into at least one newly created empty text file when d is 0;
the third writing sub-module is configured to select d webpage files from the webpage files to be stored when d is not equal to 0, and write the webpage source codes of the selected d webpage files into the text file; and writing the remaining webpage source codes of the webpage files to be stored into at least one newly created empty text file.
Optionally, the apparatus further comprises:
an index creating module configured to create a corresponding index entry for each of the web page files, where the index entry is used to indicate a position of a web page source code of the web page file in the text file;
an index storage module configured to store the index entry into an index directory.
Optionally, the apparatus further comprises:
the index acquisition module is configured to acquire an index item corresponding to the target webpage file from the index directory;
the position determining module is configured to determine the position of the webpage source code of the target webpage file in the text file according to the index item corresponding to the target webpage file;
and the code reading module is configured to read the webpage source code of the target webpage file from the text file according to the determined position.
According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for storing a web page file, the apparatus including:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to:
acquiring a webpage source code of a webpage file to be stored;
and writing the webpage source code into a text file.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
the method comprises the steps of obtaining a webpage source code of a webpage file to be stored, and writing the webpage source code into a text file for storage; the problem that the processing overhead of the system is overlarge due to the fact that the web page files captured by the web crawler are stored in a database in the related technology is solved; compared with the data reading and writing operation on the database, the system processing resource consumed by the data reading and writing operation on the text file is much smaller, the webpage file is stored more reasonably, and the technical effect of reducing the processing overhead of the system is achieved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow diagram illustrating a method of storing a web page file in accordance with an exemplary embodiment;
FIG. 2A is a flow chart illustrating a method of storing a web page file in accordance with another exemplary embodiment;
FIG. 2B is a flowchart illustrating step 202 according to another exemplary embodiment;
FIG. 2C is a flowchart illustrating step 204 through step 208 according to another exemplary embodiment;
FIG. 2D is a diagram illustrating a correspondence between index directories, text files, and web page files, according to another exemplary embodiment;
FIG. 2E is a schematic diagram of an application scenario shown in accordance with another illustrative embodiment;
FIG. 3 is a block diagram illustrating an apparatus for storing a web page file in accordance with an exemplary embodiment;
FIG. 4A is a block diagram illustrating an apparatus for storing a web page file in accordance with another exemplary embodiment;
FIG. 4B is a block diagram illustrating a code writing module 320 according to another example embodiment;
FIG. 5 is a block diagram illustrating an apparatus in accordance with an example embodiment.
With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The method for storing the webpage file provided by the embodiment of the disclosure can be applied to one or more devices in a search engine system, and the device is generally a server. For the sake of simplifying the description, in the following method embodiments, only the execution subject of each step is exemplified as the server, but this is not limited thereto.
FIG. 1 is a flow chart illustrating a method of storing a web page file according to an exemplary embodiment, which may include the steps of:
in step 101, a web page source code of a web page file to be stored is acquired.
The web page source code refers to a standard language code for organizing and typesetting elements such as characters, pictures, music, videos, links and the like on a web page. The standard language code is usually HTML (HyperText markup language) code, and sometimes CSS (Cascading Style Sheets) code or JS (JavaScript, a script language) code may be mixed.
In step 102, the web page source code is written to a text file.
In summary, in the method provided in this embodiment, the web page source code of the web page file to be stored is obtained, and the web page source code is written into the text file for storage; the problem that the processing overhead of the system is overlarge due to the fact that the web page files captured by the web crawler are stored in a database in the related technology is solved; compared with the data reading and writing operation on the database, the system processing resource consumed by the data reading and writing operation on the text file is much smaller, the webpage file is stored more reasonably, and the technical effect of reducing the processing overhead of the system is achieved.
Fig. 2A is a flowchart illustrating a method of storing a web page file according to another exemplary embodiment. The method may be applied to one or more devices in a search engine system, typically a server. The method may include the steps of:
in step 201, a web page source code of a web page file to be stored is obtained.
The server acquires a webpage file to be stored and extracts a webpage source code from the webpage file. The web page files to be stored may include one web page file, or may include two or more web page files.
For example, the server may crawl a plurality of web page files from the world wide web through a web crawler, where the crawled web page files are the web page files to be stored. The basic workflow of the web crawler is as follows: 1) obtaining a selected seed URL (Uniform Resource Locator); 2) putting the seed URL into a URL queue to be captured; 3) sequentially taking out URLs to be captured from the URL queue to be captured, analyzing the URLs to be captured, downloading webpage files corresponding to the URLs to be captured from a corresponding host, and storing the downloaded webpage files; 4) putting the URL which is completed to be captured into a captured URL queue; 5) and analyzing the URLs in the captured URL queue to obtain other URLs, and putting the obtained other URLs into the URL queue to be captured, so as to enter the next cycle. In addition, the format of the web page file crawled by the web crawler is usually the HTML format.
After the server acquires the webpage file to be stored, extracting a webpage source code from the webpage file to be stored. For example, when the web page file is an HTML file, the HTML source code is extracted from the HTML file.
The point to be described is that the web crawler generally sequentially captures the web page files to be stored one by one according to the arrangement sequence of the URLs to be captured in the URL queue to be captured. When the web page file to be stored includes a plurality of web page files, the server may sequentially extract the web page source codes from the web page files one by one, or may extract the web page source codes from the plurality of web page files at the same time, which is not limited in this embodiment.
In step 202, the web page source code is written to a text file.
Unlike the related art that a database is used to store the web page files captured by the web crawler, in the present embodiment, the web page source codes in the web page files are stored in the form of text files. In this embodiment, the file format of the text file is not limited. For example, the file format of the text file may be an HTML format, and may also be a TXT format, or may also be other file formats such as DOC, PDF, and the like.
In addition, one text file may be used to store the web page source code of one web page file, and may also be used to store the web page source codes of multiple web page files. When a text file is used to store only the web page source code of a web page file, frequent file creation, data writing and file closing are caused, and the processing overhead of the system is also relatively large. When one text file is used for storing the webpage source codes of a plurality of webpage files, the processing overhead of the system can be effectively reduced.
In one possible embodiment, as shown in fig. 2B, this step may include the following sub-steps:
in step 202a, it is detected whether the sum of the number of written files and the number of files to be written in the text file is greater than a preset threshold value.
The written file number refers to the number of the web page files corresponding to the written web page source code in the text file, and the number of the files to be written refers to the number of the web page files to be stored. The preset threshold value refers to the number of the webpage files corresponding to the most writable webpage source codes in the text file, and the value can be preset according to actual requirements. For example, when a text file is only used for storing a web page source code of a web page file, the preset threshold value is set to 1. For another example, when one text file is used to store the web page source codes of multiple web page files, the preset threshold value is set to be an integer greater than 1, such as 6, 9, 12, and so on.
When the detection result is negative, the following step 202b is executed; when the detection result is yes, the following step 202c is performed.
In step 202b, the web page source code is written to a text file.
When the sum of the number of written files of the text file and the number of files to be written is smaller than or equal to a preset threshold value, the webpage source codes of all the webpage files to be stored can be written into the text file in a saying manner, and the server writes all the acquired webpage source codes into the text file.
In step 202c, the difference d between the preset threshold and the number of written files is calculated, where d is greater than or equal to 0 and is an integer.
The difference d represents the number of web page files corresponding to the remaining writable web page source code in the text file. After step 202c, the server performs step 202d described below.
In step 202d, it is determined whether d is equal to 0.
When the judgment result is yes, the following step 202e is executed; when the judgment result is no, the following step 202f is executed.
In step 202e, the web page source code is written to at least one newly created empty text file.
When d is equal to 0, the number of the web page files corresponding to the written web page source code in the text file is said to have reached the preset threshold value. In this case, the server writes the web page source code into at least one newly created empty text file. Alternatively, the process may be as follows:
(1) recreating a blank text file;
(2) judging whether the number of the remaining webpage files to be stored is larger than a preset threshold value n corresponding to the newly created empty text file or not; if not, executing the following step (3); if yes, executing the following step (4);
(3) writing the webpage source codes of the rest webpage files to be stored into the newly created empty text file, and ending the process;
(4) selecting n webpage files from the rest webpage files to be stored, and writing the webpage source codes of the selected n webpage files into the newly created empty text file;
and (4) returning to execute the step (1) until the flow is ended.
In step 202f, d web page files are selected from the web page files to be stored, and the web page source codes of the selected d web page files are written into the text file; and writing the webpage source codes of the rest webpage files to be stored into at least one newly created empty text file.
And when d is not equal to 0, saying that the number of the webpage files corresponding to the written webpage source codes in the webpage file temporarily does not reach a preset threshold value. In this case, the server selects d web page files from the web page files to be stored, and writes the web page source codes of the selected d web page files into the text file. The selection mode may be random selection, or selection according to a capturing order of the web page files, or selection according to other preset rules, which is not limited in this embodiment. And then, the server writes the webpage source codes of the rest webpage files to be stored into at least one newly created empty text file. The process is the same as the process of the above steps (1) to (4), and is not described herein again.
It should be noted that, in this embodiment, when the server writes the web page source code into the text file, the server re-creates the next empty text file after writing a preset threshold value in the previously created text file, and the empty text files are also created one by one in sequence, instead of creating multiple empty text files at the same time, which can ensure that the storage resource is utilized more efficiently, and reduce the creation frequency of the text file as much as possible, thereby reducing the system overhead.
It should be further noted that the preset threshold values corresponding to the text files may be the same or different. That is, the number of the web page files corresponding to the web page source code which can be written into each text file at most may be the same or different. This embodiment is not limited to this. Alternatively, the same preset threshold value may be set for each text file for convenience of management.
In addition, the server may adopt a tail addition (append) method when writing the web page source code into the text file. That is, after the end of the written web page source code in the text file, the written web page source code is added.
In step 203, the text file is saved.
After writing the web page source code to the text file, the server saves the text file. In this embodiment, the storage location of the text file is not limited. For example, the server may save a text file to a hard disk.
In addition, in an actual application scenario, after the server stores the web page source code into the text file, the server may also need to search and acquire the required web page source code from the text file. For example, when the server needs to analyze the web page source code of a certain/some web page files to extract the page keywords, or when the server needs to provide the web page search results to the user, the server needs to search for the web page source code needed by the server from the text file. In order to support the search of the web page source code, as shown in fig. 2C, after the step 201, the following steps 204 and 205 may be further included.
In step 204, a corresponding index entry is created for each web page file, and the index entry is used to indicate the position of the web page source code of the web page file in the text file.
The index entry may include a file identification of the text file and a data size of the web page source code. The document identifier may be a page number of the text document, and the page number of each text document may be set at the time of creating the text document. The file identification of the text file and the data size of the webpage source code can determine in which text file the webpage source code of the webpage file is stored and which position of the text file the webpage source code is stored. In addition, when the number of the text files is large and the text files are distributed on different storage devices for storage, the index entry further includes the device address of the storage device corresponding to the text file.
Optionally, the index entry corresponding to one web page file may further include one or more of the following items of information: the URL text corresponding to the webpage file, the data size of the URL text, the time text corresponding to the webpage file, the data size of the time text and the like. The URL text corresponding to the webpage file is recorded with the URL corresponding to the webpage file, and the URL can be conveniently and efficiently acquired by recording the URL text corresponding to the webpage file in the index item. The time text corresponding to the webpage file records the capturing time of the webpage file, and the time text corresponding to the webpage file is recorded in the index item, so that the capturing time of the webpage file can be conveniently and efficiently acquired, and whether the webpage file is effective or not can be further determined. Of course, the index entry may also include other information, such as web page cache time, cookie value, supported language types, corresponding server information, and so forth.
The format of the index entry can be preset according to actual requirements. In one possible implementation, the format of the index entry is as follows: the index item comprises various items of information which are sequentially page numbers of the text files, data sizes of webpage source codes of the webpage files, data sizes of URL texts, data sizes of time texts, URL texts and time texts, and the various items of information are divided by spaces. In one illustrative example, the index entry is: 713273684725http:// www.xxxxx.com/2015-03-03.html Fri Mar 308:31: 342015. The index item indicates that the page number of the text file is 713, the data size of the webpage source code of the webpage file is 27368 bytes, the data size of the URL text is 47 bytes, the data size of the time text is 25 bytes, the URL corresponding to the webpage file is http:// bbs.xiaomi.com/2015-03-03.html, and the capture time is Fri Mar 308:31: 342015.
In step 205, the index entry is stored in the index directory.
The index directory is used for storing index items corresponding to the webpage files. For example, in the index directory, the file identifier of the web page file may be stored in correspondence with the index entry, and the corresponding index entry may be searched and obtained in the index directory according to the file identifier of the web page file.
In addition, in a possible implementation, one index directory is used for storing index entries of each web page file corresponding to the web page source code stored in one text file. In another possible embodiment, one index directory is used to store index entries of each web page file corresponding to the web page source codes stored in two or more text files.
Referring to fig. 2D in combination, a schematic diagram of correspondence between index directories, text files and web page files is exemplarily shown. As shown in fig. 2D, it is assumed that the text file 21 stores the web page source codes of four web page files, i.e., the web page file 1, the web page file 2, the web page file 3, and the web page file 4, and the text file 22 stores the web page source codes of four web page files, i.e., the web page file 5, the web page file 6, the web page file 7, and the web page file 8. Accordingly, the index directory 23 records index entries corresponding to the eight web page files, i.e., the web page files 1 to 8. The correspondence between the index items and the web page files may be as shown in fig. 2D: index entry 1 corresponds to web page file 1, index entry 2 corresponds to web page file 2, index entry 3 corresponds to web page file 3, and so on.
Optionally, in order to increase the speed of searching the index entries from the index directory and reading the information of each entry from the index entries, the index directory may be stored in the memory.
Optionally, step 205 may be followed by steps 206 to 208 as follows. Steps 206 through 208 describe the process of finding the required web page source code from the text file.
In step 206, the index entry corresponding to the target web page file is obtained from the index directory.
As already described above, in some practical application scenarios, the server needs to search for the required web page source code from the text file. In this embodiment, a server searches for a web page source code of a target web page file from a text file as an example. And the server acquires the index item corresponding to the target webpage file from the index directory. For example, the server may obtain the index entry corresponding to the target web page file from the index directory according to the file identifier of the target web page file.
In step 207, the position of the web page source code of the target web page file in the text file is determined according to the index item corresponding to the target web page file.
Since the index entry corresponding to the target web page file contains the relevant information for indicating the storage location of the web page source code of the target web page file, such as the file identification of the text file and the data size of the web page source code introduced above. Therefore, the server can determine the position of the webpage source code of the target webpage file in the text file according to the index item corresponding to the target webpage file.
In one possible implementation, a Linux system is taken as an example. In the Linux system, there are several important file operation functions as follows: open function, read function, write function, and lseek function. The open function is used for opening and creating a file, the read function is used for reading data from the file, the write function is used for writing data into the file, and the lseek function is used for controlling the reading and writing positions of the file. In the Linux system, the lseek function can be used for controlling the movement of a file pointer according to the data size of the webpage source code, and the target webpage file can be efficiently positioned to the position of the webpage source code to be read. And then, reading the webpage source code of the target webpage file from the positioned position through a read function.
In step 208, the web page source code of the target web page file is read from the text file according to the determined position.
In summary, in the method provided in this embodiment, the web page source code of the web page file to be stored is obtained, and the web page source code is written into the text file for storage; the problem that the processing overhead of the system is overlarge due to the fact that the web page files captured by the web crawler are stored in a database in the related technology is solved; compared with the data reading and writing operation on the database, the system processing resource consumed by the data reading and writing operation on the text file is much smaller, the webpage file is stored more reasonably, and the technical effect of reducing the processing overhead of the system is achieved.
In addition, the method provided by the embodiment also stores the webpage source codes of two or more webpage files through one text file, avoids creating files, writing data and closing files too frequently, and fully reduces the processing overhead of the system.
In addition, the method provided by the embodiment also realizes that the webpage source code is more conveniently searched and acquired from the text file by creating the index item for each webpage file.
In addition, the method provided by this embodiment further effectively increases the speed of searching the index entry from the index directory and reading each item of information from the index entry by storing the index directory in the memory.
It should be added that fig. 2E is a schematic diagram illustrating a possible application scenario according to an embodiment of the present disclosure. The application scenario may include: at least one processing device 31 and at least one storage device 32. The processing device 31 may be a server with computing and processing functions deployed in a local computer room of a service provider of the search engine system, and the storage device 32 may be a storage device deployed in a cloud, such as a cloud disk. The processing device 31 and the storage device 32 are connected via a network. The processing device 31 may crawl the web page file to be stored through the web crawler and send the web page file to be stored to the storage device 32. Accordingly, the storage device 32 extracts the web page source code from the web page file to be stored, and writes the web page source code to the text file. Optionally, the storage device 32 may further create a corresponding index entry for each web page file through the above step 204 and step 205, and maintain an index directory in which the index entries are recorded. When the processing device 31 needs to read the required web page source code, the processing device 31 may send a data reading request to the storage device 32, where the data reading request carries a file identifier of a target web page file to be read. Accordingly, after receiving the data reading request, the storage device 32 reads the web page source code of the target web page file through the steps 206 to 208, and sends the read web page source code of the target web page file to the processing device 31.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
Fig. 3 is a block diagram illustrating an apparatus for storing a web page file according to an exemplary embodiment. The apparatus may be applied to one or more devices in a search engine system, typically a server. The apparatus may include: a code acquisition module 310 and a code writing module 320.
The code obtaining module 310 is configured to obtain a web page source code of a web page file to be stored.
A code writing module 320 configured to write the webpage source code acquired by the code acquiring module 310 into a text file.
In summary, the apparatus provided in this embodiment obtains the web page source code of the web page file to be stored, and writes the web page source code into the text file for storage; the problem that the processing overhead of the system is overlarge due to the fact that the web page files captured by the web crawler are stored in a database in the related technology is solved; compared with the data reading and writing operation on the database, the system processing resource consumed by the data reading and writing operation on the text file is much smaller, the webpage file is stored more reasonably, and the technical effect of reducing the processing overhead of the system is achieved.
Fig. 4A is a block diagram illustrating an apparatus for storing a web page file according to another exemplary embodiment. The apparatus may be applied to one or more devices in a search engine system, typically a server. The apparatus may include: a code acquisition module 310 and a code writing module 320.
The code obtaining module 310 is configured to obtain a web page source code of a web page file to be stored.
A code writing module 320 configured to write the webpage source code acquired by the code acquiring module 310 into a text file.
Optionally, the web page file to be stored comprises a web page file;
or, the webpage files to be stored comprise two or more webpage files.
Optionally, as shown in fig. 4B, the code writing module 320 includes: a number detection submodule 320a and a first write submodule 320 b.
The number detection submodule 320a is configured to detect whether the sum of the number of written files of the text file and the number of files to be written is greater than a preset threshold value; the written file number refers to the number of the webpage files corresponding to the written webpage source codes in the text files, and the written file number refers to the number of the webpage files to be stored.
The first writing sub-module 320b is configured to write the webpage source code acquired by the code acquisition module 310 into the text file when the sum of the number of written files and the number of files to be written is less than or equal to the preset threshold value.
Optionally, as shown in fig. 4B, the code writing module 320 further includes: a difference calculation submodule 320c, a second write submodule 320d, and a third write submodule 320 e.
The difference value calculating sub-module 320c is configured to calculate a difference value d between the preset threshold value and the number of the written files when the sum of the number of the written files and the number of the files to be written is greater than the preset threshold value, wherein d is greater than or equal to 0 and is an integer.
The second writing sub-module 320d is configured to write the webpage source code acquired by the code acquiring module 310 into at least one newly created empty text file when d is equal to 0.
The third writing sub-module 320e is configured to select d webpage files from the webpage files to be stored and write the webpage source codes of the selected d webpage files into the text file when d ≠ 0; and writing the remaining webpage source codes of the webpage files to be stored into at least one newly created empty text file.
Optionally, as shown in fig. 4A, the apparatus further includes: an index creation module 330 and an index storage module 340.
An index creating module 330 configured to create a corresponding index entry for each of the web page files, where the index entry is used to indicate a location of a web page source code of the web page file in the text file.
An index storage module 340 configured to store the index entry created by the index creation module 330 into an index directory.
Optionally, as shown in fig. 4A, the apparatus further includes: an index retrieval module 350, a location determination module 360, and a code reading module 370.
An index obtaining module 350, configured to obtain an index entry corresponding to the target web page file from the index directory stored in the index storage module 340.
The position determining module 360 is configured to determine, according to the index item corresponding to the target web page file acquired by the index acquiring module 350, a position of a web page source code of the target web page file in the text file.
A code reading module 370 configured to read the web page source code of the target web page file from the text file according to the position determined by the position determining module 360.
In summary, the apparatus provided in this embodiment obtains the web page source code of the web page file to be stored, and writes the web page source code into the text file for storage; the problem that the processing overhead of the system is overlarge due to the fact that the web page files captured by the web crawler are stored in a database in the related technology is solved; compared with the data reading and writing operation on the database, the system processing resource consumed by the data reading and writing operation on the text file is much smaller, the webpage file is stored more reasonably, and the technical effect of reducing the processing overhead of the system is achieved.
In addition, the device provided by the embodiment also stores the webpage source codes of two or more webpage files through one text file, so that the files are prevented from being created, written with data and closed too frequently, and the processing overhead of the system is reduced sufficiently.
In addition, the apparatus provided in this embodiment further creates an index entry for each web page file, so that the web page source code can be found and acquired from the text file more conveniently.
In addition, the apparatus provided in this embodiment further stores the index directory in the memory, so as to effectively improve the speed of searching the index entry from the index directory and reading each item of information from the index entry.
The apparatus provided in the embodiments shown in fig. 3 and fig. 4A can be used to implement the method flows provided in the embodiments shown in fig. 1 and fig. 2A.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the function of storing the web page file, only the division of the above function modules is illustrated, and in practical applications, the function distribution may be completed by different function modules according to actual needs, that is, the content structure of the device is divided into different function modules, so as to complete all or part of the functions described above.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
An exemplary embodiment of the present disclosure further provides a device for storing a web page file, which can implement the method for storing a web page file provided by the present disclosure. The device includes: a processor, and a memory for storing executable instructions for the processor. Wherein the processor is configured to:
acquiring a webpage source code of a webpage file to be stored;
and writing the webpage source code into a text file.
Optionally, the web page file to be stored comprises a web page file;
or, the webpage files to be stored comprise two or more webpage files.
Optionally, the processor is configured to:
detecting whether the sum of the number of written files of the text file and the number of files to be written is larger than a preset threshold value or not; the written file number refers to the number of the webpage files corresponding to the written webpage source codes in the text files, and the number of the files to be written refers to the number of the webpage files to be stored;
and when the sum of the number of the written files and the number of the files to be written is less than or equal to the preset threshold value, writing the webpage source code into the text file.
Optionally, the processor is further configured to:
when the sum of the number of the written files and the number of the files to be written is larger than the preset threshold value, calculating a difference value d between the preset threshold value and the number of the written files, wherein d is not less than 0 and is an integer;
when d is 0, writing the webpage source code into at least one newly created empty text file;
when d is not equal to 0, d webpage files are selected from the webpage files to be stored, and webpage source codes of the selected d webpage files are written into the text file; and writing the remaining webpage source codes of the webpage files to be stored into at least one newly created empty text file.
Optionally, the processor is further configured to:
respectively creating a corresponding index item for each webpage file, wherein the index item is used for indicating the position of a webpage source code of the webpage file in the text file;
and storing the index entry into an index directory.
Optionally, the processor is further configured to:
acquiring an index item corresponding to the target webpage file from the index directory;
determining the position of the webpage source code of the target webpage file in the text file according to the index item corresponding to the target webpage file;
and reading the webpage source code of the target webpage file from the text file according to the determined position.
Fig. 5 is a block diagram illustrating an apparatus 500 according to an example embodiment. For example, the apparatus 500 may be provided as a server. Referring to fig. 5, apparatus 500 includes a processing component 522 that further includes one or more processors and memory resources, represented by memory 532, for storing instructions, e.g., applications, that are executable by processing component 522. The application programs stored in memory 532 may include one or more modules that each correspond to a set of instructions. Further, the processing component 522 is configured to execute the instructions to perform the method of storing a web page file provided by the above embodiments.
The apparatus 500 may also include a power component 526 configured to perform power management of the apparatus 500, a wired or wireless network interface 550 configured to connect the apparatus 500 to a network, and an input/output (I/O) interface 558. The apparatus 500 may operate based on an operating system stored in the memory 532, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (13)

1. A method of storing a web page file, the method comprising:
acquiring a webpage source code of a webpage file to be stored;
and writing the webpage source code into a text file.
2. The method of claim 1,
the webpage files to be stored comprise a webpage file;
or,
the webpage files to be stored comprise two or more webpage files.
3. The method of claim 1, wherein writing the web page source code to a text file comprises:
detecting whether the sum of the number of written files of the text file and the number of files to be written is larger than a preset threshold value or not; the written file number refers to the number of the webpage files corresponding to the written webpage source codes in the text files, and the number of the files to be written refers to the number of the webpage files to be stored;
and if the sum of the number of the written files and the number of the files to be written is less than or equal to the preset threshold value, writing the webpage source code into the text file.
4. The method of claim 3, further comprising:
if the sum of the number of the written files and the number of the files to be written is larger than the preset threshold value, calculating a difference value d between the preset threshold value and the number of the written files, wherein d is not less than 0 and is an integer;
when d is 0, writing the webpage source code into at least one newly created empty text file;
when d is not equal to 0, d webpage files are selected from the webpage files to be stored, and webpage source codes of the selected d webpage files are written into the text file; and writing the remaining webpage source codes of the webpage files to be stored into at least one newly created empty text file.
5. The method of claim 1, further comprising:
respectively creating a corresponding index item for each webpage file, wherein the index item is used for indicating the position of a webpage source code of the webpage file in the text file;
and storing the index entry into an index directory.
6. The method of claim 5, further comprising:
acquiring an index item corresponding to the target webpage file from the index directory;
determining the position of the webpage source code of the target webpage file in the text file according to the index item corresponding to the target webpage file;
and reading the webpage source code of the target webpage file from the text file according to the determined position.
7. An apparatus for storing a web page file, the apparatus comprising:
the code acquisition module is configured to acquire a webpage source code of a webpage file to be stored;
a code writing module configured to write the web page source code into a text file.
8. The apparatus of claim 7,
the webpage files to be stored comprise a webpage file;
or,
the webpage files to be stored comprise two or more webpage files.
9. The apparatus of claim 7, wherein the code writing module comprises:
the quantity detection submodule is configured to detect whether the sum of the number of written files of the text file and the number of files to be written is larger than a preset threshold value or not; the written file number refers to the number of the webpage files corresponding to the written webpage source codes in the text files, and the number of the files to be written refers to the number of the webpage files to be stored;
and the first writing sub-module is configured to write the webpage source code into the text file when the sum of the number of written files and the number of files to be written is less than or equal to the preset threshold value.
10. The apparatus of claim 9, wherein the code writing module further comprises:
the difference value calculation sub-module is configured to calculate the difference value d between the preset threshold value and the number of the written files when the sum of the number of the written files and the number of the files to be written is larger than the preset threshold value, wherein d is larger than or equal to 0 and is an integer;
a second writing submodule configured to write the web page source code into at least one newly created empty text file when d is 0;
the third writing sub-module is configured to select d webpage files from the webpage files to be stored when d is not equal to 0, and write the webpage source codes of the selected d webpage files into the text file; and writing the remaining webpage source codes of the webpage files to be stored into at least one newly created empty text file.
11. The apparatus of claim 7, further comprising:
an index creating module configured to create a corresponding index entry for each of the web page files, where the index entry is used to indicate a position of a web page source code of the web page file in the text file;
an index storage module configured to store the index entry into an index directory.
12. The apparatus of claim 11, further comprising:
the index acquisition module is configured to acquire an index item corresponding to the target webpage file from the index directory;
the position determining module is configured to determine the position of the webpage source code of the target webpage file in the text file according to the index item corresponding to the target webpage file;
and the code reading module is configured to read the webpage source code of the target webpage file from the text file according to the determined position.
13. An apparatus for storing a web page file, the apparatus comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to:
acquiring a webpage source code of a webpage file to be stored;
and writing the webpage source code into a text file.
CN201510290694.1A 2015-05-29 2015-05-29 Method and device for storing webpage file Pending CN104951516A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510290694.1A CN104951516A (en) 2015-05-29 2015-05-29 Method and device for storing webpage file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510290694.1A CN104951516A (en) 2015-05-29 2015-05-29 Method and device for storing webpage file

Publications (1)

Publication Number Publication Date
CN104951516A true CN104951516A (en) 2015-09-30

Family

ID=54166174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510290694.1A Pending CN104951516A (en) 2015-05-29 2015-05-29 Method and device for storing webpage file

Country Status (1)

Country Link
CN (1) CN104951516A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522282A (en) * 2018-09-29 2019-03-26 中国平安人寿保险股份有限公司 Picture management method, device, computer installation and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205069A1 (en) * 2003-01-14 2004-10-14 Tomonori Ishizawa Attached file management system, program, information storage medium, and method of managing attached file
CN101079059A (en) * 2007-03-27 2007-11-28 腾讯科技(深圳)有限公司 System, method and browser for keeping web page content
CN102207981A (en) * 2011-07-13 2011-10-05 华为软件技术有限公司 Method and system for managing file
CN102831186A (en) * 2012-08-02 2012-12-19 深圳市同洲电子股份有限公司 Method and device for storing and searching webpage
US8538917B2 (en) * 2011-01-14 2013-09-17 Apple Inc. System and method for file coordination
CN103455434A (en) * 2013-08-26 2013-12-18 华为技术有限公司 Method and system for establishing cache directory
CN104077292A (en) * 2013-03-27 2014-10-01 腾讯科技(深圳)有限公司 Webpage information storage method and equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205069A1 (en) * 2003-01-14 2004-10-14 Tomonori Ishizawa Attached file management system, program, information storage medium, and method of managing attached file
CN101079059A (en) * 2007-03-27 2007-11-28 腾讯科技(深圳)有限公司 System, method and browser for keeping web page content
US8538917B2 (en) * 2011-01-14 2013-09-17 Apple Inc. System and method for file coordination
CN102207981A (en) * 2011-07-13 2011-10-05 华为软件技术有限公司 Method and system for managing file
CN102831186A (en) * 2012-08-02 2012-12-19 深圳市同洲电子股份有限公司 Method and device for storing and searching webpage
CN104077292A (en) * 2013-03-27 2014-10-01 腾讯科技(深圳)有限公司 Webpage information storage method and equipment
CN103455434A (en) * 2013-08-26 2013-12-18 华为技术有限公司 Method and system for establishing cache directory

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522282A (en) * 2018-09-29 2019-03-26 中国平安人寿保险股份有限公司 Picture management method, device, computer installation and storage medium
CN109522282B (en) * 2018-09-29 2024-02-02 中国平安人寿保险股份有限公司 Picture management method, device, computer device and storage medium

Similar Documents

Publication Publication Date Title
US9251157B2 (en) Enterprise node rank engine
TW201514845A (en) Title and body extraction from web page
CN103136228A (en) Image search method and image search device
US20130339840A1 (en) System and method for logical chunking and restructuring websites
CN110020339B (en) Webpage data acquisition method and device based on non-buried point
CN103530292A (en) Webpage displaying method and device
US8290925B1 (en) Locating product references in content pages
CN103399855B (en) Behavior intention determining method and device based on multiple data sources
CN111258956A (en) Method and equipment for pre-reading mass data files facing far end
CN104252447A (en) File behavior analysis method and device
CN105117489B (en) Database management method and device and electronic equipment
KR20150008342A (en) Method for enriching a multimedia content, and corresponding device
CN107015986A (en) A kind of reptile crawls the method and device of webpage
CN102932421A (en) Cloud back-up method and device
CN106326236A (en) Webpage content identification method and system
US10289613B2 (en) Element identifier generation
US20240296277A1 (en) Methods and apparatus for matching media with a job host provider independent of the media format and job host platform
CN109284428A (en) Data processing method, device and storage medium
CN110955855A (en) Information interception method, device and terminal
Gali et al. Extracting representative image from web page
CN112579947A (en) Webpage element graph intercepting method and device and electronic equipment
CN109710626B (en) Data warehousing management method and device, electronic equipment and storage medium
CN104951516A (en) Method and device for storing webpage file
CN112860714A (en) Knowledge base, database, information updating method and device
CN104063506A (en) Method and device for identifying repeated web pages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150930

RJ01 Rejection of invention patent application after publication