Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
In order to solve the above problems of the prior art, it is an object of the present invention to provide a method and apparatus for generating or displaying a webpage annotation in consideration of the content of an annotated object on a webpage, wherein webpage annotation information can be associated with the annotated object and the content of contextual webpage elements immediately before and after the annotated object on the webpage, so that the change of the annotated object can be dynamically tracked.
Another object of the present invention is to provide a web page annotation method and apparatus, by which a web page desired to be loaded and displayed by a user and existing annotations previously annotated on the web page stored on a remote annotation server can be displayed on a client browser, and new annotations are added and displayed on the web page.
Still another object of the present invention is to provide an information sharing system for implementing information sharing based on webpage annotation by using the above webpage annotation method and apparatus.
In order to achieve the above object, according to one aspect of the present invention, there is provided a method for generating webpage markup information, the method comprising: in response to a target webpage element selected as a labeled object on a current webpage loaded on a client Web browser by a user, extracting an XPath of the labeled object in a Document Object Model (DOM) tree of the current webpage; generating feature codes CF of the labeled objects based on the labeled objects and the contents of the context webpage elements which are immediately before and after the labeled objects in the current webpage; and generating webpage labeling information based on an XPath path of the labeled object, a feature code CF and a label input by a user, wherein the webpage labeling information is stored in a label database of a remote label server, the feature code CF of the labeled object is composed of a content-based feature (CBF) of the labeled object and a CBF of a context webpage element of the labeled object, the CBF of the webpage element is composed of an alphabetic projection vector and an alphabetic order vector of the webpage element, wherein the alphabetic projection vector is composed of statistical numbers of all letters in the webpage element on an alphabet { a, b, c, d.,. z }, and the alphabetic order vector is composed of inverse statistical numbers of all letters in the webpage element on the alphabet Λ.
According to another aspect of the present invention, there is also provided an apparatus for generating webpage annotation information, the apparatus comprising: the XPath generator is used for responding to a target webpage element selected by a user on a current webpage loaded on a client Web browser as a labeled object and extracting an XPath of the labeled object in a Document Object Model (DOM) tree of the current webpage; a feature Code (CF) generator for generating a feature code CF of the tagged object based on the tagged object and contents of context web page elements immediately before and after the tagged object in the current web page; and a label generator, configured to generate webpage label information based on an XPath path of the labeled object, a feature code CF of the labeled object, and a label input by a user, where the feature code CF of the labeled object is composed of a content-based feature CBF of the labeled object and a CBF of a context webpage element of the labeled object, where the webpage label information is stored in a label database of a remote label server, and the CBF of the webpage element is composed of an alphabetical projection vector and an alphabetical order vector of the webpage element, where the alphabetical projection vector is composed of statistical numbers of all letters in the webpage element on an alphabet Λ ═ { a, b, c, d,. once, z }, and the alphabetical order vector is composed of inverse statistical numbers of all letters in the webpage element on Λ.
According to another aspect of the present invention, there is also provided a method for displaying a Web page and annotations on the Web page on a client Web browser, the method comprising: a) analyzing an input Uniform Resource Locator (URL) of a web page to be loaded and displayed on a browser in response to a user inputting the URL to obtain a valid URL; b) inquiring all labels related to the effective URL from a remote label server according to the effective URL so as to obtain a label candidate set and webpage label information of the labels; c) for each label in the label candidate set, determining whether the label labels a webpage element in the webpage to be loaded according to the labeled webpage label information of the label, that is, determining whether the label should exist in the webpage to be loaded, and if so, further determining the position of the labeled webpage element in the webpage to be loaded, that is, a label position; and d) synthesizing the labels with the loaded web pages according to the labeled web page label information and the label positions thereof determined to be present in the loaded web pages, and displaying the synthesized web pages to users through browsers, wherein the labeled web page label information comprises XPath paths of labeled objects corresponding to the labels, feature codes CF of the labeled objects, labeled contents and formats, URLs of the web pages where the labels are located, and content feature codes of the web pages where the labels are located, the feature codes CF of the labeled objects are composed of content-based features (CBF) of the labeled objects and CBF of context web page elements immediately before and after the labeled objects, and the CBF of the web page elements is composed of letter projection vectors and letter sequence vectors of the web page elements, wherein the letter projection vectors are composed of all letters in the web page elements in the alphabet of Lambda, b, c, d, a.
According to another aspect of the present invention, there is also provided an apparatus for displaying a Web page and annotations on the Web page via a client Web browser, the apparatus comprising: a URL analyzer for analyzing an input URL to obtain a valid URL in response to a Uniform Resource Locator (URL) of a web page to be loaded and displayed on a browser, which is input by a user; the label querier is used for querying all labels related to the effective URL from the remote label server according to the effective URL so as to obtain a label candidate set and webpage label information of the labels; a label position determining unit, configured to determine, for each label in the label candidate set, according to the labeled webpage label information, whether the label labels a webpage element in the webpage to be loaded, that is, whether the label should exist in the webpage to be loaded, and if so, further determine a position of the webpage element labeled by the label in the webpage to be loaded, that is, a label position; and a synthesizing unit for synthesizing the labels and the loaded web pages according to the labeled web page label information and the label positions thereof, wherein the synthesized web pages are displayed to users via browsers, the labeled web page label information comprises XPath paths of labeled objects corresponding to the labels, feature codes CF of the labeled objects, labeled contents and formats, URLs of the web pages where the labels are located, and content feature codes of the web pages where the labels are located, the feature codes CF of the labeled objects are composed of content-based features (CBF) of the labeled objects and CBFs of context web page elements immediately before and after the labeled objects, and the CBFs of the web page elements are composed of letter projection vectors and letter sequence vectors of the web page elements, wherein the letter projection vectors are composed of all letters in the web page elements in the alphabet Λ a, b, c, d, a.
In addition, according to another aspect of the present invention, there is also provided a webpage labeling method, including: displaying the Web page on the client Web browser and existing annotations previously annotated on the Web page stored on the remote annotation server by performing the above-described method for displaying the Web page and annotations on the Web page on the client Web browser in response to a user-entered URL of the Web page to be loaded and displayed on the client Web browser; adding a new label to the webpage by executing the method for generating webpage label information, wherein the newly labeled webpage label information is stored on a remote label server; and displaying the added new annotation on the webpage via the browser.
According to another aspect of the present invention, there is also provided a web page labeling apparatus, including: the device for generating webpage labeling information; and the device for displaying the webpage and the label on the webpage through the client Web browser.
According to another aspect of the present invention, there is also provided an information sharing system based on webpage annotation, including: the system comprises a client and a remote labeling server, wherein the client comprises the webpage labeling device, and the remote labeling server comprises a labeling database for storing webpage labeling information and a labeling information accessor for performing access control on the labeling database.
According to other aspects of the invention, corresponding computer-readable storage media and computer program products are also provided.
The method, the device and the system have the advantages that the XPath of the annotated object and the content of the annotated object and the contextual webpage elements thereof are considered when the webpage annotation information is generated, so that the annotated object can be dynamically tracked by the annotation, and therefore, the related annotation information can move along with the annotated object. Moreover, even if the format of the object to be labeled changes, the label can be displayed correctly. Even when the content of the marked object changes, the content change can be evaluated to determine whether the corresponding mark can be displayed.
These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.
Detailed Description
Exemplary embodiments of the present invention will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
Here, it should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the device structure and/or the processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so related to the present invention are omitted.
Fig. 2 is a diagram illustrating a structure of a system for implementing information sharing using webpage annotation according to an embodiment of the present invention. The system can be divided into two parts, a client side and a server side (i.e. an annotation server) which are connected through a network (not shown).
As shown in FIG. 2, the apparatus 200 for webpage annotation mainly comprises a user interface 210, an XPath generator 220, a content-based features (CBF) generator 230, an annotation generator 240, an annotation analyzer 250 and an XML transformer 260 at the client side, and a annotation information accessor 270 and an annotation database 280 at the server side.
In a specific implementation example of the system shown in fig. 2, the web page annotation device 200 of the client can be implemented in the form of a browser plug-in; the annotation server can be implemented by Java server, specifically, the server-side annotation information accessor 240 can be implemented by Java server, and the annotation database 250 can be implemented by an existing database management system. However, it will be understood by those skilled in the art that the principles of the present invention are not limited thereto but may be embodied in other different forms as may be desired.
At the client, the user can utilize the web page annotation device 200 to add and display new web page annotations on the web page loaded by the browser, as well as to correctly display existing web page annotations that have been previously added on the web page. In the web page annotation device 200, the user interface 210 is responsible for receiving input for the entire device, which may receive any one or more of the following input information: (1) input information relating to configuration parameters of the system; (2) input information relating to a tagged object selected by a user on a web page; (3) input information relating to the annotated content; (4) input information relating to the display mode of the annotation; and so on.
The XPath generator 220 is used to extract XPath paths of annotated objects in the DOM (document object model) tree of a web page. XPath is the expression of any element in the web page recommended by W3C, each element in the web page corresponds to an XPath, and any element in the web page can be positioned through the XPath. Each node in the DOM tree of the web page corresponds to each element contained in the web page. That is, both the annotated object and the elements of the web page immediately preceding and following the annotated object in the web page can be represented as nodes on the DOM tree. For ease of explanation, the elements of the web page immediately preceding and following the annotated object in the web page are referred to as the above and below elements, which correspond to the immediately adjacent sibling nodes of the corresponding node of the annotated object in the DOM tree, respectively, and thus may also be referred to as context nodes or context web page elements.
The CBF (content-based feature) generator 230 generates a CBF of the annotated object according to the content of the annotated object. The CBF of the labeled object is composed of the alphabetical projection vector (CPF) and the alphabetical order vector (CSF) of the labeled object, namely: CBF ═ CPF + CSF.
The alphabet projection vector (CPF) is composed of the statistical number of all the letters in the labeled object on the alphabet Λ ═ { a, b, c, d., z }, and the length of the vector is the length of the alphabet Λ. For example, assuming that the labeled object is a piece of english caption on a web page, the numbers num (a), num (b), and num (z) of each letter a, b, and z in the piece of caption can be counted, so as to obtain the following letter projection vector CPF: [ num (a), num (b),.. and num (z). The change of the CPF can reflect operations such as deletion, insertion and replacement of the content of the marked object to a certain extent.
The alphabetical order vector (CSF) consists of the statistical number of the negative orders on the alphabet Λ representing all the letters in the labeled object, the length of the vector being the length of the alphabet. Assuming that the alphabet Λ has a partial ordering relationship: a < b < c > < z, then the statistical number of all letters in the labeled object x in reverse order on letter a is all letters larger than letter a (i.e., b, c, a.. eta.,. z) and closely preceding letter a, and the statistical number of all letters in the labeled object x in reverse order on letter b is all letters larger than letter b (i.e., c, d, a.. eta.., z) and closely preceding letter b, and so on, so that the statistical number of reverse order on the entire alphabet of all letters in the labeled object x can be obtained. Changes in CSF may reflect, to some extent, the exchange changes of the annotated objects. For example, for bad and dab, their CPFs are the same, but the CSFs are different, reflecting the alphabetical differences between them.
In order to effectively track whether the context of the annotated object changes, the CBF generator 320 generates the CBF of the context node of the annotated object in addition to the CBF of the annotated object. The context node of the annotated object may be determined by the XPath path of the annotated object generated by the XPath generator 220. CBF of annotated object (represented by DOM tree node x) and its context node (respectively node x)leftAnd xrightCBF of (a) constitutes a feature code CF of a labeled object, i.e., CF (x) CBF (x)left)+CBF(x)+CBF(xright)。
The specific structure of the CBF generator 230 and the processing procedure thereof, and how to add a new label to a web page by using the web page labeling apparatus will be described below with reference to fig. 3 and 4.
The markup generator 240 generates webpage markup information according to related information of the tagged object (e.g., feature code of the tagged object) and the content and format of the input markup, and the like, and the XML converter 260 converts the generated webpage markup information into an XML message format suitable for communication with the server side through the network, so as to transmit the webpage markup information to the server side and store the webpage markup information in the markup database 280 via the markup information accessor 270. The webpage labeling information includes a labeled URL (i.e., the URL of the webpage where the label is located), a position labeled on the webpage (i.e., XPath path information of the corresponding labeled object), relevant features of the corresponding labeled object (e.g., feature code CF information, etc.), a content feature code of the webpage where the label is located, and the content and format of the label. Here, the content feature code of the web page is a feature code for identifying the content of the web page, the content feature codes of two web pages are the same, which indicates that the content of the two web pages is the same, and the content feature codes of the web pages can be obtained by using a conventional encoding method, such as hash encoding (MD 5).
The annotation analyzer 250 determines, based on the URL of the current web page, the URL stored in the annotation database 250 that is the same as or similar to the URL of the current web page in the same website as the current web page, determines the URL as a valid URL, queries all annotations related to the valid URL from the annotation database, and matches all the annotations obtained by the query in the current web page to determine which annotations should be annotated with elements currently loaded in the web page (i.e., determine which annotations should exist in the current web page), and determines at which positions in the current web page the annotations should be displayed. The annotation analyzer 230 can support situations in which the content of an annotated object is transferred from one page to another. The specific processing procedure and the structure thereof relating to the annotation analyzer 250 will be described below with reference to fig. 5 to 9.
The XML converter 260 is used to convert the information that needs to be communicated between the client and the server into XML message format, so that the web page annotation device 200 of the client can communicate with the server. However, it should be understood by those skilled in the art that XML-formatted messages are used for facilitating communication between the client and the server implemented by Java server, and the principles of the present invention are not limited to only converting the message format into XML format, but may use other different message formats to communicate between the client and the server according to different implementations of the server part as shown in fig. 2.
As shown in FIG. 2, at the server side, the annotation information accessor 270 accesses the annotation database 280 in response to a request from the client, and the annotation database 280 stores therein the webpage annotation information related to each annotation collected by the information sharing system, which as mentioned above may include the URL of the annotation (i.e., the URL of the webpage where the annotation is located), the location of the annotation on the webpage, the feature code of the corresponding annotated object, the content and format of the annotation, and so on.
The following description is made with reference to fig. 3 and 4. FIG. 3 is an exemplary flow diagram illustrating a process 300 performed when a new annotation is added to a web page using the system shown in FIG. 2 according to an embodiment of the present invention, and FIG. 4 is a schematic diagram illustrating in detail an exemplary structure and process of the CBF generator shown in FIG. 2.
As shown in fig. 3, in step S310, the XPath path of the annotated object in the DOM tree of the current web page is extracted according to the annotated object selected by the user on the current web page, and then in step S320, based on the annotated object and the content of its context node (which can be determined based on the XPath path generated in step S310), CBF of the annotated object is generated as described above, so as to obtain the feature code CF of the annotated object. Next, in step S330, web page markup information is generated based on information about the object to be tagged, the input tagged content, and the like, in step S340, the web page markup information generated in step S330 is converted into a message in XML format suitable for communication with the server side, and then in step S350, the web page markup information generated by the client side is stored in the markup database 280 at the server side via the markup information accessor 270.
The CBF generator 230 as shown in fig. 2 is shown in detail in fig. 4. As shown in fig. 4, the CBF generator 230 may include an HTML (hypertext markup language) cleaning (cleaning) unit 410, an HTML alphabets unit 420, a letter projection vector (CPF) generation unit 430, and an alphabetical order vector (CSF) generation unit 440. The following description will be given taking as an example the CBF generator 230 for generating a CBF of a labeled object.
The HTML cleaning unit 410 is used to remove some HTML tags (e.g. format tags such as < b > </b >, < u > </u >, etc.) which have no effect from the tagged objects selected by the user according to the pre-stored HTML cleaning rules (e.g. as shown in fig. 4, it can be pre-stored in the HTML dictionary 450), so as to reduce HTML noise and reduce the influence of web page format change on the tagged objects.
The HTML alphabetizing unit 420 is used for HTML alphabetizing the marked object cleaned by HTML, so that the marked object is converted into a letter string composed of letters from a to z based on the content of the marked object. For tagged objects that contain a Chinese caption, the HTML tokenization unit 420 needs to first convert the Chinese caption in the tagged object to Chinese pinyin with reference to the Chinese dictionary 460 (which may be omitted when the tagged object does not contain a Chinese caption) and then obtain the letter string. For the case of polyphones, the HTML tokenization unit may take the first Chinese pinyin for the polyphone, but it is understood that the principles of the present invention are not so limited.
The letter projection vector (CPF) generation unit 430 and the letter order vector (CSF) generation unit 440 generate the letter projection vector and the letter order vector of the labeled object, respectively, based on the letter string obtained through HTML alphabets according to the definitions of the letter projection vector (CPF) and the letter order vector (CSF) given above. Then, by concatenating the alpha projection vector (CPF) and the alphabetical order vector (CSF), the content-based features CBF of the annotated object can be obtained.
See back fig. 2. When the user inputs a URL of a certain web page in the client browser to browse the web page and the markup information on the web page, the client browser loads a desired web page and transmits the URL of the web page and the DOM tree structure to the markup parser 240.
FIG. 5 illustrates an exemplary structure of an annotation analyzer 240 according to an embodiment of the invention. As shown in FIG. 5, the annotation analyzer 230 includes a URL analyzer 510, an annotation querier 520, and a web page annotation synthesizer 530.
The URL analyzer 510 analyzes the URL input by the user, extracts all URLs in the same website with the currently loaded web page (i.e., the web page corresponding to the currently input URL, or simply the current web page) from the annotation database 280 (via the XML converter 260 and the annotation information accessor 270), forms an alternative URL set, performs the same page determination and the similar page determination on the web pages corresponding to all URLs in the alternative URL set (hereinafter, referred to as alternative URLs) and the current web page, and determines the alternative URLs corresponding to the web pages the same as or similar to the current web page as valid URLs.
The annotation querier 520 queries (via the XML converter 260 and the annotation information accessor 270) all annotations associated with the valid URL (i.e., all annotations on the web page corresponding to the valid URL) in the annotation database 280 according to the valid URL determined by the URL analyzer 510, that is, queries all annotations possibly associated with the current web page in the annotation database 280, thereby obtaining an annotation candidate set, and obtains all webpage annotation information of the possible annotations from the annotation database 280.
The web page tag compositor 530 matches the current web page with all possible tags to determine which tags are most likely to tag which elements or objects currently loaded in the web page, i.e., determines whether and where each of the possible tags exists in the current web page, and composites the tags with the web page for display to the user via the browser. As shown in fig. 5, the web page annotation synthesizer 530 may further include an annotation location determination unit 532 and a synthesis unit 534.
For each possible annotation in the annotation candidate set, the annotation position determination unit 532 determines, according to the webpage annotation information of the annotation (e.g., information such as an XPath path and a feature code CF of the annotated object corresponding to the annotation), whether the possible annotation identifies a webpage element in the current webpage (i.e., determines whether the possible annotation exists in the current webpage), and further determines the position (i.e., identifies the position) of the webpage element identified by the possible annotation in the current webpage if the possible annotation is determined to exist.
The composition unit 534 composes the annotations with the current web page according to the web page annotation information of the possible annotations determined to be present in the current web page and the determined annotation positions of the annotations in the current web page, and displays the composed web page to the user via the browser.
FIG. 6 is a flow diagram illustrating a process 600 for a user entering a URL of a web page to be loaded in a client browser using the information sharing system described above to display the web page and existing annotations therein, according to an embodiment of the invention.
As shown in fig. 6, in step S610, the URLs input by the user are analyzed to obtain the alternative URL sets, and the web pages corresponding to all the alternative URLs and the web page to be loaded (i.e., the current web page) are subjected to the same or similar page determination, so as to determine a valid URL. A specific processing procedure in step S610 will be described below with reference to fig. 7.
In step S620, according to the determined valid URL, all annotations that may be related to the current web page are queried in the annotation database, so as to obtain an annotation candidate set. Then, in step S630, it is determined which of all possible annotations exist in the current webpage, and the annotation location of the existing annotations in the current webpage is determined. A specific processing procedure related to step S630 will be described below with reference to fig. 8 and 9.
Then, in step S640, the annotations are composited with the current web page based on the web page annotation information of the annotations determined to be present in step S630 and the determined annotation positions of the annotations, and the composited web page is displayed to the user via the browser in step S650. In this case, the annotation can be first converted into html format by dynamically modifying the DOM code of the current web page, and then the html fragment after conversion is inserted into the web page code and displayed in the browser.
Fig. 7 is an exemplary flowchart illustrating a process of obtaining an alternative URL based on a URL input by a user and making the same and similar page determination on a corresponding web page and a web page currently loaded by a browser (i.e., a current web page) in one embodiment according to the present invention (i.e., a specific processing procedure of step S610 illustrated in fig. 6).
As shown in fig. 7, in step S710, as described above, based on the URL input by the user, a set of all alternative URLs in the same website as the input URL, that is, an alternative URL set, is obtained. Then, in step S720, it is determined whether the web page corresponding to a certain candidate URL is the same as the current web page. Here, if the content feature code of the web page corresponding to the alternative URL is the same as the content feature code of the current web page, it may be determined that the two web pages are the same page, otherwise, the two web pages are different. Here, whether the webpage where the label is located and the current webpage are the same page is determined by the content feature code of the webpage, so as described above, the content feature code of the webpage can be obtained by using the existing encoding method, such as MD 5. This is mainly the case for some web pages where the URL is different but the content has not changed.
If it is determined in step S720 that the two web pages are not the same, it is determined in step S730 whether the two web pages are similar pages. Here, the two web pages may be determined to be similar when the following conditions are satisfied between the two web pages, otherwise, the two web pages are not similar:
(1) the titles of the web pages are the same, and
(2) the condition of parameter transmission exists between the two webpages, digital parameters in the URL are lost, and the rest are the same;
the two webpages have parameter transmission conditions, the digital parameters in the URLs are different, and the digital parameters in the webpage corresponding to the alternative URL are smaller than those in the webpage corresponding to the current URL, and the other digital parameters are the same; or
There is no parameter passing between the two web pages, and the last address part of the URL is different, and the others are the same.
It is obvious here that the principle of the present invention is not limited to the above-mentioned similar page determination condition, and those skilled in the art can set other different similar page determination conditions as required.
When the determination result in step S720 or step S730 is affirmative, the processing proceeds to step S740, and the current alternative URL is determined as a valid URL.
If it is determined that the two web pages are neither identical nor close after the determinations in step S720 and step S730, the process proceeds to step S750, where it is determined whether there are any URLs in the alternative URL set that have not been determined as identical or close pages. If so, in step S760, the next candidate URL is extracted from the candidate URL set, and the process returns to step S720, so that the web page corresponding to the extracted next candidate URL is determined to be the same as or close to the current web page. The processing of steps S720 to S760 is repeated until it is determined in step S750 that all the alternative URLs in the alternative URL set have been subjected to the same and close page determination, thereby determining all the valid URLs in the alternative URL set.
Fig. 8 is a flowchart showing in detail the processing procedure of step S630 in fig. 6 (i.e., determining whether all possible annotations exist in the current web page and their annotation positions in the current web page), and fig. 9 is a schematic diagram showing the structure of the feature code CF of a certain annotation (as shown in (a) in fig. 9) and its corresponding DOM tree (as shown in (b) in fig. 9) used in the processing procedure shown in fig. 8.
As shown in fig. 8, in step S810, based on the webpage annotation information of the possible annotation to be currently determined, for example, the feature code CF and the XPath path of the annotated object corresponding to the annotation, and the like, based on the node determined according to the XPath path in the DOM tree of the current webpage, the nodes in the DOM tree of the current webpage are sequentially detected upward and downward, respectively, so as to determine the node (where similarity refers to that the difference between the content of the node and the context is within an allowable range) in the DOM tree that is the same as or closest to the annotated object corresponding to the annotation and the context node thereof as the DOM tree node corresponding to the annotation in the current webpage.
For example, taking the feature code CF of a certain possible annotation to be determined shown in (a) of fig. 9 as an example, wherein A, B and C respectively represent the annotated object corresponding to the annotation, its upper node and its lower node, and nodes in the DOM tree are sequentially detected based on the nodes determined based on the XPath path of a, and it is determined that A, B and C are respectively a ', B ' and C ' as shown in (B) of fig. 9, which may be referred to as DOM tree nodes corresponding to the annotation to be determined.
Then, in step S820, based on the determined DOM tree nodes corresponding to the possible annotations to be determined, the distance D (a, a') of the annotation from the DOM tree is calculated as follows:
D(A,A’)=d(A,A’)+α(d(B,B’)+d(C,C’))+βds
wherein,
d(A,A’)=|CBF(A)-CBF(A’)|,
d(B,B’)=|CBF(B)-CBF(B’)|,
d(B,B’)=|CBF(C)-CBF(C’)|,
dsalpha and beta are constants for the tree structure distance, alpha represents the influence degree of the difference of the context of the marked object on the difference of the marked object, beta represents the influence degree of the difference of the DOM tree structure on the similarity difference of the mark, dsRepresenting the difference between the context node structure in the current DOM tree and the annotated CF structure (i.e., the original context node structure).
Suppose that the lowest common node P of nodes A ', B ', C ' can be found in the DOM tree, and lA’、lB’、lC’Respectively representing the number of nodes passing from the nodes A ', B ' and C ' to the node P, dsIt can be calculated as follows:
ds=lA’+lB’+lC’
in the case as shown in FIG. 9(b), ds=1。
See back fig. 8. In step S830, it is determined whether the distance D of the to-be-determined label calculated in step S820 is smaller than a predetermined threshold. If so, it may be determined in step S840 that the annotation should exist on the current web page, and its location on the current web page. For example, if the calculated D (a, a ') is less than the predetermined threshold, it is determined that the to-be-determined annotation still marks an element or object in the current web page and thus should be displayed on the current web page, and the position of the node a' in the DOM tree determines the position where the annotation should be displayed on the current web page.
If it is determined in step S830 that the distance D of the to-be-determined annotation is not less than the predetermined threshold, then in step S840, the annotation is discarded, i.e., it is determined that the annotation should not be displayed on the current webpage.
As can be seen from the above definitions of the content-based feature CBF and the feature code CF of the tagged object, the CBF is generally unique to the tagged object (especially when the tagged object is web page content represented by english text), and has a uniform length, which is convenient for data transmission and storage; the change of the CBF can truly reflect the change of the content of the marked object; and the distance between the CF's of the labeled objects is a measure of the variation of the objects.
In the information sharing system according to the embodiment of the present invention, the labeled object is identified by using the XPath, and simultaneously, the feature code CF information of the labeled object is also utilized, so that the dynamic tracking of the labeled object in the dynamic webpage can be realized, which is impossible to realize in the conventional webpage information labeling system. This is because, in the conventional webpage information labeling system, the characteristic of the labeled object is generally constructed in the form of a hash function (such as MD5 encoding), and although the characteristic is generally unique and uniform in length, which is convenient for data transmission and storage, the characteristic cannot reflect the degree of change of the labeled content. Such hash encoding causes a slight change in the annotated object to result in a large change in the features, so that the degree of change in the annotated object cannot be measured by the distance between the features.
In the information sharing method and system based on webpage annotation described above with reference to the drawings, according to the embodiments of the present invention, the feature code of the annotated object can be generated based on the content of the annotated object and the context content thereof, so that when all possible annotations are used for matching in the currently loaded webpage, the change of the annotation can be measured, and thus whether the annotation is displayed or not can be determined according to the degree of the change, thereby implementing dynamic tracking. In addition, in the process of label matching, a lightweight DOM tree searching method based on the characteristics of the context content is adopted to measure the content change and the context change of the labeled object.
As can be seen from the above description, in the method and system according to the embodiment of the present invention described above, a dynamic tracking technology is used, so that even if a labeled object in a web page changes to some extent, the corresponding label can be correctly displayed at the changed position on the web page, and for the content that disappears from the web page, the corresponding label will not be displayed. In addition, when a labeled object in a web page is transferred from another web page, the corresponding label can be displayed at the correct position on the web page for the labeled object. In addition, in the case where the current web page may have been annotated by a different URL, all of the annotations will be correctly displayed. In addition, when the format of the marked object is changed, the mark can be displayed correctly at the same time, such as blacking, italics, and the like, and quotation. Changes in format are common in web updates or forum content transfers. Therefore, the webpage annotation can be used as a means to achieve the purpose of sharing information among users.
Further, it is apparent that the respective operational procedures of the above-described method according to the present invention can also be implemented in the form of computer-executable programs stored in various machine-readable storage media.
Moreover, the object of the present invention can also be achieved by: a storage medium storing the above executable program code is directly or indirectly supplied to a system or an apparatus, and a computer or a Central Processing Unit (CPU) in the system or the apparatus reads out and executes the program code.
At this time, as long as the system or the apparatus has a function of executing a program, the embodiment of the present invention is not limited to the program, and the program may be in any form, for example, an object program, a program executed by an interpreter, a script program provided to an operating system, or the like.
Such machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic, and magneto-optical disks, and other media suitable for storing information, etc.
In addition, the computer can also implement the present invention by connecting to a corresponding website on the internet, and downloading and installing the computer program code according to the present invention into the computer and then executing the program.
It is also to be noted that the steps of executing the above-described series of processes may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.
Finally, it should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.
Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it should be understood that the above-described embodiments are only for illustrating the present invention and are not to be construed as limiting the present invention. Various changes, substitutions, and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such structures, means, methods, or steps of a process, apparatus, manufacture, composition of matter.