CN108268884B - Document comparison method and device - Google Patents
Document comparison method and device Download PDFInfo
- Publication number
- CN108268884B CN108268884B CN201611265983.7A CN201611265983A CN108268884B CN 108268884 B CN108268884 B CN 108268884B CN 201611265983 A CN201611265983 A CN 201611265983A CN 108268884 B CN108268884 B CN 108268884B
- Authority
- CN
- China
- Prior art keywords
- paragraph
- document
- original document
- character string
- revised
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Document Processing Apparatus (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a document comparison method and a device, wherein the method comprises the following steps: content comparison is carried out on the revised document and the original document, and the longest common character string of the revised document and the original document is determined; respectively positioning paragraphs of the revised document and the original document according to the longest common character string, and determining the paragraphs with corresponding relations in the revised document and the original document; and comparing the content of any non-corresponding paragraph in the revised document with that of the original document, and determining the modification type of the non-corresponding paragraph according to the comparison result so as to accurately identify the difference between different documents.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to a document comparison method and apparatus.
Background
In the prior art, for short content comparison, the difference of text insertion and text deletion can be accurately marked, and the situation of paragraph movement is basically free of problems; however, for long or complex content comparisons, the type of upward (downward) movement of a paragraph may not be accurately determined, and the concept of the paragraph may be destroyed, so that the determination of the type of inserted text and deleted text is not accurate, and in addition, the type of paragraph splitting and combining, paragraph part copying and extensive cut-and-paste is not analyzed yet.
In summary, the existing document comparison method is insufficient, and the comparison result is also inaccurate.
Disclosure of Invention
The embodiment of the invention provides a document comparison method and device, which are used for solving the problem that a document comparison result is not accurate enough in the prior art.
The method comprises a document comparison method, wherein the method comprises the following steps: content comparison is carried out on the revised document and the original document, and the longest common character string of the revised document and the original document is determined;
respectively positioning paragraphs of the revised document and the original document according to the longest common character string, and determining the paragraphs with corresponding relations in the revised document and the original document;
and comparing the content of any non-corresponding paragraph in the revised document with that of the original document, and determining the modification type of the non-corresponding paragraph according to a comparison result.
Based on the same inventive concept, an embodiment of the present invention further provides a document comparing apparatus, including:
the determining unit is used for comparing the contents of the revised document and the original document and determining the longest common character string of the revised document and the original document;
the paragraph positioning unit is used for respectively positioning the paragraphs of the revised document and the original document according to the longest common character string and determining the paragraphs with corresponding relations in the revised document and the original document;
and the comparison unit is used for comparing the content of any non-corresponding paragraph in the revised document with that of the original document, and determining the modification type of the non-corresponding paragraph according to a comparison result.
The document comparison method provided by the embodiment of the invention firstly performs comparison on the content of the whole document, then positions the paragraphs of the original document and the revised document according to the longest common character string, determines the corresponding relationship between each paragraph in the revised document and each paragraph of the original document, and after the positioning of the paragraphs is completed, continues to compare the paragraph group with the original document, determines the movement (or cut-and-paste) of the matched paragraphs, the differences (insertion or deletion) in the corresponding paragraphs of the two documents, the splitting and combination of the paragraphs, the partial copying of the paragraphs and the like, and finally marks and displays the comparison result by using different colors according to the different types.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a document comparison method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram corresponding to a section of an original document and a revised document according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a paragraph splitting combination of an original document and a revised document according to an embodiment of the present invention;
FIG. 4 is a schematic diagram showing the comparison of the contents of an original document and a revised document according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a document comparison result of text insertion and deletion according to an embodiment of the present invention;
FIG. 6 is a schematic diagram showing the comparison of the contents of an original document and a revised document according to an embodiment of the present invention;
FIG. 7 is a diagram showing a comparison result of a paragraph up-shift and down-shift according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a comparison result of cut-and-paste between paragraphs according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a document comparing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, an embodiment of the present invention provides a flow chart of a document comparison method, and specifically, an implementation method includes:
and step S101, comparing the contents of the revised document and the original document, and determining the longest common character string of the revised document and the original document.
Step S102, respectively positioning paragraphs of the revised document and the original document according to the longest common character string, and determining that corresponding paragraphs exist in the revised document and the original document.
Step S103, for any non-corresponding paragraph in the revised document, comparing the content of the non-corresponding paragraph with that of the original document, and determining the modification type of the non-corresponding paragraph according to the comparison result.
It should be noted that, for the whole document, the longest common character string is provided, and the longest common character string can be compared with the original document and the revised document respectively, so that the paragraphs with corresponding relation between the two documents can be determined by determining which paragraphs of the two documents correspond to the longest common character string, and determining the paragraphs with corresponding relation (or determining the reference of the positional relation of other paragraphs except the paragraphs). An example is given below for illustration.
Original document:
revision document:
the longest common string obtained by the comparison should be:
comparing the longest public character string with the original document, and determining that the public character string is positioned in the 1 st section and the 2 nd section in the original document; the longest common string can be determined to be in paragraphs 3 and 4 of the revision document, as compared to the revision document. Thus, the paragraphs are already established that there is a correspondence, and the paragraphs serve as references for the location, or later for the concept of relative location, with respect to the paragraphs, where the determination of the longest common string can be determined using a fast algorithm of the current longest common subsequence.
The embodiment of the invention mainly determines different difference types existing in document comparison on the premise of accurately positioning paragraphs. The longest common string is used as a string reference, and paragraphs can be determined by determining the longest common string. Specifically, as shown in fig. 2, the original document includes a paragraph A, B, C, D, E, F, G, H from top to bottom, the revised document includes a paragraph C, a paragraph X (including a large range of contents of a paragraph H), a paragraph a, a paragraph B, a paragraph D, a paragraph F, a paragraph G, a paragraph E, a paragraph Y (including a part of contents of a paragraph G), and the overall contents of the original document are compared with the overall contents of the revised document, so as to determine that the longest common character string is all characters (including carriage returns between paragraphs) in a paragraph A, B, D, F, G; the original document and the revised document are positioned by using the determined longest common character string in fig. 2, and a paragraph A, B, D, F, G is positioned, wherein a paragraph A, B, D, F, G of the original document and a paragraph A, B, D, F, G of the revised document form a paragraph with a corresponding relationship, and a non-corresponding paragraph in the revised document except a paragraph A, B, D, F, G remains with a paragraph C, a paragraph X (including a large range of contents of a paragraph H), a paragraph E, and a paragraph Y (including a part of contents of a paragraph G).
Further, for non-corresponding paragraphs, consecutive paragraphs in the non-corresponding paragraphs may be used as paragraph groups, and all character strings in the paragraph groups are compared with all contents in an original document, specifically, in the revised document, each paragraph group is compared with the original document by taking a paragraph group as a unit, and the paragraph groups are consecutive paragraphs except for the paragraphs with the corresponding relationship in the revised document;
when the paragraph group and the original document have the longest public character string, and two continuous paragraphs in front and behind exist in the paragraph group corresponding to the longest public character string, and the previous paragraph is compared with the longest public character string, the tail is more than one carriage return character, and the fact that one paragraph exists in the original document and corresponds to the two continuous paragraphs in front and behind of the revised document corresponding to the longest public character string is determined, wherein the two continuous paragraphs in front and behind of the revised document are obtained by splitting the paragraph of the original document after the paragraph moves;
when the longest public character string exists between the paragraph group and the original document, and the paragraph of the paragraph group corresponding to the longest public character string is compared with the longest public character string, one carriage return character is less in the middle of the paragraph content, and then the position corresponding to the carriage return character is determined to generate paragraph combination.
For example, as shown in fig. 3, the corresponding paragraphs of the original document and the revised document are paragraph a, paragraph B, paragraph E, paragraph H, paragraph I, and paragraph J, the non-corresponding paragraphs of the revised document are paragraph C1, paragraph C2, and paragraph G (including the content of the original document and paragraph D and paragraph F), the paragraphs C1 and C2 are used as the first paragraph group, the paragraph G is used as the second paragraph group, the first paragraph group is compared with the whole content of the original document, after the comparison, the paragraph group is found to have the longest public character string (i.e. all characters of the original document paragraph C) after the comparison with the original document, then the paragraph C in the original document is determined according to the longest public character string, and the paragraphs C1 and C2 of the revised document are determined according to the longest public character string. Further, two consecutive paragraphs, paragraph C1 and paragraph C2, have one more carriage return character than the longest common character string, so that it can be said that the content of the original document paragraph C at the carriage return character position is determined that the paragraph splitting has occurred. Further, comparing the second paragraph group with the whole content of the original document, finding that the paragraph group has the longest public character string (namely all characters of the original document paragraph D and paragraph F) after being compared with the original document, determining the paragraph D and paragraph F in the original document according to the longest public character string, and determining the paragraph G of the revised document according to the longest public character string. Further, paragraph G has one carriage return character less in the middle of the paragraph content than the longest common character, and therefore, it is determined that paragraph G has a paragraph combination of original document D and original document F after the position of the carriage return character.
Further, after executing step S102, the method further includes: revising any paragraph with corresponding relation in the document, and recording characters which are lack of the paragraph than the corresponding paragraph of the original document as character deletion; and recording the character which is added by the paragraph more than the corresponding paragraph of the original document as character insertion aiming at the revised document.
Since the corresponding paragraphs can be determined to correspond to the characters of the longest public character string respectively in the process of finding the corresponding paragraphs, the characters of the corresponding paragraphs which are more than the longest public character string can be clearly seen, and the characters which are more than the corresponding paragraphs in the original document are marked as character deletion; in the revised document, the characters which are added in the paragraphs with the corresponding relation are recorded as character insertion.
For example, the original document and the revised document shown in fig. 4 are compared by the above method, and the revised document is "conscious form" followed by the insertion of the word "and a certain society", is "religious art" followed by the deletion of the word "religion (mystery special conscious form)" and is "as shown in fig. 5" compared with the original document.
Further, for a first non-corresponding paragraph in the revised document, comparing the first non-corresponding paragraph with the original document, wherein the first non-corresponding paragraph is any one;
when the second paragraph of the longest public character string exists in the original document in the comparison result, determining a first matching rate of a first non-corresponding paragraph and a second matching rate of the second paragraph, wherein the first matching rate and the second matching rate are equal to the ratio of the length of the matching character string to the length of the whole character string of the paragraph;
and when the first matching rate and the second matching rate are not smaller than a matching threshold at the same time, determining that the first non-corresponding paragraph is a paragraph with paragraph position movement of the second paragraph.
As shown in fig. 6, the original document contains paragraphs A, B, C, D, E, F from top to bottom. The revised document contains paragraphs C, F, E, A, B, D from top to bottom. According to the corresponding relation determining method, compared with the original document and the revised document in fig. 6, the paragraphs with the corresponding relation are preliminarily determined to have the paragraph a, the paragraph B and the paragraph D. The remaining non-corresponding paragraphs are paragraph C, paragraph F, and paragraph E. Further, the paragraphs C, F, E are formed into a paragraph group, and then are compared with the original document, and the corresponding relationship between the paragraphs C and F is found, and finally the paragraph E is compared with the original document.
That is, in the revised document, after removing the corresponding paragraph, a plurality of paragraph groups are generated, and for each paragraph group, the comparison process is repeated with the original document one by one, so as to find out the reference character strings and corresponding paragraphs of each paragraph in the second layer comparison. The term "corresponding paragraph" as used herein refers to a case of cut-and-paste or a case of paragraph movement in the whole document comparison of the original document and the revised document. In addition, when the multi-layer comparison is performed, a paragraph group set threshold is set, and when the matching rate is lower than the set threshold, the matching rate is regarded as being incapable of matching, wherein the matching rate is equal to the ratio of the length of a matching character string to the length of a paragraph whole character string. After the second layer of comparison, constructing paragraph groups for the remaining paragraphs, and continuing to repeat the above process until the paragraphs cannot be matched (or the number of preset comparison layers is reached). Here, the concept of adding paragraph groups is added, and the purpose of multi-layer comparison is to better identify the modification type of the whole paragraph group, and the process information changed from some paragraphs in the original document, such as the result of extensive cut-and-paste of the original document, the result of splitting or combining some paragraph of the original document, and the like.
Further, the method further comprises the following steps: when the first matching rate and the second matching rate are not larger than a matching threshold at the same time, determining that the content of the first non-corresponding paragraph belongs to the inserted content;
when the first matching rate is less than the matching threshold and the second matching rate is greater than the matching threshold, determining that the first non-corresponding paragraph contains at least a majority of the content of the second paragraph;
when the first match rate is greater than the match threshold and the second match rate is less than the match threshold, then determining that the first non-corresponding paragraph is at least part of the content of the second paragraph.
The comparison process of the paragraphs in the paragraph groups is specifically described below by taking the original document and the revised document in fig. 2 as an example, and the paragraph C matching rate R1 of the original document and the paragraph C matching rate R2 of the revised document are calculated by assuming the paragraph C in the paragraph groups and finding the paragraph C with the longest common character string in the comparison result of each paragraph with the original document. When R1, R2 is equal to 100%, greater than the matching threshold of 60%, then paragraph C of the revised document is the case when paragraph C of the original document moves.
Assuming that paragraph X of the revised document contains only a small portion of the contents of paragraph H, when R1, R2 are simultaneously less than the match threshold of 60%, then paragraph X of the revised document is determined to be newly inserted.
The matching rate of paragraph Y of the revised document in fig. 2 is 85% compared with the original document, but the longest common character string of the revised document is 50% compared with the character number of the whole paragraph, so that most of the contents of paragraph G containing the original document in paragraph Y are determined, because the corresponding paragraph of the original document paragraph G has been determined in the revised document in the paragraph positioning process, the paragraph Y is a part of the contents of the copy-pasted paragraph G,
through the above analysis, the modification types of the rest paragraphs in the revised document are all determined, and the user continues to check whether the undetermined type paragraphs still exist in the rest paragraphs of the original document, and if so, the rest paragraphs are regarded as deleted paragraphs.
For example, FIG. 7 shows paragraph C moving up above paragraph X and paragraph C having moved up above paragraph D, and paragraph E moving down between paragraphs G and Y, then the label here is paragraph E moving down to this. Marking paragraph E has been shifted down before paragraph F.
For another example, it was found from comparison that the content in paragraph X matches the large-scale content of paragraph H, and therefore the content of labeled paragraph H is cut and pasted to paragraph X. And marked paragraph H is removed and part of the content is cut-pasted to paragraph X, while the marked paragraph G part of the content is copy-pasted to paragraph Y, as shown in fig. 8.
Preferably, before comparing the revised document with the original document in paragraph groups, the method further comprises:
pre-identifying an ultra-short paragraph and/or a frequently occurring paragraph in the paragraph group, and eliminating the ultra-short paragraph and/or the frequently occurring paragraph;
after comparing the revised document with the original document in paragraph groups, the method further comprises:
and comparing the ultrashort paragraphs and/or frequently occurring paragraphs to generate a comparison result.
That is, to ensure that the comparison result is more accurate, a preprocessing may be performed on the document before the document is compared: for the ultrashort paragraphs (such as empty paragraphs) and frequently appearing paragraphs in the document, the comparison process can be identified in advance, and after the main comparison process is finished, the difference types of the 'special' paragraphs are supplemented and judged one by taking the corresponding positions of the determined paragraphs as the reference.
Further, the displaying the generated comparison result in the original document and the revised document respectively includes:
annotating the generated comparison result at corresponding paragraphs of the original document and the revised document in an annotating manner, and displaying differences between the original document and the revised document by utilizing different colors.
In other words, for the comparison result, an annotating manner may be adopted, and several different colors are allocated for display according to the difference type of the comparison result. Paragraph splitting (combining), annotating corresponding paragraphs, noted paragraph splitting (combining). The paragraph moves up (down), the annotation is added at the paragraph, note that it moves up (or down) to this location, and the annotation is added at the location of the paragraph corresponding to the document, note that it moves up (or down), and the content is moved. Characters are inserted, colors are directly used for marking, and annotating can be omitted. Deleting the text, annotating at the deletion location, and noting the deleted content. The large-scale content is cut and pasted, the annotation is added, the condition of cut and pasted is noted, the original cut is added with the annotation at the same time, and the annotation is noted to be cut off. The paragraph partial copy, the endorsement hint is a partial copy of the paragraph content in the original document.
Therefore, the embodiment of the invention is based on paragraphs, adds the concept of paragraph groups, accurately positions the corresponding paragraph positions of two documents, and can better identify the conditions of paragraph splitting, paragraph combining, large-scale cut-and-paste and paragraph part copy-and-paste by adopting a multi-layer comparison method.
Based on the same technical conception, the embodiment of the invention also provides a document comparing device which can execute the method embodiment. The device provided by the embodiment of the invention is shown in fig. 9, and comprises: a determining unit 301, a paragraph locating unit 302, a comparing unit 303, wherein:
a determining unit 301, configured to compare contents of a revised document and an original document, and determine a longest common character string of the revised document and the original document;
a paragraph positioning unit 302, configured to perform paragraph positioning on the revised document and the original document according to the longest common character string, and determine that a paragraph in a correspondence exists between the revised document and the original document;
and a comparison unit 303, configured to compare, for any non-corresponding paragraph in the revised document, the content of the non-corresponding paragraph with the original document, and determine a modification type of the non-corresponding paragraph according to a comparison result.
Further, the comparing unit 303 is further configured to: in the revised document, each paragraph group is compared with the original document by taking the paragraph group as a unit, and the paragraph groups are continuous paragraphs except for the paragraphs with the corresponding relation in the revised document;
when the paragraph group and the original document have the longest public character string, and two continuous paragraphs in front and behind exist in the paragraph group corresponding to the longest public character string, and the previous paragraph is compared with the longest public character string, the tail is more than one carriage return character, and the fact that one paragraph exists in the original document and corresponds to the two continuous paragraphs in front and behind of the revised document corresponding to the longest public character string is determined, wherein the two continuous paragraphs in front and behind of the revised document are obtained by splitting the paragraph of the original document after the paragraph moves;
when the paragraph group and the original document have the longest public character string, and the paragraph group corresponding to the longest public character string has one less carriage return character compared with the longest public character string, determining that the paragraph combination occurs at the position corresponding to the carriage return character.
Further, the comparing unit 303 is further configured to: revising any paragraph with corresponding relation in the document, and recording characters which are lack of the paragraph than the corresponding paragraph of the original document as character deletion; and recording the character which is added by the paragraph more than the corresponding paragraph of the original document as character insertion aiming at the revised document.
Further, the comparing unit 303 is specifically configured to:
content comparison is carried out on a first non-corresponding paragraph in the revised document and the original document, wherein the first non-corresponding paragraph is any one;
when the second paragraph of the longest public character string exists in the original document in the comparison result, determining a first matching rate of a first non-corresponding paragraph and a second matching rate of the second paragraph, wherein the first matching rate and the second matching rate are equal to the ratio of the length of the matching character string to the length of the whole character string of the paragraph;
and when the first matching rate and the second matching rate are not smaller than a matching threshold at the same time, determining that the first non-corresponding paragraph is a paragraph with paragraph position movement of the second paragraph.
Further, the comparing unit 303 is further configured to:
when the first matching rate and the second matching rate are not larger than a matching threshold at the same time, determining that the content of the first non-corresponding paragraph belongs to the inserted content;
when the first matching rate is less than the matching threshold and the second matching rate is greater than the matching threshold, determining that the first non-corresponding paragraph contains at least a majority of the content of the second paragraph;
when the first match rate is greater than the match threshold and the second match rate is less than the match threshold, then determining that the first non-corresponding paragraph is at least part of the content of the second paragraph.
Further, the method further comprises the following steps: and the annotating unit 304 is configured to add annotation to the generated comparison result at the corresponding sections of the original document and the revised document by means of annotation, and display the content difference between the original document and the revised document by using different colors.
In summary, the document comparison method provided by the embodiment of the invention firstly performs content comparison on the whole document, then locates the paragraphs of the original document and the revised document according to the longest common character string, determines the paragraphs in which the corresponding relation exists between each paragraph in the revised document and each paragraph of the original document, and after the locating of the paragraphs is completed, continues to compare the paragraphs with the original document by using the paragraph group, determines the movement (or cut-and-paste) of the matched paragraphs, the differences (insertion or deletion) in the corresponding paragraphs of the two documents, the separation and combination of the paragraphs, the partial copy of the paragraphs and the like, and finally uses different colors to perform labeling and display comparison results according to the different types mentioned above.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (10)
1. A document matching method, the method comprising:
content comparison is carried out on the revised document and the original document, and the longest common character string of the revised document and the original document is determined;
respectively positioning paragraphs of the revised document and the original document according to the longest common character string, and determining the paragraphs with corresponding relations in the revised document and the original document;
content comparison is carried out on a first non-corresponding paragraph in the revised document and the original document, wherein the first non-corresponding paragraph is any one;
when the second paragraph of the longest public character string exists in the original document in the comparison result, determining a first matching rate of the first non-corresponding paragraph and a second matching rate of the second paragraph, wherein the first matching rate and the second matching rate are equal to the ratio of the length of the matching character string to the length of the paragraph whole character string;
and when the first matching rate and the second matching rate are not smaller than a matching threshold at the same time, determining that the first non-corresponding paragraph is a paragraph with paragraph position movement of the second paragraph.
2. The method as recited in claim 1, further comprising:
in the revision document, each paragraph group is compared with the original document by taking the paragraph group as a unit, and the paragraph groups are continuous paragraphs except for the paragraphs with the corresponding relation in the revision document;
when the paragraph group and the original document have the longest public character string, and two continuous paragraphs in front and behind exist in the paragraph group corresponding to the longest public character string, and the previous paragraph is compared with the longest public character string, the tail is more than one carriage return character, and the fact that one paragraph exists in the original document and corresponds to the two continuous paragraphs in front and behind of the revised document corresponding to the longest public character string is determined, wherein the two continuous paragraphs in front and behind of the revised document are obtained by splitting the paragraph of the original document after the paragraph moves;
when the longest public character string exists between the paragraph group and the original document, and the paragraph of the paragraph group corresponding to the longest public character string is compared with the longest public character string, one carriage return character is less in the middle of the paragraph content, and then the position corresponding to the carriage return character is determined to generate paragraph combination.
3. The method of claim 1, wherein the determining that there is a paragraph of correspondence in the revised document and the original document further comprises:
revising any paragraph with corresponding relation in the document, and recording characters which are lack of the paragraph than the corresponding paragraph of the original document as character deletion; and recording the character which is added by the paragraph more than the corresponding paragraph of the original document as character insertion aiming at the revised document.
4. The method of claim 1, wherein after comparing the first non-corresponding paragraph with the original document, further comprising:
when the first matching rate and the second matching rate are not larger than a matching threshold at the same time, determining that the content of the first non-corresponding paragraph belongs to the inserted content;
when the first matching rate is less than the matching threshold and the second matching rate is greater than the matching threshold, determining that the first non-corresponding paragraph contains at least a majority of the content of the second paragraph;
when the first match rate is greater than the match threshold and the second match rate is less than the match threshold, then determining that the first non-corresponding paragraph is at least part of the content of the second paragraph.
5. The method of any one of claims 1 to 4, wherein after said comparing the content of the first non-corresponding paragraph with the original document, further comprising:
and adding annotation notes at corresponding sections of the original document and the revised document by adopting the generated comparison result in an annotation mode, and displaying content differences between the original document and the revised document by utilizing different colors.
6. A document contrast device, the device comprising:
the determining unit is used for comparing the contents of the revised document and the original document and determining the longest common character string of the revised document and the original document;
the paragraph positioning unit is used for respectively positioning the paragraphs of the revised document and the original document according to the longest common character string and determining the paragraphs with corresponding relations in the revised document and the original document;
a comparison unit, configured to compare, for a first non-corresponding paragraph in the revised document, the content of the first non-corresponding paragraph with the original document, where the first non-corresponding paragraph is any one; when the second paragraph of the longest public character string exists in the original document in the comparison result, determining a first matching rate of the first non-corresponding paragraph and a second matching rate of the second paragraph, wherein the first matching rate and the second matching rate are equal to the ratio of the length of the matching character string to the length of the paragraph whole character string; and when the first matching rate and the second matching rate are not smaller than a matching threshold at the same time, determining that the first non-corresponding paragraph is a paragraph with paragraph position movement of the second paragraph.
7. The apparatus of claim 6, wherein the contrast unit is further to:
in the revision document, each paragraph group is compared with the original document by taking the paragraph group as a unit, and the paragraph groups are continuous paragraphs except for the paragraphs with the corresponding relation in the revision document;
when the paragraph group and the original document have the longest public character string, and two continuous paragraphs in front and behind exist in the paragraph group corresponding to the longest public character string, and the previous paragraph is compared with the longest public character string, the tail is more than one carriage return character, and the fact that one paragraph exists in the original document and corresponds to the two continuous paragraphs in front and behind of the revised document corresponding to the longest public character string is determined, wherein the two continuous paragraphs in front and behind of the revised document are obtained by splitting the paragraph of the original document after the paragraph moves;
when the longest public character string exists between the paragraph group and the original document, and the paragraph of the paragraph group corresponding to the longest public character string is compared with the longest public character string, one carriage return character is less in the middle of the paragraph content, and then the position corresponding to the carriage return character is determined to generate paragraph combination.
8. The apparatus of claim 6, wherein the contrast unit is further to:
revising any paragraph with corresponding relation in the document, and recording characters which are lack of the paragraph than the corresponding paragraph of the original document as character deletion; and recording the character which is added by the paragraph more than the corresponding paragraph of the original document as character insertion aiming at the revised document.
9. The apparatus of claim 6, wherein the contrast unit is further to:
when the first matching rate and the second matching rate are not larger than a matching threshold at the same time, determining that the content of the first non-corresponding paragraph belongs to the inserted content;
when the first matching rate is less than the matching threshold and the second matching rate is greater than the matching threshold, determining that the first non-corresponding paragraph contains at least a majority of the content of the second paragraph;
when the first match rate is greater than the match threshold and the second match rate is less than the match threshold, then determining that the first non-corresponding paragraph is at least part of the content of the second paragraph.
10. The apparatus of any one of claims 6 to 9, further comprising:
and the annotating unit is used for adding annotation notes at the corresponding sections of the original document and the revised document in the generated comparison result in an annotating mode and displaying the content difference between the original document and the revised document by utilizing different colors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611265983.7A CN108268884B (en) | 2016-12-31 | 2016-12-31 | Document comparison method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611265983.7A CN108268884B (en) | 2016-12-31 | 2016-12-31 | Document comparison method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108268884A CN108268884A (en) | 2018-07-10 |
CN108268884B true CN108268884B (en) | 2023-06-16 |
Family
ID=62770175
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611265983.7A Active CN108268884B (en) | 2016-12-31 | 2016-12-31 | Document comparison method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108268884B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543163A (en) * | 2018-10-30 | 2019-03-29 | 天津字节跳动科技有限公司 | Documentation revisions record acquisition methods, device, storage medium and electronic equipment |
CN109597913B (en) * | 2018-11-05 | 2021-01-29 | 东软集团股份有限公司 | Method, device, storage medium and electronic equipment for aligning document pictures |
CN109815452B (en) * | 2018-12-25 | 2023-04-07 | 东软集团股份有限公司 | Text comparison method and device, storage medium and electronic equipment |
CN109740124A (en) * | 2018-12-25 | 2019-05-10 | 东软集团股份有限公司 | Difference output method, device, storage medium and the electronic equipment of document comparison |
CN111753505B (en) * | 2019-09-30 | 2024-10-22 | 北京沃东天骏信息技术有限公司 | Document processing method, device, server and storage medium |
CN113918509A (en) * | 2020-07-10 | 2022-01-11 | 珠海格力电器股份有限公司 | Document comparison display method and document comparison display equipment |
CN112149402B (en) * | 2020-09-23 | 2023-05-23 | 创新奇智(青岛)科技有限公司 | Document matching method, device, electronic equipment and computer readable storage medium |
CN112699658B (en) * | 2020-12-31 | 2024-05-28 | 科大讯飞华南人工智能研究院(广州)有限公司 | Text comparison method and related device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101763343A (en) * | 2008-12-23 | 2010-06-30 | 上海晨鸟信息科技有限公司 | Document editor principle supporting format comparison and plagiarism check and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9514103B2 (en) * | 2010-02-05 | 2016-12-06 | Palo Alto Research Center Incorporated | Effective system and method for visual document comparison using localized two-dimensional visual fingerprints |
-
2016
- 2016-12-31 CN CN201611265983.7A patent/CN108268884B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101763343A (en) * | 2008-12-23 | 2010-06-30 | 上海晨鸟信息科技有限公司 | Document editor principle supporting format comparison and plagiarism check and method |
Non-Patent Citations (2)
Title |
---|
"用信息检索和运筹学等技术增强作业反抄袭";龙舜;《2010 Third International Conference on Education Technology and Training (ETT)》;20101128;第377页3.2节 * |
龙舜."用信息检索和运筹学等技术增强作业反抄袭".《2010 Third International Conference on Education Technology and Training (ETT)》.2010, * |
Also Published As
Publication number | Publication date |
---|---|
CN108268884A (en) | 2018-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108268884B (en) | Document comparison method and device | |
CN102810097B (en) | Webpage text content extracting method and device | |
CN113076133B (en) | Deep learning-based Java program internal annotation generation method and system | |
CN105654022A (en) | Method and device for extracting structured document information | |
CN103164388B (en) | In a kind of layout files structured message obtain method and device | |
CN111178079B (en) | Triplet extraction method and device | |
CN104866498A (en) | Information processing method and device | |
CN102486769A (en) | Document directory processing method and device | |
US20200210746A1 (en) | Floating form processing based on topological structures of documents | |
US7870503B1 (en) | Technique for analyzing and graphically displaying document order | |
CN108734110A (en) | Text fragment identification control methods based on longest common subsequence and system | |
KR20170004983A (en) | Line segmentation method | |
US10108590B2 (en) | Comparing markup language files | |
CN106919624B (en) | Method and device for improving webpage loading speed | |
CN111061742A (en) | Method and device for marking data and service system thereof | |
CN104598510A (en) | Event trigger word recognition method and device | |
US11520835B2 (en) | Learning system, learning method, and program | |
CN102662953B (en) | With the semantic tagger system and method that input method is integrated | |
CN113139033B (en) | Text processing method, device, equipment and storage medium | |
CN104424214B (en) | A kind of self-defined method and apparatus for extracting directory content | |
CN103136166B (en) | Method and device for font determination | |
CN109710896B (en) | Text attribute difference marking method and device, storage medium and electronic equipment | |
CN106569986A (en) | Character string replacement method and device | |
CN103488616B (en) | A kind of embedded font processing method and device | |
CN113673255B (en) | Text function area splitting method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |