[go: up one dir, main page]

CN102495881B - Genetic word-based file processing method and device - Google Patents

Genetic word-based file processing method and device Download PDF

Info

Publication number
CN102495881B
CN102495881B CN201110400253.4A CN201110400253A CN102495881B CN 102495881 B CN102495881 B CN 102495881B CN 201110400253 A CN201110400253 A CN 201110400253A CN 102495881 B CN102495881 B CN 102495881B
Authority
CN
China
Prior art keywords
character
source
characters
gene
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110400253.4A
Other languages
Chinese (zh)
Other versions
CN102495881A (en
Inventor
郝佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Founder International Co Ltd
Founder International Beijing Co Ltd
Original Assignee
Founder International Co Ltd
Founder International Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Founder International Co Ltd, Founder International Beijing Co Ltd filed Critical Founder International Co Ltd
Priority to CN201110400253.4A priority Critical patent/CN102495881B/en
Publication of CN102495881A publication Critical patent/CN102495881A/en
Application granted granted Critical
Publication of CN102495881B publication Critical patent/CN102495881B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a genetic word-based file processing method and a device. The method comprises the following steps that: one or more source characters are extracted from an original file according to a genetic word stock to obtain a source character set, wherein the source characters in the source character set have corresponding genetic words in the genetic word stock; a repetition frequency of each source character in the source character set is calculated, and the source characters in the source character set are sorted according to the repetition frequency and a character internal code of each source character; the source characters in the source character set are grouped by snakelike algorithm according to a preset group number so as to obtain the character groups of the preset number; and all source characters in one group or multiple groups of character groups are replaced by corresponding genetic words in the genetic word stock to obtain a file with embedded genetic words. Due to the adoption of the method, when the file with the embedded genetic words is identified, character information in the file can be more accurate to read, and the reading accuracy is higher.

Description

Gene word-based document processing method and device
Technical Field
The invention relates to the field of document processing, in particular to a method and a device for processing a document based on gene words.
Background
The exchange technology of electronic documents or files is a technology for transmitting electronic documents among different units through a computer information network. With the development of information technology, particularly internet technology, units or departments within units can be connected to each other through a local area network or a world wide web. Meanwhile, computer text editing software is also commonly adopted by various units or departments to draft official documents or documents. The electronic document or document exchange technology is based on the above, and provides a technology and a system of a network safety transmission means by standardizing an electronic document format and uniformly transmitting flow and records, so that documents can be quickly transmitted to a receiving unit from a publishing unit through a network in an electronic form without delivering among all units by specially-assigned persons, thereby reducing the workload and improving the working efficiency.
With the continuous development of information technology, the exchange of documents, especially electronic documents, is increasingly frequent, and the documents or the documents are important carriers for transmitting important information and implementing higher spirits no matter in the process of managing national affairs by a party and administrative institutions or in the daily administrative management of enterprises and public institutions. Therefore, it is important to strengthen the management of documents or official documents, especially electronic documents or official documents, and to make the electronic documents or official documents have certain confidentiality and anti-counterfeiting property, and for special documents of some special departments, the confidentiality and anti-counterfeiting property of the documents are more important. In the prior art, most documents or documents do not have an anti-counterfeiting function, and the origin and authenticity of the documents are usually judged through the sequence numbers or official seals on the documents or documents. However, the serial number of the document or official document can be easily blocked or copied, and the current color scanning, copying and printing technologies make the official seal of the document or official document easily copied.
In the prior art, the problems are solved by encryption identification, but to realize encryption and identification, a text digital watermark technology is generally adopted, which is an important technology in the technical field of information hiding, and is more common image digital watermark. In reality, a large amount of texts (such as electronic documents) need to be kept secret, the electronic document system can limit the outflow of the encrypted electronic texts, and in addition, the system often limits the files converted into paper documents by limiting the printing times and the like, but once the files are converted into paper documents, the system cannot limit the copying of the documents and often cannot track the original sources of the paper documents.
Because the gene word is a set of all characters in a special word stock, the font of the gene word has slight difference with the original word stock, and the character is not easy to forge and detect, and meanwhile, the gene word can be conveniently detected by using a special program, so a technical worker can solve the problem that the printing or copying times of a document converted into paper cannot be limited by embedding the gene word in the document, but the existing method for embedding the gene word in the document has low character recognition accuracy when the system reads the document embedded with the gene word due to unbalanced redundancy and low utilization rate.
At present, an effective solution is not provided for the problem that the accuracy of character recognition is low when a system reads a document embedded with gene words due to unbalanced redundancy and low utilization rate of a gene word embedding document mode in the related technology.
Disclosure of Invention
The present invention is directed to a method and an apparatus for processing documents based on gene words, which solves the above-mentioned problems, and provides a method and an apparatus for processing documents based on gene words, which solves the problem of low accuracy of character recognition when reading the document embedded with gene words.
In order to achieve the above object, according to one aspect of the present invention, there is provided a gene-word based document processing method including: extracting one or more source characters from an original file according to a gene word stock to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock; calculating the repetition frequency of each source character in the source character set, and sequencing the source characters in the source character set according to the repetition frequency and the code in the character of each source character; grouping the source characters in the sequenced source character set according to a snake-shaped algorithm according to a preset group number to obtain a preset number of character groups; and replacing all source characters in one or more groups of character groups with corresponding gene words in the gene word library to obtain the document embedded with the gene words.
Further, calculating the repetition frequency of each source character in the source character set, and sorting the source characters in the source character set according to the repetition frequency and the code in the character of each source character comprises: sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low so as to obtain a first sequencing set of the source character set; and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large.
Further, calculating the repetition frequency of each source character in the source character set, and sorting the source characters in the source character set according to the repetition frequency and the code in the character of each source character comprises: sequencing the source characters in the source character set according to the sequence of the repetition frequency from low to high so as to obtain a first sequencing set of the source character set; and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large.
Further, before grouping the source characters in the sorted source character set according to a preset number of groups according to the snake-shaped algorithm to obtain a predetermined number of character groups, the method further includes: setting embedded information to obtain the digits of the embedded information, wherein the digits of the embedded information are preset groups; the embedded information is encrypted to obtain the secure embedded information.
Further, after the source characters in the sorted source character set are grouped according to a preset number of groups according to the snake-shaped algorithm to obtain a predetermined number of character groups, the method further includes: reading character information of all source characters in each group of character groups to obtain corresponding information of each character group, wherein in any group of character groups, when the number of the source characters of which the character information is 0 is larger than the number of the source characters of which the character information is 1, the corresponding information of the character group is 0; when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1.
Further, replacing all source characters in one or more groups of character groups with their corresponding gene words in the gene word library to obtain a gene word embedded document includes: when the corresponding information of the character set is 0, replacing all source characters of the character set with corresponding gene characters in a gene character library; when the correspondence information of the character group is 1, all the source characters of the character group do not perform the replacement operation.
Further, replacing all source characters in one or more groups of character groups with their corresponding gene words in the gene word library to obtain a gene word embedded document includes: when the corresponding information of the character set is 1, replacing all source characters of the character set with corresponding gene characters in a gene character library; when the correspondence information of the character group is 0, all the source characters of the character group do not perform the replacement operation.
In order to achieve the above object, according to another aspect of the present invention, there is provided a gene-word based document processing apparatus including: the extraction module is used for extracting one or more source characters from an original file according to the gene word stock so as to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock; the processing module is used for calculating the repetition frequency of each source character in the source character set and sequencing the source characters in the source character set according to the repetition frequency of each source character and the code in the character; the grouping module is used for grouping the source characters in the sequenced source character set according to a snake-shaped algorithm and a preset group number so as to obtain a predetermined number of character groups; and the replacing module is used for replacing all the source characters in one or more groups of character groups with the corresponding gene words in the gene word stock so as to obtain the document embedded with the gene words.
Further, the processing module includes: the first sequencing module is used for sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low or from low to high so as to obtain a first sequencing set of the source character set; and the second sorting module is used for sorting the source characters with the same repetition frequency in the first sorting set according to the sequence of the codes in the characters from large to small or from small to large.
Further, the apparatus further comprises: and the setting module is used for setting the embedded information to acquire the number of bits of the embedded information, wherein the number of bits of the embedded information is a preset group number, and the embedded information is encrypted to acquire the safe embedded information.
Further, the apparatus further comprises: the reading module is used for reading the character information of all the source characters in each group of character groups to obtain the corresponding information of each character group, wherein in any group of character groups, when the number of the source characters of which the character information is 0 is larger than the number of the source characters of which the character information is 1, the corresponding information of the character group is 0; when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1.
Further, the replacement module includes: the first replacement module is used for replacing all source characters of the character set with corresponding gene characters in the gene character library when the corresponding information of the character set is 0; when the corresponding information of the character group is 1, all source characters of the character group do not execute the replacement operation; or, the second replacing module is used for replacing all source characters of the character group with corresponding gene words in the gene word library when the corresponding information of the character group is 1; when the correspondence information of the character group is 0, all the source characters of the character group do not perform the replacement operation.
According to the method, one or more source characters are extracted from an original file according to a gene word stock to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock; calculating the repetition frequency of each source character in the source character set, and sequencing the source characters in the source character set according to the repetition frequency and the code in the character of each source character; grouping the source characters in the sequenced source character set according to a snake-shaped algorithm according to a preset group number to obtain a preset number of character groups; all source characters in one or more groups of character groups are replaced by corresponding gene characters in a gene character library to obtain a document embedded with the gene characters, and a balanced statistical method is adopted in the method for embedding the gene characters in the document, so that the gene characters are convenient to reuse, the problem that the correct rate of the recognized characters is low when a system reads the document embedded with the gene characters due to unbalanced redundancy and low utilization rate in the method for embedding the gene characters in the related prior art is solved, and the effects of more accurate character information and higher correct rate in reading the document embedded with the gene characters are further realized when the document embedded with the gene characters is recognized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic structural diagram of a document processing apparatus based on gene words according to an embodiment of the present invention;
FIG. 2 is a flowchart of a document processing method based on gene words according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
FIG. 1 is a schematic structural diagram of a document processing apparatus based on gene words according to an embodiment of the present invention. As shown in fig. 1, the gene-word based document processing apparatus includes: the extraction module 10 is configured to extract one or more source characters from an original file according to a gene word stock to obtain a source character set, where the source characters in the source character set have corresponding gene words in the gene word stock; the processing module 30 is configured to calculate a repetition frequency of each source character in the source character set, and sort the source characters in the source character set according to the repetition frequency and the code in the character of each source character; a grouping module 50, configured to group the source characters in the sorted source character set according to a preset group number according to a snake-shaped algorithm, so as to obtain a predetermined number of character groups; and a replacing module 70, configured to replace all source characters in one or more groups of character groups with corresponding gene words in the gene word library, so as to obtain a document in which the gene words are embedded.
According to the embodiment of the application, the extracted source characters corresponding to the gene words are grouped and subjected to snake-shaped sequencing through the processing module and the grouping module, and after the grouping and the snake-shaped sequencing are completed, the gene words in the gene word library are replaced by the corresponding source characters in the original file. The device utilizes different combinations and groups of the gene words and the word frequencies thereof to carry a large amount of information, and the balanced statistical technology of snakelike sequencing is convenient for reusing the gene words, thereby solving the problem that the accuracy of character recognition is lower when a system reads the document embedded with the gene words due to unbalanced redundancy and low utilization rate of the existing mode of embedding the gene words into the document, further realizing that the character information in the read document is more accurate and the accuracy is higher when the document embedded with the gene words is recognized, and greatly improving the robustness of the document using the gene words when the embedded information is extracted.
The original document embedded with the gene words according to the established rule is realized by the device, the redundancy of each group of gene words embedded in the original document is dynamically balanced, so that the information hiding performance is good, and after the original document is printed or copied for many times, a system can accurately judge whether the encrypted document exceeds the number of times of printing or copying.
The processing module in the above embodiment of the present application may include: the first sorting module 301 is configured to sort the source characters in the source character set according to a sequence of a repetition frequency from high to low or from low to high, so as to obtain a first sorted set of the source character set; the second sorting module 302 is configured to sort the source characters with the same repetition frequency in the first sorting set according to a descending order or a descending order of the codes in the characters. The sorting combination mode in this embodiment has an equivalent effect in the implementation process, and mainly provides a source character based on statistical sorting balance for the subsequent combination process.
The combination of the processing module and the grouping module in the above embodiment is used to dynamically allocate the number of characters in each group according to the counted word frequency, so as to dynamically balance the redundancy of the characters in each group, which is beneficial to hiding information, and simultaneously, the related gene words can be reused, thereby greatly improving the utilization frequency of the gene words and the balance of the information in each group, which is beneficial to improving the information embedding amount,
the apparatus in the above embodiment of the present application may further include: and a setting module 80, configured to set the embedded information to obtain the number of bits of the embedded information, where the number of bits of the embedded information is a preset number of groups, and encrypt the embedded information to obtain the secure embedded information. The number of packets preset for grouping characters in this embodiment, and the embedded information may be encrypted for security, for example, when the embedded information is 0110, the embedded information may be encrypted so that the embedded information seen by other non-legitimate users is 0011 or 1100, etc., instead of 0110, and only legitimate users may recognize the correct embedded information.
Therefore, the above embodiment can dynamically allocate the number of the source characters in each group according to the encoding length of the embedded information and the word frequency obtained by statistics, that is, dynamically balance the redundancy in each group of the source characters, facilitate information hiding, and can reuse the related gene words.
The apparatus in the above embodiment may further include: a reading module 90, configured to read character information of all source characters in each group of character groups to obtain corresponding information of each character group, where in any group of character groups, when the number of source characters whose character information is 0 is greater than the number of source characters whose character information is 1, the corresponding information of the character group is 0; when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1.
The replacement module in the above embodiment of the present application may include: the first replacement module is used for replacing all source characters of the character set with corresponding gene characters in the gene character library when the corresponding information of the character set is 0; when the corresponding information of the character group is 1, all source characters of the character group do not execute the replacement operation; or, the second replacing module is used for replacing all source characters of the character group with corresponding gene words in the gene word library when the corresponding information of the character group is 1; when the correspondence information of the character group is 0, all the source characters of the character group do not perform the replacement operation.
FIG. 2 is a flowchart of a document processing method based on gene words according to an embodiment of the present invention, the method including the steps of:
step S102, extracting one or more source characters from the original file according to the gene word stock by using the extraction module in fig. 1 to obtain a source character set, where the source characters in the source character set have corresponding gene words in the gene word stock.
Step S104, calculating the repetition frequency of each source character in the source character set through the processing module in fig. 1, and sorting the source characters in the source character set according to the repetition frequency and the code in the character of each source character.
Step S106, the source characters in the sorted source character set are grouped according to a preset group number by the grouping module in fig. 1 according to a serpentine algorithm, so as to obtain a predetermined number of character groups.
In step S108, all the source characters in one or more groups of character groups are replaced by their corresponding gene words in the gene word library by the replacement module in fig. 1, so as to obtain the document embedded with the gene words.
In the above embodiment of the present application, after the extracted source characters corresponding to the gene words are grouped and subjected to serpentine sorting, the gene words in the gene word library are replaced with the corresponding source characters in the original file. The method utilizes different combinations and groups of the gene words and the word frequencies thereof to carry a large amount of information, and the balanced statistical technology of snakelike sequencing is convenient for reusing the gene words, so that the problem that the accuracy of character recognition is low when a system reads the document embedded with the gene words due to unbalanced redundancy and low utilization rate in the conventional mode of embedding the gene words into the document is solved, the character information in the document is read more accurately and more accurately when the document embedded with the gene words is recognized, and the robustness of the document using the gene words when the embedded information is extracted is greatly improved.
The original document embedded with the gene words according to the established rule is realized by the mode, because the redundancy of each group of gene words embedded in the original document is dynamically balanced, the information hiding performance is good, and after the original document is printed or copied for many times, a system can accurately judge whether the encrypted document exceeds the number of times of printing or copying.
In step S104 in the foregoing embodiment of the present application, calculating the repetition frequency of each source character in the source character set, and sorting the source characters in the source character set according to the repetition frequency and the code in the character of each source character may specifically be implemented by the following steps: sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low so as to obtain a first sequencing set of the source character set; and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large. Alternatively, step S104 may be implemented by the following steps: sequencing the source characters in the source character set according to the sequence of the repetition frequency from low to high so as to obtain a first sequencing set of the source character set; and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large. After counting the number (including repetition) and the word frequency of the source characters with the corresponding gene words in the original file, the above embodiment realizes the sequencing of the source characters in all the source character sets according to the word frequency (according to the code in the character when the word frequency is the same), and the mode can ensure the uniqueness of the character sequencing.
Specifically, the specific implementation of the above method is as follows: firstly, extracting source characters corresponding to gene characters in a gene character library in an original file, and then counting the character frequency of each extracted source character. The character information of each source character can be represented by binary bits (e.g., 0 or 1), and when the frequency of occurrence of a source character as a gene word is n, the source character can be characterized as n bits 0 or 1.
For example, the following text is given as an example. The original file has a segment of characters: the method comprises the following steps that a mountain owner has a great skill in development, each character is regarded as a source character, and the corresponding gene character of the section of character in a gene character library is obtained through comparison and query, so that a source character set is obtained: the mountain owner develops a long skill, and each source character in the source character set has a corresponding gene word in the gene word library, that is, the segment of the character contains 7 corresponding gene words (including no duplication).
According to statistics, the following results are obtained:
Figure BDA0000116762460000061
after the word frequency of the gene word corresponding to each source character is obtained through statistics, the internal code of each source character is obtained at the same time, if the sequence of the repetition frequency from high to low and the internal code of the character from small to large is adopted for sequencing (for example, the word frequency of the mountain and the main character is 2, the internal code of the mountain is 0x5c71, the internal code of the main character is 0x4e3b, the main character is arranged in front of the mountain, and the rest are similar), the sequence of the sequenced character is as follows: the main mountain develops the technical growth, and the word frequency is 2211111 in turn.
Based on the above embodiment, before the step S106 groups the source characters in the sorted source character set according to the preset number of groups according to the serpentine algorithm to obtain the predetermined number of character groups, the method may further include: setting embedded information to obtain the digits of the embedded information, wherein the digits of the embedded information are preset groups; the embedded information is encrypted to obtain the secure embedded information. Specifically, a segment of characters of the original file is still used as: for example, the mountain owner has a technical mountain owner in development, and in this case, the embedded information may be set according to a requirement, for example, the embedded information is set as: 0110, 4 bits in total, so it can be known that the source characters in the source character set can be divided into 4 groups for distribution after being sorted by the above method. In addition, the embedded information may be encrypted for security, for example, when the embedded information is 0110, the embedded information may be encrypted so that the embedded information seen by other unauthorized users is 0011 or 1100, etc., instead of 0110, and only authorized users can recognize correct embedded information.
After the number of groups determined by the embedding information is obtained in step S106, each source character in the source character set that has been sorted can be allocated to each character group according to the bit number of the embedding information and the serpentine algorithm, specifically, one source character can be sequentially taken from each group according to the length of the embedding information, and then allocated to each character group according to the serpentine principle until all the codewords are allocated, so as to better solve the problem of uneven word frequency in the allocation process, to dynamically balance the amount of redundancy in each group, and to repeatedly use the related codewords, i.e., to make the number of source characters represented by 0 or 1 in each group substantially average, thereby improving the problem that otherwise the number of source characters of 0 or 1 in some groups in a character group is large or small, and no 0 or 1 characterizing the source characters in a certain group occurs, and the information of the character group is wrong, so that the use ratio of the gene words is improved after the gene words are used for replacing the source words in the source character group, and the robustness of subsequent identification or detection files is greatly improved.
Specifically, a segment of text of the original file can be still used as: for example, according to the length of embedded information 0110, we can divide the characters into four groups as follows:
a first group: master and slave
Second group: mountain length
Third group: take place of
And a fourth group: exercise machine
After the source characters in each group are replaced by the corresponding gene words in the gene word library, the system identifies the document, namely, when the gene words are detected, the condition that the original file is damaged due to 0 or 1 which characterizes the source characters in a certain group of character groups rarely occurs, and the accuracy rate of detecting the gene words is improved, namely, the gene words which should be 0 are expressed as 1 or the gene words which should be 1 are expressed as 0.
Based on the above embodiment, after the step S106 groups the source characters in the sorted source character set according to the preset number of groups according to the serpentine algorithm to obtain the predetermined number of character groups, the method may further include: reading character information of all source characters in each group of character groups to obtain corresponding information of each character group, wherein in any group of character groups, when the number of the source characters of which the character information is 0 is larger than the number of the source characters of which the character information is 1, the corresponding information of the character group is 0; when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1. In this embodiment, since each character group is composed of several bits 0 or 1, it can be determined that the character group is characterized by 0 or 1 according to the number of 0 and 1 in each group. Here, if more source characters are characterized by "0" than by "1" in a character group, the character group may be characterized by "0".
In the above embodiment of the present application, the step S108 of replacing all source characters in one or more groups of character groups with their corresponding gene words in the gene word library to obtain the document with embedded gene words may include one of the following implementation steps: when the corresponding information of the character set is 0, replacing all source characters of the character set with corresponding gene characters in a gene character library; when the correspondence information of the character group is 1, all the source characters of the character group do not perform the replacement operation. The step can also comprise another implementation step as follows: when the corresponding information of the character set is 1, replacing all source characters of the character set with corresponding gene characters in a gene character library; when the correspondence information of the character group is 0, all the source characters of the character group do not perform the replacement operation. In a specific implementation process, the system may define implementation steps of 0 or 1 according to requirements, and may define that all source characters in the character group characterized by "0" need to be replaced by gene words, and all source characters in the character group characterized by "1" do not need to be replaced. In a specific implementation, it may also be defined that all source characters in the set of characters characterized by a "1" need to be replaced by a gene word.
The key points of the invention are that the embedding and replacing are simple, the speed is high, the realization is easy, the gene word utilization rate is high, the redundancy is relatively balanced, the information hiding performance is good, and the information embedding amount is relatively large.
It can be seen from the above that, in the method embodiment of the present application, corresponding gene words in a gene word table are embedded in an existing original file, an allocation technique based on a serpentine algorithm is used as a key in the embedding process, a source character set in which the gene words need to be replaced in a source file is extracted, and then the source characters in the source character set are sorted and grouped, so that a process of balancing the source characters with different word frequencies is implemented, and in order to ensure that the validity and integrity of the original file and the gene word set table are verified, the embedded information and the bit number thereof need to be set.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
From the above description, it can be seen that the present invention achieves the following technical effects: the method for embedding the gene words in the document is simple in embedding and replacing, high in speed, easy to realize, high in gene word utilization rate, relatively balanced in redundancy, good in information hiding performance and relatively large in information embedding amount.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for processing a document based on gene words, comprising:
extracting one or more source characters from an original file according to a gene word stock to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock;
calculating the repetition frequency of each source character in the source character set, and sequencing the source characters in the source character set according to the repetition frequency and the code in the character of each source character;
grouping the source characters in the source character set after sequencing according to a snake-shaped algorithm according to a preset group number to obtain a preset number of character groups;
replacing all source characters in one or more groups of character groups with corresponding gene words in the gene word stock to obtain a document embedded with the gene words,
wherein the gene word is a set of all characters in a special word stock.
2. The method of claim 1, wherein calculating a repetition frequency of each source character in the set of source characters and ordering the source characters in the set of source characters according to the repetition frequency and intra-character codes of each source character comprises:
sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low so as to obtain a first sequencing set of the source character set;
and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large.
3. The method of claim 1, wherein calculating a repetition frequency of each source character in the set of source characters and ordering the source characters in the set of source characters according to the repetition frequency and intra-character codes of each source character comprises:
sequencing the source characters in the source character set according to the sequence of the repetition frequency from low to high so as to obtain a first sequencing set of the source character set;
and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large.
4. The method of any of claims 1-3, wherein before grouping the source characters in the sorted set of source characters by a predetermined number of groups according to a serpentine algorithm to obtain a predetermined number of character groups, the method further comprises:
setting embedded information to obtain the number of digits of the embedded information, wherein the number of digits of the embedded information is the preset number of groups;
and encrypting the embedded information to obtain the safe embedded information.
5. The method of any one of claims 1-3, wherein after grouping the source characters in the sorted set of source characters by a predetermined number of groups according to a serpentine algorithm to obtain a predetermined number of character groups, the method further comprises:
reading the character information of all the source characters in each character group to obtain the corresponding information of each character group, wherein,
in any group of character groups, when the number of source characters of which the character information is 0 is greater than that of source characters of which the character information is 1, the corresponding information of the character group is 0;
when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1,
wherein the character information of each of the source characters is represented by a binary bit of 0 or 1.
6. The method of claim 5, wherein replacing all source characters in one or more character sets with their corresponding gene words in a gene word library to obtain a gene word embedded document comprises:
when the corresponding information of the character set is 0, replacing all source characters of the character set with corresponding gene characters in a gene character library;
and when the corresponding information of the character group is 1, not executing the replacement operation on all the source characters of the character group.
7. The method of claim 5, wherein replacing all source characters in one or more character sets with their corresponding gene words in a gene word library to obtain a gene word embedded document comprises:
when the corresponding information of the character set is 1, replacing all source characters of the character set with corresponding gene characters in a gene character library;
and when the corresponding information of the character group is 0, not executing the replacement operation on all the source characters of the character group.
8. A gene-word based document processing apparatus, comprising:
the extraction module is used for extracting one or more source characters from an original file according to a gene word stock to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock;
the processing module is used for calculating the repetition frequency of each source character in the source character set and sequencing the source characters in the source character set according to the repetition frequency and the code in the character of each source character;
the grouping module is used for grouping the source characters in the source character set after sequencing according to a snake-shaped algorithm and a preset group number so as to obtain a preset number of character groups;
and the replacing module is used for replacing all source characters in one or more groups of character groups with corresponding gene words in the gene word stock so as to obtain the document embedded with the gene words, wherein the gene words are a set of all characters in a special word stock.
9. The apparatus of claim 8, wherein the processing module comprises:
the first sequencing module is used for sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low or from low to high so as to obtain a first sequencing set of the source character set;
and the second sorting module is used for sorting the source characters with the same repetition frequency in the first sorting set according to the sequence of the codes in the characters from large to small or from small to large.
10. The apparatus of claim 8 or 9, further comprising:
and the setting module is used for setting the embedded information to obtain the number of bits of the embedded information, wherein the number of bits of the embedded information is the preset group number, and encrypting the embedded information to obtain the safe embedded information.
11. The apparatus of claim 8 or 9, further comprising:
the reading module is used for reading the character information of all the source characters in each group of character groups to obtain the corresponding information of each character group, wherein in any group of character groups, when the number of the source characters of which the character information is 0 is larger than the number of the source characters of which the character information is 1, the corresponding information of the character group is 0; when the number of the source characters of which the character information is 1 is greater than that of the source characters of which the character information is 0, the corresponding information of the character group is 1, wherein the character information of each source character is represented by a binary bit of 0 or 1.
12. The apparatus of claim 11, wherein the replacement module comprises:
the first replacement module is used for replacing all source characters of the character group with corresponding gene characters in the gene character library when the corresponding information of the character group is 0; when the corresponding information of the character group is 1, all source characters of the character group do not execute the replacement operation; or,
the second replacement module is used for replacing all source characters of the character group with corresponding gene characters in the gene character library when the corresponding information of the character group is 1; and when the corresponding information of the character group is 0, not executing the replacement operation on all the source characters of the character group.
CN201110400253.4A 2011-12-06 2011-12-06 Genetic word-based file processing method and device Active CN102495881B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110400253.4A CN102495881B (en) 2011-12-06 2011-12-06 Genetic word-based file processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110400253.4A CN102495881B (en) 2011-12-06 2011-12-06 Genetic word-based file processing method and device

Publications (2)

Publication Number Publication Date
CN102495881A CN102495881A (en) 2012-06-13
CN102495881B true CN102495881B (en) 2014-06-25

Family

ID=46187706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110400253.4A Active CN102495881B (en) 2011-12-06 2011-12-06 Genetic word-based file processing method and device

Country Status (1)

Country Link
CN (1) CN102495881B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103199548B (en) * 2013-04-02 2015-05-13 国家电网公司 Capacitor grouping balancing system and capacitor grouping balancing method
CN107169722A (en) * 2017-03-23 2017-09-15 高泽 A kind of complete intelligent tracing management system of official document operating and method
CN117891787B (en) * 2024-03-15 2024-05-28 武汉磐电科技股份有限公司 Current transformer quantity value tracing data processing method, system and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1740943A (en) * 2004-08-27 2006-03-01 北京北大方正电子有限公司 A document encryption method
WO2007062554A1 (en) * 2005-12-01 2007-06-07 Peking University Founder Group Co. Ltd A method and device for embedding digital watermark into a text document and detecting it

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4181577B2 (en) * 2005-12-22 2008-11-19 インターナショナル・ビジネス・マシーンズ・コーポレーション Character string processing method, apparatus, and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1740943A (en) * 2004-08-27 2006-03-01 北京北大方正电子有限公司 A document encryption method
WO2007062554A1 (en) * 2005-12-01 2007-06-07 Peking University Founder Group Co. Ltd A method and device for embedding digital watermark into a text document and detecting it

Also Published As

Publication number Publication date
CN102495881A (en) 2012-06-13

Similar Documents

Publication Publication Date Title
Hakak et al. Approaches for preserving content integrity of sensitive online Arabic content: A survey and research challenges
CN101686294B (en) Embedded type file information security management system
CN106126982B (en) A kind of PDF document copy-right protection method based on digital finger-print
US7730037B2 (en) Fragile watermarks
CN114884697A (en) Data encryption and decryption method based on state cryptographic algorithm and related equipment
CN102495881B (en) Genetic word-based file processing method and device
CN101141466B (en) Document authentication method based on interweaving watermark and biological characteristic
CN116167807A (en) Bill anti-counterfeiting method and device, electronic equipment and storage medium
Dlamini et al. Mitigating the challenge of hardcopy document forgery
CN104376236B (en) Scheme self-adaptive digital watermark embedding grammar and extracting method based on camouflage science
CN102509058A (en) Point type GIS vector data disguise and recovery method based on redundant bit replacement
CN103984886B (en) Fingerprint embedding method based on partition fixing
Yang et al. A SVM based text steganalysis algorithm for spacing coding
Khadam et al. Data aggregation and privacy preserving using computational intelligence
CN112653704A (en) Intelligent logistics safety information transmission method based on block chain technology
Rofiatunnajah et al. Improving anitw performance using bigrams character encoding and identity-based signature
Al-Maksousy et al. Robust Visible Digital Stamp for Instant Documents Authentication and Verification
Sarbavidya et al. Applications of public key watermarking for authentication of job-card in MGNREGA
CN102496137B (en) Method and device for dynamically generating watermark
El_Haggar et al. Blind watermarking technique for relational database
CN100593787C (en) Hash file audit output method
RU2338248C1 (en) Method for marking and method for marking check of lines of answers to user database requests using digital watermarks
Wang et al. Information Hiding Technology in Electronic Notes System
CN102945472B (en) Single-parameter, gradient, reverse, synchronous and increasing encryption type binary anti-counterfeit printing method
CN102945408B (en) Double-variant multi-parameter left-shifting stepping gradually-increased encryption binary anti-counterfeiting printing method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant