CN102495881B

CN102495881B - Genetic word-based file processing method and device

Info

Publication number: CN102495881B
Application number: CN201110400253.4A
Authority: CN
Inventors: 郝佳
Original assignee: Founder International Co Ltd; Founder International Beijing Co Ltd
Current assignee: Founder International Co Ltd; Founder International Beijing Co Ltd
Priority date: 2011-12-06
Filing date: 2011-12-06
Publication date: 2014-06-25
Anticipated expiration: 2031-12-06
Also published as: CN102495881A

Abstract

The invention discloses a genetic word-based file processing method and a device. The method comprises the following steps that: one or more source characters are extracted from an original file according to a genetic word stock to obtain a source character set, wherein the source characters in the source character set have corresponding genetic words in the genetic word stock; a repetition frequency of each source character in the source character set is calculated, and the source characters in the source character set are sorted according to the repetition frequency and a character internal code of each source character; the source characters in the source character set are grouped by snakelike algorithm according to a preset group number so as to obtain the character groups of the preset number; and all source characters in one group or multiple groups of character groups are replaced by corresponding genetic words in the genetic word stock to obtain a file with embedded genetic words. Due to the adoption of the method, when the file with the embedded genetic words is identified, character information in the file can be more accurate to read, and the reading accuracy is higher.

Description

Gene word-based document processing method and device

Technical Field

The invention relates to the field of document processing, in particular to a method and a device for processing a document based on gene words.

Background

The exchange technology of electronic documents or files is a technology for transmitting electronic documents among different units through a computer information network. With the development of information technology, particularly internet technology, units or departments within units can be connected to each other through a local area network or a world wide web. Meanwhile, computer text editing software is also commonly adopted by various units or departments to draft official documents or documents. The electronic document or document exchange technology is based on the above, and provides a technology and a system of a network safety transmission means by standardizing an electronic document format and uniformly transmitting flow and records, so that documents can be quickly transmitted to a receiving unit from a publishing unit through a network in an electronic form without delivering among all units by specially-assigned persons, thereby reducing the workload and improving the working efficiency.

With the continuous development of information technology, the exchange of documents, especially electronic documents, is increasingly frequent, and the documents or the documents are important carriers for transmitting important information and implementing higher spirits no matter in the process of managing national affairs by a party and administrative institutions or in the daily administrative management of enterprises and public institutions. Therefore, it is important to strengthen the management of documents or official documents, especially electronic documents or official documents, and to make the electronic documents or official documents have certain confidentiality and anti-counterfeiting property, and for special documents of some special departments, the confidentiality and anti-counterfeiting property of the documents are more important. In the prior art, most documents or documents do not have an anti-counterfeiting function, and the origin and authenticity of the documents are usually judged through the sequence numbers or official seals on the documents or documents. However, the serial number of the document or official document can be easily blocked or copied, and the current color scanning, copying and printing technologies make the official seal of the document or official document easily copied.

In the prior art, the problems are solved by encryption identification, but to realize encryption and identification, a text digital watermark technology is generally adopted, which is an important technology in the technical field of information hiding, and is more common image digital watermark. In reality, a large amount of texts (such as electronic documents) need to be kept secret, the electronic document system can limit the outflow of the encrypted electronic texts, and in addition, the system often limits the files converted into paper documents by limiting the printing times and the like, but once the files are converted into paper documents, the system cannot limit the copying of the documents and often cannot track the original sources of the paper documents.

Because the gene word is a set of all characters in a special word stock, the font of the gene word has slight difference with the original word stock, and the character is not easy to forge and detect, and meanwhile, the gene word can be conveniently detected by using a special program, so a technical worker can solve the problem that the printing or copying times of a document converted into paper cannot be limited by embedding the gene word in the document, but the existing method for embedding the gene word in the document has low character recognition accuracy when the system reads the document embedded with the gene word due to unbalanced redundancy and low utilization rate.

At present, an effective solution is not provided for the problem that the accuracy of character recognition is low when a system reads a document embedded with gene words due to unbalanced redundancy and low utilization rate of a gene word embedding document mode in the related technology.

Disclosure of Invention

The present invention is directed to a method and an apparatus for processing documents based on gene words, which solves the above-mentioned problems, and provides a method and an apparatus for processing documents based on gene words, which solves the problem of low accuracy of character recognition when reading the document embedded with gene words.

In order to achieve the above object, according to one aspect of the present invention, there is provided a gene-word based document processing method including: extracting one or more source characters from an original file according to a gene word stock to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock; calculating the repetition frequency of each source character in the source character set, and sequencing the source characters in the source character set according to the repetition frequency and the code in the character of each source character; grouping the source characters in the sequenced source character set according to a snake-shaped algorithm according to a preset group number to obtain a preset number of character groups; and replacing all source characters in one or more groups of character groups with corresponding gene words in the gene word library to obtain the document embedded with the gene words.

Further, calculating the repetition frequency of each source character in the source character set, and sorting the source characters in the source character set according to the repetition frequency and the code in the character of each source character comprises: sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low so as to obtain a first sequencing set of the source character set; and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large.

Further, calculating the repetition frequency of each source character in the source character set, and sorting the source characters in the source character set according to the repetition frequency and the code in the character of each source character comprises: sequencing the source characters in the source character set according to the sequence of the repetition frequency from low to high so as to obtain a first sequencing set of the source character set; and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large.

Further, before grouping the source characters in the sorted source character set according to a preset number of groups according to the snake-shaped algorithm to obtain a predetermined number of character groups, the method further includes: setting embedded information to obtain the digits of the embedded information, wherein the digits of the embedded information are preset groups; the embedded information is encrypted to obtain the secure embedded information.

Further, after the source characters in the sorted source character set are grouped according to a preset number of groups according to the snake-shaped algorithm to obtain a predetermined number of character groups, the method further includes: reading character information of all source characters in each group of character groups to obtain corresponding information of each character group, wherein in any group of character groups, when the number of the source characters of which the character information is 0 is larger than the number of the source characters of which the character information is 1, the corresponding information of the character group is 0; when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1.

Further, replacing all source characters in one or more groups of character groups with their corresponding gene words in the gene word library to obtain a gene word embedded document includes: when the corresponding information of the character set is 0, replacing all source characters of the character set with corresponding gene characters in a gene character library; when the correspondence information of the character group is 1, all the source characters of the character group do not perform the replacement operation.

Further, replacing all source characters in one or more groups of character groups with their corresponding gene words in the gene word library to obtain a gene word embedded document includes: when the corresponding information of the character set is 1, replacing all source characters of the character set with corresponding gene characters in a gene character library; when the correspondence information of the character group is 0, all the source characters of the character group do not perform the replacement operation.

In order to achieve the above object, according to another aspect of the present invention, there is provided a gene-word based document processing apparatus including: the extraction module is used for extracting one or more source characters from an original file according to the gene word stock so as to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock; the processing module is used for calculating the repetition frequency of each source character in the source character set and sequencing the source characters in the source character set according to the repetition frequency of each source character and the code in the character; the grouping module is used for grouping the source characters in the sequenced source character set according to a snake-shaped algorithm and a preset group number so as to obtain a predetermined number of character groups; and the replacing module is used for replacing all the source characters in one or more groups of character groups with the corresponding gene words in the gene word stock so as to obtain the document embedded with the gene words.

Further, the processing module includes: the first sequencing module is used for sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low or from low to high so as to obtain a first sequencing set of the source character set; and the second sorting module is used for sorting the source characters with the same repetition frequency in the first sorting set according to the sequence of the codes in the characters from large to small or from small to large.

Further, the apparatus further comprises: and the setting module is used for setting the embedded information to acquire the number of bits of the embedded information, wherein the number of bits of the embedded information is a preset group number, and the embedded information is encrypted to acquire the safe embedded information.

Further, the apparatus further comprises: the reading module is used for reading the character information of all the source characters in each group of character groups to obtain the corresponding information of each character group, wherein in any group of character groups, when the number of the source characters of which the character information is 0 is larger than the number of the source characters of which the character information is 1, the corresponding information of the character group is 0; when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1.

Further, the replacement module includes: the first replacement module is used for replacing all source characters of the character set with corresponding gene characters in the gene character library when the corresponding information of the character set is 0; when the corresponding information of the character group is 1, all source characters of the character group do not execute the replacement operation; or, the second replacing module is used for replacing all source characters of the character group with corresponding gene words in the gene word library when the corresponding information of the character group is 1; when the correspondence information of the character group is 0, all the source characters of the character group do not perform the replacement operation.

According to the method, one or more source characters are extracted from an original file according to a gene word stock to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock; calculating the repetition frequency of each source character in the source character set, and sequencing the source characters in the source character set according to the repetition frequency and the code in the character of each source character; grouping the source characters in the sequenced source character set according to a snake-shaped algorithm according to a preset group number to obtain a preset number of character groups; all source characters in one or more groups of character groups are replaced by corresponding gene characters in a gene character library to obtain a document embedded with the gene characters, and a balanced statistical method is adopted in the method for embedding the gene characters in the document, so that the gene characters are convenient to reuse, the problem that the correct rate of the recognized characters is low when a system reads the document embedded with the gene characters due to unbalanced redundancy and low utilization rate in the method for embedding the gene characters in the related prior art is solved, and the effects of more accurate character information and higher correct rate in reading the document embedded with the gene characters are further realized when the document embedded with the gene characters is recognized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic structural diagram of a document processing apparatus based on gene words according to an embodiment of the present invention;

FIG. 2 is a flowchart of a document processing method based on gene words according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

FIG. 1 is a schematic structural diagram of a document processing apparatus based on gene words according to an embodiment of the present invention. As shown in fig. 1, the gene-word based document processing apparatus includes: the extraction module 10 is configured to extract one or more source characters from an original file according to a gene word stock to obtain a source character set, where the source characters in the source character set have corresponding gene words in the gene word stock; the processing module 30 is configured to calculate a repetition frequency of each source character in the source character set, and sort the source characters in the source character set according to the repetition frequency and the code in the character of each source character; a grouping module 50, configured to group the source characters in the sorted source character set according to a preset group number according to a snake-shaped algorithm, so as to obtain a predetermined number of character groups; and a replacing module 70, configured to replace all source characters in one or more groups of character groups with corresponding gene words in the gene word library, so as to obtain a document in which the gene words are embedded.

According to the embodiment of the application, the extracted source characters corresponding to the gene words are grouped and subjected to snake-shaped sequencing through the processing module and the grouping module, and after the grouping and the snake-shaped sequencing are completed, the gene words in the gene word library are replaced by the corresponding source characters in the original file. The device utilizes different combinations and groups of the gene words and the word frequencies thereof to carry a large amount of information, and the balanced statistical technology of snakelike sequencing is convenient for reusing the gene words, thereby solving the problem that the accuracy of character recognition is lower when a system reads the document embedded with the gene words due to unbalanced redundancy and low utilization rate of the existing mode of embedding the gene words into the document, further realizing that the character information in the read document is more accurate and the accuracy is higher when the document embedded with the gene words is recognized, and greatly improving the robustness of the document using the gene words when the embedded information is extracted.

The original document embedded with the gene words according to the established rule is realized by the device, the redundancy of each group of gene words embedded in the original document is dynamically balanced, so that the information hiding performance is good, and after the original document is printed or copied for many times, a system can accurately judge whether the encrypted document exceeds the number of times of printing or copying.

The processing module in the above embodiment of the present application may include: the first sorting module 301 is configured to sort the source characters in the source character set according to a sequence of a repetition frequency from high to low or from low to high, so as to obtain a first sorted set of the source character set; the second sorting module 302 is configured to sort the source characters with the same repetition frequency in the first sorting set according to a descending order or a descending order of the codes in the characters. The sorting combination mode in this embodiment has an equivalent effect in the implementation process, and mainly provides a source character based on statistical sorting balance for the subsequent combination process.

The combination of the processing module and the grouping module in the above embodiment is used to dynamically allocate the number of characters in each group according to the counted word frequency, so as to dynamically balance the redundancy of the characters in each group, which is beneficial to hiding information, and simultaneously, the related gene words can be reused, thereby greatly improving the utilization frequency of the gene words and the balance of the information in each group, which is beneficial to improving the information embedding amount,

the apparatus in the above embodiment of the present application may further include: and a setting module 80, configured to set the embedded information to obtain the number of bits of the embedded information, where the number of bits of the embedded information is a preset number of groups, and encrypt the embedded information to obtain the secure embedded information. The number of packets preset for grouping characters in this embodiment, and the embedded information may be encrypted for security, for example, when the embedded information is 0110, the embedded information may be encrypted so that the embedded information seen by other non-legitimate users is 0011 or 1100, etc., instead of 0110, and only legitimate users may recognize the correct embedded information.

Therefore, the above embodiment can dynamically allocate the number of the source characters in each group according to the encoding length of the embedded information and the word frequency obtained by statistics, that is, dynamically balance the redundancy in each group of the source characters, facilitate information hiding, and can reuse the related gene words.

The apparatus in the above embodiment may further include: a reading module 90, configured to read character information of all source characters in each group of character groups to obtain corresponding information of each character group, where in any group of character groups, when the number of source characters whose character information is 0 is greater than the number of source characters whose character information is 1, the corresponding information of the character group is 0; when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1.

The replacement module in the above embodiment of the present application may include: the first replacement module is used for replacing all source characters of the character set with corresponding gene characters in the gene character library when the corresponding information of the character set is 0; when the corresponding information of the character group is 1, all source characters of the character group do not execute the replacement operation; or, the second replacing module is used for replacing all source characters of the character group with corresponding gene words in the gene word library when the corresponding information of the character group is 1; when the correspondence information of the character group is 0, all the source characters of the character group do not perform the replacement operation.

FIG. 2 is a flowchart of a document processing method based on gene words according to an embodiment of the present invention, the method including the steps of:

step S102, extracting one or more source characters from the original file according to the gene word stock by using the extraction module in fig. 1 to obtain a source character set, where the source characters in the source character set have corresponding gene words in the gene word stock.

Step S104, calculating the repetition frequency of each source character in the source character set through the processing module in fig. 1, and sorting the source characters in the source character set according to the repetition frequency and the code in the character of each source character.

Step S106, the source characters in the sorted source character set are grouped according to a preset group number by the grouping module in fig. 1 according to a serpentine algorithm, so as to obtain a predetermined number of character groups.

In step S108, all the source characters in one or more groups of character groups are replaced by their corresponding gene words in the gene word library by the replacement module in fig. 1, so as to obtain the document embedded with the gene words.

In the above embodiment of the present application, after the extracted source characters corresponding to the gene words are grouped and subjected to serpentine sorting, the gene words in the gene word library are replaced with the corresponding source characters in the original file. The method utilizes different combinations and groups of the gene words and the word frequencies thereof to carry a large amount of information, and the balanced statistical technology of snakelike sequencing is convenient for reusing the gene words, so that the problem that the accuracy of character recognition is low when a system reads the document embedded with the gene words due to unbalanced redundancy and low utilization rate in the conventional mode of embedding the gene words into the document is solved, the character information in the document is read more accurately and more accurately when the document embedded with the gene words is recognized, and the robustness of the document using the gene words when the embedded information is extracted is greatly improved.

The original document embedded with the gene words according to the established rule is realized by the mode, because the redundancy of each group of gene words embedded in the original document is dynamically balanced, the information hiding performance is good, and after the original document is printed or copied for many times, a system can accurately judge whether the encrypted document exceeds the number of times of printing or copying.

In step S104 in the foregoing embodiment of the present application, calculating the repetition frequency of each source character in the source character set, and sorting the source characters in the source character set according to the repetition frequency and the code in the character of each source character may specifically be implemented by the following steps: sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low so as to obtain a first sequencing set of the source character set; and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large. Alternatively, step S104 may be implemented by the following steps: sequencing the source characters in the source character set according to the sequence of the repetition frequency from low to high so as to obtain a first sequencing set of the source character set; and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large. After counting the number (including repetition) and the word frequency of the source characters with the corresponding gene words in the original file, the above embodiment realizes the sequencing of the source characters in all the source character sets according to the word frequency (according to the code in the character when the word frequency is the same), and the mode can ensure the uniqueness of the character sequencing.

Specifically, the specific implementation of the above method is as follows: firstly, extracting source characters corresponding to gene characters in a gene character library in an original file, and then counting the character frequency of each extracted source character. The character information of each source character can be represented by binary bits (e.g., 0 or 1), and when the frequency of occurrence of a source character as a gene word is n, the source character can be characterized as n bits 0 or 1.

For example, the following text is given as an example. The original file has a segment of characters: the method comprises the following steps that a mountain owner has a great skill in development, each character is regarded as a source character, and the corresponding gene character of the section of character in a gene character library is obtained through comparison and query, so that a source character set is obtained: the mountain owner develops a long skill, and each source character in the source character set has a corresponding gene word in the gene word library, that is, the segment of the character contains 7 corresponding gene words (including no duplication).

According to statistics, the following results are obtained:

after the word frequency of the gene word corresponding to each source character is obtained through statistics, the internal code of each source character is obtained at the same time, if the sequence of the repetition frequency from high to low and the internal code of the character from small to large is adopted for sequencing (for example, the word frequency of the mountain and the main character is 2, the internal code of the mountain is 0x5c71, the internal code of the main character is 0x4e3b, the main character is arranged in front of the mountain, and the rest are similar), the sequence of the sequenced character is as follows: the main mountain develops the technical growth, and the word frequency is 2211111 in turn.

Based on the above embodiment, before the step S106 groups the source characters in the sorted source character set according to the preset number of groups according to the serpentine algorithm to obtain the predetermined number of character groups, the method may further include: setting embedded information to obtain the digits of the embedded information, wherein the digits of the embedded information are preset groups; the embedded information is encrypted to obtain the secure embedded information. Specifically, a segment of characters of the original file is still used as: for example, the mountain owner has a technical mountain owner in development, and in this case, the embedded information may be set according to a requirement, for example, the embedded information is set as: 0110, 4 bits in total, so it can be known that the source characters in the source character set can be divided into 4 groups for distribution after being sorted by the above method. In addition, the embedded information may be encrypted for security, for example, when the embedded information is 0110, the embedded information may be encrypted so that the embedded information seen by other unauthorized users is 0011 or 1100, etc., instead of 0110, and only authorized users can recognize correct embedded information.

After the number of groups determined by the embedding information is obtained in step S106, each source character in the source character set that has been sorted can be allocated to each character group according to the bit number of the embedding information and the serpentine algorithm, specifically, one source character can be sequentially taken from each group according to the length of the embedding information, and then allocated to each character group according to the serpentine principle until all the codewords are allocated, so as to better solve the problem of uneven word frequency in the allocation process, to dynamically balance the amount of redundancy in each group, and to repeatedly use the related codewords, i.e., to make the number of source characters represented by 0 or 1 in each group substantially average, thereby improving the problem that otherwise the number of source characters of 0 or 1 in some groups in a character group is large or small, and no 0 or 1 characterizing the source characters in a certain group occurs, and the information of the character group is wrong, so that the use ratio of the gene words is improved after the gene words are used for replacing the source words in the source character group, and the robustness of subsequent identification or detection files is greatly improved.

Specifically, a segment of text of the original file can be still used as: for example, according to the length of embedded information 0110, we can divide the characters into four groups as follows:

a first group: master and slave

Second group: mountain length

Third group: take place of

And a fourth group: exercise machine

After the source characters in each group are replaced by the corresponding gene words in the gene word library, the system identifies the document, namely, when the gene words are detected, the condition that the original file is damaged due to 0 or 1 which characterizes the source characters in a certain group of character groups rarely occurs, and the accuracy rate of detecting the gene words is improved, namely, the gene words which should be 0 are expressed as 1 or the gene words which should be 1 are expressed as 0.

Based on the above embodiment, after the step S106 groups the source characters in the sorted source character set according to the preset number of groups according to the serpentine algorithm to obtain the predetermined number of character groups, the method may further include: reading character information of all source characters in each group of character groups to obtain corresponding information of each character group, wherein in any group of character groups, when the number of the source characters of which the character information is 0 is larger than the number of the source characters of which the character information is 1, the corresponding information of the character group is 0; when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1. In this embodiment, since each character group is composed of several bits 0 or 1, it can be determined that the character group is characterized by 0 or 1 according to the number of 0 and 1 in each group. Here, if more source characters are characterized by "0" than by "1" in a character group, the character group may be characterized by "0".

In the above embodiment of the present application, the step S108 of replacing all source characters in one or more groups of character groups with their corresponding gene words in the gene word library to obtain the document with embedded gene words may include one of the following implementation steps: when the corresponding information of the character set is 0, replacing all source characters of the character set with corresponding gene characters in a gene character library; when the correspondence information of the character group is 1, all the source characters of the character group do not perform the replacement operation. The step can also comprise another implementation step as follows: when the corresponding information of the character set is 1, replacing all source characters of the character set with corresponding gene characters in a gene character library; when the correspondence information of the character group is 0, all the source characters of the character group do not perform the replacement operation. In a specific implementation process, the system may define implementation steps of 0 or 1 according to requirements, and may define that all source characters in the character group characterized by "0" need to be replaced by gene words, and all source characters in the character group characterized by "1" do not need to be replaced. In a specific implementation, it may also be defined that all source characters in the set of characters characterized by a "1" need to be replaced by a gene word.

The key points of the invention are that the embedding and replacing are simple, the speed is high, the realization is easy, the gene word utilization rate is high, the redundancy is relatively balanced, the information hiding performance is good, and the information embedding amount is relatively large.

It can be seen from the above that, in the method embodiment of the present application, corresponding gene words in a gene word table are embedded in an existing original file, an allocation technique based on a serpentine algorithm is used as a key in the embedding process, a source character set in which the gene words need to be replaced in a source file is extracted, and then the source characters in the source character set are sorted and grouped, so that a process of balancing the source characters with different word frequencies is implemented, and in order to ensure that the validity and integrity of the original file and the gene word set table are verified, the embedded information and the bit number thereof need to be set.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

From the above description, it can be seen that the present invention achieves the following technical effects: the method for embedding the gene words in the document is simple in embedding and replacing, high in speed, easy to realize, high in gene word utilization rate, relatively balanced in redundancy, good in information hiding performance and relatively large in information embedding amount.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for processing a document based on gene words, comprising:

extracting one or more source characters from an original file according to a gene word stock to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock;

calculating the repetition frequency of each source character in the source character set, and sequencing the source characters in the source character set according to the repetition frequency and the code in the character of each source character;

grouping the source characters in the source character set after sequencing according to a snake-shaped algorithm according to a preset group number to obtain a preset number of character groups;

replacing all source characters in one or more groups of character groups with corresponding gene words in the gene word stock to obtain a document embedded with the gene words,

wherein the gene word is a set of all characters in a special word stock.

2. The method of claim 1, wherein calculating a repetition frequency of each source character in the set of source characters and ordering the source characters in the set of source characters according to the repetition frequency and intra-character codes of each source character comprises:

sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low so as to obtain a first sequencing set of the source character set;

and sequencing the source characters with the same repetition frequency in the first sequencing set according to the sequence of the codes in the characters from large to small or from small to large.

3. The method of claim 1, wherein calculating a repetition frequency of each source character in the set of source characters and ordering the source characters in the set of source characters according to the repetition frequency and intra-character codes of each source character comprises:

sequencing the source characters in the source character set according to the sequence of the repetition frequency from low to high so as to obtain a first sequencing set of the source character set;

4. The method of any of claims 1-3, wherein before grouping the source characters in the sorted set of source characters by a predetermined number of groups according to a serpentine algorithm to obtain a predetermined number of character groups, the method further comprises:

setting embedded information to obtain the number of digits of the embedded information, wherein the number of digits of the embedded information is the preset number of groups;

and encrypting the embedded information to obtain the safe embedded information.

5. The method of any one of claims 1-3, wherein after grouping the source characters in the sorted set of source characters by a predetermined number of groups according to a serpentine algorithm to obtain a predetermined number of character groups, the method further comprises:

reading the character information of all the source characters in each character group to obtain the corresponding information of each character group, wherein,

in any group of character groups, when the number of source characters of which the character information is 0 is greater than that of source characters of which the character information is 1, the corresponding information of the character group is 0;

when the number of source characters of which the character information is 1 is greater than that of source characters of which the character information is 0, the corresponding information of the character group is 1,

wherein the character information of each of the source characters is represented by a binary bit of 0 or 1.

6. The method of claim 5, wherein replacing all source characters in one or more character sets with their corresponding gene words in a gene word library to obtain a gene word embedded document comprises:

when the corresponding information of the character set is 0, replacing all source characters of the character set with corresponding gene characters in a gene character library;

and when the corresponding information of the character group is 1, not executing the replacement operation on all the source characters of the character group.

7. The method of claim 5, wherein replacing all source characters in one or more character sets with their corresponding gene words in a gene word library to obtain a gene word embedded document comprises:

when the corresponding information of the character set is 1, replacing all source characters of the character set with corresponding gene characters in a gene character library;

and when the corresponding information of the character group is 0, not executing the replacement operation on all the source characters of the character group.

8. A gene-word based document processing apparatus, comprising:

the extraction module is used for extracting one or more source characters from an original file according to a gene word stock to obtain a source character set, wherein the source characters in the source character set have corresponding gene words in the gene word stock;

the processing module is used for calculating the repetition frequency of each source character in the source character set and sequencing the source characters in the source character set according to the repetition frequency and the code in the character of each source character;

the grouping module is used for grouping the source characters in the source character set after sequencing according to a snake-shaped algorithm and a preset group number so as to obtain a preset number of character groups;

and the replacing module is used for replacing all source characters in one or more groups of character groups with corresponding gene words in the gene word stock so as to obtain the document embedded with the gene words, wherein the gene words are a set of all characters in a special word stock.

9. The apparatus of claim 8, wherein the processing module comprises:

the first sequencing module is used for sequencing the source characters in the source character set according to the sequence of the repetition frequency from high to low or from low to high so as to obtain a first sequencing set of the source character set;

and the second sorting module is used for sorting the source characters with the same repetition frequency in the first sorting set according to the sequence of the codes in the characters from large to small or from small to large.

10. The apparatus of claim 8 or 9, further comprising:

and the setting module is used for setting the embedded information to obtain the number of bits of the embedded information, wherein the number of bits of the embedded information is the preset group number, and encrypting the embedded information to obtain the safe embedded information.

11. The apparatus of claim 8 or 9, further comprising:

the reading module is used for reading the character information of all the source characters in each group of character groups to obtain the corresponding information of each character group, wherein in any group of character groups, when the number of the source characters of which the character information is 0 is larger than the number of the source characters of which the character information is 1, the corresponding information of the character group is 0; when the number of the source characters of which the character information is 1 is greater than that of the source characters of which the character information is 0, the corresponding information of the character group is 1, wherein the character information of each source character is represented by a binary bit of 0 or 1.

12. The apparatus of claim 11, wherein the replacement module comprises:

the first replacement module is used for replacing all source characters of the character group with corresponding gene characters in the gene character library when the corresponding information of the character group is 0; when the corresponding information of the character group is 1, all source characters of the character group do not execute the replacement operation; or,

the second replacement module is used for replacing all source characters of the character group with corresponding gene characters in the gene character library when the corresponding information of the character group is 1; and when the corresponding information of the character group is 0, not executing the replacement operation on all the source characters of the character group.