[go: up one dir, main page]

CN112269904A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN112269904A
CN112269904A CN202011044074.7A CN202011044074A CN112269904A CN 112269904 A CN112269904 A CN 112269904A CN 202011044074 A CN202011044074 A CN 202011044074A CN 112269904 A CN112269904 A CN 112269904A
Authority
CN
China
Prior art keywords
character string
ciphertext
vector
array
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011044074.7A
Other languages
Chinese (zh)
Other versions
CN112269904B (en
Inventor
何旭
王国赛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huakong Tsingjiao Information Technology Beijing Co Ltd
Original Assignee
Huakong Tsingjiao Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huakong Tsingjiao Information Technology Beijing Co Ltd filed Critical Huakong Tsingjiao Information Technology Beijing Co Ltd
Priority to CN202011044074.7A priority Critical patent/CN112269904B/en
Publication of CN112269904A publication Critical patent/CN112269904A/en
Application granted granted Critical
Publication of CN112269904B publication Critical patent/CN112269904B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data processing method and a data processing device, wherein a target character string and a character string array of a ciphertext are obtained; calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array; determining indexes of character strings with similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation; according to the invention, the whole process of determining the character string matched with the target character string of the ciphertext in the character string array of the ciphertext is processed in the form of the ciphertext, the index and the true value of the character string matched with the target character string in the character string array of the target character string are not exposed, and the data security in the matching process is improved.

Description

Data processing method and device
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a data processing method and device.
Background
Fuzzy matching is involved in many scenarios in daily life, for example, a search engine finds terms similar to a search term in the internet to provide users with access to other terms that are the same as or similar to the search term.
At present, the matching words and the words to be matched are fuzzy matched in a plaintext form, and the matching result of the plaintext is obtained. In the process of implementing the invention, the inventor finds that the current fuzzy matching has at least the following problems: fuzzy matching is performed in a plaintext form, so that data security in the matching process cannot be guaranteed, and potential data security hazards are generated.
Disclosure of Invention
The invention provides a data processing method and a data processing device, which are used for solving the problem of potential data safety hazards in the prior art.
In order to solve the technical problem, the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a data processing method, configured to determine a character string that is matched with a target character string of a ciphertext, from all character strings included in a character string array of the ciphertext, where the method includes:
acquiring a target character string and a character string array of a ciphertext;
calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array;
determining indexes of the character strings with the similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation;
and taking the character string extracted from the character string array according to the index as the character string matched with the target character string.
In a second aspect, an embodiment of the present invention provides a data processing method, configured to determine a character string that matches a target character string from among all character strings included in a character string array, where the method includes:
acquiring a target vector and a vector array of a ciphertext, wherein the length of a vector to be matched in the vector array is the same as that of the target vector; the target vector of the ciphertext is obtained by performing ciphertext processing on the vector obtained after the target character string is coded according to the preset coding operation, and the vector array is obtained by performing ciphertext processing on the vector obtained after the character string array is coded according to the preset coding operation;
calculating the ciphertext similarity value of the target vector and the vector to be matched in the vector array to obtain a similarity array;
determining indexes of vectors with similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation;
and taking the character string corresponding to the vector extracted from the vector array according to the index as the character string matched with the target character string corresponding to the target vector.
In a third aspect, an embodiment of the present invention provides a data processing apparatus, configured to determine a character string that matches a target character string of a ciphertext, from all character strings included in a character string array of the ciphertext, where the apparatus includes:
the first acquisition module is used for acquiring a target character string and a character string array of the ciphertext;
the first calculation module is used for calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array;
the first comparison module is used for determining indexes of the character strings of which the similarity values are greater than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation;
and the first matching module is used for taking the character string extracted from the character string array according to the index as the character string matched with the target character string.
In a fourth aspect, an embodiment of the present invention provides a data processing apparatus, configured to determine a character string that matches a target character string among all character strings included in a character string array, where the apparatus includes:
the second acquisition module is used for acquiring a target vector and a vector array of the ciphertext, and the length of a vector to be matched in the vector array is the same as that of the target vector; the target vector of the ciphertext is obtained by performing ciphertext processing on the vector obtained after the target character string is coded according to the preset coding operation, and the vector array is obtained by performing ciphertext processing on the vector obtained after the character string array is coded according to the preset coding operation;
the second calculation module is used for calculating the ciphertext similarity value of the target vector and the vector to be matched in the vector array to obtain a similarity array;
the second comparison module is used for determining the index of the vector with the similarity value larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation;
and the second matching module is used for taking the character string corresponding to the vector extracted from the vector array according to the index as the character string matched with the target character string corresponding to the target vector.
In a fifth aspect of the embodiments of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when being executed by a processor, implements the steps of the data processing method described above.
In a sixth aspect of the embodiments of the present invention, an electronic device is provided, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, and when the computer program is executed by the processor, the steps of the data processing method described above are implemented.
In the embodiment of the invention, the target character string and the character string array of the ciphertext are obtained; calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array; determining indexes of character strings with similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation; according to the invention, the whole process of determining the character string matched with the target character string of the ciphertext in the character string array of the ciphertext is processed in the form of the ciphertext, the index and the true value of the character string matched with the target character string in the character string array of the target character string are not exposed, and the data security in the matching process is improved.
Drawings
FIG. 1 is a flow chart of steps of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating steps of another data processing method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an initial matrix provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of another initial matrix provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of another initial matrix provided by an embodiment of the present invention;
FIG. 6 is a flow chart illustrating steps of another data processing method according to an embodiment of the present invention;
fig. 7 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 8 is a block diagram of another data processing apparatus provided by an embodiment of the present invention;
FIG. 9 is a block diagram of an apparatus for data processing of the present application;
fig. 10 is a schematic diagram of a server in some embodiments of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In practical application, application scenes such as mobile application, a database, a search engine and the like all have fuzzy matching operation of character strings, the data processing method provided by the embodiment of the invention is particularly used for determining all character strings included in a character string array of a ciphertext and character strings matched with a target character string of the ciphertext, wherein, based on the concept of fuzzy matching, the character string matched with the target character string of the ciphertext can be understood as the character string similar to or the same as the target character string, therefore, the character string matching the target character string of the ciphertext does not necessarily coincide exactly with the target character string, and, for example, in fuzzy matching of a search engine, if a search word is 'potato', the searched entry not only comprises the 'potato' entry completely consistent with the search word, but also comprises entries similar to the search word, such as 'mashed potato' and 'potato cake'; if the search term is "app", the retrieved entries may include entries that are the same as or similar to the search term, such as "app", "applet", and the like.
In addition, in order to avoid the problem of data information leakage in the process of matching character strings, a Multi-Party Secure computing (SMC) technology is adopted in many scenarios, and Multi-Party Secure computing is realized by performing data Computation or fusion between a plurality of non-mutually trusted databases on the premise that data is mutually confidential. The technical scheme provided by the invention aims to efficiently and accurately realize fuzzy matching of the ciphertext character string under the condition of not exposing the true value of the character string in the scene of ciphertext calculation such as multi-party safe calculation.
Fig. 1 is a flowchart of steps of a data processing method according to an embodiment of the present invention, and as shown in fig. 1, the method may include:
step 101, obtaining a target character string and a character string array of a ciphertext.
The embodiment of the invention can be applied to a multi-party security computing scene, and aims to ensure that data participants jointly agree on a computing function in a scene without a trusted third party and do not expose original data mutually. Under the scene, the target character string and the character string array are acquired in a ciphertext form, and specific numerical values of the character string are not exposed in the calculation process.
Specifically, the character string array may include a plurality of character strings to be matched, and the plurality of character strings to be matched may be arranged in a certain order. For example, a string array: in the [ "app," "applet," and "banana" ], three character strings are included, and the three character strings have a certain arrangement order.
And 102, calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array.
In the embodiment of the invention, the similarity value of the target character string and each character string to be matched in the character string array can be calculated based on the similarity calculation algorithm of the ciphertext, and the calculation result of each ciphertext is established as the similarity array. The arrangement sequence of the calculation results of the ciphertext is the same as the ordering sequence of the character strings to be matched.
For example, for the target string "ap" and string array of the ciphertext: [ App ], [ applet ], [ banana ], a ciphertext similarity value x of a target character string "ap" and a character string "app" to be matched, a ciphertext similarity value y of the target character string "ap" and a character string "applet" to be matched, and a ciphertext similarity value z of the target character string "ap" and a character string "banana" to be matched can be obtained through a ciphertext similarity calculation algorithm, and according to the similarity values, a similarity array of ciphertexts is finally established: [ x, y, z ].
Specifically, the embodiment of the present invention may reflect the similarity value between the target character string and the character string to be matched by using the text distance between the target character string and the character string to be matched, and the larger the text distance between the target character string and the character string to be matched is, the smaller the similarity value between the target character string and the character string to be matched is; the smaller the text distance between the two is, the larger the similarity value between the two is. It should be noted that, because the target character string and the character string to be matched are both in the form of character strings, the text distance includes, but is not limited to, a Dice distance, a Jaccard distance, an edit distance, and a hamming distance.
And 103, determining indexes of the character strings with the similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation.
In the embodiment of the invention, the similarity array of the ciphertexts comprises similarity values of a plurality of ciphertexts, the concept of fuzzy matching is to determine the similarity value which is larger than the preset similarity threshold of the ciphertexts in the similarity array based on the preset similarity threshold of the ciphertexts and determine the index of the character string corresponding to the similarity value. The ciphertext similarity threshold may be a fixed value, and may be set according to actual requirements, which is not limited in the embodiment of the present invention.
Specifically, in a multi-party secure computing scenario, since the similarity array is in a ciphertext form, in order to ensure that each similarity value is not exposed to the outside in the computing process, the similarity value greater than the preset ciphertext similarity threshold in the similarity array is obtained in the embodiment of the present invention, and the similarity value can be obtained by obtaining the maximum value of each similarity value and the preset ciphertext similarity threshold:
for example, a pairwise ciphertext comparison operation tmp _ arr ═ rule (b _ arr-a _ arr) + a _ arr is performed for one similarity value a _ arr (assuming that its plaintext value is 60) and a preset ciphertext similarity threshold b _ arr (assuming that its plaintext value is 80). During the calculation of the relu function, if the input value is less than 0, the value is replaced by 0, and if the input value is greater than 0, the original value is reserved, so the result after the calculation of the step is [ b _ arr ]. I.e., the larger of the two ciphertext values is taken. By the ciphertext comparison calculation mode, the similarity value which is larger than the preset ciphertext similarity threshold value in the similarity array can be determined, and the index of the similarity value in the similarity array is determined as the index of the character string corresponding to the similarity value.
And 104, taking the character string extracted from the character string array according to the index as the character string matched with the target character string.
In the embodiment of the present invention, according to the index of the similarity value greater than the preset ciphertext similarity threshold obtained in step 103, the character string extracted from the character string array is used as the index of the character string whose similarity value with the target character string is greater than the preset ciphertext similarity threshold, that is, the character string matched with the target character string.
For example, for the target string "ap" and string array of the ciphertext: [ App ], [ applet ], [ banana ], a ciphertext similarity value x of a target character string "ap" and a character string "app" to be matched, a ciphertext similarity value y of the target character string "ap" and a character string "applet" to be matched, and a ciphertext similarity value z of the target character string "ap" and a character string "banana" to be matched can be obtained through a ciphertext similarity calculation algorithm, and according to the similarity values, a similarity array of ciphertexts is finally established: [ x, y, z ].
And if the similarity values larger than the preset ciphertext similarity threshold value in the similarity array are x and y, the obtained indexes are (0) and (1), the character strings 'app' and 'applet' can be extracted from the character string array according to the indexes (0) and (1), and the two character strings are character strings matched with the target character string 'ap' of the ciphertext.
After determining the character string matched with the target character string of the ciphertext, the matched character string can be sent to a client side providing the target character string as a matching result, and corresponding data calculation operation can be carried out according to the matching result.
To sum up, in the data processing method provided by the embodiment of the present invention, a target character string and a character string array of a ciphertext are obtained; calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array; determining indexes of character strings with similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation; according to the invention, the whole process of determining the character string matched with the target character string of the ciphertext in the character string array of the ciphertext is processed in the form of the ciphertext, the index and the true value of the character string matched with the target character string in the character string array of the target character string are not exposed, and the data security in the matching process is improved.
Fig. 2 is a flowchart of steps of a data processing method according to an embodiment of the present invention, and as shown in fig. 2, the method may include:
step 201, obtaining a target character string and a character string array of the ciphertext.
This step may specifically refer to the above substep 101, and is not described herein again.
Step 202, uniformly converting all characters in the target character string and the character string array into upper case or uniformly converting all characters in the target character string and the character string array into lower case.
In the embodiment of the invention, in order to standardize the formats of the target character string and the character string array in the follow-up process and reduce the system overhead of capital and small case conversion in the follow-up calculation process, all characters in the target character string and the character string array can be uniformly converted into capital or uniformly converted into lowercase.
Step 203, obtaining a text distance between the target character string and the character string to be matched.
Specifically, the embodiment of the present invention may reflect the similarity value between the target character string and the character string to be matched by using the text distance between the target character string and the character string to be matched, and the larger the text distance between the target character string and the character string to be matched is, the smaller the similarity value between the target character string and the character string to be matched is; the smaller the text distance between the two is, the larger the similarity value between the two is. The text distance covers the similarity of the respective characters of the two character strings and the semantic similarity of the two character strings, so that the similarity value of the target character string and the character string to be matched can be further obtained through the text distance.
Optionally, the text distance includes: the Dice distance, the Jaccard distance, the edit distance and the Hamming distance. Specifically, the larger the text distance between the target character string and the character string to be matched is, the smaller the similarity value between the target character string and the character string to be matched is; the smaller the text distance between the target character string and the character string to be matched is, the larger the similarity value between the target character string and the character string to be matched is, and the text distance between the target character string and the character string to be matched can be converted into the similarity value in the range of 0-1.
In one implementation, the edit distance is used to measure the similarity of two strings, and refers to the minimum number of operands required to convert string X to string Y using character manipulation. Wherein the character operation includes deleting a character, inserting a character, and modifying a character. The smaller the edit distance of two character strings, the more similar the two are.
Optionally, in a case that the text distance is an edit distance, step 203 may include:
substep 2031, obtaining the character length n of the target character string and the character length m of the character string to be matched.
Substep 2032, for the character string to be matched, creating an initial matrix of (n +1) × (m +1) dimensions, where each element of the initial matrix has an initial value of 0.
Sub-step 2033, determining the value of each element in the initial matrix.
Substep 2034, determining the value of the element at the corner position at the lower right of the initial matrix as the edit distance between the target character string and the character string to be matched.
Specifically, for the above sub-step 2031-2034, in the multi-party security calculation scenario, an example of calculating the edit distance of the two character strings X, Y is as follows:
1. first, the length n of the character string X and the length m of the character string Y are obtained. And if the character string contains an empty character string, returning to None to show an alarm. The method comprises the following specific steps:
n=len(X)
length of the string a, b is obtained
if an empty string is included, then None is returned
return None
2. An initial matrix of (n +1) × (m +1) dimensions is created with an initial value of 0 for each element of the initial matrix, and the value of each element in the initial matrix is determined by a loop traversal.
Optionally, the process of determining the value of each element in the initial matrix may specifically be implemented by the following sub-steps:
substep a1, setting the elements located at the first row and the first column of the initial matrix to be an arithmetic array starting from 0, and the tolerance of the arithmetic array is 1.
Specifically, after the initial matrix is established, the first row and the first column of the initial matrix may be set to an arithmetic array starting from 0 with a tolerance of 1. For example, referring to fig. 3, a schematic diagram of an initial matrix provided by an embodiment of the present invention is shown, where if n is 3 and m is 4, a first row of the initial matrix is set to an arithmetic difference array with a tolerance of 1 from 0 to 4; the first column of the initial matrix is set to an arithmetic group with a tolerance of 1 starting from 0 to ending at 3. The method comprises the following specific steps:
zeros ((n +1, m +1)) - - -creating a matrix
for i in range (n +1) — initialize line 0
D[i][0]=pp.sint(i)
for j in range (m +1) — initialize the 0 th row
D[0][j]=pp.sint(j)
And a substep a2, sequentially calculating values of the elements of the initial matrix except the elements on the first row and the first column from the second row and the second column of the initial matrix.
Wherein the value of each of the other elements is calculated from a first element adjacent to the left side of the other element, a second element adjacent to the upper side of the other element, and a third element adjacent to the upper left side of the other element.
In a specific implementation manner, values of other elements in the initial matrix except for elements on the first row and the first column are calculated, and specifically, the values of the other elements may be calculated in a loop traversal starting from the 1 st row and the 1 st column of the initial matrix. Referring to fig. 3, the value of the other element a is calculated first, then the value of the other element b with the index (1,2) is calculated, then the value of the other element c with the index (1,3) is calculated, then the value of the other element d with the index (1,4) is calculated, then the value … of the other element b with the index (2,1) is calculated until the value of the other element f is calculated.
In the embodiment of the present invention, the value of each other element may be calculated from a left adjacent first element, an upper adjacent second element, and an upper left adjacent third element of the other element.
For example, referring to fig. 3, the value of the element a may be obtained from a first element 1 having an index of (1,0), a second element 1 having an index of (0,1), and a third element 0 having an index of (0, 0).
Optionally, in a case that it is determined that the first element is equal to the second element, a value of each of the other elements is a minimum value of: a sum of a value of the first element and 1, a sum of a value of the second element and 1, and a value of the third element.
Referring to fig. 3, the value of the element a may be obtained from a first element 1 having an index of (1,0), a second element 1 having an index of (0,1), and a third element 0 having an index of (0,0), and since the values of the first element and the second element are the same, the value of the element a may be the minimum value of the sum of the value of the first element and 1, the sum of the value of the second element and 1, and the value of the third element, that is, 0.
In the case where it is determined that the first element is not equal to the second element, the value of each of the other elements is the minimum of: a sum result of a value of the first element and 1, a sum result of a value of the second element and 1, and a sum result of a value of the third element and 1.
Referring to fig. 4, another initial matrix diagram provided in the embodiment of the present invention is shown, after the value of the element a is calculated to be 0, if the value of the other element b with the index (1,2) is further calculated, the value of the other element b with the index (1,2) may be obtained from the first element 0 with the index (1,1), the second element 2 with the index (0,2), and the third element 1 with the index (0,1), and since the values of the first element and the second element are different, the value of the other element b with the index (1,2) may be the minimum value of the sum of the value of the first element and 1, the sum of the value of the second element and 1, and the sum of the value of the third element and 1, that is 1. And so on until the values of all other elements are calculated, so as to obtain the initial matrix shown in fig. 5, and the values of all the elements in the initial matrix shown in fig. 5 are solved.
The process of determining the value of each element in the initial matrix is specifically as follows:
Figure BDA0002707465810000101
Figure BDA0002707465810000111
3. after the value of each element in the initial matrix is determined, the value of the element returned to the lower right corner of the initial matrix is the edit distance between the target character string X and the character string Y to be matched, i.e., the value 1 of the element indexed (3,4) in fig. 5. Further, according to the edit distance between the target character string X and the character string Y to be matched, calculating the similarity value between the target character string X and the character string Y to be matched as follows:
edit distance ═ n, m%
Similarity score res ═ 1-distance/max (m, n)
In the case where the edit distance is 1 [3,4], the similarity score res is 1-1/max (4,3) is 0, that is, the target character string X is not similar to the character string Y to be matched.
Further, in the multi-party secure computation scenario, the computation processes of sub-step 2031 and 2034 are all computed in the form of ciphertext, in the process of performing sub-step a2 in step 2, a loop traversal is performed to determine the value of each element in the initial matrix, and m × n loops are performed, each loop needs to perform a step of determining whether the first element and the second element of the other elements are equal, resulting in that the whole process needs to perform m × n flag comparison operations (flag is 1- (a [ i-1] ═ b [ j-1]) — where a [ i-1] ═ b [ j-1] denotes that whether the first element a and the second element b are equal), in a multi-party security calculation scenario, the execution of m × n flag comparison operations results in a high calculation amount and a serious reduction in processing performance.
To solve the above problem, alternatively, the determination of whether the first element and the second element are the same may be implemented by the following steps B1-B4.
And step B1, acquiring a first character length j of the first element and a second character length k of the second element.
Step B2, converting the first element into a first matrix of j × k dimensions, and converting the second element into a second matrix of j × k dimensions.
And step B3, performing matrix comparison operation on the first matrix and the second matrix to obtain a j × k dimension comparison result matrix.
And B4, according to the comparison result matrix, determining that the first element and the second element are equal, or the first element and the second element are not equal.
With respect to the above steps B1-B4, in a multi-party security computation scenario, the purpose of reducing the computation amount and improving the processing performance can be achieved by optimizing the flag comparison operation, and specifically, the following example is given to determine whether the first element and the second element are the same:
1. first, a first character length j of a first element a and a second character length k of a second element b are obtained, the first element is converted into a first matrix with dimensions of j multiplied by k, and the second element is converted into a second matrix with dimensions of j multiplied by k.
For example, assume that a first element a is "hi", a second element b is "jkl", a first character length j of the first element a is 2, and a second character length k of the second element b is 3.
The first element a may be converted into a first matrix
Figure BDA0002707465810000121
The second element b may be converted into a second matrix
Figure BDA0002707465810000122
2. And performing matrix comparison operation on the ciphertext on the first matrix a1 and the second matrix b1 to obtain a j × k dimension comparison result matrix.
In particular, the result matrix is compared
Figure BDA0002707465810000123
The element in the ith row and the jth column in the comparison result matrix flag _ all is the comparison result of a [ i ] ═ b [ j ].
3. And extracting a comparison result from the comparison result matrix, wherein the comparison result comprises that the first element and the second element are the same or different.
Therefore, for comparison between the first element "hi" and the second element "jkl", 2 × 3 ═ 6 comparisons are required before the flag comparison operation is optimized, that is, it is necessary to sequentially determine that a [0] ═ b [0], a [0] ═ b [1], and a [0] ═ b [2] …. After the flag comparison operation is optimized, the comparison between the first element "hi" and the second element "jkl" only needs one matrix comparison operation, and the subsequent comparison results of a [0] ═ b [0], a [0] ═ b [1], and a [0] ═ b [2] … can be extracted from the comparison result matrix through the extraction operation, and the calculation amount occupied by the extraction operation is extremely low, so that on the basis of reducing the comparison times and the calculation amount, the calculation amount is further reduced through the extraction operation, and the calculation performance is improved.
Further, in the multi-party security computation scenario, the computation processes of sub-steps 2031 and 2034 are all computed in a ciphertext form, and in the process of performing sub-step a2 in step 2, the value of each other element in the initial matrix is determined through loop traversal, the value of each other element depends on the values of the first element, the second element and the third element of the other element, and the evaluation of each other element needs to be performed by two comparison operations in a manner of computing the value of each other element through loop traversal in sub-step a2 (fig. 3 to fig. 5), and since the values of n × m other elements are to be computed, 2 × n × m comparison operations are generally required, so that the computation amount is high, and the processing performance is seriously degraded.
To solve the above problem, calculating the values of other elements in the initial matrix may be optionally implemented by the following step C1.
And step C1, starting from the upper left corner of the initial matrix to the upper left corner of a diagonal line of the lower right corner, and sequentially calculating the values of other elements on each target oblique line according to the sequence of the distance between the target oblique line and the upper left corner from near to far, wherein the target oblique line is an oblique line intersected with the diagonal line.
In a multi-party safety computing scene, the purposes of reducing the computing amount and improving the processing performance can be achieved by optimizing the computing sequence of other elements.
Specifically, referring to fig. 3, it is assumed that the character length of the target character string is n, and the character length of the character string to be matched is m; an initial matrix of (n +1) × (m +1) dimensions is established as shown in fig. 3, where the elements located in the first row and first column of the initial matrix are the arithmetic group starting from 0 and the tolerance of the arithmetic group is 1.
In the initial matrix of fig. 3, different target slant lines are indicated by different element letters, i.e. 1 other element a indicates one target slant line 1, two other elements b indicates another target slant line 2, three other elements c indicates another target slant line 3, three other elements d indicates another target slant line 4, two other elements e indicates another target slant line 5, and 1 other element f indicates the last target slant line 6, so there are a total of 6 target slant lines in the initial matrix.
In the embodiment of the present invention, the values of the other elements on each target oblique line may be sequentially calculated according to the sequence from the upper left corner to the lower right corner of the initial matrix shown in fig. 3, that is, the values of the target oblique line 1 to the target oblique line 6 are calculated, so that when the values of the other elements on each target oblique line are calculated, the values of the first element, the second element, and the third element corresponding to the other elements on the target oblique line are all calculated, so that the values of the first element, the second element, and the third element corresponding to the other elements on the target oblique line may be extracted at one time, and the values of all the other elements on the target oblique line are calculated at one time. Thus, only two comparisons per target slope are needed, for a total of only 2 (n + m-1) comparisons. Therefore, the times of comparison operation are reduced, and the calculation performance is improved.
For example, assuming that there are two elements a and b on a target diagonal, three values of the first element, the second element and the third element corresponding to the element a are [1,2,3], and three values of the first element, the second element and the third element corresponding to the element b are [3,2,7], they can be spliced into a matrix [ [1,2,3], [3,27] ], and then the minimum value of each row is calculated using pnp.
And 204, determining the similarity value of the target character string and the character string to be matched according to the text distance between the target character string and the character string to be matched.
In another implementation, the Dice distance is used to measure the similarity between two sets, and a string may be understood as a set, so the Dice distance may also be used to measure the similarity between strings. If the text distance is a Dice distance, an example of determining the similarity value between the target character string and the character string to be matched according to the text distance between the target character string and the character string to be matched is as follows:
by equation 1:
Figure BDA0002707465810000141
and calculating the Dice distance (Dice) of the target character string and the character string to be matched, wherein X represents a set corresponding to the target character string, Y represents a set corresponding to the character string to be matched, a denominator represents the sum of the lengths of the set X and the set Y, and a numerator represents the length of the set X and the set Y after intersection operation.
For example, if X corresponds to the target character string "applet" and Y corresponds to the character string "app" to be matched, the Dice distance between the target character string and the character string to be matched is equal to
Figure BDA0002707465810000151
Since the range of the Dice distance is between 0 and 1, the Dice distance between the target character string and the character string to be matched can be directly used as the similarity value of the target character string and the character string to be matched.
In a multiparty security computing scenario, the intersection c of X and Y may be computed, and the length l of the intersection c and the length of X, Y are obtained, where c ═ X ═ Y, l ═ len (c), la ═ len (X), and b ═ len (Y).
In another implementation, the Jaccard distance can also be used to measure the similarity of two sets. If the text distance is the Jaccard distance, an example of determining the similarity value between the target character string and the character string to be matched according to the text distance between the target character string and the character string to be matched is as follows:
by equation 2:
Figure BDA0002707465810000152
calculating the Jaccard distance j between the target character string and the character string to be matched, wherein X represents a set corresponding to the target character string, and Y represents the distance j between the target character string and the character string to be matchedMatching the sets corresponding to the character strings, wherein the denominator represents the length of the set X and the set Y after the phase-parallel operation, and the numerator represents the length of the set X and the set Y after the intersection operation. Since the Jaccard distance ranges from 0 to 1, the Jaccard distance between the target character string and the character string to be matched can be directly used as the similarity value between the target character string and the character string to be matched.
In a multiparty security computing scenario, the intersection c of X and Y may be computed, and the length l of the intersection c and the length of X, Y are obtained, where c ═ X ═ Y, l ═ len (c), la ═ len (X), and b ═ len (Y).
In another implementation, the hamming distance may also be used to measure the similarity between two character strings, where the hamming distance represents the total number of characters with different positions corresponding to two character strings with equal length, i.e., the number of replacement operations required to replace one character string with another character string. If the text distance is a hamming distance, an example of determining the similarity value between the target character string and the character string to be matched according to the text distance between the target character string and the character string to be matched is as follows:
in a multi-party security computation scenario, the computation of the similarity scores of two equal-length strings X, Y using hamming distance follows the following steps:
1. and (4) circularly judging whether each character in the character string X, Y is the same or not, and counting the number count (ciphertext) of the equal characters. For example, the character string "applet" and the character string "abcde" where the first character and the last character in the two character strings are the same, then count is 2.
Specifically, the following process can embody the meaning of the loop judgment:
count=0
foriin (0, len (X) -1) -cycle through each element in X, Y
flag (X [ i ] ═ Y [ i ]) - - -, in turn, it is judged that each element is equal (ciphertext)
count + flag-update count value
2. The respective character lengths of the character strings X, Y are obtained.
3. The ratio of equal characters to the length of the string is calculated, which is the similarity score of two equal length strings X, Y.
When the similarity values are calculated using the hamming distance, the Dice distance, and the jaccard distance, the calculation operations for the sets are used, and therefore the obtained similarity values are all in the clear. Only when the similarity value is calculated using the edit distance, the calculated similarity value is the ciphertext, and thus the data security can be further improved by calculating the similarity value using the edit distance.
And step 205, determining indexes of the character strings with the similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation.
This step may refer to substep 103 described above, and will not be described herein.
Optionally, step 205 may include:
substep 2051, comparing each element in the similarity array with a similarity threshold of the ciphertext to obtain a ciphertext comparison matrix; and the result which is greater than the similarity threshold value in the ciphertext comparison matrix is represented by a ciphertext of 1, and the result which is not greater than the similarity threshold value is represented by a ciphertext of 0.
In the embodiment of the invention, the similarity array of the ciphertexts comprises similarity values of a plurality of ciphertexts, the concept of fuzzy matching is to determine the similarity value which is larger than the preset similarity threshold of the ciphertexts in the similarity array based on the preset similarity threshold of the ciphertexts and determine the index of the character string corresponding to the similarity value.
Therefore, after each element in the similarity array is compared with the similarity threshold value of the ciphertext, the ciphertext comparison matrix can be obtained, the result which is greater than the similarity threshold value in the ciphertext comparison matrix is represented by the ciphertext of 1 (or 0), and the result which is not greater than the similarity threshold value is represented by the ciphertext of 0 (or 1), so that the comparison result is not exposed, the index of the similarity value in the similarity array is not exposed, and the data security is improved.
And a substep 2052 of recovering the value of the ciphertext comparison matrix into a plaintext, and using the index of the element other than 0 as the index of the character string with the similarity value larger than a preset ciphertext similarity threshold value in the similarity degree group.
In this step, after the value of the ciphertext comparison matrix is restored to the plaintext through an internal plaintext restoration operation, the index of the element other than 0 is used as the index of the character string with the similarity value greater than the preset ciphertext similarity threshold in the similarity degree group, so that the extraction of the index of the character string with the similarity value greater than the preset ciphertext similarity threshold is realized.
It should be noted that, in order to avoid the exposure of the index, the character string array and the ciphertext comparison matrix may be first scrambled according to the same method, and then the subsequent operation of extracting the index of the character string with the similarity value greater than the preset ciphertext similarity threshold is performed.
And step 206, taking the character string extracted from the character string array according to the index as the character string matched with the target character string.
This step may refer to the sub-step 104, which is not described herein.
In summary, the data processing method provided by the embodiment of the present invention obtains the target character string and the character string array of the ciphertext; calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array; determining indexes of character strings with similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation; according to the invention, the whole process of determining the character string matched with the target character string of the ciphertext in the character string array of the ciphertext is processed in the form of the ciphertext, the index and the true value of the character string matched with the target character string in the character string array of the target character string are not exposed, and the data security in the matching process is improved.
Fig. 6 is a flowchart of steps of another data processing method according to an embodiment of the present invention, and as shown in fig. 6, the method may include:
and 301, acquiring a target vector and a vector array of the ciphertext.
The length of the vector to be matched in the vector array is the same as that of the target vector; the target vector of the ciphertext is obtained by performing ciphertext processing on the vector obtained after the target character string is encoded according to the preset encoding operation, and the vector array is obtained by performing ciphertext processing on the vector obtained after the character string array is encoded according to the preset encoding operation.
The embodiment of the invention can be applied to a multi-party security computing scene, and aims to ensure that data participants jointly agree on a computing function in a scene without a trusted third party and do not expose original data mutually. In this scenario, the target vector and the vector array of the ciphertext obtained by converting and encrypting the target character string and the character string array of the plaintext may be obtained first, and then the target vector and the vector array of the ciphertext may be processed, so that the true value is not exposed in the calculation process.
For example, encoding of the target character string and the character string array according to a preset encoding operation can be realized by a bag-of-words model:
assuming that the character string a is "app", the character string array b is [ "apple", "link" ]; the bag-of-words of the bag-of-words model includes [ "a", "p", "l", "e", "i", "n", "k" ].
After encoding, the character string a is converted into a vector a [1,1,0,0,0,0,0], and the character string array b is converted into a vector array [ [1,1,1,1,0,0,0], [0,0,1,0,1, 1] ], and the vector array obtained after conversion are respectively encrypted, wherein 1 represents the character of the character string corresponding to the position, is the same as the character of the corresponding position in the word packet, and 0 represents the character of the character string corresponding to the position, and is different from the character of the corresponding position in the word packet. A string contains primarily 26 lowercase english characters, 26 uppercase english characters, and 10 digits, and we can encode each string with a 62-long pack.
Step 302, calculating a ciphertext similarity value of the target vector and a vector to be matched in the vector array to obtain a similarity array.
This step may refer to the sub-step 102, which is not described herein.
Specifically, the embodiment of the present invention may reflect the similarity value between the target vector and the vector to be matched by using the text distance between the target vector and the vector to be matched, where the larger the text distance between the target vector and the vector to be matched, the smaller the similarity value between the target vector and the vector to be matched; it should be noted that, because the target vector and the vector to be matched are in the form of vectors, the text distance between the target vector and the vector to be matched may be a vector distance between two vectors, and the vector distance includes, but is not limited to, a cosine distance and an euclidean distance.
Optionally, step 302 may include:
and a substep 3021 of obtaining a text distance between the target vector and the vector to be matched.
And a substep 3022, determining a similarity value between the target vector and the vector to be matched according to a text distance between the target vector and the vector to be matched.
Specifically, the embodiment of the present invention may reflect the similarity value between the target vector and the vector to be matched by using the text distance between the target vector and the vector to be matched, where the larger the text distance between the target vector and the vector to be matched, the smaller the similarity value between the target vector and the vector to be matched; the smaller the text distance between the two is, the larger the similarity value between the two is. The text distance covers the similarity of the elements of the two vectors and the semantic similarity of the two vectors, so that the similarity value of the target vector and the vector to be matched can be further obtained through the text distance.
Optionally, the text distance includes: cosine distance or Euclidean distance.
In one implementation, the cosine distance is used to measure the similarity between two vectors, and if the text distance is the cosine distance, an example of determining the similarity value between the target vector and the vector to be matched according to the text distance between the target vector and the vector to be matched is as follows:
by equation 3:
Figure BDA0002707465810000191
calculating cosine distances between target vectors and vectors to be matched, wherein A represents the target vectors, B represents the vectors to be matched, and denominators represent vectors A and AThe product of the modes of vector B, the molecule inner (a, B), represents the inner product of vector A, B.
Referring to the example in step 301, when the vector B to be matched is the first vector (corresponding to the string applet) in the vector array B, the cosine distance between the vector a and the vector B
Figure BDA0002707465810000192
When the vector B to be matched is the second vector (corresponding character string link) in the vector array B, the cosine distance between the vector A and the vector B
Figure BDA0002707465810000193
Specifically, based on the original target character string and the character string to be matched in the character string array, the embodiment of the present invention may uniformly convert the original target character string and the character string to be matched into 0-1 vectors (the vectors only contain 0 or 1) with the same length, and based on the vectors, the embodiment of the present invention may also implement set operations, such as intersection, union, and the like. The text distance measurement (e.g. dice distance, jaccard distance) based on collective operation can be implemented by a vector-based method.
Intersection can be obtained by inner product operation of two vectors, and the union can be obtained based on intersection result, for example, | a | + | B | - | a |, and the union length can be obtained by summation of vectors.
Assuming that the original string is A, B, the encoded vector is a, b:
then: l a | ═ sum (a) - -, adding up all the elements in a
Inner product operation is carried out on | A |, and |, B | ═ inner (a, B) - -, subtending quantity a, B
|A∪B|=|A|+|B|-|A∩B|=sum(a)+sum(b)-inner(a,b)
And step 303, determining indexes of vectors of which the similarity values are greater than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation.
This step may refer to substep 103 described above, and will not be described herein.
Optionally, step 303 may include:
substep 3031, comparing each element in the similarity array with a similarity threshold value of a ciphertext to obtain a ciphertext comparison matrix; and the result which is greater than the similarity threshold value in the ciphertext comparison matrix is represented by a ciphertext of 1, and the result which is not greater than the similarity threshold value is represented by a ciphertext of 0.
And a substep 3032 of recovering the value of the ciphertext comparison matrix into a plaintext, and taking the index of the element other than 0 as the index of the vector with the similarity value larger than a preset ciphertext similarity threshold value in the similarity degree group.
The sub-steps 3031-3032 can refer to the sub-steps 2051-2052, which are not described herein again.
And step 304, taking the character string corresponding to the vector extracted from the vector array according to the index as the character string matched with the target character string corresponding to the target vector.
This step may refer to the sub-step 104, which is not described herein.
To sum up, in the data processing method provided by the embodiment of the present invention, by obtaining a target vector and a vector array of a ciphertext, lengths of a vector to be matched and the target vector in the vector array are the same; the target vector of the ciphertext is obtained by performing ciphertext processing on a vector obtained after the target character string is encoded according to a preset encoding operation, and the vector array is obtained by performing ciphertext processing on a vector obtained after the character string array is encoded according to the preset encoding operation; calculating the ciphertext similarity value of the target vector and the vector to be matched in the vector array to obtain a similarity array; determining indexes of vectors with similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation; according to the method and the device, the character string corresponding to the vector extracted from the vector array according to the index is used as the character string matched with the target character string corresponding to the target vector.
Fig. 7 is a block diagram of a data processing apparatus according to an embodiment of the present invention, and as shown in fig. 7, the apparatus may include:
a first obtaining module 401, configured to obtain a target character string and a character string array of a ciphertext;
a first calculating module 402, configured to calculate, based on a ciphertext, a similarity value of the target character string and each character string to be matched in the character string array, to obtain a similarity array;
optionally, the first calculating module 402 includes:
the first obtaining sub-module is used for obtaining the text distance between the target character string and the character string to be matched;
and the first distance calculation sub-module is used for determining the similarity value of the target character string and the character string to be matched according to the text distance between the target character string and the character string to be matched.
Optionally, in a case that the text distance is an edit distance, the first distance calculation sub-module includes:
the acquisition unit is used for acquiring the character length n of the target character string and the character length m of the character string to be matched;
a creating unit, configured to create an initial matrix of (n +1) × (m +1) dimensions for the character string to be matched, where an initial value of each element of the initial matrix is 0;
an evaluation unit for determining a value of each element in the initial matrix;
optionally, the evaluation unit includes:
the setting subunit is used for setting elements positioned on the first row and the first column of the initial matrix as an arithmetic array starting from 0, and the tolerance of the arithmetic array is 1;
the traversal calculation subunit is used for sequentially calculating values of other elements except for the elements on the first row and the first column in the initial matrix from the second row and the second column of the initial matrix;
optionally, the traversal calculation subunit includes:
and the sequencing calculation subunit is used for sequentially calculating the values of other elements on each target oblique line from the upper left corner of the diagonal line from the upper left corner to the lower right corner of the initial matrix according to the sequence of the distance between the target oblique line and the upper left corner from near to far, wherein the target oblique line is an oblique line intersected with the diagonal line.
Wherein the value of each of the other elements is calculated from a first element adjacent to the left side of the other element, a second element adjacent to the upper side of the other element, and a third element adjacent to the upper left side of the other element.
Optionally, in a case that it is determined that the first element is equal to the second element, a value of each of the other elements is a minimum value of: a sum result of a value of the first element and 1, a sum result of a value of the second element and 1, and a value of the third element;
in the case where it is determined that the first element is not equal to the second element, the value of each of the other elements is the minimum of: a sum result of a value of the first element and 1, a sum result of a value of the second element and 1, and a sum result of a value of the third element and 1.
Optionally, the evaluation unit further includes:
a length obtaining subunit, configured to obtain a first character length j of the first element and a second character length k of the second element;
a construction subunit, configured to convert the first element into a first matrix of a dimension of j × k, and convert the second element into a second matrix of a dimension of j × k;
the comparison subunit is used for performing matrix comparison operation on the first matrix and the second matrix to obtain a j × k dimensional comparison result matrix;
and the judging subunit is used for determining that the first element is equal to the second element or the first element is not equal to the second element according to the comparison result matrix. And the determining unit is used for determining the value of an element positioned at the corner point position at the lower right of the initial matrix as the editing distance between the target character string and the character string to be matched.
Optionally, the text distance includes: the Dice distance, the Jaccard distance, the edit distance and the Hamming distance.
A first comparing module 403, configured to determine, through ciphertext comparison operation, an index of a character string in the similarity degree group, where a similarity value is greater than a preset ciphertext similarity threshold;
optionally, the first comparing module includes:
the first ciphertext comparison submodule is used for comparing each element in the similarity array with a similarity threshold of a ciphertext to obtain a ciphertext comparison matrix; the result which is greater than the similarity threshold value in the ciphertext comparison matrix is represented by a ciphertext of 1, and the result which is not greater than the similarity threshold value is represented by a ciphertext of 0;
and the first conversion submodule is used for recovering the value of the ciphertext comparison matrix into a plaintext, and taking the index of the element other than 0 as the index of the character string with the similarity value larger than a preset ciphertext similarity threshold value in the similarity degree group.
A first matching module 404, configured to use the character string extracted from the character string array according to the index as the character string matched by the target character string.
Optionally, the apparatus further comprises:
and the conversion module is used for uniformly converting all characters in the target character string and the character string array into upper case or uniformly converting all characters into lower case.
To sum up, the data processing apparatus provided in the embodiment of the present invention obtains the target character string and the character string array of the ciphertext; calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array; determining indexes of character strings with similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation; according to the invention, the whole process of determining the character string matched with the target character string of the ciphertext in the character string array of the ciphertext is processed in the form of the ciphertext, the index and the true value of the character string matched with the target character string in the character string array of the target character string are not exposed, and the data security in the matching process is improved.
Fig. 8 is a block diagram of a data processing apparatus according to an embodiment of the present invention, and as shown in fig. 8, the apparatus may include:
a second obtaining module 501, configured to obtain a target vector and a vector array of a ciphertext, where a vector to be matched in the vector array is the same as the target vector in length; the target vector of the ciphertext is obtained by performing ciphertext processing on the vector obtained after the target character string is coded according to the preset coding operation, and the vector array is obtained by performing ciphertext processing on the vector obtained after the character string array is coded according to the preset coding operation;
a second calculating module 502, configured to calculate a ciphertext similarity value between the target vector and a vector to be matched in the vector array, to obtain a similarity array;
optionally, the second calculating module 502 includes:
the second obtaining submodule is used for obtaining the text distance between the target vector and the vector to be matched;
and the second distance calculation submodule is used for determining the similarity value of the target vector and the vector to be matched according to the text distance between the target vector and the vector to be matched.
Optionally, the text distance includes: cosine distance or Euclidean distance.
A second comparing module 503, configured to determine, through ciphertext comparison operation, an index of a vector of which a similarity value in the similarity degree group is greater than a preset ciphertext similarity threshold;
optionally, the second comparing module includes:
the second ciphertext comparison submodule is used for comparing each element in the similarity array with a similarity threshold of a ciphertext to obtain a ciphertext comparison matrix; the result which is greater than the similarity threshold value in the ciphertext comparison matrix is represented by a ciphertext of 1, and the result which is not greater than the similarity threshold value is represented by a ciphertext of 0;
and the second conversion submodule is used for recovering the value of the ciphertext comparison matrix into a plaintext, and taking the index of the element other than 0 as the index of the vector with the similarity value larger than the preset ciphertext similarity threshold value in the similarity degree group.
A second matching module 504, configured to use a character string corresponding to a vector extracted from the vector array according to the index as a character string matched with a target character string corresponding to the target vector.
To sum up, in the data processing apparatus provided in the embodiment of the present invention, by obtaining a target vector and a vector array of a ciphertext, a length of a vector to be matched in the vector array is the same as a length of the target vector; the target vector of the ciphertext is obtained by performing ciphertext processing on a vector obtained after the target character string is encoded according to a preset encoding operation, and the vector array is obtained by performing ciphertext processing on a vector obtained after the character string array is encoded according to the preset encoding operation; calculating the ciphertext similarity value of the target vector and the vector to be matched in the vector array to obtain a similarity array; determining indexes of vectors with similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation; according to the method and the device, the character string corresponding to the vector extracted from the vector array according to the index is used as the character string matched with the target character string corresponding to the target vector.
Fig. 9 is a block diagram illustrating an apparatus 800 for data processing in accordance with an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 9, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. The front camera and the rear camera to be matched may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Fig. 10 is a schematic diagram of a server in some embodiments of the present application. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), and the module to be matched may include a series of instruction operations in the server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
A non-transitory computer-readable storage medium, in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the data processing method provided by the above-described embodiment.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a data processing method.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.
The data processing method, the data processing apparatus and the apparatus for data processing provided by the present application are introduced in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present application, and the above descriptions of the embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A data processing method for determining a string that matches a target string of a ciphertext, from among all strings included in a string array of the ciphertext, the method comprising:
acquiring a target character string and a character string array of a ciphertext;
calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array;
determining indexes of the character strings with the similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation;
and taking the character string extracted from the character string array according to the index as the character string matched with the target character string.
2. The method according to claim 1, wherein the calculating the ciphertext similarity value of the target character string and each character string to be matched in the character string array comprises:
acquiring a text distance between the target character string and the character string to be matched;
and determining the similarity value of the target character string and the character string to be matched according to the text distance between the target character string and the character string to be matched.
3. The method of claim 2, wherein the text distance comprises: the Dice distance, the Jaccard distance, the edit distance and the Hamming distance.
4. The method according to claim 3, wherein in a case that the text distance is an edit distance, the obtaining of the text distance between the target character string and the character string to be matched comprises:
acquiring the character length n of the target character string and the character length m of the character string to be matched;
aiming at the character string to be matched, creating an initial matrix of (n +1) × (m +1) dimensions, wherein the initial value of each element of the initial matrix is 0;
determining a value for each element in the initial matrix;
and determining the value of an element positioned at the corner point position at the lower right of the initial matrix as the editing distance between the target character string and the character string to be matched.
5. The method of claim 4, wherein determining the value of each element in the initial matrix comprises:
setting elements positioned on the first row and the first column of the initial matrix as an arithmetic array starting from 0, wherein the tolerance of the arithmetic array is 1;
calculating values of other elements except elements on a first row and a first column in the initial matrix in sequence from a second row and a second column of the initial matrix;
wherein the value of each of the other elements is calculated from a first element adjacent to the left side of the other element, a second element adjacent to the upper side of the other element, and a third element adjacent to the upper left side of the other element.
6. A data processing method for determining a character string matching a target character string among all character strings included in a character string array, the method comprising:
acquiring a target vector and a vector array of a ciphertext, wherein the length of a vector to be matched in the vector array is the same as that of the target vector; the target vector of the ciphertext is obtained by performing ciphertext processing on the vector obtained after the target character string is coded according to the preset coding operation, and the vector array is obtained by performing ciphertext processing on the vector obtained after the character string array is coded according to the preset coding operation;
calculating the ciphertext similarity value of the target vector and the vector to be matched in the vector array to obtain a similarity array;
determining indexes of vectors with similarity values larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation;
and taking the character string corresponding to the vector extracted from the vector array according to the index as the character string matched with the target character string corresponding to the target vector.
7. A data processing apparatus for determining a character string that matches a target character string of a ciphertext, from among all character strings included in a character string array of the ciphertext, the apparatus comprising:
the first acquisition module is used for acquiring a target character string and a character string array of the ciphertext;
the first calculation module is used for calculating the similarity value of the target character string and each character string to be matched in the character string array based on the ciphertext to obtain a similarity array;
the first comparison module is used for determining indexes of the character strings of which the similarity values are greater than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation;
and the first matching module is used for taking the character string extracted from the character string array according to the index as the character string matched with the target character string.
8. A data processing apparatus for determining a character string that matches a target character string among all character strings included in a character string array, the apparatus comprising:
the second acquisition module is used for acquiring a target vector and a vector array of the ciphertext, and the length of a vector to be matched in the vector array is the same as that of the target vector; the target vector of the ciphertext is obtained by performing ciphertext processing on the vector obtained after the target character string is coded according to the preset coding operation, and the vector array is obtained by performing ciphertext processing on the vector obtained after the character string array is coded according to the preset coding operation;
the second calculation module is used for calculating the ciphertext similarity value of the target vector and the vector to be matched in the vector array to obtain a similarity array;
the second comparison module is used for determining the index of the vector with the similarity value larger than a preset ciphertext similarity threshold value in the similarity degree group through ciphertext comparison operation;
and the second matching module is used for taking the character string corresponding to the vector extracted from the vector array according to the index as the character string matched with the target character string corresponding to the target vector.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the data processing method according to any one of claims 1 to 6.
10. An electronic device, comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the data processing method according to any one of claims 1 to 6.
CN202011044074.7A 2020-09-28 2020-09-28 Data processing method and device Active CN112269904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011044074.7A CN112269904B (en) 2020-09-28 2020-09-28 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011044074.7A CN112269904B (en) 2020-09-28 2020-09-28 Data processing method and device

Publications (2)

Publication Number Publication Date
CN112269904A true CN112269904A (en) 2021-01-26
CN112269904B CN112269904B (en) 2023-07-25

Family

ID=74348681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011044074.7A Active CN112269904B (en) 2020-09-28 2020-09-28 Data processing method and device

Country Status (1)

Country Link
CN (1) CN112269904B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032839A (en) * 2021-05-25 2021-06-25 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
CN113051610A (en) * 2021-03-12 2021-06-29 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
CN115758417A (en) * 2022-11-22 2023-03-07 中金金融认证中心有限公司 Data processing method, electronic device and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930458A (en) * 2010-08-18 2010-12-29 杭州东信北邮信息技术有限公司 Short message matching method based on characteristic value
CN102314580A (en) * 2011-09-20 2012-01-11 西安交通大学 Vector and matrix operation-based calculation-supported encryption method
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance
US20140233727A1 (en) * 2012-11-16 2014-08-21 Raytheon Bbn Technologies Corp. Method for secure substring search
CN104881439A (en) * 2015-05-11 2015-09-02 中国科学院信息工程研究所 Method and system for space-efficient multi-pattern matching
CN105243327A (en) * 2015-11-17 2016-01-13 四川神琥科技有限公司 Security processing method for files
CN106874401A (en) * 2016-12-30 2017-06-20 中安威士(北京)科技有限公司 A kind of ciphertext index method of data base-oriented encrypted fields fuzzy search
CN107153652A (en) * 2016-03-03 2017-09-12 阿里巴巴集团控股有限公司 Target string is converted into the method and device of standardization character string
CN110347723A (en) * 2019-07-12 2019-10-18 税友软件集团股份有限公司 A kind of data query method, system and electronic equipment and storage medium
US20200150885A1 (en) * 2018-11-14 2020-05-14 Exate Technology Limited Distributed Data Storage System and Method
CN111488497A (en) * 2019-01-25 2020-08-04 北京沃东天骏信息技术有限公司 Similarity determination method and device for character string set, terminal and readable medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930458A (en) * 2010-08-18 2010-12-29 杭州东信北邮信息技术有限公司 Short message matching method based on characteristic value
CN102314580A (en) * 2011-09-20 2012-01-11 西安交通大学 Vector and matrix operation-based calculation-supported encryption method
US20140233727A1 (en) * 2012-11-16 2014-08-21 Raytheon Bbn Technologies Corp. Method for secure substring search
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance
CN104881439A (en) * 2015-05-11 2015-09-02 中国科学院信息工程研究所 Method and system for space-efficient multi-pattern matching
CN105243327A (en) * 2015-11-17 2016-01-13 四川神琥科技有限公司 Security processing method for files
CN107153652A (en) * 2016-03-03 2017-09-12 阿里巴巴集团控股有限公司 Target string is converted into the method and device of standardization character string
CN106874401A (en) * 2016-12-30 2017-06-20 中安威士(北京)科技有限公司 A kind of ciphertext index method of data base-oriented encrypted fields fuzzy search
US20200150885A1 (en) * 2018-11-14 2020-05-14 Exate Technology Limited Distributed Data Storage System and Method
CN111488497A (en) * 2019-01-25 2020-08-04 北京沃东天骏信息技术有限公司 Similarity determination method and device for character string set, terminal and readable medium
CN110347723A (en) * 2019-07-12 2019-10-18 税友软件集团股份有限公司 A kind of data query method, system and electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张蜀男等: "云存储中高效密文检索的中文数据加密方案", 《计算机科学》, no. 201806, pages 130 - 135 *
韩程程等: "语义文本相似度计算方法", 《华东师范大学学报(自然科学版)》, no. 5, pages 95 - 112 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051610A (en) * 2021-03-12 2021-06-29 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
CN113032839A (en) * 2021-05-25 2021-06-25 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
CN115758417A (en) * 2022-11-22 2023-03-07 中金金融认证中心有限公司 Data processing method, electronic device and storage medium

Also Published As

Publication number Publication date
CN112269904B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN111612070B (en) Image description generation method and device based on scene graph
CN110909815B (en) Neural network training method, neural network training device, neural network processing device, neural network training device, image processing device and electronic equipment
CN112861175B (en) Data processing method and device for data processing
CN111832067B (en) Data processing method and device and data processing device
CN109558599B (en) Conversion method and device and electronic equipment
CN112269904B (en) Data processing method and device
CN110569777A (en) Image processing method and device, electronic equipment and storage medium
CN111859035B (en) Data processing method and device
EP3734472A1 (en) Method and device for text processing
CN112487415B (en) Method and device for detecting security of computing task
CN112241250B (en) Data processing method and device and data processing device
CN115085912B (en) Ciphertext calculation method and device and ciphertext calculation device
US20230386449A1 (en) Method and apparatus for training neural network, and method and apparatus for audio processing
CN114168798A (en) Text storage management and retrieval method and device
CN114154485A (en) Text error correction method and device
CN111860552B (en) Model training method and device based on nuclear self-encoder and storage medium
CN111538998A (en) Text encryption method and device, electronic equipment and computer readable storage medium
CN113589954A (en) Data processing method and device and electronic equipment
CN112906904B (en) Data processing method and device for data processing
CN114168809B (en) Document character string coding matching method and device based on similarity
CN115357626A (en) Data processing method, device, electronic equipment, medium and product
CN110084065B (en) Data desensitization method and device
CN113807540A (en) A data processing method and device
CN112016637B (en) Hierarchical sampling method and device for hierarchical sampling
CN113098524B (en) Information encoding method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant