[go: up one dir, main page]

CN114996280A - Method, device, equipment and medium for correcting field information of data table - Google Patents

Method, device, equipment and medium for correcting field information of data table Download PDF

Info

Publication number
CN114996280A
CN114996280A CN202210916472.6A CN202210916472A CN114996280A CN 114996280 A CN114996280 A CN 114996280A CN 202210916472 A CN202210916472 A CN 202210916472A CN 114996280 A CN114996280 A CN 114996280A
Authority
CN
China
Prior art keywords
list
field
data
key
data list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210916472.6A
Other languages
Chinese (zh)
Other versions
CN114996280B (en
Inventor
袁凯
郑书磊
叶新江
吕观祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Merit Interactive Co Ltd
Original Assignee
Merit Interactive Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Merit Interactive Co Ltd filed Critical Merit Interactive Co Ltd
Priority to CN202210916472.6A priority Critical patent/CN114996280B/en
Publication of CN114996280A publication Critical patent/CN114996280A/en
Application granted granted Critical
Publication of CN114996280B publication Critical patent/CN114996280B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of databases, in particular to a method, a device, equipment and a medium for correcting field information of a data table, wherein the method comprises the following steps: acquiring a sample data set and a key data list; acquiring a preset field list according to the sample data set; acquiring a key field list according to the key data list; acquiring target field information of the key data list based on the preset field list and the key field list; therefore, the field names of the initial data list can be converted into the industry standard field names of the initial data list, and further the problem that the data tables uploaded by different data sources cannot be analyzed in a centralized mode due to the fact that the field names of the data tables uploaded by different users are inconsistent is avoided.

Description

Method, device, equipment and medium for correcting field information of data table
Technical Field
The invention relates to the technical field of databases, in particular to a method, a device, equipment and a medium for correcting field information of a data table.
Background
With the development of the digital era, data needs to be uploaded to a designated information system for uniform analysis and processing, however, in the process, field names in data tables uploaded by a plurality of users are usually not uniform, so that the data tables uploaded by different users cannot be analyzed in a centralized manner. Currently, it is common to determine the relationships between uploaded data tables through manual operations, which causes inefficiency in the identification of data table relationships and brings considerable labor cost. Therefore, how to effectively unify field names in data tables uploaded by different users is a technical problem that needs to be solved urgently by those skilled in the art at present.
Disclosure of Invention
Aiming at the technical problem, the invention protects a field information correction method of a data table, which comprises the following steps:
acquiring a sample data set and a key data list;
acquiring a preset field list according to the sample data set;
acquiring a key field list according to the key data list;
and acquiring target field information corresponding to the key data list based on the preset field list and the key field list, wherein the target field information comprises a target field name and a target character string corresponding to the target field name.
The invention also protects a field information correcting device of the data table, which comprises:
the data acquisition module is used for acquiring a sample data set and a key data list;
the sample field list acquisition module is used for acquiring a preset field list according to the sample data set;
a key field list obtaining module, wherein the key field list is used for obtaining a key field list by the obtaining module according to the key data list;
and the target field information acquisition module is used for acquiring target field information corresponding to the key data list based on the preset field list and the key field list, wherein the target field information comprises a target field name and a target character string corresponding to the target field name.
The invention protects an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the field information correction method of the data table when executing the computer program.
The present invention protects a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method for correcting field information of a data table.
Compared with the prior art, the invention has obvious advantages and beneficial effects. By the technical scheme, the field information correction method, the field information correction device, the electronic equipment and the storage medium of the data table provided by the invention can achieve considerable technical progress and practicability, have industrial wide utilization value and at least have the following advantages:
the invention discloses a method, a device, equipment and a medium for correcting field information of a data table, wherein the method comprises the following steps: acquiring a sample data set and a key data list; acquiring a preset field list according to the sample data set; acquiring a key field list according to the key data list; acquiring target field information of the key data list based on the preset field list and the key field list; therefore, the field names of the initial data list can be converted into the industry standard field names of the initial data list, and further the problem that the data tables uploaded by different data sources cannot be analyzed in a centralized mode due to the fact that the field names of the data tables uploaded by different users are inconsistent is avoided.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.
Drawings
Fig. 1 is a flowchart of a method for correcting field information of a data table according to an embodiment of the present invention;
fig. 2 is a flowchart of steps before the step S100 according to an embodiment of the present invention;
fig. 3 is a flowchart of step S17 according to an embodiment of the present invention;
fig. 4 is a flowchart of steps before another step S100 according to an embodiment of the present invention;
fig. 5 is a flowchart of step S25 according to an embodiment of the present invention;
fig. 6 is a flowchart of a step S400 according to an embodiment of the present invention;
fig. 7 is a flowchart of step S401 according to an embodiment of the present invention;
fig. 8 is a flowchart of step S403 according to an embodiment of the present invention;
fig. 9 is a flowchart of another field information correction method of the data table according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a field information correction apparatus of a data table according to a second embodiment of the present invention;
fig. 11 is a schematic structural diagram of other modules in the apparatus according to the second embodiment of the present invention;
fig. 12 is a schematic structural diagram of a module 17 according to a second embodiment of the present invention;
fig. 13 is another schematic structural diagram of other modules in the apparatus according to the second embodiment of the present invention;
fig. 14 is a schematic structural diagram of a module 25 according to a second embodiment of the present invention;
fig. 15 is a schematic structural diagram of a module 40 according to a second embodiment of the present invention;
fig. 16 is a schematic structural diagram of a module 401 according to a second embodiment of the present invention;
fig. 17 is a schematic structural diagram of a module 403 according to a second embodiment of the present invention;
fig. 18 is a schematic structural diagram of another field information correction apparatus of the data table according to the second embodiment of the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given with reference to the accompanying drawings and preferred embodiments of a data processing system for acquiring a target position and its effects.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
As shown in fig. 1, the first embodiment provides a method for correcting field information of a data table, where the method includes the following steps:
s100, acquiring a sample data set and a key data list.
Specifically, the step S100 further includes the following steps before, as shown in fig. 2:
and S11, acquiring any first original data list from the first original data set as a first data list.
Further, the first raw data set includes: the data processing method comprises the following steps of obtaining a plurality of first original data lists, wherein each first original data list belongs to data lists of the same industry.
S13, obtaining any associated data list corresponding to the first data list from the associated data set corresponding to the first data list as a second data list, where any associated data list corresponding to the first data list is any one of the first original data lists except the first data list in the first original data set.
S15, acquiring a first common field name between the first data list and the second data list based on the first data list and the second data list.
Further, the first common field name is a field name having the same field name and a structure type corresponding to the same field name in the first field list and the second field list.
Further, the first field list is a field list constructed based on all field names in the first data list.
Further, the second field list is a field list constructed based on all field names in the second data list.
And S17, acquiring a sample data set based on the first common field name.
Further, the step S17 includes the following steps, as shown in fig. 3:
s171, obtaining a field ratio K corresponding to the first data list 1 Wherein, K is 1 The following conditions are met:
Figure DEST_PATH_IMAGE002
where N is the number of first raw data lists in the first raw data set, and M is the number of first raw data lists in the first raw data set i Is the number of first common field names, M, between the first data list and the ith second data list 0 The number of field names in the first field list.
S173, when K 1 ≥K 0 Then, determining the first data list as a sample data list, wherein K 0 For the preset field ratio threshold, those skilled in the art set the field ratio threshold according to actual requirements, and details are not described herein.
S175, when K 1 <K 0 And then determining that the first data list is a non-sample data list.
By the method, the field name similarity of each sample data list in the sample data set is high, the unified field names corresponding to the sample data lists can be determined, and further the problem that the data tables uploaded by different data sources cannot be analyzed in a concentrated mode due to the fact that the field names of the data tables uploaded by different users are inconsistent is avoided.
Specifically, the step S100 further includes the following steps before, as shown in fig. 4:
and S21, acquiring any one second original data list from the second original data set as a third data list.
Further, the second original data set includes: and the plurality of second original data lists are arranged, wherein each second original data list and the first original data list belong to the same industry, and the data sources corresponding to the second original data list and each first original data list are inconsistent.
S23, based on any sample data list in the sample data set and the third data list, obtaining a second common field name between the sample data list and the third data list.
Further, the second common field name is a field name having the same field name and a structure type corresponding to the same field name in the preset field list and the third field list.
Further, the third field list is a field list constructed based on all field names in the third data list.
Further, the preset field list is a field list constructed based on all field names in the sample data list.
And S25, acquiring a key data list based on the second common field name.
Further, the step S25 includes the following steps, as shown in fig. 5:
s251, obtaining a field ratio K corresponding to the third data list 2 Wherein, K is 2 The following conditions are met:
Figure DEST_PATH_IMAGE004
wherein Z is the number of sample data lists in the sample data set, G j Is the second number of common field names, G, between the third data list and the jth sample data list 0 Is the number of field names in the third field list.
S253, when K 2 ≥K 0 And determining that the third data list is a key data list.
S255, when K 2 <K 0 And then determining that the third data list is a non-critical data list.
By the method, the key data list and the field names of each sample data list in the sample data set are highly similar, the unified field names corresponding to the key data list can be determined, and further the problem that the data tables uploaded by different data sources cannot be analyzed in a centralized mode due to the fact that the field names of the data tables uploaded by different users are inconsistent is avoided.
S200, acquiring a preset field list according to the sample data set, wherein the preset field list comprises a plurality of preset field names and preset character strings corresponding to the preset field names.
Specifically, the step S200 further includes the steps of:
s201, all the first common field names are obtained and are subjected to de-duplication processing, and an initial field name list is obtained.
S203, acquiring the quantity of original data tables corresponding to each initial field name in the initial field name list according to the initial field name list;
s205, when the number of the original data tables corresponding to the initial field name is not less than a preset data table number threshold, determining the initial field name as a preset field name;
s207, acquiring all character strings corresponding to the preset field names from an original data list according to the preset field names, and constructing an initial character string list corresponding to any one of the preset field names, wherein the initial character string refers to any one of the character strings in the initial character string list corresponding to the preset field names;
s209, traversing the initial character string list corresponding to the preset field name, and acquiring a character string similarity list corresponding to the preset field name, wherein the character string similarity corresponding to the preset field name is the sum of the similarity between any initial character string corresponding to the preset field name and other initial character strings except the initial character string;
s2011 traverses the string similarity list corresponding to the preset field name, and obtains the maximum string similarity corresponding to the preset field name from the string similarity list corresponding to the preset field name as the preset string corresponding to the preset field name.
The field name and the character string in the same industry can be accurately described, the field characteristics in the same industry can be represented, and the phenomenon that the field names and the character strings cannot be unified due to different descriptions of the same field in the industry is avoided.
S300, acquiring a key field list according to the key data list, wherein the key field list comprises a plurality of key field names and key character strings corresponding to each key field name.
Specifically, the key field name is any field name in the key data list.
Specifically, the keyword string is a string for describing the keyword field name.
S400, acquiring target field information corresponding to the key data list based on the preset field list and the key field list, wherein the target field information comprises a target field name and a target character string corresponding to the target field name.
Specifically, the step S400 further includes the following steps, as shown in fig. 6:
s401, acquiring an intermediate field list corresponding to the key data list according to the key field names and the preset field names, wherein the intermediate field list comprises a plurality of intermediate field names and intermediate character strings corresponding to each intermediate field name.
Further, the step S401 further includes the following steps, as shown in fig. 7:
s4011, according to the keyword field name and the preset field name, a first similarity list corresponding to the keyword field name is obtained.
Further, the first similarity list corresponding to the key field names includes a plurality of first similarities corresponding to the key field names, where the first similarities refer to similarities between the first field vectors corresponding to the key field names and the second field vectors corresponding to any one of the preset field names.
Preferably, the method further comprises the following step of determining an acquisition method of the first similarity:
s31, acquiring a first intermediate similarity set H = { H } according to the keyword field name list and the preset field name list 1 ,……,H r ,……,H g },H r ={H r1 ,……,H re ,……,H rf },H re For the first intermediate similarity corresponding to the e-th similarity obtaining method in the r-th key field name, r =1, … …, g, g are the number of the key field names, e =1, … …, f, f are the number of the similarity obtaining methods, wherein each first intermediate similarity adopts a different similarity obtaining method; preferably, G = G 0
S33, when H re >H 0 Then, a target number list U = { U } corresponding to H is obtained 1 ,……,U e ,……,U f },U e Obtaining the target number corresponding to the e-th similarity obtaining method, wherein H 0 Is a preset intermediate similarity threshold; it can be understood that: the target quantity means that any similarity acquisition method is adopted to acquire the condition satisfying H re Greater than H 0 Number of key field names.
And S35, acquiring the similarity acquiring method corresponding to the maximum target number from the U as the first similarity acquiring method.
By the method, the most effective similarity obtaining method can be determined, and the influence on the accuracy of determining the target field name and the target character string due to inaccuracy of the similarity is avoided, so that accurate analysis cannot be performed on the data tables uploaded by different data sources.
S4013, traversing the first similarity list corresponding to the key field names, and when the first similarity is not less than a preset first similarity threshold, taking the preset field names corresponding to the first similarity as middle field names.
S4015, according to the intermediate field names, acquiring the intermediate character strings corresponding to the intermediate field names from the preset character strings corresponding to all the preset field names.
And S403, acquiring a target field name and a target character string corresponding to the key data list according to the intermediate character string and the key character string.
Further, the step S403 includes the following steps, as shown in fig. 8:
s4031, according to the intermediate character string and the key character string, a second similarity list corresponding to the key character string is obtained.
Further, the second similarity list corresponding to the key character string includes a plurality of second similarities corresponding to the key character string, where the second similarities refer to similarities between the first word vector corresponding to the key character string and the second word vector corresponding to any one of the intermediate character strings.
Preferably, the method further comprises the steps of obtaining the first word vector:
s41, performing word segmentation processing on the key character string to obtain a keyword list corresponding to the key character string, wherein the keyword list comprises a plurality of keywords, and persons skilled in the art know that any word segmentation processing method in the prior art falls into the protection scope of the invention; preferably, the participle processing method is an IK Analyzer participle processing method.
S43, obtaining a keyword vector corresponding to each keyword, wherein a method of obtaining the keyword vector corresponding to the keyword is the same as a method of obtaining the first field vector, and is not described herein again.
S45, obtaining the first word vector a = (a) based on the keyword vector corresponding to each keyword 1 ,……,A x ,……,A y ),A x Is the bit value of the x-th bit in the first word vector, x =1, … …, y, y is the dimension of the first word vector, wherein A x The following conditions are met:
Figure DEST_PATH_IMAGE006
,C xq and for the xth bit value corresponding to the qth keyword in the keyword list, q =1, … …, and p is the number of keywords in the keyword list.
By the method, the word vectors corresponding to the key data list can be accurately acquired, so that the similarity between the word vectors and the preset field names can be conveniently acquired, and the data tables uploaded by different data sources can be accurately analyzed.
Preferably, the method further comprises the step of obtaining the second word vector:
s51, performing word segmentation processing on the intermediate character string to obtain an intermediate word list corresponding to the intermediate character string, wherein the intermediate word list comprises a plurality of intermediate words, and persons skilled in the art know that any word segmentation processing method in the prior art falls into the protection scope of the invention; preferably, the word segmentation processing method is an IK Analyzer word segmentation processing method.
And S53, obtaining an intermediate word vector corresponding to each intermediate word, wherein a method for obtaining the intermediate word vector corresponding to the intermediate word is consistent with a method for obtaining the second field vector, and is not described herein again.
S55, obtaining the second word vector B = (B) based on the intermediate word vector corresponding to each intermediate word 1 ,……,B x ,……,B y ),B x Is the bit value of the x-th bit in the second word vector, wherein B x The following conditions are met:
Figure DEST_PATH_IMAGE008
,D xt and t =1, … …, s is the number of the intermediate words in the intermediate word list, and t is the bit value of the xth bit corresponding to the tth intermediate word in the intermediate word list.
Specifically, the method for acquiring the second similarity is the same as the method for acquiring the first similarity, and is not repeated here.
S4033, traversing a second similarity list corresponding to the intermediate character string, and when the second similarity is not smaller than a preset second similarity threshold, taking the intermediate character string corresponding to the second similarity as a designated character string.
S4035, the number of the designated character strings is obtained, and when the number of the designated character strings is equal to 1, the preset field names corresponding to the designated character strings are used as target field names, and the designated character strings are used as target character strings corresponding to the target field names.
S4037, when the number of the designated character strings is not equal to 1, screening out target field names and target character strings corresponding to the target field names from all the designated character strings; those skilled in the art will appreciate that the target character string and the target field name can be manually filtered, and will not be described in detail herein.
By means of the method and the device, the field names of the initial data list can be converted into the industry standard field names of the initial data list, and further the situation that the data lists uploaded by different users cannot be analyzed in a centralized mode due to the fact that the field names of the data lists uploaded by different data sources are inconsistent is avoided.
Specifically, the method further includes the following steps, as shown in fig. 9:
and S500, replacing the key character string corresponding to the target character string with the target character string.
S600, replacing the target field name with the key field name corresponding to the target field name, replacing the field and the character string in the key data list with the labeled field and character string, and unifying data tables of different data sources in the same industry.
The embodiment provides a field information correction method for a data table, which can convert field names of an initial data list into industry standard field names of the initial data list, and further avoid that the data tables uploaded by different data sources cannot be analyzed in a centralized manner due to inconsistent field names of the data tables uploaded by different users, and meanwhile, characters in a key data list can be replaced by labeled fields and character strings, so that the data tables of different data sources in the same industry are unified.
As shown in fig. 10, the second embodiment provides a field information correcting apparatus for a data table, the apparatus includes:
a data obtaining module 10, where the data obtaining module 10 is configured to obtain a sample data set and a key data list.
As shown in fig. 11, the apparatus further includes:
a first data list obtaining module 11, where the first data list obtaining module 11 is configured to obtain any first original data list from a first original data set as a first data list, where the first original data set includes: a plurality of first original data lists, wherein each first original data list belongs to data lists of the same industry.
A second data list obtaining module 13, where the second data list obtaining module 13 is configured to obtain any associated data list corresponding to the first data list from the associated data set corresponding to the first data list as a second data list, where any associated data list corresponding to the first data list is any one of the first original data lists except the first data list in the first original data set.
A first common field name obtaining module 15, where the first common field name obtaining module 15 is configured to obtain a first common field name between the first data list and the second data list based on the first data list and the second data list, where the first common field name is a field name having a same field name and a structure type corresponding to the same field name in the first field list and the second field list.
Further, the first field list is a field list constructed based on all field names in the first data list.
Further, the second field list is a field list constructed based on all field names in the second data list.
A sample data set obtaining module 17, where the sample data set obtaining module 17 is configured to obtain a sample data set based on the first common field name.
As shown in fig. 12, the sample data set obtaining module 17 further includes:
a first field ratio obtaining module 171, where the field ratio obtaining module 17 is configured to obtain a field ratio K corresponding to the first data list 1 Wherein, K is 1 The following conditions are met:
Figure DEST_PATH_IMAGE010
where N is the number of first raw data lists in the first raw data set, and M is the number of first raw data lists in the first raw data set i Is the number of first common field names, M, between the first data list and the ith second data list 0 Is the number of field names in the first field list.
A sample data list obtaining module 173, said sample data list obtaining module 173 is used for K 1 ≥K 0 Then, determining the first data list as a sample data list, wherein K 0 For the preset field ratio threshold, those skilled in the art set the field ratio threshold according to actual requirements, and details are not described herein.
A non-sample data list obtaining module 175, where the non-sample data list obtaining module 175 is used for K 1 <K 0 And then determining that the first data list is a non-sample data list.
By the method, the field name similarity of each sample data list in the sample data set is high, the unified field names corresponding to the sample data lists can be determined, and further the problem that the data tables uploaded by different data sources cannot be analyzed in a concentrated mode due to the fact that the field names of the data tables uploaded by different users are inconsistent is avoided.
As shown in fig. 13, the apparatus further includes:
a third data list obtaining module 21, where the third data list obtaining module 21 is configured to obtain any one of the second original data lists from the second original data set as a third data list.
Further, the second original data set includes: and the plurality of second original data lists are arranged, wherein each second original data list and the first original data list belong to the same industry, and the data sources corresponding to the second original data list and each first original data list are inconsistent.
A second common field name obtaining module 23, where the second common field name obtaining module 23 is configured to obtain a second common field name between any sample data list in the sample data set and the third data list based on the sample data list and the third data list.
Further, the second common field name is a field name having the same field name and a structure type corresponding to the same field name in the preset field list and the third field list.
Further, the third field list is a field list constructed based on all field names in the third data list.
Further, the preset field list is a field list constructed based on all field names in the sample data list.
A key data list obtaining module 25, where the key data list obtaining module 25 is configured to obtain a key data list based on the second common field name.
As shown in fig. 14, the key data list obtaining module 25 further includes:
a second field ratio obtaining module 251, where the second field ratio obtaining module 251 is configured to obtain a field ratio K corresponding to the third data list 2 Wherein, K is 2 The following conditions are met:
Figure DEST_PATH_IMAGE012
wherein Z is the number of sample data lists in the sample data set, G j Is the second number of common field names, G, between the third data list and the jth sample data list 0 The number of field names in the third field list.
A key data list obtaining module 253, where the key data list module 253 is used for K 2 ≥K 0 And determining that the third data list is a key data list.
A non-critical data list obtaining module 255, the non-critical data list obtaining module 255 is used for K 2 <K 0 And then determining that the third data list is a non-critical data list.
A sample field list obtaining module 20, configured to obtain a preset field list according to the sample data set, where the preset field list includes a plurality of preset field names and a preset character string corresponding to each preset field name.
Specifically, the sample field list obtaining module 20 further includes:
an initial field name list obtaining module 201, where the initial field name list obtaining module 201 is configured to obtain all the first common field names and perform deduplication processing on all the first common field names to obtain an initial field name list.
An original data table quantity obtaining module 203, where the original data table quantity obtaining module 203 is configured to obtain, according to the initial field name list, a quantity of an original data table corresponding to each initial field name in the initial field name list;
a preset field name determining module 205, where the preset field name determining module 205 is configured to determine that the initial field name is a preset field name when the number of original data tables corresponding to the initial field name is not less than a preset data table number threshold;
an initial string list building module 207, where the initial string list building module 207 is configured to obtain all strings corresponding to the preset field names from an original data list according to the preset field names, and build an initial string list corresponding to any one of the preset field names, where the initial string is any one of the strings in the initial string list corresponding to the preset field names;
a character string similarity list obtaining module 209, where the character string similarity list obtaining module 209 is configured to traverse an initial character string list corresponding to the preset field name, and obtain a character string similarity list corresponding to the preset field name, where the character string similarity corresponding to the preset field name is a sum of similarities between any initial character string corresponding to the preset field name and other initial character strings except for the preset field name;
a preset character string determination module 2011, the preset character string determination module 2011 is configured to traverse the character string similarity list corresponding to the preset field name and obtain the maximum character string similarity corresponding to the preset field name from the character string similarity list corresponding to the preset field name, and the maximum character string similarity is used as the preset character string corresponding to the preset field name.
A key field list obtaining module 30, configured to obtain a key field list according to the key data list, where the key field list includes a plurality of key field names and a key character string corresponding to each key field name.
Specifically, the key field name is any field name in the key data list.
Specifically, the keyword string is a string for describing the keyword field name.
And a target field information obtaining module 40, configured to obtain target field information corresponding to the key data list based on the preset field list and the key field list, where the target field information includes a target field name and a target character string corresponding to the target field name.
As shown in fig. 15, the apparatus further includes:
an intermediate field list obtaining module 401, where the intermediate field list obtaining module 401 is configured to obtain an intermediate field list corresponding to the key data list according to the key field names and the preset field names, and the intermediate field list includes a plurality of intermediate field names and an intermediate character string corresponding to each intermediate field name.
As shown in fig. 16, the middle field list obtaining module 401 further includes:
a first similarity list obtaining module 4011, where the first similarity list obtaining module 4011 is configured to obtain, according to the key field name and the preset field name, a first similarity list corresponding to the key field name.
Further, the first similarity list corresponding to the key field names includes a plurality of first similarities corresponding to the key field names, where the first similarities refer to similarities between the first field vectors corresponding to the key field names and the second field vectors corresponding to any one of the preset field names.
Preferably, the apparatus further comprises:
a first intermediate similarity set obtaining module 31, where the first intermediate similarity set obtaining module 31 is configured to obtain a first intermediate similarity set H = { H = according to the keyword field name list and the preset field name list 1 ,……,H r ,……,H g },H r ={H r1 ,……,H re ,……,H rf },H re For a first intermediate similarity corresponding to an e-th similarity obtaining method in an r-th key field name, r =1, … …, g, g are the number of the key field names, e =1, … …, f, f are the number of the similarity obtaining methods, wherein each first intermediate similarity adopts a different similarity obtaining method; preferably, G = G 0
A target number list obtaining module 33, the target number list obtaining module 33 is used for H re >H 0 Then, a target number list U = { U } corresponding to H is obtained 1 ,……,U e ,……,U f },U e Obtaining the target number corresponding to the e-th similarity obtaining method, wherein H 0 Is a preset intermediate similarity threshold; it can be understood that: the target quantity means that any similarity acquisition method is adopted to acquire the condition satisfying H re >H 0 Number of key field namesAmount (v).
A similarity obtaining method determining module 35, where the similarity obtaining method determining module 35 is configured to obtain, from U, a similarity obtaining method corresponding to the maximum number of targets as a first similarity obtaining method.
The intermediate field name determining module 4013 is configured to traverse a first similarity list corresponding to the key field names, and when the first similarity is not less than a preset first similarity threshold, use preset field names corresponding to the first similarity as the intermediate field names.
The intermediate character string determining module 4015 is configured to, according to the intermediate field names, obtain intermediate character strings corresponding to the intermediate field names from the preset character strings corresponding to all the preset field names.
A target information determining module 403, where the target information determining module 403 is configured to obtain, according to the intermediate character string and the key character string, a target field name and a target character string corresponding to the key data list.
As shown in fig. 17, the target information determination module 403 further includes:
a second similarity list obtaining module 4031, where the second similarity list obtaining module 4031 is configured to obtain, according to the intermediate character string and the key character string, a second similarity list corresponding to the key character string.
Further, the second similarity list corresponding to the key character string includes a plurality of second similarities corresponding to the key character string, where the second similarities refer to similarities between the first word vector corresponding to the key character string and the second word vector corresponding to any one of the intermediate character strings.
Preferably, the apparatus further comprises:
a keyword list obtaining module 41, where the keyword list obtaining module 41 is configured to perform word segmentation on the keyword string to obtain a keyword list corresponding to the keyword string, where the keyword list includes a plurality of keywords, and a person skilled in the art knows that any word segmentation processing method in the prior art falls within the protection scope of the present invention; preferably, the word segmentation processing method is an IK Analyzer word segmentation processing method.
A keyword vector obtaining module 43, where the keyword vector obtaining module 43 is configured to obtain a keyword vector corresponding to each keyword, where a method for obtaining the keyword vector corresponding to the keyword is consistent with a method for obtaining the first field vector, and is not described herein again.
A first word vector obtaining module 45, where the first word vector obtaining module 45 is configured to obtain the first word vector a = (a) based on a keyword vector corresponding to each keyword 1 ,……,A x ,……,A y ),A x Is the bit value of the x-th bit in the first word vector, x =1, … …, y, y is the dimension of the first word vector, wherein A x The following conditions are met:
Figure DEST_PATH_IMAGE014
,C xq and for the xth bit value corresponding to the qth keyword in the keyword list, q =1, … …, and p is the number of keywords in the keyword list.
Preferably, the apparatus further comprises:
an intermediate word list obtaining module 51, where the intermediate word list obtaining module 51 is configured to perform word segmentation on the intermediate character string to obtain an intermediate word list corresponding to the intermediate character string, where the intermediate word list includes a plurality of intermediate words, and a person skilled in the art knows that any word segmentation processing method in the prior art falls within the protection scope of the present invention; preferably, the word segmentation processing method is an IK Analyzer word segmentation processing method.
An intermediate word vector obtaining module 53, where the intermediate word vector determining module 53 is configured to obtain an intermediate word vector corresponding to each intermediate word, where a method for obtaining an intermediate word vector corresponding to an intermediate word is consistent with a method for obtaining a second field vector, and is not described herein again.
An intermediate word vector obtaining module 55, the intermediate word vectorThe determining module 55 is configured to obtain the second word vector B = (B) based on the intermediate word vector corresponding to each of the intermediate words 1 ,……,B x ,……,B y ),B x Is the bit value of the x-th bit in the second word vector, wherein B x The following conditions are met:
Figure 224087DEST_PATH_IMAGE008
,D xt and t =1, … …, s is the number of the intermediate words in the intermediate word list, and is the bit value of the xth bit corresponding to the tth intermediate word in the intermediate word list.
Specifically, the method for acquiring the second similarity is the same as the method for acquiring the first similarity, and is not repeated here.
A designated character string determination module 4033, where the designated character string determination module 4033 is configured to traverse a second similarity list corresponding to the intermediate character string, and when the second similarity is not smaller than a preset second similarity threshold, use the intermediate character string corresponding to the second similarity as a designated character string.
A first execution module 4035, where the first execution module 4035 is configured to obtain the number of the specified character strings, and when the number of the specified character strings is equal to 1, use a preset field name corresponding to the specified character string as a target field name and use the specified character string as a target character string corresponding to the target field name.
A second execution module 4037, where the second execution module 4037 is configured to filter out, when the number of the specified character strings is not equal to 1, a target field name and a target character string corresponding to the target field name from all the specified character strings; those skilled in the art will appreciate that the target character string and the target field name can be manually filtered, and will not be described in detail herein.
As shown in fig. 18, the apparatus includes:
a first replacing module 500, where the first replacing module 500 is configured to replace the target character string with a key character string corresponding to the target character string.
A second replacing module 600, where the second replacing module 600 is configured to replace the target field name with the key field name corresponding to the target field name.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring a sample data set and a key data list;
acquiring a preset field list according to the sample data set;
acquiring a key field list according to the key data list;
and acquiring target field information corresponding to the key data list based on the preset field list and the key field list, wherein the target field information comprises a target field name and a target character string corresponding to the target field name.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring a sample data set and a key data list;
acquiring a preset field list according to the sample data set;
acquiring a key field list according to the key data list;
and acquiring target field information corresponding to the key data list based on the preset field list and the key field list, wherein the target field information comprises a target field name and a target character string corresponding to the target field name.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (14)

1. A method for correcting field information of a data table, the method comprising the steps of:
acquiring a sample data set and a key data list;
acquiring a preset field list according to the sample data set;
acquiring a key field list according to the key data list;
and acquiring target field information corresponding to the key data list based on the preset field list and the key field list, wherein the target field information comprises a target field name and a target character string corresponding to the target field name.
2. The method for correcting field information of a data table according to claim 1, further comprising the following steps before acquiring the sample data set and the key data list:
acquiring any first original data list from the first original data set as a first data list;
acquiring any associated data list corresponding to the first data list from the associated data set corresponding to the first data list as a second data list, wherein any associated data list corresponding to the first data list is any one of the first original data lists except the first data list in the first original data set;
acquiring a first common field name between the first data list and the second data list based on the first data list and the second data list;
and acquiring a sample data set based on the first common field name.
3. The method for correcting field information of a data table according to claim 2, further comprising the following steps before acquiring the sample data set and the key data list:
acquiring any one second original data list from the second original data set as a third data list;
acquiring a second common field name between the sample data list and the third data list based on any sample data list in the sample data set and the third data list;
and acquiring a key data list based on the second common field name.
4. The method for correcting field information of a data table according to claim 1, wherein the predetermined field list includes a plurality of predetermined field names and a predetermined string corresponding to each of the predetermined field names.
5. The method for correcting field information of a data table according to claim 4, wherein the key field list includes a plurality of key field names and a key string corresponding to each key field name.
6. The method for correcting field information of a data table according to claim 5, wherein the step of obtaining the target field information corresponding to the key data list based on the preset field list and the key field list comprises the steps of:
acquiring an intermediate field list corresponding to the key data list according to the key field names and the preset field names, wherein the intermediate field list comprises a plurality of intermediate field names and intermediate character strings corresponding to each intermediate field name;
and acquiring a target field name corresponding to the key data list and a target character string corresponding to the target field name according to the intermediate character string and the key character string.
7. An apparatus for correcting field information of a data table, the apparatus comprising:
the data acquisition module is used for acquiring a sample data set and a key data list;
the sample field list acquisition module is used for acquiring a preset field list according to the sample data set;
a key field list obtaining module, configured to obtain a key field list according to the key data list;
and the target field information acquisition module is used for acquiring target field information corresponding to the key data list based on the preset field list and the key field list, wherein the target field information comprises a target field name and a target character string corresponding to the target field name.
8. The apparatus for correcting field information of a data table according to claim 7, wherein said apparatus further comprises:
the device comprises a first data list acquisition module, a first data list acquisition module and a second data list acquisition module, wherein the first data list acquisition module is used for acquiring any one first original data list from a first original data set as a first data list;
a second data list obtaining module, configured to obtain any associated data list corresponding to the first data list from the associated data set corresponding to the first data list as a second data list, where any associated data list corresponding to the first data list is any one of the first original data lists in the first original data set except the first data list;
a first common field name obtaining module, configured to obtain a first common field name between the first data list and the second data list based on the first data list and the second data list;
and the sample data set acquisition module is used for acquiring a sample data set based on the first common field name.
9. The apparatus for correcting field information of a data table according to claim 8, wherein said apparatus further comprises:
a third data list obtaining module, configured to obtain any one of the second original data lists from the second original data sets as a third data list;
a second common field name obtaining module, configured to obtain a second common field name between the first data list and the third data list based on the first data list and the third data list;
and the key data list acquisition module is used for acquiring a key data list based on the second common field name.
10. The apparatus for correcting field information of a data table of claim 7, wherein the predetermined field list comprises a plurality of predetermined field names and a predetermined string corresponding to each of the predetermined field names.
11. The apparatus for correcting field information of a data table according to claim 10, wherein the key field list includes a plurality of key field names and a key string corresponding to each key field name.
12. The apparatus for correcting field information of a data table according to claim 11, wherein the target field information acquiring module further comprises:
the intermediate field list acquisition module is used for acquiring an intermediate field list corresponding to the key data list according to the key field names and the preset field names, wherein the intermediate field list comprises a plurality of intermediate field names and intermediate character strings corresponding to each intermediate field name;
and the target information determining module is used for acquiring a target field name corresponding to the key data list and a target character string corresponding to the target field name according to the intermediate character string and the key character string.
13. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements a field information correcting method of a data table according to any one of claims 1 to 6 when executing the computer program.
14. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a field information correcting method of a data table according to any one of claims 1 to 6.
CN202210916472.6A 2022-08-01 2022-08-01 Method, device, equipment and medium for correcting field information of data table Active CN114996280B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210916472.6A CN114996280B (en) 2022-08-01 2022-08-01 Method, device, equipment and medium for correcting field information of data table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210916472.6A CN114996280B (en) 2022-08-01 2022-08-01 Method, device, equipment and medium for correcting field information of data table

Publications (2)

Publication Number Publication Date
CN114996280A true CN114996280A (en) 2022-09-02
CN114996280B CN114996280B (en) 2022-10-25

Family

ID=83020968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210916472.6A Active CN114996280B (en) 2022-08-01 2022-08-01 Method, device, equipment and medium for correcting field information of data table

Country Status (1)

Country Link
CN (1) CN114996280B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840742A (en) * 2023-02-13 2023-03-24 每日互动股份有限公司 Data cleaning method, device, equipment and medium
CN117312624A (en) * 2023-11-30 2023-12-29 北京睿企信息科技有限公司 Data processing system for acquiring target data list
CN118708602A (en) * 2024-08-30 2024-09-27 浙江有数数智科技有限公司 A data synchronization method, device, medium and equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061742A (en) * 2019-12-25 2020-04-24 北京数起科技有限公司 Method and device for marking data and service system thereof
CN111400392A (en) * 2020-06-03 2020-07-10 上海冰鉴信息科技有限公司 Multi-source heterogeneous data processing method and device
CN112241421A (en) * 2019-07-18 2021-01-19 天云融创数据科技(北京)有限公司 Data blood margin determination method and device
CN112364051A (en) * 2020-11-25 2021-02-12 腾讯科技(深圳)有限公司 Data query method and device
US11042668B1 (en) * 2018-04-12 2021-06-22 Datavant, Inc. System for preparing data for expert certification and monitoring data over time to ensure compliance with certified boundary conditions
CN113127460A (en) * 2019-12-31 2021-07-16 北京懿医云科技有限公司 Evaluation method of data cleaning frame, device, equipment and storage medium thereof
CN113505128A (en) * 2021-06-30 2021-10-15 平安科技(深圳)有限公司 Method, device and equipment for creating data table and storage medium
CN113836126A (en) * 2021-09-22 2021-12-24 上海妙一生物科技有限公司 Data cleaning method, device, equipment and storage medium
WO2022037624A1 (en) * 2020-08-19 2022-02-24 第四范式(北京)技术有限公司 Method and apparatus for determining association relationship between data tables, and device
CN114168608A (en) * 2021-12-16 2022-03-11 中科雨辰科技有限公司 Data processing system for updating knowledge graph
CN114580392A (en) * 2022-04-29 2022-06-03 中科雨辰科技有限公司 Data processing system for identifying entity

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11042668B1 (en) * 2018-04-12 2021-06-22 Datavant, Inc. System for preparing data for expert certification and monitoring data over time to ensure compliance with certified boundary conditions
CN112241421A (en) * 2019-07-18 2021-01-19 天云融创数据科技(北京)有限公司 Data blood margin determination method and device
CN111061742A (en) * 2019-12-25 2020-04-24 北京数起科技有限公司 Method and device for marking data and service system thereof
CN113127460A (en) * 2019-12-31 2021-07-16 北京懿医云科技有限公司 Evaluation method of data cleaning frame, device, equipment and storage medium thereof
CN111400392A (en) * 2020-06-03 2020-07-10 上海冰鉴信息科技有限公司 Multi-source heterogeneous data processing method and device
WO2022037624A1 (en) * 2020-08-19 2022-02-24 第四范式(北京)技术有限公司 Method and apparatus for determining association relationship between data tables, and device
CN112364051A (en) * 2020-11-25 2021-02-12 腾讯科技(深圳)有限公司 Data query method and device
CN113505128A (en) * 2021-06-30 2021-10-15 平安科技(深圳)有限公司 Method, device and equipment for creating data table and storage medium
CN113836126A (en) * 2021-09-22 2021-12-24 上海妙一生物科技有限公司 Data cleaning method, device, equipment and storage medium
CN114168608A (en) * 2021-12-16 2022-03-11 中科雨辰科技有限公司 Data processing system for updating knowledge graph
CN114580392A (en) * 2022-04-29 2022-06-03 中科雨辰科技有限公司 Data processing system for identifying entity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张黎明: "大宗数据中数据优化抽取方法的研究与应用", 《中国优秀硕士学位论文全文数据库》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840742A (en) * 2023-02-13 2023-03-24 每日互动股份有限公司 Data cleaning method, device, equipment and medium
CN117312624A (en) * 2023-11-30 2023-12-29 北京睿企信息科技有限公司 Data processing system for acquiring target data list
CN117312624B (en) * 2023-11-30 2024-02-20 北京睿企信息科技有限公司 Data processing system for acquiring target data list
CN118708602A (en) * 2024-08-30 2024-09-27 浙江有数数智科技有限公司 A data synchronization method, device, medium and equipment

Also Published As

Publication number Publication date
CN114996280B (en) 2022-10-25

Similar Documents

Publication Publication Date Title
CN114996280B (en) Method, device, equipment and medium for correcting field information of data table
Thor et al. Introducing CitedReferencesExplorer (CRExplorer): A program for reference publication year spectroscopy with cited references standardization
KR20190019892A (en) Method and apparatus for constructing a decision model, computer device and storage medium
CN109783604B (en) Information extraction method and device based on small amount of samples and computer equipment
CN109325042B (en) Processing template acquisition method, form processing method, device, equipment and medium
CN112560444A (en) Text processing method and device, computer equipment and storage medium
KR20160100226A (en) Method and device for constructing on-line real-time updating of massive audio fingerprint database
CN111177217A (en) Data preprocessing method, device, computer equipment and storage medium
CN112286934A (en) Database table import method, device, equipment and medium
CN116226154B (en) Upgrading system of cluster database
CN111913945A (en) Data management method and device and storage medium
CN111460268B (en) Method and device for determining database query request and computer equipment
CN116561607A (en) Method and device for detecting abnormality of resource interaction data and computer equipment
Martins et al. Efficient dynamic time warping for big data streams
CN111078671A (en) Method, device, equipment and medium for modifying data table field
CN110647452B (en) Test method, test device, computer equipment and storage medium
CN110399396B (en) Efficient data processing
CN109639283A (en) Workpiece coding method based on decision tree
CN116401212B (en) Personnel file quick searching system based on data analysis
Plevris et al. Literature review of historical masonry structures with machine learning
CN111861100A (en) Work order processing method and device based on process scoring
CN111177132A (en) Label cleaning method, device, equipment and storage medium for relational data
CN112015723A (en) Data classification method, apparatus, computer equipment and storage medium
CN112067514B (en) Soil particle size detection method, system and medium based on geotechnical screening test
CN112148721B (en) Data checking method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant