[go: up one dir, main page]

CN111930819A - ETL script generation method and device - Google Patents

ETL script generation method and device Download PDF

Info

Publication number
CN111930819A
CN111930819A CN202010817816.9A CN202010817816A CN111930819A CN 111930819 A CN111930819 A CN 111930819A CN 202010817816 A CN202010817816 A CN 202010817816A CN 111930819 A CN111930819 A CN 111930819A
Authority
CN
China
Prior art keywords
fields
field
target
pasting
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010817816.9A
Other languages
Chinese (zh)
Other versions
CN111930819B (en
Inventor
刘伟
杜利仲
尹宏春
王晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
ICBC Technology Co Ltd
Original Assignee
ICBC Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ICBC Technology Co Ltd filed Critical ICBC Technology Co Ltd
Priority to CN202010817816.9A priority Critical patent/CN111930819B/en
Publication of CN111930819A publication Critical patent/CN111930819A/en
Application granted granted Critical
Publication of CN111930819B publication Critical patent/CN111930819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence, and provides an ETL script generation method and device, wherein the method comprises the following steps: acquiring a table structure of a target table and a table structure of a source table, wherein the table structures comprise: a field; matching fields of the target table with fields of the pasting source table; judging whether all fields in the target table have matched fields in the pasting source table; and if so, generating an ETL script according to the table structure of the target table, the table structure of the pasting source table and the matching result of the target table and the pasting source table. By adopting the technical scheme, the ETL script can be automatically generated, the development workload is reduced, the accuracy is high, the development time is shortened, and the method and the device can adapt to the rapid development of the technology.

Description

ETL script generation method and device
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an ETL script generation method and device.
Background
The data warehouse is the core of the current big data system, is mainly used for performing offline data analysis and multi-dimensional statistical functions, and has the characteristics of multiple statistical targets, large data quantity, long operation time and the like. In order not to affect the operation of the service system, data is generally captured from the service system periodically during the service valley of the service system, so that the online service system and the offline data analysis system are isolated from each other and can operate independently from each other.
In order to improve the service quality of the industries such as finance and medical treatment and reduce risks, various national relevant regulatory agencies require that various operating agencies upload detailed and statistical operating service data, and as the service systems of various operating agencies are different and have no uniform standard, the data of various operators need to be standardized into a uniform structure meeting the requirements of the regulatory agencies and then reported to the regulatory agencies. Each operation mechanism integrates data in various source pasting tables into a target table by developing an ETL script, and data standardization is achieved.
In the first stage of the ETL development process, the processing logic of the business system needs to be understood, and the structure of the data model, the meaning of the data table and the field storage data, the relationship between tables, the meaning of the field enumeration value (dictionary table), and the like need to be investigated clearly. The second stage of ETL development has several ways:
1. manually writing SQL statements, and inserting data in the pasting source table structure into the target table through SQL statements such as select, insert and the like.
2. And marking the corresponding relation between each field of the source table and each field of the target table in a graphical mode, and automatically generating a select statement and an insert statement through a program.
The two methods need to have a large amount of communication and recording work with customer service personnel, and form a pseudo code and field mapping list file for the next development work. In the ETL development process, a large number of manual operations are carried out, and database languages such as SQL sentences or storage processes are formed according to the business logic and the mapping relation recorded in the last step. Especially, the first method also needs strong SQL development capability, has large workload, low accuracy, long development time, and cannot adapt to the rapid development of the technology.
Disclosure of Invention
In view of the problems in the prior art, the present invention provides an ETL script generating method and apparatus, an electronic device, and a computer-readable storage medium, which can at least partially solve the problems in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, an ETL script generating method is provided, including:
acquiring a table structure of a target table and a table structure of a source table, wherein the table structures comprise: a field;
matching fields of the target table with fields of the pasting source table;
judging whether all fields in the target table have matched fields in the pasting source table;
and if so, generating an ETL script according to the table structure of the target table, the table structure of the pasting source table and the matching result of the target table and the pasting source table.
Further, the matching the fields of the target table with the fields of the pasting source table includes:
expanding the fields of the source pasting table according to a pre-acquired knowledge base to obtain corresponding near synonyms and/or synonyms, wherein the knowledge base comprises: a physical dictionary, the physical dictionary comprising: vocabulary and its synonyms;
and matching the fields of the target table with the fields of the pasting source table, the corresponding similar meaning words and/or the synonyms.
Further, the knowledge base further comprises: before matching the fields of the target table with the fields of the pasting source table, the relationship between entities further includes:
identifying ambiguous fields in the target table;
and judging the final meaning of the ambiguous field according to the relation between the entities.
Further, the ETL script generating method further includes:
if all the fields in the target table cannot have matched fields in the pasting source table, judging whether the unmatched fields have corresponding business logic or not;
if yes, obtaining the mapping relation between the unmatched field and the field in the source pasting table according to the service logic;
and generating an ETL script according to the table structure of the target table, the table structure of the pasting source table, the matching result of the target table and the pasting source table and the mapping relation.
Further, the ETL script generating method further includes:
identifying whether the field in the target table is a dictionary item or not according to a pre-acquired knowledge base;
and if so, setting the mapping relation between the field corresponding to the dictionary entry and the corresponding pasting source table field.
Further, the number of the source pasting tables is more than 1;
the matching the fields of the target table with the fields of the pasting source table comprises:
matching the fields of the target table with the fields of the source tables respectively;
obtaining the table structure similarity of the target table and each pasting primary side according to the matching result;
selecting the source table with the highest table structure similarity with the target table as a main table;
the determining whether all the fields in the target table have matching fields in the pasting source table includes:
judging whether all fields in the target table have matched fields in the main table;
if not, judging whether the pasting source tables except the main table have fields matched with the rest fields of the target table.
Further, the ETL script generating method further includes:
and processing the input corpus by using NLP technology to obtain the knowledge base.
In a second aspect, an ETL script generating apparatus is provided, including:
the table structure acquisition module acquires a table structure of a target table and a table structure of a source table, wherein the table structures comprise: a field;
the field matching module is used for matching the field of the target table with the field of the pasting source table;
the field judgment module is used for judging whether all fields in the target table have matched fields in the pasting source table;
and the first ETL script generation module generates an ETL script according to the table structure of the target table, the table structure of the pasting source table and the matching result of the target table and the pasting source table if all the fields in the target table have matched fields in the pasting source table.
Further, the field matching module includes:
a word sense expansion unit, which expands the field of the source table according to a pre-acquired knowledge base to obtain corresponding near-synonyms and/or synonyms, wherein the knowledge base comprises: a physical dictionary, the physical dictionary comprising: vocabulary and its synonyms;
and the field matching unit is used for matching the field of the target table with the field of the pasting source table, the corresponding similar meaning word and/or the synonym.
Further, the knowledge base further comprises: the ETL script generating device further comprises:
an ambiguous field judgment module for identifying ambiguous fields in the target table;
and the field meaning determining module is used for judging the final meaning of the ambiguous field according to the relation between the entities.
Further, the ETL script generating apparatus further comprises:
the business logic judgment module is used for judging whether the unmatched fields have corresponding business logic or not if all the fields in the target table cannot have matched fields in the pasting source table;
the service logic mapping relation setting module is used for obtaining the mapping relation between the unmatched field and the field in the source pasting table according to the service logic if the unmatched field has the corresponding service logic;
and the second ETL script generation module is used for generating an ETL script according to the table structure of the target table, the table structure of the pasting source table, the matching result of the target table and the pasting source table and the mapping relation.
Further, the ETL script generating apparatus further comprises:
the dictionary item judging module is used for identifying whether the field in the target table is a dictionary item according to a pre-acquired knowledge base;
and if the fields in the target table are dictionary items, setting the mapping relation between the fields corresponding to the dictionary items and the fields corresponding to the pasting source table.
Further, the number of the source pasting tables is more than 1;
the field matching module comprises:
the field matching unit is used for matching the fields of the target table with the fields of the pasting source tables respectively;
the similarity calculation unit is used for obtaining the table structure similarity of the target table and each paste primary side according to the matching result;
the master table selecting unit selects the source table with the highest table structure similarity with the target table as a master table;
the field judging module comprises:
the first judging unit is used for judging whether all the fields in the target table have matched fields in the main table;
and a second judging unit, configured to judge whether fields matching the remaining fields of the target table exist in the source table other than the main table if all the fields in the target table do not have matching fields in the main table.
Further, the ETL script generating apparatus further comprises:
and the knowledge base construction module is used for processing the input corpus by utilizing an NLP technology to obtain the knowledge base.
In a third aspect, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the steps of the ETL script generation method.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, realizes the steps of the ETL script generation method described above.
The invention provides an ETL script generation method and device, wherein the method comprises the following steps: acquiring a table structure of a target table and a table structure of a source table, wherein the table structures comprise: a field; matching fields of the target table with fields of the pasting source table; judging whether all fields in the target table have matched fields in the pasting source table; and if so, generating an ETL script according to the table structure of the target table, the table structure of the pasting source table and the matching result of the target table and the pasting source table. By adopting the technical scheme, the ETL script can be automatically generated, the development workload is reduced, the accuracy is high, the development time is shortened, and the method and the device can adapt to the rapid development of the technology.
In addition, the knowledge base is constructed by the NLP technology in the embodiment of the invention, and the word sense expansion and the word sense determination are carried out by utilizing the entity dictionary and the relation between the entities in the knowledge base, so that the accuracy of the generated ETL script is further improved.
In order to make the aforementioned and other objects, features and advantages of the invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. In the drawings:
FIG. 1 is a schematic diagram of an architecture between a server S1 and a client device B1 according to an embodiment of the present invention;
FIG. 2 is a block diagram of the server S1, the client device B1 and the database server S2 according to an embodiment of the present invention;
FIG. 3 is a first flowchart illustrating an ETL script generation method according to an embodiment of the present invention;
fig. 4 shows the specific steps of step S200;
FIG. 5 is a flowchart illustrating a second ETL script generation method in an embodiment of the present invention;
FIG. 6 is a third flowchart illustrating an ETL script generation method in an embodiment of the present invention;
FIG. 7 illustrates an ETL script generation architecture;
FIG. 8 illustrates a system architecture for building a knowledge base in an embodiment of the invention;
FIG. 9 illustrates a system architecture for generating ETL scripts in an embodiment of the present invention;
FIG. 10 is a block diagram showing the structure of an ETL script generating apparatus in the embodiment of the present invention;
fig. 11 is a block diagram of an electronic device according to an embodiment of the invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of this application and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Interpretation of terms:
ETL, acronym of Extract-Transform-Load in English, which is used to describe the process of extracting (Extract), converting (Transform), and loading (Load) data from a source to a destination. The term ETL is more commonly used in data warehouses, but its objects are not limited to data warehouses.
NLP is an abbreviation of Natural Language Processing, which is a study on how computers can understand human languages, understand the meaning of Natural Language texts through computer mechanisms, and express the intentions and ideas of given deep layers in Natural Language texts.
The embodiment of the invention provides an ETL script generation method, which is characterized in that a Natural Language Processing (NLP) system collects business logic (one of relationships between entities) and term naming (namely a knowledge base) through modeling, the mapping relationship between a source business system (namely a source pasting table) and a target system (namely a target table) is intelligently matched, an ETL rule is automatically generated through an inference engine supported by the knowledge base, and finally an ETL script is generated through manual auditing and correcting, so that the workload can be reduced, the accuracy rate can be improved, and the development time can be shortened.
In view of the above, the present application provides an ETL script generating apparatus, which may be a server S1, and referring to fig. 1, the server S1 may be communicatively connected to at least one client device B1, the client device B1 may transmit a table structure of a target table and a table structure of a provenance table to the server S1, and the server S1 may receive the table structure of the target table and the table structure of the provenance table online. The server S1 may perform online or offline preprocessing on the table structure of the acquired target table and the table structure of the posting source table, and match the fields of the target table with the fields of the posting source table; judging whether all fields in the target table have matched fields in the pasting source table; and if so, generating an ETL script according to the table structure of the target table, the table structure of the pasting source table and the matching result of the target table and the pasting source table. The server S1 may then send ETL scripts online to the client device B1. The client device B1 may receive the ETL script online.
In addition, referring to fig. 2, the server S1 may also be communicatively connected to at least one database server S2, the database server S2 being used to store the knowledge base or corpora used to construct the knowledge base. The database server S2 sends the knowledge base or the corpus for constructing the knowledge base to the server S1 on line, and the server S1 may receive the knowledge base or the corpus for constructing the knowledge base on line, and then generates an ETL script according to the knowledge base or the corpus for constructing the knowledge base and then the ETL script according to the knowledge base.
Based on the above, the client device B1 may have a display interface so that a user can view the ETL script sent by the server S1 according to the interface.
It is understood that the client device B1 may include a smart phone, a tablet electronic device, a portable computer, a desktop computer, etc.
In practical applications, part of the ETL script generation method may be executed on the server S1 side as described above, that is, the architecture shown in fig. 1, all operations may be completed in the client device B1, and the client device B1 may be directly connected to the database server S2 in a communication manner. Specifically, the selection may be performed according to the processing capability of the client device B1, the limitation of the user usage scenario, and the like. This is not a limitation of the present application. If all the operations are completed in the client device B1, the client device B1 may further include a processor for performing specific processing of the ETL script generation method.
The server and the client device may communicate using any suitable network protocol, including network protocols not yet developed at the filing date of this application. The network protocol may include, for example, a TCP/IP protocol, a UDP/IP protocol, an HTTP protocol, an HTTPS protocol, or the like. Of course, the network Protocol may also include, for example, an RPC Protocol (Remote Procedure Call Protocol), a REST Protocol (Representational State Transfer Protocol), and the like used above the above Protocol.
FIG. 3 is a first flowchart illustrating an ETL script generation method according to an embodiment of the present invention; as shown in fig. 3, the ETL script generation method may include the following:
step S100: acquiring a table structure of a target table and a table structure of a source table, wherein the table structures comprise: a field;
when the target tables are multiple, the target tables are processed one by one to obtain the ETL scripts corresponding to each target table. However, in the case of a currently common database, there are many cases where there are a plurality of posting source tables, and in this case, for a certain target table, processing needs to be performed based on the table structure of the target table and the plurality of posting source tables.
And for each target table, by generating an ETL script, in the execution of the ETL script, the data required by the target table can be acquired from the data of the pasting source database, and the pasting source tables with different formats can be standardized into the format of the target table according to the service requirements.
It is worth noting that the pasting source table may be stored in a pasting source database. The table structure includes: table name, field type, and all annotation content.
Step S200: matching fields of the target table with fields of the pasting source table;
specifically, each field in the target table and each field in the pasting source table are respectively matched, so as to find a field matching relationship between the target table and the pasting source table, for example: one field in the target table is a deposit account, and one field in the transaction flow meter is a deposit account, so that the two fields are matched with each other to obtain a mapping relation: target table-savings account → transaction flow meter-savings account, when data is loaded, the contents of the savings account field of the transaction flow meter of one of the posting source tables can be loaded into the savings account field of the target table.
It should be noted that because there are many synonyms and near synonyms in natural language, people have no uniform specification when building tables using natural language, so field names may not be as standard, and synonyms may be used to mark the same field between two tables, such as: the account number and the account are matched, so that when field matching is carried out, the same words are considered for matching, and comprehensive matching is carried out after expansion according to synonyms and similar words, so that the matching rate is improved.
Step S300: judging whether all fields in the target table have matched fields in the pasting source table;
if yes, go to step S400.
Specifically, it is desirable that all fields in the target table can find corresponding matching fields in the pasting source table, and at this time, the ETL script is automatically generated only according to the field matching relationship, the table structure of the target table, and the table structure of the pasting source table based on the preset script generation program.
Step S400: and generating an ETL script according to the table structure of the target table, the table structure of the pasting source table and the matching result of the target table and the pasting source table.
By adopting the technical scheme, the ETL script can be automatically generated, the development workload is reduced, the accuracy is high, the development time is shortened, and the method and the device can adapt to the rapid development of the technology.
In an optional embodiment, the number of the pasting source table is one, when all the fields in the target table do not have matching fields in the pasting source table, the fields in the target table that do not have matching fields in the pasting source table are sent to customer service staff or developers, a processing scheme of the customer service staff or the developers is obtained, and then an ETL script is generated according to the processing scheme, the table structure of the target table, the table structure of the pasting source table, and the matching result of the target table and the pasting source table.
In an optional embodiment, the number of the pasting source tables is multiple, the target table is matched with each pasting source table one by one, and in order to improve efficiency, when all fields in the target table find corresponding matched fields in the first N pasting source tables, the matching is stopped; on the other hand, in order to improve the precision, the field primitive word matching can be carried out on the target table and each pasting source table one by one, and the matching is stopped if all the fields in the target table find the corresponding matched fields in the first N pasting source tables; and if partial fields in the target table cannot find matched fields in all the label source tables based on the original word matching, performing synonym and near synonym expansion on the fields, and then performing matching.
In an alternative embodiment, referring to fig. 4, this step S200 may include the following:
step S210: expanding the fields of the source pasting table according to a pre-acquired knowledge base to obtain corresponding near synonyms and/or synonyms, wherein the knowledge base comprises: a physical dictionary, the physical dictionary comprising: vocabulary (or entities, terms) and their synonyms and synonyms;
specifically, the entity dictionary includes different types of professional terms, synonyms, etc. of a certain term. Including all terms of chinese, pinyin, various acronyms variants of english, etc. For example, accounts, zhanghu, zh, account, etc. are synonyms.
Step S220: and matching the fields of the target table with the fields of the pasting source table, the corresponding similar meaning words and/or the synonyms.
By adopting the technical scheme, the field matching rate can be improved based on the expansion of the synonyms and the synonyms, so that the success rate of the generation of the ETL script is improved, and the workload of developers is reduced.
In an optional embodiment, if the determination result in step S300 shows that all the fields in the target table do not have matching fields in the pasting source table, the fields in the target table that do not have matching fields in the pasting source table are sent to customer service staff or developers, a processing scheme of the customer service staff or developers is obtained, and then an ETL script is generated according to the processing scheme, the table structure of the target table, the table structure of the pasting source table, and the matching result between the target table and the pasting source table.
In an optional embodiment, the knowledge base further comprises: the relationship between entities refers to the relationship between every two entities, of course, all terms have the relationship between the entities, and the relationship definition includes three parts, namely, term 1, term 2 and directional relationship. For example, the term 1 is 'customer', the term 2 is 'bank', the relationship is 'current account', each entity may have different attributes. Such as the customer may have the attribute of a certificate number. The relationship is directional and may be that the customer has a current account with the bank, rather than vice versa.
Referring to fig. 5, the ETL script generation method may further include:
step S500: identifying ambiguous fields in the target table;
specifically, when a field is expanded by using a knowledge base, if two or more corresponding words are retrieved for a word, the corresponding field is considered to be an ambiguous field.
Step S600: and judging the final meaning of the ambiguous field according to the relation between the entities.
Specifically, the final meaning of the ambiguous field is determined using the relationships between the entities and the directionality of the relationships.
For example: "zh" can be "account number" or "certificate number", and is judged to be "account number" if other fields in the table have a character string "hqck" (deposit due).
By adopting the character strings, ambiguous character strings can be effectively identified and processed, and the accuracy of the ETL script is further improved.
In an optional embodiment, referring to fig. 6, if all fields in the target table fail to have matching fields in the posting source table, the ETL script generation method may further include:
step S700: judging whether the unmatched fields have corresponding business logic or not;
if yes, go to step S800; if not, executing the step S1000;
specifically, the knowledge base stores business logic, which is derived from processing corpora and discovering business association between entities during processing, and is derived from research and development personnel or customer service personnel and is set according to business, such as: the 'individual deposit account detail' is divided into two types of current deposit and periodic deposit, and some banks can put the two types of data in a database table and others can store the two types of data respectively.
Step S800: obtaining the mapping relation between the unmatched field and the field in the source pasting table according to the service logic;
specifically, it may be set to "target table-field a → [ posts table 1-field B) + (posts table 2-field C) ].
Step S900: and generating an ETL script according to the table structure of the target table, the table structure of the pasting source table, the matching result of the target table and the pasting source table and the mapping relation.
It should be noted that, when there still exists a field or a mapping relationship where some fields are not matched after the above processing, step S1000 is also executed to supplement the processing scheme of the remaining fields in a manual manner.
Step S1000: sending an alarm to the background customer service;
step S1100: acquiring a field processing scheme of background customer service;
the processing scheme is manually compiled according to the service scene.
Step S1200: and generating an ETL script according to the table structure of the target table, the table structure of the pasting source table, the matching result of the target table and the pasting source table and the processing scheme.
By adopting the technical scheme, the automation level can be further improved.
In an optional embodiment, the ETL script generating method may further include:
identifying whether the field in the target table is a dictionary item or not according to a pre-acquired knowledge base;
and if so, setting the mapping relation between the field corresponding to the dictionary entry and the corresponding pasting source table field.
Specifically, identifying dictionary entries refers to the mapping relationship of dictionary entries for aligning the pasting source table to the standard table, for example, some fields f and m are respectively represented by male and female. Recognizing the dictionary item intelligently according to the table annotation field, analyzing the semantics of the dictionary item and mapping the semantic meaning with a standard table; and recommending according to the actual value, and deducing the ratio of f to m representing male and female according to the corresponding name characteristics in the actual record.
For example, the target table is replaced by 0 and 1 for men and women, and the label source table is represented by m and f for men and women. Then the corresponding m needs to be replaced by 0 and f by 1 when the pasting source table is converted to the target table. But the customer does not tell us that we do not know if m represents a male or a female. However, if we find that the names corresponding to the field m in the table are mostly "steel" and "sharp", and the names corresponding to the field f are many "Chary" and "beautiful", it can be inferred that m and f represent male and female, respectively. This is also an off-the-shelf knowledge base.
By adopting the technical scheme, the standardization degree of the source pasting data can be further improved.
In an alternative embodiment, the number of the pasting source tables is greater than 1, and the step S200 may include:
step I: matching the fields of the target table with the fields of the source tables respectively;
step II: obtaining the table structure similarity of the target table and each pasting primary side according to the matching result;
specifically, the table structure similarity is obtained according to the field matching result, and the calculation rule is preset by a developer, for example, a score is given according to the field matching number, or on the basis of giving a score according to the field matching number, a weight is added to each field according to the importance degree to obtain a weighted score, or a weight is added according to the field matching degree (for example, the field original word direct matching is a matching degree, the synonym matching is a matching degree, and the synonym matching is a matching degree), to obtain a weighted score.
Step III: selecting the source table with the highest table structure similarity with the target table as a main table;
and selecting a main table, giving priority to the main table, and supplementing other source tables when the data of the main table is insufficient.
Step S300 may include:
step A: judging whether all fields in the target table have matched fields in the main table;
if not, executing the step B; if yes, executing the step of generating the ETL script.
And B: and judging whether the pasting source tables except the main table have fields matched with the rest fields of the target table.
By adopting the technical scheme, the matching efficiency can be improved, and the research and development time is further shortened.
In an optional embodiment, the ETL script generating method may further include: and processing the input corpus by using NLP technology to obtain the knowledge base.
The knowledge base content includes: entity dictionaries and relationships between entities (terms).
The construction of the entity dictionary comprises the following steps: articles such as national standards, specifications and academic papers are used as input learning materials (terms are called as linguistic materials), the TF-IDF (Term Frequency-Inverse Document Frequency), Chinese word segmentation and other modes are used for word segmentation, natural language mature technologies such as synonym forest and word vectors are used for extracting the inter-word relation, and then a large number of entity dictionaries and synonym near-synonyms are extracted through a manual review method.
The relationship between entities (terms) can be processed by the current relatively mature natural language processing technology on the basis of the previous word segmentation result. At present, there are various extraction methods, such as a pattern matching combined bootstrap method (bootstrap), and an artificial intelligence method based on a neural network, such as a method of performing Part-of-Speech tagging, local syntax analysis, dependency relationship, and the like by using Bi-LSTM to perform relationship extraction. The final result can also be obtained by combining the two methods at the same time according to the results of the two methods.
The knowledge base obtained by natural language processing is continuously and naturally increased through iteration, and the newly increased knowledge base needs to be confirmed through manual work.
By adopting the technical scheme, a knowledge base which is more pertinent to the service data or the field can be constructed, and the accuracy of the ETL script is further improved.
By adopting the technical scheme, the workload in the development process of the ETL under the scene of supervision and delivery service can be greatly reduced by combining the natural processing technology with the rule engine, the accuracy of the development result is improved, the development time is greatly reduced, and the investment and time cost are reduced.
It is worth to be noted that there are many methods for natural language processing, which can be based on handwriting rules or by using a pattern matching automated processing method, can also use an artificial intelligence neural network, and can also combine the three methods at will.
In order to make the present application better understood by those skilled in the art, the following describes the implementation process of the present invention in addition to fig. 7 to 9:
FIG. 7 illustrates an ETL script generation architecture; as shown in fig. 7, includes: the system comprises a model building module 1, a rule recommendation engine 2 and a manual optimization interface 3.
The model building module builds a knowledge base required by a 'rule recommendation engine' by using a Natural Language Processing (NLP) technology, wherein the contents of the knowledge base comprise: entity dictionaries and relationships between entities (terms).
A solid dictionary: including different types of terms in the art, synonyms, etc. for a term. Including Chinese, pinyin, English various abbreviation variants with terms in the relationship between the entities (terms), etc. For example, accounts, zhanghu, zh, account, etc. are synonyms.
The construction method comprises the following steps: articles such as national standards, specifications and academic papers can be used as input learning materials (terms are called as linguistic materials), the TF-IDF (Term Frequency-Inverse Document Frequency), Chinese word segmentation and other modes are used for word segmentation, natural language mature technologies such as synonym forest and word vectors are used for extracting the inter-word relation, and then a large number of entity dictionaries and synonym near-synonyms are extracted through a manual review method.
Specifically, an open source implementation in jieba or corenlp may be called.
Relationships between entities (terms): the relationship between every two terms includes three parts: term 1, term 2, directional relationship. For example, the term 1 is 'customer', the term 2 is 'bank', the relationship is 'current account', each entity may have different attributes. Such as the customer may have the attribute of a certificate number. The relationship is directional and may be that the customer has a current account with the bank, rather than vice versa.
The relationship between entities can be processed on the basis of the previous word segmentation result through the current relatively mature natural language processing technology. At present, there are various extraction methods, such as a pattern matching combined bootstrap method (bootstrap), and an artificial intelligence method based on a neural network, such as a method of performing Part-of-Speech tagging, local syntax analysis, dependency relationship, and the like by using Bi-LSTM to perform relationship extraction. The final result can also be obtained by combining the two methods at the same time according to the results of the two methods.
The knowledge base obtained through natural language processing is continuously and naturally increased through iteration, the newly increased knowledge base needs to be adjusted into a part conforming to professional knowledge through the 'artificial optimization interface' of the third part, and then a new round of model training process can be started, so that subsequent processing is more accurate and faster.
Because the initial knowledge base is small in scale, the accuracy of the extracted entities and relationships is not very high, and therefore the corrected results need to be manually reviewed. And the corrected result is gradually supplemented into the knowledge base, the knowledge base is gradually enriched, and the extraction accuracy is gradually improved.
For example, referring to fig. 8, the model building module 1 may include: the term analysis module 1.1 and the relation analysis module 1.2. After the training data is input, the training data is processed by a term analysis module 1.1 and a relation analysis module 1.2 to obtain a term dictionary library and a term relation library.
The rule reasoning engine depends on the entity, the synonym and the relation among the entities extracted by the model building module, and if necessary, some rules are manually compiled according to the processing logic of the service to be used as supplement, and the rule reasoning engine is combined with the knowledge base and defined by input and output.
Referring to fig. 9, the rule inference engine outputs an ETL script with a source table structure and a target table structure as input, and specifically, the rule inference engine 2 includes: the system comprises a table structure analysis module 2.1, a table matching module 2.2, an inter-table relation identification module 2.3, a dictionary item identification module 2.4, a business logic identification module 2.5 and a rule recommendation engine 2.6.
The table structure parsing module 2.1 imports the table structure of the source database of posts (including table name, field type and all annotation content) into the rule reasoning engine.
The table matching module 2.2 matches all tables in the pasting source table for each table in the target table, searches the most matched table according to the table structure similarity of the two tables (for example, matching each field in the pasting source table with the field in the target table at first, matching synonyms and synonyms of the field in the target table if the matching is not successful, comprehensively evaluating the similarity of the whole table according to the similarity), and the synonyms in the entity dictionary collected by the model are used here. Meanwhile, the knowledge base is utilized to intelligently judge some ambiguous character strings. See if the combination of contextual strings is in the knowledge base entity dictionary. The demand deposit account number is in the dictionary, so hqckzh does not represent the demand deposit account number. For example, "zh" may be "account" or "license number" and thus is distinguished as "account" if there is a string "hqck" (deposit due) in other fields in the table.
The inter-table relationship identification module 2.3 is used for intelligently matching the inter-table relationships: according to the matching result, each field of each table can be basically identified, and then the relation between tables needing to be associated is searched according to the content of the knowledge base, for example, a certain foreign key of the table corresponds to a main key of a certain target table.
The dictionary entry recognition module 2.4 is used for intelligently recognizing dictionary entries: identifying dictionary entries refers to aligning the pasting source table to the dictionary entry mapping relationship of the standard table, for example, some fields f and m are respectively representative of male and female. Recognizing the dictionary item intelligently according to the table annotation field, analyzing the semantics of the dictionary item and mapping the semantic meaning with a standard table; and recommending according to the actual value, and deducing the ratio of f to m representing male and female according to the corresponding name characteristics in the actual record.
The service logic identification module 2.5 is used for performing service logic identification: business logic identification relies heavily on knowledge bases, and generally, knowledge bases of the type are rules manually written according to business scenes. For example, the "individual deposit account detail" is divided into two types of current deposit and periodic deposit, and some banks put these two types of data in a database table and others store them separately. The recommendation engine needs to identify which scheme is currently being used by the system and record it for subsequent processing.
For example, two tables exist when the source table is stored separately, and only one table exists when the source table is stored in combination. When only one table is retrieved based on the table name and table structure, it is suspected that only one table stores data of two tables. In addition, there is typically a field in the periodic table for the deposit cycle, and the current deposit record has a value that is typically meaningless (e.g., null or 0).
The rule recommendation engine 2.6 is used for rule recommendation: and the recommendation engine automatically generates an ETL extracted SQL script according to the previously obtained information, including the previously obtained knowledge such as the field mapping relation between the subsidence source table and the standard table, the main foreign key relation between the subsidence source tables, the value of the dictionary item, the business logic and the like, and according to different dialects (Oracle, Spark and the like) of the SQL language of the processing system.
Specifically, the insert, select, where and the field name are combined into an sql statement that can be executed using string concatenation.
The manual adjustment interface is distributed in each stage, the results of each stage are displayed to the user, and the user reviews the results, manually adjusts and triggers the system to enter the next stage. When the accuracy and stability of the system are high, the full-automatic operation of the whole process is supported, and the final result is displayed to a user for checking and adjusting.
Based on the same inventive concept, the embodiment of the present application further provides an ETL script generating apparatus, which can be used to implement the methods described in the foregoing embodiments, as described in the following embodiments. Because the principle of solving the problem of the ETL script generating device is similar to that of the method, the implementation of the ETL script generating device can refer to the implementation of the method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 10 is a block diagram showing the structure of an ETL script generating apparatus in the embodiment of the present invention. As shown in fig. 10, the ETL script generating apparatus specifically includes: the system comprises a table structure acquisition module 10, a field matching module 20, a field judgment module 30 and a first ETL script generation module 40.
The table structure obtaining module 10 obtains a table structure of a target table and a table structure of a source table, where the table structures include: a field;
the field matching module 20 matches the fields of the target table with the fields of the pasting source table;
the field judging module 30 judges whether all the fields in the target table have matching fields in the pasting source table;
if all the fields in the target table have matching fields in the pasting source table, the first ETL script generating module 40 generates an ETL script according to the table structure of the target table, the table structure of the pasting source table and the matching result of the target table and the pasting source table.
By adopting the technical scheme, the ETL script can be automatically generated, the development workload is reduced, the accuracy is high, the development time is shortened, and the method and the device can adapt to the rapid development of the technology.
In an optional embodiment, the field matching module comprises: a word sense expansion unit and a field matching unit.
The word meaning expansion unit expands the fields of the source table according to a pre-acquired knowledge base to obtain corresponding similar words and/or synonyms, wherein the knowledge base comprises: a physical dictionary, the physical dictionary comprising: vocabulary and its synonyms;
and the field matching unit matches the fields of the target table with the fields of the pasting source table, the corresponding similar meaning words and/or the synonyms.
In an optional embodiment, the knowledge base further comprises: the ETL script generating device further comprises: an ambiguity field judging module and a field meaning determining module.
An ambiguous field judgment module for identifying ambiguous fields in the target table;
and the field meaning determining module is used for judging the final meaning of the ambiguous field according to the relation between the entities.
In an optional embodiment, the ETL script generating apparatus further comprises: the system comprises a business logic judgment module, a business logic mapping relation setting module and a second ETL script generation module.
The business logic judgment module is used for judging whether the unmatched fields have corresponding business logic or not if all the fields in the target table cannot have matched fields in the pasting source table;
the service logic mapping relation setting module is used for obtaining the mapping relation between the unmatched field and the field in the source pasting table according to the service logic if the unmatched field has the corresponding service logic;
and the second ETL script generation module is used for generating an ETL script according to the table structure of the target table, the table structure of the pasting source table, the matching result of the target table and the pasting source table and the mapping relation.
In an optional embodiment, the ETL script generating apparatus further comprises: the device comprises a dictionary item judging module and a field mapping relation setting module.
The dictionary item judging module is used for identifying whether the field in the target table is a dictionary item according to a pre-acquired knowledge base;
and if the fields in the target table are dictionary items, setting the mapping relation between the fields corresponding to the dictionary items and the fields corresponding to the pasting source table.
In an alternative embodiment, the number of the pasting source tables is greater than 1; the field matching module comprises: the device comprises a field matching unit, a similarity calculation unit and a main table selection unit. The field judging module comprises: a first judgment unit and a second judgment unit.
The field matching unit is used for matching the fields of the target table with the fields of the pasting source tables respectively;
the similarity calculation unit is used for obtaining the table structure similarity of the target table and each paste primary side according to the matching result;
the master table selecting unit selects the source table with the highest table structure similarity with the target table as a master table;
the first judging unit is used for judging whether all the fields in the target table have matched fields in the main table;
and a second judging unit, configured to judge whether fields matching the remaining fields of the target table exist in the source table other than the main table if all the fields in the target table do not have matching fields in the main table.
The apparatuses, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or implemented by a product with certain functions. A typical implementation device is an electronic device, which may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
In a typical example, the electronic device specifically includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the ETL script generation method when executing the program.
Referring now to FIG. 11, shown is a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present application.
As shown in fig. 11, the electronic apparatus 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate works and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM)) 603. In the RAM603, various programs and data necessary for the operation of the system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted as necessary on the storage section 608.
In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, an embodiment of the present invention includes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the ETL script generation method described above.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (16)

1. An ETL script generation method, comprising:
acquiring a table structure of a target table and a table structure of a source table, wherein the table structures comprise: a field;
matching fields of the target table with fields of the pasting source table;
judging whether all fields in the target table have matched fields in the pasting source table;
and if so, generating an ETL script according to the table structure of the target table, the table structure of the pasting source table and the matching result of the target table and the pasting source table.
2. The ETL script generating method of claim 1, wherein said matching fields of said target table with fields of said pasting source table comprises:
expanding the fields of the source pasting table according to a pre-acquired knowledge base to obtain corresponding near synonyms and/or synonyms, wherein the knowledge base comprises: a physical dictionary, the physical dictionary comprising: vocabulary and its synonyms;
and matching the fields of the target table with the fields of the pasting source table, the corresponding similar meaning words and/or the synonyms.
3. The ETL script generating method of claim 2, wherein said knowledge base further comprises: before matching the fields of the target table with the fields of the pasting source table, the relationship between entities further includes:
identifying ambiguous fields in the target table;
and judging the final meaning of the ambiguous field according to the relation between the entities.
4. The ETL script generating method according to claim 1, further comprising:
if all the fields in the target table cannot have matched fields in the pasting source table, judging whether the unmatched fields have corresponding business logic or not;
if yes, obtaining the mapping relation between the unmatched field and the field in the source pasting table according to the service logic;
and generating an ETL script according to the table structure of the target table, the table structure of the pasting source table, the matching result of the target table and the pasting source table and the mapping relation.
5. The ETL script generation method of any of claims 1-4, further comprising:
identifying whether the field in the target table is a dictionary item or not according to a pre-acquired knowledge base;
and if so, setting the mapping relation between the field corresponding to the dictionary entry and the corresponding pasting source table field.
6. The ETL script generating method according to claim 1, wherein the number of the source-attached tables is greater than 1;
the matching the fields of the target table with the fields of the pasting source table comprises:
matching the fields of the target table with the fields of the source tables respectively;
obtaining the table structure similarity of the target table and each pasting primary side according to the matching result;
selecting the source table with the highest table structure similarity with the target table as a main table;
the determining whether all the fields in the target table have matching fields in the pasting source table includes:
judging whether all fields in the target table have matched fields in the main table;
if not, judging whether the pasting source tables except the main table have fields matched with the rest fields of the target table.
7. The ETL script generating method according to claim 2, further comprising:
and processing the input corpus by using NLP technology to obtain the knowledge base.
8. An ETL script generation apparatus, comprising:
the table structure acquisition module acquires a table structure of a target table and a table structure of a source table, wherein the table structures comprise: a field;
the field matching module is used for matching the field of the target table with the field of the pasting source table;
the field judgment module is used for judging whether all fields in the target table have matched fields in the pasting source table;
and the first ETL script generation module generates an ETL script according to the table structure of the target table, the table structure of the pasting source table and the matching result of the target table and the pasting source table if all the fields in the target table have matched fields in the pasting source table.
9. The ETL script generating apparatus of claim 8, wherein said field matching module comprises:
a word sense expansion unit, which expands the field of the source table according to a pre-acquired knowledge base to obtain corresponding near-synonyms and/or synonyms, wherein the knowledge base comprises: a physical dictionary, the physical dictionary comprising: vocabulary and its synonyms;
and the field matching unit is used for matching the field of the target table with the field of the pasting source table, the corresponding similar meaning word and/or the synonym.
10. The ETL script generating apparatus of claim 9, wherein said knowledge base further comprises: the ETL script generating device further comprises:
an ambiguous field judgment module for identifying ambiguous fields in the target table;
and the field meaning determining module is used for judging the final meaning of the ambiguous field according to the relation between the entities.
11. The ETL script generating apparatus according to claim 8, further comprising:
the business logic judgment module is used for judging whether the unmatched fields have corresponding business logic or not if all the fields in the target table cannot have matched fields in the pasting source table;
the service logic mapping relation setting module is used for obtaining the mapping relation between the unmatched field and the field in the source pasting table according to the service logic if the unmatched field has the corresponding service logic;
and the second ETL script generation module is used for generating an ETL script according to the table structure of the target table, the table structure of the pasting source table, the matching result of the target table and the pasting source table and the mapping relation.
12. The ETL script generating apparatus according to any one of claims 8 to 11, further comprising:
the dictionary item judging module is used for identifying whether the field in the target table is a dictionary item according to a pre-acquired knowledge base;
and if the fields in the target table are dictionary items, setting the mapping relation between the fields corresponding to the dictionary items and the fields corresponding to the pasting source table.
13. The ETL script generating apparatus of claim 8, wherein the number of the source tables is greater than 1;
the field matching module comprises:
the field matching unit is used for matching the fields of the target table with the fields of the pasting source tables respectively;
the similarity calculation unit is used for obtaining the table structure similarity of the target table and each paste primary side according to the matching result;
the master table selecting unit selects the source table with the highest table structure similarity with the target table as a master table;
the field judging module comprises:
the first judging unit is used for judging whether all the fields in the target table have matched fields in the main table;
and a second judging unit, configured to judge whether fields matching the remaining fields of the target table exist in the source table other than the main table if all the fields in the target table do not have matching fields in the main table.
14. The ETL script generating apparatus according to claim 9, further comprising:
and the knowledge base construction module is used for processing the input corpus by utilizing an NLP technology to obtain the knowledge base.
15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the ETL script generation method of any of claims 1 to 7 when executing the program.
16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the ETL script generation method according to any one of claims 1 to 7.
CN202010817816.9A 2020-08-14 2020-08-14 ETL script generation method and device Active CN111930819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010817816.9A CN111930819B (en) 2020-08-14 2020-08-14 ETL script generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010817816.9A CN111930819B (en) 2020-08-14 2020-08-14 ETL script generation method and device

Publications (2)

Publication Number Publication Date
CN111930819A true CN111930819A (en) 2020-11-13
CN111930819B CN111930819B (en) 2024-03-15

Family

ID=73310913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010817816.9A Active CN111930819B (en) 2020-08-14 2020-08-14 ETL script generation method and device

Country Status (1)

Country Link
CN (1) CN111930819B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113760891A (en) * 2020-12-02 2021-12-07 北京沃东天骏信息技术有限公司 Data table generation method, device, equipment and storage medium
CN113934786A (en) * 2021-09-29 2022-01-14 浪潮卓数大数据产业发展有限公司 Implementation method for constructing unified ETL
CN114817388A (en) * 2022-04-12 2022-07-29 浪潮软件股份有限公司 Method, device and medium for intelligently generating extraction script
CN118426753A (en) * 2024-07-04 2024-08-02 深圳达实旗云健康科技有限公司 Data conversion code generation method, device, equipment, medium and program product
US12254311B2 (en) 2022-01-28 2025-03-18 International Business Machines Corporation Automatic extract, transform and load accelerator for data platform in distributed computing environment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069033A (en) * 2015-07-22 2015-11-18 北京京东尚科信息技术有限公司 Method and device for creating database table model
CN109634587A (en) * 2018-12-04 2019-04-16 上海碳蓝网络科技有限公司 A kind of method and apparatus generating storage script and data loading
CN111090640A (en) * 2019-11-13 2020-05-01 山东中磁视讯股份有限公司 ETL data cleaning method and system
CN111291049A (en) * 2020-01-19 2020-06-16 软通动力信息技术有限公司 Method, device, equipment and storage medium for creating table
CN111324647A (en) * 2020-01-21 2020-06-23 北京东方金信科技有限公司 Method and device for generating ETL code
CN111367895A (en) * 2020-03-31 2020-07-03 中国建设银行股份有限公司 Data migration method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069033A (en) * 2015-07-22 2015-11-18 北京京东尚科信息技术有限公司 Method and device for creating database table model
CN109634587A (en) * 2018-12-04 2019-04-16 上海碳蓝网络科技有限公司 A kind of method and apparatus generating storage script and data loading
CN111090640A (en) * 2019-11-13 2020-05-01 山东中磁视讯股份有限公司 ETL data cleaning method and system
CN111291049A (en) * 2020-01-19 2020-06-16 软通动力信息技术有限公司 Method, device, equipment and storage medium for creating table
CN111324647A (en) * 2020-01-21 2020-06-23 北京东方金信科技有限公司 Method and device for generating ETL code
CN111367895A (en) * 2020-03-31 2020-07-03 中国建设银行股份有限公司 Data migration method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113760891A (en) * 2020-12-02 2021-12-07 北京沃东天骏信息技术有限公司 Data table generation method, device, equipment and storage medium
CN113760891B (en) * 2020-12-02 2024-07-16 北京沃东天骏信息技术有限公司 Data table generation method, device, equipment and storage medium
CN113934786A (en) * 2021-09-29 2022-01-14 浪潮卓数大数据产业发展有限公司 Implementation method for constructing unified ETL
CN113934786B (en) * 2021-09-29 2023-09-08 浪潮卓数大数据产业发展有限公司 Implementation method for constructing unified ETL
US12254311B2 (en) 2022-01-28 2025-03-18 International Business Machines Corporation Automatic extract, transform and load accelerator for data platform in distributed computing environment
CN114817388A (en) * 2022-04-12 2022-07-29 浪潮软件股份有限公司 Method, device and medium for intelligently generating extraction script
CN118426753A (en) * 2024-07-04 2024-08-02 深圳达实旗云健康科技有限公司 Data conversion code generation method, device, equipment, medium and program product

Also Published As

Publication number Publication date
CN111930819B (en) 2024-03-15

Similar Documents

Publication Publication Date Title
CN111930819B (en) ETL script generation method and device
US10521464B2 (en) Method and system for extracting, verifying and cataloging technical information from unstructured documents
US10706045B1 (en) Natural language querying of a data lake using contextualized knowledge bases
CN101814067B (en) System and methods for quantitative assessment of information in natural language contents
CN111581990A (en) Cross-border transaction matching method and device
US20230205996A1 (en) Automatic Synonyms Using Word Embedding and Word Similarity Models
CN113377963B (en) Knowledge-graph-based well site test data processing method and device
CN117312372A (en) SQL generating method, device, equipment and medium based on background knowledge enhancement
US10706030B2 (en) Utilizing artificial intelligence to integrate data from multiple diverse sources into a data structure
US20240386060A1 (en) Providing an object-based response to a natural language query
Kashmira et al. Generating entity relationship diagram from requirement specification based on nlp
CN113849661A (en) Entity embedded data extraction method and device, electronic equipment and storage medium
US20250005018A1 (en) Information processing method, device, equipment and storage medium based on large language model
CN112906368A (en) Industry text increment method, related device and computer program product
CN117251455A (en) Intelligent report generation method and system based on large model
Sangeetha et al. An intelligent automatic query generation interface for relational databases using deep learning technique
CN119576975A (en) SQL statement generation method, device, computer equipment, readable storage medium and program product
US20240378381A1 (en) Automated calculation predictions with explanations
US20180349351A1 (en) Systems And Apparatuses For Rich Phrase Extraction
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
CN118114681A (en) Semantic analysis method, device and computer readable storage medium
CN113095078B (en) Method, device and electronic device for determining associated assets
CN112905790B (en) Method, device and system for extracting qualitative indicators of regulatory events
Kunanets et al. Enhanced LSA Method with Ukraine Language Support.
US20240370665A1 (en) Intelligent document processing and information extraction using artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210125

Address after: 100140, 55, Fuxing Avenue, Xicheng District, Beijing

Applicant after: INDUSTRIAL AND COMMERCIAL BANK OF CHINA

Applicant after: ICBC Technology Co.,Ltd.

Address before: 100029 Tianyuan Xiangtai building, No.5 Anding Road, Chaoyang District, Beijing

Applicant before: ICBC Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant