US10810224B2 - Computerized methods and programs for ingesting data from a relational database into a data lake - Google Patents
Computerized methods and programs for ingesting data from a relational database into a data lake Download PDFInfo
- Publication number
- US10810224B2 US10810224B2 US16/020,829 US201816020829A US10810224B2 US 10810224 B2 US10810224 B2 US 10810224B2 US 201816020829 A US201816020829 A US 201816020829A US 10810224 B2 US10810224 B2 US 10810224B2
- Authority
- US
- United States
- Prior art keywords
- data
- ingestion
- relational database
- lake
- udf
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/214—Database migration support
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2291—User-Defined Types; Storage management thereof
Definitions
- the invention relates in general to the field of computerized techniques for ingesting data from a relational database into a data lake.
- it is directed to methods involving an extract, transform, load (ETL) pipeline.
- ETL extract, transform, load
- a data lake is a storage repository that holds a huge amount of raw or refined data in native format until it is accessed.
- a data lake typically relies on Hadoop-compatible object storage, according to which an organization's data is loaded into a Hadoop platform. Then, business analytics and data-mining tools can possibly be applied to the data where it resides on the Hadoop cluster.
- data lakes can also be used effectively without incorporating Hadoop, depending on the needs and goals of the organization. More generally, a data lake is a large data pool in which the schema and data requirements are not defined until the data is queried.
- data lakes often build on new technologies such as scalable file systems (e.g., Hadoop distributed file system, or HDFS), no SQL databases (e.g., Casandra), object stores (e.g., S3) and processing frameworks (e.g., Spark).
- scalable file systems e.g., Hadoop distributed file system, or HDFS
- no SQL databases e.g., Casandra
- object stores e.g., S3
- processing frameworks e.g., Spark
- One of the key enabling technologies for a data lake is to allow so called “silo-ed” data available within existing data sources (e.g., warehouses) to be ingested into the lake.
- data sources e.g., warehouses
- Sqoop Specific technologies such as Sqoop have been developed exactly for this purpose, but they require skill sets beyond those required for standard ETL processing common within data warehouses. Such technologies are furthermore perceived as not being easy to integrate into ETL pipelines.
- the invention is embodied as a computerized method for ingesting data from a relational database into a data lake.
- a relational database is provided, wherein a user-defined function, or UDF, is associated with a standard operation of extract, transform, load, or ETL, of an ETL pipeline.
- UDF user-defined function
- This UDF is designed so as to be triggered upon performing said standard operation and thereby allow a code associated with said UDF to be executed.
- said standard operation is executed, which triggers said UDF and, in turn, an execution of said code.
- the invention is embodied as a computer program product for ingesting data from a relational database into a data lake.
- the computer program product comprising a computer readable storage medium having program instructions embodied therewith.
- the program instructions are executable by one or more processors of a computerized system, so as to run a relational database such as described above. I.e., it includes a specifically designed UDF, which is nevertheless associated with a standard ETL operation of an ETL pipeline.
- FIG. 1 is a block diagram schematically illustrating a data lake and a relational database system comprising multiple databases for migrating data from multiple data sources such as data warehouses, as involved in embodiments;
- FIG. 2 is a flowchart illustrating steps of a method for ingesting data, as in embodiments.
- FIG. 3 schematically represents a general purpose computerized system, suited for implementing one or more method steps as involved in embodiments of the invention.
- One of the key enabling technologies for a data lake is to allow so called “silo-ed” data available within existing data sources (e.g., warehouses) to be ingested into the lake.
- existing data sources e.g., warehouses
- Specific technologies have been developed to allow data available within existing data sources to be ingested into the lake, but such technologies require skill sets beyond those required for standard ETL processing common within data warehouses. There is therefore an impedance between the world of data sources such as data warehouses and that of the big data lake systems.
- FIGS. 1, 2 an aspect of the invention is first described, which concerns a computerized method for ingesting data from a relational database 21 (e.g., Db2) into a data lake 30 (e.g., a big data system).
- a relational database 21 e.g., Db2
- a data lake 30 e.g., a big data system
- this method relies on providing (step S 30 , FIG. 2 ) a relational database 21 (e.g., to a data ingestor 2 ), wherein a specifically designed user-defined function (UDF) is associated with a standard operation of extract, transform, load (ETL) of an ETL pipeline.
- a relational database 21 e.g., to a data ingestor 2
- UDF user-defined function
- ETL extract, transform, load
- an ETL pipeline refers to a set of processes extracting data from one system, transforming it, and loading into a database.
- the UDF is designed so as to be triggered upon performing said standard operation.
- triggering the UDF allows a code associated therewith to be executed.
- a UDF corresponds to a piece of code that an end user may insert into a system to extend its capabilities.
- the UDF used is devised for a relational database 21 - 23 , so as to allow a code (e.g., an application-specific code) to be executed on a standard ETL type operation.
- data 15 can be migrated (step S 40 ) from one or more data sources 11 - 13 into the relational database 21 , based on the ETL pipeline, such that said standard operation will be executed S 44 .
- This triggers S 46 the UDF and, thus, the execution S 48 of the associated code.
- the execution S 48 of the code causes to notify S 49 an entity 35 running on the data lake 30 that a set 22 of data migrated to the relational database 21 is to be ingested and, this, according to given ingestion modalities.
- Such modalities are specified by the code as the latter executes. That is, this code is adapted, upon execution, to notify said entity and inform it about ingestion modalities to be observed.
- Such modalities may for instance describe which data (e.g., data tables) and how such data should be ingested into the lake.
- the entity 35 at stake can typically be regarded as a workflow, e.g., implemented by as part of a workflow scheduler, such as the so-called Apache Oozie Workflow Scheduler for Hadoop.
- the set 22 of migrated data can be ingested S 50 into the data lake 30 , according to said modalities and, e.g., a work flow initiated (or even implemented) by the notified entity 35 .
- said ingestion modalities notably specify whether said set 22 of data may be ingested in parallel.
- such modalities may specify whether said set 22 should be appended to data already existing in the data lake 30 (“append” mode), upon ingestion thereof.
- the append mode is used to add and extend data that may be already be present in the data lake.
- the modalities may specify whether data should overwrite data already existing in the data lake 30 (“overwrite” mode), upon ingestion. That is, the overwrite mode typically copies one or more tables from the relational database 21 onto an area of the data lake where it overwrites existing data, as specified by said modalities.
- data may be both appended to already existing data while overwriting distinct, already existing data.
- multiple modalities are preferably specified altogether.
- the ingestion may possibly come in the two modes described above.
- ingestion modalities may further specify whether data 22 should be appended to already existing data and/or overwrite data in the data lake.
- filters may be available, which allow a user 2 to select a subset of the data from the relational database 21 .
- the user may possibly be able to indicate whether a catalogue should be updated or not, as discussed later in more detail.
- Enabling parallel ingestions is particularly advantageous where said set 22 of data comprises a plurality of data tables, as subsets of such tables may be ingested in parallel.
- the number of such tables is typically limited by the resources allocated to ingestion as, e.g., on a Hadoop cluster. By default, the ingestion of a single table is not distributed across a cluster. If a table is large this may become the limiting factor as the ingestor must wait until the largest table is ingested before the updating of a database is complete. For this reason, a user may request that the ingestion of a specific table be performed in parallel to the ingestion of other tables.
- the standard ETL operations are preferably executed S 44 on a dummy database table, as enabled by the relational database 21 provided.
- the database may put in a dummy table, whose only purpose is to trigger the ingestion.
- This dummy database table may possibly be subject to access rights governing the one or more users or applications allowed to perform the ingestion.
- the subsequent ingestion S 50 will be performed in the extent permitted by such access rights, in addition to being performed according to said modalities.
- access rights are associated with the UDF itself, as in IBM Db2 databases.
- the UDF may notably be associated with a so-called “SELECT” operation, which forms part of the ETL pipeline.
- the ingestion modalities can be specified as parameters to the SELECT operation.
- the user can further specify a filter in the SELECT statement that triggers the UDF such that only part of the table is ingested, for example by imposing a “WHERE DATE” to be ulterior to a given (e.g., WHERE DATE >‘2018-02-19’).
- the UDF may otherwise be associated with any other convenient ETL operation, such as INSERT, UPDATE, or DELETE.
- the UDF is executed in the database specific environment using the database specific language (here SQL).
- database specific language here SQL.
- different databases associate UDFs with different constructs, such that the UDF may, in general, be associated with other constructs.
- the notification S 49 is, as per the execution of the UDF code, performed by writing a message describing said ingestion modalities into a work queue, e.g., a queue used by a workflow scheduler on the data lake side.
- the subsequent ingestion S 50 may be initiated by reading S 52 the work queue, e.g., using a daemon process, so as to initiate a work flow S 54 -S 58 to ingest said set 22 of data into the data lake 30 .
- the entity may, upon receiving the notification S 49 , schedule S 51 a deferred execution of this ingestion, as illustrated in FIG. 2 .
- this work flow may notably cause to recreate S 55 a data structure of the data 15 migrated into the relational database 21 within a database table of the data lake 30 .
- the data lake 30 may be a Hadoop-enabled data lake 30 and said data structure may be recreated S 55 onto the HDFS file system of the data lake.
- the data structure may be recreated S 55 a posteriori (i.e., after loading S 54 the data, as in FIG. 2 ) or, in variants, apriori, i.e., prior to loading S 54 the data, at variance with FIG. 2 .
- the work flow may cause to index S 56 the ingested set 22 of data.
- the work flow may possibly cause to cataloguing S 57 such data 22 , e.g., as per modalities specified by the user. E.g., once data has been moved into an area of the data lake where data is served to authorized users only, such data are registered in a metadata repository and this repository can be implemented as part of a larger catalogue, which can be used to browse and understand the available data assets.
- the ingestion S 50 may further comprise updating S 58 access rules for the ingested set 22 of data.
- the data lake entity 35 is implemented as part of a workflow scheduler running on the data lake.
- the workflow may notably cause to load S 54 the data 22 into the data lake (e.g., into an HDFS), recreate the relational table S 55 , update S 57 the catalogue that contains the ingested table names and their metadata, and set S 58 the access rights on the table.
- the relational table may for instance be recreated using a SQL query engine, such as BigSQL, Impala, etc., based on data copied from the relational data base.
- the user 2 may possibly want to inquire about the ingestion status and, to that aim, be able to query S 61 contents in an ingestion log database to track S 62 progress of the ingestion.
- the relational database 21 is typically provided S 30 upon a data owner requesting S 10 an ingestion of data into the data lake 30 . That is, upon receiving the owner's request S 10 , an authorized entity proceeds to create S 20 the needed relational database 21 and insert S 20 an apposite UDF in the created database(s).
- the relational database 21 is preferably provided S 30 as part of a relational database system 20 (also referred to as a “drop zone” in this document).
- This system comprises multiple, different databases 21 - 23 .
- Each database 21 - 23 may for instance be customized with respect to a respective data source 11 - 13 .
- Such data sources 11 - 13 may for example be data warehouses 11 - 13 , which may require specific databases 21 - 23 , as illustrated in FIG.
- data can be moved from the data warehouses 11 - 13 into respective relational databases 21 - 23 using any suitable technology, e.g., Db2-to-Db2, DataStage, etc., which are known per se.
- suitable technology e.g., Db2-to-Db2, DataStage, etc.
- Other ETL tools can be used as well.
- the invention can be embodied as a computer program product, designed to enable and ease ingestion of data, according to methods as described herein.
- the computer program product comprises a computer readable storage medium having program instructions embodied therewith.
- such instructions are executable by processing means of a computerized system, which may include one or more computerized units 101 such as depicted in FIG. 3 .
- Upon execution, such instructions make it possible to run a relational database 21 , wherein a UDF is associated with a standard ETL operation of an ETL pipeline, as described earlier. I.e., the UDF can be triggered upon performing said standard operation and thereby allow a UDF code to be executed.
- This sub-section describes detailed mechanisms to extend the concept of user defined functions (UDFs) within a relational database such as Db2 to allow an ETL developer to trigger the ingestion of data into a data lake.
- UDFs user defined functions
- the UDF is an application-specific piece of logic that can be associated with a specific action within the database.
- the UDF is triggered by an action taken within the relational database, which then notifies entities 35 running within the data lake about how and what data to ingest.
- this action is designed to be a standard ETL operation such as the “SELECT” operation, which can conveniently be made part of an ETL pipeline defined using tools such as Data-Stage.
- the description of which and how the data are to be ingested is defined as parameters to this “SELECT” statement.
- the UDF notifies via a convenient notification system to the data lake that this data should be ingested.
- the ingestion request is scheduled and performed at a later time.
- the ETL operation is executed on a dummy database table whose only purpose is to trigger the ingestion. Access rights over this table may govern who is allowed to perform an ingestion from that database into the lake.
- each UDF may have associated access rights that govern who is allowed to trigger the ingestion, as noted earlier.
- the progress of the ingestion can be followed within the ETL by querying the contents of an ingestion log database.
- the relational database instance from which data is to be ingested is here termed the drop zone.
- the drop zone is supported by standard relational technology such as Db2.
- the drop zone consists of multiple different databases 21 - 23 , each corresponding to a specific data warehouse 11 - 13 , as depicted in FIG. 1 .
- Data is moved from the data warehouses 11 - 13 into the data lake 30 using any convenient technology, e.g., Db2to-Db2, Data Stage, etc.
- a data ware house owner 2 Before migration of data into the drop zone, a data ware house owner 2 requests the creation of a drop zone database 21 .
- the creation of this database 21 inserts the trigger mechanism into the database, and establishes correct access rights for a functional user 2 to read the data.
- the tables to be ingested are triggered by executing the trigger on the control database by the means described previously.
- the UDF that implements the trigger writes into a work queue a message describing which tables and how such tables should now be ingested into the lake.
- the work queue is read by a daemon process and this initiates a work flow that actually performs the ingestion.
- This typically involves reading the data from the drop zone database and recreating the relational table within the database table on HDFS, (e.g., using BigSQL, Impala), indexing the data (e.g., using Elastic Search, SOLR), Cataloging the data (e.g., using IBM IGC, Apache Atlas), updating access rules (e.g., using Apache Ranger), and/or any other suitable action.
- Computerized devices can be suitably designed for implementing embodiments of the present invention as described herein.
- the methods described herein are largely non-interactive and automated.
- the methods described herein can be implemented either in an interactive, partly-interactive or non-interactive system.
- the methods described herein can be implemented in software (e.g., firmware), hardware, or a combination thereof.
- the methods described herein are implemented in software, as an executable program, the latter executed by suitable digital processing devices. More generally, embodiments of the present invention can be implemented wherein general-purpose digital computers, such as personal computers, workstations, etc., are used.
- FIG. 3 schematically represents a computerized unit 101 , e.g., a general-purpose computer.
- a computerized unit 101 e.g., a general-purpose computer.
- Several computerized units 101 may be involved along the work flow path (e.g., in the data warehouses, on the data ingestors' side, on the data lake side).
- the unit 101 includes a processor 105 , memory 110 coupled to a memory controller 115 , and one or more input and/or output (I/O) devices 145 , 150 , 155 (or peripherals) that are communicatively coupled via a local input/output controller 135 .
- I/O input and/or output
- the input/output controller 135 can be, but is not limited to, one or more buses or other wired or wireless connections, as is known in the art.
- the input/output controller 135 may have additional elements, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers, to enable communications.
- the local interface may include address, control, and/or data connections to enable appropriate communications among the aforementioned components.
- the processor 105 is a hardware device for executing software, particularly that stored in memory 110 .
- the processor 105 can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computer 101 , a semiconductor based microprocessor (in the form of a microchip or chip set), or generally any device for executing software instructions.
- the memory 110 can include any one or combination of volatile memory elements (e.g., random access memory) and nonvolatile memory elements. Moreover, the memory 110 may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory 110 can have a distributed architecture, where various components are situated remote from one another, but can be accessed by the processor 105 .
- the software in memory 110 may include one or more separate programs, each of which comprises an ordered listing of executable instructions for implementing logical functions.
- the software in the memory 110 includes methods (or parts thereof) described herein in accordance with exemplary embodiments and a suitable operating system (OS) 111 .
- the OS 111 essentially controls the execution of other computer programs and provides scheduling, input-output control, file and data management, memory management, and communication control and related services.
- the methods described herein may be in the form of a source program, executable program (object code), script, or any other entity comprising a set of instructions to be performed.
- object code executable program
- script any other entity comprising a set of instructions to be performed.
- the program needs to be translated via a compiler, assembler, interpreter, or the like, as known per se, which may or may not be included within the memory 110 , so as to operate properly in connection with the OS 111 .
- the methods can be written as an object oriented programming language, which has classes of data and methods, or a procedure programming language, which has routines, subroutines, and/or functions.
- a conventional keyboard 150 and mouse 155 can be coupled to the input/output controller 135 .
- Other I/O devices 145 - 155 may include other hardware devices.
- the I/O devices 145 - 155 may further include devices that communicate both inputs and outputs.
- the system 100 can further include a display controller 125 coupled to a display 130 .
- the system 100 can further include a network interface or transceiver 160 for coupling to a network (not shown, e.g., to set several units 101 in data communication along the work flow path S 10 -S 62 ).
- the network transmits and receives data between the unit 101 and external systems.
- the network is possibly implemented in a wireless fashion, e.g., using wireless protocols and technologies, such as WiFi, WiMax, etc.
- the network may be a fixed wireless network, a wireless local area network (LAN), a wireless wide area network (WAN) a personal area network (PAN), a virtual private network (VPN), intranet or other suitable network system and includes equipment for receiving and transmitting signals.
- LAN wireless local area network
- WAN wireless wide area network
- PAN personal area network
- VPN virtual private network
- the network can also be an IP-based network for communication between the unit 101 and any external server, client and the like via a broadband connection.
- network can be a managed IP network administered by a service provider.
- the network can be a packet-switched network such as a LAN, WAN, Internet network, etc.
- the software in the memory 110 may further include a basic input output system (BIOS).
- BIOS is stored in ROM so that the BIOS can be executed when the computer 101 is activated.
- the processor 105 When the unit 101 is in operation, the processor 105 is configured to execute software stored within the memory 110 , to communicate data to and from the memory 110 , and to generally control operations of the computer 101 pursuant to the software.
- the methods described herein and the OS 111 in whole or in part are read by the processor 105 , typically buffered within the processor 105 , and then executed.
- the methods described herein are implemented in software, the methods can be stored on any computer readable medium, such as storage 120 , for use by or in connection with any computer related system or method.
- the present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, Java, Scala or the like, and procedural programming languages, such as the C programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the blocks may occur out of the order noted in the Figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
-
- select udf-name(udf-parameters) from sysibm.sysdummyl,
whereas with Oracle Database the same can be achieved by executing: - call udf-name(udf-parameters) into: result
- print result
- select udf-name(udf-parameters) from sysibm.sysdummyl,
Claims (19)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/020,829 US10810224B2 (en) | 2018-06-27 | 2018-06-27 | Computerized methods and programs for ingesting data from a relational database into a data lake |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/020,829 US10810224B2 (en) | 2018-06-27 | 2018-06-27 | Computerized methods and programs for ingesting data from a relational database into a data lake |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20200004862A1 US20200004862A1 (en) | 2020-01-02 |
| US10810224B2 true US10810224B2 (en) | 2020-10-20 |
Family
ID=69054700
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/020,829 Expired - Fee Related US10810224B2 (en) | 2018-06-27 | 2018-06-27 | Computerized methods and programs for ingesting data from a relational database into a data lake |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US10810224B2 (en) |
Families Citing this family (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11194813B2 (en) * | 2018-07-06 | 2021-12-07 | Open Text Sa Ulc | Adaptive big data service |
| US11841882B2 (en) * | 2018-09-23 | 2023-12-12 | Microsoft Technology Licensing, Llc | Individualized telemetry processing leveraging digital twins property(ies) and topological metadata |
| CN111367984B (en) * | 2020-03-11 | 2023-03-21 | 中国工商银行股份有限公司 | Method and system for loading high-timeliness data into data lake |
| CN112163031B (en) * | 2020-11-11 | 2023-06-16 | 西安四叶草信息技术有限公司 | Graph data extraction method based on thought guide graph |
| CN112905564B (en) * | 2021-02-26 | 2023-02-21 | 浪潮云信息技术股份公司 | Atlas-based method and device for managing metadata of Oracle database |
| CN113791742B (en) * | 2021-11-18 | 2022-03-25 | 南湖实验室 | High-performance data lake system and data storage method |
| CN114254019A (en) * | 2021-12-23 | 2022-03-29 | 中国工商银行股份有限公司 | Index data statistical method and device |
| US11954531B2 (en) * | 2021-12-28 | 2024-04-09 | Optum, Inc. | Use of relational databases in ephemeral computing nodes |
| CN114048260B (en) * | 2022-01-12 | 2022-09-09 | 南湖实验室 | Method for interconnecting data lake and relational database |
| CN114911809B (en) * | 2022-05-12 | 2024-10-08 | 北京火山引擎科技有限公司 | Data processing method and device |
| CN116166757B (en) * | 2022-12-06 | 2024-11-15 | 浪潮通用软件有限公司 | Multi-source heterogeneous lake and warehouse integrated data processing method, equipment and medium |
| WO2024182891A1 (en) * | 2023-03-07 | 2024-09-12 | Mastercard Technologies Canada ULC | Extensible data enclave platform |
| US12174849B2 (en) * | 2023-05-04 | 2024-12-24 | Microsoft Technology Licensing, Llc | Validation of ETL pipeline using assert data |
| CN116881364B (en) * | 2023-07-11 | 2026-02-03 | 中国工商银行股份有限公司 | Transaction log processing method and device |
Citations (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100211539A1 (en) * | 2008-06-05 | 2010-08-19 | Ho Luy | System and method for building a data warehouse |
| US20120102007A1 (en) * | 2010-10-22 | 2012-04-26 | Alpine Consulting, Inc. | Managing etl jobs |
| US20140358844A1 (en) * | 2013-06-03 | 2014-12-04 | Bank Of America Corporation | Workflow controller compatibility |
| US20150347540A1 (en) * | 2014-06-02 | 2015-12-03 | Accenture Global Services Limited | Data construction for extract, transform and load operations for a database |
| US20160253340A1 (en) | 2015-02-27 | 2016-09-01 | Podium Data, Inc. | Data management platform using metadata repository |
| US20160314202A1 (en) | 2015-02-26 | 2016-10-27 | Accenture Global Services Limited | System for linking diverse data systems |
| US9679041B2 (en) | 2014-12-22 | 2017-06-13 | Franz, Inc. | Semantic indexing engine |
| US20170177309A1 (en) | 2015-12-22 | 2017-06-22 | Opera Solutions U.S.A., Llc | System and Method for Rapid Development and Deployment of Reusable Analytic Code for Use in Computerized Data Modeling and Analysis |
| WO2017106851A1 (en) | 2015-12-18 | 2017-06-22 | Inovalon, Inc. | System and method for providing an on-demand real-time patient-specific data analysis computing platform |
| US20170286526A1 (en) | 2015-12-22 | 2017-10-05 | Opera Solutions Usa, Llc | System and Method for Optimized Query Execution in Computerized Data Modeling and Analysis |
| US20180101583A1 (en) * | 2016-10-11 | 2018-04-12 | International Business Machines Corporation | Technology for extensible in-memory computing |
| US20180150529A1 (en) * | 2016-11-27 | 2018-05-31 | Amazon Technologies, Inc. | Event driven extract, transform, load (etl) processing |
| US20180196858A1 (en) * | 2017-01-11 | 2018-07-12 | The Bank Of New York Mellon | Api driven etl for complex data lakes |
-
2018
- 2018-06-27 US US16/020,829 patent/US10810224B2/en not_active Expired - Fee Related
Patent Citations (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100211539A1 (en) * | 2008-06-05 | 2010-08-19 | Ho Luy | System and method for building a data warehouse |
| US20120102007A1 (en) * | 2010-10-22 | 2012-04-26 | Alpine Consulting, Inc. | Managing etl jobs |
| US20140358844A1 (en) * | 2013-06-03 | 2014-12-04 | Bank Of America Corporation | Workflow controller compatibility |
| US20150347540A1 (en) * | 2014-06-02 | 2015-12-03 | Accenture Global Services Limited | Data construction for extract, transform and load operations for a database |
| US9679041B2 (en) | 2014-12-22 | 2017-06-13 | Franz, Inc. | Semantic indexing engine |
| US20160314202A1 (en) | 2015-02-26 | 2016-10-27 | Accenture Global Services Limited | System for linking diverse data systems |
| US20160253340A1 (en) | 2015-02-27 | 2016-09-01 | Podium Data, Inc. | Data management platform using metadata repository |
| WO2017106851A1 (en) | 2015-12-18 | 2017-06-22 | Inovalon, Inc. | System and method for providing an on-demand real-time patient-specific data analysis computing platform |
| US20170177309A1 (en) | 2015-12-22 | 2017-06-22 | Opera Solutions U.S.A., Llc | System and Method for Rapid Development and Deployment of Reusable Analytic Code for Use in Computerized Data Modeling and Analysis |
| US20170286526A1 (en) | 2015-12-22 | 2017-10-05 | Opera Solutions Usa, Llc | System and Method for Optimized Query Execution in Computerized Data Modeling and Analysis |
| US20180101583A1 (en) * | 2016-10-11 | 2018-04-12 | International Business Machines Corporation | Technology for extensible in-memory computing |
| US20180150529A1 (en) * | 2016-11-27 | 2018-05-31 | Amazon Technologies, Inc. | Event driven extract, transform, load (etl) processing |
| US20180196858A1 (en) * | 2017-01-11 | 2018-07-12 | The Bank Of New York Mellon | Api driven etl for complex data lakes |
Non-Patent Citations (5)
| Title |
|---|
| Bringing Relational Data Into Data Lakes, p. 2, found at URL: http://blog.cask.co/2016/06/bringing-relational-data-into-data-lakes/. |
| Liu et al., "An ETL optimization framework using partitioning and parallelization," Proceedings of the 30th Annual ACM Symposium on Applied Computing, 2015. (Year: 2015). * |
| Simitsis et al. "Optimizing analytic data flows for multiple execution engines," Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012. (Year: 2012). * |
| The Best Data Ingestion Tools for Migrating to a Hadoop Data Lake, p. 5, Home-Grown Ingestion Patterns, URL: http://rcgglobalservices.com/blog/the-best-data-ingestion-tools-for-migrating-to-a-hadoop-data-lake/. |
| Using Azure Data Lake Store for Big Data Requirements, p. 1, URL: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-data-scenarios#ingest-data-into-data-lake-store. |
Also Published As
| Publication number | Publication date |
|---|---|
| US20200004862A1 (en) | 2020-01-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10810224B2 (en) | Computerized methods and programs for ingesting data from a relational database into a data lake | |
| US11347723B2 (en) | Automated suspension and rebuilding of database indices | |
| US9852197B2 (en) | ETL tool interface for remote mainframes | |
| US9244951B2 (en) | Managing tenant-specific data sets in a multi-tenant environment | |
| US10726039B2 (en) | Systems and methods for updating database indexes | |
| Ciaburro et al. | Hands-on machine learning on google cloud platform: Implementing smart and efficient analytics using cloud ml engine | |
| US9984082B2 (en) | Index suspension prior to database update | |
| US10909103B2 (en) | Techniques and architectures for data field lifecycle management | |
| US20200218702A1 (en) | Archiving Objects in a Database Environment | |
| CN112307122A (en) | A data lake-based data management system and method | |
| US10783125B2 (en) | Automatic data purging in a database management system | |
| L’Esteve | Databricks | |
| US20180293277A1 (en) | Explicit declaration of associations to optimize grouping of elements by large data objects | |
| US10558640B2 (en) | Dynamically adding custom data definition language syntax to a database management system | |
| US9405788B2 (en) | Mass delete restriction in a database | |
| Akhtar et al. | Using phoenix | |
| US11409729B2 (en) | Managing database object schema virtual changes | |
| US12147412B2 (en) | Concurrent optimistic transactions for tables with deletion vectors | |
| Shaw et al. | Loading data into hive | |
| Monroe | Fasta Organism Filter | |
| Powers | Getting Started with a Database | |
| Lake et al. | Technical Insights |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BAUER, DANIEL N.;GARCES ERICE, LUIS;ROONEY, JOHN G.;AND OTHERS;REEL/FRAME:046221/0522 Effective date: 20180626 Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BAUER, DANIEL N.;GARCES ERICE, LUIS;ROONEY, JOHN G.;AND OTHERS;REEL/FRAME:046221/0522 Effective date: 20180626 |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
| FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20241020 |