CN116028192A - A multi-source heterogeneous data acquisition method, device and storage medium - Google Patents
A multi-source heterogeneous data acquisition method, device and storage medium Download PDFInfo
- Publication number
- CN116028192A CN116028192A CN202310315993.0A CN202310315993A CN116028192A CN 116028192 A CN116028192 A CN 116028192A CN 202310315993 A CN202310315993 A CN 202310315993A CN 116028192 A CN116028192 A CN 116028192A
- Authority
- CN
- China
- Prior art keywords
- data
- source
- task
- data source
- data acquisition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 79
- 238000012545 processing Methods 0.000 claims abstract description 17
- 238000013480 data collection Methods 0.000 claims description 85
- 238000013507 mapping Methods 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 14
- 230000000903 blocking effect Effects 0.000 claims description 5
- 239000000725 suspension Substances 0.000 claims description 4
- 230000009193 crawling Effects 0.000 description 14
- 230000006870 function Effects 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 9
- 238000013500 data storage Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000007726 management method Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000003442 weekly effect Effects 0.000 description 3
- 241000209149 Zea Species 0.000 description 2
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 2
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 2
- 235000005822 corn Nutrition 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 238000004886 process control Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 241000938605 Crocodylia Species 0.000 description 1
- 241000238413 Octopus Species 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013481 data capture Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000003137 locomotive effect Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 210000000707 wrist Anatomy 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本申请涉及数据处理领域,尤其涉及一种多源异构数据采集方法、装置和存储介质。The present application relates to the field of data processing, in particular to a multi-source heterogeneous data acquisition method, device and storage medium.
背景技术Background technique
21世纪是一个大数据时代,数据无处不在,存在于生活的方方面面。无论是出于数据分析还是产品需求,我们都需要从某些网站提取出我们感兴趣,有价值的内容。但由于收集数据是一个重复性工作,且人力有穷尽,所以爬虫应运而生,并且迅速的发展壮大。The 21st century is an era of big data, data is everywhere and exists in all aspects of life. Whether it is for data analysis or product needs, we need to extract interesting and valuable content from certain websites. However, because collecting data is a repetitive task, and manpower is exhausted, reptiles came into being and developed rapidly.
数据采集接入作为数据应用、数字化的底层架构,在各个项目中的数据治理、数据资源汇集、统一的必备环节,具有不可替代的作用。As the underlying structure of data application and digitalization, data collection and access plays an irreplaceable role in data governance, collection and unification of data resources in various projects.
站在大数据角度,需要开展深化数据采集方法论研究,融合多种数据源,必须要考虑多源异构数据源的关联融合处理,巩固基础建设。From the perspective of big data, it is necessary to carry out in-depth research on data collection methodology and integrate multiple data sources. It is necessary to consider the association and fusion processing of multi-source heterogeneous data sources to consolidate the infrastructure.
合理的数据采集方法是实现数据可用的关键步骤,数据来源多样性已成为大数据环境下的一个基本特征,而传统的数据采集来源单一,且存储、管理和分析数据量也相对较小,大多采用关系型数据库和并行数据仓库,因此,在多样的信息源与海量的数据量环境下,数据采集技术面临的挑战也变得尤为突出。数据采集方法按照数据类型主要可分为离线采集、实时采集、互联网采集。A reasonable data collection method is a key step to make data available. The diversity of data sources has become a basic feature in the big data environment. However, the traditional data collection source is single, and the amount of data stored, managed and analyzed is relatively small. Relational databases and parallel data warehouses are used. Therefore, in the context of diverse information sources and massive data volumes, the challenges faced by data acquisition technology have become particularly prominent. Data collection methods can be divided into offline collection, real-time collection, and Internet collection according to data types.
发明内容Contents of the invention
为了解决上述技术问题,本申请提供了一种多源异构数据采集方法、装置和存储介质。In order to solve the above technical problems, the present application provides a multi-source heterogeneous data acquisition method, device and storage medium.
本申请第一方面提供了一种多源异构数据采集方法,所述方法包括:The first aspect of the present application provides a multi-source heterogeneous data collection method, the method comprising:
确定数据源的类型,并配置所述数据源的数据源信息;Determine the type of data source, and configure the data source information of the data source;
配置任务调度器,所述任务调度器用于定时执行任务、周期执行任务、确定服务节点以及确定执行策略;Configuring a task scheduler, the task scheduler is used to execute tasks regularly, execute tasks periodically, determine service nodes, and determine execution strategies;
创建数据采集任务,所述数据采集任务包括数据来源、数据目标源以及数据采集策略;Create a data collection task, the data collection task includes a data source, a data target source, and a data collection strategy;
通过配置好的任务调度器,按照所述数据采集策略执行所述数据采集任务;Execute the data collection task according to the data collection strategy through the configured task scheduler;
输出数据采集结果。Output data collection results.
可选的,通过配置好的任务调度器,按照所述数据采集策略执行所述数据采集任务包括:Optionally, executing the data collection task according to the data collection strategy through the configured task scheduler includes:
所述数据源信息中包含有数据源表格,所述数据源表格中列举有所需要采集的数据的字段;The data source information includes a data source table, and the data source table lists the fields of the data to be collected;
选中所述数据源,并下载所述数据源中的所有数据;Select the data source and download all the data in the data source;
在所述所有数据中确定待采集的目标源表格;Determining the target source table to be collected in all the data;
根据所述数据源表格中的字段与所述目标源表格中字段的关联性,构建所述数据源表格与所述目标源表格的映射关系;Constructing a mapping relationship between the data source table and the target source table according to the correlation between the fields in the data source table and the fields in the target source table;
依据所述映射关系进行数据采集。Data collection is performed according to the mapping relationship.
可选的,当数据源为网站时,所述配置所述数据源的数据源信息包括:Optionally, when the data source is a website, the data source information configuring the data source includes:
获取预先配置好的采集脚本信息;Obtain pre-configured acquisition script information;
若脚本为自定义脚本,则获取自定义脚本文件;If the script is a custom script, obtain the custom script file;
若脚本为java脚本,则在获取脚本文件之后,配置脚本的包名、类名、方法名。If the script is a java script, after obtaining the script file, configure the package name, class name, and method name of the script.
可选的,当所述数据源为关系型数据库时,所述配置所述数据源的数据源信息包括:Optionally, when the data source is a relational database, the data source information configuring the data source includes:
配置所述数据数据源的ip地址,端口,用户名和密码信息;Configure the ip address, port, user name and password information of the data source;
所述关系型数据库包括:MySQL、Oracle、SQLServer、PostgreSQL、Hive、HDFS、MongoDB,Gbase、 Kingbase。Described relational database comprises: MySQL, Oracle, SQLServer, PostgreSQL, Hive, HDFS, MongoDB, Gbase, Kingbase.
可选的,所述配置任务调度器包括:Optionally, the configuration task scheduler includes:
配置阻塞处理策略、配置子任务、配置任务重试策略、配置任务执行触发策略和执行报警策略。Configure blocking processing strategies, configure subtasks, configure task retry strategies, configure task execution trigger strategies, and execute alarm strategies.
可选的,还包括:在采集数据的过程中,生成任务日志和统计报表;Optionally, it also includes: during the process of collecting data, generating task logs and statistical reports;
所述任务日志中记录有任务在执行过程中详细的错误日志;A detailed error log during the execution of the task is recorded in the task log;
所述统计报表包括:数据采集信息和节点信息,所述数据采集信息包括:总量、总容量、当日新增量信息以及30天趋势统计;所述节点信息包括:CPU使用率、内存占用率,内存总量、内存剩余量;任务信息:成功数、失败数、执行中止数以及一周统计图。The statistical report includes: data collection information and node information, the data collection information includes: total amount, total capacity, new incremental information of the day and 30-day trend statistics; the node information includes: CPU usage rate, memory usage rate , the total amount of memory, the remaining amount of memory; task information: number of successes, number of failures, number of execution suspensions, and a weekly statistical chart.
本申请第二方面提供了一种多源异构数据采集装置,包括:The second aspect of the present application provides a multi-source heterogeneous data acquisition device, including:
确定单元,用于确定数据源的类型,并配置所述数据源的数据源信息;A determining unit, configured to determine the type of the data source, and configure the data source information of the data source;
配置单元,用于配置任务调度器,所述任务调度器用于定时执行任务、周期执行任务、确定服务节点以及确定执行策略;The configuration unit is used to configure a task scheduler, and the task scheduler is used to execute tasks periodically, execute tasks periodically, determine service nodes, and determine execution strategies;
创建单元,用于创建数据采集任务,所述数据采集任务包括数据来源、数据目标源以及数据采集策略;A creation unit is used to create a data collection task, and the data collection task includes a data source, a data target source and a data collection strategy;
采集单元,用于通过配置好的任务调度器,按照所述数据采集策略执行所述数据采集任务;The collection unit is configured to execute the data collection task according to the data collection strategy through the configured task scheduler;
输出单元,用于输出数据采集结果。The output unit is used to output data collection results.
可选的,所述采集单元具体用于:Optionally, the collection unit is specifically used for:
所述数据源信息中包含有数据源表格,所述数据源表格中列举有所需要采集的数据的字段;The data source information includes a data source table, and the data source table lists the fields of the data to be collected;
选中所述数据源,并下载所述数据源中的所有数据;Select the data source and download all the data in the data source;
在所述所有数据中确定待采集的目标源表格;Determining the target source table to be collected in all the data;
根据所述数据源表格中的字段与所述目标源表格中字段的关联性,构建所述数据源表格与所述目标源表格的映射关系;Constructing a mapping relationship between the data source table and the target source table according to the correlation between the fields in the data source table and the fields in the target source table;
依据所述映射关系进行数据采集。Data collection is performed according to the mapping relationship.
本申请第三方面提供了一种多源异构数据采集装置,所述装置包括:The third aspect of the present application provides a multi-source heterogeneous data acquisition device, the device includes:
处理器、存储器、输入输出单元以及总线;Processor, memory, I/O unit and bus;
所述处理器与所述存储器、所述输入输出单元以及所述总线相连;The processor is connected to the memory, the input and output unit and the bus;
所述存储器保存有程序,所述处理器调用所述程序以执行第一方面以及第一方面中任一项可选的所述方法。The memory stores a program, and the processor invokes the program to execute the first aspect and any optional method in the first aspect.
本申请第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质上保存有程序,所述程序在计算机上执行时执行第一方面以及第一方面中任一项可选的所述方法。The fourth aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a program, and when the program is executed on a computer, the first aspect and any optional one of the first aspect are executed. the method.
从以上技术方案可以看出,本申请具有以下优点:As can be seen from the above technical solutions, the present application has the following advantages:
本申请提供了一种多源异构数据采集方法,在数据采集前,通过对任务调度器和数据源的配置化,能够提高方法的适用性和易用性,降低对用户专业度的依赖,同时可以根据需要进行水平扩展,实现分布式数据采集,在完成配置后,通过任务调度器能够应用于多种类型数据的自动化采集,融合了网页数据、离线数据、实时数据等多种类型的数据采集功能,使得其爬取海量多数据源的数据时不受数据源和数据存储系统的异构性的影响,极大程度提高了数据采集的效率。This application provides a multi-source heterogeneous data collection method. Before data collection, through the configuration of task scheduler and data source, the applicability and ease of use of the method can be improved, and the dependence on user expertise can be reduced. At the same time, it can be extended horizontally according to the needs to realize distributed data collection. After the configuration is completed, it can be applied to the automatic collection of various types of data through the task scheduler, which integrates various types of data such as web page data, offline data, and real-time data. The collection function makes it not affected by the heterogeneity of data sources and data storage systems when crawling data from massive multi-data sources, which greatly improves the efficiency of data collection.
附图说明Description of drawings
为了更清楚地说明本申请中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in this application more clearly, the accompanying drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings in the following description are only some embodiments of the application. Those of ordinary skill in the art can also obtain other drawings based on these drawings without any creative effort.
图1为本申请中提供的多源异构数据采集方法的一个实施例流程示意图;Fig. 1 is a schematic flow chart of an embodiment of the multi-source heterogeneous data acquisition method provided in the present application;
图2为本申请中提供的多源异构数据采集装置的一个实施例结构示意图;FIG. 2 is a schematic structural diagram of an embodiment of a multi-source heterogeneous data acquisition device provided in the present application;
图3为本申请中提供的多源异构数据采集装置的另一个实施例结构示意图。Fig. 3 is a schematic structural diagram of another embodiment of the multi-source heterogeneous data collection device provided in this application.
具体实施方式Detailed ways
数据采集接入作为数据应用、数字化的底层架构,在各个项目中的数据治理、数据资源汇集、统一的必备环节,具有不可替代的作用。As the underlying structure of data application and digitalization, data collection and access plays an irreplaceable role in data governance, collection and unification of data resources in various projects.
合理的数据采集方法是实现数据可用的关键步骤,数据来源多样性已成为大数据环境下的一个基本特征,而传统的数据采集来源单一,且存储、管理和分析数据量也相对较小,大多采用关系型数据库和并行数据仓库,因此,在多样的信息源与海量的数据量环境下,数据采集技术面临的挑战也变得尤为突出。数据采集方法按照数据类型主要可分为离线采集、实时采集、互联网采集。A reasonable data collection method is a key step to make data available. The diversity of data sources has become a basic feature in the big data environment. However, the traditional data collection source is single, and the amount of data stored, managed and analyzed is relatively small. Relational databases and parallel data warehouses are used. Therefore, in the context of diverse information sources and massive data volumes, the challenges faced by data acquisition technology have become particularly prominent. Data collection methods can be divided into offline collection, real-time collection, and Internet collection according to data types.
对于目前已有的几种数据采集方式:For several existing data collection methods:
(1)离线采集(1) Offline collection
数据离线采集手段核心为ETL工具,在数据仓库的语境下,ETL数据采集过程包括数据的提取(Extract)、转换(Transform)和加载(Load)。常见主流ETL工具有 ApacheSqoop、Kettle、Talend等。The core of the offline data collection method is the ETL tool. In the context of the data warehouse, the ETL data collection process includes data extraction (Extract), transformation (Transform) and loading (Load). Common mainstream ETL tools include Apache Sqoop, Kettle, Talend, etc.
Apache Sqoop(SQL-to-Hadoop)主要用于在HDFS/Hive与关系型数据库(Mysql、Oracle、SQL server等)之间进行数据的传输,是Hodoop与SQL之间桥梁纽带。Sqoop即可将关系型数据库中的数据导入HDFS/Hive/Hbase,也能将HDFS中的数据导入关系型数据库,以Mapreduce作为底层引擎,并行方式同步数据、效率较高。Kettle是Pentaho公司推出以Java编写的开源ETL工具,可以在Window、Linux、Unix上运行,数据抽取高效稳定、功能强大、具有API,支持二次开发,便于集成。Kettle有两种脚本文件transformation和 job,transformation完成针对数据的基础转换,job则完成整个工作流的控制。Talend是支持AWS Redshift、S3的ETL工具,图形化的界面,拖拉式的方式无需代码可使用Hadoop、Spark、SparkStream和NoSQL数据库。Apache Sqoop (SQL-to-Hadoop) is mainly used for data transmission between HDFS/Hive and relational databases (Mysql, Oracle, SQL server, etc.), and is the bridge between Hodoop and SQL. Sqoop can import data from a relational database into HDFS/Hive/Hbase, and can also import data from HDFS into a relational database. Using Mapreduce as the underlying engine, it can synchronize data in parallel with high efficiency. Kettle is an open source ETL tool written in Java launched by Pentaho. It can run on Windows, Linux, and Unix. It has efficient and stable data extraction, powerful functions, API, supports secondary development, and is easy to integrate. Kettle has two script files, transformation and job. Transformation completes the basic conversion of data, and job completes the control of the entire workflow. Talend is an ETL tool that supports AWS Redshift and S3. It has a graphical interface and can use Hadoop, Spark, SparkStream and NoSQL databases in a drag-and-drop manner without code.
ETL通常通过图形化的配置方式,简单,灵活,使得用户无需过分关心数据库的各种内部细节,而专注于功能,简化了用户的操作;支持所有常见的数据源,如Oracle,Sqlserver, DB2, Mysql, Access等,还提供了对各种平面数据源,如txt,excel,csv,xml等的支持;兼容性强,支持各种硬件和软件平台,软件平台如操作系统,支持windows,linux以及国产操作系统;同时支持各种硬件平台,如x86,龙芯等等;功能更为强大,数据处理组件非常丰富,通用性更强,组件很容易复用;提供灵活的定制规则,更好的控制数据质量;提供强大的管理功能,如权限管理,日志管理。但是多数ETL工具不具备数据库实时数据感知,无法满足做大数据实时处理的场景。数据库实时同步的工具,如阿里的Canal可以通过读取mysqlbin log实现数据的实时同步。ETL usually adopts a graphical configuration method, which is simple and flexible, so that users do not need to care too much about various internal details of the database, but focus on functions and simplify user operations; support all common data sources, such as Oracle, Sqlserver, DB2, Mysql, Access, etc. also provide support for various flat data sources, such as txt, excel, csv, xml, etc.; strong compatibility, support various hardware and software platforms, software platforms such as operating systems, support windows, linux and Domestic operating system; supports various hardware platforms at the same time, such as x86, Godson, etc.; has more powerful functions, rich data processing components, stronger versatility, and easy reuse of components; provides flexible customization rules and better control Data quality; provide powerful management functions, such as rights management, log management. However, most ETL tools do not have real-time data perception of the database, and cannot meet the scenarios of real-time processing of big data. Tools for real-time database synchronization, such as Ali's Canal, can realize real-time data synchronization by reading the mysqlbin log.
(2)实时采集(2) Real-time collection
日志文件常用的采集工具有Flume、Logstash、FileBeat等等,实时采集主要用在考虑流处理的业务场景,比如,用于记录数据源的执行的各种操作活动,比如网络监控的流量管理、金融应用的股票记账和 web 服务器记录的用户访问行为。在流处理场景,数据采集会成为Kafka的消费者,将上游源源不断的数据拦截住,然后根据业务场景做对应的处理(例如去重、去噪、中间计算等),之后再写入到对应的数据存储中。这个过程类似传统的ETL,但它是流式的处理方式,而非定时的批处理Job。Common collection tools for log files include Flume, Logstash, FileBeat, etc. Real-time collection is mainly used in business scenarios that consider stream processing, for example, to record various operational activities performed by data sources, such as traffic management for network monitoring, financial User access behavior recorded by the application's stock accounting and web server. In the stream processing scenario, data acquisition will become a consumer of Kafka, intercept the continuous stream of upstream data, and then perform corresponding processing according to the business scenario (such as deduplication, denoising, intermediate calculation, etc.), and then write to the corresponding in the data storage. This process is similar to traditional ETL, but it is a stream processing method rather than a scheduled batch processing job.
(3)互联网数据采集(3) Internet data collection
互联网数据采集的手段为网络爬虫技术,是利用互联网搜索引擎技术实现有针对性、行业性、精准性的数据抓取,并按照一定规则和筛选标准进行数据归类,并形成数据库文件的的一个过程。随着互联网技术的发展和海量信息的增长,对信息的获取与分拣成为一种越来越大的需求,现阶段国内实现互联网数据获取的工具很多,市场上主流工具主要有火车头、八爪鱼、集搜客、神箭手云爬虫及狂人采集器等。爬虫是通过编写一段代码,来获取目标网站上感兴趣的数据,减少人为工作量。爬虫的主要流程:获取一个初始网页的URL,获得初始URL上的所有信息,再次在获得的信息中抓取我们感兴趣的内容,并对该内容进行解析匹配,最终保存到数据库。爬虫总体可分为三个模块:网络请求模块、爬取流程控制模块、内容分析提取模块。The means of Internet data collection is web crawler technology, which uses Internet search engine technology to achieve targeted, industry-specific, and precise data capture, and classifies data according to certain rules and screening standards, and forms a database file. process. With the development of Internet technology and the growth of massive information, the acquisition and sorting of information has become an increasing demand. At this stage, there are many tools for Internet data acquisition in China. The mainstream tools on the market mainly include locomotives and octopus Fish, Jisooke, Archer Cloud Crawler and Madman Collector, etc. Crawlers obtain interesting data on the target website by writing a piece of code, reducing human workload. The main process of the crawler: obtain the URL of an initial webpage, obtain all the information on the initial URL, grab the content we are interested in again from the obtained information, parse and match the content, and finally save it to the database. The crawler can be divided into three modules as a whole: network request module, crawling process control module, and content analysis and extraction module.
对于网络请求模块,本质是发送与接收http(s)请求。http请求信息由请求行(line)、请求头(headers)、请求正文(body)三部分组成。请求行包括请求方法和请求地址URL,常见的请求方法(method)为:GET、POST方法,以外还包括OPTIONS、DELETE、PUT等方法。请求头一般包括Accept、Referer、Accept-language、User-Agent、Host等信息。请求体包含参数名称和值的键值对。For the network request module, the essence is to send and receive http(s) requests. HTTP request information consists of three parts: request line (line), request header (headers), and request body (body). The request line includes the request method and the request address URL. The common request methods (methods) are: GET, POST methods, and also include OPTIONS, DELETE, PUT and other methods. The request header generally includes information such as Accept, Referer, Accept-language, User-Agent, and Host. The request body contains key-value pairs of parameter names and values.
爬取流程控制模块主要是对爬取速率、是否需要代理、采用何种方式爬取进行设置。如果爬取速率过快,会导致IP被封,或者对方服务器崩溃,所以需要对爬取速率做一定限制。其次,多次爬取同一网站后,可能会导致账号IP被封,此时我们就需要加入代理来解决这个问题。对不同的网站,由于其前端编写方式不同,所以爬取的方式也可能不同,比如:WebClient、WebDriver等。The crawling process control module is mainly to set the crawling rate, whether an agent is needed, and how to crawl. If the crawling rate is too fast, the IP will be blocked, or the server of the other party will crash, so it is necessary to limit the crawling rate. Secondly, after crawling the same website multiple times, the account IP may be blocked. At this time, we need to join a proxy to solve this problem. For different websites, due to the different front-end writing methods, the crawling methods may also be different, such as: WebClient, WebDriver, etc.
内容分析提取模块主要是在获取到相应URL地址的全部内容后,对其中我们感兴趣的部分内容进行再次获取。常见的方法是通过设置相应的node节点及属性,得到xpath路径,再通过xpath路径来获得相应节点的信息。网页信息格式多为Html、Json、JavaScript,往往与我们储存信息格式不同,所以获得相应信息后,我们还需对其格式进行分析转换,进而储存到数据库。The content analysis and extraction module mainly obtains the part of the content we are interested in after obtaining the entire content of the corresponding URL address. A common method is to obtain the xpath path by setting the corresponding node node and attributes, and then obtain the information of the corresponding node through the xpath path. Most of the web page information formats are Html, Json, and JavaScript, which are often different from the format of our storage information. Therefore, after obtaining the corresponding information, we need to analyze and convert the format, and then store it in the database.
如今市面上的爬虫功能一般分为数据采集(下载相关的网页),数据处理(对相关网页的数据进行分析)和数据存储(将所需要爬取的内容进行保存) 三个部分的内容, 而高级的爬虫则使用了分布式技术以及并发编程用于相关数据爬取以及分析。在过去爬虫所爬取的网站页面主要是html文档本身,所以爬虫所抓取的内容大部分都是html中包含的内容,但是随着前端技术的发展以,动态网页的概率越来越大,这些动态网站中使用了ajax技术,相比起传统的html文档,如今很多网页的信息都是通过javascript动态生成的。Today's crawler functions on the market are generally divided into three parts: data collection (downloading related web pages), data processing (analyzing the data of related web pages) and data storage (saving the content to be crawled). Advanced crawlers use distributed technology and concurrent programming for related data crawling and analysis. In the past, the website pages crawled by crawlers were mainly html documents themselves, so most of the content crawled by crawlers was contained in html. However, with the development of front-end technology, the probability of dynamic web pages is increasing. These dynamic websites use ajax technology. Compared with traditional html documents, the information of many web pages is dynamically generated through javascript.
网页的多样性导致了常见的爬虫程序无法适用不同格式的网页,在针对不同格式的网站爬取时后端程序员需要在原有代码上进行修改,这极大的降低了工作效率和代码维护性。The diversity of web pages makes common crawlers unable to apply to web pages of different formats. When crawling websites of different formats, back-end programmers need to modify the original code, which greatly reduces work efficiency and code maintainability. .
基于此,本申请对现有技术做出了改进和优化,提供了一种多源异构数据采集方法,旨在通过分布式系统实现通用式多源异构数据采集,提高数据处理效率。Based on this, the present application improves and optimizes the existing technology, and provides a multi-source heterogeneous data collection method, which aims to realize general multi-source heterogeneous data collection through a distributed system and improve data processing efficiency.
需要说明的是,本申请提供的多源异构数据采集方法,可以应用于终端也可以应用于系统,还可以应用于服务器上,例如终端可以是智能手机或电脑、平板电脑、智能电视、智能手表、便携计算机终端也可以是台式计算机等固定终端。为方便阐述,本申请中以终端为执行主体进行举例说明。It should be noted that the multi-source heterogeneous data collection method provided by this application can be applied to terminals, systems, or servers. For example, terminals can be smart phones or computers, tablet computers, smart TVs, smart Wrist watches and portable computer terminals may also be fixed terminals such as desktop computers. For convenience of description, in this application, a terminal is used as an execution subject for illustration.
下面对本申请的实施例进行详细描述:Embodiments of the application are described in detail below:
请参阅图1,图1为本申请提供的多源异构数据采集方法一个实施例流程示意图,该方法包括:Please refer to Figure 1, Figure 1 is a schematic flow chart of an embodiment of the multi-source heterogeneous data acquisition method provided by the present application, the method includes:
101、确定数据源的类型,并配置所述数据源的数据源信息;101. Determine the type of the data source, and configure the data source information of the data source;
本申请提供的方法,能够对不同类型的数据源进行自动化采集,开始先需要确定数据源的类型,并在系统中配置数据源的数据源信息,配置的过程可以通过可视化的提示信息进行引导,从而降低操作人员的操作难度,本申请采用 Framework + plugin 架构构建。可以应用的数据源可以是关系型或者非关系型的数据库,结构化的、非结构化或者半结构化的数据库,关系型或者非关系型数据源主要包括:MySQL、Oracle、SQLServer、PostgreSQL、Hive、HDFS、MongoDB,Gbase、 Kingbase等,对于关系型数据库,其主要配置的数据源信息包括ip地址,端口,用户名、密码信息。半结构化的数据源包括txt、csv、excel、access,对于半结构化的数据源,其主要配置的数据源信息包括通信协议类型(ftp/sftp)、通讯协议信息(ip,端口、用户、密码、文件路径)等信息,以及数据文件信息(如起始行,起始列、文件内容分隔符);自定义脚本采集则需要上传脚本文件,java脚本额外需要配置包名、类名、方法名,Js脚本和python脚本则需要在脚本中定义启动方法;实时数据数据源主要包括:kafka,该类型数据源需要配置服务地址、消费topic的组ID、键和值解码器等。The method provided by this application can automatically collect different types of data sources. At the beginning, it is necessary to determine the type of data source and configure the data source information of the data source in the system. The configuration process can be guided by visual prompt information. In order to reduce the difficulty of operation for operators, this application is constructed using Framework + plugin architecture. Applicable data sources can be relational or non-relational databases, structured, unstructured or semi-structured databases, relational or non-relational data sources mainly include: MySQL, Oracle, SQLServer, PostgreSQL, Hive , HDFS, MongoDB, Gbase, Kingbase, etc. For relational databases, the main configuration data source information includes ip address, port, user name, and password information. Semi-structured data sources include txt, csv, excel, access. For semi-structured data sources, the main configuration data source information includes communication protocol type (ftp/sftp), communication protocol information (ip, port, user, Password, file path) and other information, as well as data file information (such as start line, start column, file content delimiter); custom script collection needs to upload script files, java script additionally needs to configure package name, class name, method Name, Js script and python script need to define the startup method in the script; real-time data data sources mainly include: kafka, this type of data source needs to be configured with service address, group ID of consumption topic, key and value decoder, etc.
102、配置任务调度器,所述任务调度器用于定时执行任务、周期执行任务、确定服务节点以及确定执行策略;102. Configuring a task scheduler, the task scheduler is used to execute tasks regularly, execute tasks periodically, determine service nodes, and determine execution strategies;
在实际数据采集的过程中,通过任务调度器来执行采集任务,在采集前需要先针对实际所应用的数据源来配置任务调度器,配置任务调度器具体可以是确定服务节点、确定执行策略、确定执行周期和时间等,还可以包括配置阻塞处理策略、配置子任务、配置任务重试策略、配置任务执行触发策略和执行报警策略。例如配置阻塞处理策略可以包括:单机串行-放入任务执行队列,按照时间顺序依次执行、丢弃后续调度-放任务执行塞队列,本次任务会被丢弃,并标记为失败、覆盖之前调度-放任务执行塞队列,之前的任务会被终止,运行本次。In the process of actual data collection, the collection task is executed through the task scheduler. Before the collection, it is necessary to configure the task scheduler for the actual data source. Configuring the task scheduler can specifically determine the service node, determine the execution strategy, Determining the execution period and time, etc. may also include configuring blocking processing strategies, configuring subtasks, configuring task retry strategies, configuring task execution trigger strategies, and executing alarm strategies. For example, configuring the blocking processing strategy can include: single-machine serial - put into the task execution queue, execute in sequence according to time, discard subsequent scheduling - put the task into the queue, this task will be discarded and marked as failed, overwrite the previous scheduling - Put the task execution queue, the previous task will be terminated and run this time.
配置子任务可以包括:Configuration subtasks can include:
任务构建如果需要在本任务执行结束并执行成功的时候触发下一个任务,任务是相互依赖的,任务A—>任务B—>任务C,也就是说任务是串行调度的。那么就可以把另外的任务作为本任务的子任务运行。因此,配置子任务的策略可以包括子任务的触发执行条件,例如触发执行条件为本次任务执行完毕,或者到达指定的时间等。If task construction needs to trigger the next task when the execution of this task is completed and executed successfully, the tasks are interdependent, task A -> task B -> task C, that is to say, tasks are scheduled serially. Then you can run another task as a subtask of this task. Therefore, the strategy for configuring subtasks may include triggering execution conditions of the subtasks, for example, the triggering execution conditions are that the execution of the current task is completed, or a specified time is reached.
配置任务重试策略可以包括:Configuring task retry policies can include:
自定义任务失败重试次数,当任务失败时将会按照预设的失败重试次数主动进行重试;其中分片任务支持分片粒度的失败重试,在实际中可以有其它的重试策略,此处不做限定。Customize the number of retries for task failures. When the task fails, it will actively retry according to the preset number of failed retries; among them, the shard task supports shard-grained failure retries, and other retry strategies can be used in practice , is not limited here.
配置触发策略具体可以包括:Configuring trigger policies can specifically include:
配置触发方式,除了"Cron方式"和"任务依赖方式"触发任务执行之外,还可以基于事件的触发任务方式。Configure the trigger mode. In addition to "Cron mode" and "task-dependent mode" to trigger task execution, you can also trigger task mode based on events.
103、创建数据采集任务,所述数据采集任务包括数据来源、数据目标源以及数据采集策略;103. Create a data collection task, where the data collection task includes a data source, a data target source, and a data collection strategy;
104、通过配置好的任务调度器,按照所述数据采集策略执行所述数据采集任务;104. Execute the data collection task according to the data collection strategy through the configured task scheduler;
105、输出数据采集结果。105. Output the data collection result.
通过步骤101和步骤102分别配置好数据源和任务调度器之后,创建具体的数据采集任务,其中就包括具体的数据来源、数据目标源和数据采集策略。After the data source and the task scheduler are respectively configured through
其中,通过配置好的任务调度器,按照所述数据采集策略执行所述数据采集任务的具体方式如下:Wherein, through the configured task scheduler, the specific manner of executing the data collection task according to the data collection strategy is as follows:
所述数据源信息中包含有数据源表格,所述数据源表格中列举有所需要采集的数据的字段;The data source information includes a data source table, and the data source table lists the fields of the data to be collected;
选中所述数据源,并下载所述数据源中的所有数据;Select the data source and download all the data in the data source;
在所述所有数据中确定待采集的目标源表格;Determining the target source table to be collected in all the data;
根据所述数据源表格中的字段与所述目标源表格中字段的关联性,构建所述数据源表格与所述目标源表格的映射关系;Constructing a mapping relationship between the data source table and the target source table according to the correlation between the fields in the data source table and the fields in the target source table;
依据所述映射关系进行数据采集。Data collection is performed according to the mapping relationship.
本申请中,数据源信息包括有数据源表格,该数据源表格中列举有所需要采集的数据的字段,在采集时将数据源中的所有数据下载下来,并从中确定需要进行数据采集的目标源表格,再根据两个表格之间字段的关联性来构建两个表格之间的映射关系,该映射关系用于后续的数据的采集。In this application, the data source information includes a data source table, which lists the fields of the data to be collected, downloads all the data in the data source during collection, and determines the target of data collection from it The source table, and then construct a mapping relationship between the two tables according to the correlation of the fields between the two tables, and the mapping relationship is used for subsequent data collection.
通过本申请提供的方式,可以解决由于字段名称不一致或者不完全一致而带来的无法进行数据采集的问题,并且在面向多个数据源进行采集时,具有采集的灵活性,例如数据源表格(A)中字段:name,age,address信息,目标源表格(B)字段信息为:user_name,user_age,addr,由于字段名称可能一致或者不一致的情况,则需要通过映射关系绑定,将A.name 中的数据采集到B.user_name中去,完成任务采集,最后输出数据采集结果。再例如,在进行多对多的数据采集时,可以根据其字段之间的关联性来调整映射关系,而这种关联性例如表示为字段的顺序,或者是其它。Through the method provided by this application, the problem of inability to collect data due to inconsistency or incompleteness of the field names can be solved, and when collecting for multiple data sources, it has the flexibility of collection, such as the data source form ( The fields in A) are: name, age, address information, and the target source table (B) field information is: user_name, user_age, addr. Since the field names may be consistent or inconsistent, it is necessary to bind A.name through the mapping relationship Collect the data in B.user_name, complete the task collection, and finally output the data collection results. For another example, when many-to-many data collection is performed, the mapping relationship can be adjusted according to the correlation between the fields, and this correlation can be expressed as the order of the fields, or others.
下面通过举例进行说明:The following is an example to illustrate:
选择已经构建好的数据源,选择库名->表名以及查询条件;Select the data source that has been built, select the library name -> table name and query conditions;
选择已经构建好的目标数据源,选择库名->表名->采集模式(清空表再写入/追加写入)->写入模式(插入,更新);Select the target data source that has been built, select the library name -> table name -> collection mode (clear the table and then write/append write) -> write mode (insert, update);
配置字段映射方式,例如使用下标进行字段关系映射。Configure the field mapping method, such as using subscripts for field relationship mapping.
配置任务信息,配置任务调度,例如选择任务调度器,配置预览以及任务生成,通过选择规则,生成配置预览并进行展示,方便操作人员进行查看和确认,最后成功创建任务。Configure task information, configure task scheduling, such as selecting a task scheduler, configuration preview, and task generation. By selecting rules, generate a configuration preview and display it, which is convenient for operators to view and confirm, and finally successfully create tasks.
下面通过两个示例对创建数据采集任务的过程进行说明:The process of creating a data collection task is described below with two examples:
示例一,采集关系型数据源,由Mysql采集至Mysql:Example 1, collect relational data sources, from Mysql to Mysql:
a、创建数据来源-数据源a. Create data source - data source
输入数据源名称,Mysql地址,端口、用户名、密码、库名信息。Enter the data source name, Mysql address, port, user name, password, and database name information.
b、创建数据来源-数据目标源,即将采集到的数据存储位置b. Create data source-data target source, the storage location of the data to be collected
输入数据源名称,Mysql地址,端口、用户名、密码、库名信息。Enter the data source name, Mysql address, port, user name, password, and database name information.
c、创建执行策略c. Create an execution strategy
选择任务需要在执行的节点->配置调度周期(Corn表达式),->任务执行时长(可以选择固定时间,也可以选择按照任务实际时间而定)->选择任务执行策略(固定第一个,最后一个,故障转移,忙碌转移等)->配置子任务(执行完当前任务后执行的任务,可以为空)Select the node where the task needs to be executed -> configure the scheduling cycle (Corn expression), -> task execution time (you can choose a fixed time, or you can choose to follow the actual time of the task) -> select the task execution strategy (fix the first , the last one, failover, busy transfer, etc.) -> Configure subtasks (tasks to be executed after the current task is executed, can be empty)
d、创建任务d. Create tasks
创建任务后,选择数据源,系统会下载出所有的数据来源,选择步骤a中的数据源,系统会自动列出该库下所有的表,再确定待采集的数据表,系统会依据表列举出选中表下的所有字段;After creating the task, select the data source, the system will download all the data sources, select the data source in step a, the system will automatically list all the tables under the library, and then determine the data table to be collected, the system will list according to the table Select all fields under the selected table;
选择数据目标,系统会下载出所有的数据目标源,选择步骤b中的数据源,系统会自动列出该库下所有的表,选择待采集的数据表,系统会依据表列举出选中表下的所有字段;Select the data target, the system will download all the data target sources, select the data source in step b, the system will automatically list all the tables under the database, select the data table to be collected, the system will list the selected table according to the table all fields of the
完成数据的下载之后,系统将来源和目标两张数据表,分别是数据源表格和目标源表格,中的字段进行关系映射。例如数据源表格(A)中字段:name,age,address信息,目标源表格(B)字段信息为:user_name,user_age,addr,由于字段名称可能一致或者不一致的情况,则需要通过映射关系绑定,将A.name 中的数据采集到B.user_name中去,完成任务采集,最后输出数据采集结果。After downloading the data, the system performs relational mapping on the fields in the source and target data tables, namely the data source table and the target source table. For example, the fields in the data source table (A) are: name, age, address information, and the field information in the target source table (B) is: user_name, user_age, addr. Since the field names may be consistent or inconsistent, they need to be bound through the mapping relationship , collect the data in A.name to B.user_name, complete the task collection, and finally output the data collection results.
示例二,自定义脚本采集样例,由网页采集至Mysql:Example 2, custom script collection sample, from webpage to Mysql:
A、创建数据来源-数据源,输入数据源名称,java工程的包名,类名,方法名,上传脚本编译后可执行文件(*.jar).A. Create data source - data source, input data source name, java project package name, class name, method name, upload script compiled executable file (*.jar).
B、创建数据来源-数据目标源,即将采集到的数据存储位置B. Create data source-data target source, the storage location of the data to be collected
输入数据源名称,Mysql地址,端口、用户名、密码、库名信息。Enter the data source name, Mysql address, port, user name, password, and database name information.
C、创建执行策略,C. Create an execution strategy,
选择任务需要在执行的节点->配置调度周期(Corn表达式),->任务执行时长(可以选择固定时间,也可以选择按照任务实际时间而定)->选择任务执行策略(固定第一个,最后一个,故障转移,忙碌转移等)->配置子任务(执行完当前任务后执行的任务,可以为空)。Select the node where the task needs to be executed -> configure the scheduling cycle (Corn expression), -> task execution time (you can choose a fixed time, or you can choose to follow the actual time of the task) -> select the task execution strategy (fix the first , the last one, failover, busy transfer, etc.) -> Configure subtasks (tasks to be executed after the current task is executed, can be empty).
d、创建任务d. Create tasks
选择数据来源,系统会依据表列举出选中表下的所有字段(脚本中按照规则编写好);选择目标数据源,系统会下载出所有的目标数据源,选择步骤B中的数据源,系统会自动列出该库下所有的表,选择待采集的数据表,系统会依据表列举出选中表下的所有字段,完成数据的下载后,系统将来源和目标两张数据表,分别是数据源表格和目标源表格中的字段进行关系映射。如数据来源表(A)中字段:name,age,address信息,数据目标中的表(B)字段信息为:user_name,user_age,addr,由于字段名称可能一致或者不一致的情况,则需要通过映射关系绑定,将A.name 中的数据采集到B.user_name中去。Select the data source, the system will list all the fields under the selected table according to the table (the script is written according to the rules); select the target data source, the system will download all the target data sources, select the data source in step B, the system will Automatically list all the tables under the library, select the data table to be collected, and the system will list all the fields under the selected table according to the table. Relational mapping between fields in the table and the target source table. For example, the fields in the data source table (A) are: name, age, address information, and the field information in the table (B) in the data target is: user_name, user_age, addr. Since the field names may be consistent or inconsistent, you need to pass the mapping relationship Binding, collect the data in A.name to B.user_name.
本申请提供了一种多源异构数据采集方法,在数据采集前,通过对任务调度器和数据源的配置化,能够提高方法的适用性和易用性,降低对用户专业度的依赖,同时可以根据需要进行水平扩展,实现分布式数据采集,在完成配置后,通过任务调度器能够应用于多种类型数据的自动化采集,融合了网页数据、离线数据、实时数据等多种类型的数据采集功能,使得其爬取海量多数据源的数据时不受数据源和数据存储系统的异构性的影响,极大程度提高了数据采集的效率。This application provides a multi-source heterogeneous data collection method. Before data collection, through the configuration of task scheduler and data source, the applicability and ease of use of the method can be improved, and the dependence on user expertise can be reduced. At the same time, it can be extended horizontally according to the needs to realize distributed data collection. After the configuration is completed, it can be applied to the automatic collection of various types of data through the task scheduler, which integrates various types of data such as web page data, offline data, and real-time data. The collection function makes it not affected by the heterogeneity of data sources and data storage systems when crawling data from massive multi-data sources, which greatly improves the efficiency of data collection.
上述实施例提供的方法具有如下优点:The method provided by the foregoing embodiments has the following advantages:
易用性,通过配置化的引导式任务配置,使得方法更容易使用,减低了操作门槛,提高了客户的普遍性。Ease of use, through configurable guided task configuration, makes the method easier to use, reduces the threshold of operation, and improves the universality of customers.
可扩展性,可以根据需要进行水平扩展,从而使得方法适用于各类应用场景,并且能够实现分布式数据采集;Scalability, which can be extended horizontally according to needs, so that the method is suitable for various application scenarios and can realize distributed data collection;
普适性,本方法融合了网页数据、离线数据、实时数据等多种类型的数据采集功能;Universality, this method integrates various types of data collection functions such as webpage data, offline data, and real-time data;
数据异构性,爬取海量多数据源的公文数据时,屏蔽了数据源和数据存储系统的异构性,使用调度器系统完成自动周期的采集。Data heterogeneity, when crawling official document data from a large number of multiple data sources, the heterogeneity of data sources and data storage systems is shielded, and the scheduler system is used to complete automatic periodic collection.
在另一个可选的实施例中,在进行数据采集的过程中,还可以生成任务日志和统计报表,任务日志中记录有任务在执行过程中详细的错误日志,整个采集系统在运行中由于需要执行列表抓取和详情抓取,可能会因为网络原因或其他原因导致任务失败,这种情况需要得到有效的任务结果反馈,以保证在大量任务中查找失败的并进行及时修改。任务日志模块主要记录任务在执行过程中详细的错误日志,利于运维人员排查原因。同时可以对任务日志进行指定日期时间内删除,减轻存储压力。统计报表可以包括:数据采集信息和节点信息,所述数据采集信息包括:总量、总容量、当日新增量信息以及30天趋势统计;所述节点信息包括:CPU使用率、内存占用率,内存总量、内存剩余量;任务信息:成功数、失败数、执行中止数以及一周统计图。In another optional embodiment, in the process of data collection, task logs and statistical reports can also be generated. The task log records detailed error logs during the execution of the task. The entire acquisition system is running due to the need Performing list crawling and detail crawling may cause task failures due to network or other reasons. In this case, effective task result feedback is required to ensure that failures are found in a large number of tasks and modified in a timely manner. The task log module mainly records detailed error logs during task execution, which is helpful for operation and maintenance personnel to troubleshoot the cause. At the same time, task logs can be deleted within a specified date to reduce storage pressure. The statistical report can include: data collection information and node information, the data collection information includes: total amount, total capacity, new incremental information of the day, and 30-day trend statistics; the node information includes: CPU usage, memory usage, The total amount of memory, the remaining amount of memory; task information: number of successes, number of failures, number of execution suspensions, and a weekly statistical chart.
上述实施例对本申请中的方法进行了描述,下面对本申请中涉及的装置和存储介质进行描述。The above embodiments describe the method in this application, and the device and storage medium involved in this application are described below.
参阅图2,本申请中提供了一种多源异构数据采集装置,包括:Referring to Figure 2, this application provides a multi-source heterogeneous data acquisition device, including:
确定单元201,用于确定数据源的类型,并配置所述数据源的数据源信息;A determining unit 201, configured to determine the type of the data source, and configure the data source information of the data source;
配置单元202,用于配置任务调度器,所述任务调度器用于定时执行任务、周期执行任务、确定服务节点以及确定执行策略;The
创建单元203,用于创建数据采集任务,所述数据采集任务包括数据来源、数据目标源以及数据采集策略;The creating
采集单元204,用于通过配置好的任务调度器,按照所述数据采集策略执行所述数据采集任务;The
输出单元205,用于输出数据采集结果;An
所述采集单元204具体用于:The
所述数据源信息中包含有数据源表格,所述数据源表格中列举有所需要采集的数据的字段;The data source information includes a data source table, and the data source table lists the fields of the data to be collected;
选中所述数据源,并下载所述数据源中的所有数据;Select the data source and download all the data in the data source;
在所述所有数据中确定待采集的目标源表格;Determining the target source table to be collected in all the data;
根据所述数据源表格中的字段与所述目标源表格中字段的关联性,构建所述数据源表格与所述目标源表格的映射关系;Constructing a mapping relationship between the data source table and the target source table according to the correlation between the fields in the data source table and the fields in the target source table;
依据所述映射关系进行数据采集。Data collection is performed according to the mapping relationship.
可选的,当数据源为网站时,配置单元202具体用于:Optionally, when the data source is a website, the
获取预先配置好的采集脚本信息;Obtain pre-configured acquisition script information;
若脚本为自定义脚本,则获取自定义脚本文件;If the script is a custom script, obtain the custom script file;
若脚本为java脚本,则在获取脚本文件之后,配置脚本的包名、类名、方法名。If the script is a java script, after obtaining the script file, configure the package name, class name, and method name of the script.
可选的,当所述数据源为关系型数据库时,确定单元201具体用于:Optionally, when the data source is a relational database, the determining unit 201 is specifically configured to:
配置所述数据数据源的ip地址,端口,用户名和密码信息;Configure the ip address, port, user name and password information of the data source;
所述关系型数据库包括:MySQL、Oracle、SQLServer、PostgreSQL、Hive、HDFS、MongoDB,Gbase、 Kingbase。Described relational database comprises: MySQL, Oracle, SQLServer, PostgreSQL, Hive, HDFS, MongoDB, Gbase, Kingbase.
可选的,配置单元202具体用于:Optionally, the
配置阻塞处理策略、配置子任务、配置任务重试策略、配置任务执行触发策略和执行报警策略。Configure blocking processing strategies, configure subtasks, configure task retry strategies, configure task execution trigger strategies, and execute alarm strategies.
可选的,装置还包括生成单元206,具体用于:Optionally, the device further includes a
在采集数据的过程中,生成任务日志和统计报表;In the process of collecting data, generate task logs and statistical reports;
所述任务日志中记录有任务在执行过程中详细的错误日志;A detailed error log during the execution of the task is recorded in the task log;
所述统计报表包括:数据采集信息和节点信息,所述数据采集信息包括:总量、总容量、当日新增量信息以及30天趋势统计;所述节点信息包括:CPU使用率、内存占用率,内存总量、内存剩余量;任务信息:成功数、失败数、执行中止数以及一周统计图。The statistical report includes: data collection information and node information, the data collection information includes: total amount, total capacity, new incremental information of the day and 30-day trend statistics; the node information includes: CPU usage rate, memory usage rate , the total amount of memory, the remaining amount of memory; task information: number of successes, number of failures, number of execution suspensions, and a weekly statistical chart.
本实施例中提供的装置与前述实施例中提供的方法相对应,具有相应的特定技术特征,因此其具有与方法实施例一致的技术效果,此处不再赘述。The device provided in this embodiment corresponds to the method provided in the foregoing embodiments, and has corresponding specific technical features, so it has the same technical effect as the method embodiment, and will not be repeated here.
参阅图3,本申请还提供了一种多源异构数据采集装置,包括:Referring to Figure 3, the present application also provides a multi-source heterogeneous data acquisition device, including:
处理器301、存储器302、输入输出单元303、总线304;
处理器301与存储器302、输入输出单元303以及总线304相连;The
存储器302保存有程序,处理器301调用程序以执行如上任一多源异构数据采集方法。The
本申请还涉及一种计算机可读存储介质,计算机可读存储介质上保存有程序,其特征在于,当程序在计算机上运行时,使得计算机执行如上任一多源异构数据采集方法。The present application also relates to a computer-readable storage medium, on which a program is stored, which is characterized in that, when the program is run on a computer, the computer is made to execute any one of the multi-source heterogeneous data acquisition methods above.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-onlymemory)、随机存取存储器(RAM,randomaccess memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for enabling a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk, and other media that can store program codes.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310315993.0A CN116028192A (en) | 2023-03-29 | 2023-03-29 | A multi-source heterogeneous data acquisition method, device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310315993.0A CN116028192A (en) | 2023-03-29 | 2023-03-29 | A multi-source heterogeneous data acquisition method, device and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116028192A true CN116028192A (en) | 2023-04-28 |
Family
ID=86077883
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310315993.0A Pending CN116028192A (en) | 2023-03-29 | 2023-03-29 | A multi-source heterogeneous data acquisition method, device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116028192A (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117112697A (en) * | 2023-08-29 | 2023-11-24 | 港华数智能源科技(深圳)有限公司 | Data management method and related device |
| CN117707886A (en) * | 2023-12-04 | 2024-03-15 | 中电金信软件(上海)有限公司 | Metadata acquisition method and device |
| CN118093707A (en) * | 2024-04-28 | 2024-05-28 | 北方健康医疗大数据科技有限公司 | A multi-modal data collection method, system, terminal and storage medium |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111460019A (en) * | 2020-04-02 | 2020-07-28 | 中电工业互联网有限公司 | Data conversion method and middleware of heterogeneous data source |
| CN111898009A (en) * | 2020-06-16 | 2020-11-06 | 华北电力大学 | A distributed acquisition system and method for multi-source power data fusion |
| CN112433998A (en) * | 2020-11-20 | 2021-03-02 | 广东电网有限责任公司佛山供电局 | Multisource heterogeneous data acquisition and convergence system and method based on power system |
| CN113590626A (en) * | 2021-08-03 | 2021-11-02 | 中铁工程装备集团有限公司 | Multi-source heterogeneous data acquisition system and method for tunneling equipment |
-
2023
- 2023-03-29 CN CN202310315993.0A patent/CN116028192A/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111460019A (en) * | 2020-04-02 | 2020-07-28 | 中电工业互联网有限公司 | Data conversion method and middleware of heterogeneous data source |
| CN111898009A (en) * | 2020-06-16 | 2020-11-06 | 华北电力大学 | A distributed acquisition system and method for multi-source power data fusion |
| CN112433998A (en) * | 2020-11-20 | 2021-03-02 | 广东电网有限责任公司佛山供电局 | Multisource heterogeneous data acquisition and convergence system and method based on power system |
| CN113590626A (en) * | 2021-08-03 | 2021-11-02 | 中铁工程装备集团有限公司 | Multi-source heterogeneous data acquisition system and method for tunneling equipment |
Non-Patent Citations (1)
| Title |
|---|
| 刘海;张瞩熹;任雯;肖岩平;: "面向异构数据源的分布式集成工具研究与设计", 计算机应用研究, no. 1, pages 204 - 206 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117112697A (en) * | 2023-08-29 | 2023-11-24 | 港华数智能源科技(深圳)有限公司 | Data management method and related device |
| CN117707886A (en) * | 2023-12-04 | 2024-03-15 | 中电金信软件(上海)有限公司 | Metadata acquisition method and device |
| CN118093707A (en) * | 2024-04-28 | 2024-05-28 | 北方健康医疗大数据科技有限公司 | A multi-modal data collection method, system, terminal and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11238036B2 (en) | System performance logging of complex remote query processor query operations | |
| US12013764B2 (en) | Past-state backup generator and interface for database systems | |
| CN116028192A (en) | A multi-source heterogeneous data acquisition method, device and storage medium | |
| US11860741B2 (en) | Continuous data protection | |
| Jayalath et al. | From the cloud to the atmosphere: Running MapReduce across data centers | |
| US11042503B1 (en) | Continuous data protection and restoration | |
| JP6416194B2 (en) | Scalable analytic platform for semi-structured data | |
| US11892976B2 (en) | Enhanced search performance using data model summaries stored in a remote data store | |
| CN108280023B (en) | Task execution method and device and server | |
| Grover et al. | Data Ingestion in AsterixDB. | |
| CN111241182A (en) | Data processing method and device, storage medium and electronic device | |
| WO2017162032A1 (en) | Method and device for executing data recovery operation | |
| US10951540B1 (en) | Capture and execution of provider network tasks | |
| CN105243155A (en) | Big data extracting and exchanging system | |
| US11841827B2 (en) | Facilitating generation of data model summaries | |
| CN114443599A (en) | Data synchronization method, device, electronic device and storage medium | |
| CN108228432A (en) | A kind of distributed link tracking, analysis method and server, global scheduler | |
| CN103997438A (en) | Method for automatically monitoring distributed network spiders in cloud computing | |
| CN118568142A (en) | Graph data importing method, electronic equipment and graph data importing system | |
| CN111581227A (en) | Event pushing method and device, computer equipment and storage medium | |
| CN113360319B (en) | Data backup method and device | |
| Bansal et al. | Big data streaming with spark | |
| CN115686807A (en) | Data processing method and system, mobile terminal, electronic device and storage medium | |
| CN110019045A (en) | Method and device is landed in log | |
| Marchiori et al. | Design and Development of a Cloud-Based Data Lake and Business Intelligence Solution on AWS |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230428 |