Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a data writing method based on an Hbase database according to a first embodiment of the present invention, which can be applied to a terminal with a data processing function, such as a computer, and the data writing method based on the Hbase database shown in fig. 1 mainly includes the following steps:
s101, acquiring a data record corresponding to a file to be written, a row primary key value corresponding to the data record and an identification code of the thread from the thread, generating a row primary key list containing the corresponding relation between the data record and the row primary key value, and taking the acquired identification code of the thread and the generated row primary key list as reference data.
The Hbase database includes a plurality of Threads (Threads) that are used for allocation and scheduling. Each thread corresponds to an identification code of the thread, i.e., an Identification (ID) of the thread. In practical applications, a file to be written may be divided into a plurality of data records, and most of the data records in the file to be written are allocated to one thread, but there is also a possibility of being allocated to a plurality of threads. One data record corresponds to one row primary key value (rowkey). The row primary key list includes a plurality of corresponding relations between the data records and the row primary key values, wherein the row primary key list corresponds to the obtained thread ID.
Optionally, the row primary key list may also be corresponding to the file to be written, the correspondence between the row primary key list and the file to be written is stored, and the thread is notified to store the correspondence between the row primary key list and the file to be written.
S102, writing the acquired data record, the row primary key value corresponding to the acquired data record, the identification code of the acquired thread and the generated row primary key list into a cache memory in a database.
And in practical application, writing the acquired data record, the row primary key value corresponding to the acquired data record, the identification code of the acquired thread and the generated row primary key list into a memory MemStore.
S103, writing the data records stored in the cache memory, the row main key values corresponding to the stored data records, the identification codes of the stored threads and the stored row main key list into a distributed file system in the database, and taking the identification codes of the threads and the row main key list stored in the cache memory as data to be compared after the writing is finished.
Here, the data stored in the cache memory in S102 is written into Hfile of the distributed file system, i.e., the HBase database.
And S104, comparing the reference data with the data to be compared, and if the comparison result shows that the data record does not exist in the distributed file system of the database, rewriting the file to be written into the database.
The process of the comparison is to ensure the integrity of the data written to the distributed file system. If the comparison result indicates that the distributed file system with the data record not written into the database exists, the file to be written is written into the database again, that is, the steps S101 to S104 need to be executed again.
In the embodiment of the invention, a data record corresponding to a file to be written, a row main key value corresponding to the data record and an identification code of the thread are obtained from the thread, a row main key list containing a corresponding relation between the data record and the row main key value is generated, the identification code of the obtained thread and the generated row main key list are used as reference data, the obtained data record, the row main key value corresponding to the obtained data record, the identification code of the obtained thread and the generated row main key list are written into a cache memory in a database, the data record stored in the cache memory, the row main key value corresponding to the stored data record, the identification code of the stored thread and the stored row main key list are written into a distributed file system in the database, and after the writing is finished, the identification code of the thread and the row main key list stored in the cache memory are used as data to be compared, the reference data and the data to be compared are compared, if the comparison result shows that a data record is not written into the distributed file system of the database, the file to be written is written into the database again, and therefore after the data are written in each time, whether the data are all written into the database is determined through comparison, the completeness of data storage is further ensured.
Referring to fig. 2, fig. 2 is a schematic flow chart of a data writing method based on an Hbase database according to a second embodiment of the present invention, which can be applied to a terminal with a data processing function, such as a computer, and the data writing method based on the Hbase database shown in fig. 2 mainly includes the following steps:
s201, sending the data records in the file to be written to a thread, and generating corresponding row primary key values for each data record received by the thread through the thread.
One data record corresponds to one row primary key value. In practical applications, a file to be written may be divided into a plurality of data records, and most of the data records in the file to be written are allocated to one thread, but there is also a possibility of being allocated to a plurality of threads. In practical applications, a thread is a rowkey value generated by a Hash (Hash) algorithm.
S202, acquiring a data record corresponding to a file to be written, a row primary key value corresponding to the data record and an identification code of the thread from the thread, generating a row primary key list containing the corresponding relation between the data record and the row primary key value, and taking the acquired identification code of the thread and the generated row primary key list as reference data.
The row primary key list includes a plurality of corresponding relations between the data records and the row primary key values, wherein the row primary key list corresponds to the obtained thread ID.
Optionally, the thread may be further notified to store the corresponding relationship between the row primary key list and the file to be written.
S203, writing the obtained data record, the row primary key value corresponding to the obtained data record, the identification code of the obtained thread, and the generated row primary key list into a cache memory in the database.
And in practical application, writing the acquired data record, the row primary key value corresponding to the acquired data record, the identification code of the acquired thread and the generated row primary key list into a memory MemStore.
And S204, writing the data records stored in the cache memory, the row main key values corresponding to the stored data records, the identification codes of the stored threads and the stored row main key list into a distributed file system in the database, and taking the identification codes of the threads and the row main key list stored in the cache memory as data to be compared after the writing is finished.
Here, the data stored in the cache memory in S203 is written into Hfile of the distributed file system, i.e., the HBase database. In practical application, the data records stored in the cache memory, the row main key values corresponding to the stored data records, the identification codes of the stored threads and the stored row main key list are firstly forwarded to the HStore of the HBase database, and then the data records stored in the cache memory, the row main key values corresponding to the stored data records, the identification codes of the stored threads and the stored row main key list are written into the HFile through a commit mode. After writing the HFile, the identification code of the thread in the cache memory and the row main key list are saved as the data to be compared in the server where the cache memory is located.
S205, comparing the reference data with the data to be compared, and if the comparison result shows that the data record does not exist in the distributed file system of the database, writing the file to be written into the database again.
The process of the comparison is to ensure the integrity of the data written to the distributed file system. If the comparison result indicates that there is a distributed file system in which the data record is not written into the database, the file to be written is written into the database again, that is, the steps S201 to S205 need to be executed again.
Optionally, comparing the reference data with the data to be compared specifically includes:
judging whether the identification code of the thread in the data to be compared is consistent with the identification code of the thread in the reference data;
if the comparison result is consistent with the comparison result, comparing the row main key list in the data to be compared with the row main key list in the reference data;
if the row main key list in the data to be compared is completely consistent with the row main key list in the reference data, the comparison result is that no data record is written into the distributed file system of the database;
and if the row main key list in the data to be compared is inconsistent with the row main key list in the reference data, determining that the comparison result is that a data record exists in a distributed file system which is not written in the database.
Firstly, judging whether the IDs of the threads in the reference data and the data to be compared are consistent, and comparing the consistency of the row main key lists in the reference data and the data to be compared under the condition that the IDs of the threads are consistent.
Optionally, if the comparison result indicates that there is a data record in the distributed file system that is not written in the database, then writing the file to be written in the database again specifically includes:
if the comparison result is that a data record does not exist in the distributed file system of the database, searching missing row primary key values in the row primary key list of the data to be compared from the row primary key list of the reference data;
acquiring a row main key list where the missing row main key values are located and a to-be-written file corresponding to the acquired row main key list from the thread according to the identification code of the thread in the reference data or the to-be-compared data, wherein the thread stores the corresponding relationship between the row main key list and the to-be-written file;
and writing the data record corresponding to the file to be written into the database again, and comparing the reference data with the data to be compared again until the comparison result is that no data record is written into the distributed file system of the database.
Firstly, determining a row main key list where missing row main key values are located, then finding the ID of a thread through the row main key list, finally, finding a to-be-written file needing to be rewritten by using the row main key list through the corresponding relation between the row main key list stored in the thread and the to-be-written file, and then re-executing the step S201 to the step S205 until the comparison result is that no data record is written into a distributed file system of the database, so that data loss is avoided, and the data can be completely stored in the database.
It should be noted that, for the same file to be written that is rewritten, the respective generated rowkey values are the same, and at the same time, it can also be understood that the row primary key lists are the same, so that the same data records can be prevented from being repeatedly stored.
Optionally, after comparing the reference data with the data to be compared, the method further includes:
if the comparison result is that no data record is written into the distributed file system of the database, deleting the reference data and the data to be compared;
and sending deletion prompt information to the thread according to the identification code of the thread in the reference data or the data to be compared, wherein the deletion prompt information is used for prompting to delete the corresponding relation between the row main key list stored in the thread and the file to be written.
If the comparison result is that no data record is written into the distributed file system of the database, deleting the reference data and the data to be compared, and informing the thread to delete the corresponding relation between the row main key list stored in the thread and the file to be written, so that partial storage space can be released, and system resources are optimized.
In the embodiment of the invention, the data records in the file to be written are sent to the thread, so that each data record received by the thread generates a corresponding row main key value through the thread, the data record corresponding to the file to be written, the row main key value corresponding to the data record and the identification code of the thread are obtained from the thread, a row main key list containing the corresponding relation between the data record and the row main key value is generated, meanwhile, the identification code of the obtained thread and the generated row main key list are used as reference data, the obtained data record, the row main key value corresponding to the obtained data record, the identification code of the obtained thread and the generated row main key list are written into a cache memory in a database, the data records stored in the cache memory, the row main key values corresponding to the stored data records, the identification codes of the stored threads and the stored row main key list are written into a distributed file system in the database, after the writing is finished, the identification code of the thread stored in the cache memory and the row main key list are used as data to be compared, the reference data and the data to be compared are compared, if the comparison result shows that a distributed file system with data records not written in the database exists, the file to be written in is written in the database again, and therefore after the data are written in each time, whether the data are all written in the database or not is determined through comparison, the completeness of data storage is further ensured.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a data writing apparatus based on an Hbase database according to a third embodiment of the present invention, and for convenience of description, only the parts related to the embodiment of the present invention are shown. The Hbase database-based data writing apparatus illustrated in fig. 3 may be an implementation subject of the Hbase database-based data writing method provided in the embodiments illustrated in fig. 1 and fig. 2. The data writing device based on the Hbase database illustrated in fig. 3 mainly includes: an acquisition module 301 and a processing module 302. The above functional modules are described in detail as follows:
an obtaining module 301, configured to obtain, from a thread, a data record corresponding to a file to be written, a row primary key value corresponding to the data record, and an identification code of the thread, generate a row primary key list including a correspondence between the data record and the row primary key value, and use the obtained identification code of the thread and the generated row primary key list as reference data;
a processing module 302, configured to write the obtained data record, the row primary key value corresponding to the obtained data record, the identification code of the obtained thread, and the generated row primary key list into a cache memory in a database;
the processing module 302 is further configured to write the data records stored in the cache memory, the row primary key values corresponding to the stored data records, the identification codes of the stored threads, and the stored row primary key list into the distributed file system in the database, and after the writing is completed, take the identification codes of the threads and the row primary key list stored in the cache memory as data to be compared;
the processing module 302 is further configured to compare the reference data with the data to be compared, and if the comparison result indicates that there is a data record in the distributed file system that is not written in the database, write the file to be written in the database again.
Each thread corresponds to an identification code of the thread, i.e., an ID of the thread. A file to be written may be divided into a plurality of data records, and most of the data records in a file to be written may be allocated to one thread, but there is also a possibility of being allocated to a plurality of threads. The row primary key list includes a plurality of corresponding relations between the data records and the row primary key values, wherein the row primary key list corresponds to the obtained thread ID. Optionally, the row primary key list may also be corresponding to the file to be written, the correspondence between the row primary key list and the file to be written is stored, and the thread is notified to store the correspondence between the row primary key list and the file to be written. The cache memory of the Hbase database is MemStore.
The process of the comparison is to ensure the integrity of the data written to the distributed file system. If the comparison result indicates that there is a distributed file system in which the data record is not written into the database, the processing module 302 writes the file to be written into the database again.
For details that are not described in the present embodiment, please refer to the description of the embodiment shown in fig. 1, which is not described herein again.
It should be noted that, in the embodiment of the data writing device based on the Hbase database illustrated in fig. 3, the division of the functional modules is only an example, and in practical applications, the above functions may be allocated by different functional modules according to needs, for example, configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the data writing device based on the Hbase database is divided into different functional modules to complete all or part of the above described functions. In addition, in practical applications, the corresponding functional modules in this embodiment may be implemented by corresponding hardware, or may be implemented by corresponding hardware executing corresponding software. The above description principles can be applied to various embodiments provided in the present specification, and are not described in detail below.
In the embodiment of the present invention, the obtaining module 301 obtains a data record corresponding to a file to be written, a row primary key value corresponding to the data record, and an identification code of the thread from a thread, generates a row primary key list including a correspondence relationship between the data record and the row primary key value, and writes the identification code of the obtained thread and the generated row primary key list as reference data, the processing module 302 writes the obtained data record, the row primary key value corresponding to the obtained data record, the identification code of the obtained thread, and the generated row primary key list into a cache memory in a database, the processing module 302 writes the data record stored in the cache memory, the row primary key value corresponding to the stored data record, the identification code of the stored thread, and the stored row primary key list into a distributed file system in the database, and after the writing is completed, the identification code of the thread stored in the cache memory and the row main key list are used as data to be compared, the processing module 302 compares the reference data with the data to be compared, if the comparison result shows that a data record is not written into the distributed file system of the database, the file to be written is written into the database again, and therefore after the data is written, whether all the data are written into the database is determined through comparison every time, the integrity of data storage is ensured.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a data writing device based on an Hbase database according to a fourth embodiment of the present invention, and for convenience of description, only the parts related to the embodiment of the present invention are shown. The Hbase database-based data writing apparatus illustrated in fig. 4 may be an implementation subject of the Hbase database-based data writing method provided in the embodiments illustrated in fig. 1 and fig. 2. The data writing device based on the Hbase database illustrated in fig. 4 mainly includes: a sending module 401, an obtaining module 402, and a processing module 403, where the processing module 403 includes a pair sub-module 4031; the processing module 403 further includes a search sub-module 4032, an acquisition sub-module 4033 and a reset sub-module 4034. The above functional modules are described in detail as follows:
the sending module 401 is configured to send the data record in the file to be written to a thread, so that each data record received by the thread generates a corresponding row primary key value through the thread.
An obtaining module 402, configured to obtain, from a thread, a data record corresponding to a file to be written, a row primary key value corresponding to the data record, and an identification code of the thread, generate a row primary key list including a correspondence between the data record and the row primary key value, and use the obtained identification code of the thread and the generated row primary key list as reference data;
a processing module 403, configured to write the obtained data record, the row primary key value corresponding to the obtained data record, the identification code of the obtained thread, and the generated row primary key list into a cache memory in a database;
the processing module 403 is further configured to write the data records stored in the cache memory, the row primary key values corresponding to the stored data records, the identification codes of the stored threads, and the stored row primary key list into the distributed file system in the database, and after the writing is completed, take the identification codes of the threads and the row primary key list stored in the cache memory as data to be compared;
the processing module 403 is further configured to compare the reference data with the data to be compared, and if the comparison result indicates that there is a data record in the distributed file system that is not written in the database, write the file to be written in the database again.
One data record corresponds to one row primary key value. In practical applications, a file to be written may be divided into a plurality of data records, and most of the data records in the file to be written are allocated to one thread, but there is also a possibility of being allocated to a plurality of threads. In practical applications, a thread is a rowkey value generated by a Hash algorithm. The row primary key list includes a plurality of corresponding relations between the data records and the row primary key values, wherein the row primary key list corresponds to the obtained thread ID.
Optionally, the sending module 401 is further configured to notify the thread to store the corresponding relationship between the row primary key list and the file to be written.
Optionally, the processing module 403 includes: a pair sub-module 4031;
the comparison submodule 4031 is configured to determine whether the identification code of the thread in the data to be compared is consistent with the identification code of the thread in the reference data;
the comparison submodule 4031 is further configured to compare the row primary key list in the data to be compared with the row primary key list in the reference data if the row primary key lists are consistent with each other;
the comparison submodule 4031 is further configured to determine that, if the row primary key list in the data to be compared is completely consistent with the row primary key list in the reference data, the comparison result is that no data record is written in the distributed file system of the database;
the comparison sub-module 4031 is further configured to determine that there is a data record in the distributed file system that is not written in the database if the row primary key list in the data to be compared is inconsistent with the row primary key list in the reference data.
Optionally, the processing module 403 further includes: the search sub-module 4032, the acquisition sub-module 4033, the reset sub-module 4034, the deletion sub-module 4035 and the prompt sub-module 4036;
a searching submodule 4032, configured to search, if the comparison result indicates that there is a data record that is not written in the distributed file system of the database, a row primary key value missing from the row primary key list of the data to be compared from the row primary key list of the reference data;
an obtaining sub-module 4033, configured to obtain, according to an identification code of a thread in the reference data or the data to be compared, a rowkey list where the missing rowkey value is located and a file to be written corresponding to the obtained rowkey list from the thread, where a correspondence between the rowkey list and the file to be written is stored in the thread;
the resetting sub-module 4034 is configured to rewrite the data record corresponding to the file to be written into the database, and to compare the reference data with the data to be compared again until the comparison result indicates that no data record is written into the distributed file system of the database.
A deleting submodule 4035, configured to delete the reference data and the data to be compared if the comparison result is that no data record is written in the distributed file system of the database;
a prompt submodule 4036, configured to send deletion prompt information to the thread according to the identification code of the thread in the reference data or the data to be compared, where the deletion prompt information is used to prompt to delete the correspondence between the row key list stored in the thread and the file to be written.
It should be noted that, for the same file to be written that is rewritten, the respective generated rowkey values are the same, and at the same time, it can also be understood that the row primary key lists are the same, so that the same data records can be prevented from being repeatedly stored.
For details of the embodiment, please refer to the description of the embodiment shown in fig. 1 and fig. 2, which is not repeated herein.
In this embodiment of the present invention, the sending module 401 sends the data record in the file to be written to the thread, so as to generate a row primary key value corresponding to each data record received by the thread through the thread, the obtaining module 402 obtains the data record corresponding to the file to be written, the row primary key value corresponding to the data record, and the identification code of the thread from the thread, and generates a row primary key list including a correspondence relationship between the data record and the row primary key value, and simultaneously uses the identification code of the obtained thread and the generated row primary key list as reference data, the processing module 403 writes the obtained data record, the row primary key value corresponding to the obtained data record, the identification code of the obtained thread, and the generated row primary key list into a cache memory in a database, and writes the data record stored in the cache memory, the row key value corresponding to the stored data record, the row primary key value corresponding to the line, and the row primary key value corresponding to the thread into the cache memory, The stored thread identification code and the stored row main key list are written into a distributed file system in the database, after the writing is finished, the thread identification code and the row main key list stored in the cache memory are used as data to be compared, the reference data and the data to be compared are compared, if the comparison result shows that data records are not written into the distributed file system of the database, the file to be written is written into the database again, and therefore after the data are written in each time, whether the data are all written into the database is determined through comparison, the integrity of data storage is further ensured.
In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication link may be an indirect coupling or communication link of some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In summary, the present disclosure should not be construed as limiting the present disclosure since the method and apparatus for writing data based on Hbase database provided by the present disclosure may be modified by those skilled in the art according to the concepts of the present disclosure.