CN115993932B

CN115993932B - Data processing method, device, storage medium and electronic device

Info

Publication number: CN115993932B
Application number: CN202211475701.1A
Authority: CN
Inventors: 周兆星
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2025-10-03
Anticipated expiration: 2042-11-23
Also published as: CN115993932A

Abstract

The present application discloses a data processing method, device, storage medium and electronic device. The method includes: obtaining each data input from each data node; storing each data mapping to the corresponding data slice according to the mapping relationship; determining the priority of the storage area, wherein the storage area includes: a solid-state drive and a hard disk drive; preferentially inputting the data stored in the data slice into multiple buffer areas in the solid-state drive, and after the occupancy rate of the buffer area of the solid-state drive reaches a preset threshold, writing the remaining data to the hard disk drive. The present application solves the technical problem of serious skew of storage data caused by the access data skew phenomenon generated by multi-user, multi-task and multi-priority access flows of big data in the process of cloud computing, as well as causing hot data competition and waste of cold data storage resources.

Description

Data processing method, device, storage medium and electronic equipment

Technical Field

The present application relates to the field of big data, and in particular, to a data processing method, apparatus, storage medium, and electronic device.

Background

The related technology produces a computing node in the computing process, then encrypts and transmits the computing node to a cloud platform for storage, but the problem that large data can incline access data to multi-user, multi-task and multi-priority access flows in the cloud computing process, so that the storage data based on unified management is seriously inclined, and hot data competition and cold data storage resource waste are caused still exists.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a data processing method, a device, a storage medium and electronic equipment, which at least solve the technical problems that storage data are severely inclined and hot data competition and cold data storage resource waste are caused due to access data inclination phenomenon generated by multi-user, multi-task and multi-priority access flows in the cloud computing process of big data.

According to one aspect of the embodiment of the application, a data processing method is provided, which comprises the steps of obtaining each data transmitted by each data node, mapping each data into a corresponding data fragment according to a mapping relation, determining the priority of a storage area, wherein the storage area comprises a solid-state drive and a hard disk drive, preferentially inputting the data stored in the data fragment into a plurality of buffer areas in the solid-state drive, and writing the rest data into the hard disk drive after the occupancy rate of the buffer areas of the solid-state drive reaches a preset threshold.

Optionally, mapping each data into a corresponding data fragment according to the mapping relation comprises determining an initial key value pair corresponding to each data, mapping the initial key value pair into a target binary group, and determining the data fragment to which each data belongs according to the key value in the target binary group.

Optionally, the method further comprises determining access conditions of the respective data, dividing the respective data according to the access conditions, and classifying the respective data as hot data or cold data.

Optionally, after classifying each data as hot data or cold data, the method further comprises obtaining a global data copy load value, determining a data block corresponding to the hot data in the case that the task executed in the current period is a non-local task, and automatically copying the data block from other nodes.

Optionally, after classifying each data into hot data or cold data, the method further comprises detecting data block loads stored on the data nodes at intervals of a preset period, acquiring total number of data copies corresponding to the data nodes when the difference between the data block loads and normal loads is smaller than a preset threshold value, issuing erasure codes to the data nodes when the total number of data copies is the preset number, receiving data information returned by the data blocks, and independently storing the data information to the cold data independent disk array.

Optionally, deleting the files in the data blocks on the data nodes and reporting the deleting information under the condition that the total number of the data copies is not the preset number, wherein the deleting information comprises file names and positions corresponding to the files.

Optionally, determining the access condition of each data, classifying each data into hot data or cold data according to the access condition, including at least obtaining each file name corresponding to each data and each access time, determining the access times corresponding to each file name, determining that the data is hot data when the access times are greater than a preset access times, or determining that the data is hot data when the access times belong to a target period, determining that the data is cold data when the access times are less than a preset access times, or determining that the data is cold data when the access times belong to a target period.

According to another aspect of the embodiment of the application, a data processing device is provided, which comprises an acquisition module, a mapping module and a determining module, wherein the acquisition module is used for acquiring each data transmitted by each data node, the mapping module is used for mapping each data into a corresponding data fragment according to a mapping relation, the determining module is used for determining the priority of a storage area, the storage area comprises a solid-state drive and a hard disk drive, the data stored in the data fragment is preferentially input into a plurality of buffer areas in the solid-state drive, and after the occupancy rate of the buffer area of the solid-state drive reaches a preset threshold, the rest data is written into the hard disk drive.

According to another aspect of the embodiment of the application, a nonvolatile storage medium is provided, which comprises a stored program, wherein the program controls a device where the storage medium is located to execute any data processing method when running.

According to another aspect of the embodiment of the application, there is also provided an electronic device, including a processor, and a memory for storing instructions executable by the processor, wherein the processor is configured to execute the instructions to implement any one of the data processing methods.

In the embodiment of the application, a mode of dividing the data access heat is adopted, each data transmitted by each data node is obtained, each data is mapped and stored into a corresponding data partition according to a mapping relation, the priority of a storage area is determined, wherein the storage area comprises a solid-state drive and a hard disk drive, the data stored in the data partition is preferentially input into a plurality of buffer areas in the solid-state drive, after the occupancy rate of the buffer areas of the solid-state drive reaches a preset threshold value, the rest data is written into the hard disk drive, the aim of reducing the data redundancy in the process of large data cloud computing storage is achieved, the technical effects of reducing hot data competition and avoiding the waste of cold data storage resources are achieved, and the technical problems of serious inclination of the stored data caused by access data inclination phenomena caused by multi-user, multi-task and multi-priority access flows in the process of cloud computing of the large data are solved, and the hot data competition and the cold data storage resource waste is caused.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of a data processing method according to an embodiment of the application;

FIG. 2 is a flow chart of an alternative data processing method according to an embodiment of the application;

FIG. 3 is a flow chart of data execution of a data processing method according to an embodiment of the present application;

FIG. 4 is a data flow diagram of a data processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus structure of a data processing method according to an embodiment of the present application;

Fig. 6 is a schematic block diagram of an example electronic device 600 in accordance with an embodiment of the application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present application, there is provided a method embodiment of data processing, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

FIG. 1 is a data processing method according to an embodiment of the present application, as shown in FIG. 1, the method includes the steps of:

step S102, acquiring each data transmitted by each data node;

Step S104, mapping each data into a corresponding data fragment according to the mapping relation;

Step S106, determining the priority of a storage area, wherein the storage area comprises a solid-state drive and a hard disk drive;

it should be noted that, the priority of the solid state drive is higher than the priority of the hard disk drive.

Step S108, the data stored in the data fragments are input to a plurality of buffer areas in the solid state drive preferentially, and after the occupancy rate of the buffer areas of the solid state drive reaches a preset threshold, the rest data are written into the hard disk drive.

It should be noted that, the preset threshold may be 80%, that is, when the occupancy rate of the buffer area reaches 80%, the remaining data is written into the hard disk drive.

In an exemplary embodiment of the application, mapping each data into a corresponding data fragment according to a mapping relation comprises determining an initial key value pair corresponding to each data, mapping the initial key value pair into a target binary group, and determining the data fragment to which each data belongs according to the key value in the target binary group.

It will be appreciated that the higher number of accesses is hot data and the lower number of accesses is cold data.

In some optional embodiments of the present application, after classifying each data as hot data or cold data, the method further includes obtaining a global data copy load value, determining a data block corresponding to the hot data if the task executed in the current period is a non-local task, and automatically copying the data block from other nodes.

In an exemplary embodiment of the present application, after classifying each data into hot data or cold data, the method further includes detecting a data block load stored on a data node at intervals of a predetermined period, acquiring a total number of data copies corresponding to the data node if a difference between the data block load and a normal load is smaller than a preset threshold, issuing erasure codes to the data node if the total number of data copies is the predetermined number, receiving data information returned by the data block, and individually storing the data information to the cold data independent disk array.

In an optional embodiment, when the total number of the data copies is not the predetermined number, deleting the file in the data block on the data node and reporting the deleting information, where the deleting information includes a file name and a position corresponding to the file.

It should be noted that, the predetermined number of data copies are three copies, no copy operation can be performed on a single file, and a random allocation policy is adopted for the storage positions of the copies.

In some optional embodiments of the present application, determining the access condition of each data, classifying each data into hot data or cold data according to the access condition includes at least obtaining each file name corresponding to each data and each access time, determining the access times corresponding to each file name, determining that the data is hot data when the access times are greater than a preset access times, or determining that the data is hot data when the access times belong to a target period, determining that the data is cold data when the access times are less than a preset access times, or determining that the data is cold data when the access times belong to a target period.

It will be appreciated that when the access time belongs to a peak period, the data may be determined to be hot data, and when the access time belongs to an off-peak period, the data may be determined to be cold data.

In order to facilitate a better understanding of the technical solution of the present application by a person skilled in the art, a description will now be given with reference to a specific embodiment.

FIG. 2 is a schematic flow chart of an alternative data processing method according to an embodiment of the present application, as shown in FIG. 2, the flow mainly includes the following steps:

(1) Partitioning the acquired data content according to the data requirement, and establishing a plurality of databases through the partitioned content;

(2) One or more data nodes are stored in the data fragments, the data fragments are divided by a mapping technology in a processing partition mode, then the data are sent to a plurality of buffer areas in the solid-state drive, and when the buffer areas are fully written, the background writes the data in the buffer areas into the hard disk drive;

(3) The information acquisition module acquires upper node data access logs in the system and provides information for dynamic data division, a dynamic cold and hot copy distinguishing module is adopted to dynamically divide data access heat, and a dynamic copy storage module is used for managing and maintaining the number of copies;

It should be noted that, the default copy number of all files in the dynamic cold and hot copy distinguishing module is three copies, copy operation cannot be performed on a single file, a completely random allocation strategy is adopted for storage positions of the copies, and the dynamic cold and hot copy distinguishing module performs unified management on data storage and data access.

It should be noted that, the dynamic copy storage module can make mark distinction on the cold and hot of the data according to the access condition of the data, the data copy of the dynamic copy storage module is completely dynamic, and the dynamic copy storage module adopts a feedback adjustment mechanism, which mainly includes a copy increasing mechanism of zero data copy and a data copy automatic attenuation mechanism to change the number of data blocks.

It can be understood that the copy adding mechanism of the zero data copy can acquire the global data copy load value based on the log record module, if a non-local task is executed, the corresponding data block is hot data, when the data mapping is completed, the mapping task can automatically copy the data block from other nodes, the copy adding mechanism of the zero data copy occurs after the data block copy is completed, the conventional mapping task can discard the mapping input data as a temporary file, and the copy adding mechanism of the zero data copy locally persistence and reports the data to the server, so that the visibility of the data block is realized.

The automatic attenuation mechanism of the data copy is based on the calculation of the load of the data block, the load of the data block stored on the whole node is scanned regularly, when the load of the data block is found to be obviously lower than the normal load value, the dynamic copy storage module accesses preferentially, the total number of the copies is obtained, if the number of the copies is not three, the corresponding file of the data block is deleted directly and reported, the global visibility of the data deletion is realized, and if the number of the copies is equal to three, the data is processed by the cold data independent disk array module.

(4) The method comprises the steps that (1) extra data block reliability storage is carried out on attenuated data which are rarely accessed by adopting a cold data independent disk array module;

It should be noted that, the storage of the data block of the cold data independent disk array module adopts a delayed loading mode, when the current number of all copies of the dynamic cold and hot copy distinguishing module is three, the data node issues the erasure code calculation operation to the data node, and when the data node receives the information, the data node submits the information of the data block, and the cold data independent disk array module performs the data reliability storage.

It can be understood that the RAID storage of the data blocks of the cold data independent disk array module adopts a delayed loading mode, for the data blocks of a file, the master node periodically gathers all copy positions of the data blocks, if the number of available copies is lower than three copies, the automatic copy adding operation is performed, so as to ensure the reliability of the data, in the dynamic cold and hot copy distinguishing module, after the master node receives a message that the life cycle of the copies of the data blocks is finished, the copy number of the data blocks is monitored, if the current number of all copies is found to be three, erasure code computing operation is issued to the data node, the data node does not delete the corresponding copies immediately after receiving the information, but submits the information of the data blocks, including the file name of the data block, the data block splitting ID number, the original data block data is sent to the cold data independent disk array module for data reliability storage, and after the data blocks are stored in the cold data independent disk array module, the data block is deleted, the data block whose life cycle has been finished is returned to the master node, and the whole flow of the data block storage is completed.

(5) The data is divided into a plurality of segments, the data is written in the segments, the mapping input is completed, a plurality of temporary buffer area files exist in the solid-state drive, the data are strictly sequenced and integrated according to key values through a reorganization end, and an intermediate data file containing a plurality of partitions is formed and stored in the hard disk drive;

(6) And the reorganization end globally merges the files in the file transmission process, aggregates the key value pairs with the same key value into a key group indexed by the key value, and transmits the key group content to the reduction end for application.

It is easy to notice that the method has the following beneficial effects by dividing the data access heat:

(1) The scheme of the application adopts a dynamic storage mode, and performs data storage through a full dynamic copy mode and an independent redundant disk array combination strategy, compared with a static scheme, dynamic copy can be efficiently adapted to the change of upper file access, so that adaptive data storage is provided, for hot spot data, the dynamic increase of the copy number can improve the availability of the data under concurrency, reduce the generation of non-local tasks, reduce network transmission overhead, and also relieve the unbalanced load condition of nodes, thereby improving the overall performance of the system.

(2) The scheme of the application adopts a dynamic cold and hot copy distinguishing module, the dynamic cold and hot copy distinguishing module adopts a dynamic copy mode, the load of a data block is substantially dependent on the backup number of the current data block, the access load of files in the same access state can be changed along with the change of the copy number, more data blocks can bear the pressure of upper access together when more data are backed up, so that the load is lower, and the fewer data are backed up, the opposite is true, and for a file, the upper access depends on a user, and the dynamic cold and hot copy distinguishing module cannot interfere, so the dynamic cold and hot copy distinguishing module achieves the final purpose of adapting to the upper access by adjusting the copy number by utilizing the concept of the load of the data blocks.

(3) According to the scheme, a cold data independent disk array module is adopted, for cloud computing of big data, the most core mechanism depends on load abstraction and computing of data blocks, the load of the data blocks directly determines the copy number of the data blocks, the load of disks and the load of nodes, the load of the disks is a core parameter of a multi-disk scheduler, the load of the nodes influences the priority of task scheduling, the position of the copy is further influenced, the task load of the nodes is further influenced, and the cold data independent disk array module is adopted, so that the problem of redundancy caused by data inclination in the big data cloud computing is solved, the redundancy is reduced to the lowest value, the load of a big data cloud computing server is reduced, and the speed of the big data cloud computing is improved.

Fig. 3 is a schematic diagram of a data execution flow of a data processing method according to an embodiment of the present application, as shown in fig. 3, the flow mainly includes the following steps:

(1) The plurality of databases are distributed to a plurality of servers for network interconnection, and each data fragment is analyzed one by one in the process of data analysis by a mapping technology;

(2) The information acquisition module aims at the data to be recorded, namely the file name of the current access, the node position of the data block after the file division and the time of the current access;

it should be noted that, the format of the access information of the collected file is < file name, list < access time > >, the access information is used for dividing the file heat, the mapping relation between the file name and the node position of the splitting data block of the file is used for heat calculation, and the node heat is calculated according to the file data block distribution and the file heat, so that the function of balancing the node load of the subsequent task scheduler is supported.

(3) The input of the mapping task is usually text data, the initial Key value pair is < RAWdata, line number >, and one or more Key values of < RAWdata, line number > are remapped into meaningful < Key, value > tuples through a mapping end;

It should be noted that, the output result of the mapping is partitioned and then transferred to a buffer area in the solid state drive, and the background process writes the data in the current buffer area into the hard disk drive whenever the buffer area is about to be written to 80%. When all mapping inputs have been completed, there may be multiple temporary buffer files in the hard disk drive that need to be merged, and during the merging process, it is ensured that the data within each partition of the final merged file is strictly ordered according to key values.

It can be understood that, in order to ensure the speed of processing the mass data, all the key value pairs output by mapping are strictly arranged in ascending order according to the key values, and the advantage of the strictly arranged in ascending order is that the reduction data can conveniently and quickly find a certain key value pair, thereby improving the quick query of the user on the result key value pair.

(4) When the mapping end processes data, a plurality of reductions exist, one kind of mapping data needs to be sent to the partition of the corresponding reduction task, each partition can ensure that the data in the mapping end is mapped to the unique reduction task, and the key value pair output by the mapping task can be reduced to the unique partition according to the key value;

(5) Reorganization copies data exceeding the storage space of the available solid state drive to the hard disk drive as a temporary file, and the result of the reduction is organized in the form of key value pairs and written to the server side.

It will be appreciated that the copy stage results in the reduction end accepting a large number of mapping results, and that the splitting into a plurality of different files requires global merging of the files after all copies have been completed, thereby generating final reduction input data, which is aggregated for key-value pairs having the same key-value into a set of values indexed by key-values, at which point the reorganization end has completed the transfer of the mapping output results to the reduction end.

Fig. 4 is a data flow diagram of a data processing method according to an embodiment of the present application, as shown in fig. 4, the flow mainly includes the following steps:

(1) Because there are multiple reductions when the mapping end processes data, the mapping data needs to be sent to the partitions of the corresponding reduction tasks, such as partition a, partition B, partition C and partition D in the solid-state drive shown in fig. 4, each partition can ensure that the data in the mapping end is mapped to a unique reduction task, the key value pair output by the mapping task can be attributed to a unique partition according to the key value, and the reorganizing end copies the data exceeding the storage space of the available solid-state drive into the hard disk drive as temporary files, for example, temporary file a, temporary file B, temporary file C and temporary file D in fig. 4;

(2) The reduction end receives a large amount of mapping results in the copying stage, and because the reduction end is divided into a plurality of different files, the files need to be globally combined after all copying is completed, so that reduction input data is generated;

(3) The reduction input data is input to a reduction task.

Fig. 5 is a schematic diagram of an apparatus structure of a data processing method according to an embodiment of the present application, as shown in fig. 5, the apparatus includes:

An acquiring module 50, configured to acquire each data transmitted by each data node;

The mapping module 52 is configured to map each data into a corresponding data slice according to the mapping relationship;

the determining module 54 is configured to determine a priority of a storage area, where the storage area includes a solid state drive and a hard disk drive, input data stored in a data partition into a plurality of buffer areas in the solid state drive preferentially, and write remaining data into the hard disk drive after an occupancy rate of the buffer areas of the solid state drive reaches a preset threshold.

The device comprises an acquisition module 50 for acquiring each data transmitted by each data node, a mapping module 52 for mapping each data into a corresponding data fragment according to a mapping relation, and a determination module 54 for determining the priority of a storage area, wherein the storage area comprises a solid-state drive and a hard disk drive, the data stored in the data fragment is preferentially input into a plurality of buffer areas in the solid-state drive, after the occupancy rate of the buffer areas of the solid-state drive reaches a preset threshold value, the residual data is written into the hard disk drive, thereby achieving the aim of reducing the data redundancy in the process of big data cloud computing storage, further realizing the technical effects of reducing hot data competition and avoiding the waste of cold data storage resources, and further solving the technical problems of serious inclination of the storage data caused by access data inclination phenomenon caused by multi-user, multi-task and multi-priority access flow in the process of the cloud computing of the big data, and causing hot data competition and cold data storage resource waste.

According to another aspect of the embodiments of the present application, there is also provided a nonvolatile storage medium including a stored program, wherein the device in which the nonvolatile storage medium is controlled to execute any one of the data processing methods when the program runs.

Specifically, the storage medium is configured to store program instructions for the following functions, and implement the following functions:

The method comprises the steps of obtaining all data transmitted by all data nodes, mapping and storing all data into corresponding data fragments according to a mapping relation, determining the priority of a storage area, wherein the storage area comprises a solid-state drive and a hard disk drive, inputting the data stored in the data fragments into a plurality of buffer areas in the solid-state drive preferentially, and writing the rest data into the hard disk drive after the occupancy rate of the buffer areas of the solid-state drive reaches a preset threshold.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In an exemplary embodiment of the application, a computer program product is also provided, comprising a computer program which, when executed by a processor, implements any of the above-mentioned data processing methods.

Optionally, the computer program may, when executed by a processor, implement the steps of:

According to an embodiment of the present application, there is provided an electronic device including at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the data processing methods described above.

Optionally, the electronic device may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input device is connected to the processor.

Fig. 6 is a schematic block diagram of an example electronic device 600 in accordance with an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including an input unit 606, e.g., keyboard, mouse, etc., an output unit 607, e.g., various types of displays, speakers, etc., a storage unit 608, e.g., magnetic disk, optical disk, etc., and a communication unit 609, e.g., network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When a computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the data processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, etc. which can store the program code.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A data processing method, comprising:

Get each data passed in by each data node;

Storing each data mapping into a corresponding data shard according to the mapping relationship;

Determining a priority of a storage area, wherein the storage area includes: a solid state drive and a hard disk drive;

inputting the data stored in the data slices into a plurality of buffer areas in the solid-state drive in a preferential manner, and writing the remaining data into the hard disk drive after the occupancy rate of the buffer areas in the solid-state drive reaches a preset threshold;

The method further includes: determining the access status of each data, dividing the each data according to the access status, and classifying the each data as hot data or cold data; after classifying the each data as hot data or cold data, the method further includes: detecting the data block load stored on the data node at predetermined intervals, and when the difference between the data block load and the normal load is less than a preset threshold, obtaining the total number of data copies corresponding to the data node; when the total number of data copies is a predetermined number, issuing an erasure code to the data node; receiving data information returned by the data block, and storing the data information separately in a cold data independent disk array.

2. The method according to claim 1, wherein storing each data mapping into a corresponding data shard according to a mapping relationship comprises:

Determine the initial key-value pairs corresponding to the respective data, and map the initial key-value pairs into target binary groups;

The data shard to which each data belongs is determined according to the key value in the target binary group.

3. The method according to claim 1, characterized in that after classifying the data into hot data or cold data, the method further comprises:

Obtain a global data replica load value, and when the task executed in the current period is a non-local task, determine the data block corresponding to the hot data, and automatically copy the data block from other nodes.

4. The method according to claim 1 is characterized in that, when the total number of data copies is not the predetermined number, the file in the data block on the data node is deleted and the deletion information is reported, wherein the deletion information includes: the file name and location corresponding to the file.

5. The method according to claim 1, wherein determining the access status of each data, dividing the data according to the access status, and classifying the data as hot data or cold data comprises:

At least obtain the file names and access times corresponding to each data;

Determining the number of accesses corresponding to each file name, and determining that the data is hot data if the number of accesses is greater than a preset number of accesses; or determining that the data is hot data if the access time falls within a target time period;

If the number of accesses is less than a preset number of accesses, the data is determined to be cold data; or, if the access time falls within a target period, the data is determined to be cold data.

6. A data processing device, comprising:

The acquisition module is used to obtain the data transmitted by each data node;

A mapping module, configured to store each data mapping into a corresponding data shard according to a mapping relationship;

a determination module, configured to determine a priority of a storage area, wherein the storage area includes a solid-state drive and a hard disk drive, preferentially inputting the data stored in the data slices into a plurality of buffer areas in the solid-state drive, and writing the remaining data to the hard disk drive after an occupancy rate of the buffer area of the solid-state drive reaches a preset threshold;

The data processing device is further used to: determine the access status of each data, divide the each data according to the access status, and classify the each data as hot data or cold data; after classifying the each data as hot data or cold data, the data processing device is further used to: detect the data block load stored on the data node at predetermined intervals, and when the difference between the data block load and the normal load is less than a preset threshold, obtain the total number of data copies corresponding to the data node; when the total number of data copies is a predetermined number, send an erasure code to the data node; receive data information returned by the data block, and store the data information separately in a cold data independent disk array.

7. A non-volatile storage medium, characterized in that the storage medium includes a stored program, wherein when the program is run, the device where the storage medium is located is controlled to execute the data processing method according to any one of claims 1 to 5.

8. An electronic device, comprising:

processor;

a memory for storing instructions executable by the processor;

The processor is configured to execute the instructions to implement the data processing method according to any one of claims 1 to 5.