[go: up one dir, main page]

CN113590536B - Data storage method, system, electronic equipment and storage medium - Google Patents

Data storage method, system, electronic equipment and storage medium Download PDF

Info

Publication number
CN113590536B
CN113590536B CN202110551280.5A CN202110551280A CN113590536B CN 113590536 B CN113590536 B CN 113590536B CN 202110551280 A CN202110551280 A CN 202110551280A CN 113590536 B CN113590536 B CN 113590536B
Authority
CN
China
Prior art keywords
file
persistent memory
spark
new
memory device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110551280.5A
Other languages
Chinese (zh)
Other versions
CN113590536A (en
Inventor
秦朝阳
付海明
袁博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Jinan data Technology Co ltd
Original Assignee
Inspur Jinan data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Jinan data Technology Co ltd filed Critical Inspur Jinan data Technology Co ltd
Priority to CN202110551280.5A priority Critical patent/CN113590536B/en
Publication of CN113590536A publication Critical patent/CN113590536A/en
Application granted granted Critical
Publication of CN113590536B publication Critical patent/CN113590536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • G06F12/0646Configuration or reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data storage method applied to a Spark computing framework, comprising the following steps: determining the device name and the preset file size of the persistent memory device; wherein the persistent memory device is a device in fsdax mode; creating file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names, and generating a persistent memory file with the preset file size under each file path; and storing the intermediate data generated by each Spark task to the corresponding persistent memory file. The method and the device can flexibly distribute the number of the executors, so that the Spark computing framework can simultaneously execute a plurality of Spark tasks, and the task processing efficiency of the Spark computing framework and the stability in continuous operation are improved. The application also discloses a data storage system, electronic equipment and a storage medium, which have the beneficial effects.

Description

Data storage method, system, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of memory computing technologies, and in particular, to a data storage method, a data storage system, an electronic device, and a storage medium.
Background
The persistent memory (PMem, persistent Memory) is a new generation of storage medium, and has advantages not only in byte-addressable, high-speed read-write performance, but also in power-down non-volatile, high storage density, low static power consumption, and other DRAM (Dynamic Random Access Memory ). The above excellent characteristics make it possible to become DRAM substitutes in future, and bring new opportunities for constructing more efficient memory computing systems.
Spark is a big data memory computing framework, and Spark can be optimized based on persistent memory. At present, communities have schemes for storing intermediate data in a Spark computing process on a persistent memory by using a devdax mode of the persistent memory, so that I/O (data read/write operation) time of a computer is reduced, and execution of computing tasks is accelerated. In the scheme, the persistent memory device in the devdax mode needs to be in one-to-one correspondence with the Spark executor, at this time, the Spark cannot execute a plurality of tasks at the same time, the task execution efficiency of the Spark computing framework is low, and the stability is poor in continuous operation.
Therefore, how to improve the task execution efficiency of Spark computing framework is a technical problem that those skilled in the art need to solve at present.
Disclosure of Invention
The invention aims to provide a data storage method, a data storage system, electronic equipment and a storage medium, which can improve the quantity of stored intermediate data in persistent memory equipment.
In order to solve the above technical problems, the present application provides a data storage method applied to Spark computing frames, the data storage method includes:
determining the device name and the preset file size of the persistent memory device; wherein the persistent memory device is a device in fsdax mode;
creating file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names, and generating a persistent memory file with the preset file size under each file path;
and storing the intermediate data generated by each Spark task to the corresponding persistent memory file.
Optionally, creating file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device name includes:
and randomly generating a plurality of intermediate file names, and creating the file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names and the intermediate file names.
Optionally, the method further comprises:
when the Spark computing framework is detected to execute a new Spark task, a new intermediate file name is randomly generated, and a new file path corresponding to the new Spark task is created in the persistent memory device according to the device name and the new intermediate file name;
and generating a new persistent memory file under the new file path, and storing new intermediate data generated by the new Spark task into the new persistent memory file.
Optionally, before storing the intermediate data generated by each Spark task in the corresponding persistent memory file, the method further includes:
determining a target CPU core corresponding to the persistent memory device;
acquiring the current process number of the Spark task;
configuring the binding relation between the current process number and the target CPU core;
correspondingly, storing the intermediate data generated by each Spark task to the corresponding persistent memory file, including:
and controlling the target CPU core to store the intermediate data generated by each Spark task to the corresponding persistent memory file according to the binding relation.
Optionally, configuring the binding relationship between the current process number and the target CPU core includes:
and configuring the binding relation between the current process number and the target CPU core through a task set method or a sched_security method.
Optionally, before determining the device name and the preset file size of the persistent memory device, the method further includes:
judging whether the device attribute of the persistent memory device is a directory;
if yes, the persistent memory device is judged to be a device in the fsdax mode.
Optionally, after storing the intermediate data generated by each Spark task in the corresponding persistent memory file, the method further includes:
if the Spark task is detected to be executed, deleting the persistent memory file corresponding to the executed Spark task by calling a POSIX file deleting method.
The application also provides a data storage system applied to Spark computing framework, comprising:
the parameter determining module is used for determining the device name and the preset file size of the persistent memory device; wherein the persistent memory device is a device in fsdax mode;
the file creating module is used for creating file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names, and generating a persistent memory file with the preset file size under each file path;
and the data storage module is used for storing the intermediate data generated by each Spark task to the corresponding persistent memory file.
The present application also provides a storage medium having stored thereon a computer program which, when executed, performs the steps of the data storage method described above.
The application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps executed by the data storage method when calling the computer program in the memory.
The application provides a data storage method applied to a Spark computing framework, comprising the following steps: determining the device name and the preset file size of the persistent memory device; wherein the persistent memory device is a device in fsdax mode; creating file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names, and generating a persistent memory file with the preset file size under each file path; and storing the intermediate data generated by each Spark task to the corresponding persistent memory file.
The method comprises the steps of determining the device name and the preset file size of the persistent memory device in the fsdax mode, creating a file path in the persistent memory device according to the device name, and generating a persistent memory file with the preset file size under the file path. Intermediate data is generated in the process of executing the Spark task by the Spark computing framework, and the intermediate data is stored under a file path corresponding to the Spark task, namely, the intermediate data is stored by using a persistent memory file with a preset file size. In the above process, a memory with a preset file size is allocated in the persistent memory device to store intermediate data of Spark tasks, so that a scheme of storing intermediate data of multiple Spark tasks in the same persistent memory device can be realized. Compared with an intermediate data storage mode that the persistent memory device is exclusively used by a single Spark task in a devdax mode in the related art, the method and the device can flexibly allocate the number of the executors, improve the number of independent resource pools for storing intermediate data in the persistent memory device, enable the Spark computing framework to simultaneously execute a plurality of Spark tasks, and improve the task processing efficiency of the Spark computing framework and the stability in continuous operation. The application also provides a data storage system, an electronic device and a storage medium, which have the beneficial effects and are not described in detail herein.
Drawings
For a clearer description of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a data storage method according to an embodiment of the present disclosure;
FIG. 2 is a flowchart of a memory computing method based on a persistent memory fsdax mode according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a memory computing method based on a persistent memory fsdax mode according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a data storage system according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Referring to fig. 1, fig. 1 is a flowchart of a data storage method according to an embodiment of the present application.
The specific steps may include:
s101: determining the device name and the preset file size of the persistent memory device;
the Spark computing framework can be deployed on a server, and the persistent memory device, namely the persistent memory hardware device, is a device in the fsdax mode. fsdax refers to a block device implementation of a persistent memory AD Mode (AppDirect Mode, application direct access Mode of persistent memory); devices in fsdax mode expose standard file system APIs to users and are mounted as directories on the file system when in use.
The device name of the persistent memory device may be determined by reading the configuration file, and the preset file size may also be determined by reading the configuration file. In addition, the embodiment can also determine the preset file size according to the task type of the Spark task currently executed.
As a possible implementation, there may be an operation of determining the mode in which the persistent memory device is located before this step, which is specifically as follows: judging whether the device attribute of the persistent memory device is a directory; if yes, the persistent memory device is judged to be a device in the fsdax mode.
S102: creating file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names, and generating a persistent memory file with the preset file size under each file path;
wherein this step creates a persistent memory file in the persistent memory device for storing intermediate data of the Spark task. Before this step, there may be an operation of detecting the execution condition of the Spark task, and if it is detected that the Spark computing framework starts to execute the Spark task, then the relevant operation of this step may be executed, so as to store intermediate data generated in the process of executing each Spark task into a corresponding persistent memory file. In this embodiment, file paths corresponding to multiple Spark tasks may be sequentially created, that is: each time a Spark task is detected to start being executed, a corresponding file path and persistent memory file are created, so that the persistent memory device can store intermediate data generated by a plurality of Spark tasks. In this embodiment, file paths corresponding to any two Spark tasks are different, that is, intermediate data of each Spark task corresponds to persistent memory files one by one.
Specifically, in this embodiment, a file path may be created in the persistent memory device according to a device name of the persistent memory device, and a correspondence between a Spark task and the file path may be bound. A persistent memory file of a preset file size may be generated in the file path, where the size of the persistent memory file generated in this embodiment should be smaller than the total memory size of the persistent memory device. As a possible implementation, the size of the persistent memory file may be less than or equal to one half of the total memory size of the persistent memory device, i.e. the persistent memory device may be capable of storing at least two Spark tasks of intermediate data, so that the Spark computing framework may perform at least two Spark tasks simultaneously.
As a possible implementation manner, in the process of creating a file path, in order to distinguish persistent memory files corresponding to different Spark tasks, the embodiment may randomly generate a plurality of intermediate file names, and create the file paths corresponding to the Spark tasks in the persistent memory device according to the device names and the intermediate file names.
S103: and storing the intermediate data generated by each Spark task into a corresponding persistent memory file.
After the persistent memory file corresponding to the Spark task is generated, if any intermediate data generated by the Spark task is detected, the intermediate data may be stored to the corresponding persistent memory file according to the binding relationship between the Spark task and the persistent memory file. The intermediate data refers to other data except the output result generated in the process of executing the Spark task. Specifically, the persistent memory file corresponding to the Spark task is: and generating a persistent memory file under a file path corresponding to the Spark task.
Further, after the intermediate data is stored in the persistent memory file, if it is detected that the Spark task is executed, the persistent memory file corresponding to the executed Spark task is deleted by calling a file deletion method of POSIX (Portable Operating System Interface ). POSIX is a series of interrelated standard generalizations of the API.
In this embodiment, the device name and the preset file size of the persistent memory device in the fsdax mode are first determined, a file path is created in the persistent memory device according to the device name, and a persistent memory file with the preset file size is generated in the file path. Intermediate data will be generated during the Spark computing framework executing the Spark task, and in this embodiment, the intermediate data is stored under a file path corresponding to the Spark task, that is, the intermediate data is stored by using a persistent memory file with a preset file size. In the above process, a memory with a preset file size is allocated in the persistent memory device to store intermediate data of Spark tasks, so that a scheme of storing intermediate data of multiple Spark tasks in the same persistent memory device can be realized. The devdax mode is a character device implementation of the persistent memory AD mode, which may allocate persistent memory to a virtual machine or register persistent memory for RDMA (remote direct memory access). Compared with an intermediate data storage mode in which the persistent memory device is exclusively used by a single Spark task in the devdax mode in the related art, the method and the device can improve the quantity of stored intermediate data in the persistent memory device, enable the Spark computing framework to simultaneously execute a plurality of Spark tasks, and improve task processing efficiency of the Spark computing framework.
As a possible implementation manner, when detecting that the Spark computing framework executes a new Spark task, the application may also randomly generate a new intermediate file name, and create a new file path corresponding to the new Spark task in the persistent memory device according to the device name and the new intermediate file name; and generating a new persistent memory file under the new file path, and storing new intermediate data generated by the new Spark task into the new persistent memory file.
By the method, after the new Spark task is detected, a new persistent memory file can be created so as to store new intermediate data. That is, the above procedure enables each Spark task being executed to have its corresponding persistent memory file to store intermediate data.
As a further description of the corresponding embodiment of fig. 1, this embodiment may also bind the process numbers of the CPU core and Spark task before storing the intermediate data generated by the Spark task in the persistent memory file, so as to maximize the performance of the persistent memory device. The specific process is as follows: determining a target CPU core corresponding to the persistent memory device, acquiring a current process number of the Spark task, and configuring a binding relationship between the current process number and the target CPU core. As a possible implementation manner, the above procedure may further configure the binding relationship between the current process number and the target CPU core through a task method or a sched_security method.
After determining the binding relationship between the current process number and the target CPU core, the present embodiment may control the target CPU core to store the intermediate data generated by each Spark task to the corresponding persistent memory file according to the binding relationship.
The above embodiments may be implemented based on a memory computing system that includes an initialization module, a NUMA (Non-Uniform Memory Access ) binding module, and a resource reclamation module.
The initialization module mainly reads and processes the user configuration parameters, and creates a corresponding persistent memory file on the persistent memory device. The NUMA binding module is used for creating a corresponding relation between a CPU core used by the executor and the persistent memory device. And the resource recovery module is used for calling a system standard API to clear the intermediate data of the calculation process from the persistent memory at the file granularity when the calculation task is about to end so as to facilitate the smooth execution of the subsequent task. The three modules are unified as a plug-in of the Spark computing framework, and are actively called by the Spark computing framework to achieve the optimization effect when the task runs. An actuator is a component in a Spark computing framework that is responsible for the specific execution of a task, with multiple actuators typically being present for a Spark task.
Unlike persistent memory that uses devdax mode, the fsdax mode is used to tell the program how much space to use (as can be read from the user's configuration) in addition to specifying which piece of persistent memory device to use. Since the fsdax mode needs to determine the device name and the target file name of the stored data when in use, the present embodiment can determine the device name and the preset file size by reading the user profile using the initialization module.
Since the target file name has uncertainty and temporary property in fsdax mode, the NUMA variability of persistent memory is brought by persistent memory device, independent of file name. Therefore, the NUMA binding module needs to acquire the device name used by the current executor process, acquire a pre-configured CPU core number according to the device name, acquire the process number according to the running environment, and bind the process number with the CPU core by using a system command so as to maximize the performance of the persistent memory device.
In the working process of the resource recovery module, the main difference between the resource recovery module and the devdax mode is that the program in the devdax mode needs to clear the data of the memory position pointed by each object pointer one by one, then modify the size of the cleared data, and repeatedly circulate until the current data size is judged to be equal to 0, and then normally exit. The above-mentioned process is complicated, and is liable to make mistakes when the operand is great. In the fsdax mode, all intermediate data can be regarded as a whole file, and the POSIX standard file operation API is adopted, so that deletion work is only needed once, and cleaning work becomes simple and stable.
Referring to fig. 2 and fig. 3, fig. 2 is a flowchart of a memory computing method based on a persistent memory fsdax mode according to an embodiment of the present application, and fig. 3 is a schematic diagram of a memory computing method based on a persistent memory fsdax mode according to an embodiment of the present application.
The use of persistent memory devdax mode in the related art has a strict limitation on the number of the executors, and for a determined server environment, the number of the executors cannot be directly adjusted (old equipment needs to be manually deleted first, and then a corresponding number of devdax namespaces are re-created), which may lead to that the task cannot reach the optimal configuration; in addition, when the devdax mode releases the space occupied by the temporary data, a local method for circularly acquiring the address deletion data object is used, the operation is complex, errors are easy to occur, program breakdown can be caused by leakage of a persistent memory, and the usability of a computing system is greatly reduced. The embodiment provides a memory calculation optimization scheme based on a persistent memory fsdax mode to solve the problems. The initialization module creates a corresponding fsdax file (persistent memory file) on the persistent memory device according to the device and size specified by the user. The NUMA binding module is responsible for creating a corresponding relation between a CPU core used by an executor process and persistent memory equipment; the resource recycling module is responsible for calling POSIX standard API to remove the intermediate data of the computing process from the persistent memory with file granularity when the computing task is about to end.
As shown in fig. 2, after the program starts, it may be determined whether the persistent memory device is in fsdax mode according to whether the persistent memory device attribute is a directory. Wherein when the persistent memory device attribute is directory, it may be determined that the persistent memory device is in fsdax mode. The device name of the persistent memory device is specified by the user in the user configuration file, and the user configuration file may be specified with a preset file size and a CPU core list corresponding to the persistent memory device.
In the working process of the initialization module, if the persistent memory device is in the fsdax mode, modifying the value of a specific marking item to true (true), randomly generating an intermediate file name, combining the device name and the intermediate file name into a file path, and creating a persistent memory file with a preset file size so as to store intermediate data generated in the running process of the memory calculation program by using the persistent memory file. If the persistent memory device is in the devdax mode, the size of the persistent memory device is not required to be specified, and only the device name is required, because the devdax mode is exclusive to the device when running, and as a result, the Spark computing framework cannot simultaneously execute a plurality of tasks in the original implementation; in the fsdax mode, a user can run a plurality of Spark tasks simultaneously as long as a reasonable preset file size is allocated in the configuration information, and the working efficiency and the space utilization rate of the Spark computing framework are improved. As shown in fig. 3, the persistent memory device mounts the fsdax namespace to the mount directory of the file system. The Spark computing framework creates a persistent memory file in the persistent memory device through initialization, then binds a process number and a CPU core through the NUMA binding module, writes intermediate data into the persistent memory file in a memory computing (Spark) task data exchange process, and can also read the intermediate data from the persistent memory file. After the Spark task is executed, intermediate data in the persistent memory file can be cleaned.
In the working process of the NUMA binding module, since the read-write performance of the persistent memory device is affected by the NUMA node, the NUMA binding module binds the device with the correct CPU core. This binding relationship can be obtained from a user profile. After the user installs the persistent memory device, the binding relationship is determined and cannot be changed along with the running of the program.
In fsdax mode, the NUMA binding module reads the corresponding CPU core number list according to the device name. The NUMA binding module obtains the current process number (PID) according to the runtime environment (such as JVM), and binds the current process number with the CPU core through the system command. Specifically, the above binding operation may be implemented by a "task" method, or may be implemented by a "sched_security" method.
And after the Spark task is executed, entering a resource recovery module. The resource recovery module firstly judges whether the specific mark item is true, and if true, the cleaning work of the fsdax file is carried out. Under the fsdax mode, the file deleting method of POSIX can be directly called, and the complex code logic for acquiring and releasing the memory address is not required to be written, so that the method is simple and the stability is greatly improved.
The above embodiment provides a memory calculation optimization scheme based on the persistent memory fsdax mode, which can enable a memory calculation program to perform flexible actuator allocation so as to achieve optimal parameter configuration required by a task; and the parallelism of a plurality of calculation tasks can be realized, and the operation efficiency and the space utilization rate of the persistent memory device can be improved. The performance of the program after NUMA binding is improved, the program codes are easy to maintain after the resource recovery process is modified, errors are not easy to occur in the space release process occupied by temporary data, the program robustness is improved, and the stability is improved by a plurality of times compared with the stability of the task running continuously before the modification.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a data storage system according to an embodiment of the present application, where the system is applied to a Spark computing framework, and may specifically include:
a parameter determining module 401, configured to determine a device name and a preset file size of the persistent memory device; wherein the persistent memory device is a device in fsdax mode;
a file creating module 402, configured to create file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names, and generate a persistent memory file with the preset file size under each file path;
and the data storage module 403 is configured to store the intermediate data generated by each Spark task to the corresponding persistent memory file.
In this embodiment, the device name and the preset file size of the persistent memory device in the fsdax mode are first determined, a file path is created in the persistent memory device according to the device name, and a persistent memory file with the preset file size is generated in the file path. Intermediate data will be generated during the Spark computing framework executing the Spark task, and in this embodiment, the intermediate data is stored under a file path corresponding to the Spark task, that is, the intermediate data is stored by using a persistent memory file with a preset file size. In the above process, a memory with a preset file size is allocated in the persistent memory device to store intermediate data of Spark tasks, so that a scheme of storing intermediate data of multiple Spark tasks in the same persistent memory device can be realized. Compared with an intermediate data storage mode in which the persistent memory device is exclusively used by a single Spark task in the devdax mode in the related art, the method and the device can improve the quantity of stored intermediate data in the persistent memory device, enable the Spark computing framework to simultaneously execute a plurality of Spark tasks, and improve task processing efficiency of the Spark computing framework.
Further, the file creation module 402 includes:
and the path creation module is used for randomly generating a plurality of intermediate file names and creating the file paths corresponding to the Spark tasks in the persistent memory device according to the device names and the intermediate file names.
Further, the method further comprises the following steps:
the new intermediate data storage module is used for randomly generating a new intermediate file name when detecting that the Spark computing framework executes a new Spark task, and creating a new file path corresponding to the new Spark task in the persistent memory device according to the device name and the new intermediate file name; and the method is also used for generating a new persistent memory file under the new file path and storing new intermediate data generated by the new Spark task to the new persistent memory file.
Further, the method further comprises the following steps:
a binding module, configured to determine a target CPU core corresponding to the persistent memory device before storing intermediate data generated by the Spark task to the persistent memory file; the method is also used for acquiring the current process number of the Spark task; the method is also used for configuring the binding relation between the current process number and the target CPU core;
correspondingly, the data storage module 403 is specifically configured to control the target CPU core to store the intermediate data generated by each Spark task to the corresponding persistent memory file according to the binding relationship.
Further, the process of configuring the binding relationship between the current process number and the target CPU core by the binding module includes: and configuring the binding relation between the current process number and the target CPU core through a task set method or a sched_security method.
Further, the method further comprises the following steps:
the mode judging module is used for judging whether the equipment attribute of the persistent memory equipment is a catalog or not before the equipment name of the persistent memory equipment and the preset file size are determined; if yes, the persistent memory device is judged to be a device in the fsdax mode.
Further, the method further comprises the following steps:
and the file deleting module is used for deleting the persistent memory file corresponding to the Spark task after the execution by calling the file deleting method of the POSIX if the Spark task is detected to be executed.
Since the embodiments of the system portion and the embodiments of the method portion correspond to each other, the embodiments of the system portion refer to the description of the embodiments of the method portion, which is not repeated herein.
The present application also provides a storage medium having stored thereon a computer program which, when executed, performs the steps provided by the above embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The application also provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided in the foregoing embodiments when calling the computer program in the memory. Of course the electronic device may also include various network interfaces, power supplies, etc.
In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should be noted that it would be obvious to those skilled in the art that various improvements and modifications can be made to the present application without departing from the principles of the present application, and such improvements and modifications fall within the scope of the claims of the present application.
It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (8)

1. A data storage method applied to a Spark computing framework, the data storage method comprising:
determining the device name and the preset file size of the persistent memory device; wherein the persistent memory device is a device in fsdax mode;
creating file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names, and generating a persistent memory file with the preset file size under each file path; creating file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names, wherein the method specifically comprises randomly generating a plurality of intermediate file names, and creating the file paths corresponding to the plurality of Spark tasks in the persistent memory device according to the device names and the intermediate file names;
storing the intermediate data generated by each Spark task to the corresponding persistent memory file;
when the Spark computing framework is detected to execute a new Spark task, a new intermediate file name is randomly generated, and a new file path corresponding to the new Spark task is created in the persistent memory device according to the device name and the new intermediate file name;
and generating a new persistent memory file under the new file path, and storing new intermediate data generated by the new Spark task into the new persistent memory file.
2. The data storage method according to claim 1, further comprising, before storing the intermediate data generated by each Spark task to the corresponding persistent memory file:
determining a target CPU core corresponding to the persistent memory device;
acquiring the current process number of the Spark task;
configuring the binding relation between the current process number and the target CPU core;
correspondingly, storing the intermediate data generated by each Spark task to the corresponding persistent memory file, including:
and controlling the target CPU core to store the intermediate data generated by each Spark task to the corresponding persistent memory file according to the binding relation.
3. The data storage method according to claim 2, wherein configuring the binding relationship of the current process number and the target CPU core includes:
and configuring the binding relation between the current process number and the target CPU core through a task set method or a sched_security method.
4. The data storage method of claim 1, further comprising, prior to determining the device name and the preset file size of the persistent memory device:
judging whether the device attribute of the persistent memory device is a directory;
if yes, the persistent memory device is judged to be a device in the fsdax mode.
5. The data storage method according to any one of claims 1 to 4, further comprising, after storing the intermediate data generated by each Spark task to the corresponding persistent memory file:
if the Spark task is detected to be executed, deleting the persistent memory file corresponding to the executed Spark task by calling a POSIX file deleting method.
6. A data storage system for use in a Spark computing framework, the data storage system comprising:
the parameter determining module is used for determining the device name and the preset file size of the persistent memory device; wherein the persistent memory device is a device in fsdax mode;
the file creating module is used for creating file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names, and generating a persistent memory file with the preset file size under each file path; the file creating module comprises a path creating module, wherein the path creating module is specifically used for randomly generating a plurality of intermediate file names, and creating the file paths corresponding to a plurality of Spark tasks in the persistent memory device according to the device names and the intermediate file names;
the data storage module is used for storing the intermediate data generated by each Spark task to the corresponding persistent memory file;
the new intermediate data storage module is used for randomly generating a new intermediate file name when detecting that the Spark computing framework executes a new Spark task, and creating a new file path corresponding to the new Spark task in the persistent memory device according to the device name and the new intermediate file name; and the method is also used for generating a new persistent memory file under the new file path and storing new intermediate data generated by the new Spark task to the new persistent memory file.
7. An electronic device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the data storage method of any of claims 1 to 5 when the computer program in the memory is invoked by the processor.
8. A storage medium having stored therein computer executable instructions which when loaded and executed by a processor perform the steps of the data storage method of any one of claims 1 to 5.
CN202110551280.5A 2021-05-20 2021-05-20 Data storage method, system, electronic equipment and storage medium Active CN113590536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110551280.5A CN113590536B (en) 2021-05-20 2021-05-20 Data storage method, system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110551280.5A CN113590536B (en) 2021-05-20 2021-05-20 Data storage method, system, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113590536A CN113590536A (en) 2021-11-02
CN113590536B true CN113590536B (en) 2023-12-29

Family

ID=78243162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110551280.5A Active CN113590536B (en) 2021-05-20 2021-05-20 Data storage method, system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113590536B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193494A (en) * 2017-05-19 2017-09-22 深圳大学 RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system
CN109040252A (en) * 2018-08-07 2018-12-18 平安科技(深圳)有限公司 Document transmission method, system, computer equipment and storage medium
WO2019015288A1 (en) * 2017-07-20 2019-01-24 中兴通讯股份有限公司 Method, device and system for persistent data processing, and readable storage medium
CN110535753A (en) * 2019-08-28 2019-12-03 广州视源电子科技股份有限公司 Task creation method, device, equipment and storage medium
CN111414134A (en) * 2020-02-20 2020-07-14 上海交通大学 Transaction write optimization framework method and system for persistent memory file system
CN112434025A (en) * 2020-10-29 2021-03-02 苏州浪潮智能科技有限公司 Method, system, device and medium for optimizing index persistence

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10255180B2 (en) * 2015-12-11 2019-04-09 Netapp, Inc. Server-based persistence management in user space

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193494A (en) * 2017-05-19 2017-09-22 深圳大学 RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system
WO2019015288A1 (en) * 2017-07-20 2019-01-24 中兴通讯股份有限公司 Method, device and system for persistent data processing, and readable storage medium
CN109040252A (en) * 2018-08-07 2018-12-18 平安科技(深圳)有限公司 Document transmission method, system, computer equipment and storage medium
WO2020029388A1 (en) * 2018-08-07 2020-02-13 平安科技(深圳)有限公司 File transmission method, system, computer device and storage medium
CN110535753A (en) * 2019-08-28 2019-12-03 广州视源电子科技股份有限公司 Task creation method, device, equipment and storage medium
CN111414134A (en) * 2020-02-20 2020-07-14 上海交通大学 Transaction write optimization framework method and system for persistent memory file system
CN112434025A (en) * 2020-10-29 2021-03-02 苏州浪潮智能科技有限公司 Method, system, device and medium for optimizing index persistence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Apache Spark内存管理;电脑编程技巧与维护(07);全文 *
面向固态硬盘的Spark数据持久化方法设计;陆克中;朱金彬;李正民;隋秀峰;计算机研究与发展(006);全文 *

Also Published As

Publication number Publication date
CN113590536A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
US10810051B1 (en) Autoscaling using file access or cache usage for cluster machines
US8832697B2 (en) Parallel filesystem traversal for transparent mirroring of directories and files
US7949839B2 (en) Managing memory pages
US20110213954A1 (en) Method and apparatus for generating minimum boot image
US8606791B2 (en) Concurrently accessed hash table
US20050188164A1 (en) System and method for memory management
CN102117330A (en) A method and system for protecting the integrity of key areas of an embedded Linux operating system
US8296742B2 (en) Automatic native generation
CN107357691B (en) Method and device for processing mirror image file
CN113590536B (en) Data storage method, system, electronic equipment and storage medium
CN104572483B (en) Dynamic memory management device and method
KR101590764B1 (en) Computing system and method for controling memory of computing system
KR20120082176A (en) Data processing method of database management system and system thereof
CN111694580B (en) Method and device for upgrading and initializing storage device and electronic device
CN113296961B (en) GPU-based dynamic memory allocation method and device and memory linked list
KR20190069134A (en) Apparatus and method for file sharing between applications
JP6927301B2 (en) Accelerator controller, accelerator control method, and accelerator control program
JP7595553B2 (en) Application program execution method
KR101140522B1 (en) System and Method for Managing Object
CN113296923A (en) Memory merging method, device and computer readable medium
CN113419864B (en) Application memory management method, device, equipment and storage medium
CN118672516B (en) Data storage method, device, storage medium and computer program product
CN111435342A (en) Poster updating method, poster updating system and poster management system
US20180137049A1 (en) Mutable type builder
JP4888713B2 (en) Computer system activation method, information processing apparatus, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant