CN119402200B

CN119402200B - CPU-GPU Data Processing and Transmission Optimization System and Method Based on Zero-Knowledge Proof

Info

Publication number: CN119402200B
Application number: CN202411314060.0A
Authority: CN
Inventors: 曲强; 张乾强
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2024-09-20
Filing date: 2024-09-20
Publication date: 2025-12-16
Anticipated expiration: 2044-09-20
Also published as: CN119402200A

Abstract

The invention discloses a CPU-GPU data processing and transmission optimizing system oriented to zero knowledge proof, which is used for executing multi-scalar multiplication operation based on elliptic curve cryptography to realize the generation of zero knowledge proof, and comprises a CPU and a GPU, wherein a data transmission channel is arranged between the CPU and the GPU, the CPU is used for acquiring and storing point sets and scalar sets related to multi-scalar multiplication operation, communicating with the GPU by utilizing a set parallelization data transmission mechanism, executing calculation of multi-scalar multiplication by the GPU based on the point sets and the scalar sets, and transmitting calculation results back to the CPU. According to the invention, through the synergistic effect of a plurality of technical points, the speed and the efficiency of zero knowledge proof generation are obviously improved, and the resource utilization efficiency is optimized.

Description

Zero knowledge proof oriented CPU-GPU data processing and transmission optimization system and method

Technical Field

The invention relates to the technical field of computers, in particular to a zero knowledge proof oriented CPU-GPU data processing and transmission optimization system and method.

Background

Zero-Knowledge Proof of quality (ZKP) is a cryptographic protocol built on two or more parties, the purpose of which is to enable a Prover (saver) to be confident of the authenticity of a statement without revealing any useful information to the Verifier (Verifier). The key of the protocol is that the trust of the verifier can be ensured, and the information privacy of the prover can be protected. The core value of zero knowledge proof is the ability to prove the authenticity of a statement without revealing any specific information. However, implementation of this technique faces significant computational challenges, mainly due to the computationally intensive nature of its algorithms. Traditional Central Processing Unit (CPU) architectures are struggling in handling such high concurrency, high computational density tasks, and particularly in handling large-scale data sets and complex operations, due to their single-threaded performance and memory access latency, and it is often difficult for CPUs to achieve satisfactory efficiency.

To overcome this limitation, heterogeneous computing architectures, particularly Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), and Graphics Processors (GPUs), are critical to achieving zero-knowledge proof efficient computing due to their unique parallel processing capabilities and custom computing advantages. The FPGA can greatly improve the speed of complex mathematical operation in the ZKP through parallel processing and pipeline optimization, so that the time for proving, generating and verifying is shortened, and the FPGA is particularly important to application scenes with high real-time requirements. In addition, the programmability and flexibility of the FPGA allows for customized designs for specific ZKP algorithms, enabling more efficient computation paths, reducing unnecessary data movement and storage, and thus improving overall performance, enabling large-scale deployment. An ASIC is an integrated circuit that is specifically tailored for a certain class of applications, and for zero knowledge proof, the ASIC can be designed to specifically perform certain critical mathematical operations, such as modular multiplication and elliptic curve point addition, thereby providing higher performance custom calculations, lower power consumption, smaller physical size, and long-term economics, especially in stable large-scale applications, the fixed circuit design of the ASIC ensures computational efficiency and data processing security. For example, pipeZK implements a set of pipelined architectures using ASIC chips, enabling efficient zero knowledge proof generation by setting up special circuitry to accelerate both the fast number theory transformation and the multiple scalar multiplication stages of computation-intensive. The GPU has a large number of simple computing units, is specially used for parallel computing, and can simultaneously execute a large number of similar computing tasks, which is very effective for matrix operations, vector operations and parallelized large number operations common in zero knowledge proof. The GPU has a good programming environment, provides a relatively friendly development tool and library for a developer, enables the developer to program by using a high-level language close to C++, remarkably reduces a learning curve of zero-knowledge proof algorithm development, and enables more background developers to participate. And the GPU market is relatively mature, and has rich driver support and development tools.

However, existing heterogeneous computing architectures for generating zero-knowledge proof suffer from the following drawbacks:

1) The design and implementation of FPGAs typically requires extensive knowledge of the hardware description language (e.g., verilog or VHDL) and a deep understanding of the principles of hardware logic design. This results in a high technical threshold and learning costs, and it is often necessary for developers of software contexts to grasp FPGA development skills for a long time and effort. Furthermore, the circuit design on an FPGA is typically highly customizable, optimized for a particular algorithm or application. This specificity means that the migration of designs between different application scenarios can be difficult, requiring re-evaluation and adjustment of hardware configuration, increasing the development cycle and cost of the project.

2) Application of ASIC chips to the generation and verification of zero knowledge proof, while it may significantly improve the operating efficiency of a particular algorithm and reduce power consumption, has some limitations. For example, ASIC chips are inflexible in their circuit layout and functionality once designed, meaning that new ZKP algorithms or parameter variations cannot be easily accommodated, high development costs, high early investment including design, mask and test and production costs for ASIC chips, uneconomical for smaller scale applications or experimental ZKP solutions, long update cycles, longer ASIC design and manufacturing cycles, and failure of an ASIC to follow up in time if significant breakthrough or algorithm improvement occurs in the ZKP field, resulting in outdated.

3) Although GPUs possess high bandwidth video memory, ZKP computation may involve large amounts of data transfer, especially when data needs to be exchanged frequently between the CPU and the GPU, which adds additional delay and overhead. The transfer of data between a host (CPU) and a device (GPU) requires the transfer of data over a PCIe bus, which is a bottleneck. For ZKP calculations, this communication delay can significantly impact performance if the algorithm requires a large amount of data exchange. This problem is exacerbated by frequent data transmissions, especially when ZKP involves complex interactive protocols. In addition, not only is there a delay in communication between the GPU and the outside, but data sharing and synchronization between different threads within the GPU also causes a delay. If the ZKP algorithm is improperly designed, resulting in unnecessary memory access patterns or excessive synchronization operations, this also increases computation delay.

In the existing zero-knowledge proof generation library, such as a Bellman algorithm library, the architecture is often limited by the data transmission influence from the CPU to the GPU, which causes performance bottleneck. As shown in fig. 1, in a multi-scalar multiplication computation process, a large number of point sets and the transfer of scalar sets between heterogeneous architectures are typically involved. However, since the GPU needs to wait for the CPU to complete the data transmission before starting the computation, the GPU is in an idle state during waiting for the completion of the data transmission, which affects the overall computation efficiency.

In summary, the popularization and application of the zero-knowledge proof technology still face the problems of high computational requirements, long generation period, high operation cost and the like, and these factors greatly prevent the wide application of the zero-knowledge proof technology in a blockchain architecture. Especially in environments such as financial markets that respond to severe demands on the immediate, the generation and verification delays of zero knowledge proofs directly threaten the confidentiality and robustness of transactions, thereby causing a series of negative effects such as transaction delays, reduced trust, potential economic losses, and the like. It should be noted that frequent data exchange between the host and the Graphics Processing Unit (GPU) often results in idle computing resources, which forms a bottleneck for performance improvement. Therefore, optimizing the communication mechanism between the CPU and the GPU and improving the memory access policy become key policies for improving the zero knowledge proof performance, aiming at greatly reducing the proof construction time.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a CPU-GPU data processing and transmission optimizing system and method oriented to zero knowledge proof.

According to a first aspect of the present invention, there is provided a zero-knowledge proof oriented CPU-GPU data processing and transmission optimization system for performing multiple scalar multiplication operations based on elliptic curve cryptography to achieve generation of zero-knowledge proof, the system comprising a CPU and a GPU, wherein a data transmission channel is provided between the CPU and the GPU, the CPU is configured to acquire and store a point set and a scalar set involved in the multiple scalar multiplication operations, and communicate with the GPU using a set parallelization data transmission mechanism, and the GPU performs computation of the multiple scalar multiplication based on the point set and the scalar set, and returns a computation result to the CPU.

According to a second aspect of the present invention, a CPU-GPU data processing and transmission optimization method is provided that is zero knowledge proof oriented. The method comprises the following steps:

Storing, in a CPU, a point set and a scalar set involved in a multi-scalar multiplication operation, the multi-scalar multiplication operation being a multi-scalar multiplication operation based on elliptic curve cryptography for enabling generation of a zero knowledge proof;

And transferring data between the CPU and the GPU by using a set parallelization data transmission mechanism to control the GPU to execute calculation of multi-scalar multiplication based on the point set and the scalar set, and transmitting calculation results back to the CPU.

Compared with the prior art, the invention has the advantages that an innovative parallelization data transmission mechanism is provided for the heterogeneous architecture of the CPU-GPU, the core of the parallelization mechanism is to redesign the data flow, the CPU is allowed to process the transmitted partial data immediately while preparing the data, and the GPU does not need to wait for the complete transmission of all the data at one time. The design improves the throughput of zero knowledge proof generation, reduces the idle period of the GPU, and shortens the generation time of the proof through more efficient task scheduling, thereby realizing the effective improvement of the system performance and the reduction of the calculation cost on the premise of not increasing the hardware cost. The invention solves the problems of communication delay between the CPU and the GPU, data sharing and synchronization delay and the like.

Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a prior art Bellman execution diagram;

FIG. 2 is a diagram of a zero knowledge proof oriented CPU-GPU heterogeneous architecture in accordance with an embodiment of the invention;

FIG. 3 is a schematic diagram of an update process of a shared pre-calculation table according to one embodiment of the invention;

FIG. 4 is a schematic illustration of the calculation of an elliptic curve according to one embodiment of the present invention;

FIG. 5 is a comparison diagram of CPU-GPU transmission performance for zero knowledge proof in accordance with an embodiment of the present invention;

FIG. 6 is a schematic diagram of a zero-copy technique according to one embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

In the computation of zero knowledge proof, efficient data transfer between the CPU and the GPU is crucial. The data transmission efficiency is improved, the delay of data exchange between the CPU and the GPU can be greatly shortened, the waiting time in the transmission process is reduced, the generation and verification speed of zero knowledge proof is accelerated, and smoother and efficient service experience can be provided for users on the premise of ensuring high safety and privacy, so that the increasingly-growing big data processing and complex computing demands are met.

The CPU-GPU data processing and transmission optimization system oriented to zero knowledge proof provided by the invention is a CPU-GPU heterogeneous architecture, and is shown in fig. 2. A CPU (central processing unit) is a main computing unit of a computer, and includes, for example, a controller, an ALU (arithmetic logic unit), a cache, a memory, and the like. The CPU is responsible for controlling and coordinating the operation of the system, including, but not limited to, running an operating system, managing memory and I/O devices. In the context of ZKP, the CPU is typically responsible for handling non-parallel logic flows such as initialization, data preparation, result summarization, etc., and performing the necessary data preprocessing and post-processing tasks before and after GPU (graphics processor) computation. GPUs were originally designed for graphics rendering, but due to their powerful parallel computing capabilities, have been widely used in the fields of scientific computing, machine learning, and cryptography. The GPU is composed of multiple stream processors (SM), each of which adopts a single instruction multithreading parallel processing mode, supporting thousands of synchronized threads co-resident GPU hosts. This allows for efficient parallelism of the GPU, particularly suited for handling large-scale computational tasks without data dependencies, such as the computational operations required in zero-knowledge proof. In the context of zero knowledge proof, GPUs are mainly used to perform a large number of parallel mathematical operations, such as matrix multiplication, modulo operation, polynomial operation, etc., which are the most time-consuming parts of the zero knowledge proof generation and verification process.

In general, the system architecture of the present invention is designed to emphasize optimization of the communication architecture and the memory architecture, which are critical to promote efficient data transfer between the CPU and the GPU by Zero Knowledge Proof (ZKP). For communication structures, a high-speed data transmission channel, such as PCIe (Peripheral Component Interconnect Express), is required between the CPU and the GPU. The optimized system can utilize a more efficient transmission protocol and hardware to realize intelligent scheduling and optimization of data flow, ensure that data flows between a CPU and a GPU efficiently, reduce unnecessary data copying and waiting time, and ensure that data transmission cannot become a bottleneck of the whole system performance. For memory architecture, the system typically includes two types of memory, CPU memory and GPU memory, respectively. To increase efficiency, data needs to be intelligently managed and scheduled between these memory spaces in order to minimize data duplication and transmission during computation.

Aiming at the performance bottleneck in the zero knowledge proof generation process, the invention provides an innovative high-performance computing framework which is focused on optimizing data transmission and pretreatment mechanism. The framework aims to remarkably improve the execution efficiency of the zero knowledge proof algorithm, ensure reasonable utilization of resources and enhance the overall performance and safety of the system. Hereinafter, embodiments of an optimized pretreatment mechanism and data transmission mechanism will be specifically described.

First, an optimized pretreatment mechanism

1. Shared pre-computation table

The pre-calculation table (Precomputation Table) is a data structure for pre-calculating and storing the results of specific functions or operations, which are frequently used in subsequent calculations. By creating such a table prior to the start of a computation or at an early stage of a computation-intensive task, a portion of the computation-intensive workload may be transferred to the preprocessing stage. The method can remarkably reduce the cost of online or real-time calculation and improve the calculation efficiency.

Sharing pre-calculation tables in the context of multi-scalar multiplication (MSM) in Elliptic Curve Cryptography (ECC) is an optimization technique for reducing calculation time and storage requirements. The shared pre-computation table is an efficient way to accelerate subsequent multi-scalar multiplication computations by computing and storing multiples of points on the elliptic curve in advance. This approach is particularly useful when dealing with multiple related points, as it allows the points to share the same optimization table, thereby reducing redundant computation and storage burden.

The basic idea of sharing the pre-calculation table is to assume that there is a set of points P ₁,P₂,...,P_n, and that these points all lie in the same subgroup on the elliptic curve. This means that for any point P _i and P _j, there is some integer k, such that either P _i＝k·P_j or P _j＝k·P_i. If a base point G can be found which is a generator of all points, i.e. all P _i＝k_i G, then only multiples of the base point G can be pre-calculated, which can be used to accelerate scalar multiplication of any point P _i.

In one embodiment, the step of implementing a multi-scalar multiplication using a shared pre-calculation table includes:

S1, selecting a base point, namely finding a proper base point G, and generating a subgroup containing all P _i.

Typically, G is a generator of a large prime order on the elliptic curve.

S2, constructing a pre-calculation table, namely pre-calculating the multiple sequence of G.

For example, the pre-calculated multiple of G sequence is G,2G,3G, lG, where l is the maximum length of the pre-calculation table, typically 2 ^w, where w is the window size of the split window phase in the PIPPENGER (bucket) algorithm.

S3, optimizing scalar multiplication.

For example, for each point P _i＝k_i G, scalar multiplication may be accelerated by looking up the appropriate value in the pre-calculation table. Specifically, if k _i can be decomposed into k _i＝b₀+b₁·2^w+b₂·2^2w +. In binary, where b _j is a binary number of w bits, then point G,2 ^wG,2^2w g., in the pre-calculation table can be used, and P _i is obtained by point addition and double operation.

The memory core of the pre-calculation table is mainly a carefully selected set of multiples of the base point G, and stores scalar relationships between points in each point set and G. These multiples form the basis components for quickly constructing arbitrary scalar multiplication results. By pre-computing and storing a series of multiple points of the base point G, subsequent large-scale scalar multiplication operations can be significantly accelerated, as many common multiplications can be decomposed into simple combinations of pre-computed points, reducing the need for real-time computation. In addition, in order to reduce the storage overhead of the pre-calculation table, a coordinate compression technology can be adopted, namely only the x coordinate of the point is stored, and the y coordinate can be calculated according to an elliptic curve equation when needed, so that the effective utilization of the storage space is realized.

To address different situations and performance requirements, in one embodiment, the set pre-calculation table will also be updated periodically, including but not limited to adding more advanced pre-calculation points to accommodate the PIPPENGER algorithm with larger window sizes, or updating the algorithm to increase efficiency.

Fig. 3 is an update process of the pre-calculation table. At the beginning of the calculation, a certain amount of storage space is reserved in the pre-calculation table for storing the pre-calculated point set. These point sets are typically carefully chosen according to the expected computational requirements in order to cover as much as possible of the situation in the subsequent computation process. When encountering a point set to be solved in the calculation running process, firstly, trying to find whether a corresponding calculated result exists in a pre-calculation table. If the result exists, the result can be directly obtained from the table without complex calculation, so that a great amount of calculation time and resources are saved, if the result does not exist, the point set is recalculated, and the calculation result is returned to the GPU. As the computation advances, the contents of the pre-computation table will also change. When a new calculation result is generated, it is necessary to decide whether to add it to the pre-calculation table. If there is still enough room left in the pre-computed table, the newly generated point set results will be inserted directly into the table for use by subsequent queries. By the design, future calculation load can be further reduced, and calculation efficiency is improved. However, when the space of the pre-calculation table is insufficient, the set effective policy is adopted to manage the contents in the table. Considering that the computation of the point set has significant temporal locality, i.e., the point set that has been accessed recently is likely to be accessed again in a future period of time, in one embodiment, the update of the pre-computation table contents is performed using a recently least recently Used (LEAST RECENTLY Used, LRU) policy. The LRU algorithm is a replacement algorithm whose basic idea is to preferentially eliminate those items that are not used for the longest time, thereby keeping the stored contents of the table the most representative and most frequently used point set. In this way, the value of the pre-computation table can be maximized, always in an optimal state, to meet the current computing needs.

Meanwhile, along with the update of the pre-calculation table, the point set relation database designed by the pre-calculation table is synchronously checked and updated, so that all point sets depending on the pre-calculation table can be correctly mapped to the latest point set, and the accuracy and efficiency of calculation are maintained. Finally, in the selection of the data structure, in the pre-calculation table designed to efficiently manage the points in the pre-calculation table and the scalar relationships thereof, for example, a hash table is used to store the scalar values of the base points, so that the average lookup time of O (1), that is, the time required for looking up the scalar values of a point is constant regardless of the size of the data set, is ensured. Through the strategy, the pre-calculation table can not only remarkably improve the calculation efficiency of multi-scalar multiplication in elliptic curve cryptography, but also effectively manage storage resources and ensure the safety and reliability of the system.

2. Signed window method

In elliptic curve cryptography, as shown in connection with fig. 4, the negation of a point exhibits a significant computational efficiency advantage over point addition or point doubling, mainly due to its simplified mathematical operation requirements. Specifically, in elliptic curve cryptography, a negative point-P of a point P generally refers to that point on the elliptic curve that is symmetrical to P about the x-axis. In most mathematical representations of elliptic curves, calculating the negative point of a point is a very simple operation, since it involves only changing the sign of the y-coordinate. Without performing complex modulo operations, multiplicative inverse computations, or square root solutions. Thus, in one embodiment, the computational effort of multiple scalar multiplication is reduced in an easy way to sign elliptic curves, improving the efficiency of the generation of zero knowledge proofs.

In contrast, point addition or point doubling involves not only the computation of polynomials, but also often requires solving of multiplicative inverse, which is a relatively time-consuming task in the finite field. Therefore, the invention is introduced into the multi-scalar multiplication and dot multiplication calculation, can obviously reduce the consumption of calculation resources and promote the overall operation speed in view of the high-efficiency characteristic of taking negative operation, and provides an important performance optimization approach for the application of elliptic curve cryptography. In the conventional PIPPENGER unsigned window approach, the window value range is [0,2 ^c-1 ], resulting in that each window may need to handle up to 2 ^c different values. And after introducing the signed window, the window value range becomes [ -2 ^c-1,2^c-1 -1]. This not only narrows the absolute range of window values, but also allows partial dot multiplication operations to be replaced with a negative representation of the dot, i.e. the dot's inverse, thereby reducing the number of window values that need to be processed separately.

In one embodiment, the step of converting the scalar k into a signed sliding window code comprises:

step S21, determining the window size

For example, a window size c is selected. The window size determines how many bits are in each window.

Step S2, the scalar k is expressed as a binary form, and a binary string is obtained.

For example, if k=23, its binary form is 10111.

Step S3, window division is conducted on the binary string.

The binary string is split into a number of windows according to the length of c bits. For example, for c= 3,10111, it will be divided into 010 and 111. It will be appreciated that in order to form a complete window, it is in some cases necessary to supplement the most significant bit with 0.

And S4, converting the window value into decimal numbers.

For example, 010 converts to 2,111 converts to 7.

Step S5, signed conversion

For each window value v, if v >2 ^c-1 -1, 2 ^c is subtracted from v to make it a signed number. For example, if c=3, the window value should range from [ -4,3]. If the window value is 5 or greater, it is converted to an equivalent negative number. For example, 5 will be converted to-3 because 5-8= -3.

Step S6, adjusting the subsequent window

If the value of the current window changes from positive to negative (i.e., 2 ^c is subtracted), then the value of the next window needs to be increased by 1 to maintain the equivalence of scalar multiplication.

In elliptic curve arithmetic, point addition and point doubling are the most basic and costly operations. The signed window method improves computational efficiency by allowing the use of negative points to reduce the absolute size of window values, thereby more frequently utilizing the inverse of points in a multi-scalar multiplication (MSM) process. By utilizing the high-efficiency characteristic, the complex point operation times are obviously reduced, the consumption of calculation resources is reduced, the calculation efficiency is improved, and the high-efficiency and the accuracy of calculation can be maintained even under the condition of adopting a smaller window size.

Second, optimized data transmission mechanism

1. Asynchronous data transmission mechanism

The core value of asynchronous data transmission is that it breaks the inherent limit between data transmission and computational tasks, and enables the possibility of parallel execution of both. Specifically, during the process of copying data from the CPU to the GPU, the GPU is not stalled, but rather can continue to process existing datasets, performing computing tasks. The mechanism not only shortens the idle period of the GPU obviously, but also improves the utilization efficiency of hardware resources greatly through the overlapping of data transmission and calculation.

To achieve this optimization in the data transfer part in the generation of zero knowledge proof, in one embodiment, a batch transfer strategy based on point sets is proposed. The key to this strategy is fine-grained performance assessment and data management. Firstly, dividing a large data set into a plurality of small batches, and evaluating the size of each batch according to the performance of the GPU, so that the GPU is ensured to have enough workload after receiving data, and the waiting time in the data transmission process is avoided. The purpose of this strategy is to keep the GPU active all the time, fully utilize its computational power, avoid the performance waste caused by data transmission.

In addition, in the flow of data processing, after the data calculation of each batch is completed, the GPU will immediately return the result to the CPU, instead of waiting for all the data processing to be completed and summarizing. The instant feedback mechanism enables the CPU to timely conduct subtask protocol and integration, prepares for the next calculation in advance, and further accelerates the whole calculation flow. Moreover, this strategy also emphasizes efficient management of memory in order to cope with the storage requirements of large-scale data sets while reducing storage costs. Once the data set or part of the elements thereof do not participate in calculation, the occupied memory space is released immediately, and the efficient utilization of resources is ensured.

In summary, by implementing an asynchronous data transmission policy, combining a batch processing and instant feedback mechanism of a point set and a dynamic memory management policy, the efficiency of the cooperation between the CPU and the GPU can be significantly improved. The method for overlapping calculation and data transmission not only reduces the waiting time of the GPU, but also optimizes the use of the memory, provides powerful technical support for processing large-scale data sets, and particularly shows great potential and advantages in the computationally intensive tasks such as generation and verification of Zero Knowledge Proof (ZKP). Compared with the original zero knowledge generation system, compared with fig. 1, fig. 5 shows that the CPU is promoted by 4 time blocks, the GPU is promoted by two time blocks, and the advantage of the optimization method provided by the invention on the promotion of the working efficiency of the CPU and the GPU is more obvious along with the increase of the data volume.

2. Zero copy technique

Zero-copy technology (Zero-copy technology) refers to a series of methods and strategies aimed at minimizing or completely eliminating data copying operations when data is transferred between different software layers or hardware components. By reducing the number of data copying, the zero-copy technique can significantly improve system performance, reduce latency, and reduce CPU and memory bandwidth consumption, thereby improving overall system efficiency. In zero knowledge proving that efficient transfer of data between the CPU and GPU is a key factor in the computation of multiple scalar multiplications, improving overall system performance. Traditionally, the transfer of data from the CPU memory to the GPU device memory requires explicit copy operations, which not only increase the burden on the CPU, but also cause significant delays and bandwidth consumption on large data sets. In order to solve the problems, the invention also introduces a zero copy technology on the premise of realizing a multi-batch asynchronous transmission mechanism.

The zero copy technique is an asynchronous memory mapping method that maps fixed (non-pageable) host memory directly to GPU memory and implicitly transfers it to the GPU. This technique allows the GPU thread to directly access host memory. When a GPU thread reads a variable that maps with a host, it will commit a PCIe read transaction and the host will return the data over the PCIe bus. The core principle of the zero-copy technique is to reduce or eliminate the number of copies of data between different memory regions, especially during data transfer between the CPU and the GPU. In conventional data transfer processes, data typically needs to be copied from host memory to kernel buffer and then from kernel buffer to GPU's device memory, which consumes a significant amount of CPU cycles and memory bandwidth. In contrast, zero copy techniques allow data to be transferred directly from host memory to GPU device memory without intermediate copying steps, thereby significantly reducing CPU intervention and data transfer delays.

In conventional CUDA programming, data must be explicitly copied from host memory to the device memory of the GPU using an API such as cudaMemcpy. This process not only adds additional delay, but also occupies the resources of the CPU. In contrast, as shown in FIG. 6, the present invention will implement CUDA unified virtual Address space (Unified Virtual Addressing, UVA) functionality in a CUDA environment to achieve zero copy like effects. After the UVA function is implemented, the CPU and GPU share a unified virtual address space. This means that the same virtual address can be used to access data, whether it be a CPU or a GPU, without knowing in which physical memory region (CPU memory or GPU memory) the data is actually stored. UVA allows the GPU to directly access host memory without explicitly copying data to GPU device memory. This simplifies data management and speeds up data transmission.

In conclusion, the method and the device remarkably improve the speed and the efficiency of ZKP generation through the synergistic effect of a plurality of technical points, optimize the resource use and provide powerful support for the zero knowledge proving technology in a high-performance computing environment. The advantages of the present invention over the prior art are mainly represented by the following aspects:

1) The invention introduces an innovative pretreatment mechanism, which builds an efficient calculation framework by utilizing a CPU to execute key calculation tasks in advance, greatly shortens the time for generating ZKP, and particularly shows prominence when processing a large-scale data set. For example, by sharing the design of the pre-calculation table, the effective utilization of the calculation resources is realized, by caching the intermediate data of the complex operation, the repeated calculation is avoided, the calculation time and the energy consumption are reduced when a large-scale data set or a high-complexity algorithm is processed, and the effect is remarkable particularly in the multi-scalar multiplication calculation. For another example, aiming at the characteristic of low inversion cost of the elliptic curve, a signed window technology is designed for the operation of multi-scalar multiplication, and the number of point multiplication times in actual operation is reduced by sign expansion of operands, so that the complexity of large number multiplication in elliptic curve cryptography is reduced. The signed sliding window technology reduces the number of point multiplication times required by large number multiplication in elliptic curve cryptography by performing sign expansion and operand optimization, and directly accelerates the generation flow of zero knowledge proof.

2) The invention provides a set of optimization schemes aiming at the zero knowledge proof system proof generation speed, the original proof generation logic is not required to be reconstructed, the communication efficiency between heterogeneous computing platforms is optimized, the generation delay is effectively reduced by refining a data transmission mechanism between a CPU and a GPU, and the system performance is greatly improved. For example, the transmission mode between systems is improved, and a data exchange process which is faster and saves resources when processing and transmitting data is realized through a zero-copy transmission technology. The zero-copy transmission technology realizes the optimization in the data processing and transmission process, not only greatly reduces the processing load of a CPU, reduces the consumption of memory bandwidth, but also greatly shortens the delay of data transmission, thereby realizing the acceleration of data processing and the remarkable improvement of the overall system performance. For another example, by improving the transmission mode of data among heterogeneous architectures and implementing a batch transmission strategy, progressive delivery of the point set and the scalar set to the GPU is realized, the idle waiting period of the GPU is avoided, and the continuous utilization of GPU resources is ensured, so that the resource utilization rate and the overall efficiency in the zero knowledge proof generation process are remarkably improved. By the design, the utilization rate of the GPU is remarkably improved, the overall calculation efficiency is greatly optimized, and particularly when a large-scale data set is processed, the progressive delivery mechanism greatly reduces the waiting time and promotes the continuous and maximum utilization of resources, so that the leap of system performance is realized. The staged and decoupled data transfer strategy enables the CPU and the GPU to work cooperatively instead of waiting alternately, so as to ensure continuous utilization of GPU resources.

3) The invention has been subjected to multiple simulation tests, and experiments prove that the invention can meet the expected design index. For example, by taking advantage of the parallel computing capabilities of the CPU, a series of critical computing tasks are performed in advance, and a pre-computation table is built that enables direct reference to these pre-computed results when the zero knowledge proof is actually generated, avoiding duplicate computations, greatly compressing the time required to generate the proof.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++, python, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A CPU-GPU data processing and transmission optimizing system oriented to zero knowledge proof is used for executing multi-scalar multiplication operation based on elliptic curve cryptography to achieve zero knowledge proof generation, the system comprises a CPU and a GPU, wherein a data transmission channel is arranged between the CPU and the GPU, the CPU is used for acquiring and storing a point set and a scalar set related to the multi-scalar multiplication operation, the CPU is communicated with the GPU by utilizing a set parallelization data transmission mechanism, and the GPU is used for executing multi-scalar multiplication calculation based on the point set and the scalar set and transmitting calculation results back to the CPU.

2. The system of claim 1, wherein the GPU performs the multi-scalar multiplication operation using a shared pre-computation table that stores a series of multiple points of a base point G, and stores a scalar relationship between points in each set of points and the base point G, which is a generator of all points in the set of points.

3. The system of claim 1, wherein the GPU implements a multi-scalar multiplication operation based on a bucket algorithm in which scalar k is converted into a signed sliding window by:

Determining a window size c, the window size c defining how many bits are within each window;

Representing scalar k as a binary form, forming a binary string;

Dividing the binary string into a plurality of windows according to the length of c bits to obtain a plurality of window values;

Converting each window value into a decimal number to obtain a decimal window value;

For each decimal window value v, if v >2 ^c-1 -1, subtracting 2 ^c from v to make it a signed number;

if the value of the current window changes from positive to negative, then the value of the next window is increased by 1 to maintain the equivalence of scalar multiplication.

4. The system of claim 2, wherein the shared pre-calculation table is configured to be updated periodically according to the steps of:

When calculation starts, reserving a set number of storage spaces in the shared pre-calculation table for storing a point set obtained by pre-calculation;

Searching whether a corresponding calculated result exists in the shared pre-calculation table or not when the point set to be solved is calculated in the running process, acquiring the result from the table if the corresponding calculated result exists, re-calculating the point set if the corresponding calculated result does not exist, and returning the calculation result to the GPU;

when a new point set result is generated, if the shared pre-calculation table still has enough residual space, the newly generated point set result is inserted into the table, and when the shared pre-calculation table has insufficient space, the storage content in the table is updated by using a set updating strategy.

5. The system of claim 4, wherein the update policy employs a most recently unused policy to prioritize out items that are not used for a longest period of time.

6. The system according to claim 1, wherein the parallelized data transmission mechanism is a point set based batch transmission strategy comprising:

For the point set stored in the CPU, dividing the point set into a plurality of batches according to the performance of the GPU, and transmitting the batches to the GPU;

after the GPU obtains the calculation result of each batch, the calculation result of each batch is immediately transmitted back to the CPU, and the memory space occupied by the elements which do not participate in calculation is released.

7. The system of claim 1, wherein the CPU and the GPU share a unified virtual address space during execution of the multi-scalar multiplication operations to enable transfer of data between the CPU and the GPU via a zero-copy technique.

8. The system of claim 1, wherein in the shared pre-calculation table, a hash table is employed to store scalar values of base points G, and for points in the set of points, only the x-coordinates of the points are stored.

9. A CPU-GPU data processing and transmission optimizing method oriented to zero knowledge proof comprises the following steps:

10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor realizes the steps of the method according to claim 9.