Disclosure of Invention
Technical problem to be solved
In order to reduce time consumption of a three-dimensional model slicing processing process and improve slicing efficiency, the invention provides a multi-layer parallel slicing method for processing a 3D printing model. A four-level parallelization slicing scheme is designed, the parallelization potential of the slicing problem is utilized, the effect of reducing the time consumption of three-dimensional model slicing processing is achieved by means of parallel processing of all execution units, and the original slicing precision is not reduced.
Technical scheme
A multi-level parallel slicing method of a 3D printing model is characterized by comprising the following steps:
step 1: executing parallelization of computing node level slices, performing task division among computing nodes by taking a patch as a basic unit in a cluster, and dividing an original model file according to a division result to obtain a plurality of sub-model files;
step 2: executing multi-GPU-level slice parallelization, performing task division among a plurality of GPUs by taking a patch as a basic unit in each computing node, reading corresponding sub-model files into a memory in a grading manner according to a division result, constructing patches and vertex arrays, and importing the patches and the vertex arrays required by the slices into the GPU memory from a computing node main memory;
and step 3: executing thread block level slice parallelization, and dividing a patch array led into a GPU memory into a plurality of sub-patch arrays among thread blocks;
and 4, step 4: executing thread-level slice parallelization, dividing sub-patch arrays belonging to a thread block among threads, sequentially processing the allocated patches by the threads, solving all tangent planes intersected with each patch, calculating tangent segments, and mutually exclusively inserting the tangent segments into the tangent segment arrays of the corresponding layers by the threads;
and 5: exporting the tangent line segment array from each GPU memory to a main memory, merging the calculation results of each GPU at a CPU end, and generating a sub-tangent line segment array corresponding to the current calculation node at each layer;
step 6: task division is carried out among the computing nodes according to layers, sub-tangent segment arrays are collected and merged according to the division results according to the layers, and a corresponding tangent segment array is obtained in each layer obtained through distribution;
and 7: and each computing node sequentially processes the tangent line array of each layer existing in the current node and parallelly executes the tangent line connection to generate the intra-layer tangent contour line.
The step 1 of dividing tasks among the computing nodes means that a patch is taken as a basic unit, an original model file is divided in a continuous equipartition mode, and the equipartition remainder is distributed to the last computing node; and dividing the original model file according to the division result to form a plurality of sub-model files, wherein one sub-model file is allocated to one computing node.
In the step 2, a task division finger is made among the GPUs, a patch is taken as a basic unit, the sub-model file is divided in a continuous dividing mode, and the remainder after dividing is distributed to the last GPU.
And 2, reading in the sub-model files and constructing the array in the step 2 means that the sub-model files distributed to the calculation nodes are read for a plurality of times according to the task division result, wherein the reading times are equal to the number of GPUs contained in the calculation nodes. After reading, the main memory of the computing node contains the model data with the same group number, and each group contains a patch array and a vertex array.
In the step 3, the patch is divided among the thread blocks, the patch is firstly distributed in a continuous equipartition mode, and the remainders after equipartition are sequentially distributed to each thread block one at a time, so that the task amount distribution is balanced and each thread block is distributed to the continuous patch.
In the step 4, the sub-patch arrays are divided among the threads, and a mode of sequentially distributing the sub-patch arrays among the threads one at a time is adopted, so that the distribution of the task amount is balanced, the memory units read by adjacent threads are continuous, and the merged reading of the GPU memory is facilitated to improve the memory access bandwidth.
The mutually exclusive insertion of the tangent segments into the array in the step 4 means that when the threads participating in the parallel processing process respective triangular patches, the calculated tangent segments have the probability of belonging to the same layer, and data needs to be mutually exclusive inserted into the same array. The exclusive insertion mode is that an array index is set to point to the next idle position to be inserted in the array, the atomic addition operation needs to be executed on the array index when data is specified to be inserted, and the current value of the array index is obtained as the insertion position.
And 5, merging the calculation results of the GPUs, namely merging the tangent segment data according to layers at the CPU end, namely merging the tangent segments belonging to the same layer into the same array.
In the step 6, collecting merged tangent segment data among the computing nodes specifically includes the following steps:
1) taking a layer as a basic unit, distributing merging tasks among the computing nodes in a continuous and uniform dividing mode, and distributing the equalized remainder to the last computing node;
2) the computing nodes perform data merging on the distributed layers, collect tangent segment data located on the same layer in all the computing nodes, and merge the tangent segment data into an array in a unified manner; after the combination of the data of the tangent line segments is completed, each computing node discards the data of the layer which does not belong to the current computing node, and the memory space is released.
Advantageous effects
The invention provides a multi-level parallel slicing method for a 3D printing model, which accelerates three-dimensional model slicing in a multi-level parallel processing mode, and specifically comprises four levels of slicing parallelization, namely a computing node level, a multi-GPU level, a thread block level and a thread level. In the parallelization of each layer, a corresponding task division and data interaction scheme is designed according to the address space distribution, the access mode and the data structure characteristics of the current layer, so that the load of each parallel execution unit is balanced and the data communication amount is reduced. In thread-level parallelization, a data competition problem during parallel insertion is solved by adding array index atoms. The method can effectively reduce the time consumption of three-dimensional model slicing processing under the condition of not reducing the original slicing precision. Meanwhile, the I/O time of the hard disk can be reduced by reading the corresponding sub-model files in parallel by each computing node; each computing node only needs to process the sub-model data, and the memory occupation amount is reduced.
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
referring to fig. 1-2, the invention improves the slicing processing speed of the three-dimensional model by a multi-level parallel method, specifically including four-level parallelization. From a top-down perspective, compute node level, multi-GPU level, thread block level, and thread level, respectively.
The method comprises the following steps: and executing parallelization of the computing node level slices, performing task division among the computing nodes by taking the surface patches as basic units in the cluster, dividing the original model file according to the division result, executing the step two, and collecting merged tangent segment data among the computing nodes after the step two is finished.
The task division among the computing nodes is to divide the original model file in a continuous equipartition mode by taking a patch as a basic unit, and the equipartition remainder is distributed to the last computing node. And dividing the original model file according to the division result to form a plurality of sub-model files, wherein one sub-model file is allocated to one computing node.
Referring to fig. 3, if the total number of patches N contained in an STL model file, the number of compute nodes in a cluster is N1, the number of patches that each compute node has to divide is count _ uniform 1, and the number of patches that the last process has to divide is last _ count1, then:
wherein,
indicating rounding down,% indicates the remainder. The original STL model file is divided into n1 sub model files according to the number of slices calculated above.
The benefits of segmenting the original model file are:
1) the data amount of each computing node read in the memory from the hard disk is reduced, and further the I/O time is reduced;
2) reducing the memory usage amount of each computing node, so that a larger model can be sliced under the condition that the total memory amount of the computing nodes is fixed;
3) each computing node executes slice computation in parallel, which brings about effective reduction of computation time.
The method for collecting merged tangent segment data among the computing nodes specifically comprises the following steps:
1) and distributing merging tasks among the computing nodes in a continuous and uniform division mode by taking the layer as a basic unit, and distributing the averaged remainder to the last computing node.
The number of layers obtained by each computing node is uniform _ layer _ count, the residual number obtained by the average is distributed to the last computing node, and the number of layers obtained by the average is last _ layer _ count, then:
2) and the computing nodes perform data merging on the distributed layers, collect tangent segment data positioned on the same layer in all the computing nodes, and uniformly merge the tangent segment data into an array. In one embodiment, the merge operation is performed on the sliced segment data using the MPI _ Gather () interface of MPI, which is called once by merging the data of one layer. After the combination of the data of the tangent line segments is completed, each computing node discards the data of the layer which does not belong to the current computing node, and the memory space is released.
Step two: and executing multi-GPU-level slice parallelization, performing task division among a plurality of GPUs by taking a patch as a basic unit in a computing node, reading sub-model files into a memory in a grading manner according to division results, constructing a patch and a vertex array, importing data required by the slice into a GPU memory from a computing node main memory, executing the step three, exporting tangent segment data from each GPU memory to a main memory after the execution, and merging each GPU computing result at a CPU (central processing unit) end.
The task division among the GPUs is realized by dividing the sub-model file in a continuous equal division mode by taking a patch as a basic unit, and distributing the remainder after the equal division to the last GPU.
Referring to fig. 4, if the number of GPUs in each compute node is n2, the number of patches included in the sub-model file allocated to the current compute node is count1, the number of patches evenly divided by each GPU is uniform _ count2, and the number of patches divided by the last GPU is last _ count2, then:
and in the second step, reading in the sub-model files and constructing the array refers to that the sub-model files distributed to the calculation nodes are read for a plurality of times according to the task division result, wherein the reading times are equal to the number of GPUs contained in the calculation nodes. After reading, the main memory of the computing node contains the model data with the same group number, and each group contains a patch array and a vertex array.
After reading, the main memory of the computing node contains n2 groups of data, and each group of data contains a patch array Face [ ] and a Vertex array Vertex [ ]. In the patch array, one element records the index of three vertices contained in one patch, and the index refers to the subscript of the Vertex in the Vertex [ ] array. In the vertex array, one element records the x, y, z three-dimensional coordinate system coordinates of the vertex.
Furthermore, the merging of the calculation results of the GPUs is to merge the tangent segment data in layers at the CPU end, that is, to merge the tangent segments belonging to the same layer into the same array.
In one embodiment, OpenMP is used to implement control of multiple GPUs in a compute node, where the number of OpenMP threads created is equal to the number of GPUs in the compute node, and one OpenMP thread takes over one GPU compute card.
Step three: and executing thread block level slice parallelization, dividing the patch array led into the GPU memory among all thread blocks, and executing the step four.
The dividing of the surface patch among the thread blocks is that firstly the surface patch is distributed in a continuous equal distribution mode, and the remainder after the equal distribution is sequentially distributed to each thread block one at a time, so that the distribution of the task amount is balanced and each thread block is divided into continuous surface patches.
Referring to fig. 5, if the number of thread blocks is n3, the number of patches in the Face [ ] array in the GPU memory is count2, the remainder after the division is remain _ count, the number of patches divided by each thread block is Face _ count, and each thread block has a continuous and unique number id, then:
in one embodiment, the thread blocks are organized into one dimension, and the number of thread blocks is reasonably set according to the number of streaming multiprocessors contained in the GPU.
Step four: and executing thread-level slice parallelization, dividing sub-patch arrays belonging to a thread block among threads, sequentially processing the allocated patches by the threads, solving all tangent planes intersected with each patch, calculating tangent segments, and mutually exclusively inserting the tangent segments into the tangent segment arrays of the corresponding layers by the threads.
Referring to fig. 6, the number of threads is n4, and the sub-patch arrays are divided among the threads, and are sequentially allocated among the threads one at a time, so that the task amount allocation is balanced, and the memory units read by adjacent threads are continuous, which facilitates the merged reading of the GPU memory to improve the memory access bandwidth.
The mutually exclusive insertion of the tangent segments into the array means that when the threads participating in the parallel processing process respective triangular patches, the calculated tangent segments have the probability of belonging to the same layer, and data needs to be mutually exclusive inserted into the same array. The exclusive insertion mode is that an array index is set to point to the next idle position to be inserted in the array, the atomic addition operation needs to be executed on the array index when data is specified to be inserted, and the current value of the array index is obtained as the insertion position.
In one embodiment, the threads are organized in one dimension, the number of threads included in a thread block is usually an integer multiple of 32, and the number of threads is chosen reasonably according to the number of hardware resources such as registers on the GPU.
Referring to fig. 7, in an embodiment, an index e is used to record a current position to be inserted into an arr array, a thread x and a thread y respectively calculate tangent segments s1 and s2 of the same tangent plane and respective triangle patches, and to Insert into the arr array at the same time, the thread x executes atomadd (e), Insert (s1), the thread y executes atomadd (e), and Insert (s2), so that correct insertion of mutual exclusion into the same array is achieved.
Step five: and executing the line cutting segment connection in parallel by each computing node according to the distributed layers to generate the slice contour lines in the layers.
The layer allocated to each computing node is the layer allocated when the combined tangent segment data is collected among the computing nodes in the step one. During data collection and merging, tangent segment data is divided into computing nodes in the cluster according to layers in a balanced mode. After division, the data of each layer are independent, and each computing node can execute segment cutting connection in parallel to generate an in-layer slice contour line so as to complete the slice processing of the model.
Referring to fig. 8, slicing time consumption of an original scheme and a current scheme is compared, wherein the original scheme refers to a serial slicing scheme, the current scheme refers to the multi-level parallel slicing scheme, a hardware device used for testing is a GPU cluster, the cluster comprises two computing nodes, the type of a CPU on the computing nodes is intel (r) xeon (r) Gold 6132CPU @2.60GHz, each computing node is provided with three GPU computing cards, and the type of each GPU computing card is Tesla V100-PCIE-32 GB. Test results show that the multi-level parallel slicing scheme can effectively reduce slicing time consumption, and the larger the model scale is, the more obvious the acceleration effect is.