[go: up one dir, main page]

TWI867624B - Method for improving LDPC decoder parallelism by dividing LDPC code - Google Patents

Method for improving LDPC decoder parallelism by dividing LDPC code Download PDF

Info

Publication number
TWI867624B
TWI867624B TW112126729A TW112126729A TWI867624B TW I867624 B TWI867624 B TW I867624B TW 112126729 A TW112126729 A TW 112126729A TW 112126729 A TW112126729 A TW 112126729A TW I867624 B TWI867624 B TW I867624B
Authority
TW
Taiwan
Prior art keywords
column
matrix
row
segment
swap
Prior art date
Application number
TW112126729A
Other languages
Chinese (zh)
Other versions
TW202505878A (en
Inventor
魏大鈞
曾戈忠
Original Assignee
睿寬智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 睿寬智能科技有限公司 filed Critical 睿寬智能科技有限公司
Priority to TW112126729A priority Critical patent/TWI867624B/en
Application granted granted Critical
Publication of TWI867624B publication Critical patent/TWI867624B/en
Publication of TW202505878A publication Critical patent/TW202505878A/en

Links

Images

Landscapes

  • Error Detection And Correction (AREA)
  • Complex Calculations (AREA)

Abstract

To use an LDPC layer decoder, an LDPC H matrix is converted into instructions readable by the decoder, and an order of the circulants of the matrix is processed to enhance decoding performance. To this end, at first, an Hab generator is used to preprocess the matrix to achieve high throughput. Secondly, a circulant scheduler is used to optimize circulant order to implement parallelization and further enhance hardware performance.

Description

分割LDPC碼而改善LDPC解碼器並行的方法 Method for improving LDPC decoder parallelism by splitting LDPC codes

本發明關於LDPC(low density parity check)最小和(min-sum)解碼,尤其關於分割LDPC碼而改善LDPC解碼器並行的方法。 The present invention relates to LDPC (low density parity check) min-sum decoding, and more particularly to a method for improving the parallelism of LDPC decoders by segmenting LDPC codes.

通常,最小和解碼器處理矩陣時,用以列為基礎的條目(row-based entries)。舉例而言,以列接列(row-by-row)的方式處理第1圖所示的LDPC最小和解碼的矩陣。在硬體裡的資料流也依循此以行為基礎的法則,要讀的一列資料與要寫的前一列資料用相同記憶體空間。 Usually, when processing matrices, the minimum sum decoder uses row-based entries. For example, the LDPC minimum sum decoding matrix shown in Figure 1 is processed row-by-row. The data flow in the hardware also follows this row-based rule, and a row of data to be read uses the same memory space as the previous row of data to be written.

把LDPC的矩陣轉換成硬體指令的典型方式是依矩陣順序。然而,在相鄰幾列間,常發生先讀後寫(read before write)衝突。為解決衝突,硬體須暫停一個元素的讀,直到在同欄的前一個元素被處理且被寫回記憶體。這種解決衝突的行為造成許多延宕。 The typical way to convert LDPC matrices into hardware instructions is in matrix order. However, read before write conflicts often occur between adjacent columns. To resolve conflicts, the hardware must pause the reading of an element until the previous element in the same column is processed and written back to memory. This conflict resolution behavior causes a lot of delays.

有鑑於習知技藝之問題,本發明之目的是提供一種高效的分割LDPC碼而改善LDPC解碼器並行的方法。 In view of the problems in the prior art, the purpose of the present invention is to provide a method for efficiently segmenting LDPC codes to improve the parallelism of LDPC decoders.

為達成目的,在此分割LDPC碼而改善LDPC解碼器並行的方法中,先把整個矩陣均分為數段。依序把矩陣中每一欄隨機分配到某段,分配時將當下每段之需求作為加權。分配完所有欄位後計算加權對調分數,若此加權對調分數小於稍早被貯存的加權對調分數,則更新至此加權對調分數並記錄對調分配方式。判斷矩陣的對調分配次數是否達最大遞迴數。若矩陣的對調分配次數未達最大遞迴數,則回上述把整個矩陣均分為數段的步驟。若矩陣的對調分配次數已達最大遞迴數,則依上述記錄之最小加權對調分數的對調分配方式產生重組矩陣及重組向量。 To achieve the purpose, in this method of improving the parallelism of LDPC decoders by segmenting LDPC codes, the entire matrix is first divided into several segments. Each column in the matrix is randomly assigned to a segment in sequence, and the current demand of each segment is used as a weight during the assignment. After all columns are assigned, the weighted swap score is calculated. If this weighted swap score is less than the weighted swap score stored earlier, it is updated to this weighted swap score and the swap assignment method is recorded. It is determined whether the number of swap assignments of the matrix reaches the maximum recurrence number. If the number of swap assignments of the matrix does not reach the maximum recurrence number, the above step of dividing the entire matrix into several segments is returned. If the number of matrix swaps has reached the maximum recurrence number, the reorganized matrix and reorganized vector are generated according to the swap distribution method with the minimum weighted swap score recorded above.

S10:對調矩陣的列以產生不同的列順序 S10: Swap the rows of the matrix to produce a different row order

S11:計算任意兩列的衝突數 S11: Calculate the number of conflicts between any two columns

S12:依序選一列當一個列順序之首 S12: Select a row in order as the first row in the sequence

S13:選衝突最少且未用過的列當下一列 S13: Select the row with the least conflicts and has not been used as the next row

S14:完成一個列順序? S14: Complete a row sequence?

S15:更新有最小總衝突數的列順序 S15: Update the column order with the minimum total number of conflicts

S16:已探索以同一列為首的所有可能列順序? S16: Have all possible column sequences starting with the same column been explored?

S17:再從被記錄的當今列開始,選另一候選者當下一列 S17: Start from the current row and select another candidate to be the next row

S18:每一列都當過一個列順序之首? S18: Has every row been the first in the column sequence?

S20:依次選一個列順序 S20: Select a column order in sequence

S30:對調欄位並分割矩陣 S30: Swap columns and split the matrix

S31:依序處理原始矩陣每一欄 S31: Process each column of the original matrix in sequence

S32:計算加權對調分數 S32: Calculate weighted swap scores

S33:矩陣的對調分配次數達最大遞迴數? S33: Has the number of matrix swaps reached the maximum number of recursions?

S34:產生重組矩陣及重組向量 S34: Generate reorganization matrix and reorganization vector

S40:調度在每一個核心裡的元素 S40: Scheduling elements in each core

S41:計算衝突標記及風險分數 S41: Calculate conflict markers and risk scores

S42:依風險分數,從高到低排序元素 S42: Sort elements by risk score, from high to low

S43:從記憶體讀取非零元素 S43: Read non-zero elements from memory

S44:把處理完的非零元素資料寫入記憶體 S44: Write the processed non-zero element data into memory

S45:完成整個矩陣? S45: Complete the entire matrix?

S50:產生多行CMEM指令 S50: Generate multiple lines of CMEM instructions

S55:找最少的CMEM指令的行數 S55: Find the minimum number of CMEM instruction lines

S60:達重複次數上限? S60: Reached the upper limit of repetition times?

S70:已處理所有列順序? S70: Have all row sequences been processed?

S75:依CMEM指令產生調度檔案並加CRC32校驗位元 S75: Generate a scheduling file according to the CMEM instruction and add CRC32 check bits

S80:用檢查器及C model測試調度檔案 S80: Use the checker and C model to test the scheduling file

〔圖1〕呈現一個LDPC矩陣; [Figure 1] shows an LDPC matrix;

〔圖2〕一個LDPC最小和解碼系統的方塊圖; [Figure 2] Block diagram of an LDPC minimum sum decoding system;

〔圖3〕本發明的重排LDPC解碼矩陣而縮短LDPC最小和解碼延遲的方法的流程圖; [Figure 3] Flowchart of the method of the present invention for rearranging the LDPC decoding matrix to shorten the LDPC minimum sum decoding delay;

〔圖4〕係圖3的方法的對調列的步驟的流程圖; [Figure 4] is a flow chart of the steps of the method of Figure 3 in reverse order;

〔圖5〕呈現一種分割第1圖的矩陣的欄的方式; [Figure 5] shows a way to partition the columns of the matrix in Figure 1;

〔圖6〕係圖3的方法的對調欄的步驟的流程圖; [Figure 6] is a flow chart of the steps of swapping columns in the method of Figure 3;

〔圖7〕呈現分配欄的階段(一); [Figure 7] Stage (I) of presenting the allocation column;

〔圖8〕呈現分配欄的階段(二); [Figure 8] Stage (II) of presenting the allocation column;

〔圖9〕呈現分配欄的階段(三); [Figure 9] Stage (III) of presenting the allocation column;

〔圖10〕矩陣計算衝突標記與風險分數的階段(一); 〔Figure 10〕The stage of matrix calculation of conflict markers and risk scores (I);

〔圖11〕矩陣計算衝突標記與風險分數的階段(二); 〔Figure 11〕The stage of matrix calculation of conflict markers and risk scores (II);

〔圖12〕圖3所示的方法的調度步驟的流程圖; [Figure 12] Flowchart of the scheduling steps of the method shown in Figure 3;

〔圖13〕從重組矩陣選元素的階段(一)。 〔Figure 13〕The stage of selecting elements from the reorganized matrix (I).

〔圖14〕從重組矩陣選元素的階段(二)。 〔Figure 14〕The stage of selecting elements from the reorganized matrix (II).

〔圖15〕從重組矩陣選元素的階段(三)。 〔Figure 15〕The stage of selecting elements from the reorganized matrix (III).

〔圖16〕從重組矩陣選元素的階段(四)。 [Figure 16] The stage of selecting elements from the reorganized matrix (IV).

以下參考相關圖式進一步說明本發明的重排LDPC解碼矩陣而縮短LDPC最小和解碼延遲的方法的較佳實施例。為便於理解本發明,以下用相同符號標示相同元件。 The following reference figures further illustrate a preferred embodiment of the method of rearranging the LDPC decoding matrix to shorten the LDPC minimum sum decoding delay. To facilitate understanding of the present invention, the same symbols are used to indicate the same elements.

參考第2圖,一個解碼系統包括硬體(在下方的虛線框裡)及本發明的方法(在上方的虛線框裡)。硬體(例如LDPC最小和解碼器)以列為基礎處理矩陣。硬體包括一個記憶體及數個核心。記憶體貯存一列元素(row of circulants)。每一個核心是一個處理器(processor)。 Referring to FIG. 2 , a decoding system includes hardware (in the lower dashed box) and the method of the present invention (in the upper dashed box). The hardware (e.g., LDPC minimum sum decoder) processes matrices on a column basis. The hardware includes a memory and a plurality of cores. The memory stores a row of circulants. Each core is a processor.

硬體裡處理矩陣時,為解決矩陣的欄(column)裡的元素(circulant)分享記憶體空間的問題,並提升並行速度,本發明的方法包括矩陣重組程序(Hab generator)及元素調度程序(circulant scheduler)。每一個元素是一個循環矩陣(circulant),亦即一個較小矩陣。 When processing a matrix in hardware, in order to solve the problem of sharing memory space among the elements (circulants) in the columns of the matrix and improve the parallel speed, the method of the present invention includes a matrix reorganization program ( Hab generator) and an element scheduling program (circulant scheduler). Each element is a circulant, that is, a smaller matrix.

矩陣重組程序預處理一個原始矩陣(例如LDPC的矩陣),產生一個重組的矩陣(欄位已對調)及重組向量(表示欄位對調的順序)。預處理包括拆開、重新排序並組合。在可行的分割數目下,此方法可拆原始矩陣成兩個以上的較小的矩陣。如此,矩陣重組程序預處理原始矩陣而達高產率。 The matrix reshuffling procedure preprocesses an original matrix (e.g., a matrix of LDPC) to produce a reshuffled matrix (with fields swapped) and a reshuffled vector (indicating the order of the swapped fields). The preprocessing includes splitting, reordering, and combining. Under a feasible number of splits, this method can split the original matrix into two or more smaller matrices. In this way, the matrix reshuffling procedure preprocesses the original matrix to achieve high yield.

元素調度程序用重組的矩陣及重組向量,產生CMEM檔案。在此,分拆在每一個核心裡的元素並重新排序而使結果有較佳效能。在每一個核心裡,元素調度程序最佳化(optimize)元素順序供並行(parallelization)。如此,每一個核心 及整個系統有較少的怠機(idle)時間,就是強化硬體效能。 The element scheduler uses the reorganized matrix and reorganized vector to generate a CMEM file. Here, the elements in each core are split and reordered to achieve better performance. In each core, the element scheduler optimizes the element order for parallelization. In this way, each core and the entire system have less idle time, which enhances hardware performance.

參考第3圖,步驟S10、S20及S30組成矩陣重組程序。 Referring to Figure 3, steps S10, S20 and S30 constitute a matrix reorganization procedure.

在步驟S10,對調(swap)矩陣的列(row)的位置而產生不同排列組合的列順序(row sequence)。 In step S10, the positions of the rows of the matrix are swapped to generate row sequences of different permutations.

在步驟S20,依次從這些列順序選一個。 In step S20, select one from these columns in sequence.

在步驟S30,依選中之列順序,用對調欄位的方式,分割矩陣給不同核心,並將結果整合成一個重組的矩陣(Hab matrix)。 In step S30, the matrix is divided into different cores by swapping the columns according to the selected row order, and the results are integrated into a reorganized matrix ( Hab matrix).

步驟S40、S50及S55組成元素調度程序。 Steps S40, S50 and S55 constitute the element scheduling procedure.

在步驟S40,調度在每一個核心裡的元素。換言之,所有核心同時執行步驟S40。 In step S40, the elements in each core are scheduled. In other words, all cores execute step S40 simultaneously.

在步驟S50,產生多行CMEM指令。 In step S50, multiple lines of CMEM instructions are generated.

在步驟S55,找到最少的CMEM指令的行數。 In step S55, find the minimum number of CMEM instruction lines.

依一個列順序,以不同方式對調欄的位置,就是將欄位對調到不同的核心,並經元素調度程序排列後,最終導致不同的CMEM指令的組合及行數。因此,執行步驟S30、S40、S50、S55及S60達最大遞迴數而找到一個特定列順序下的最佳結果。執行步驟S30、S40、S50、S55、S60及S70而找到 所有列順序的最佳結果。最佳結果是最少CMEM指令行數。 According to a row order, swapping the position of the column in different ways means swapping the column to different cores, and after being arranged by the element scheduler, it finally leads to different combinations and numbers of CMEM instructions. Therefore, executing steps S30, S40, S50, S55 and S60 to reach the maximum number of loops to find the best result under a specific row order. Executing steps S30, S40, S50, S55, S60 and S70 to find the best result of all row orders. The best result is the minimum number of CMEM instruction lines.

在步驟S75,依上述CMEM指令產生最終解碼器調度檔案(scheduling file),並在檔案最後加CRC32校驗位元。 In step S75, the final decoder scheduling file is generated according to the above CMEM instruction, and a CRC32 check bit is added at the end of the file.

在步驟S80,用一個檢查器(checker)及C model(以C語言寫的解碼器模型)執行解碼器調度檔案來完成復驗測試。 In step S80, a checker and C model (decoder model written in C language) are used to execute the decoder scheduling file to complete the verification test.

在步驟S10,對調列。換言之,重新排列原始矩陣的列,減少因列而生的先讀後寫的衝突。此舉有助於元素調度程序在步驟S40安排元素。參考第4圖,步驟S10被細分為步驟S11到S18。 In step S10, the columns are swapped. In other words, the columns of the original matrix are rearranged to reduce the read-before-write conflicts caused by the columns. This helps the element scheduler to arrange the elements in step S40. Referring to FIG. 4, step S10 is subdivided into steps S11 to S18.

在步驟S11,當元素在原始矩陣的同一欄,若它們在相鄰的數列(例如相鄰的2列、相鄰的3列、相鄰的4列、…),則它們可能有先讀後寫衝突。計算任意兩列的衝突數(number of conflicts)而建立一個如下的衝突表(conflict table)。依此衝突表可快速進行後續步驟。 In step S11, when elements are in the same column of the original matrix, if they are in adjacent columns (e.g., adjacent 2 columns, adjacent 3 columns, adjacent 4 columns, ...), they may have a read-before-write conflict. Calculate the number of conflicts between any two columns and create a conflict table as follows. Based on this conflict table, subsequent steps can be quickly performed.

Figure 112126729-A0101-12-0006-1
Figure 112126729-A0101-12-0006-1

參考表1,舉例描述兩列的情況。在列1與列2間無衝突。在列1與列3間有15個衝突。在列2與列3間有2個衝突。衝突表有助於找到有最小衝突數的下一列。 Refer to Table 1, which describes the situation of two columns. There are no conflicts between columns 1 and 2. There are 15 conflicts between columns 1 and 3. There are 2 conflicts between columns 2 and 3. The conflict table helps to find the next column with the minimum number of conflicts.

在步驟S12,為找到所有的列順序組合,依序從原始矩陣選一列,例如列K當一個列順序之首。 In step S12, to find all the row sequence combinations, select a row from the original matrix in order, for example, row K as the first row sequence.

在步驟S13,為當今列(例如列K),選衝突最少且可用的列當下一列。「衝突最少」表示有最小衝突數。「可用」表示未被列入當今或早先產生的列順序(從步驟S17回步驟S13)。可能不只一列(例如列L及列M)有最小衝突數,這些列都是下列的候選者。從這些候選者選一列(例如列L)當下一列。 In step S13, for the current row (e.g. row K), select the row with the least conflicts and available as the next row. "Minimum conflicts" means having the minimum number of conflicts. "Available" means not included in the current or previously generated row sequence (from step S17 back to step S13). There may be more than one row (e.g. row L and row M) with the minimum number of conflicts, and these rows are the following candidates. Select a row (e.g. row L) from these candidates to be the next row.

在另一實施例,可用最小衝突數+n的範圍取代「衝突最少」。此舉可找到更多可能性但消耗更多時間。另外若有太多候選者,可用隨機選取機制,其有助於減少處理時間並找到好的列順序。 In another embodiment, "minimum conflict" can be replaced by the range of minimum conflict number + n. This can find more possibilities but consumes more time. In addition, if there are too many candidates, a random selection mechanism can be used, which helps reduce processing time and find a good row order.

在步驟S14,檢查是否完成一個列順序(此列順序涵蓋所有列)。若是,則到步驟S15,否則回步驟S13。 In step S14, check whether a row sequence is completed (this row sequence covers all rows). If so, go to step S15, otherwise return to step S13.

在步驟S15,更新有最小總衝突數的列順序。為此,比較當今列順序的總衝突數與稍早被貯存的列順序的總衝 突數。若當今總衝突數大於稍早的總衝突數,則保留稍早被貯存的列順序。若當今總衝突數等於稍早的總衝突數,則貯存當今列順序與稍早的列順序並存。若當今總衝突數小於稍早的總衝突數,則貯存當今列順序。 In step S15, the column sequence with the minimum total conflict number is updated. To this end, the total conflict number of the current column sequence is compared with the total conflict number of the earlier stored column sequence. If the current total conflict number is greater than the earlier total conflict number, the earlier stored column sequence is retained. If the current total conflict number is equal to the earlier total conflict number, the current column sequence is stored together with the earlier column sequence. If the current total conflict number is less than the earlier total conflict number, the current column sequence is stored.

在步驟S16,判斷是否已探索以同一列(例如列K)為首的所有可能列順序組合。若是,則到步驟S18,否則到步驟S17。 In step S16, determine whether all possible column order combinations starting with the same column (e.g., column K) have been explored. If so, go to step S18, otherwise go to step S17.

在步驟S17,用遞迴方式,在目前已建構的列順序,由後往前尋找是否某列所在位置可由另一列所替代,找到時將此位置與後續組合移除,並到步驟S13挑選此位置下個合適候選者(例如列M),並繼續往後完成此狀況下的整個列順序(在步驟S13,已選的排列方式會被汰除)。 In step S17, a recursive method is used to search from the back to the front in the currently constructed row sequence to see if a row position can be replaced by another row. If found, this position and the subsequent combination are removed, and the next suitable candidate for this position (such as row M) is selected in step S13, and the entire row sequence under this condition is completed (in step S13, the selected arrangement will be eliminated).

以下,以表1(衝突表)的列1為例,描述如何建立一個列順序。 Below, we take row 1 of table 1 (conflict table) as an example to describe how to create a row order.

在步驟S12,選列1為當今列順序之首。此時,列順序=[1],總衝突數=0。 In step S12, select row 1 as the first row in the current row order. At this time, the row order = [1], and the total number of conflicts = 0.

在步驟S13,查表而決定列2與列1為首有最小衝突數。此時,列順序=[1,2],總衝突數=0。 In step S13, the table is looked up to determine that row 2 and row 1 have the smallest number of conflicts. At this time, the row order = [1,2], and the total number of conflicts = 0.

在步驟S14,決定列順序只有2個項目,並因此回 步驟S13。 In step S14, it is determined that the column sequence has only 2 items, and therefore returns to step S13.

在步驟S13,查表而決定列3與列2間有最小衝突數。此時,列順序=[1,2,3],總衝突數=2。 In step S13, the table is looked up to determine the minimum number of conflicts between row 3 and row 2. At this time, the row order = [1, 2, 3], and the total number of conflicts = 2.

重複程序(步驟S13加S14),直到所有列被寫入列順序。此時,列順序=[1,2,3,4,5,6,7,8],總衝突數=5。 Repeat the process (steps S13 plus S14) until all rows are written into the row order. At this point, the row order = [1,2,3,4,5,6,7,8], and the total number of conflicts = 5.

以下,討論置換且有相等或更佳結果的情形。 Below, we discuss situations where substitutions produce equal or better results.

上述完成一個列順序[1,2,3,4,5,6,7,8]後,在步驟S16,決定未找到所有以列1為首的列順序,就到步驟S17。在步驟S17,由後往前尋找有其他候選者可替換的列順序位置。因列順序中8、7、6、與5皆無其他候選者滿足原最小衝突數(衝突數=0),直到4的位置發現有其他候選者,此時將此位置與後續組合移除而得到列順序[1,2,3],並回到步驟S13選列8(因列3與列4的衝突數是0,且列3與列8的衝突數也是0,但列4先前已選過予以剃除)。然後,反覆執行步驟S13與步驟S14。最後,可完成新的列順序[1,2,3,8,7,6,5,4],總衝突數是5。 After completing a row sequence [1,2,3,4,5,6,7,8], in step S16, it is determined that all row sequences starting with row 1 have not been found, and step S17 is performed. In step S17, row sequence positions that have other candidates to replace are searched from back to front. Because there are no other candidates in the row sequence 8, 7, 6, and 5 that meet the original minimum conflict number (conflict number = 0), other candidates are found until position 4. At this time, this position and the subsequent combination are removed to obtain the row sequence [1,2,3], and return to step S13 to select row 8 (because the conflict number between row 3 and row 4 is 0, and the conflict number between row 3 and row 8 is also 0, but row 4 has been previously selected and shaved off). Then, repeat step S13 and step S14. Finally, the new row sequence [1,2,3,8,7,6,5,4] can be completed, and the total number of conflicts is 5.

在步驟S15,額外貯存此列順序,因它與稍早被貯存的列順序有相等的總衝突數。此時,找到兩個列順序。 In step S15, this row sequence is additionally stored because it has the same total conflict count as the row sequence stored earlier. At this point, two row sequences are found.

以下,討論置換且無更佳結果的情形。 Below, we discuss the case where substitution does not produce better results.

若把「最少衝突」擴大為一個大範圍,例如20,則在上述完成最初列順序[1,2,3,4,5,6,7,8]後,在步驟S17有另一種選擇方式。由後往前搜尋至列順序7的位置時,發現有另一個候選者(因列6與列8間的衝突數是14而14 <20),此時回到步驟S13,選擇列8替代原本的列7。藉反覆執行步驟S13及S14,建立另一個列順序。此時列順序=[1,2,3,4,5,6,8,7],總衝突數=16。 If "minimum conflict" is expanded to a large range, such as 20, then after completing the initial row sequence [1,2,3,4,5,6,7,8], there is another choice in step S17. When searching from back to front to the row sequence 7, another candidate is found (because the number of conflicts between row 6 and row 8 is 14 and 14 <20), then return to step S13 and select row 8 to replace the original row 7. By repeatedly executing steps S13 and S14, another row sequence is established. At this time, the row sequence = [1,2,3,4,5,6,8,7], and the total number of conflicts = 16.

然而,在步驟S15,因16>5(稍早被貯存的總衝突數),就跳過此列順序並到步驟S17去遞迴地探索其他列順序。 However, in step S15, since 16>5 (the total number of conflicts stored earlier), this column sequence is skipped and step S17 is performed to recursively explore other column sequences.

在步驟S18,判斷是否每一列都當過某個列順序之首。若是,則到步驟S20,否則選下一列(例如列K+1)並回步驟S12。 In step S18, determine whether each row has been the head of a row sequence. If so, go to step S20, otherwise select the next row (for example, row K+1) and return to step S12.

以下,舉例描述步驟S18。令找到的列順序如下: Below, we describe step S18 as an example. Let the column order found be as follows:

Found Row-Sequences (number = row index) Found Row-Sequences (number = row index)

[1, 3,4, 6,2, 5] [1, 3,4, 6,2, 5]

[1, 6, 5, 3, 2, 4] [1, 6, 5, 3, 2, 4]

[2, 5, 1, 3, 4, 6] [2, 5, 1, 3, 4, 6]

[2, 4, 1, 6, 5, 3] [2, 4, 1, 6, 5, 3]

此矩陣有6列,已找到以列1或列2為首的所有列順序,共4個列順序。在步驟S18,發現並非每一個列都曾做為列首來檢驗其列順序,因此依序選擇了下一列,也就是 列3做為列順序之首,並回到步驟S12來建構如此條件下的列順序。 This matrix has 6 columns, and all column sequences starting with column 1 or column 2 have been found, a total of 4 column sequences. In step S18, it is found that not every column has been used as the first column to check its column sequence, so the next column is selected in sequence, that is, column 3 is used as the first column sequence, and step S12 is returned to construct the column sequence under such conditions.

步驟S20,從步驟S10產生的所有列順序(以每一列為首並滿足最小衝突數的列順序),逐次選其中一個列順序。然後,對中選列順序執行步驟S30、S40、S50及S55而產生一個重組矩陣(Hab matrix)、一個重組向量(ab_vector)及一組CMEM指令。然後,在步驟S60判斷是否達到重複次數上限,若無則回到步驟S30。 In step S20, one of the column sequences (column sequences with each column as the first and satisfying the minimum number of conflicts) generated in step S10 is selected one by one. Then, steps S30, S40, S50 and S55 are executed for the selected column sequence to generate a reorganization matrix (H ab matrix), a reorganization vector (ab_vector) and a set of CMEM instructions. Then, in step S60, it is determined whether the upper limit of the number of repetitions is reached. If not, it returns to step S30.

在步驟S30,產生重組矩陣及重組向量。依中選列順序,逐一處理所有列。參考第5圖,依核心的數量,藉對調的方式,把原始矩陣平均分成兩個或是更多的段(part),並將對調後的段整合成重組矩陣及重組向量(向量資訊包含分拆及對調的順序)。執行步驟S30使硬體的每一個核心的負荷(須處理的元素量)與任何其他核心的負荷相同或相近。 In step S30, a reorganized matrix and a reorganized vector are generated. All rows are processed one by one in the order of the selected rows. Referring to FIG. 5, the original matrix is evenly divided into two or more segments (parts) by swapping according to the number of cores, and the swapped segments are integrated into a reorganized matrix and a reorganized vector (the vector information includes the order of splitting and swapping). Step S30 is performed so that the load (the number of elements to be processed) of each core of the hardware is the same or similar to the load of any other core.

分拆時,用加權後的結果,隨機決定每一欄到某段。以第5圖為例,範圍_k表示段k目前所缺的欄的量,所有的欄需求總數(所有的範圍_k累加)為總隨機範圍(total random range)。在此總隨機範圍裡,產生一個隨機數,並用此隨機數所在的區間來判斷此欄應該分配給哪段。一段的需求愈大(圖中長度 越大者),此隨機數落在其範圍裡的機率愈高,表示此欄愈有可能被分配到此段。當任何段被填滿,它會自然被移除,就不會有任何欄被指派給這段。 When splitting, use the weighted results to randomly determine which segment each column is assigned to. Taking Figure 5 as an example, range_k represents the number of columns currently missing in segment k, and the total number of column requirements (the sum of all range_k) is the total random range. In this total random range, a random number is generated, and the interval of this random number is used to determine which segment this column should be assigned to. The greater the demand for a segment (the larger the length in the figure), the higher the probability that this random number falls within its range, indicating that this column is more likely to be assigned to this segment. When any segment is filled, it will be naturally removed, and no column will be assigned to this segment.

另外,為幫助我們評估分拆後的結果,定義一個加權對調分數(weighted swap score)如下: In addition, to help us evaluate the results after the spin-off, a weighted swap score is defined as follows:

Figure 112126729-A0101-12-0012-2
Figure 112126729-A0101-12-0012-3
Figure 112126729-A0101-12-0012-2
Figure 112126729-A0101-12-0012-3

在上式中,M是總列數,L是總段數,j是列碼且從1到Mk是段碼且從1到LN k 是第k段的元素總量,

Figure 112126729-A0101-12-0012-4
是所有段的平均元素總量,N jk 是第j列的第k段的元素量,
Figure 112126729-A0101-12-0012-5
是第j列的所有段的平均元素量,α+β=1。 In the above formula, M is the total number of columns, L is the total number of segments, j is the column number and ranges from 1 to M , k is the segment number and ranges from 1 to L , N k is the total number of elements in the kth segment,
Figure 112126729-A0101-12-0012-4
is the average total number of elements in all segments, N jk is the number of elements in the kth segment of the jth column,
Figure 112126729-A0101-12-0012-5
is the average number of elements in all segments of the jth column, α + β =1.

用加權對調分數評估欄對調的結果,愈小愈好。加權對調分數考慮兩個因子,其一是在每一段裡的元素的總量,其二是在每一列裡每一段裡的元素的量,每一因子得到一個權重(α或β)。實際運用上,可依原始矩陣分布狀況,調整α與β之大小,使加權對調分數偏重或平均兩因子之影響。 Use the weighted swap score to evaluate the result of the column swap, the smaller the better. The weighted swap score considers two factors, one is the total number of elements in each segment, and the other is the number of elements in each segment in each column. Each factor gets a weight (α or β). In actual application, the size of α and β can be adjusted according to the distribution of the original matrix, so that the weighted swap score emphasizes or averages the impact of the two factors.

就第一因子而言,計算在每一段裡的元素的量與分拆平均值(理想值)之差值的絕對值,並把它們加在一起。為平衡兩個因子的影響,把第一因子乘以M。 For the first factor, calculate the absolute value of the difference between the amount of elements in each segment and the split mean (ideal value) and add them together. To balance the effects of the two factors, multiply the first factor by M.

就第二因子而言,計算在每一列裡的每一段元素的量與此列裡分拆平均值(理想值,每一列皆可能不同)之差值的絕對值,並把它們加在一起。 For the second factor, calculate the absolute value of the difference between the amount of each segment of elements in each column and the average value of the splits in this column (ideally, each column may be different), and add them together.

參考第6圖,以步驟S31、S32、S33與S34詳細描述步驟S30,將原始矩陣平均分成兩段或更多段,並產生重組矩陣及重組向量之方式。 Referring to FIG. 6, step S30 is described in detail with steps S31, S32, S33 and S34, which evenly divide the original matrix into two or more segments and generate a reorganized matrix and a reorganized vector.

在步驟S31,依序處理原始矩陣每一欄(欄1、欄2、欄3、…、欄n)。處理每一欄時,產生一個隨機數,並依此隨機數的大小,將其所在位置對應到相關段所在的範圍,決定把欄分配給對應的段。 In step S31, each column of the original matrix (column 1, column 2, column 3, ..., column n) is processed in sequence. When processing each column, a random number is generated, and according to the size of the random number, its position is matched to the range where the relevant segment is located, and the column is allocated to the corresponding segment.

參考第7圖、第8圖與第9圖,舉例描述步驟S31之方法。假定欲將原始矩陣均分為四段,分別為段1、段2、段3與段4,它們處理當下所需的欄數量分別是2、4、1及3(因欄位皆為隨機分配至任一段,處理當下各段所需的欄數量極可能不相等),總隨機範圍是10。圖中較長段表示需要較多欄,並有較多機會得到在其範圍裡的隨機數。舉例而言,目前處理至第k欄,欲分配欄k給其中一段,隨機從1到10選一數,此時得到隨機數5。因5在段2裡,將欄k派給段2。 Refer to Figures 7, 8 and 9 for an example of the method of step S31. Assume that the original matrix is to be divided into four segments, namely segment 1, segment 2, segment 3 and segment 4, and the number of columns required for processing them is 2, 4, 1 and 3 respectively (because the columns are randomly assigned to any segment, the number of columns required to process each segment is likely to be unequal), and the total random range is 10. The longer segment in the figure means that more columns are required and there is a greater chance of obtaining a random number within its range. For example, the current processing is to the kth column, and column k is to be assigned to one of the segments. A number is randomly selected from 1 to 10, and the random number 5 is obtained. Since 5 is in segment 2, column k is assigned to segment 2.

然後,把段2的需求數(範圍_2)減1,4-1=3,可得知目前段2只需3欄。各段所需的欄的量就如第8圖所示。隨機數的範圍變成1到9(總隨機範圍是9)。 Then, subtract 1 from the required number of segment 2 (range_2), 4-1=3, and we know that segment 2 only needs 3 columns. The number of columns required for each segment is shown in Figure 8. The range of random numbers becomes 1 to 9 (the total random range is 9).

然後,處理欄k+1,隨機從1到10選一數。舉例而言,此時得到隨機數6,則將欄k+1分配給段3。 Then, process column k+1 and randomly select a number from 1 to 10. For example, if the random number is 6, column k+1 is assigned to segment 3.

因段3被填滿,即從此需求圖中被移除。各段需要的欄的量就變化如第9圖所示。此時,隨機數的範圍變成1到8(總隨機範圍是8)。 Since segment 3 is filled, it is removed from this demand graph. The number of columns required for each segment changes as shown in Figure 9. At this time, the range of random numbers becomes 1 to 8 (the total random range is 8).

參考第6圖,在步驟S32,觀察每一段裡的非零(non-zero:NZ)元素的量,並依方程式計算加權對調分數。若此加權對調分數小於稍早被貯存的加權對調分數,則更新加權對調分數,並將此時分配給各段的欄位與順序記錄下來。 Referring to Figure 6, in step S32, the amount of non-zero (NZ) elements in each segment is observed, and the weighted swap score is calculated according to the equation. If the weighted swap score is less than the weighted swap score stored earlier, the weighted swap score is updated, and the fields and sequence assigned to each segment at this time are recorded.

參考下表,舉例計算加權對調分數: Refer to the following table for an example of calculating the weighted swap score:

Figure 112126729-A0101-12-0014-6
Figure 112126729-A0101-12-0014-6

首先,觀察各段的總量,計算第一因子。參考上表,4段的分別總數={29,27,26,22},平均值=26,各段總數與平均值的差值,並取絕對值的和是3+1+0+4=8。因總共有4列,故乘上總列數得8x4=32。 First, observe the total amount of each segment and calculate the first factor. Referring to the table above, the total amount of the 4 segments = {29, 27, 26, 22}, the average value = 26, and the difference between the total amount of each segment and the average value, and the sum of the absolute values is 3+1+0+4=8. Since there are 4 columns in total, multiplying by the total number of columns gives 8x4=32.

接著,觀察各列各段的總量,計算第二因子。列1的各段總數={5,5,8,6},平均值是6,差值的絕對值的和是1+1+2+0=4。列2的個數={7,9,7,5},平均值是7,差值的絕對值的和是0+2+0+2=4。列3的個數={9,5,5,5},平均值是6,差值的絕對值的和是3+1+1+1=6。列4的個數={8,8,6,6},平均值是7,差值的絕對值的和是1+1+1+1=4。把4列的差值的絕對值的和加在一起得到20。 Next, observe the total amount of each column and calculate the second factor. The total number of columns in each segment = {5,5,8,6}, the average value is 6, and the sum of the absolute values of the differences is 1+1+2+0=4. The number of columns in 2 = {7,9,7,5}, the average value is 7, and the sum of the absolute values of the differences is 0+2+0+2=4. The number of columns in 3 = {9,5,5,5}, the average value is 6, and the sum of the absolute values of the differences is 3+1+1+1=6. The number of columns in 4 = {8,8,6,6}, the average value is 7, and the sum of the absolute values of the differences is 1+1+1+1=4. Add up the sum of the absolute values of the differences of the four columns to get 20.

若α=β=0.5,表示兩因子影響均等,則加權對調分數是32*0.5+20*0.5=52。 If α=β=0.5, it means that the two factors have equal influence, then the weighted swap score is 32*0.5+20*0.5=52.

每一種對調(執行步驟S31及S32所得)有一個加權對調分數。若此加權對調分數小於稍早被貯存的加權對調分數,則貯存此加權對調分數為最小加權對調分數,此種對調就是最佳方式,並將此時分配給各段的欄位與順序記錄下來。 Each swap (obtained by executing steps S31 and S32) has a weighted swap score. If this weighted swap score is less than the weighted swap score stored earlier, then this weighted swap score is stored as the minimum weighted swap score. This swap is the best way, and the fields and sequence assigned to each segment at this time are recorded.

在步驟S33,判斷矩陣的對調分配次數是否達最大遞迴數。若達最大遞迴數,則到步驟S34,否則回步驟S31再次針對整個矩陣進行欄的對調分配。 In step S33, determine whether the number of matrix swap allocations has reached the maximum number of recursions. If it has reached the maximum number of recursions, go to step S34, otherwise return to step S31 to swap the columns of the entire matrix again.

在步驟S34,產生重組矩陣及重組向量。最小加權對調分數的對調就是最佳方式。依上述與最小加權對調分數對應的各段欄位分配與順序,將所有段對調並整合成重組矩陣及重組向量。 In step S34, a reorganization matrix and a reorganization vector are generated. The reorganization with the minimum weighted reorganization score is the best method. According to the above-mentioned field allocation and sequence of each segment corresponding to the minimum weighted reorganization score, all segments are swapped and integrated into a reorganization matrix and a reorganization vector.

元素調度程序將每一個有重組向量的重組矩陣最佳化元素順序,供兩個或更多個核心並行。元素調度程序為每一個核心組織矩陣元素的順序,並避免先讀後寫衝突的發生來減少系統延遲,因為系統用硬體怠機的方式來避免先讀後寫衝突。元素調度程序會在所有核心上執行,並針對核心進行優化。 The element scheduler optimizes the order of elements of each reshuffle matrix with a reshuffle vector for two or more cores to run in parallel. The element scheduler organizes the order of matrix elements for each core and reduces system latency by avoiding read-before-write conflicts, because the system uses hardware idling to avoid read-before-write conflicts. The element scheduler runs on all cores and is optimized for each core.

在步驟S40,調度(schedule)元素並產生CMEM指令行。依每一列與其他列衝突的可能性,決定把這列的元素交給硬體處理的順序。參考第10圖,為決定衝突的可能性,定義兩個值,其一是衝突標記(conflict flag:CF),其二是風險分數(Risk Score)。 In step S40, the elements are scheduled and CMEM instruction lines are generated. The order in which the elements of each row are handed over to the hardware for processing is determined based on the possibility of conflict between each row and other rows. Referring to Figure 10, two values are defined to determine the possibility of conflict, one is the conflict flag (CF), and the other is the risk score (Risk Score).

衝突標記(CF)表示同一欄裡在當今元素以上(之前)未處理元素的量。當每一個矩陣元素處理完並寫回記憶體時,就即時把同一欄以下(之後)的所有矩陣元素之衝突標記減1。因此,一個元素的衝突標記=0,表示在它以上的元素全被寫回記憶體,且它是一個不衝突(可用)的元素。 The conflict flag (CF) indicates the number of unprocessed elements above (before) the current element in the same column. When each matrix element is processed and written back to the memory, the conflict flag of all matrix elements below (after) the same column is immediately reduced by 1. Therefore, the conflict flag of an element = 0 means that all elements above it have been written back to the memory, and it is a non-conflicting (usable) element.

風險分數(Risk Score)表示同一欄裡在當今元素以下(之後)的元素的量。較高風險分數表示有較大機率發生衝突(表示同一欄有較多元素共用同一記憶空間)。為減少未來衝突,在可行的狀況下,先處理有較高風險分數的元素並把它寫回記憶體,即能減少未來衝突發生的機率。 The risk score indicates the number of elements below (after) the current element in the same column. A higher risk score indicates a greater chance of conflict (indicating that more elements in the same column share the same memory space). To reduce future conflicts, if possible, process elements with higher risk scores first and write them back to memory, which can reduce the probability of future conflicts.

參考第11圖,計算風險分數時,因硬體處理矩陣元素會由最後一列返回第一列環繞,為處理環繞時造成的衝突,須複製最前若干列(至少一列)至最後一列之後,作為額外列來近似環繞的衝突狀況。額外列的量取決於硬體的處理序列(pipeline)的長度,亦即可能處於衝突情況的列的最大量。 Refer to Figure 11. When calculating the risk score, the hardware processes the matrix elements from the last row back to the first row and wraps around. To handle the conflict caused by wrapping, the first few rows (at least one row) must be copied to the back of the last row as additional rows to approximate the conflict of wrapping. The number of additional rows depends on the length of the hardware processing sequence (pipeline), that is, the maximum number of rows that may be in conflict.

參考第12圖,用步驟S41、S42、S43、S44與S45來詳細描述步驟S40之處理。在步驟S41,預處理整個重組矩陣。為每一個元素,用兩個計數器(counter)計算衝突標記及風險分數。 Referring to Figure 12, the processing of step S40 is described in detail using steps S41, S42, S43, S44 and S45. In step S41, the entire reorganized matrix is preprocessed. For each element, two counters are used to calculate the conflict flag and risk score.

參考第13圖左側,舉例描述步驟S41。若衝突僅發生在相鄰的兩列之間,則在用額外列計算風險分數時,只複製列1到底部當額外列。參考第13圖右側結果,此時列2的所有元素的衝突標記是{1,X,0,X,1},風險分數是{1,X,0,X,3},X表示此位置在列2上為全零元素不予以計算。 Refer to the left side of Figure 13 for an example of step S41. If the conflict only occurs between two adjacent columns, then when using additional columns to calculate the risk score, only column 1 is copied to the bottom as an additional column. Refer to the right side of Figure 13 for the result. At this time, the conflict mark of all elements in column 2 is {1,X,0,X,1}, and the risk score is {1,X,0,X,3}, where X indicates that the position is all zero elements on column 2 and is not calculated.

參考第12圖,在步驟S42,依風險分數,從高到低排序(sort)元素。濾掉全零元素,只留非零元素當候選者。排序的目的是稍後步驟尋找可用且風險分數最大者時,可從左到右搜索可用(衝突標記=0)的元素,並在找到第一個滿足條件的元素時停下,即是可用元素中風險分數最大者。 Referring to Figure 12, in step S42, the elements are sorted from high to low according to the risk score. All zero elements are filtered out, leaving only non-zero elements as candidates. The purpose of sorting is to search for available (conflict mark = 0) elements from left to right when looking for the available element with the largest risk score in the later step, and stop when the first element that meets the conditions is found, that is, the element with the largest risk score among the available elements.

參考第13圖右側,舉例描述步驟S42。列2由欄1至欄5之風險分數依序={1,X,0,X,3},由大至小排列且剃除全零元素後為{3,1,0},分別來自欄位5、欄位1與欄位3,此為列2之候選者。 Refer to the right side of Figure 13 for an example of step S42. The risk scores of column 1 to column 5 in row 2 are {1, X, 0, X, 3} in order, arranged from large to small and after removing all zero elements, they are {3, 1, 0}, which come from column 5, column 1 and column 3 respectively. This is a candidate for row 2.

參考第12圖,步驟S43及S44用隊列(queue)的資料結構型態來模擬硬體行為,並用隊列之特性維持元素在硬體中輸入及輸出的時間差。隊列的深度取決於硬體本身處理序列的線性長度。 Referring to Figure 12, steps S43 and S44 use the queue data structure type to simulate the hardware behavior and use the queue characteristics to maintain the time difference between the input and output of elements in the hardware. The depth of the queue depends on the linear length of the hardware's own processing sequence.

在步驟S43,從記憶體讀取非零元素資料。在此步驟,須搜尋並挑選一個有最大風險分數的無衝突的元素(其衝突標記=0,表示在它以上的元素都被寫回記憶體),把此元素放入模擬其核心運算的隊列後端,做插入動作,此作法將運用到所有核心的運算上。若在某個核心找不到可用的元素,則使該核心停機(stall)一個系統週期。此挑選元素的順序即為讀取階段的排程結果,紀錄下為後續產生CMEM指令用。 In step S43, non-zero element data is read from the memory. In this step, a conflict-free element with the largest risk score must be searched and selected (its conflict flag = 0, indicating that the elements above it are written back to the memory), and this element is placed at the back of the queue simulating its core operation, and an insertion action is performed. This approach will be applied to the operation of all cores. If no available element is found in a core, the core is shut down (stall) for a system cycle. The order of selecting elements is the scheduling result of the reading phase, which is recorded for the subsequent generation of CMEM instructions.

在步驟S44,把處理完的非零元素資料寫入記憶體。從隊列的前端取出一個元素,將其從隊列中移除。將此元素所在欄位中以下(之後)的非零元素實施寫回(write-back)機制。寫回機制的方法為把同欄以下(之後)的非零元素衝突標記皆減1,表示此非零元素上方(之前)已有一非零元素處理完並寫回 In step S44, the processed non-zero element data is written into the memory. An element is taken from the front of the queue and removed from the queue. The non-zero elements below (after) the column where this element is located are implemented with a write-back mechanism. The write-back mechanism is to reduce the conflict flags of the non-zero elements below (after) the same column by 1, indicating that a non-zero element above (before) this non-zero element has been processed and written back.

在步驟S45,判斷是否已對每一個核心裡每一個元素執行讀寫步驟。換言之,判斷是否完成整個矩陣。若是,則到步驟S50,否則回步驟S43。 In step S45, determine whether the read and write steps have been performed on each element in each core. In other words, determine whether the entire matrix is completed. If so, go to step S50, otherwise return to step S43.

在步驟S50,把上述排程結果編譯為CMEM指令行。這些指令行的總數被用於評估硬體效能且被視為更新參考。 In step S50, the above scheduling results are compiled into CMEM command lines. The total number of these command lines is used to evaluate hardware performance and is considered as an update reference.

以第13圖之列2及列3為例,參考第14圖到第16圖,舉例描述此單一核心狀況下的記憶體讀取(步驟S43)及記憶體寫入(步驟S44)。此方法可多核心狀況下,運用至不同核心與不同子矩陣上。 Taking row 2 and row 3 of Figure 13 as an example, refer to Figures 14 to 16 to describe the memory read (step S43) and memory write (step S44) under this single core state. This method can be applied to different cores and different sub-matrices under multi-core conditions.

參考第14圖左側,經步驟S42排序完每列的非零元素後,可得列2候選清單依序為欄5、欄1、欄3,因其風險分數是3、1、0,而衝突標記依序是1、1、0。另外列3候選清單依序為的欄5、欄2、欄4,因其風險分數是2、1、1,而衝突標記依序衝突標記是2、0、0。 Referring to the left side of Figure 14, after step S42 sorts the non-zero elements of each column, the candidate list of column 2 is column 5, column 1, column 3 in order, because its risk scores are 3, 1, 0, and the conflict marks are 1, 1, 0 in order. In addition, the candidate list of column 3 is column 5, column 2, column 4 in order, because its risk scores are 2, 1, 1, and the conflict marks are 2, 0, 0 in order.

另外參考第14圖右側,目前此核心處理的隊列(queue)裡有4個元素,其中隊列的最前端(即將輸出者)是先前某列的欄5元素。 Also refer to the right side of Figure 14. Currently, there are 4 elements in the queue processed by this core, and the front of the queue (the one to be output) is the element in column 5 of a previous row.

在步驟S43,執行從記憶體讀取的步驟,做法為選擇列2候選清單中,可用且風險分數最大之元素。因列2候選清單中由左到右掃描,只有欄3元素衝突標記=0,所以選擇欄3元素輸入第14圖右側隊列後端(輸入端),並記錄列2欄3元素為此步驟排程結果。 In step S43, the step of reading from the memory is executed by selecting the available element with the largest risk score in the candidate list of row 2. Since only the element of column 3 has a conflict flag = 0 when scanning from left to right in the candidate list of row 2, the element of column 3 is selected and input into the rear end (input end) of the right queue of Figure 14, and the element of column 3 of row 2 is recorded as the scheduling result of this step.

在步驟S44,執行寫入記憶體的步驟,由隊列最前端(輸出端)取出一元素,並針對此元素上實施寫回機制。因 為此元素是來自於先前某列的欄5,在寫回記憶體時,需把在同一欄在它以下的元素的衝突標記減1。在此例,是把列2及列3候選清單中同樣來自欄5之元素的衝突標記減1,即為列2的欄5元素衝突標記減至0,列3的欄5元素衝突標記減至1,所得結果參考第15圖左側。 In step S44, the step of writing into memory is executed, an element is taken out from the front end (output end) of the queue, and the write-back mechanism is implemented for this element. Because this element comes from column 5 of a previous row, when writing back to memory, the conflict flag of the element below it in the same column needs to be reduced by 1. In this example, the conflict flag of the element from column 5 in the candidate list of row 2 and row 3 is reduced by 1, that is, the conflict flag of the element in column 5 of row 2 is reduced to 0, and the conflict flag of the element in column 5 of row 3 is reduced to 1. The result is shown on the left side of Figure 15.

在步驟S45,判斷未完成整個矩陣,並回步驟S43。 In step S45, it is determined that the entire matrix is not completed, and the process returns to step S43.

接著參考第15圖,執行步驟S43,在列2候選清單中由左到右掃描,僅欄5的元素衝突標記是0且風險分數最大,所以選擇欄5元素輸入第15圖右側隊列後端(輸入端),並記錄列2欄5元素為此步驟排程結果。 Next, refer to Figure 15, execute step S43, scan from left to right in the candidate list of row 2, only the element of column 5 has a conflict mark of 0 and the risk score is the largest, so select the element of column 5 and input it into the back end (input end) of the right queue of Figure 15, and record the element of column 5 of row 2 as the scheduling result of this step.

在步驟S44,由隊列最前端(輸出端)取出一元素,並針對此元素上實施寫回機制。因為此元素是來自於先前某列的欄1,所以將列2的欄1元素衝突標記減至0,而列3因無欄1相關元素不做變動,所得結果參考第16圖左側。 In step S44, an element is taken from the front end (output end) of the queue, and a write-back mechanism is implemented for this element. Because this element comes from column 1 of a previous row, the conflict flag of column 1 element of row 2 is reduced to 0, and row 3 does not change because there is no column 1 related element. The result is shown on the left side of Figure 16.

在步驟S45,判斷未完成整個矩陣,並回步驟S43。 In step S45, it is determined that the entire matrix is not completed, and the process returns to step S43.

接著參考第16圖,執行步驟S43,在列2候選清單中由左到右掃描,因清單中只剩欄1元素,且其衝突標記=0,所以選擇欄1元素輸入第15圖右側隊列最後端(輸入端),並記錄列2欄1元素為此步驟排程結果。 Next, refer to Figure 16 and execute step S43. Scan from left to right in the candidate list of row 2. Since only the column 1 element is left in the list and its conflict flag = 0, select the column 1 element and input it to the end (input end) of the right queue of Figure 15, and record the row 2 column 1 element as the scheduling result of this step.

在步驟S44,由隊列最前端(輸出端)取出一元素,並針對此元素上實施寫回機制。因為此元素是來自於先前某列的欄2,所以將列3的欄2元素衝突標記減至0,而列2因已無元素不做變動。 In step S44, an element is taken from the front end (output end) of the queue, and a write-back mechanism is implemented for this element. Because this element comes from column 2 of a previous row, the conflict flag of column 2 of row 3 is reduced to 0, and row 2 is not changed because there is no element.

在步驟S45,判斷未完成整個矩陣,並回步驟S43。重複步驟S43、S44、S45直到完成整個矩陣的所有元素,並進到步驟S50,依上述所記錄的排程結果編譯為CMEM指令行。 In step S45, it is determined that the entire matrix is not completed, and the process returns to step S43. Steps S43, S44, and S45 are repeated until all elements of the entire matrix are completed, and the process proceeds to step S50, where the CMEM command line is compiled according to the above recorded scheduling results.

參考第3圖,在步驟S55,若找到更好結果則更新。指令行愈少代表結果愈好,因硬體作業時間與指令數正相關。 Refer to Figure 3, in step S55, if a better result is found, it is updated. Fewer instruction lines means better results, because the hardware operation time is positively correlated with the number of instructions.

在步驟S60,判斷產生指令行的次數是否達最大遞迴數。若是,則到步驟S70,否則回步驟S30,依此選定之列順序,再分配矩陣與排程產生新指令行。 In step S60, determine whether the number of times the command line is generated reaches the maximum number of loops. If so, go to step S70, otherwise return to step S30, and allocate the matrix and schedule the generation of new command lines according to the selected column order.

在步驟S70,判斷是否已處理所有列順序,就是依每個列順序產生不同指令行並紀錄最佳結果。若是,則到步驟S75,否則回步驟S20依序選擇下個列順序。 In step S70, determine whether all row sequences have been processed, that is, generate different command lines for each row sequence and record the best result. If yes, go to step S75, otherwise return to step S20 to select the next row sequence in sequence.

在步驟S75,將上述指令行製作成最終解碼器調度檔案(scheduling file),並在檔案最後加入CRC32校驗位元。 In step S75, the above command line is made into the final decoder scheduling file, and CRC32 check bits are added at the end of the file.

在步驟S80,檢查上述完成之調度檔案。檢查步驟包含從檔案讀取內容,依指令格式重建矩陣,並檢查每 一項限制。同時用C語言撰寫的解碼器模型(C-model)直接執行調度檔案,並檢查解碼輸出是否正確。 In step S80, the above completed scheduling file is checked. The checking step includes reading the content from the file, rebuilding the matrix according to the instruction format, and checking each restriction. At the same time, the decoder model (C-model) written in C language directly executes the scheduling file and checks whether the decoded output is correct.

以上僅為描述本發明的較佳實施方式,非用以限定本發明的範圍。本技術領域內的一般技術人員根據實施例所作的均等變化,以及本領域內技術人員熟知的改變,仍在本發明的範圍內。 The above is only to describe the preferred implementation of the present invention, and is not intended to limit the scope of the present invention. Equal changes made by ordinary technicians in this technical field based on the embodiments, as well as changes known to technicians in this field, are still within the scope of the present invention.

S10:對調矩陣的列以產生不同的列順序 S10: Swap the rows of the matrix to produce a different row order

S20:依次選一個列順序 S20: Select a column order in sequence

S30:對調欄位並分割矩陣 S30: Swap columns and split the matrix

S40:調度在每一個核心裡的元素 S40: Scheduling elements in each core

S50:產生多行CMEM指令 S50: Generate multiple lines of CMEM instructions

S55:找最少的CMEM指令的行數 S55: Find the minimum number of CMEM instruction lines

S60:達重複次數上限? S60: Reached the upper limit of repetition times?

S70:已處理所有列順序? S70: Have all row sequences been processed?

S75:依CMEM指令產生調度檔案並加CRC32校驗位元 S75: Generate a scheduling file according to the CMEM instruction and add CRC32 check bits

S80:用檢查器及C model測試調度檔案 S80: Use the checker and C model to test the scheduling file

Claims (2)

一種分割LDPC碼而改善LDPC解碼器並行的方法,其包括以下步驟:把整個矩陣均分為數段,依序把矩陣中每一欄隨機分配到某段,分配時將當下每段之需求作為加權(S31);計算一個加權對調分數如下(S32):
Figure 112126729-A0305-02-0028-1
其中M是總列數,L是總段數,j是列碼且從1到Mk是段碼且從1到LN k 是第k段的元素總量,
Figure 112126729-A0305-02-0028-3
是所有段的平均元素總量,N jk 是第j列的第k段的元素量,
Figure 112126729-A0305-02-0028-5
是第j列的所有段的平均元素量,α+β=1;若此加權對調分數小於稍早被貯存的加權對調分數,則更新加權對調分數並記錄對調分配方式(S32);判斷矩陣的對調分配次數是否達最大遞迴數(S33);若矩陣的對調分配次數未達最大遞迴數,則回上述把整個矩陣均分為數段的步驟(S31);及若矩陣的對調分配次數已達最大遞迴數,則依上述記錄之最小加權對調分數的對調分配方式產生重組矩陣 及重組向量(S34)。
A method for improving LDPC decoder parallelism by segmenting an LDPC code includes the following steps: dividing the entire matrix into a number of segments, randomly assigning each column in the matrix to a segment in sequence, and using the current demand of each segment as a weight during the assignment (S31); calculating a weighted swap score as follows (S32):
Figure 112126729-A0305-02-0028-1
Where M is the total number of columns, L is the total number of segments, j is the column number and ranges from 1 to M , k is the segment number and ranges from 1 to L , N k is the total number of elements in the kth segment,
Figure 112126729-A0305-02-0028-3
is the average total number of elements in all segments, N jk is the number of elements in the kth segment of the jth column,
Figure 112126729-A0305-02-0028-5
is the average number of elements of all segments in the j-th column, α + β =1; if this weighted swap score is less than the weighted swap score stored earlier, then update the weighted swap score and record the swap allocation method (S32); determine whether the swap allocation number of the matrix reaches the maximum recurrence number (S33); if the swap allocation number of the matrix does not reach the maximum recurrence number, return to the above step of dividing the entire matrix into several segments (S31); and if the swap allocation number of the matrix has reached the maximum recurrence number, generate a reorganized matrix and a reorganized vector according to the swap allocation method of the minimum weighted swap score recorded above (S34).
如請求項1所述分割LDPC碼而改善LDPC解碼器並行的方法,其中把整個矩陣均分為數段的步驟(S31)包括以下步驟;計算矩陣均分後每一段可容納的欄數,並將此欄數作為此段需求;統計所有需求為總隨機範圍;產生一個隨機數,此隨機數是1到總隨機範圍;依上述隨機數的落點,從上述數段擇一段作為中選段;把此欄數派給上述中選段;把上述中選段需求的欄數減1;把總隨機範圍減1;回上述產生一個隨機數的步驟來指派下一欄。 A method for improving the parallelism of LDPC decoders by dividing LDPC codes as described in claim 1, wherein the step (S31) of equally dividing the entire matrix into segments includes the following steps: calculating the number of columns that can be accommodated in each segment after the matrix is equally divided, and using the number of columns as the requirement of the segment; summarizing all requirements as the total random range; generating a random number, which is 1 to the total random range; selecting a segment from the segments as a middle segment according to the location of the random number; assigning the number of columns to the middle segment; reducing the number of columns required by the middle segment by 1; reducing the total random range by 1; and returning to the step of generating a random number to assign the next column.
TW112126729A 2023-07-18 2023-07-18 Method for improving LDPC decoder parallelism by dividing LDPC code TWI867624B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW112126729A TWI867624B (en) 2023-07-18 2023-07-18 Method for improving LDPC decoder parallelism by dividing LDPC code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW112126729A TWI867624B (en) 2023-07-18 2023-07-18 Method for improving LDPC decoder parallelism by dividing LDPC code

Publications (2)

Publication Number Publication Date
TWI867624B true TWI867624B (en) 2024-12-21
TW202505878A TW202505878A (en) 2025-02-01

Family

ID=94769545

Family Applications (1)

Application Number Title Priority Date Filing Date
TW112126729A TWI867624B (en) 2023-07-18 2023-07-18 Method for improving LDPC decoder parallelism by dividing LDPC code

Country Status (1)

Country Link
TW (1) TWI867624B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080155372A1 (en) * 2006-12-21 2008-06-26 Radiospire Networks, Inc. Methods and apparatus for improving error indication performance in systems with low-density parity check codes
US9362955B2 (en) * 2010-09-10 2016-06-07 Trellis Phase Communications, Lp Encoding and decoding using constrained interleaving
US20160164537A1 (en) * 2014-12-08 2016-06-09 Samsung Electronics Co., Ltd. Method and apparatus for parallel concatenated ldpc convolutional codes enabling power-efficient decoders
US20170019211A1 (en) * 2014-03-17 2017-01-19 Lg Electronics Inc. Method and device for decoding low density parity check code for forward error correction in wireless communication system
WO2018126496A1 (en) * 2017-01-09 2018-07-12 Qualcomm Incorporated Bit allocation for encoding and decoding

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080155372A1 (en) * 2006-12-21 2008-06-26 Radiospire Networks, Inc. Methods and apparatus for improving error indication performance in systems with low-density parity check codes
US9362955B2 (en) * 2010-09-10 2016-06-07 Trellis Phase Communications, Lp Encoding and decoding using constrained interleaving
US20170019211A1 (en) * 2014-03-17 2017-01-19 Lg Electronics Inc. Method and device for decoding low density parity check code for forward error correction in wireless communication system
US20160164537A1 (en) * 2014-12-08 2016-06-09 Samsung Electronics Co., Ltd. Method and apparatus for parallel concatenated ldpc convolutional codes enabling power-efficient decoders
WO2018126496A1 (en) * 2017-01-09 2018-07-12 Qualcomm Incorporated Bit allocation for encoding and decoding

Also Published As

Publication number Publication date
TW202505878A (en) 2025-02-01

Similar Documents

Publication Publication Date Title
US6643644B1 (en) Method and apparatus for retrieving accumulating and sorting table formatted data
JP5460486B2 (en) Apparatus and method for sorting data
US11934696B2 (en) Machine learning assisted quality of service (QoS) for solid state drives
CN115668217A (en) Position mask for transformer model
US9053209B2 (en) System, method, and computer program product for performing graph coloring
US11249727B2 (en) Efficient matrix data format applicable for artificial neural network
TWI867624B (en) Method for improving LDPC decoder parallelism by dividing LDPC code
TWI866338B (en) Method for improving the latency of ldpc min-sum decoder by re-ordering the ldpc codes
CN118036678A (en) Automatic attention thinning method, device, electronic equipment and storage medium
CN107133271B (en) Semantic brain graph real-time expression system and operation method thereof
US20130013244A1 (en) Pattern based test prioritization using weight factors
CN113919255A (en) Digital circuit parallel static learning method and system under memory limitation
CN116578425A (en) Load balancing method and system based on rasterization
WO2011066057A2 (en) Grouping mechanism for multiple processor core execution
CN114428936B (en) Apparatus, method and medium for distributing processing threads for matrix-matrix multiplication
JP2013127750A (en) Partitioning device, method and program
Gal et al. Using machine learning clustering to find large coverage holes
CN113849298A (en) Information processing system, information processing equipment, information processing method and program
CN117520306B (en) Data verification method and device and electronic equipment
Paradis Enumeration and Simulation of Random Tree Topologies
JP2022521105A (en) Sampling device
CN116405155B (en) Method for decoding VDIF format data by using GPU
JP7496952B1 (en) Optimization device, optimization method, and program
CN1581190A (en) List generation system, winning list generation method and recording medium
Ben-Artzy et al. SpeLLM: Character-Level Multi-Head Decoding