Summary of the invention
The object of the present invention is to provide and a kind ofly can support employing 4 degree of parallelisms of binary system and duobinary compatible LTE and WiMAX standard, the Turbo decoder of base-16 structure simultaneously, to be applicable to the 4G mobile communication system such as 3GPP LTE/LTE-Advanced, WiMAX/WiMAX-Advanced (IEEE 802.16e/m).
Turbo decoder provided by the invention, adopts 4 degree of parallelisms, base-16 structure, completes chip design based on TSMC 65nm 1P9M LP CMOS technique, and chip core area is 1.39 mm
2, maximum clock frequency is 600MHz, the high-throughput under 5.5 iteration is: 821Mbps under LTE pattern, 810.6Mbps under WiMAX pattern.There is following improvement compared to other Turbo decoders: (1) proposes new high radix decoding architecture, by comparing in advance the branch in grid chart with same initial state and same state of termination, reduce the complexity of recursive operation unit and reduced critical path simultaneously; (2) for high degree of parallelism, high radix decoder, the interleaver of simplifying has been proposed, the interleaver in LTE and WiMAX standard all can be formed based on simple barrel shift network and two-dimensional address maker; (3) optimize memory construction, avoided memory access conflict, reduced space and the number of memory; (4) by algorithm optimization and resource-sharing, reduced computational complexity.
Turbo decoder provided by the invention, as shown in Figure 8, system configuration comprises its system configuration: the soft inputting and soft output sub-decoder (SISO) of external information memory cell, system information memory cell, check digit 1 memory cell, check digit 2 memory cell, P-road permutation network, base-16, corresponding control module, address generation module and hard decision buffer memory.Wherein, P-road permutation network, soft inputting and soft output sub-decoder and corresponding memory cell are all 4 degree of parallelisms, i.e. P=4 shown in figure.External information memory cell is mainly used to be stored in last iterative process and obtains external information, replace by permutation network, input soft inputting and soft output sub-decoder is as the prior information of next iteration, and the size of external information memory cell is N/P, N is code length, and P is degree of parallelism.System information memory cell is the system information to code word for storing received, the systematic code receiving, be input to soft inputting and soft output sub-decoder by permutation network, carry out iterative decoding as the system information in iterative decoding, system information memory cell size is N/P equally.And check digit 1 memory cell and check digit 2 memory cell are respectively for corresponding 2 check codes of storing received code word, by a selector, during different sub-iteration, send into different check digit, to complete iterative decoding.The effect of address generation module is according to the progress of iterative decoding and requirement, generates corresponding address and visits these memory cell, to read corresponding external information, system information and check digit information.Soft inputting and soft output sub-decoder is the core of whole decoder, input corresponding external information, system information and selection check position by permutation network, obtain the required initial value of decoding, carry out iterative decoding, after sub-iteration completes, obtain posterior information, posterior information stores the external information as next second son iteration in external information memory cell into by permutation network.And control module is for total control address generation module and the process of permutation network and iterative decoding.After iterative decoding completes, the code word that obtains translating by hard decision, exists in hard decision buffer memory, exports as decoding.
In the present invention, system information and external information adopt four-quadrant to divide, to ensure the real-time storage of data in the decoding of base-16.Each clock cycle, 4 parallel sub-decoders need read 16 groups of intrinsic information and external information altogether, they and through 4 group of 4 parallel permutation network in tunnel.After maximum posteriori decoding, each clock cycle writes back 16 class values to external information.Wherein, address generation module provides 16 addresses under LTE pattern, 8 addresses are only provided under WiMAX pattern, their hardware resource can be completely multiplexing, and permutation network has stronger configurability, all code lengths and 1,2 under can two kinds of patterns of complete support, 4 three kinds of degree of parallelisms, permutation network is all based on simple barrel shift network.
Below some modules are described further:
1. configurable base-16 soft inputting and soft output sub-decoder
Soft inputting and soft output sub-decoder is to forward state metric α, the unit that backward state measurement β and branch metric γ calculate, as shown in Figure 9, the RAM that the inside has comprised twoport is for depositing calculative numeral information, branch metric calculation unit calculates after branch metric, be kept in register, for follow-up forward state metric and the calculating of backward state measurement, the forward state metric calculating and backward state measurement are kept in by register again, calculate forward and backward state measurement as next iteration, the corresponding LLR value of forward-backward algorithm metric calculation output that LLR arithmetic element utilization is simultaneously current are preserved.
The soft inputting and soft output sub-decoder of the present invention's design can be supported binary system turbo code and duobinary system turbo code, can support sliding window (SW) and two kinds of modes of two-way simultaneous window (PW), can support information transmission and training calculate two kinds of recursive operation initialization, and according to the pattern of rate adjust window and initialize mode, there is dynamic control power consumption.
In fact this configuration provides several important mechanism: be first to peek by two-way simultaneous, this is the prerequisite that reduces time delay, secondly LLR and backward training can be carried out to hardware resource sharing.Due to LLR arithmetic element can with forward recursive walked abreast also can with backward recursive arithmetic element, therefore the key of concurrency is the concurrency of forward direction and backward recursive.In the middle of traditional sliding window flow process, it should be noted that the grid chart between sub-block joins, therefore also can transmit mode metric.
Number at window has very high similitude in fact compared with hour two kinds of flow processs, and PW is more effective, and at this moment in PW, needed training calculating does not need to increase hardware, only need two existing arithmetic elements to calculate, therefore need classification discussion, comprise a window, two windows, the situation of multiple windows.Training computing in the time of high code check, open, in the time of low code check, can close or reduce to train length, performance loss is not obvious because in the time of different code check performance loss difference, this becomes the basis of optimised power consumption.
2. address generation module---configurable ARP and QPP interleaver
The interleaver of the turbo code using in distinct communication standards is conventionally different, and in the design of many standards decoder, interleaver designs is a key issue flexibly.ARP interleaver is a polynomial form, and as Fig. 5, QPP interleaver is quadratic polynomial form.In fact parallel interleaver can utilize the relation between parallel address, and recurrence relation between adjacent periods address is simplified the calculating of interleaving address.What in WiMAX standard, adopt is ARP interleaver, also can calculate interleaving address by recursive form.
Because window is long larger, write address is larger by the method cost of reading to store address, therefore adopts the mode of calculating in real time here, adopts the address generator of two forward directions; Two backward addresses can be obtained by LIFO, or use two reverse address generators.
3. memory cell
Storage resources under two kinds of patterns can efficient multiplexing, notice that duobinary system turbo code carries out decoding based on symbol, the external information of transmitting between adjacent sub-iteration is 2 times of binary system turbo, by the definition of log-likelihood ratio in algorithm, can be reduced to 1.5 times.For base-16 decoding, application four-quadrant splitting scheme.The each symbol of WiMAX (2-bit information bit) need to be stored 3 external information values, and therefore, under code length same case, WiMAX need to increase by 50% external information memory capacity.External information is converted to bit-level storage from symbol level, while reading, is again converted to symbol level, make like this external information memory cell of duobinary system and traditional binary decoding can be completely multiplexing.But this method can cause the performance loss of 0.1 ~ 0.2dB left and right, in the situation that degree of parallelism is lower, this part external information is higher at the proportion of whole decoder, and in the decoding architecture of highly-parallel, proportion can constantly decline.Therefore the present invention does not adopt this external information conversion method, but retains this department's storage resources to obtain good decoding performance.
Embodiment
Below in conjunction with accompanying drawing to the present invention specifically realize optimization and improvement, be described in further details.
1. the fusion of decoding algorithm in the soft inputting and soft of base-16 output sub-decoder (SISO):
Although binary system and duobinary grid chart are distinct, by making grid chart there is identical annexation after reordering.Concrete, the position of 2,7 two states of WiMAX is exchanged, the position of 3,6 two states exchanges, and just can obtain the grid chart through rearrangement.The grid chart shape of it and Binary Turbo codes is closely similar, and in each like this state measurement recursive operation unit, the input of state variable is consistent, does not therefore need extra selector.Because inside, recursive operation unit is difficult to insert pipeline register conventionally, therefore, this effect that can reduce critical path is very useful.
Can make the two there is similar state transitions situation by rearrangement, but due to their difference of coding structure, its output situation in the situation that input information bits is identical is also inconsistent, therefore in branch metric and log-likelihood calculations, considers this species diversity again.Fig. 6 has provided the situation of two kinds of code state transitions and State-output.For its base-4 grid chart of Binary Turbo codes, obtain by conversion, therefore its state transitions be from
moment is transferred to
moment.For duobinary system Turbo code, each moment of encoder input dibit, therefore its grid progression only has the half of code length, its state transitions be all from
arrive
moment.
According to the information bit of each state transitions output, the combined situation of check bit, have 16 kinds of branch metrics, table 2 has provided the branch metric calculation formula after fortran.
Each branch metric can be split as two parts, and wherein a part is only relevant to information bit, and another part is only relevant to check digit.The former is prior information and the intrinsic information sum of information bit, we are referred to as measure information (Information-Only Metric, IOM), and the latter is check digit intrinsic information sum, we are referred to as verification tolerance (Parity-Only Metric, POM).By this fractionation, the branch metric calculation under two kinds of patterns can be completely multiplexing.After the branch metric publicity conversion of introducing, the measure information under LTE pattern has
,
,
, 0 four kinds of values, the measure information under WiMAX pattern has
,
,
, 0 four kinds of values.And verification tolerance is under two kinds of patterns
,
,
, 0 these four kinds of values.Meanwhile, in branch metric buffer memory, only need to store three measure informations and two check digit intrinsic information (or three verification tolerance).
Searching out the similitude of two kinds of pattern inferior division metric calculation formula and carrying out after fortran, the amount of calculation of branch metric greatly reduces, and needed computational resource and storage resources be corresponding reducing also.Adopt original method, the calculating of each branch metric needs 5 adders, altogether needs 80 adders.And adopt after the computational methods of simplifying, altogether only need 13 adders, reduce 83.75%
Table 2 bimodulus base-4 decoding branch metric calculation formula
。
In order further to make hardware resource better multiplexing, the computing formula of base-4 log-likelihood ratio to two-stage system Turbo code converts, and can obtain the computational methods of the symbol level posterior probability identical with duobinary system Turbo code.32 possible state transitions branches in grid chart, according to the value condition of information bit, are divided into 4 classes, in each class, have 8 kinds of situations, then by the posterior probability of forward state metric and backward state measurement and branch metric addition acquisition symbol level.Due to rear in the computational process of state measurement, as calculated all situations of branch metric and backward state sum, therefore the two sum in formula
,
,
,
results of intermediate calculations that can backward state measurement is carried out buffer memory and is obtained, and this improvement can be saved 32 adders, and for high degree of parallelism decoder, this improvement can be saved a large amount of calculation resources and power consumption.
(3)
By 4 symbol level posterior probability
,
,
,
can calculate the external information of Binary Turbo codes
,
or the external information of duobinary system Turbo code
,
,
.But both external information computational process difference are very large, can not allow fusion, need to calculate log-likelihood ratio and the external information of bit-level for LTE pattern, to calculate the soft output of symbol level for WiMAX pattern.Under LTE pattern, their computational methods are:
(4)
(5)
Notice
,
these two were just calculated before Branch Computed tolerance with value, therefore need not recalculate.
Under WiMAX pattern, their computational methods are:
(6)
(7)
Notice
,
,
these three were just calculated before Branch Computed tolerance with value, therefore need not recalculate.
The output hard decision of Binary Turbo codes can obtain according to bit-level log-likelihood ratio information, is normally obtained by following method for the output judgement of Turbo code:
(8)
The consistency of considering subtraction operation and compare operation, the generation of the two hard decision information all can, according to formula below, be adjudicated according to the sign bit of bit-level posterior probability log-likelihood ratio.
(9)。
The Binary Turbo codes that adopts base-4 decoding, the calculating of its posterior probability also can be converted to the calculating of symbol level probability.It should be noted that, although formula 3 shows that both posterior probability computing formula are in full accord, but because the state transitions situation of the two encoder is also inconsistent, 32 kinds of paths situation in the time being divided into 4 groups according to information bit is also inconsistent, therefore must increase extra data selector, realize two kinds of compatibilities under pattern.
Also can calculate the posterior probability of bit-level for duobinary system Turbo code, but because its coding and decoding process is based on symbol, therefore can cause obvious performance loss.There is document to propose a kind of method converting between symbol level and bit-level, by stored bits level external information, in serial decoding device, reduced the storage resources of 20% left and right, can reduce performance loss by inverse transformation.But existing by performance simulation hair, this method still can cause the BER/FER performance loss of 0.1dB ~ 0.2dB left and right, on the other hand, for high degree of parallelism decoder, external information memory shared area proportion in whole decoder less (in 4 degree of parallelism base-16 decoders that the present invention realizes, external information memory proportion is less than 2%), therefore, symbolization level log-likelihood ratio storage external information of the present invention, to reduce performance loss.
2. a new high radix decoding architecture, by comparing in advance the branch in grid chart with same initial state and same state of termination, reduces the complexity of recursive operation unit and has reduced critical path simultaneously:
Computation complexity increases along with radix increases and is index, and the critical path of state measurement recursive operation unit also can constantly increase, and therefore, must take effective method reduce the complexity of high radix decoding and promote its operating frequency.
Due to the high radix Jia-ratio of state recursive operation unit-select circuit to occupy the more hardware resource of SISO sub-decoder, and limit the maximum clock frequency of decoding, high radix Jia-ratio-select the needs of circuit improve and optimize.There is not recursive operation in the calculating of branch metric and posterior probability, can adopt pipelining to promote operating frequency, but their computational complexity also can be along with radix increases and increases rapidly.Therefore, also the low complex degree of considering branch metric calculation unit and posterior probability computing unit is realized.
Gao Jijia ratio selects circuit implementation to mainly contain two kinds: a kind of is grid chart based on cascade by some groups of low order Jia-ratios-select circuit to carry out cascade, and another kind is that addition and comparison operation are carried out cascade by the grid chart based on merging.The present invention proposes a kind of high radix recursive operation unit of novelty, it is limited observing state number, therefore Gao Jijia can not exceed status number than the possible situation of the state measurement input of selecting circuit, have in the path of common initial state or common state of termination for these, maximum path only need to can obtain by the maximum of respective branch tolerance relatively, instead of must first computing mode tolerance and branch metric with could calculate maximum path.
In the renewal process of each state measurement, need the number of path of comparison for being radix, because the number of state measurement is certain, therefore branched measurement value that only need to be more identical from two paths of same state, is kept to state number by path number relatively like this.This is equivalent to and has compressed the path number that participates in recursive operation, and obviously, in the time that radix is greater than status number, this method is very effective.Because the status number of Turbo code is generally 8, therefore, in the time of base 16, can adopting said method the required 16 road acs unit of base 16 be reduced to 8 road acs units, can reach the effect that reduces complexity and reduce critical path simultaneously.As shown in Figure 1.
Fig. 2 provided traditional base-
state measurement recursive operation unit, the number of adder is consistent with radix, and need to be from
individual adder output compares maximum.Base provided by the invention-
state measurement recursive operation cellular construction as shown in Figure 3, first compares
individual branch metric, is divided into 2
mgroup compares, and draws the branch metric of each group of maximum, is then added with the larger branch metric drawing and corresponding state measurement, by 2
mthe comparator of input compares, and finally obtains optimum path.In recursive operation unit in Fig. 3 owing to having compared in advance part branch metric, therefore the number of adder is reduced to status number, also corresponding reducing of the input number of comparator, in reducing computational complexity, also make the critical path of this recursive operation unit reduce, thereby obtain higher operating frequency.
In the decoding of base-16, owing to having increased the step-length that in grid chart, state measurement upgrades, every four grid progression only calculate and store one-level state measurement value, and this makes the calculating of soft output more complicated.The method of calculating the log-likelihood ratio of base-16 has two kinds at present, a kind of method is that 4 grades of 128 kinds of state status corresponding to grid are divided into 8 groups according to every bit value condition, every group has 64 kinds of state transitions situations, this method need to be used 256 adders and 8 Zu64 road comparators, and amount of calculation is very large.Another kind method is the posterior probability that first calculates 4 bit symbols, is then converted to the log-likelihood ratio of bit-level, and this method still needs number of adders constant, and comparator is reduced to 16 group of 8 input comparator and 8 group of 8 input comparator.By using the intermediate object program of add operation in backward state measurement, the number of adder can reduce half.For base-16 in this paper, owing to having simplified the add operation of recursive operation unit, therefore the intermediate object program of adder no longer can be unit multiplexed by log-likelihood calculations.What the present invention taked is the state measurement that first recalculates the middle one-level of 4 grades of grids, then calculate respectively 2 bit symbol posterior probability corresponding to front two-stage and rear two-stage, then be converted to respectively symbol rank probability, by this new method, the number of adder and comparator can reduce, and its complexity is low.The computational methods of log-likelihood ratio in base-16 decoding architecture of the present invention's design, as shown in Figure 4, wherein, base-16 to be split into two base-4 calculate the LLR value of each bit (top and the bottom on the right in figure), in figure, the state measurement in the middle of base-16 is first calculated by ACS-4 in the left side, above 8 ACS-4 unit calculate forward state metric, below 8 ACS-4 calculate backward state measurement, 4 paths before and after comparing respectively, select optimal path and have calculated state measurement.And utilize intermediate value alpha+γ, the β+γ of the calculating of 16 ACS-4, then send into ACS-8 and be added and compare with corresponding state measurement, obtain LL
uvthe value of (uv=00,01,10,11), obtains the finally LLR value of each bit according to these values.
3. for high degree of parallelism, high radix decoder, proposed the interleaver of simplifying, the interleaver in LTE and WiMAX standard all can be formed based on simple barrel shift network and two-dimensional address maker:
The interleaver of the turbo code using in distinct communication standards is conventionally different, and in the design of many standards decoder, interleaver designs is a key issue flexibly.ARP interleaver is a polynomial form, and QPP interleaver is quadratic polynomial form.In fact parallel interleaver can utilize the relation between parallel address, and recurrence relation between adjacent periods address is simplified the calculating of interleaving address.What in WiMAX standard, adopt is ARP interleaver, also can calculate interleaving address by recursive form.
Because window is long larger, write address is larger by the method cost of reading to store address, therefore, the present invention adopts the mode of real-time calculating, the address generator that adopts two forward directions, two backward addresses can be obtained by LIFO, or use two reverse address generators.
4. optimize storage structure, avoids memory access conflict, reduces space and the number of memory:
Except total storage size, the bit wide of memory, the degree of depth, type (single port, twoport, SRAM, Register Files etc.), Deng area and power consumption that all can appreciable impact memory, in addition, except these factors, read-write clock frequency, read-write number of times in the unit interval also can appreciable impact power consumption.
In storage resources, the storage size of channel intrinsic information and external information is determined by maximum code length and the lowest bit rate of system, does not change along with sub-block degree of parallelism.Adopt after sliding window setting technique, the branch metric in each MAP sub-decoder and state measurement memory space are by the long decision of window, and the value of boundary condition tolerance reduces along with the increase of degree of parallelism.And when degree of parallelism is lower, channel intrinsic information and external information storage account for the largest percentage, therefore can consider the optimization of its memory.Along with the increase of degree of parallelism, the storage resources proportion in MAP sub-decoder constantly increases.
External information need to provide access hole conventionally simultaneously, and the method therefore realizing has double port memory, adopts two single port memories of ping-pong operation, can be operated in the memory of Clock Doubled.Find by the research to ARP and QPP interleaver, it can be divided into Ji Qiouji or four-quadrant set according to the lowest order of address, and between sequence address and interleaving address, meet the rule of mapping one by one, therefore divide to realize by rational memory block and use single port memory.State measurement employing method for normalizing can increase the bit wide of storage, but the increase of this part is very limited.
5. by algorithm optimization and resource-sharing, reduced computational complexity:
Although the present invention uses after (Scaling)-MAX-Log-MAP decoding algorithm, addition and comparison operation in decode procedure, are only comprised, but meeting under high-throughput requirement, concurrency in algorithm is fully excavated, and be mapped to a large amount of hardware resources, how in the process that completes necessary calculating, to excavate shared data and computing can be saved great amount of hardware resources.
From the arthmetic statement process of MAX-Log-MAP, can see that in decoding, main arithmetic operation comprises: branch metric calculation, state measurement recursive calculation, posterior probability are calculated, in fact in existing document, conventionally only describe state measurement recursive calculation and improve one's methods, and branch metric and posterior probability are calculated description and improve less, but in high radix decoding circuit, the area that Zhe Liang department is shared and power dissipation ratio regular meeting enlarge markedly, and therefore become one of key of design of encoder.
The computation complexity of branch metric is also relevant to lowest bit rate and decoding radix that system is supported.The lowest bit rate of CDMA2000, the EV-DO that for example 3GPP2 organizes to set up, middle regulation is 1/5, and along with the increase of decoding radix, the complexity of branch metric also increases rapidly on the other hand.
What in the calculating due to Max-Log-MAP algorithm state tolerance and posterior probability, rely on is the relative size of state measurement and branch metric, therefore can do a fortran to branch metric, difference is participated in to computing as new branch metric, thereby reduce storage and the amount of calculation of branch metric.Be decoded as example with base-2, original branch metric formula is:
(1)
Here we deduct 4 kinds of branch metrics
this branch metric, thereby the branch metric calculation formula being simplified:
(2)。
Table 1 has shown and utilizes above-mentioned fortran's method can save more calculation resources.
The branch metric calculation formula (3GPP standard, 1/3 code check) of table 1 base-4 decoding architecture
。