Performance Enhancement of Video Compression Algorithms With SIMD
Performance Enhancement of Video Compression Algorithms With SIMD
Performance Enhancement of Video Compression Algorithms With SIMD
The algorithms, however, consist of repetitive and regular operations by nature, which
could benefit greatly from the use of some architectures that are better able to perform
such repetitive tasks efficiently. In recent years, general purpose microprocessors have
also been endowed with functional units capable of Single Instruction Stream Multiple
Data Stream (SIMD) operation. This project attempts to study the speedup achievable for
the most critical parts of these algorithms by utilizing the Streaming SIMD Extensions 2
from the Intel Pentium 4 processors. We also deal with the improvement schemes for the
DCT algorithm on the SSE2 architecture.
Current block
Since most of the video streams have a frame rate ranging from 15 to 30 frames per
second, there is never a very large motion of any object between two successive frames.
Therefore most search algorithms search for matching block in the neighborhood of the
position of the current block in the next frame. The region where matching block is
searched for is called the search region. Search region around a block is shown in the
figure 2.1.1.
The choice for the value of p will depend upon the type of broadcast that has to be sent.
For fast-moving videos such as sports events a higher value of p such as 16 or 32 may be
used. On the other hand for broadcasts with less motion such as a news-telecast a smaller
value of p such as 4 or 8 may be used.
(x,y)
(x+u,y+v)
u,v
The task to be performed by the search algorithm is to find the best match for a block in
the current frame in the next frame. A typical block size is 8x8 or 16x16 pixels. The
quality of match found will depend on the value of Mean Absolute Error, more
commonly known as MAE between the blocks. This is the average absolute pixel-
wise difference between two blocks, reference block in the current frame and
probable match found in the next frame. The matching block is figured out on the
basis of the magnitude of the value of its mean error. Smaller the magnitude better is
the match. The displacement of the block with the minimum MAE is taken as the
motion vector.
Next we explain the two search algorithms that were used in this project.
2.1.1 Full Search
Full search is an exhaustive search algorithm. Full search is the simplest method to find
the motion vector for each block; in it the MAE(i,j) is found at each point in the search
region. Thus a search for the match block is made in the complete (-p, +p) range in the
future frames for every block of the current frame.
For each motion vector, there are (2p) 2 search locations. At each search location (i,j) we
compare N x M pixels. Each pixel comparison requires three operations, namely: a
subtraction, an absolute value calculation and an addition. We ignore the cost of
accessing the pixels C(x + k, y + l) and R(x + i + k, y + j + l). Thus the total complexity
per block is (2p) 2 x MN x 3 operations. For a picture resolution of I x J, and a picture rate
of F pictures/second, the overall complexity is IJF/MN x (2p) 2 x MN x 3 operations per
second.
But this makes it a very intensive method computationally. CPU time for full search is the
highest of all the algorithms. At the same time the accuracy of Full search is also highest
and the best match for every block in the current frame is always found. Full search,
therefore is a benchmark for comparison of the quality of a search algorithm, which as
was previously mentioned depends on CPU time and accuracy. There is a trade-off
between the efficiency of the algorithm and the quality of the prediction image. Keeping
this trade-off in mind a lot of algorithms have been developed.
Step 1: An initial step size is picked. Eight points at a distance of step size from the
centre (around the centre point) are picked for comparison.
Step 2: The step size is halved. The centre is moved to the point with the minimum
distortion. Steps 1 and 2 are repeated till the step size becomes smaller
than 1. A particular path for the convergence of this algorithm is shown
below:
The block diagram of the Encoder is given below in Figure 2.2.1 in order to illustrate the
idea.
Frame (n)
I(x,y,t) Motion
u,v
Estimation
Frame (n+1)
I(x,y,t+1)
Motion
Compensation
DCT coding
C(k) = () if k = 0
C(k) = 1 otherwise
The DCT transformation can be viewed as the process of finding for each waveform, the
corresponding weight Y(k,l) so that the sum of 64 waveforms scaled by the
corresponding weights Y(k,l) yields the reconstructed version of the original 8 x 8 block.
Energy compaction of DCT is among the highest next only to the Karhunen- Loeve
Transform. This means that the information can be compressed to a very high degree with
DCT, which is why DCT is commonly used. At the same time DCT also minimizes the
block artifact that is present in many other transforms due to the favorable periodic nature
of DCT.
2.4 Quantization
The human eye is not sensitive to the high frequency content in an image. Therefore
removal of these spatial frequencies does not lead to any perceptible loss in image
quality. This is the basic principle behind quantization. The spatial frequency content of
the image is obtained by using the DCT operation, which is followed by a removal of the
high frequency content that is the quantization process.
The JPEG standard recommends standard values of quantization tables which are used to
deemphasize higher frequencies in the DCT image. Quantization is a lossy process and
some data is lost during quantization. This loss of information is irreversible.
2.5 Run Length Encoding (RLE)
Run-length encoding is the next stage of the compression process. It encodes the runs of
zeroes. If pixel values are correlated to their neighbors, then there will be sequences of
the same value. Instead of coding all the repeat values, just encode the first value and
then give the run length of the sequence. Intuitively, one can understand how RLE can
help in achieving compression. Suppose the data is 000000(ten times). Now instead of
writing ten zeroes one can send only 0-10, which could be taken to mean that a zero
occurs 10 times. This is how compression is achieved in Run-Length Encoding. Runs of
zeroes are encoded in a 16 bit or 8 bit format.
Shannon has proved that the entropy of the total message gives the most efficient code,
with minimum average code length, for sending a message.
Given n symbols S1 to Sn-1 with probabilities of occurrence P1 to Pn-1 in a certain
message, the entropy of the message will be given by
Entropy = Pi log2 (1/ Pi)
Huffman encoding attempts to minimize the average number of bits per symbol and try to
get a value close to entropy.
Example:
We describe the algorithm for Huffman encoding with the help of an example.
Consider 4 symbols with probabilities as shown in the first column of the table.
Step 1: Sort the probabilities and arranged in descending order as shown in the column
marked Probability.
Step 2: Add the probabilities of the last two symbols and add them to the next column
after sorting the values.
Step 3: Continue steps 1 and 2 until only two symbols remain.
Step 4: Assign bit 0 to upper symbol and 1 to the lower symbol. (or vice-versabut then
this format should be followed throughout the process.)
Step 5: Trace back the probabilities according to where they have come from in the
previous column and append a 0 or 1 depending on the format chosen above.
Step 6: Follow this procedure up to first column and you have the variable length code
ready.
Huffman encoding isnt implemented in this manner in the JPEG image compression
standard. The standard code tables for Huffman coding are defined in the standard for the
DC values and the non-DC values as well. These values are looked up both for encoding
and decoding. Huffman code is a prefix code and hence it can be uniquely decoded.
The other alternative methods for entropy coding or source coding are Shannon-Fano
encoding, and arithmetic coding. Arithmetic coding has even been adopted in the JPEG
2000 standard for entropy coding after IBM agreed to release its patent on this technique
for the JPEG 2000 standard.
3. Key Bottlenecks
After performing an analysis of the Video compression algorithms and a survey in the
literature to improve its performance we were able to identify two algorithms (i.e. Motion
estimation and DCT), which are the most resource intensive and in which a very high
proportion of the time in a video compression algorithm is spent. Motion Estimation
requires making use of highly repetitive methods applied to the whole image. DCT is
essentially a matrix multiplication loop which is to be performed on every 8 or 16 pixels
of the image. Also with the increase in image resolution the problem becomes even worse
as the loop iterations will increase and will require more computational resources.
However, it can be seen that there isnt any data dependence between the various data
elements that are used in the algorithm. Therefore it is possible to try and improve
performance of these programs by exploiting parallelism inherent in these media
algorithms and running different data points in parallel to obtain higher throughput.
SIMD architecture exploits this parallelism by use of increase datapath size and
performing the same operations on the different data point (in our case pixels).
4. SIMD Architecture:
Usually, processors process one data element in one instruction, a processing style called
Single Instruction Single Data, or SISD. In contrast, processors having the SIMD
capability process more than one data element in one instruction. The Single Instruction
Stream Multiple Data Stream (SIMD) Architectures perform operations on many
elements in a lockstep fashion. The same instruction is performed on different data
elements computed by different functional units.
The Intels MMX/SSE/SSE2, AMDs 3DNow, Power PCs Altivec ISA extensions are
testimonial to the benefits of SIMD support to traditional superscalar.
4.1 Intels Streaming SIMD Extensions
SIMD Extensions for the IA-32 ISA began with the Multimedia Extensions (MMX) in
1997 for the Pentium processor. MMX datapath of 64 bits subword parallel ALUs for
bytes, words and doublewords enhanced its performance on multimedia benchmarks.
However, these instructions had a very limited function, in that only integer data-types
could be handled. Also since the MMX instructions utilized the floating point registers, it
was very hard to inter-mingle floating point and MMX instructions.
Streaming SIMD Extensions (SSE) from the Pentium III marked the advent of 68 new
instructions to the IA-32 ISA, in particular the MMX. The biggest winners from the new
instructions were applications that handled 3D or streaming media, as applying identical
instructions to multiple pieces of code was now handled in parallel. AMD wasn't idle
over this time though, and introduced 3DNow! to the world. This much catchier-sounding
set offered capabilities similar to those made possible by SSE, but was incompatible with
it.
The SSE2 technology from the Pentium-4, introduced new Single Instruction Multiple
Data (SIMD) double-precision floating-point instructions and new SIMD integer
instruction into the IA-32 Intel architecture. The 128-bit SIMD integer extensions are a
full superset of the 64-bit integer SIMD instructions, with additional instructions to
support more integer data types, conversion between integer and floating-point data
types, and efficient operations between the caches and system memory. These
instructions provide a means to accelerate operations typical of 3D graphics, real-time
physics, spatial (3D) audio, video encoding/decoding, encryption, and scientific
application.
4.1.1 SSE Vs MMX
MMX and SSE, both of which are extensions to existing architectures, share the concept
of SIMD, but they differ in the data types they handle, and in the way they are supported
in the processor.
MMX instructions are SIMD for integers, while SSE instructions are SIMD for single-
precision floating-point numbers also. MMX instructions operate on two 32-bit integers
simultaneously, while SSE instructions operate on four 32-bit floats simultaneously.
A major difference between MMX and SSE is that no new registers were defined for
MMX, while eight new registers have been defined for SSE. Each of the registers for
SSE is 128 bits long and can hold four single-precision floating-point numbers (each
being 32 bits long). The arrangement of the floating-point numbers in the new data type
handled by SSE is illustrated in Figure 4.1.
The immediate question is: Where did the registers for MMX come from? The MMX
registers were allocated out of the floating-point registers of the floating-point unit. A
floating-point register is 80 bits long, of which 64 bits were used for an MMX register. A
limitation of this architecture is that an application cannot execute MMX instructions and
perform floating-point operations simultaneously. Additionally, a large number of
processor clock cycles are needed to change the state of executing MMX instructions to
the state of executing floating-point operations and vice versa. SSE does not have such a
restriction. Separate registers have been defined for SSE. Hence, applications can execute
SIMD integer (MMX) and SIMD floating-point (SSE) instructions simultaneously.
Applications can also execute non-SIMD floating-point and SIMD floating-point
instructions simultaneously.
The arrangement of the registers in MMX and SSE is illustrated in Figure 4.2. Figure
4.2(a) illustrates the mutually exclusive floating-point and MMX registers, while Figure
4.2(b) illustrates the SSE registers.
MMX and SSE have one more similarity: Both have eight registers. MMX registers are
named mm0 through mm7, while SSE registers are named xmm0 through xmm7. For the
purpose of our experiment we make use of the SSE2 extensions.
4.2 SSE2 Coding Techniques
There is limited compiler support available for the SIMD ISA extension. As a result to
make use of the rich features provided by this extension we need to go through different
programming techniques. One can use one of the following techniques to code programs
with SSE2.
Advantages of using Intrinsics and the Vector Class Library is that the Intrinsics and
Vector Classes free the programmer from managing registers while ensuring easier
maintenance and modularization of code. The compiler optimizes instruction scheduling
and register allocation and hence the executable runs faster.
- i , si, su, pi, pu, epi, epu indicates an integer, 64-bit signed or unsigned integer,
128 bit (ep) signed or unsigned extended precision operation for 8, 16, 32 or 64
bits.
To use the intrinsics library, the file xmmintrin.h must be included. Thus we chose to
utilize the Intrinsics style of coding for Motion Estimation Algorithms. We chose the
Intels C++ compiler over the Microsofts Visual Studio pack to compile our motion
estimation algorithms. For most of the parts we made use of normal C code constructs.
However in cases where we could exploit parallelism with SIMD we made use of SSE2
intrinsics to indicate to the compiler its use.
4.3.1Code Snippets
Blockdiff is the main computationally intensive function call in the program. It also can
make use of the SSE2 features to improve its performance. We change the code using
intrinsics to employ the SSE2 datapath. Snippet below provides the blockdiff function
call.
int blockdiff(int x1,int x2,int y11,int y22)
{
unsigned char block1[16], block2[16];
int i1,j1,k1,ch,offset1,offset2;
int diff1[16][16], totaldiff = 0;
FILE *fp1, *fp2;
__m128i *b1,*b2,m1;
union mmx
{
__m128i m;
short int x[8];
}m;
.
.
.
The top part shows the declarations inside the function, while the bottom part shows the
calculation of the difference using the SSE intrinsic. We define a union called mmx,
which can be used to address the m register of the mmx datatype __m128i and as an
array of 8 intgers as well. This __m128i register consists of 16 8-bit integer values.
The block1 and block2 arrays will contain the 16 8-bit pixel values from the image.
These are typecast into the __m128i format and put into the locations pointed by b1 and
b2. Next the __mm_sad_epu8() instruction finds the sum of differences of these 16
values directly and places it in the m1 register. Totaldiff adds up the total difference from
previous iterations and this one.
union mmx
{
__m64 m;
int x[2];
}m;
.
.
.
// type casting pointers.
b1 = (__m64*)block1;
b2 = (__m64*)block2;
The operations performed are similar but the datatype and intrinsic used are different.
The datatype used is the __m64 type, which consists of 8 8-bit values. The intrinsic used
to calculate the sum of differences is the _m_psadbw() operation.
Totaldiff will again contain the accumulated difference from all the iterations.
4.3.2. Results
Without SSE With SSE
Figure 4.4.3 Part of frame (4) Figure 4.4.4 Part of frame (5)
Figure 4.4.5 Part of Predicted frame Figure 4.4.6 Part of Predicted frame
(5) with block size of 8 x (5) with block size of
8 16x16
We draw the following conclusions from the results given above. We notice that the
speedup is by a factor of 3-4 for most programs with SSE. Also the 8 x 8 block programs
for both algorithms take longer to execute compared to the 16 x 16 block programs. The
reason is that the loop overhead for the programs goes up, even though the number of
addition or subtractions to be performed are the same. From the images we see that the 8
x 8 blocks perform a better job at matching than the 16 x 16 blocks. The predicted images
after motion compensation show that the 8 x 8 blocks are better suited for tracking
movement of the smaller image regions with these algorithms.
4.4 DCT
Our second candidate algorithm which is highly computationally intensive is the DCT. It
is essentially matrix multiplication of an image block by a DCT constant multiplication
matrix. This structure can also make use of the SIMD architecture to improve
performance. For the following section we present a few suggestions that would improve
the performance of the DCT algorithm. However, we dont go into the performance
comparisons of the algorithms due to lack of the capability to add additional functionality
to the compiler tool. Below is a code snippet for the DCT algorithm.
void DCT (int InBlock[][8], int OutBlock[][8])
{
int TempBlock[8][8], CosTrans[8][8];
Transpose(CosBlock, CosTrans);
(a) (b)
(c) (d)
Figure 4.4.2:Matrix Multiplication for computation of a single element.Parts a,b,c,d
show the various steps for obtaining a single result for matrix multiplication
The traditional method requires that the row and column elements of the two matrices
that are multiplied be accessed one at a time and a MAC operation performed. This will
require 64 sequential operations of accessing the elements from memory and multiply
accumulate. However, when we employ the use of the SIMD architecture it will require
16 operations on the SIMD architecture. The illustration is given below.
(a) (b)
(c) (d)
Figure 4.4.3:Matrix Multiplication for four elements.Parts a,b,c,d show the
computation on the SIMD platform
Essentially, because of the large datapath of the SSE architecture, it is possible to
concurrently perform operations that are independent from each other. Therefore, partial
products for 4 different elements of the matrix are carried on in parallel. Hence improving
performance at the cost of extra hardware.
We believe this will improve performance of the DCT code till upto 4 times the original
sequential code for DCT
4.4.2 Specialized hardware support for DCT on the SIMD architecture
Using the SIMD architecture of the SSE2 for the implementation of 8 point 1-D DCT
does improve performance over the use of simple C code implementation of 1-D DCT.A
specialized accelerator for DCT incorporated on the SIMD would improve performance
further.
The motivation of this study is therefore to study the trade off between the cost in terms
of hardware v/s performance improvement obtained by using a dedicated accelerator for
1-D DCT implementation. Choice of the DCT accelerator was highly driven by its
capability to scale with the SIMD architecture. Hence implementing distributed
arithmetic for DCT which is easily scalable with the SIMD architecture.
The 2-D DCT has been recognized as the most cost effective techniques among various
transform coding schemes for image compression. The DCT is one of the orthogonal
transforms and the N x N 2-D DCT is defined as follows
2 N 1 N 1
(2i 1)u (2 j 1)u
X (u , v) C (u ).C (v). x(i, j ). cos cos
N i 0 j 0 2N 2N
The 2-D DCT unit is comprised of two 1-D DCT units and a transpose operation. This 2-
D DCT is separated into two 1-D DCTs by the row-column decomposition technique.
The input data are fed into the first DCT unit where 1-D DCT is calculated in row order.
Then the intermediate data is transposed. Finally, the transposed data are inputted to the
second 1-D DCT unit and processed in column order.
The recursive fast DCT algorithm is used to calculate the eight point DCT as shown
X0 A AAA 0xx 7
X xx
1 2 B C C B 1 6
.
X4 2A C AA 2xx 5
X 6 C BB C 3xx 4
X1 D E F G 0xx 7
X xx
1 3 E G D F 1 6
.
X5 2F D G E 2xx 5
X 7 G F E D 3xx 4
, B cos , C sin , D cos
A = cos 4 8 8 16 , E cos 3 , F sin 3 , G sin
16 16 16
16
0.5
+
X2 X4 X6 X1 X3 X5 X7
16
16
R 0.25
16
The figure 4.2.2.1 shows the block diagram of the 1-D DCT processing unit on the SSE2
platform.The preprocessed values of addition and substraction can be obtains by the
parallel addition and substraction instructions on the SSE2.
Multiplier accumulator in the DCT core processor has been designed with the distributed
arithmetic. According to the distributed arithmetic, the parallel multipliers can be
eliminated from the core processor and the hardware amount is greatly reduced.
Furthermore, a very high speed operation can be achieved because the critical path is
formed in adder instead of multiplier.
Here, we illustrate the principle of distributed arithmetic (DA).Assume the input vector is
presented in N-bit twos complement code as follows:
N 1
x k bk 0 bkn .2 n
n 1
The multiply accumulate in the normal way can be presented as the following equation:
K
y a k .x k
k 1
K N 1
a k .(bk 0 bkn .2 n
k 1 n 1
where ak (k=1,2,3,.K) is the multiply coefficient. Based on the distributed arithmetic y
can be calculated as follows
N 1
K K
y a k .bkn .2 n a k .(bk 0 )
n 1 k 1 k 1
The multiply operation is implemented with a ROM that stores the precalculated partial
products. Therefore, the hardware of the multiply accumulation based on the DA includes
ROM and an adder that accumulates the partial products read from ROM.
In this case the two adjacent bits are processed simultaneously. Thus, two ROMs
required to offer a pair of partial products for higher and lower bits in every cycle, and
both have two banks for the two modes of DCT. The ROM size itself was reduced by 2 4
times by using the fast algorithm in conjunction with the DA scheme. With the present
configuration of the DAP(Distributed Arithmetic processor) structure we would require a
16 ROMs each 16 x 16 bits. Each of the DAPs will complete the operation in 8 cycles.
Hence the entire DCT is calculated in 8 cycles because all the DAPs work in parallel on
the SIMD architecture of SSE2.
The DAP structure was implemented using Verilog and synthesis was done using the
gflx-p library. The control flow for the DAP is given below:
The DAP will take a total of 8 cycles to complete. For all the fractional binary bits we use
the shift add method. Essentially we shift the operand right by 2 and add it every cycle as
shown in the DAP figure.
The Shiftadd complement method is used in case when DAP is processing the integer
part of the binary fixed point number. This can be further understood by looking into the
code placed in the appendix.
Synthesis Results
Area results:
Combinational Area : 6660.396
Non-Combinational Area : 1128.06
Interconnect Net Area :1211.4841
int i1,j1,k1;
FILE *fpold,*fpnew;
first = time(NULL);
}
}
//------------------------------------------------------------------------
//SUBPROGARM FOR SORTING THE DIFFERENCES
//-----------------------------------------------------------------------
motionx = x - 8;
motiony = y - 8;
//----------------------------------------------------------------------
// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE
//----------------------------------------------------------------------
/* fpold = fopen("C:\\ECE734\\f4.raw","rb");
fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b");
if(motionx != 0 || motiony != 0)
{
//SKIP PIXELS UPTO INITIAL POINT
fseek(fpold,(176 * y11) + x1, 0);
fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP
PIXELS USING MOTION VECTOR FOR THE NEW IMAGE
fseek(fpold,160,1);
fseek(fpnew,160,1);
}
}
fclose(fpnew);
fclose(fpold); */
amad = amad + sqrt(motionx*motionx + motiony*motiony);
peldiff = peldiff + mindiff[i][j];
}
}
amad = amad/99;
printf("\nAMAD = %f", amad);
printf("\nPixel Difference %ld", peldiff/99);
second=time(NULL);
printf("\nDifference in time %ld", second - first);
getch();
}
//--------------------------------------------------------------------------
// FUNCTION BLOCKDIFF
//--------------------------------------------------------------------------
else
{
fp1 = fopen("C:\\ECE734\\f4.raw", "rb");
fp2 = fopen("C:\\ECE734\\f5.raw", "rb");
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
void main()
{
int i1,j1,k1;
FILE *fpold,*fpnew;
int diff1,diff,diffx,diffy;
amad =0.0;
DWORD start, finish, duration;
//first = clock();
start = timeGetTime();
printf("START TIME!! %ld\n", start);
diff = 1000000;
diffy =0;
diffx =0;
//j and i move the reference blocks
for(j=0; j<9 ;j++)//j<9 9*16 = 144
{
for(i=0; i<11 ;i++)//i<11 16*11 = 176
{
x1 = 16*i; //START OF BLOCKS IN REFERENCE IMAGE
y11 = 16*j;
//----------------------------------------------------------------------
// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE
//----------------------------------------------------------------------
fpold = fopen("C:\\ECE734\\f4.raw","rb");
fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b");
if(motionx != 0 || motiony != 0)
{
//SKIP PIXELS UPTO INITIAL POINT
fseek(fpold,(176 * y11) + x1, 0);
fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS
USING MOTION VECTOR FOR THE NEW IMAGE
fseek(fpold,160,1);
fseek(fpnew,160,1);
}
}
fclose(fpnew);
fclose(fpold);
// amad = 99.0f ;// 1.0;((float)motionx* motionx) + ((float)motiony*motiony);
peldiff = peldiff + diff;
}
}
//amad = amad/99;
// printf("\nAMAD = %f",amad);
printf("\nPixel Difference %ld", peldiff/99);
second = clock();
printf("\nDifference in time %ld", second);
getch();
}
//--------------------------------------------------------------------------
// FUNCTION BLOCKDIFF
//--------------------------------------------------------------------------
union mmx
{
__m128i m;
short int x[8];
}m;
else
{
fp1 = fopen("C:\\ECE734\\f4.raw", "rb");
fp2 = fopen("C:\\ECE734\\f5.raw", "rb");
if(fp1 == NULL || fp2 == NULL)
{
printf("File cannot be opened");
getch();
exit(0);
}
fseek(fp1,offset1,SEEK_SET);
fread(block1,1,16,fp1);
//for(i=0;i<16;i++)
//printf("%c \n",block1[i]);
fseek(fp2,offset2,SEEK_SET);
fread(block2,1,16,fp2);
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.3 Three Step Search Program with simple C for 16 x 16 image blocks
//------------------------------------------------------------------------
// PROGRAM TO IMPLEMENT 3 STEP SEARCH
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
void main()
{
FILE *fpold, *fpnew;
int i1,j1,k1,ch,c;
first = time(NULL);
locx = x1;
locy = y11;
p = 16;
while(p >= 1)
{
//ALGORITHM FOR 3 STEP SEARCH
//FOR POINT NO. 0
x2 = locx - p;
y22 = locy - p;
diff[i][j][0] = blockdiff(x1, x2, y11, y22);
//------------------------------------------------------------------------
//SUBPROGARM FOR SORTING THE DIFFERENCES
//-----------------------------------------------------------------------
if(l==0)
{
locx = locx - p;
locy = locy - p;
}
if(l==1)
{
locx = locx;
locy = locy - p;
}
if(l==2)
{
locx = locx + p;
locy = locy - p;
}
if(l==3)
{
locx = locx - p;
locy = locy;
}
if(l==4)
{
locx = locx;
locy = locy;
}
if(l==5)
{
locx = locx + p;
locy = locy;
}
if(l==6)
{
locx = locx - p;
locy = locy + p;
}
if(l==7)
{
locx = locx;
locy = locy + p;
}
if(l==8)
{
locx = locx + p;
locy = locy + p;
}
p = p/2;
} //while loop end.
if(motionx != 0 || motiony != 0)
{
//SKIP PIXELS UPTO INITIAL POINT
fseek(fpold,(176 * y11) + x1, 0);
fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS
USING MOTION VECTOR FOR THE NEW IMAGE
fseek(fpold,160,1);
fseek(fpnew,160,1);
}
}
fclose(fpnew);
fclose(fpold);//*/
}
}
second = time(NULL);
printf("Time taken %ld", second - first);//TO FIND CPU TIME
printf("\nAMAD=%lf", amad/99);
printf("\nPixel Difference %ld", peldiff/99);
getch();
}
//--------------------------------------------------------------------------
// FUNCTION BLOCKDIFF
//--------------------------------------------------------------------------
else
{
fp1 = fopen("C:\\ECE734\\f4.raw", "rb");
fp2 = fopen("C:\\ECE734\\f5.raw", "rb");
if(fp2 == NULL)
{
printf("File 2 cannot be opened");
getch();
exit(0);
}
if(fp1 == NULL)
{
printf("File 1 cannot be opened");
getch();
exit(0);
}
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.4 Three Step Search Program with SSE 2 intrinsics for 16 x 16 blocks
//------------------------------------------------------------------------
// PROGRAM TO IMPLEMENT 3 STEP SEARCH
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
#include<xmmintrin.h>
#include<sse2mmx.h>
void main()
{
FILE *fpold, *fpnew;
int i1,j1,k1,ch,c;
first = time(NULL);
locx = x1;
locy = y11;
p = 16;
while(p >= 1)
{
//ALGORITHM FOR 3 STEP SEARCH
//FOR POINT NO. 0
x2 = locx - p;
y22 = locy - p;
diff[i][j][0] = blockdiff(x1, x2, y11, y22);
//------------------------------------------------------------------------
//SUBPROGARM FOR SORTING THE DIFFERENCES
//-----------------------------------------------------------------------
if(l==0)
{
locx = locx - p;
locy = locy - p;
}
if(l==1)
{
locx = locx;
locy = locy - p;
}
if(l==2)
{
locx = locx + p;
locy = locy - p;
}
if(l==3)
{
locx = locx - p;
locy = locy;
}
if(l==4)
{
locx = locx;
locy = locy;
}
if(l==5)
{
locx = locx + p;
locy = locy;
}
if(l==6)
{
locx = locx - p;
locy = locy + p;
}
if(l==7)
{
locx = locx;
locy = locy + p;
}
if(l==8)
{
locx = locx + p;
locy = locy + p;
}
p = p/2;
} //while loop end.
if(motionx != 0 || motiony != 0)
{
//SKIP PIXELS UPTO INITIAL POINT
fseek(fpold,(176 * y11) + x1, 0);
fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS
USING MOTION VECTOR FOR THE NEW IMAGE
fseek(fpold,160,1);
fseek(fpnew,160,1);
}
}
fclose(fpnew);
fclose(fpold);//*/
}
}
second = time(NULL);
printf("Time taken %ld", second - first);//TO FIND CPU TIME
printf("\nAMAD=%lf", amad/99);
printf("\nPixel Difference %ld", peldiff/99);
getch();
}
//--------------------------------------------------------------------------
// FUNCTION BLOCKDIFF
//--------------------------------------------------------------------------
union mmx
{
__m128i m;
short int x[8];
}m;
else
{
fp1 = fopen("C:\\ECE734\\f4.raw", "rb");
fp2 = fopen("C:\\ECE734\\f5.raw", "rb");
fseek(fp1,offset1,SEEK_SET);
fread(block1,1,16,fp1);
//for(i=0;i<16;i++)
//printf("%c \n",block1[i]);
fseek(fp2,offset2,SEEK_SET);
fread(block2,1,16,fp2);
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.5 Full Search with simple C for 8 x 8 blocks
//------------------------------------------------------------------------
// PROGRAM TO IMPLEMENT FULL SEARCH
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
void main()
{
int i1,j1,k1;
FILE *fpold,*fpnew;
first = time(NULL);
//----------------------------------------------------------------------
// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE
//----------------------------------------------------------------------
fpold = fopen("C:\\ECE734\\f4.raw","rb");
fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b");
if(motionx != 0 || motiony != 0)
{
//SKIP PIXELS UPTO INITIAL POINT
fseek(fpold,(176 * y11) + x1, 0);
fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS
USING MOTION VECTOR FOR THE NEW IMAGE
}
}
fclose(fpnew);
fclose(fpold);
//amad = amad + sqrt(motionx*motionx + motiony*motiony);
peldiff = peldiff + mindiff[i][j];
}
}
amad = amad/99;
printf("\nAMAD = %f", amad);
printf("\nPixel Difference %ld", peldiff/99);
second=time(NULL);
printf("\nDifference in time %ld", second - first);
getch();
}
//--------------------------------------------------------------------------
// FUNCTION BLOCKDIFF
//--------------------------------------------------------------------------
else
{
fp1 = fopen("C:\\ECE734\\f4.raw", "rb");
fp2 = fopen("C:\\ECE734\\f5.raw", "rb");
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.7 Three Step Search with simple C for 8 x 8 blocks
//------------------------------------------------------------------------
// PROGRAM TO IMPLEMENT 3 STEP SEARCH
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
void main()
{
FILE *fpold, *fpnew;
int i1,j1,k1,ch,c;
//clrscr();
first = time(NULL);
locx = x1;
locy = y11;
p = 8;
while(p >= 1)
{
//ALGORITHM FOR 3 STEP SEARCH
//FOR POINT NO. 0
x2 = locx - p;
y22 = locy - p;
diff[i][j][0] = blockdiff(x1, x2, y11, y22);
//------------------------------------------------------------------------
//SUBPROGARM FOR SORTING THE DIFFERENCES
//-----------------------------------------------------------------------
if(l==0)
{
locx = locx - p;
locy = locy - p;
}
if(l==1)
{
locx = locx;
locy = locy - p;
}
if(l==2)
{
locx = locx + p;
locy = locy - p;
}
if(l==3)
{
locx = locx - p;
locy = locy;
}
if(l==4)
{
locx = locx;
locy = locy;
}
if(l==5)
{
locx = locx + p;
locy = locy;
}
if(l==6)
{
locx = locx - p;
locy = locy + p;
}
if(l==7)
{
locx = locx;
locy = locy + p;
}
if(l==8)
{
locx = locx + p;
locy = locy + p;
}
p = p/2;
} //while loop end.
if(motionx != 0 || motiony != 0)
{
//SKIP PIXELS UPTO INITIAL POINT
fseek(fpold,(176 * y11) + x1, 0);
fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS
USING MOTION VECTOR FOR THE NEW IMAGE
//--------------------------------------------------------------------------
// FUNCTION BLOCKDIFF
//--------------------------------------------------------------------------
else
{
fp1 = fopen("C:\\ECE734\\f4.raw", "rb");
fp2 = fopen("C:\\ECE734\\f5.raw", "rb");
if(fp2 == NULL)
{
printf("File 2 cannot be opened");
getch();
exit(0);
}
if(fp1 == NULL)
{
printf("File 1 cannot be opened");
getch();
exit(0);
}
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.8 Three Step Search Program with SSE 2 for 8 x 8 blocks
//------------------------------------------------------------------------
// PROGRAM TO IMPLEMENT 3 STEP SEARCH
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
#include<sse2mmx.h>
#include<xmmintrin.h>
void main()
{
FILE *fpold, *fpnew;
int i1,j1,k1,ch,c;
first = time(NULL);
locx = x1;
locy = y11;
p = 8;
while(p >= 1)
{
//ALGORITHM FOR 3 STEP SEARCH
//FOR POINT NO. 0
x2 = locx - p;
y22 = locy - p;
diff[i][j][0] = blockdiff(x1, x2, y11, y22);
//------------------------------------------------------------------------
//SUBPROGARM FOR SORTING THE DIFFERENCES
//-----------------------------------------------------------------------
if(l==0)
{
locx = locx - p;
locy = locy - p;
}
if(l==1)
{
locx = locx;
locy = locy - p;
}
if(l==2)
{
locx = locx + p;
locy = locy - p;
}
if(l==3)
{
locx = locx - p;
locy = locy;
}
if(l==4)
{
locx = locx;
locy = locy;
}
if(l==5)
{
locx = locx + p;
locy = locy;
}
if(l==6)
{
locx = locx - p;
locy = locy + p;
}
if(l==7)
{
locx = locx;
locy = locy + p;
}
if(l==8)
{
locx = locx + p;
locy = locy + p;
}
p = p/2;
} //while loop end.
if(motionx != 0 || motiony != 0)
{
//SKIP PIXELS UPTO INITIAL POINT
fseek(fpold,(176 * y11) + x1, 0);
fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS
USING MOTION VECTOR FOR THE NEW IMAGE
//--------------------------------------------------------------------------
// FUNCTION BLOCKDIFF
//--------------------------------------------------------------------------
union mmx
{
__m64 m;
int x[2];
}m;
else
{
fp1 = fopen("C:\\ECE734\\f4.raw", "rb");
fp2 = fopen("C:\\ECE734\\f5.raw", "rb");
fseek(fp1,offset1,SEEK_SET);
fread(block1,1,8,fp1);
fseek(fp2,offset2,SEEK_SET);
fread(block2,1,8,fp2);
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
/*****************************************************************************
* File Name : DCTmatrix.c
*
* Comment : This file will help produce ROM values stored to compute DCT
* using distributed arithmetic.
*
* Project :ECE734
* Author :Shamik Valia
*
****************************************************************************/
#include<stdio.h>
#include<math.h>
#include<stdlib.h>
#define FOR_HARDWARE
//#define FOR_SIMULATOR
FILE *R1,*R2,*R3,*R4,*R5,*R6,*R7,*R8 ;
double A,B,C,D,E,F,G,v1;
int i,j,k,l;
//char *C;
char c[atoi(argv[1])+1] ;
void dectobinary(double ,int, char[]);
void complement(char[]);
A = cos(M_PI/4);
B = cos(M_PI/8);
C = sin(M_PI/8);
D = cos(M_PI/16);
E = cos(3*M_PI/16);
F = sin(3*M_PI/16);
G = sin(M_PI/16);
R1 = fopen("dct_ROM1.dat","w");
R2 = fopen("dct_ROM2.dat","w");
R3 = fopen("dct_ROM3.dat","w");
R4 = fopen("dct_ROM4.dat","w");
R5 = fopen("dct_ROM5.dat","w");
R6 = fopen("dct_ROM6.dat","w");
R7 = fopen("dct_ROM7.dat","w");
R8 = fopen("dct_ROM8.dat","w");
for(i=0;i<=1;i++){
for(j=0;j<=1;j++){
for(k=0;k<=1;k++) {
for(l=0;l<=1;l++){
#ifdef FOR_SIMULATOR
v1=0.5*((A*i)+(A*j)+(A*k)+(A*l));
fprintf(R1,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1);
v1=0.5*((B*i)+(C*j)-(C*k)-(B*l));
fprintf(R2,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1);
v1=0.5*((A*i)-(A*j)-(A*k)+(A*l));
fprintf(R3,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1);
v1=0.5*((C*i)-(B*j)+(B*k)-(C*l));
fprintf(R4,"sequence %d%d%d%d value = %lf \n ",i,j,k,l,v1);
v1=0.5*((D*i)+(E*j)+(F*k)+(G*l));
fprintf(R5,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1);
v1=0.5*((E*i)-(G*j)-(D*k)-(F*l));
fprintf(R6,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1);
v1=0.5*((F*i)-(D*j)+(G*k)+(E*l));
fprintf(R7,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1);
v1=0.5*((G*i)-(F*j)+(E*k)-(D*l));
fprintf(R8,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1);
#endif
#ifdef FOR_HARDWARE
v1=0.5*((A*i)+(A*j)+(A*k)+(A*l));
dectobinary(v1,atoi(argv[1]),c);
fprintf(R1,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);
v1=0.5*((B*i)+(C*j)-(C*k)-(B*l));
dectobinary(v1,atoi(argv[1]),c);
fprintf(R2,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);
v1=0.5*((A*i)-(A*j)-(A*k)+(A*l));
dectobinary(v1,atoi(argv[1]),c);
fprintf(R3,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);
v1=0.5*((C*i)-(B*j)+(B*k)-(C*l));
dectobinary(v1,atoi(argv[1]),c);
fprintf(R4,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);
v1=0.5*((D*i)+(E*j)+(F*k)+(G*l));
dectobinary(v1,atoi(argv[1]),c);
fprintf(R5,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);
v1=0.5*((E*i)-(G*j)-(D*k)-(F*l));
dectobinary(v1,atoi(argv[1]),c);
fprintf(R6,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);
v1=0.5*((F*i)-(D*j)+(G*k)+(E*l));
dectobinary(v1,atoi(argv[1]),c);
fprintf(R7,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);
v1=0.5*((G*i)-(F*j)+(E*k)-(D*l));
dectobinary(v1,atoi(argv[1]),c);
fprintf(R8,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);
#endif
}
}
}
}
fclose (R1);
fclose (R2);
fclose (R3);
fclose (R4);
fclose (R5);
fclose (R6);
fclose (R7);
fclose (R8);
//double no;
///int precision ;
//given the nos before the decimal point it will give binary representation
float fixed ;
int decimal ;
int ans = 0, x ,i=0,two_complement = 0;
void complement(char[]);
//char C[precision];
//printf("i am inside");
if (no<0){
two_complement = 1;
no = - no ;
}
decimal = (int)no ;
fixed = no-decimal;
while(decimal/2 !=0)
{
x = decimal % 2 ;
ans = ans + x*pow(10,i);
i++;
decimal = decimal / 2 ;
}
x= decimal%2;
ans = ans + x*pow(10,i);
sprintf(c,"%d",ans);
if(strlen(c)!=2)
{
if(strlen(c)>2) fprintf(stderr,"error at decimal..more than 2 places");
c[1] = c[0];
c[0] = '0';
c[2]='\0';
}
// printf("stringlenth is %d",strlen(c));
//given the nos after decimal point , gives the binary.
for(i=strlen(c);i<precision;i++)
{
fixed = 2*fixed ;
if(fixed>=1.0) {
fixed = fixed-1.0 ;
c[i] = '1';
}
else c[i] ='0';
}
c[i]='\0';
if(two_complement==1){
complement(c);
}
// return C ;
}
int i = strlen(c);
int flag = 0;
while(i!=-1){
if (flag ==0){
if(c[i]=='1')
flag = 1;
i--;
}else //flag ==0 ends
{ //flag ==1
if(c[i]=='1')
c[i]='0';
else c[i]='1';
i--;
} //flag==1 ends
} //while ends
}
/* Real dct implementation.
x0 = x0 + x7 ;
x1 = x1 + x6 ;
x2 = x2 + x5 ;
x3 = x3 + x4 ;
x4 = x0 - x7 ;
x5 = x0 - x6 ;
x6 = x0 - x5 ;
x7 = x0 - x4 ;
*************************************************/
#include<stdio.h>
#include<math.h>
#include<stdlib.h>
main ()
{
short A,B,C,D,E,F,G;
short int X0,X1,X2,X3,X4,X5,X6,X7 ;
short int x0,x1,x2,x3,x4,x5,x6,x7;
double X0_,X1_,X2_,X3_,X4_,X5_,X6_,X7_;
int Add_SS2(int ,int);
int n1,n2,n3;
/*
A = cos(M_PI/4);
B = cos(M_PI/8);
C = sin(M_PI/8);
D = cos(M_PI/16);
E = cos(3*M_PI/16);
F = sin(3*M_PI/16);
G = sin(M_PI/16);
*/
A =23170;
B=30273;
C=12539;
D=32138;
E=27245;
F=18204;
G=6392;
printf("input x0 : ");
scanf("%hd",&x0);
printf("\ninput x1 : ");
scanf("%hd",&x1);
printf("\ninput x2 : ");
scanf("%hd",&x2);
printf("\ninput x3 : ");
scanf("%hd",&x3);
printf("\ninput x4 : ");
scanf("%hd",&x4);
printf("\ninput x5 : ");
scanf("%hd",&x5);
printf("\ninput x6 : ");
scanf("%hd",&x6);
printf("\ninput x7 : ");
scanf("%hd",&x7);
/*
X0 = (short int)(((A*x0) + (A*x1) + (A*x2) + (A*x3))>>1);
X2 = (short int)(((B*x0) + (C*x1) - (C*x2) - (B*x3))>>1);
X4 = (short int)(((A*x0) - (A*x1) - (A*x2) + (A*x3))>>1);
X6 = (short int)(((C*x0) - (B*x1) - (B*x2) - (C*x3))>>1);
n1 = Add_SS2((B*x0) ,- (C*x2));
n2 = Add_SS2((C*x1) ,- (B*x3));
n3 = Add_SS2(n1,n2);
X2 = (short int)(n3>>17);
// X2=X2+ (((B*x_[0]) + (C*x_[1]) - (C*x_[2]) - (B*x_[3]))/(power(2,15-i)));
n1 = Add_SS2((A*x0) , - (A*x1));
n2 = Add_SS2((A*x3),- (A*x2) );
n3 = Add_SS2(n1,n2);
X4 = (short int)(n3>>17);
n1 = Add_SS2((F*x4), - (D*x5));
n2 = Add_SS2((G*x6), (E*x7) );
n3 = Add_SS2(n1,n2);
X5 = (short int)(n3>>17);
n1 = Add_SS2((G*x4), - (F*x5));
n2 = Add_SS2((E*x6),- (D*x7));
n3 = Add_SS2(n1,n2);
X7 = (short int)(n3>>17);
out = a + b;
#include<stdio.h>
FILE *fo;
//Copyright information...
fprintf(fo,"//#########################################################\n");
fprintf(fo,"//\n");
fprintf(fo,"// File Name:%s \n",argv[3]);
fprintf(fo,"//\n");
fprintf(fo,"// Comment : The file is ROM code for inbit#%s and output#%s \n",argv[2],argv[4]);
fprintf(fo,"// Project: \n");
fprintf(fo,"// Author : Shamik Valia \n");
fprintf(fo,"// Creation Date : \n");
fprintf(fo,"// Advisor:Prof.Mike Schutle\n");
fprintf(fo,"//\n");
fprintf(fo,"//#########################################################\n");
fprintf(fo,"\n\n");
//
module struc_16dap(in1,in2,in3,in4,reset,clk,start,out,done);
dap_controller_4 D1(reset_reg_use,done,S1,S2,reset_reg,shiftAdd,shiftaddcomplement,rd,reset,start,clk);
//adder block
adder Add1(adder_out,ext_out1,ext_out2,out,shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use);
assign strobe = 1'b1;
reg_ #(21) RE1(out,adder_out,clk,strobe,reset_reg);
endmodule
module adder(adder_out,in_adder_,in_adder,r,shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use);
always@(in_adder,in_adder_,r,shiftAdd,shiftaddcomplement,reset_reg_use)
begin
if (shiftAdd ==1'b1)
begin
if (reset_reg_use == 1'b1)
begin
c = {in_adder_,1'b0,1'b0};
d = {in_adder[18],in_adder,1'b0};
adder_out = {in_adder_,1'b0,1'b0}+{in_adder[18],in_adder,1'b0};
end
else
begin
c = {in_adder_,1'b0,1'b0};
d = {in_adder[18],in_adder,1'b0};
e = {r[20],r[20],r[20:2]} ;
adder_out = {in_adder_,1'b0,1'b0}+({in_adder[18],in_adder,1'b0})+
({r[20],r[20],r[20:2]});
end
end
else
begin
if (shiftaddcomplement ==1'b1)
begin
adder_out = {(~in_adder_ + 1'b1),1'b0,1'b0} +({in_adder[18],in_adder,1'b0}) +
({r[20],r[20],r[20:2]});
c = {(~in_adder_ + 1'b1),1'b0,1'b0};
d = {in_adder[18],in_adder,1'b0};
e = {r[20],r[20],r[20:2]} ;
end
else
begin
c= 20'b0;
d = 20'b0;
e = 20'b0;
adder_out = 20'b0;
end
end
end
endmodule
module
dap_controller_4(reset_reg_use,done,S1,S2,reset_reg,shiftAdd,shiftaddcomplement,rd,reset,start,clk);
ST0:if(start ==1'b1)
begin
S1 = 3'b000;
S2 = 3'b000;
nextstate = ST1 ;
reset_reg_use = 1'b1;
shiftAdd = 1'b1;
rd = 1'b1;
done = 1'b0;
end
else
nextstate = ST0;
ST1:begin
nextstate = ST2;
S1 = 3'b001;
S2 = 3'b001;
shiftAdd = 1'b1;
reset_reg_use =1'b0;
rd = 1'b1;
end
ST2 : begin
nextstate = ST3;
S1 = 3'b010;
S2 = 3'b010;
shiftAdd = 1'b1;
reset_reg_use =1'b0;
rd = 1'b1;
end
ST3 : begin
nextstate = ST4;
S1 = 3'b011;
S2 = 3'b011;
shiftAdd = 1'b1;
reset_reg_use =1'b0;
rd = 1'b1;
end
ST4 : begin
nextstate = ST5;
S1 = 3'b100;
S2 = 3'b100;
shiftAdd = 1'b1;
reset_reg_use =1'b0;
rd = 1'b1;
end
ST5 : begin
nextstate = ST6;
S1 = 3'b101;
S2 = 3'b101;
shiftAdd = 1'b1;
reset_reg_use =1'b0;
rd = 1'b1;
end
ST6 : begin
nextstate = ST7;
S1 = 3'b110;
S2 = 3'b110;
shiftAdd = 1'b1;
reset_reg_use =1'b0;
rd = 1'b1;
end
ST7 : begin
nextstate = ST0;
S1 = 3'b111;
S2 = 3'b111;
//shiftAdd = 1'b1;
shiftaddcomplement = 1'b1;
rd = 1'b1;
done =1'b1;
end
state = nextstate ;
end
end
endmodule
/*
module struc_16;
check16 c1(in1,in2,in3,in4,out1);
struc_16dap D1(in1,in2,in3,in4,reset,clk,start,out,done);
assign value = out<<1;
//assign error = (val==1'b1) ? (out1[30:15] - out[16:1]) : 16'b0;
assign error = out1[30:15] - out[16:1];
always
#5 clk = ~clk ;
always
begin
# 5 start = 1'b1;val = 1'b0;
t = t+1;
# 10 start =1'b0;
# 95 val=1'b1;
end
initial
begin
clk = 1'b1;reset =1'b1;t = 64'h0000_0000_0000_0000;
#7 reset =1'b0;
end
endmodule
*/