DE3750017T2

DE3750017T2 - Orthogonal transformation processor.

Info

Publication number: DE3750017T2
Application number: DE3750017T
Authority: DE
Inventors: Adrianus Ligtenberg; Jay Henry O'neill
Original assignee: American Telephone and Telegraph Co Inc
Current assignee: AT&T Corp
Priority date: 1986-11-10
Filing date: 1987-11-04
Publication date: 1994-09-29
Anticipated expiration: 2007-11-05
Also published as: CN87107679A; EP0267729A2; EP0267729B1; JPS63136167A; ES2005021A6; KR880006617A; EP0267729A3; US4760543A; DE3750017D1; CA1277036C; KR920001618B1

Description

Die vorliegende Erfindung betrifft einen Prozessor für orthogonale Transformationen.The present invention relates to a processor for orthogonal transformations.

In der Vergangenheit wurden Spezialrechner in beträchtlichem Umfange für rechenintensive Algorithmen verwendet, wenn effektive Lösungen von Echtzeit- Informationsverarbeitungsproblemen benötigt wurden. Nachdem Mehrzweckcomputer billiger wurden und sich ihre Geschwindigkeit erhöhte, nahm der Bedarf nach speziell entwickelten Computern ab, aber es sind natürlich immer größere Probleme in Angriff zu nehmen. Kundenintegrierte Prozessoren, verwendet als periphere Coprozessoren, können dieselben Erfordernisse erfüllen. Wenn sie geeignet entworfen und verwendet werden, steigern sie drastisch die Leistungsfähigkeit von Mehrzweckcomputersystemen für einen großen Bereich von rechenintensiven Programmen.In the past, special purpose computers were used extensively for computationally intensive algorithms when effective solutions to real-time information processing problems were needed. As general purpose computers became cheaper and faster, the need for purpose-built computers decreased, but of course larger problems remained to be tackled. Custom processors, used as peripheral coprocessors, can meet the same needs. When properly designed and used, they dramatically increase the performance of general purpose computer systems for a wide range of computationally intensive programs.

Beispiele für kundenspezifische Echtzeit- Informationsverarbeitungschips beinhalten verschiedene Ausführungsformen zur Durchführung der direkten und inversen diskreten Kosinus-Transformation (DCT) mit Anwendungen bei der Bildcodierung. Solch ein Chip wird beispielsweise von M. Vetterli und A. Ligtenberg in "A Discrete Fourier-Cosine Transform Chip", IEEE Journal on Selected Areas in Communications, Band SAC-4, Nr. 1, Januar 1986, Seiten 49-61 beschrieben.Examples of custom real-time information processing chips include various embodiments for performing the direct and inverse discrete cosine transform (DCT) with applications in image coding. Such a chip is described, for example, by M. Vetterli and A. Ligtenberg in "A Discrete Fourier-Cosine Transform Chip," IEEE Journal on Selected Areas in Communications, Vol. SAC-4, No. 1, January 1986, pages 49-61.

Die DCT, die eine orthogonale Transformation darstellt, ist nützlich zur Bildcodierung, da Bilder ein beträchtliches Ausmaß an Redundanz aufweisen und es daher sinnvoll ist, Bilder in Blocks zu verarbeiten. Eine zweidimensionale DCT- Transformation solcher Blocks liefert normalerweise hauptsächlich niederfrequente Komponenten, so daß das Nichtbeachten der hochfrequenten Komponenten (geringer Größe) die Qualität des codierten Bildes nur wenig beeinträchtigt. Obwohl die Verwendung der DCT zur Bildcodierung potentielle übertragungs- und Speichervorteile aufweist, ist der Rechenaufwand sehr groß. Das Arbeiten bei Videoabtastraten erfordert eine Verarbeitungsgeschwindigkeit von etwa 6,4 Millionen Abtastwerte pro Sekunde (unter der Annahme von 15.734 Abtastzeilen pro Sekunde und etwa 400 Abtastwerte pro Zeile, wie sie für die NTSC-Norm benötigt werden). Eine Achtpunkt-DCT erfordert zumindest 13 Multiplikationen und 29 Additionen. Eine zweidimensionale Transformation kann durch Anwendung einer eindimensionalen Transformation auf die Zeilen, der sich eine eindimensionale Transformation der Spalten anschließt, durchgeführt werden. Folglicherweise wird zur Echtzeit-Bildverarbeitung ein integrierter DCT-Schaltkreis benötigt zur Durchführung von 1,6 Millionen Achtpunkt- Transformationen pro Sekunde, was etwa 20 Millionen Multiplikationen und 47 Millionen Additionen beinhaltet.The DCT, which is an orthogonal transform, is useful for image coding, since images have a considerable degree of redundancy and it is therefore sensible to process images in blocks. A two-dimensional DCT transformation of such blocks normally yields mainly low frequency components, so that ignoring the high frequency components (small size) has little impact on the quality of the encoded image. Although using DCT for image coding has potential transmission and storage advantages, the computational cost is very high. Working at video sampling rates requires a processing speed of about 6.4 million samples per second (assuming 15,734 scanning lines per second and about 400 samples per line as required for the NTSC standard). An eight-point DCT requires at least 13 multiplications and 29 additions. A two-dimensional transformation can be performed by applying a one-dimensional transformation to the rows, followed by a one-dimensional transformation to the columns. Consequently, real-time image processing requires a DCT integrated circuit to perform 1.6 million eight-point transforms per second, which involves approximately 20 million multiplications and 47 million additions.

Eine andere Anwendungsmöglichkeit für kundenspezifische Prozessoren für orthogonale Echtzeit-Signaltransformationen liegt bei Systemen zur Lösung von Problemen entsprechend der Methode der kleinsten Quadrate. Solche Probleme sind bei der Signalverarbeitung und der linearen Programmierung vorherrschend. Beispielsweise besteht der bei weitem rechenintensivste Aspekt eines linearen Programmierungsproblems unter Verwendung von Karmarkars Algorithmus in der Lösung eines Problems entsprechend der Methode der kleinsten Quadrate bei jeder Iteration. Dieses Problem kann mit einem Coprozessor gelöst werden, der aus einem Feld von Prozessoren für orthogonale Transformationen besteht (jeder bestehend aus vier Multiplizierern, einem Addierer und einem Subtrahierer) und die Verwendung solch eines Coprozessors würde die Bearbeitungszeit für Karmarkars Softwarerealisation um eine Größenordnung verringern.Another application for custom real-time orthogonal signal transform processors is in systems for solving least squares problems. Such problems are prevalent in signal processing and linear programming. For example, by far the most computationally intensive aspect of a linear programming problem using Karmarkar's algorithm is solving a least squares problem at each iteration. This problem can be solved using a coprocessor consisting of an array of orthogonal transform processors (each consisting of four multipliers, a adder and a subtractor) and the use of such a coprocessor would reduce the processing time for Karmarkar's software implementation by an order of magnitude.

Um die hohe Effizienz zu erreichen, die - wie in Zusammenhang mit der DCT-Bildcodierung dargestellt - bei manchen Anwendungen erforderlich ist, sind zwei Fragen zu beantworten. Zum einen die nach den grundlegenden Bausteinen, die mit diesen Operationen verknüpft sind und zum anderen die nach den effizienten VLSI-Strukturen für diese Bausteine. Die Antworten auf diese zwei Fragen sind stark miteinander verknüpft, da sich ohne gute Aufteilung des Algorithmus keine effiziente VLSI-Implementierung erreichen läßt und da sich keine optimale Aufteilung des Algorithmus erreichen läßt, ohne zu wissen, wie eine effiziente VLSI-Struktur implementiert werden kann.In order to achieve the high efficiency required in some applications, as shown in the context of DCT image coding, two questions must be answered. First, the basic building blocks associated with these operations and second, the efficient VLSI structures for these building blocks. The answers to these two questions are closely linked, since without good partitioning of the algorithm, an efficient VLSI implementation cannot be achieved and since an optimal partitioning of the algorithm cannot be achieved without knowing how to implement an efficient VLSI structure.

Diese Schwierigkeiten sind in dem am 9. April 1986 erteilten amerikanischen Patent 4,510,578 von Miyaguchi et al dargestellt, in dem eine Schaltung beschrieben ist, bei der ein Eingangssignal einer orthogonalen Transformation unterworfen wird. Die Schaltung umfaßt eine erste Stufe von drei Speichern, die parallel betrieben werden und acht Multiplizierer für konstante Koeffizienten speist. Die Ausgangssignale der Multiplizierer werden an drei Addierer angelegt, von denen zwei Volladdierer sind. Die drei Ausgänge der ersten Stufe werden an drei Sekundärstufen aus komplexen Multiplizierern angelegt, wobei jede der drei Stufen komplexer Multiplizierer zwei Speicher, vier Multiplizierer und zwei Addierer umfaßt. Bei dem Schaltkreis von Miyaguchi et al werden keine Anstrengungen unternommen, eine Architektur zu verwenden, die besonders schnell ist. Die Betonung scheint vielmehr darauf zu liegen, herkömmliche Module (Addierer, Multiplizierer) in einer Zusammenstellung zu verwenden, die die gewünschte Transformation in einer möglichst direkten Art und Weise durchführt.These difficulties are illustrated in U.S. Patent 4,510,578 issued to Miyaguchi et al. on April 9, 1986, which describes a circuit in which an input signal is subjected to an orthogonal transformation. The circuit comprises a first stage of three memories operating in parallel and feeding eight constant coefficient multipliers. The outputs of the multipliers are applied to three adders, two of which are full adders. The three outputs of the first stage are applied to three secondary stages of complex multipliers, each of the three stages of complex multipliers comprising two memories, four multipliers and two adders. In the Miyaguchi et al. circuit, no effort is made to use an architecture that is particularly fast. The emphasis seems to be on using conventional modules (adders, multipliers) in a combination that performs the desired transformation in the most direct way possible.

Die Überlegungen, die bei der Verwirklichung eines möglichst guten integrierten Schaltkreises zu berücksichtigen sind, betreffen im Gegensatz hierzu nicht nur die Anzahl der benötigten Multiplizierer und Addierer sondern auch die Größe dieser Elemente und die durch sie hervorgerufenen Verzögerungen. Bei einigen Ausführungsformen wird beispielsweise die Verzögerung dadurch sehr stark erhöht, daß zuerst addiert und dann multipliziert wird und nicht zuerst multipliziert und dann addiert wird.In contrast, the considerations to be taken into account in realizing the best possible integrated circuit concern not only the number of multipliers and adders required, but also the size of these elements and the delays they introduce. In some embodiments, for example, the delay is greatly increased by adding first and then multiplying, rather than multiplying first and then adding.

Die GB-A-2141847 beschreibt eine Anordnung zur Matrizen- Muliplikation von Bilddaten (mit den drei Dimensionen x, y und z), bei der eine Rotationstransformation und eine Translationstransformation von Koordinatendaten L ausgeführt werden. In der Beschreibung wird eine Rotationstransformationsmatrix R und eine Translationstransformationsmatrix T beschrieben. Die Berechnung wird so durchgeführt, daß zuerst die Matrix P mit der Matrix R multipliziert wird, um eine Transformationsmatrix W zu erhalten und dann alle Zeilensegmente mit der Transformationsmatrix W zu multiplizieren. Die Multiplikationen werden für alle Datenkomponenten zur gleichen Zeit durchgeführt (innerhalb der Periode, die für die verschiedenen seriellen Multiplikationen aufgewendet wird).GB-A-2141847 describes an arrangement for matrix multiplication of image data (with the three dimensions x, y and z) in which a rotation transformation and a translation transformation of coordinate data L are carried out. In the description, a rotation transformation matrix R and a translation transformation matrix T are described. The calculation is carried out by first multiplying the matrix P by the matrix R to obtain a transformation matrix W and then multiplying all row segments by the transformation matrix W. The multiplications are carried out for all data components at the same time (within the period spent on the various serial multiplications).

Entsprechend einem Aspekt der vorliegenden Erfindung wird ein Prozessor für orthogonale Transformationen gemäß Anspruch 1 zur Verfügung gestellt.According to one aspect of the present invention, there is provided an orthogonal transform processor according to claim 1.

Entsprechend einem anderen Aspekt der vorliegenden Erfindung wird ein Prozessor für orthogonale Transformationen gemäß Anspruch 11 zur Verfügung gestellt.According to another aspect of the present invention, there is provided an orthogonal transform processor according to claim 11.

Eine Ausführungsform der Erfindung stellt eine Prozessorschaltung für orthogonale Transformationen zur Verfügung, die besonders gut zur Verwirklichung in integrierten Schaltkreisen geeignet ist.One embodiment of the invention provides a processor circuit for orthogonal transformations for which is particularly well suited for implementation in integrated circuits.

Eine andere Ausführungsform der Erfindung stellt einen Prozessor für orthogonale Transformationen zur Verfügung, dessen Architektur die maximale Ausnutzung der Geschwindigkeitsmöglichkeiten von integrierten Schaltkreisen ermöglicht.Another embodiment of the invention provides an orthogonal transform processor whose architecture enables maximum utilization of the speed capabilities of integrated circuits.

Bei einer Ausführungsform, die eine Rotatorschaltung, einen Kommunikationsschaltung und eine interne oder externe Suchtabelle umfaßt, führt die Rotatorschaltung die für einen Rotator erforderlichen komplexen Multiplikationen durch, wobei die benötigten Koeffizienten von der Suchtabelle zur Verfügung gestellt werden, während die Kommunikationsschaltung mit der Aufgabe beschäftigt ist, die Eingangssignale aufzunehmen, den Rotator mit den für jede Iteration benötigten Signalen zu versorgen und die sich ergebenden Ausgangssignale zu liefern. Der Rotator wird durch eine Matrix von räumlich benachbarten, miteinander verbundenen Multiplizierern und Addierermodulen verwirklicht, wobei jedes Modul einen Bereich jeder der in dem Rotator benötigten Multiplizierer (oder Addierer, je nachdem) umfaßt. Die Suchtabelle ist ein zählerloser Festwertspeicher und die Kommunikationsschaltung besteht aus einer miteinander verbundene Gruppe von Dualeingang-/Dualausgangregistern.In an embodiment comprising a rotator circuit, a communication circuit and an internal or external lookup table, the rotator circuit performs the complex multiplications required by a rotator, with the required coefficients provided by the lookup table, while the communication circuit is busy taking the input signals, supplying the rotator with the signals required for each iteration and providing the resulting output signals. The rotator is implemented by a matrix of spatially adjacent, interconnected multiplier and adder modules, each module comprising a portion of each of the multipliers (or adders, as the case may be) required in the rotator. The lookup table is a counterless read-only memory and the communication circuit consists of an interconnected set of dual-input/dual-output registers.

Die Figuren zeigen:The figures show:

Fig. 1 ein verallgemeinertes Blockschaltbild einer Ausführungsform des erfindungsgemäßen Prozessors für orthogonale Transformationen,Fig. 1 is a generalized block diagram of an embodiment of the processor according to the invention for orthogonal transformations,

Fig. 2 eine Ausführungsform eines Rotators 300,Fig. 2 an embodiment of a rotator 300,

Fig. 3 ein Blockschaltbild eines typischen Elementes 320 des in Fig. 2 dargestellten Rotators,Fig. 3 is a block diagram of a typical element 320 of the rotator shown in Fig. 2,

Fig. 4 den Signalfluß und die für die Implementierung einer diskreten Kosinustransformation mit dem in Fig. 1 dargestellten System benötigte Reihenfolge von Rotationen,Fig. 4 shows the signal flow and the sequence of rotations required to implement a discrete cosine transformation using the system shown in Fig. 1,

Fig. 5 die allgemeine Struktur des Speichers der in Fig. 1 dargestellten Kommunikationsschaltung 100,Fig. 5 shows the general structure of the memory of the communication circuit 100 shown in Fig. 1,

Fig. 6 eine andere Möglichkeit des Signalflusses und der für die Durchführung der diskreten Kosinustransformation mit dem in Fig. 1 dargestellten System benötigten Reihenfolge von Rotationen,Fig. 6 shows another possibility of the signal flow and the sequence of rotations required to perform the discrete cosine transformation with the system shown in Fig. 1,

Fig. 7 ein umfassenderes Blockschaltbild der in Fig. 1 dargestellten Kommunikationsschaltung,Fig. 7 is a more comprehensive block diagram of the communication circuit shown in Fig. 1,

Fig. 8 die Art und Weise, mit der Information zwischen den inFig. 8 the way in which information is exchanged between the

Fig. 5 dargestellten Speicherelementen übertragen wird,Fig. 5 shown storage elements is transferred,

Fig. 9 eine Realisierungsmöglichkeit für die Normierungsblocks gemäß Fig. 7,Fig. 9 shows a possible implementation for the standardization blocks according to Fig. 7,

Fig. 10 die Struktur einer schnellen zählerlosen Suchtabelle,Fig. 10 the structure of a fast counterless search table,

Fig. 11 eine vergrößerte Struktur für die Suchtabelle gemäß Fig. 1, enthaltend zwei Speicherelemente 56,Fig. 11 shows an enlarged structure for the search table according to Fig. 1, containing two storage elements 56,

Fig. 12 ein Blockschaltbild zur Implementierung einer zweidimensionalen orthogonalen Transformation,Fig. 12 a block diagram for implementing a two-dimensional orthogonal transformation,

Fig. 13 eine Ausführungsform des in Fig. 12 dargestellten Transponierungsspeichers 502 undFig. 13 shows an embodiment of the transposition memory 502 shown in Fig. 12 and

Fig. 14 eine Struktur der Speicherzellen 503 des in Fig. 13 dargestellten Transponierungsspeichers.Fig. 14 shows a structure of the memory cells 503 of the transposition memory shown in Fig. 13.

Fig. 1 zeigt das allgemeine Blockschaltbild eines erfindungsgemäßen Prozessors für orthogonale Transformationen. Darin werden Eingangssignale über Leitung 10 an die Kommunikationsschaltung 100 angelegt und Ausgangssignale werden über Leitung 20 durch die Kommunikationsschaltung 100 übergeben. Der Schaltkreis 100 spricht auf ein "Bereit"- Steuersignal von Leitung 40 und auf von der Suchtabelle 200 gelieferte Steuersignale an. Die Kommunikationsschaltung 100 wirkt mit dem Rotator 300 über die Leitung 30 zusammen, die Signale von und zu dem Rotator befördern. Die Suchtabelle 200 spricht auf das gleiche "Bereit"-Steuersignal an und wahlweise auf ein "inverses" Steuersignal. Zusätzlich zu der Lieferung von Steuersignalen an die Kommunikationsschaltung 100 liefert die Suchtabelle 200 auch Koeffizientensignale an den Rotator 300. Die Funktion und die Betriebsweise jedes der in Fig. 1 dargestellten Elemente wird im folgenden ausführlicher beschrieben.Fig. 1 shows the general block diagram of an orthogonal transform processor according to the invention. Therein, input signals are applied to the communication circuit 100 via line 10 and output signals are passed through the communication circuit 100 via line 20. The circuit 100 is responsive to a "ready" control signal from line 40 and to control signals provided by the lookup table 200. The communication circuit 100 interacts with the rotator 300 via line 30 which carry signals to and from the rotator. The lookup table 200 is responsive to the same "ready" control signal and optionally to an "inverse" control signal. In addition to providing control signals to communication circuit 100, lookup table 200 also provides coefficient signals to rotator 300. The function and operation of each of the elements shown in Figure 1 is described in more detail below.

Der kleinste gemeinsame Nenner von Rechengrundelementen für die orthogonale Transformation ist eine komplexe Multiplikation (oder Rotation). Die Gleichungen, die den Rotator beschreiben sind: The lowest common denominator of computational primitives for orthogonal transformation is a complex multiplication (or rotation). The equations describing the rotator are:

wobei xi und yi die Ein- bzw. Ausgabewerte und c&sub1; und c&sub2; die Koeffizienten, wie z. B. cosR und sinR sind. Die direkte Realisierung von Gleichung (1) erfordert vier Multiplikationen und zwei Additionen. Durch Manipulation von Gleichung (1) erhält man die Beziehung: where xi and yi are the input and output values, respectively, and c₁ and c₂ are the coefficients, such as cosR and sinR. The direct realization of equation (1) requires four multiplications and two additions. By manipulating equation (1), one obtains the relationship:

mit A=c&sub1;-c&sub2;, B=c&sub2; und C=c&sub1;+c&sub2;.with A=c₁-c₂, B=c₂ and C=c₁+c₂.

Diese Version erfordert drei Multiplikationen und drei Additionen und erscheint auf den ersten Blick eine kompaktere Realisierung zu ermöglichen. Bei einem sorgfältigen Vergleich der VLSI-Eignung der beiden Schemata unter Verwendung paralleler Multiplizierer zeigt sich jedoch, daß Gleichung (1) alles in allem gesehen zu einer bevorzugten Realisierung führt. Erstens ist der totale Zeitaufwand für die Realisierung von Gleichung (2) größer als für die Realisierung von Gleichung (1) wegen der Addition (x&sub1;+x&sub2;), die vor der Multiplikation Vorrang hat. Zweitens werden nur zwei Koeffizienten für die Realisierung von Gleichung (1) benötigt im Vergleich zu drei Koeffizienten für die Realisierung von Gleichung (2) und drittens verbraucht die Realisierung von Gleichung (1) weniger Siliciumfläche, da sie einer regelmäßigeren Kommunikationsstruktur zwischen den verschiedenen Elementen, die den Rotator bilden, zugänglich ist.This version requires three multiplications and three additions and at first glance appears to allow a more compact realization. However, a careful comparison of the VLSI suitability of the two schemes using parallel multipliers shows that equation (1) leads to a preferred realization on balance. First, the total time required to realize equation (2) is greater than that required to realize equation (1) because of the addition (x1+x2) which takes precedence over multiplication. Second, only two coefficients are required to realize equation (1) compared to three coefficients required to realize equation (2) and third, the realization of equation (1) consumes less silicon area because it is amenable to a more regular communication structure between the various elements forming the rotator.

Die Realisierung der Rotatorbeziehungen von Gleichung (1) in der Rotatorschaltung 300 erfolgt gemäß einem Multiplikationsalgorithmus, der dem von Baugh-Wooley vorgeschlagenen ähnlich ist und, wie unten ausgeführt, die Umordnung und Vereinigung unterschiedlicher Glieder in dem Multiplikationsprozeß beinhaltet.The implementation of the rotator relationships of equation (1) in the rotator circuit 300 is carried out according to a multiplication algorithm similar to that proposed by Baugh-Wooley and, as discussed below, involves the rearrangement and merging of different terms in the multiplication process.

Es ist bekannt, daß eine negative Zahl mit n-1 Magnitude-Bits, dargestellt in 2-Komplementform, als die Differenz It is known that a negative number with n-1 magnitude bits, represented in 2's complement form, is the difference

dargestellt werden kann. Mit solch einer Darstellung läßt sich ein Multiplikationsprodukt als With such a representation, a multiplication product can be represented as

ausdrücken, woraus sich durch Ausmultiplizieren folgende Beziehung ergibt: which, by multiplying out, yields the following relationship:

Das folgende Feld zeigt die partiellen Produktglieder, die für die Terme, aus denen Gleichung (5) besteht (für N=3) benötigt werden. Die ersten drei Zeilen entsprechen dem ersten Term, die vierte Zeile entspricht dem zweiten Term, die fünfte und sechste Zeile entsprechen dem dritten Term, dargestellt in 2- Komplementform und die siebte und achte Zeile entsprechen dem letzten Term von Gleichung 5. The following panel shows the partial product terms needed for the terms that make up equation (5) (for N=3). The first three rows correspond to the first term, the fourth row corresponds to the second term, the fifth and sixth rows correspond to the third term, presented in 2's complement form, and the seventh and eighth rows correspond to the last term of equation 5.

Durch Umschreiben ergibt sich das im folgenden dargestellte Feld partieller Produktterme. Rewriting this yields the field of partial product terms shown below.

Aus dem obigen ergibt sich, daß das gewünschte Produkt durch Summation von nur zwei Bit Produkttermen erhalten werden kann, wobei die benötigten Grundelemente UND-Glieder und NAND- Glieder in Kombination mit Volladdierern und Halbaddierern sind, bei denen der Carry-in bei einigen auf "1" und bei den meisten auf "0" gesetzt ist. Ausführlicher dargestellt ergibt sich bei einer Untersuchung des oben angeführten zweiten Multiplikationsfeldes, daß die erste Zeile lediglich drei UND- Glieder (x&sub0;c&sub0;, x&sub0;c&sub1;, x&sub0;c&sub2;) und ein NAND-Glied ( ) erfordert. Die zweite Zeile erfordert drei UND-Glieder und Halbaddierer mit einem"0"-Carry-in (x&sub1;c&sub0;, x&sub1;c&sub1;, x&sub1;c&sub2;) und ein NAND-Glied und einen Halbaddierer mit einem "1"-Carry-in ( ). Die nächste Zeile erfordert drei UND-Glieder und Volladdierer (x&sub2;C&sub0;, x&sub2;c&sub1;, x&sub2;c&sub2;) und ein NAND-Glied und einen Volladdierer ( ). Die letzte Zeile erfordert drei NAND- Glieder und Volladdierer ( ) und ein UND-Glied und einen Volladdierer (x&sub3;c&sub3;). Dieses Schema an Gliedern und Addierern läßt sich in gewohnter Art und Weise sehr einfach auf Fälle erweitern, bei denen der Multiplikator und der Multiplikand mehr als vier Bits aufweisen.From the above it follows that the desired product can be obtained by summing only two bit product terms, where the basic elements required are AND gates and NAND gates in combination with full adders and half adders, where the carry-in is set to "1" for some and to "0" for most. In more detail, an examination of the second Multiplication field that the first line requires only three AND gates (x₀c₀, x₀c₁, x₀c₂) and one NAND gate ( ). The second line requires three AND gates and half adders with a "0" carry-in (x₁c₀, x₁c₁, x₁c₂) and one NAND gate and one half adder with a "1" carry-in ( ). The next line requires three AND gates and full adders (x₂C₀, x₂c₁, x₂c₂) and one NAND gate and one full adder ( ). The last line requires three NAND gates and full adders ( ) and one AND gate and full adder (x₃c₃). This scheme of gates and adders can be easily extended in the usual way to cases where the multiplier and the multiplicand have more than four bits.

Fig. 2 illustriert eine Rotatorstruktur, die einen Multiplikationsbereich 310 und einen Additions- und einen Subtraktionsbereich 330 aufweist. Ein wichtiger Aspekt des in Fig. 2 dargestellten Rotators besteht darin, daß alle Elemente in einer verschachtelten Art und Weise konstruiert sind, was bedeutet, daß korrespondierende Funktionen jedes der im Bereich 310 realisierten vier Multiplizierers als eine Einheit und in großer physischer Nähe zueinander geschaffen sind. Diese Verschachtelung liefert eine Reihe von Vorteilen. Ein Vorteil besteht darin, daß alle Signalleitungen (einschließlich der Eingangsleitungen) kurz sind, wodurch die Geschwindigkeitsmöglichkeiten verbessert werden. Ein zweiter Vorteil besteht darin, daß alle korrespondierenden Leitungen im wesentlichen dieselbe Länge aufweisen, wodurch Asymmetrien (skew) minimiert und - daraus folgend - die Geschwindigkeitsmöglichkeiten verbessert werden. Ein dritter Vorteil besteht darin, daß die Struktur vollständig regelmäßig ist, was eine effiziente Verwendung der Siliciumfläche ermöglicht.Figure 2 illustrates a rotator structure having a multiplication section 310 and an addition and subtraction section 330. An important aspect of the rotator shown in Figure 2 is that all elements are constructed in an interleaved manner, meaning that corresponding functions of each of the four multipliers implemented in section 310 are created as a unit and in close physical proximity to each other. This interleaving provides a number of advantages. One advantage is that all signal lines (including the input lines) are short, thereby improving speed capabilities. A second advantage is that all corresponding lines are essentially the same length, thereby minimizing skew and, consequently, improving speed capabilities. A third advantage is that the structure is completely regular, allowing efficient use of silicon area.

In Fig. 2 umfaßt der Multiplikationsbereich 310 eine Vielzahl von Vierergruppen-Multipliziererblocks (Quad Multiplier (QM) Blocks) 320. Jeder Block 320 hat zwei Signaleingänge und zwei Koeffizienteneingänge wie auch Summen- und Übertragseingänge zum Weiterleiten von Informationen von anderen Blocks 320. Die Blocks 320 sind so entworfen, daß sie eine zweidimensionale rechteckige Matrix bilden, wobei die Elemente 320 in jeder "Zeile" und jeder "Spalte" mit zwei Elementen in einer "Zeile" höherer Nummer verbunden sind: Mit einem Element in derselben "Spalte" und mit einem Element in einer "Spalte" höherer Nummer. Das heißt, daß ein Element 320 aus einer "Zeile" i und einer "Spalte" j (QM)i,j mit (QM)i+1,j und (QM)i+1,j+1 verbunden ist.In Fig. 2, the multiplication area 310 comprises a plurality of quad multiplier blocks (Quad Multiplier (QM) Blocks) 320. Each block 320 has two signal inputs and two coefficient inputs as well as sum and carry inputs for passing information from other blocks 320. The blocks 320 are designed to form a two-dimensional rectangular matrix, with the elements 320 in each "row" and each "column" connected to two elements in a higher numbered "row": to an element in the same "column" and to an element in a higher numbered "column". That is, an element 320 from a "row" i and a "column" j (QM)i,j is connected to (QM)i+1,j and (QM)i+1,j+1.

In Übereinstimmung mit dem oben Gesagten ist die Struktur des Bereiches 310 im wesentlichen rechteckig und entspricht - wie im folgenden dargestellt - einer geshifteten Version des Multiplikationsfeldes. In accordance with the above, the structure of the region 310 is substantially rectangular and corresponds to a shifted version of the multiplication field, as shown below.

Wie von der obigen Zerlegung des Multiplikationsprozesses zu sehen ist, sind die Blocks 320 nicht in jeder Hinsicht identisch. Sie sind in dem Sinne identisch, daß sie alle zu dem Produkt beitragen durch Bearbeiten von drei Eingangsbits (zwei Bits in gewisser gegengekoppelter Position) und durch Erzeugen von Summen- und Übertragsausgangsbits. Sie unterscheiden sich dadurch, daß einige UND-Glieder erfordern, während andere NAND-Glieder erfordern. Zudem erfordern einige Volladdierer, während andere Halbaddierer erfordern, wie oben beschrieben. Obwohl bei einigen Anwendungen keines der NAND- Glieder getaktet oder registriert ist (d. h. keine Pipelineverarbeitung), werden bei anderen Anwendungen außerdem einige oder alle der Blocks registriert, um vorzusorgen, welcher Umfang an Pipelineverarbeitung auch immer wünschenswert erscheint.As can be seen from the above decomposition of the multiplication process, the blocks 320 are not identical in every respect. They are identical in the sense that they all contribute to the product by operating on three input bits (two bits in some negative feedback position) and by Generate sum and carry output bits. They differ in that some require AND gates while others require NAND gates. In addition, some require full adders while others require half adders, as described above. Additionally, although in some applications none of the NAND gates are clocked or registered (ie, no pipeline processing), in other applications some or all of the blocks are registered to provide for whatever amount of pipeline processing is desirable.

Fig. 3 zeigt ein QM-Element, das einen Volladdierer und ein Register umfaßt. Dies ist die allgemeine Ausführungsform, da ein QM-Element, das einen Halbaddierer und kein Register umfaßt, im wesentlichen eine abgemagerte Version des in Fig. 3 dargestellten QM-Elementes darstellt. In Fig. 3 ist das Element 400 ein QM-Element 320 in einer speziellen Spalte am Zeilenende innerhalb des Bereiches 310. Das Element 390 ist ein QM-Element in einer Zeile oberhalb des Elementes 400 und in derselben Spalte, das Element 380 ist ein QM-Element in einer Zeile oberhalb des Elementes 400 und in einer Spalte, die von geringerer arithmetischer Bedeutung (um ein Bit) ist, als diejenige von Element 400 und Element 410 ist ein QM- Element in einer Zeile unterhalb des Elementes 400.Fig. 3 shows a QM element that includes a full adder and a register. This is the general embodiment, since a QM element that includes a half adder and no register is essentially a stripped down version of the QM element shown in Fig. 3. In Fig. 3, element 400 is QM element 320 in a special column at the end of the row within region 310. Element 390 is a QM element in a row above element 400 and in the same column, element 380 is a QM element in a row above element 400 and in a column that is of less arithmetic significance (by one bit) than that of element 400, and element 410 is a QM element in a row below element 400.

Innerhalb des Elementes 400 sind Volladdierer 401-404, die auf die Multiplikatorenbits ci und cj sowie auf Multiplikandenbits (z. B. den dritten Bit) der Multiplikandenwörter xm und xn ansprechen, zum Aufsummieren von Bits von dem QM-Element 380 und zum Übertragen von Bits von dem QM-Element 390. Insbesondere spricht der Addierer 401 auf eine Summen- und Übertragseingabe von den Elementen 380 und 390 an und auf eine ausgewählte logische Kombination von ci und xm. Der Addierer 402 spricht auf eine Summen- und eine Übertragseingabe von den Elementen 380 und 390 an und auf dieselbe logische Kombination von cj und xm. Der Addierer 403 spricht auf eine Summen- und eine Übertragseingabe von den Elementen 380 und 390 an und auf dieselbe logische Kombination von ci und xm und schließlich spricht der Addierer 404 auf eine Summen- und eine Übertragseingabe von den Elementen 380 und 390 an und auf dieselbe logische Kombination von cj und xn. Jeder der Addierer (401-404) erzeugt ein Signalpaar, umfassend ein Summenausgangssignal und ein Übertragsausgangssignal. Jedes dieser Signalpaare wird in der in Fig. 3 dargestellten Ausführungsform an ein Register angelegt, welches ebenfalls auf ein Taktsignal C anspricht. Die getakteten Ausgangssignale dieser Register (409-413) bilden die Ausgangssignale des QM- Elementes 400. Die Übertragssignale werden an das QM-Element 410 angelegt, während die Summensignale an das signifikanteste nächste QM-Element in der Zeile des Elementes 410 angelegt werden. Die oben erwähnte logischen Kombinationen von ci und cj mit xm und xn, die durch die Elemente 405 bis 408 ausgeführt werden, sind entweder UND-Glieder oder NAND- Glieder, abhängig von der speziellen Zeile und Spalte, die das Element 390 belegt.Within element 400 are full adders 401-404 responsive to the multiplier bits ci and cj and to multiplicand bits (e.g., the third bit) of the multiplicand words xm and xn for summing bits from the QM element 380 and for carrying bits from the QM element 390. In particular, adder 401 is responsive to a sum and carry input from elements 380 and 390 and to a selected logical combination of ci and xm. Adder 402 is responsive to a sum and carry input from elements 380 and 390 and to the same logical combination of cj and xm. Adder 403 is responsive to a sum and a carry input from elements 380 and 390 and to the same logical combination of ci and xm and finally adder 404 is responsive to a sum and a carry input from elements 380 and 390 and to the same logical combination of cj and xn. Each of the adders (401-404) produces a signal pair comprising a sum output signal and a carry output signal. Each of these signal pairs is applied in the embodiment shown in Fig. 3 to a register which is also responsive to a clock signal C. The clocked output signals of these registers (409-413) form the output signals of QM element 400. The carry signals are applied to QM element 410 while the sum signals are applied to the most significant next QM element in the row of element 410. The above-mentioned logical combinations of ci and cj with xm and xn performed by elements 405 through 408 are either AND gates or NAND gates, depending on the particular row and column that element 390 occupies.

Der Bereich 330 in Fig. 2 umfaßt das Addierer- und Subtrahierernetz, das zur Komplettierung der Rotatorfunktion benötigt wird. Jedes QM-Element in der untersten (letzten) Zeile und in der am wenigsten signifikanten (äußersten rechten) Spalte des Feldes 320 liefert vier Summenbits und diese Bits müssen geeignet addiert und subtrahiert werden. Jedes Addition-/Subtraktionselement 340 innerhalb des Bereiches 330 umfaßt daher einen Zweibit-Addierer und einen Zweibit-Subtrahierer. In Übereinstimmung mit herkömmlichen Entwurfstechniken wird der Subtrahierer implementiert durch einfaches Invertieren des zu subtrahierenden Eingabewertes und durch Addition einer "1" in der Carry-in-Position des ersten Addierers in dem Feld. Mit anderen Worten ausgedrückt, kann der Bereich 330 einfach zwei Schnellübertrags-Addierer (Ripple-Through-Adders) umfassen.Region 330 in Figure 2 comprises the adder and subtractor network needed to complete the rotator function. Each QM element in the bottom (last) row and in the least significant (rightmost) column of array 320 provides four sum bits and these bits must be appropriately added and subtracted. Each addition/subtraction element 340 within region 330 therefore comprises a two-bit adder and a two-bit subtractor. In accordance with conventional design techniques, the subtractor is implemented by simply inverting the input value to be subtracted and adding a "1" in the carry-in position of the first adder in the array. In other words, the region 330 may simply comprise two ripple-through adders.

Die Kommunikationsschaltung in Fig. 1 sorgt für den Transfer von Daten von und zu dem Rotator. Dieser Transfer ist spezifisch für den implementierten Algorithmus, aber die unten beschriebene Hardwarerealisation ist grundlegend. Es kann gezeigt werden, daß eine orthogonale Transformation (Matrix Q) in Übereinstimmung mit: The communication circuit in Fig. 1 provides for the transfer of data to and from the rotator. This transfer is specific to the algorithm implemented, but the hardware realization described below is fundamental. It can be shown that an orthogonal transformation (matrix Q) in accordance with:

als eine Folge von ebenen Rotationen (Matrix T&ijlig;) implementiert werden kann.can be implemented as a sequence of plane rotations (matrix T&ijlig;).

Dieses Prinzip wird in dem vorliegenden Transformationsprozessor verwendet, wie im folgenden in Verbindung mit der Darstellung einer diskreten Kosinus- Transformation (DCT) illustriert wird.This principle is used in the present transformation processor, as described below in connection with the representation of a discrete cosine transform (DCT).

Eine Achtpunkt-DCT-Transformation kann durch die Matrix An eight-point DCT transformation can be given by the matrix

ausgedrückt werden. Durch Umordnen von Spalten und Betrachten von ausgewählten transformierten Ausgangssignalen als Signalpaare kann die obige Matrix zerlegt und in vier Gruppierungen strukturiert werden, wobei jede Gruppierung vier Terme der durch Gleichung (1) spezifizierten Form umfaßt.By rearranging columns and considering selected transformed output signals as signal pairs, the above matrix can be decomposed and structured into four groupings, each grouping comprising four terms of the form specified by equation (1).

Eine Hardwarerealisation einer solchen Formulierung kann mit einer Rotatorschaltung der oben beschriebenen Form erreicht werden und mit einer Kommunikationsschaltung, die ausreichend Speicherplatz zum Speichern der Eingangssignale (xi) und der entstehenden Zwischenresultate aufweist. Eine effizientere Realisation ergibt sich jedoch unter Verwendung eines "schnellen DCT"-Algorithmus.A hardware realization of such a formulation can be achieved with a rotator circuit of the form described above and with a communication circuit that has sufficient memory space to store the input signals (xi) and the resulting intermediate results. However, a more efficient realization is achieved using a "fast DCT" algorithm.

Fig. 4 zeigt den Signalfluß für einen "schnellen DCT"- Algorithmus, wobei jeder der Kreise in Fig. 4 (z. B. Kreis 17) eine Am-Ort-Rotation repräsentiert. Der Ausdruck "Am-Ort- Rotation" bedeutet, daß die Rotationsoperation durch eine Kommunikationsschaltung ausgeführt wird, die beim Eingeben zweier Eingangssignale von speziellen Speicherpositionen in den Rotator die Ergebnisse (von dem Rotator) in dieselben Speicherpositionen zurückleitet. Erforderlich ist es dann noch, die Reihenfolge der Signale zu und von der Kommunikationsschaltung 100 geeignet zu steuern zum Erzielen der Endresultate, die in Fig. 4 spezifiziert und in der folgenden Tabelle I zusammengefaßt sind. Tabelle 1 Fig.4 shows the signal flow for a "fast DCT" algorithm, where each of the circles in Fig. 4 (e.g., circle 17) represents an in-place rotation. The term "in-place rotation" means that the rotation operation is performed by a communications circuit which, upon inputting two input signals from specific memory locations to the rotator, returns the results (from the rotator) to the same memory locations. It is then necessary to appropriately control the order of signals to and from the communications circuit 100 to achieve the final results specified in Fig. 4 and summarized in Table I below. Table 1

Zusätzlich zu der Aufgabe, der Rotatorschaltung Eingangssignale und Zwischenergebnisse zuzuleiten, muß die Kommunikationsschaltung 100, die zwei Operationsmodi darstellt, auch ankommende neue Daten annehmen. Unter Berücksichtigung dieser zweier Funktionen zeigt Fig. 5 ein Funktionsdiagramm einer Kommunikationsschaltung 100.In addition to the task of supplying input signals and intermediate results to the rotator circuit, the communication circuit 100, which represents two modes of operation, must also accept incoming new data. Taking these two functions into account, Fig. 5 shows a functional diagram of a communication circuit 100.

In Übereinstimmung mit Fig. 5 umfaßt die Schaltkreis 100 einen adressierbaren Speicher 121 und einen nicht adressierbaren Speicher 122. Der Speicher 122 ist hauptsächlich ein serieller Speicher. Das heißt, daß die Eingangsdaten über Leitung 123 in den Speicher 122 geshiftet werden und daß die transformierten Ausgangssignale über Leitung 124 aus dem Speicher 122 geshiftet werden. Dieser Zustand ist unter der Überschrift "normal" auf der linken Seite von Fig. 5 dargestellt. Während die Eingangsdaten in den Speicher 122 geshiftet werden, wirkt der Speicher 121 mit dem Rotator 300 zusammen und führt die Am-Ort-Substitution von Rotator-Eingangssignalen durch Rotatorergebnisse aus. Dies läßt sich einfach erreichen, da der Speicher 121 adressierbar ist und so angeordnet, daß Daten von jeder Speicheradresse an allen Adressen außer an den zwei ausgewählten Adressen zu sich selbst zurückgeführt werden. An den zwei ausgewählten Adressen tritt die zuvor erwähnte Substitution auf, wie durch die Signale X&sub0;, X&sub1;, Y&sub0;, Y&sub1; in Fig. 5 dargestellt. Wenn die in dem Rotator 300 durchgeführte Transformation beendet ist, müssen die in dem Speicher 122 gesammelten Daten als Vorbereitung für die nächste Transformation in dem Speicher 121 untergebracht werden. Gleichzeitig müssen die in dem Speicher 121 gespeicherten Transformationsresultate entfernt werden. Dies wird dadurch erreicht, daß der Inhalt des Speichers 121 mit dem Inhalt des Speichers 122 ausgetauscht wird, wie auf der rechten Seite von Fig. 5 unter der Überschrift "swap" (Austausch) dargestellt ist.In accordance with Fig. 5, the circuit 100 includes an addressable memory 121 and a non-addressable memory 122. The memory 122 is essentially a serial memory. That is, the input data is shifted into the memory 122 via line 123 and the transformed output signals are shifted out of the memory 122 via line 124. This state is shown under the heading "normal" on the left side of Fig. 5. While the input data is being shifted into the memory 122, the memory 121 cooperates with the rotator 300 and performs the in-place substitution of rotator input signals with rotator results. This is easily accomplished because the memory 121 is addressable and arranged so that data from any memory address is fed back to itself at all addresses except the two selected addresses. At the two selected addresses, the aforementioned substitution occurs, as shown by the signals X0, X1, Y0, Y1 in Fig. 5. When the transformation performed in the rotator 300 is completed, the data collected in the memory 122 must be placed in the memory 121 in preparation for the next transformation. At the same time, the transformation results stored in the memory 121 must be removed. This is accomplished by exchanging the contents of the memory 121 with the contents of the memory 122, as shown on the right side of Fig. 5 under the heading "swap".

Tabelle 2 zeigt die Adressierungsreihenfolge für den Speicher 122 und die für den Rotator 300 verwendeten Koeffizienten. Die Adressen und Koeffizienten von Tabelle 2 entsprechen den Bezeichnungen in Fig. 4. Tabelle 2 Adresse Koeffizienten Table 2 shows the addressing order for the memory 122 and the coefficients used for the rotator 300. The addresses and coefficients of Table 2 correspond to the designations in Fig. 4. Table 2 Address coefficients

Die Reihenfolge von Tabelle 2 ist nicht die einzige mögliche Reihenfolge. Fig. 4 offenbart, daß jede Reihenfolge, die sicherstellt, daß gewisse Rotationen keinen Vorrang haben vor gewissen anderen Rotationen, akzeptabel ist. Es sei auch darauf hingewiesen, daß der schnelle DCT-Algorithmus von Fig. 4 nicht der einzig mögliche "schnelle DCT"-Algorithmus ist. Zur Illustration zeigt Fig. 6 einen Algorithmus, der in gewissem Sinne regulärer ist als der in Fig. 4 dargestellte Algorithmus. Die Kreise und Zahlen in Fig. 6 haben dieselbe Bedeutung wie in Fig. 4.The order of Table 2 is not the only possible order. Fig. 4 reveals that any order that ensures that certain rotations do not take precedence over certain other rotations is acceptable. It should also be noted that the fast DCT algorithm of Fig. 4 is not the only possible "fast DCT" algorithm. To illustrate, Fig. 6 shows an algorithm that is in some sense more regular than the algorithm shown in Fig. 4. The circles and numbers in Fig. 6 have the same meaning as in Fig. 4.

Fig. 7 illustriert eine Ausführungsform einer Kommunikationsschaltung 100 zur Implementierung des Funktionsdiagramms von Fig. 5. Die Eingangssignale Y&sub0; und Y&sub1; sind an die Normierungseinheiten 110 und 120 angelegt und die Ausgaben der Normierungseinheiten 110 und 120 dienen zur Adressierung der Multiplexer 111 und 112. Die Multiplexer 111 und 112 sprechen auf die Adreßsignale addr&sub0; und addr&sub1; an. Diese Adreßsignale werden der Kommunikationsschaltung 100 durch die Suchtabelle 200 (Look-up Table) zugeführt und folgen der Reihenfolge, wie sie beispielsweise in Tabelle 2 festgelegt ist. Die Multiplexer 111 und 112 sind herkömmliche Selektoren (one-to-many selectors) in dem Sinne, daß sie Eingangssignale Y&sub0; und Y&sub1; dazu veranlassen, an einem der vielen Ausgänge der Multiplexer 111 bzw. 112 aufzutreten. Die Multiplexer 111 und 112 unterscheiden sich von herkömmlichen Multiplexern dadurch, daß jeder Ausgangsleitung eine Steuerleitung zugeordnet ist, die die Ausgangsleitung, auf der Signale anliegen, identifiziert. Diese Leitung ermöglicht die Rückführung von Signalen von all den Registern, die keine Signale Y empfangen, wie im Zusammenhang mit Fig. 5 diskutiert. Die Ausgänge der Multiplexer 111 und 112 sind mit einem Vielfacheingabe-Vielfachausgabe-Speicherblock 130 verbunden, welcher die Speicher 121 und 122 enthält und eine Vielzahl von Speicherblocks 131 umfaßt. Jeder Speicherblock 131 besitzt zwei Eingänge und zwei Ausgänge, einen "Bereit"- Steuersignaleingang, ein Freigabe-Steuersignal und zwei Register. Die Ausgänge des Multiplexers 111 sind je mit einem Eingang eines anderen Speicherblocks 131 verbunden. Die Ausgänge des Multiplexers 112 sind jeweils parallel zu den Ausgängen des Multiplexers 111 geschaltet. Der andere Eingang jedes Speicherblockes ist mit dem einen Ausgang des vorgeschalteten Speicherblockes 131 verbunden und bildet hierbei die Eingangs- und Ausgangsverbindung mit der seriellen Speicheranordnung 122. Die anderen Ausgänge des Speicherblockes 131 sind an die Adreß-Multiplexer 113 und 114 angelegt, die durch die addr&sub0; und addr&sub1; Steuersignale gesteuert werden. Die Ausgangssignale der Multiplexer 113 und 114 (many-to-one) sind die Signale X&sub0; und X&sub1;, die an dem Rotator 300 anliegen. Die Signale X&sub0; und X&sub1; sind entweder das Eingangssignal x oder Zwischenresultatglieder, wie durch das Signalflußdiagramm in Fig. 4 beschrieben. Wie oben ausgeführt, sind die Signale Y&sub0; und Y&sub1; die Rechenergebnisse des Rotators, die bei Beendigung gleich y&sub0; und y&sub1; sind.Fig. 7 illustrates one embodiment of a communications circuit 100 for implementing the functional diagram of Fig. 5. The input signals Y₀ and Y₁ are applied to the normalization units 110 and 120 and the outputs of the normalization units 110 and 120 are used to address the multiplexers 111 and 112. The multiplexers 111 and 112 are responsive to the address signals addr₀ and addr₁. These address signals are applied to the communications circuit 100 through the look-up table 200 and follow the order as defined, for example, in Table 2. The multiplexers 111 and 112 are conventional one-to-many selectors in the sense that they accept input signals Y₀ and Y₁. to appear at one of the many outputs of the multiplexers 111 and 112, respectively. The multiplexers 111 and 112 differ from conventional multiplexers in that each output line is associated with a control line which identifies the output line on which signals are present. This line allows signals to be returned from all of the registers which do not receive Y signals, as discussed in connection with Fig. 5. The outputs of the multiplexers 111 and 112 are connected to a multiple input multiple output memory block 130 which contains the memories 121 and 122 and comprises a plurality of memory blocks 131. Each memory block 131 has two inputs and two outputs, a "ready" control signal input, an enable control signal and two registers. The outputs of the multiplexer 111 are each connected to an input of a different memory block 131. The outputs of the multiplexer 112 are each connected in parallel to the outputs of the multiplexer 111. The other input of each memory block is connected to one output of the upstream memory block 131 and forms the input and output connection to the serial memory arrangement 122. The other outputs of the memory block 131 are applied to the address multiplexers 113 and 114, which are controlled by the addr₀ and addr₁ control signals. The output signals of the multiplexers 113 and 114 (many-to-one) are the signals X₀ and X₁, which are applied to the rotator 300. The signals X₀ and X₁ are either the input signal x or intermediate result terms as described by the signal flow diagram in Fig. 4. As stated above, the signals Y₀ and Y₁ are the computational results of the rotator, which upon completion are equal to y₀ and y₁.

Fig. 8 zeigt eine Realisierungsmöglichkeit für die Speicherblöcke 131. Sie umfaßt Register 133 und 134, einen zweipoligen Umschalter 132 und einen einpoligen Umschalter 135. Die Eingangssignale von den Multiplexern 111 und 112 liegen an dem Eingangsanschluß des Schalters 135 an und die Freigabesignale von den Multiplexern 111 und 112 liegen an dem Steueranschluß des Schalters 135 an. Die andere Eingabe an den Schalter 135 erfolgt von der Ausgabe des Registers 133 und ermöglicht hierdurch die Signalrückkopplungsmöglichkeit. Das Ausgangssignal des Schalters 135 liegt an einen Eingang des Schalters 132 an, während der serielle Ausgang des Blocks 131 an dem anderen Eingang des Schalters 132 anliegt ist. Der Schalter 132 wird durch das "Bereit"-Steuersignal gesteuert. Normalerweise ist das "Bereit"-Steuersignal gegeben, so daß der serielle Eingang durch den Schalter 132 an das Register 134 anliegt und der andere Eingang (von Schalter 135) an das Register 133 anliegt. Die Ausgabe des Registers 133 liegt an den Multiplexern 113 und 114 an (zusätzlich dazu, daß sie auch an Schalter 135 anliegt), während das Ausgangssignal des Registers 134 an der seriellen Eingabe des nächsten Blockes 135 anliegt.Fig. 8 shows one possible implementation for the memory blocks 131. It includes registers 133 and 134, a double pole switch 132 and a single pole switch 135. The input signals from the multiplexers 111 and 112 are applied to the input terminal of the switch 135 and the enable signals from the multiplexers 111 and 112 are applied to the control terminal of the switch 135. The other input to the switch 135 is from the output of the register 133, thereby enabling the signal feedback capability. The output signal of the switch 135 is applied to one input of the switch 132, while the serial output of the block 131 is applied to the other input of the switch 132. The switch 132 is controlled by the "ready" control signal. Normally, the "Ready" control signal is given so that the serial input through switch 132 is applied to register 134 and the other input (from switch 135) is applied to register 133. The output of register 133 is applied to the multiplexers 113 and 114 (in addition to being applied to switch 135), while the output of register 134 is applied to the serial input of the next block 135.

Die Normierungseinheiten 110 und 120 in Fig. 7 werden benötigt, da die Multiplikationsergebnisse des Rotators 300 eine Bitanzahl aufweist, die gleich der Summe der Bits in dem Multiplikator und dem Multiplikanden (+1) ist. Diese Anzahl muß auf die Anzahl von Bits in dem Multiplikanden reduziert werden, wenn die Ergebnisse nicht mit jeder Iteration wachsen sollen. Dies kann durch einfachen Abbruch erfolgen, es wird jedoch vorgeschlagen, eine Normierungseinheit zu verwenden, die gelegentlich auftretende große Werte abschneidet. Dieses Abschneiden gestattet es, weniger Bits abzubrechen und ermöglicht damit einen geringeren Grad an Abbruchfehlern. Fig. 9 zeigt eine einfache Realisierungsmöglichkeit für die Normierungseinheiten 110 und 120. Das Register 115 übernimmt die Resultate des Rotators 300 und ausgewählte Ausgangsbits des Registers 115 hoher Signifikanz liegen an dem Detektor 116 an. Wenn der Detektor 116 alle Nuller oder Einer (das Vorzeichenbit) empfängt, wird eine ausgewählte Gruppe der nächsten signifikantesten Bits des Registers 115 durch die Gatteranordnung 117 zu den Multiplexern der Fig. 7 geleitet. Ansonsten blockiert der Detektor 117 diese Ausgangsbits und ersetzt sie durch das Vorzeichenbit.The normalization units 110 and 120 in Fig. 7 are needed because the multiplication results of the rotator 300 have a number of bits equal to the sum of the bits in the multiplier and the multiplicand (+1). This number must be reduced to the number of bits in the multiplicand if the results are not to grow with each iteration. This can be done by simple truncation, but it is suggested to use a normalization unit that truncates occasionally occurring large values. This truncation allows fewer bits to be truncated and thus enables a lower degree of truncation errors. Fig. 9 shows a simple implementation for the normalization units 110 and 120. The register 115 receives the results of the rotator 300 and selected output bits of the register 115 of high significance are applied to the detector 116. If the detector 116 receives all zeros or ones (the sign bit), a selected group of the next most significant bits of the register 115 are passed through the gate arrangement 117 to the multiplexers of Figure 7. Otherwise, the detector 117 blocks these output bits and replaces them with the sign bit.

Die Struktur der orthogonalen Transformation ist ein iteratives Verfahren, das eine Sequenz von Koeffizienten und "addr"-Adressensteuersignalen in rascher Reihenfolge erfordert. Ein herkömmlicher Festwertspeicher und ein Zähler stellen eine Lösung für diese Funktion dar, aber es ist schwierig, die Geschwindigkeit der Adressendecodierung in einem Standard ROM zu erhöhen und Zähler sind komplizierter als nötig. Ein seriell adressierter Festwertspeicher erfüllt die Anforderungen des Transformationsverfahrens und besitzt die Vorteile eines Schieberegisterintensive-Designs.The structure of the orthogonal transform is an iterative process that requires a sequence of coefficients and "addr" address control signals in rapid order. A conventional read-only memory and a counter provide a solution for this function, but it is difficult to increase the speed of address decoding in a standard ROM and counters are more complicated than necessary. A serially addressed read-only memory satisfies the requirements of the transformation process and has the advantages of a shift register intensive design.

Bezug nehmend auf Fig. 10 ist der Festwertspeicher in Block 56 konkretisiert, der eine Ansammlung von Signalleitungen 51, 52 und 53 und aktivierbare Verbindungspunkte 54 enthält. Die Signalleitungen sind in Übereinstimmung mit der Reihenfolge 51, 52, 53, 52, 51, 52, . . . , verschachtelt und die Verbindungspunkte 54 verbinden (wenn aktiviert) ausgewählte benachbarte Leitungen. Die Leitungen 51 sind alle mit einer ersten Spannungsquelle Vi verbunden, entsprechend einem logischen Niveau "1" und die Leitungen 54 sind alle mit einer zweiten Spannungsquelle V&sub0; verbunden, entsprechend einem logischen Niveau "0". Die Leitungen 52 bilden den Ausgang des Speichers. Die Verbindungspunkte 54 sind herkömmliche Halbleiterschalter, die über Aktivierungssignale gesteuert werden. Die Verbindungspunkte sind in Gruppen angeordnet, wobei jede Gruppe aus einem Verbindungspunkt, der jeder Leitung 52 zugeordnet ist, besteht und wobei alle Verbindungspunkte durch ein einfaches Steuersignal gesteuert werden. Die Steuersignale für die unterschiedlichen Gruppen werden von dem Register 55 erhalten, in das mit jedem "Bereit"-Signal, das an die Suchtabelle 200 angelegt ist, ein Puls zwischengeschaltet wird. Wenn der "Bereit"-Puls durch das Register 55 geshiftet wird, wird auf sukzessive Wörter des Speichers zugegriffen.Referring to Fig. 10, the read-only memory is embodied in block 56, which includes a collection of signal lines 51, 52, and 53 and activatable connection points 54. The signal lines are interleaved in accordance with the order 51, 52, 53, 52, 51, 52, . . . and the connection points 54 connect (when activated) selected adjacent lines. The lines 51 are all connected to a first voltage source Vi corresponding to a logic level "1" and the lines 54 are all connected to a second voltage source V0 corresponding to a logic level "0". The lines 52 form the output of the memory. The connection points 54 are conventional semiconductor switches controlled by activation signals. The connection points are arranged in groups, each group consisting of one connection point associated with each line 52, and all of the connection points are controlled by a single control signal. The control signals for the different groups are obtained from register 55, into which a pulse is interposed with each "ready" signal applied to lookup table 200. As the "ready" pulse is shifted through register 55, successive words of the memory are accessed.

Die Suchtabelle 200 in Fig. 1 zeigt auch ein an die Tabelle angelegtes "inverses" Steuersignal. Dieses Signal sorgt für die Durchführung der inversen Transformation. Die inverse Transformation läßt sich, wie in Fig. 11 dargestellt, durch eine zweite, in das Element 200 integrierte Suchtabelle realisieren. Alles was benötigt wird ist die Verwendung zweier Blocks 56, die durch ein Register 55 gesteuert werden und eines Multiplexers 57, der (gesteuert durch das "inverse" Steuersignal) einen der Speicher der beiden Blocks 56 auswählt. Diese zweite Suchtabelle gestattet die Bestimmung einer anderen Sequenz von Adreßkoeffizienten.The lookup table 200 in Fig. 1 also shows an "inverse" control signal applied to the table. This signal ensures that the inverse transformation is carried out. The inverse transformation can be realized, as shown in Fig. 11, by a second lookup table integrated into the element 200. All that is required is the use of two blocks 56 controlled by a register 55 and a multiplexer 57 which (controlled by the "inverse" control signal) accesses one of the memories of the two blocks 56. This second lookup table allows the determination of a different sequence of address coefficients.

Ein Kandidat für den internationalen Videocodierer- Transformationsstandard niederer Bitrate ist eine zweidimensionale DCT von 8 · 8 Pixelblocks. Diese läßt sich in eine eindimensionale Achtpunkt-Transformation von jeder Reihe von 8 Pixeln, gefolgt von einer eindimensionalen Achtpunkt- Transformation von jeder Spalte zerlegen. Wie in Fig. 12 dargestellt, läßt sich solch eine Transformation durch einen Prozessor für orthogonale Transformationen 501, der mit einem Transponierungsspeicher 502 und einem anderen Prozessor für orthogonale Transformationen 501 hintereinandergeschaltet ist implementieren. Sie kann auch in einem einzigen Prozessor für orthogonale Transformationen 501 implementiert werden, wobei der Speicher 64 Muster enthält und die Suchtabelle geeignet angeordnet ist.A candidate for the international low bit rate video encoder transform standard is a two-dimensional DCT of 8 x 8 pixel blocks. This can be decomposed into a one-dimensional eight-point transform of each row of 8 pixels, followed by a one-dimensional eight-point transform of each column. As shown in Fig. 12, such a transform can be implemented by an orthogonal transform processor 501 connected in series with a transpose memory 502 and another orthogonal transform processor 501. It can also be implemented in a single orthogonal transform processor 501, with the memory containing 64 patterns and the lookup table suitably arranged.

Fig. 13 stellt eine Realisierungsmöglichkeit für den Transponierungsspeicher 502 dar. Sie zeigt ein zweidimensionales Feld aus Speicherregistern 503, das so gruppiert werden kann, um in einer horizontalen oder vertikalen Zeilenabtastungssequenz zu shiften. Ausführlicher dargestellt hat jedes Speicherregister 503 einen horizontalen Ein- und Ausgang, einen vertikalen Ein- und Ausgang und einen Richtungssteuereingang. Die Register 503 des Feldes sind somit einander verbunden, daß die horizontalen Ausgänge innerhalb einer Zeile mit horizontalen Eingängen innerhalb derselben Zeile verbunden sind und die vertikalen Ausgänge in einer Zeile mit den vertikalen Eingängen innerhalb derselben Zeile verbunden sind. Dies gilt für alle Elemente, die nicht in der ersten und letzten Spalte oder Zeile sind. Der vertikale Ausgang in der letzten Zeile jeder Spalte ist mit dem vertikalen Eingang der ersten Zeile in der nächsten Spalte verbunden und der letzte horizontale Ausgang in jeder Spalte ist auf ähnliche Art mit dem ersten horizontalen Eingang in der nächsten Zeile verbunden. Die zwei Eingänge des Registers in der letzten Zeile und der letzten Spalte sind miteinander verbunden und bilden den Ausgang des Transponierungsspeichers 502.Fig. 13 illustrates one possible implementation for the transpose memory 502. It shows a two-dimensional array of memory registers 503 that can be grouped to shift in a horizontal or vertical line scan sequence. In more detail, each memory register 503 has a horizontal input and output, a vertical input and output, and a direction control input. The registers 503 of the array are thus connected to each other such that the horizontal outputs within a row are connected to horizontal inputs within the same row, and the vertical outputs in a row are connected to the vertical inputs within the same row. This applies to all elements that are not in the first and last column or row. The vertical output in the last row of each column is connected to the vertical input of the first row in the next column, and the last horizontal output in each column is similarly connected to the first horizontal input in the next row. The two inputs of the register in the last row and the last column are connected together and form the output of the transpose memory 502.

Wie in Fig. 14 dargestellt, beinhaltet jedes Speicherregister 503 ein Register 504 und einen Selektor 505. Der Selektor 505 spricht auf das Richtungssteuersignal an, das entweder den horizontalen oder den vertikalen Eingang auswählt. Der vertikale Eingang wird an das Register 504 angelegt und der Ausgang des Registers 504 wird an beide horizontale und vertikale Ausgänge angelegt.As shown in Figure 14, each storage register 503 includes a register 504 and a selector 505. The selector 505 is responsive to the direction control signal which selects either the horizontal or vertical input. The vertical input is applied to the register 504 and the output of the register 504 is applied to both the horizontal and vertical outputs.

Während des Betriebes werden Daten hineingeschoben, bis die Matrix voll ist. Dann wird die Richtung des Shiftens (Richtungsteuerung) umgekehrt und die Daten werden hinausgeschoben, während die Daten des nächsten Blockes hineingeschoben werden. Das Richtungssteuersignal ist demgemäß eine einfache Rechteckwelle. Diese Struktur besitzt den Vorteil, daß sie bei extrem hoher Geschwindigkeit arbeiten kann.During operation, data is shifted in until the matrix is full. Then the direction of shifting (direction control) is reversed and the data is shifted out while the data of the next block is shifted in. The direction control signal is therefore a simple square wave. This structure has the advantage of being able to operate at extremely high speed.

Die dargestellte Offenbarung und die hier beschriebenen einzelnen Ausführungsbeispiele dienen im wesentlichen nur zur Veranschaulichung der vorliegenden Erfindung. Für den Fachmann ergeben sich viele Änderungen in der Konstruktion sowie stark unterschiedliche Ausführungsformen und Anwendungen.The disclosure presented and the individual embodiments described here serve essentially only to illustrate the present invention. Many changes in design as well as widely different embodiments and applications will become apparent to those skilled in the art.

Der Prozessor kann beispielsweise mehr als ein Verarbeitungselement enthalten, z. B. einen Rotator und eine separate Addierer/Subtrahierereinheit, die Rotationen von 45 durchführt, wodurch die Verarbeitungsgeschwindigkeit der Einheit durch Ausnutzen der Parallelitäten in dem Algorithmus erhöht wird. Während Fig. 1 eine als I/O-Interface dienender Kommunikationsschaltung 100 und eine mit einem Prozessorelement (Rotator 300) kommunizierende Speichereinheit darstellt, heißt das, daß man durch einfache Verallgemeinerung der vorliegenden Erfindung auch eine Vielzahl von Prozessorelementen verwenden kann, die mit dem I/O-Interface und den Speichereinrichtungen (von denen angenommen wird, daß sie in den Rotator 300 oder sonstwie integriert sind) verbunden sind.For example, the processor may include more than one processing element, e.g. a rotator and a separate adder/subtractor unit that performs rotations of 45, thereby increasing the processing speed of the unit by exploiting the parallelism in the algorithm. While Fig. 1 shows a communication circuit 100 serving as an I/O interface and a memory unit communicating with a processor element (rotator 300) , this means that by simple generalization of the present invention, one may also use a plurality of processor elements connected to the I/O interface and the storage devices (which are assumed to be integrated into the rotator 300 or otherwise).

Claims

1. Orthogonal transform processor comprising a rotator (300), coefficient storage means (200) and interface means for transmitting input signals to the processor and for receiving output signals from the processor, characterized in that the interface means include a communication circuit (100) responsive to an applied processor input signal vector, applying components of the input signal vector to be transformed in pairs - designated x₀ and x₁ - to the rotator, controlling the transfer of the rotator output signals, the x₀ and x₁ component signals belonging to a set of signals including the processor input signal vector and first and second rotator output signals, controlling the order of coefficients applied to the rotator, and providing a transformed vector at an output of the transformation processor,

that the rotator is responsive to the pairs of x�0 and x₁ component signals and to first and second coefficient signals c�0 and c₁ associated with each of the pairs, forming a first rotator output signal corresponding to c�0 x�0 + c₁ x₁ and a second rotator output signal corresponding to -c₁ x�0 + c₁ x₁, and

that the coefficient storage means responsive to an initiation control signal are communication circuit supplying control signals and first and second coefficient signals to the rotator.

2. A processor according to claim 1, characterized in that the coefficient storage means are also responsive to an "inverse" control signal for effecting in the processor a preselected orthogonal transformation when the "inverse" control signal is at a first logic level, and for effecting in the processor the inverse transformation of the preselected orthogonal transformation when the "inverse" control signal is at a second logic level.

3. A processor according to claim 1, characterized in that the rotator includes four multipliers, each of which includes a plurality of modules, corresponding modules of the four multipliers being in close physical proximity to each other.

4. A processor according to claim 1, characterized in that the rotator performs four multiplications and includes an array of interconnected modules, each module including four regions and each region of a module performing a similar portion of the multiplications performed in the rotator.

5. A processor according to claim 1, characterized in that the rotator performs four multiplications and comprises a matrix of spatially adjacent and interconnected multiplier-adder modules, each module having four regions and each region of a module performs a similar part of the multiplications performed by the rotator.

6. A processor according to claim 1, characterized in that the rotator performs four multiplications and includes a matrix of spatially adjacent, interconnected multiplier and adder modules, each module comprising:

a first logical multiplication element (405) responsive to x�0 and c�0,

a first adder (401) responsive to the applied sum and carry signals and to the first logical multiplication element for generating an output sum signal and an output carry signal, a second logical multiplication element (406) responsive to x₀ and c₁,

an adder (402) responsive to the applied sum and carry signals and to the second logical multiplication element for generating an output sum signal and an output carry signal, a third logical multiplication element (407) responsive to x₁ and c₀,

an adder (403) responsive to the applied sum and carry signals and the third logical multiplication element for generating an output sum signal and an output carry signal,

a fourth logical multiplication element (408) responsive to x₁ and c₁, and

a fourth adder (404) responsive to the applied sum and carry signals and the fourth logical multiplication element for generating an output sum signal and an output carry signal.

7. A processor according to claim 6, characterized in that it has registration devices (409-413) which are responsive to the output sum and output carry signals of the first, second, third and fourth adders and to an applied clock signal for generating clocked sum and carry output signals of the modules.

8. A processor according to claim 6, characterized in that the logical multiplication elements are AND gates.

9. A processor according to claim 1, characterized in that the means for generating control signals comprises a shift register (55) responsive to the initiation control signal for generating delayed replicas of the initiation control signal, and a plurality of groups (56) of connection points, each group being controlled by a different one of the delayed replicas of the initiation control signal.

10. A processor according to claim 1, characterized in that the communication circuit has first and second memory sections, the first memory section (120) storing the applied processor input signals and forwarding the stored processor input signals to the second memory section under the control of the initiation control signal, and that the second memory section (121, 122) responsive to the control signals from the means for generating the control signals, storing the signals supplied by the first memory segment, applying signals to the rotator, storing the output signals of the rotator and controlled by the initiation control signal applies the transformed output signals to the first memory segment.

11. A processor for two-dimensional m x n orthogonal transformations with a first orthogonal transformation processor (501) responsive to input signals for generating m-transformations, a second orthogonal transformation processor (501) responsive to intermediate input signals for generating n-transformations and a transpose memory (502) responsive to the m-transformations of the first orthogonal transformation processor for storing n of the m-transformations of the first orthogonal transformation processor and for reading out the n stored in-transformations in transposed order for generating the intermediate input signals, each of the transformation processors having a rotator (300), coefficient storage means (200) and interface means for supplying input signals to the processor and for receiving output signals from the processor, characterized in that the Interface means of a communication circuit (100) responsive to an applied processor input signal vector, which applies components of the input signal vector to be transformed in pairs - designated x₀ and x₁ - to the rotator, which controls the transfer of the rotator output signals, the x₀ and x₁ component signals belonging to a set of signals including the processor input signal vector and first and second rotator output signals, which controls the order of coefficients which are applied to the rotator, and which provides a transformed vector at an output of the transformation processor,

that the rotator is responsive to the pairs of - x�0 and x�1 - component signals and to first and second coefficient signals c�0 and c�1 associated with each of the pairs, forming a first rotator output signal corresponding to c�0 x�0 + c�1 x�1 and a second rotator output signal corresponding to -c�1 x�0 + c�0 x�1, and

that the coefficient storage means, responsive to an initiation control signal, provide control signals to the communication circuit and first and second coefficient signals to the rotator.