Vectorized transforms in scalar processors

Author

Trelewicz, J.Q. ; Mitchell, Joan L. ; Brady, Michael T.

Author_Institution

IBM Almaden Res. Center, San Jose, CA, USA

Volume

19

Issue

4

fYear

2002

fDate

7/1/2002 12:00:00 AM

Firstpage

22

Lastpage

31

Abstract

We disclose a generalized approach to creating efficient implementations of linear, orthogonal transforms, with specific examples discussed for the 8 x 8 DCT used in image compression. We connect this with a method for performing signed, parallel processing in scalar, off-the-shelf processors for integer transforms. Uniform data precision may be used, but is not required for the method. The coefficients resulting from the new algorithm converge more quickly than the approximation made to the coefficients. Furthermore, the new algorithm allows more control of the specific representation chosen for the coefficients, as is detailed below. The methods described were designed for addressing this need with two\´s-complement arithmetic. Data that can be processed in parallel, because of the algorithm structure, are packed in a "vector" format, described, into registers. Many signed arithmetic operations can be performed on these vectors, including addition, subtraction, multiplication by scalars, shifting, and others. When the parallel processing is completed, the vectors can be unpacked into scalar values for storage or subsequent processing. The importance of these methods lies in their handling of carries and borrows in the packed vector format. The generalized method is described. Notation is given at the beginning to establish consistency through the article. We discuss a generalized approach to integer transforms, using the DCT as a specific example. Then we detail the vector format, which allows vector computation in scalar processors of parallelizable algorithms. The IDCT is used as a numerical example in the discussion of the vector format. The results were developed for high-end printers (e.g., more than 100 pages per minute), where image compression and decompression must be performed in real time, either in FPGAs, or in embedded processors; however, the methods are applicable to a broad range of signal processing systems

Keywords

data compression; digital arithmetic; discrete cosine transforms; embedded systems; field programmable gate arrays; image coding; inverse problems; parallel algorithms; parallel architectures; transform coding; DCT; FPGA; addition; algorithm structure; coefficients convergence; embedded processors; high-end printers; image compression; image decompression; integer transforms; linear orthogonal transforms; multiplication; parallelizable algorithms; registers; scalar off-the-shelf processors; signal processing systems; signed arithmetic operations; signed parallel processing; subtraction; two´s-complement arithmetic; uniform data precision; vector format; vectorized transforms; Approximation algorithms; Arithmetic; Concurrent computing; Design methodology; Discrete cosine transforms; Image coding; Image converters; Parallel processing; Printers; Signal processing algorithms;

fLanguage

English

Journal_Title

Signal Processing Magazine, IEEE

Publisher

ieee

ISSN

1053-5888

Type

jour

DOI

10.1109/MSP.2002.1012347

Filename

1012347