Monday, November 26, 2007

攤販金頭腦 變身發明王

王登福點子多多 隨身計算筆、自動傘、數鈔筆…讓他奪得不少大獎
 嘉義縣攤販出身的發明家王登福,藉由發明致富。
 王登福發明的可書寫、計算、查閱\各時區時間及匯率的計算筆已經改良到第六代。
嘉義縣

●從擺地攤到成為十大傑出發明家,原籍嘉義縣朴子的王登福,選擇能快速賺錢的「發明」業,在發明界出人頭地。

王登福說,當完兵後,他隻身從嘉義北上,身上只有父親給的一千元車資及當兵時儲蓄存下來的三萬元。跑到北部,住在板橋的遠親好心提供他頂樓的空間,頂樓只有一支水龍頭與一片稻草鋪成的「屋頂」,「房屋」原先是狗窩。

王 登福拿出三萬元到台北市五分埔批發女裝,到萬華廣州街、士林夜市擺地攤。放眼整條街都是攤販,民眾閒逛,沒有停下來看女裝、商品,王登福苦想出連珠炮式、 有諧音的叫賣方式,終於吸引民眾的好奇,有人詢問他的背景,有人觀看、挑選地攤上的皮帶、女裝等等。他說,生意好的時候,一個月賺個二、三十萬元不成問 題。

賺了錢,王登福心想:難道要一輩子擺地攤?他告訴自己,「不能這樣過一生」。

他撥出擺地攤的時間去淡江大學地政班進修; 而擺攤叫賣時,和顧客討價還價、向廠商批貨時計算成本,逼得他花腦筋發明出計算筆,隨身攜帶。多年來,計算筆從第一代到第六代,附加功能增多,除了計算 外,也能查閱不同時區的時間,還可換算匯率。帶著一支計算筆出國,解決計算、看當地時間、換算當地錢幣等等的問題。

滂沱大雨中,當人們右手 撐著雨傘、左手提著手提袋,走到車旁時,要空出一隻手來開車門,只得先把手提袋放地上,換左手撐雨傘,結果是手提袋內的物品濕了,身體也淋濕了。就這樣的 情境,王登福的念頭一轉,發明不必電力、不求外力的收放自如的全自動傘,為被視為夕陽工業的洋傘、雨傘業開創一道曙光。

王登福賺了錢,經常要當著客戶面前數錢,一疊疊鈔票,往往要數上幾次才能確定金額,他感到很頭痛。一次洗澡時,因為冷水的刺激,他想到利用點鈔機的原理,發明利用震動原理的數鈔筆。

創意可能變成黃金!王登福鼓勵民眾善用智慧賺錢、善用瞬間的創意致富。他認為洗澡時,就是腦袋瓜子天馬行空的時候,整個人受到冷水或熱水的刺激,往往會激發出人意表的創意。

1997,王登福參加在瑞士的世界發明比賽得雙面金牌,三年後又在英國倫敦科學展獲雙料冠軍,奠定他當選全台十大傑出發明家的地位。他強調,成功絕非偶然,不活在過去的陰影,從中學習經驗,才能奠定發明致富的契機。

2006-06-13

Friday, November 16, 2007

Tutorial: Floating-point arithmetic on FPGAs

Inside microprocessors, numbers are represented as integers—one or several bytes stringed together. A four-byte value comprising 32 bits can hold a relatively large range of numbers: 232, to be specific. The 32 bits can represent the numbers 0 to 4,294,967,295 or, alternatively, -2,147,483,648 to +2,147,483,647. A 32-bit processor is architected such that basic arithmetic operations on 32-bit integer numbers can be completed in just a few clock cycles, and with some performance overhead a 32-bit CPU can also support operations on 64-bit numbers. The largest value that can be represented by 64 bits is really astronomical: 18,446,744,073,709,551,615. In fact, if a Pentium processor could count 64-bit values at a frequency of 2.4 GHz, it would take it 243 years to count from zero to the maximum 64-bit integer.

Dynamic Range and Rounding Error Problems
Considering this, you would think that integers work fine, but that is not always the case. The problem with integers is the lack of dynamic range and rounding errors.

The quantization introduced through a finite resolution in the number format distorts the representation of the signal. However, as long as a signal is utilizing the range of numbers that can be represented by integer numbers, also known as the dynamic range, this distortion may be negligible.

Figure 1 shows what a quantized signal looks like for large and small dynamic swings, respectively. Clearly, with the smaller amplitude, each quantization step is bigger relative to the signal swing and introduces higher distortion or inaccuracy.


http://i.cmpnet.com/dspdesignline/2006/12/xilinxfigure1_big.gif

Figure 1: Signal quantization and dynamic range

The following example illustrates how integer math can mess things up.

A Calculation Gone Bad
An electronic motor control measures the velocity of a spinning motor, which typically ranges from 0 to10,000 RPM. The value is measured using a 32-bit counter. To allow some overflow margin, let's assume that the measurement is scaled so that 15,000 RPM represents the maximum 32-bit value, 4,294,967,296. If the motor is spinning at 105 RPM, this value corresponds to the number 30,064,771 within 0.0000033%, which you would think is accurate enough for most practical purposes.

Assume that the motor control is instructed to increase motor velocity by 0.15% of the current value. Because we are operating with integers, multiplying with 1.0015 is out of the question—as is multiplying by 10,015 and dividing by 10,000—because the intermediate result will cause overflow.

The only option is to divide by integer 10,000 and multiply by integer 10,015. If you do that, you end up with 30,094,064; but the correct answer is 30,109,868. Because of the truncation that happens when you divide by 10,000, the resulting velocity increase is 10.6% smaller than what you asked for. Now, an error of 10.6% of 0.15% may not sound like anything to worry about, but as you continue to perform similar adjustments to the motor speed, these errors will almost certainly accumulate to a point where they become a problem.

What you need to overcome this problem is a numeric computer representation that represents small and large numbers with equal precision. That is exactly what floating-point arithmetic does.

Floating Point to the Rescue
As you have probably guessed, floating-point arithmetic is important in industrial applications like motor control, but also in a variety of other applications. An increasing number of applications that traditionally have used integer math are turning to floating-point representation. I'll discuss this once we have looked at how floating-point math is performed inside a computer.



IEEE 754 at a Glance
A floating-point number representation on a computer uses something similar to a scientific notation with a base and an exponent. A scientific representation of 30,064,771 is 3.0064771 x 107, whereas 1.001 can be written as 1.001 x 100.

In the first example, 3.0064771 is called the mantissa, 10 the exponent base, and 7 the exponent.

IEEE standard 754 specifies a common format for representing floating-point numbers in a computer. Two grades of precision are defined: single precision and double precision. The representations use 32 and 64 bits, respectively. This is shown in Figure 2.



IEEE 754 at a Glance
A floating-point number representation on a computer uses something similar to a scientific notation with a base and an exponent. A scientific representation of 30,064,771 is 3.0064771 x 107, whereas 1.001 can be written as 1.001 x 100.

In the first example, 3.0064771 is called the mantissa, 10 the exponent base, and 7 the exponent.

IEEE standard 754 specifies a common format for representing floating-point numbers in a computer. Two grades of precision are defined: single precision and double precision. The representations use 32 and 64 bits, respectively. This is shown in Figure 2.

http://i.cmpnet.com/dspdesignline/2006/12/xilinxfigure2_big.gif
igure 2: IEEE floating-point formats

In IEEE 754 floating-point representation, each number comprises three basic components: the sign, the exponent, and the mantissa. To maximize the range of possible numbers, the mantissa is divided into a fraction and leading digit. As I'll explain, the latter is implicit and left out of the representation.

The sign bit simply defines the polarity of the number. A value of zero means that the number is positive, whereas a 1 denotes a negative number.

The exponent represents a range of numbers, positive and negative; thus a bias value must be subtracted from the stored exponent to yield the actual exponent. The single precision bias is 127, and the double precision bias is 1,023. This means that a stored value of 100 indicates a single-precision exponent of -27. The exponent base is always 2, and this implicit value is not stored.

For both representations, exponent representations of all 0s and all 1s are reserved and indicate special numbers:

  • Zero: all digits set to 0, sign bit can be either 0 or 1
  • ±∞: exponent all 1s, fraction all 0s
  • Not a Number (NaN): exponent all 1s, non-zero fraction. Two versions of NaN are used to signal the result of invalid operations such as dividing by zero, and indeterminate results such as operations with non-initialized operand(s).

The mantissa represents the number to be multiplied by 2 raised to the power of the exponent. Numbers are always normalized; that is, represented with one non-zero leading digit in front of the radix point. In binary math, there is only one non-zero number, 1. Thus the leading digit is always 1, allowing us to leave it out and use all the mantissa bits to represent the fraction (the decimals).

Following the previous number examples, here is what the single precision representation of the decimal value 30,064,771 will look like:

The binary integer representation of 30,064,771 is 1 1100 1010 1100 0000 1000 0011. This can be written as 1.110010101100000010000011 x 224. The leading digit is omitted, and the fraction—the string of digits following the radix point—is 1100 1010 1100 0000 1000 0011. The sign is positive and the exponent is 24 decimal. Adding the bias of 127 and converting to binary yields an IEEE 754 exponent of 1001 0111.

Putting all of the pieces together, the single representation for 30,064,771 is shown in Figure 3.


Figure 3: 30,064,771 represented in IEEE 754 single-precision format

Gain Some, Lose Some
Notice that you lose the least significant bit (LSB) of value 1 from the 32-bit integer representation—this is because of the limited precision for this format.

The range of numbers that can be represented with single precision IEEE 754 representation is ±(2-2-23) x 2127, or approximately ±1038.53. This range is astronomical compared to the maximum range of 32-bit integer numbers, which by comparison is limited to around ±2.15 x 109. Also, whereas the integer representation cannot represent values between 0 and 1, single-precision floating-point can represent values down to ±2-149, or ±~10-44.85. And we are still using only 32 bits—so this has to be a much more convenient way to represent numbers, right?

The answer depends on the requirements.

  • Yes, because in our example of multiplying 30,064,771 by 1.001, we can simply multiply the two numbers and the result will be extremely accurate.
  • No, because as in the preceding example the number 30,064,771 is not represented with full precision. In fact, 30,064,771 and 30,064,770 are represented by the exact same 32-bit bit pattern, meaning that a software algorithm will treat the numbers as identical. Worse yet, if you increment either number by 1 a billion times, none of them will change. By using 64 bits and representing the numbers in double precision format, that particular example could be made to work, but even double-precision representation will face the same limitations once the numbers get big—or small enough.
  • No, because most embedded processor cores ALUs (arithmetic logic units) only support integer operations, which leaves floating-point operations to be emulated in software. This severely affects processor performance. A 32-bit CPU can add two 32-bit integers with one machine code instruction; however, a library routine including bit manipulations and multiple arithmetic operations is needed to add two IEEE single-precision floating-point values. With multiplication and division, the performance gap just increases; thus for many applications, software floating-point emulation is not practical.


Floating Point Co-Processor Units
For those who remember PCs based on the Intel 8086 or 8088 processor, they came with the option of adding a floating-point coprocessor unit (FPU), the 8087. Though a compiler switch, you could tell the compiler that an 8087 was present in the system. Whenever the 8086 encountered a floating-point operation, the 8087 would take over, do the operation in hardware, and present the result on the bus.

Hardware FPUs are complex logic circuits, and in the 1980s the cost of the additional circuitry was significant; thus Intel decided that only those who needed floating-point performance would have to pay for it. The FPU was kept as an optional discrete solution until the introduction of the 80486, which came in two versions, one with and one without an FPU. With the Pentium family, the FPU was offered as a standard feature.

Floating Point is Gaining Ground
These days, applications using 32-bit embedded processors with far less processing power than a Pentium also require floating-point math. Our initial example of motor control is one of many—other applications that benefit from FPUs are industrial process control, automotive control, navigation, image processing, CAD tools, and 3D computer graphics, including games.

As floating-point capability becomes more affordable and proliferated, applications that traditionally have used integer math turn to floating-point representation. Examples of the latter include high-end audio and image processing. The latest version of Adobe Photoshop, for example, supports image formats where each color channel is represented by a floating-point number rather than the usual integer representation. The increased dynamic range fixes some problems inherent in integer-based digital imaging.

If you have ever taken a picture of a person against a bright blue sky, you know that without a powerful flash you are left with two choices; a silhouette of the person against a blue sky or a detailed face against a washed-out white sky. A floating-point image format partly solves this problem, as it makes it possible to represent subtle nuances in a picture with a wide range in brightness.

Compared to software emulation, FPUs can speed up floating-point math operations by a factor of 20 to 100 (depending on type of operation) but the availability of embedded processors with on-chip FPUs is limited. Although this feature is becoming increasingly more common at the higher end of the performance spectrum, these derivatives often come with an extensive selection of advanced peripherals and very high-performance processor cores—features and performance that you have to pay for even if you only need the floating-point math capability.

FPUs on Embedded Processors
With the MicroBlaze 4.00 processor, Xilinx makes an optional single precision FPU available. You now have the choice whether to spend some extra logic to achieve real floating-point performance or to do traditional software emulation and free up some logic (20-30% of a typical processor system) for other functions.

Why Integrated FPU is the Way to Go
A soft processor without hardware support for floating-point math can be connected to an external FPU implemented on an FPGA. Similarly, any microcontroller can be connected to an external FPU. However, unless you take special considerations on the compiler side, you cannot expect seamless cooperation between the two.

C-compilers for CPU architecture families that have no floating-point capability will always emulate floating-point operations in software by linking in the necessary library routines. If you were to connect an FPU to the processor bus, FPU access would occur through specifically designed driver routines such as this one:

void user_fmul(float *op1, float *op2, float *res)
{
FPU_operand1=*op1; // write operand a to FPU
FPU_operand2=*op2; // write operand b to FPU
FPU_operation=MUL; // tell FPU to multiply
while (!(FPU_stat & FPUready)); // wait for FPU
*res = FPU_result // return result
}

To do the operation, z = x*y in the main program, you would have to call the above driver function as:

float x, y, z;
user_fmul (&x, &y, &z);

For small and simple operations, this may work reasonably well, but for complex operations involving multiple additions, subtractions, divisions, and multiplications, such as a proportional integral derivative (PID) algorithm, this approach has three major drawbacks:

  • The code will be hard to write, maintain, and debug
  • The overhead in function calls will severely decrease performance
  • Each operation involves at least five bus transactions; as the bus is likely to be shared with other resources, this not only affects performance, but the time needed to perform an operation will be dependent on the bus load in the moment

The MicroBlaze Way
The optional MicroBlaze soft processor with FPU is a fully integrated solution that offers high performance, deterministic timing, and ease of use. The FPU operation is completely transparent to the user.

When you build a system with an FPU, the development tools automatically equip the CPU core with a set of floating-point assembly instructions known to the compiler.

To perform y = x*y, you would simply write:

float x, y, z;
y = x * z;

and the compiler will use those special instructions to invoke the FPU and perform the operation.

Not only is this simpler, but a hardware-connected FPU guarantees a constant number of CPU cycles for each floating-point operation. Finally, the FPU provides an extreme performance boost. Every basic floating-point operation is accelerated by a factor 25 to 150, as shown in Figure 4.


Figure 4: MicroBlaze floating-point acceleration

Conclusion
Floating-point arithmetic is necessary to meet precision and performance requirements for an increasing number of applications.

Today, most 32-bit embedded processors that offer this functionality are derivatives at the higher end of the price range.

The MicroBlaze soft processor with FPU can be a cost-effective alternative to ASSP products, and results show that with the correct implementation you can benefit not only from ease-of-use but vast improvements in performance as well.

For more information on the MicroBlaze FPU, visit www.xilinx.com/ipcenter/processor_central/microblaze/microblaze_fpu.htm.

[Editor's Note: This article first appeared in the Xilinx Embedded Magazine and is presented here with the kind permission of Xcell Publications.]

About the Author
Geir Kjosavik is the Senior Staff Product Marketing Engineer of the Embedded Processing Division at Xilinx, Inc. He can be reached at geir.kjosavik@xilinx.com.

Fundamentals of embedded audio, part 3

Audio Processing Methods
Getting data to the processor's core
There are a number of ways to get audio data into the processor's core. For example, a foreground program can poll a serial port for new data, but this type of transfer is uncommon in embedded media processors because it makes inefficient use of the core.

Instead, a processor connected to an audio codec usually uses a DMA engine to transfer the data from the codec link (like a serial port) to some memory space available to the processor. This transfer of data occurs in the background without the core's intervention. The only overhead is in setting up the DMA sequence and handling the interrupts once the buffer of data has been received or transmitted.

Block processing versus sample processing
Sample processing and block processing are two approaches for dealing with digital audio data. In the sample-based method, the processor crunches the data as soon as it's available. Here, the processing function incurs overhead during each sample period. Many filters (like FIR and IIR, described later) are implemented this way because the effective latency is lower for sample-based processing than for block processing.

In block processing is a buffer of a specific length must be filled before passing the data to the processing function. Some filters are implemented using block processing because it is more efficient than sample processing. For one, the processing function does not need to be called for each sample, greatly reducing overhead. Also, many embedded processors contain multiple processing units such as multipliers or full ALUs that can crunch blocks of data in parallel. . What's more, some algorithms are, by nature, must be processed in blocks. A well known one is the Fourier Transform (and its practical counterpart, the Fast Fourier Transform, or FFT), which accepts blocks of temporal or spatial data and converts them into frequency domain representations.

Double-Buffering
In a block-based processing system that uses DMA to transfer data to and from the processor core, a "double buffer" must exist to arbitrate between the DMA transfers and the core. This is done so that the processor core and the core-independent DMA engine do not access the same data at the same time and cause a data coherency problem.

For example, to facilitate the processing of a buffer of length N, simply create a buffer of length 2-N. For a bi-directional system, two buffers of length 2-N must be created. As shown in Figure 1a, the core processes the in1 buffer and stores the result in the out1 buffer, while the DMA engine is filling in0 and transmitting the data from out0. It can be seen in Figure 1b that once the DMA engine is done with the left half of the double buffers, it starts transferring data into in1 and out of out1, while the core processes data from in0 and into out0. This configuration is sometimes called "ping-pong buffering," because the core alternates between processing the left and right halves of the double buffers.

Note that in real-time systems, the serial port DMA (or another peripheral's DMA tied to the audio sampling rate) dictates the timing budget. For this reason, the block processing algorithm must be optimized in such a way that its execution time is less than or equal to the time it takes the DMA to transfer data to/from one half of a double-buffer.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure11_big.gif



Two-dimensional (2D) DMA
When data is transferred across a digital link like I2S, it may contain several channels. These may all be multiplexed onto one data line going into the same serial port. In such a case, 2D DMA can be used to de-interleave the data so that each channel is linearly arranged in memory. Take a look at Figure 2 for a graphical depiction of this arrangement, where samples from the left and right channels are de-multiplexed into two separate blocks. This automatic data arrangement is extremely valuable for those systems that employ block processing.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure12_big.gif

Figure 2. A 2D DMA engine used to de-interleave (a) I²S stereo data into (b) separate left and right buffers.

Basic Operations
There are three fundamental operations in audio processing. They are the summing, multiplication, and time delay operations. Many more complicated effects and algorithms can be implemented using these three elements. A summer has the obvious duty of adding two signals together. A multiplication can be used to boost or attenuate an audio signal. On most media processors, these operations can be executed in a single cycle.

A time delay is a bit more complicated. The delay is accomplished with a delay line, which is really nothing more than an array in memory that holds previous data. For example, an echo algorithm might hold 500 mS of input samples for each channel. For a simple delay effect, the current output value is computed by adding the current input value to a slightly attenuated previous sample. If the audio system is sample-based, then the programmer can simply keep track of an input pointer and an output pointer (spaced at 500 mS worth of samples apart), and increment them after each sampling period.

Since delay lines are meant to be reused for subsequent sets of data, the input and output pointers will need to wrap around from the end of the delay line buffer back to the beginning. In C/C++, this is usually done by appending the modulus operator (%) to the pointer increment.

This wrap-around may incur no extra processing cycles for a processor that supports circular buffering (see Figure 3). In this case, the beginning location and length of a circular buffer must be provided only once. During processing, the software increments or decrements the current pointer within the buffer, but the hardware takes care of wrapping around to the beginning of the buffer if the current pointer falls outside of the bounds. Without this automated address generation, the programmer would have to manually keep track of the buffer, thus wasting valuable processing cycles.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure13_big.gif



Figure 3. (a) Graphical representation of a delay line using a circular buffer (b) Layout of a circular buffer in memory.

A delay line structure can give rise to an important audio building block called the comb filter, which is essentially a delay with a feedback element. When multiple comb filters are used simultaneously, they can create the effect of reverberation.

Signal generation
In some audio systems, a signal (for example, a sine wave) might need to be synthesized. Taylor Series function approximations can emulate trigonometric functions. Uniform random number generators are handy for creating white noise.

However, synthesis might not fit into a given system's processing budget. On fixed-point systems with ample memory, you can use a table lookup instead of generating a signal. This has the side effect of taking up precious memory resources, so hybrid methods can be used as a compromise. For example, you can store a coarse lookup table to save memory. During runtime, the exact values can be extracted from the table using interpolation, an operation that can take significantly less time than computing using a full Taylor Series approximation. This hybrid approach provides a good balance between computation time and memory resources.

Filtering and Algorithms
Digital filters are used in audio systems for attenuating or boosting the energy content of a sound wave at specific frequencies. The most common filter forms are high-pass, low-pass, band-pass and notch. Any of these filters can be implemented in two ways. These are the finite impulse response filter (FIR) and the infinite impulse response filter (IIR), and they are often used as building blocks to more complicated filtering algorithms like parametric equalizers and graphic equalizers.

Finite Impulse Response (FIR) filter
The FIR filter's output is determined by the sum of the current and past inputs, each of which is first multiplied by a filter coefficient. The FIR summation equation, shown in Figure 4a, is also known as "convolution," one of the most important operations in signal processing. In this syntax, x is the input vector, y is the output vector, and h holds the filter coefficients. Figure 4a also shows a graphical representation of the FIR implementation.

The convolution is such a common operation in media processing that many processors can execute a multiply-accumulate (MAC) instruction along with multiple data accesses (reads and writes) in one cycle.

http://i.cmpnet.com/dspdesignline/2007/09/adifgure14_big.gif


Figure 4. (a) FIR filter equation and structure (b) IIR filter equation and structure.

Infinite Impulse Response (IIR) filter
Unlike the FIR, whose output depends only on inputs, the IIR filter relies on both inputs and past outputs. The basic equation for an IIR filter is a difference equation, as shown in Figure 4b. Because of the current output's dependence on past outputs, IIR filters are often referred to as "recursive filters." Figure 4b also gives a graphical perspective on the structure of the IIR filter.

Fast Fourier Transform
Quite often, we can do a better job describing an audio signal by characterizing its frequency composition. A Fourier Transform takes a time-domain signal and rearranges it into the frequency domain; the inverse Fourier Transform achieves the opposite, converting a frequency-domain representation back into the time domain. Mathematically, there are some nice property relationships between operations in the time domain and those in the frequency domain. Specifically, a time-domain convolution (or an FIR filter) is equivalent to a multiplication in the frequency domain. This tidbit would not be too practical if it weren't for a special optimized implementation of the Fourier transform called the Fast Fourier Transform (FFT). In fact, it is often more efficient to implement a sufficiently long FIR filter by transforming the input signal and coefficients into the frequency domain with an FFT, multiplying the transforms, and then transforming the result back into the time domain with an inverse FFT.

There are other transforms that are used often in audio processing. Among them, the most common is the modified discrete cosine transform (MDCT), which is the basis for many audio compression algorithms.

Sample Rate Conversion
There are times when you will need to convert a signal sampled at one frequency to a different sampling rate. One situation where this is useful is when you want to decode an audio signal sampled at, say 8 kHz, but the DAC you're using does not support that sampling frequency. Another scenario is when a signal is oversampled, and converting it to a lower frequency can lead to a reduction in computation time. The process of converting the sampling rate of a signal from one rate to another is called sampling rate conversion (or SRC).

Increasing the sampling rate is called interpolation, and decreasing it is called decimation. Decimating a signal by a factor of M is achieved by keeping only every Mth sample and discarding the rest. Interpolating a signal by a factor of L is accomplished by padding the original signal with L-1 zeros between each sample.

Even though interpolation and decimation factors are integers, you can apply them in series to an input signal and get a rational conversion factor. When you upsample by 5 and then downsample by 3, then the resulting factor is 5/3 = 1.67.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure15_big.gif

Figure 5. Sample-rate conversion through upsampling and ownsampling.

To be honest, we oversimplified the SRC process a bit too much. In order to prevent artifacts due to zero-padding a signal (which creates images in the frequency domain), an interpolated signal must be low-pass-filtered before being used as an output or as an input into a decimator. This anti-imaging low-pass filter can operate at the input sample rate, rather than at the faster output sample rate, by using a special FIR filter structure that recognizes that the inputs associated with the L-1 inserted samples have zero values.

Similarly, before they're decimated, all input signals must be low-pass-filtered to prevent aliasing. The anti-aliasing low-pass filter may be designed to operate at the decimated sample rate, rather than at the faster input sample rate, by using a FIR filter structure that realizes the output samples associated with the discarded samples need not be computed. Figure 5 shows a flow diagram of a sample rate converter. Note that it is possible to combine the anti-imaging and anti-aliasing filter into one component for computational savings.

Obviously, we've only been able to give a surface discussion on these embedded audio topics, but hopefully we've provided a useful template for the kinds of considerations necessary for developing an embedded audio processing application.

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.

Fundamentals of embedded audio, part 2

Dynamic Range and Precision
You may have seen dB specs thrown around for various products available on the market today. Table 1 lists a few fairly established products along with their assigned signal quality, measured in dB.


Table 1: Dynamic range comparison of various audio systems.

So what exactly do those numbers represent? Let's start by getting some definitions down. Use Figure 1 as a reference signal for the following "cheat sheet" of the essentials.


Figure 1: Relationship between some important terms in audio systems.

The dynamic range of the human ear (the ratio of the loudest to the quietest signal level) is about 120 dB. In systems where noise is present, dynamic range is described as the ratio of the maximum signal level to the noise floor. In other words,

Dynamic Range (dB) = Peak Level (dB) - Noise Floor (dB)

The noise floor in a purely analog system comes from the electrical properties of the system itself. In digital systems, audio signals also acquire noise from the ADCs and DACs, as well as from the quantization errors due to sampling.

Another important measure is the signal-to-noise ratio (SNR). In analog systems, this means the ratio of the nominal signal to the noise floor, where "line level" is the nominal operating level. On professional equipment, the nominal level is usually 1.228 Vrms, which translates to +4 dBu. The headroom is the difference between nominal line level and the peak level where signal distortion starts to occur. The definition of SNR is a bit different in digital systems, where it is defined as the dynamic range.

Now, armed with an understanding of dynamic range, we can start to discuss how this is useful in practice. Without going into a long derivation, let's simply state what is known as the "6 dB rule". This rule is key to the relationship between dynamic range and computational word width. The complete formulation is described in the equation below, but in shorthand the 6 dB rule means that the addition of one bit of precision will lead to a dynamic range increase of 6 dB. Note that the 6 dB rule does not take into account the analog subsystem of an audio design, so the imperfections of the transducers on both the input and the output must be considered separately.

Dynamic Range (dB) = 6.02n + 1.76 ≈ 6n dB
where
n = the number of precision bits

The "6 dB rule" dictates that the more bits we use, the higher the audio quality we can attain. In practice, however, there are only a few realistic choices of word width. Most devices suitable for embedded media processing come in three word width flavors: 16-bit, 24-bit, and 32-bit. Table 2 summarizes the dynamic ranges for these three types of processors.


Table 2: Dynamic range of various fixed-point architectures.

Since we're talking about the 6 dB rule, it is worth mentioning something about the nonlinear quantization methods that are typically used for speech signals. A telephone-quality linear PCM encoding requires 12 bits of precision. However, our ears are more sensitive to audio changes at small amplitudes than at high amplitudes. Therefore, the linear PCM sampling is overkill for telephone communications. The logarithmic quantization used by the A-law and μ–law companding standards achieves a 12-bit PCM level of quality using only 8 bits of precision. To make our lives easier, some processor vendors have implemented A-law and μ–law companding into the serial ports of their devices. This relieves the processor core from doing logarithmic calculations.

After reviewing Table 2, recall once again that the dynamic range of the human ear is around 120 dB. Because of this, 16-bit data representation doesn't quite cut it for high quality audio. This is why vendors introduced 24-bit processors. However, these 24-bit systems are a bit non-standard from a C compiler standpoint, so many audio designs these days use 32-bit processing.

Choosing the right processor is not the end of the story, because the total quality of an audio system is dictated by the quality level of the "lowest-achieving" component. Besides the processor, a complete system includes analog components like microphones and speakers, as well the converters to translate signals between the analog and digital domains. The analog domain is outside of the scope of this discussion, but the audio converters do cross into the digital realm.

Let's say that you want to use the AD1871 for sampling audio. The datasheet for this converter explains that it is a 24-bit converter, but its dynamic range is not the theoretical 144 dB – it is 105 dB. The reason for this is that a converter is not a perfect system, and vendors publish only the useful dynamic range.

If you were to hook up a 24-bit processor to the AD1871, then the SNR of your complete system would be 105 dB. The noise floor would amount to 144 dB – 105 dB = 39 dB. Figure 2 is a graphical representation of this situation. However, there is still another component of a digital audio system that we have not discussed yet: computation on the processor's core.

http://i.cmpnet.com/dspdesignline/2007/09/adifigure4_big.gif
Figure 2: An audio system's SNR consists of the weakest component's SNR.

Passing data through a processor's computational units can potentially introduce a variety of errors. One is quantization error. This can be introduced when a series of computations causes a data value to be either truncated or rounded (up or down). For example, a 16-bit processor may be able to add a vector of 16-bit data and store this in an extended-length accumulator. However, when the value in the accumulator is eventually written to a 16-bit data register, some of the bits are truncated.

Take a look at Figure 3 to see how computation errors can affect a real system. For an ideal 16-bit A/D converter (Figure 3a), the signal-to-noise ratio would be 16 x 6 = 96 dB. If quantization errors did not exist, then 16-bit computations would suffice to keep the SNR at 96 dB. Both 24-bit and 32-bit systems would dedicate 8 and 16 bits, respectively, to the dynamic range below the noise floor. In essence, those extra bits would be wasted.

However, all digital audio systems do introduce some round-off and truncation errors. If we can quantify this error to take, for example, 18 dB (or 3 bits), then it becomes clear that 16-bit computations will not suffice in keeping the system's SNR at 96 dB (Figure 3b). Another way to interpret this is to say that the effective noise floor is raised by 18 dB, and the total SNR is decreased to 96 dB – 18 dB = 78 dB. This leads to the conclusion that having extra bits below the noise floor helps to deal with the nuisance of quantization.

Figure 3 (a) Allocation of extra bits with various word width computations for an ideal 16-bit, 96 dB SNR system, when quantization error is neglected (b) Allocation of extra bits with various word width computations for an ideal 16-bit, 96 dB SNR system, when quantization noise is present.

Numeric Formats for Audio
There are many ways to represent data inside a processor. The two main processor architectures used for audio processing are fixed-point and floating-point. Fixed-point processors are designed for integer and fractional arithmetic, and they usually natively support 16-bit, 24-bit, or 32-bit data. Floating-point processors provide very good performance with native support for 32-bit or 64-bit floating-point data types. However, floating-point processors are typically more costly and consume more power than their fixed-point counterparts, and most real systems must strike a balance between quality and engineering cost.

Fixed-point Arithmetic
Processors that can perform fixed-point operations typically use two's complement binary notation for representing signals. A fixed-point format can represent both signed and unsigned integers and fractions. The signed fractional format is most common for digital signal processing on fixed-point processors. The difference between integer and fractional formats lies in the location of the binary point. For integers, the binary point is to the right of the least significant digit, whereas fractions usually have their binary point to the left of the sign bit. Figure 4a shows integer and fractional formats.

While the fixed-point convention simplifies numeric operations and conserves memory, it presents a tradeoff between dynamic range and precision. In situations that require a large range of numbers while maintaining high resolution, a radix point that can shift based on magnitude and exponent, (i.e., floating-point) is desirable.

http://i.cmpnet.com/dspdesignline/2007/09/adifigure6_big.gif
Figure 4. (a) Fractional and integer formats

Floating-point Arithmetic
Using floating-point format, very large and very small numbers can be represented in the same system. Floating-point numbers are quite similar to scientific notation representation of rational numbers. They are described with a mantissa and an exponent. The mantissa dictates precision, and the exponent controls dynamic range.

There is a standard that governs floating-point computations of digital machines. It is called IEEE-754 (Figure 4a) and can be summarized as follows for 32-bit floating-point numbers. Bit 31 (MSB) is the sign bit, where a 0 represents a positive sign and a 1 represents a negative sign. Bits 30 through 23 represent an exponent field (exp_field) as a power of 2, biased with an offset of 127. Finally, bits 22 through 0 represent a fractional mantissa (mantissa). The hidden bit is basically an implied value of 1 to the left of the radix point.

The value of a 32-bit IEEE floating-point number can be represented with the following equation:

(-1)sign_bit x (1.mantissa) x 2(exp_field " 127)

With an 8-bit exponent and a 23-bit mantissa, IEEE-754 reaches a balance between dynamic range and precision. In addition, IEEE floating-point libraries include support for additional features such as ±infinity, zero, and NaN (not a number).


無法顯示錯誤的圖片「http://i.cmpnet.com/dspdesignline/2007/09/adifigure6_big.gif」


http://i.cmpnet.com/dspdesignline/2007/09/adifigure7_big.gif

Figure 4. (a) Fractional and integer formats (b) IEEE 754 32-bit single-precision floating-point format.

Table 3 shows the smallest and largest values attainable from the common floating-point and fixed-point types.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure8_big.gif

Table 3. Comparison of dynamic range for various data formats.

Emulation on 16-bit Architectures
As explained earlier, 16-bit processing does not provide a high enough SNR for high quality audio, but this does not mean that you shouldn't choose a 16-bit processor. For example, while a 32-bit floating-point machine makes it easier to code an algorithm that preserves 32-bit data natively, a 16-bit processor can also maintain 32-bit integrity through emulation at a much lower cost. Figure 5 illustrates some of the possibilities for choosing a data type for an embedded algorithm.

In the remainder of this section, we'll describe how to achieve floating-point and 32-bit extended precision fixed-point functionality on a 16-bit fixed-point machine.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure9_big.gif

Figure 5: Depending the goals of an application, there are many data types that can satisfy system requirements.

Floating-point emulation on fixed-point processors
On most 16-bit fixed-point processors, IEEE-754 floating-point functions are available as library calls from either C/C++ or assembly language. These libraries emulate the required floating-point processing using fixed-point multiply and ALU logic. This emulation requires additional cycles to complete. However, as fixed-point processor core clock speeds venture into the 500 MHz - 1 GHz range, the extra cycles required to emulate IEEE-754-compliant floating-point math become less significant.

It is sometimes advantageous to use a "relaxed" version of IEEE-754 in order to reduce computational complexity. This means that the floating-point arithmetic doesn't implement the standard features such ±infinity, zero, and NaN.

A further optimization is to use a more native type for the mantissa and exponent. Take, for example, Analog Devices' fixed-point Blackfin processor architecture, which has a register file set that consists of sixteen 16-bit registers that can be used instead as eight 32-bit registers. In this configuration, on every core clock cycle, two 32-bit registers can source operands for computation on all four register halves. To make optimized use of the Blackfin register file, a two-word format can be used. In this way, one word (16 bits) is reserved for the exponent and the other word (16 bits) is reserved for the fraction.

Double-Precision Fixed-Point Emulation
There are many applications where 16-bit fixed-point data is not sufficient, but where emulating floating-point arithmetic may be too computationally intensive. For these applications, extended-precision fixed-point emulation may be enough to satisfy system requirements. Using a high-speed fixed-point processor will insure a significant reduction in the amount of required processing. Two popular extended-precision formats for audio are 32-bit and 31-bit fixed-point representations.

32-Bit-Accurate Emulation
32-bit arithmetic is a natural software extension for 16-bit fixed-point processors. For processors whose 32-bit register files can be accessed as two 16-bit halves, the halves can be used together to represent a single 32-bit fixed-point number. The Blackfin processor's hardware implementation allows for single-cycle 32-bit addition and subtraction.

For instances where a 32-bit multiply will be iterated with accumulation (as is the case in some algorithms we'll talk about soon), we can achieve 32-bit accuracy with 16-bit multiplications in just 3 cycles. Each of the two 32-bit operands (R0 and R1) can be broken up into two 16-bit halves (R0.H / R0.L and R1.H / R1.L).


http://i.cmpnet.com/dspdesignline/2007/09/adifigure10_big.gif

igure 6. 32-bit multiplication with 16-bit operations.

From Figure 6, it is easy to see that the following operations are required to emulate the 32-bit multiplication R0 x R1 with a combination of instructions using 16-bit multipliers:

Four 16-bit multiplications to yield four 32-bit results:

  1. R1.L x R0.L
  2. R1.L x R0.H
  3. R1.H x R0.L
  4. R1.H x R0.H

Three operations preserve bit place in the final answer (the >> symbol denotes a right shift). Since we are performing fractional arithmetic, the result is 1.63 (1.31 x 1.31 = 2.62 with a redundant sign bit). Most of the time, the result can be truncated to 1.31 in order to fit in a 32-bit data register. Therefore, the result of the multiplication should be in reference to the sign bit, or the most significant bit. This way the rightmost least significant bits can be safely discarded in a truncation:

  1. (R1.L x R0.L) >> 32
  2. (R1.L x R0.H) >> 16
  3. (R1.H x R0.L) >> 16

The final expression for a 32-bit multiplication is:

((R1.L x R0.L) >> 32 + (R1.L x R0.H) >> 16) + ((R1.H x R0.L) >> 16 + R1.H x R0.H)

On the Blackfin architecture, these instructions can be issued in parallel to yield an effective rate of a 32-bit multiplication in three cycles.

31-Bit-Accurate Emulation
We can reduce a fixed-point multiplication requiring at most 31-bit accuracy to just 2 cycles. This technique is especially appealing for audio systems, which usually require at least 24-bit representation, but where 32-bit accuracy may be a bit excessive. Using the "6 dB rule," 31-bit-accurate emulation still maintains a dynamic range of around 186 dB, which is plenty of headroom even with all the quantization effects.

From the multiplication diagram shown in Figure 6, it is apparent that the multiplication of the least significant half-word R1.L x R0.L does not contribute much to the final result. In fact, if the result is truncated to 1.31, then this multiplication can only have an effect on the least significant bit of the 1.31 result. For many applications, the loss of accuracy due to this bit is balanced by the speeding up of the 32-bit multiplication through eliminating one 16-bit multiplication, one shift, and one addition.

The expression for 31-bit accurate multiplication is:

((R1.L x R0.H) + (R1.H x R0.L) ) >> 16 + (R1.H x R0.H)

On the Blackfin architecture, these instructions can be issued in parallel to yield an effective rate of a 2 cycles for each 32-bit multiplication.

So that's the scoop on numeric formats for audio. In the final article of this series, we'll talk about some strategies for developing embedded audio applications, focusing primarily on data movement and building blocks for common algorithms.

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.







Fundamentals of embedded audio, part 1

Audio functionality plays a critical role in embedded media processing. While audio takes less processing power in general than video processing, it should be considered equally important.

In this article, the first of a three-part series, we will explore how data is presented to an embedded processor from a variety of audio converters (DACs and ADCs). Following this, we will explore some common peripheral standards used for connecting to audio converters.

Converting between Analog and Digital Audio Signals

Sampling
All A/D and D/A conversions should obey the Shannon-Nyquist sampling theorem. In short, this theorem dictates that an analog signal must be sampled at a rate (the Nyquist sampling rate) equal to or exceeding twice its bandwidth (the Nyquist frequency) in order for it to be reconstructed in the eventual D/A conversion. Sampling below the Nyquist sampling rate will introduce aliases, which are low frequency "ghost" images of frequencies that fall above the Nyquist frequency. For example, if we take an audio signal that is band-limited to 0-20 kHz, and sample it at 2 - 20 kHz = 40 kHz, then the Nyquist Theorem assures us that the original signal can be reconstructed without any signal loss. Sampling this 0-20 kHz band-limited signal at anything less than 40 kHz will introduce distortions due to aliasing. Figure 1 shows the aliasing effect on a 20 kHz sine wave. When sampled at 40 kHz, a 20 kHz signal is represented correctly (Figure 1a). However, the same 20 kHz sine wave sampled at 30 kHz actually looks like a lower-frequency alias of the original sine wave (Figure 1b).


http://i.cmpnet.com/dspdesignline/2007/08/adifigure1_big.gif

Figure 1. (a) Sampling a 20 kHz signal at 40 kHz captures the original signal correctly (b) Sampling the same 20 kHz signal at 30 kHz captures an aliased (low frequency ghost) signal.

No practical system will sample at exactly twice the Nyquist frequency, however. This is because restricting a signal into a specific band requires an analog low-pass filter. Since analog filters are never ideal, high frequency components above the Nyquist frequency can still pass through, causing aliasing. Therefore, it is common to sample above the Nyquist frequency in order to minimize this aliasing. For example, the sampling rate for CD audio is 44.1 kHz, not 40 kHz, and many high-quality systems sample at 48 kHz in order to capture the 0-20 kHz range of hearing even more faithfully.

For speech signals, the energy content below 4 kHz is enough to store an intelligible reproduction of a speech signal. For this reason, telephony applications usually use only 8 kHz sampling (= 2 - 4 kHz). Table 1 summarizes some sampling rates used by familiar systems.


Table 1. Commonly used sampling rates.

PCM Output
The most common digital representation for audio is called PCM (pulse-code modulation). In this representation, an analog amplitude is encoded with a digital level for each sampling period. The resulting digital wave is a vector of snapshots taken to approximate the input analog wave. Since all A/D converters have finite resolution, they introduce quantization noise that is inherent in digital audio systems. Figure 2 shows a PCM representation of an analog sine wave (Figure 2a) converted using an ideal A/D converter. In this case, quantization manifests itself as the "staircase effect" (Figure 2b). You can see that lower resolution leads to a worse representation of the original wave (Figure 2c).

For a numerical example, let's assume that a 24-bit A/D converter is used to sample an analog signal whose range is -2.828 V to 2.828 V (5.656 Vpp). The 24 bits allow for 224 (16,777,216) quantization levels. Therefore, the effective voltage resolution is 5.656 V / 16,777,216 = 337.1 nV. In the second part of this series, we'll see how codec resolution affects the dynamic range of audio systems.


http://i.cmpnet.com/dspdesignline/2007/08/adifigure3_big.gif

Figure 2. (a) An analog signal (b) Digitized PCM signal (c) Digitized PCM signal using fewer bits of precision.

PWM Output
Another popular type of modulation is pulse-width modulation (PWM). In PWM, it is the duty cycle, not voltage level, that codes a signal's amplitude. An advantage of PWM is that PWM signals can drive an output circuit directly without any need for a DAC. This is especially useful when a low-cost solution is required. PWM signals can be generated with general-purpose I/O pins, or they can be driven directly by specialized PWM timers, available on many processors.

To achieve decent quality, the PWM carrier frequency should be at least 12 times the bandwidth of the signal, and the resolution of the timer (i.e. granularity of the duty cycle) should be at least 16 bits. Because of the high carrier frequency requirement, traditional PWM audio circuits were used only for low-bandwidth audio such as audio sent to a subwoofer. However, with today's high-speed processors, it's possible to carry higher bandwidth audio.

Before the PWM stream is output, it must be low-pass-filtered to remove the high-frequency carrier. This is usually done in the amplifier circuit that drives the speaker. A class of amplifiers, called Class D, has been used successfully in such a configuration. When amplification is not required a low-pass filter is sufficient as the output stage. In some low-cost applications, where sound quality is not as important, the PWM streams can connect directly to a speaker. In such a system, the mechanical inertia of the speaker's cone acts as a low-pass filter to remove the carrier frequency.

Brief Background on Audio Converters

Audio ADCs
There are many ways to perform A/D conversion. The first commercially successful ADCs used a successive approximation scheme, in which a comparator compares the the input analog voltage against a series of discrete voltage levels and finds the closest match.

Most audio ADCs today, however, are sigma-delta converters. Instead of employing successive approximations to create wide resolutions, sigma-delta converters use 1-bit ADCs. In this scheme, the single bit codes whether the current sample has a higher or lower voltage than the previous sample.

In order to compensate for the reduced number of quantization steps, the signal is oversampled at a frequency much higher than the Nyquist frequency. In order to accommodate the more traditional PCM stream processing, conversion from this super-sampled 1-bit stream into a slower, higher-resolution stream is performed using digital filtering blocks inside these converters. For example, a 16-bit 44.1 kHz sigma-delta ADC might oversample at 64x, yielding a 1-bit stream at a rate of 2.8224 MHz. A digital decimation filter converts this super-sampled stream to a 16-bit one at 44.1 kHz.

Because they oversample analog signals, sigma-delta ADCs relax the performance requirements of the analog low-pass filters that band-limit input signals. They also have the advantage of spreading out noise over a wider spectrum than traditional converters.

Audio DACs
Traditional approaches to D/A conversion include weighted resistor, R-2R ladder, and zero-cross distortion methods. Just as in the A/D case, sigma-delta designs rule the D/A conversion space. To take an example, a sigma delta converter might take a 16-bit 44.1 kHz signal, convert it into a 1-bit 2.8224 MHz stream using an interpolating filter, and then feed the 1-bit signal to a DAC that converts the super-sampled stream to an analog signal.

Many audio systems employed today use a sigma-delta audio ADC and a sigma-delta DAC. Therefore the conversion between PCM signals and oversampled 1-bit signals is done twice. For this reason, Sony and Philips have introduced an alternative to PCM, called Direct-Stream Digital (DSD), in their Super Audio CD (SACD) format. This format stores data using the 1-bit high-frequency (2.8224 MHz) sigma-delta stream, bypassing the PCM conversion. The disadvantage is that DSD streams are less intuitive than PCM and require a separate set of digital audio algorithms.

Connecting to Audio Converters: An ADC example
OK, enough background information. Now let's look at an actual ADC. A good choice for a low-cost audio ADC is the Analog Devices AD1871, a sigma-delta converter featuring 24-bit resolution and a 96 kHz sampling frequency. A functional block diagram of the AD1871 is shown in Figure 3a. Stereo audio is input via the left (VINLx) and right (VINRx) input channels and digitized audio data is streamed out serially through the data port, usually to a corresponding serial port on a signal processor.
http://i.cmpnet.com/dspdesignline/2007/08/adifigure4_big.gif
Figure 3. (a) Functional block diagram of the AD1871 audio ADC
(b) Glueless connection of an ADSP-BF533 media processor to the AD1871.

As the block diagram in Figure 3b implies, the interface between the AD1871 ADC and Blackfin processor is glueless (the analog part of the circuit is simplified since only the digital signals are important in this discussion). The Blackfin connects to the ADC via 2 serial ports (SPORTs) and an SPI (Serial Peripheral Interface) port that allows the AD1871 to be configured via software commands. Parameters configurable through the SPI include the sampling rate, word width, and channel gain and muting. The oversampling rate of the AD1871 is supplied with an external crystal.

The SPORT is the data link to the AD1871 and is configured in I²S mode. I²S is a standard protocol developed by Philips for transmission of digital audio signals.

This standard allows audio equipment manufacturers to create components that are compatible with each other. To be exact, I²S is simply a 3-wire serial interface used to transmit stereo data. As shown in Figure 4a, it specifies a bit clock (middle), a data line (bottom), and a left/right synchronization line (top) that selects whether a left or right channel frame is currently being transmitted. In essence, I²S is a time-division-multiplexed (TDM) serial stream with two active channels. TDM is a method of transferring multiple channels (for example, stereo audio) over one physical link.

During setup, the AD1871 can reduce the 12.288 MHz sampling rate it receives from the external crystal and use this reduced clock to drive the SPORT clock (RSCLK) and frame synchronization (RFS) lines. This configuration insures that the sampling and data transmission are in sync.

The SPI interface, shown in Figure 4b, was designed by Motorola for connecting host processors to a variety of digital components. The interface between an SPI master and an SPI slave substantially consists of a clock line (SCK), two data lines (MOSI and MISO), and a slave select (SPISEL) line. One of the data lines is driven by the master (MOSI), and the other is driven by the slave (MISO). In the example shown in Figure 3b, the processor's SPI port interfaces gluelessly to the SPI block of the AD1871.


http://i.cmpnet.com/dspdesignline/2007/08/adifigure5_big.gif

Figure 4. The data signals transmitted by the AD1871 using the I²S protocol
(b) The SPI 3-wire interface used to control the AD1871.

Audio codecs with a separate SPI control port allow a host processor to change the ADC settings on the fly. Besides muting and gain control, one of the really useful settings on ADCs like the AD1871 is the ability to place it in power-down mode. For battery-powered applications, this is often an essential function.

DACs and Codecs
Connecting an audio DAC to a host processor is an identical process to the ADC connection we just discussed. In a system that uses both an ADC and a DAC, the same serial port can hook up to both, if it supports bidirectional transfers.

But if you're tackling full-duplex audio, then you're better off using a single-chip audio codec that handles both the analog-to-digital and digital-to-analog conversions. A good example of such a codec is the Analog Devices AD1836, which features three stereo DACs and two stereo ADCs, and is able to communicate through a number of serial protocols, including I²S.

In this article, we have covered some basics of connecting audio converters to embedded processors. In part 2 of this series, we will describe the formats in which audio data is stored and processed. In particular, we will review the compromises associated with selecting data sizes. This is important because it dictates the data types used and may also rule out some processor choices if the desired quality level is too high for what a particular device can achieve. Furthermore, data size selection helps with making tradeoffs between increased dynamic range and additional processing power.

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.


CCD and CMOS image sensor processing pipeline


Page 1 of 3

Video Imaging DesignLine


Camera sources today are overwhelmingly based on either Charge-Coupled Device (CCD) or CMOS technology. Both of these technologies convert light into electrical signals, but they differ in how this conversion occurs.

In CCD devices, an array of millions of light-sensitive picture elements, or pixels, spans the surface of the sensor. After exposure to light, the accumulated charge over the entire CCD pixel array is read out at one end of the device and then digitized via an Analog Front End (AFE) chip or CCD processor. On the other hand, CMOS sensors directly digitize the exposure level at each pixel site.

In general, CCDs have the highest quality and lowest noise, but they are not power-efficient. CMOS sensors are easy to manufacture and have low power dissipation, but at reduced quality. Part of the reason for this is because the transistors at each pixel site tend to occlude light from reaching part of the pixel. However, CMOS has started giving CCD a run for its money in the quality arena, and increasing numbers of mid-tier camera sensors are now CMOS-based.

Regardless of their underlying technology, all pixels in the sensor array are sensitive to grayscale intensity -- from total darkness (black) to total brightness (white). The extent to which they're sensitive is known as their "bit depth." Therefore, 8-bit pixels can distinguish between 28, or 256, shades of gray, whereas 12-bit pixel values differentiate between 4096 shades. Layered over the entire pixel array is a color filter that segments each pixel into several color-sensitive "subpixels." This arrangement allows a measure of different color intensities at each pixel site. Thus, the color at each pixel location can be viewed as the sum of its red, green and blue channel light content, superimposed in an additive manner. The higher the bit depth, the more colors that can be generated in the RGB space. For example, 24-bit color (8 bits each of R,G,B) results in 224, or 16.7 million, discrete colors.

In order to properly represent a color image, a sensor needs 3 color samples -- most commonly, Red, Green and Blue -- for every pixel location. However, putting 3 separate sensors in every camera is not a financially tenable solution (although lately such technology is becoming more practical). What's more, as sensor resolutions increase into the 5-10 Megapixel range, it becomes apparent that some form of image compression is necessary to prevent the need to output 3 bytes (or worse yet, 3 12-bit words for higher-resolution sensors) for each pixel location.

Not to worry, because camera manufacturers have developed clever ways of reducing the number of color samples necessary. The most common approach is to use a Color Filter Array (CFA), which measures only a single color at any given pixel location. Then, the results can be interpolated by the image processor to appear as if 3 colors were measured at every location.

The most popular CFA in use today is the Bayer pattern, shown in Figure 1. This scheme, invented by Kodak, takes advantage of the fact that the human eye discerns differences in green-channel intensities more than red or blue changes. Therefore, in the Bayer color filter array, the Green subfilter occurs twice as often as either the Blue or Red subfilter. This results in an output format sometimes known as '4:2:2 RGB', where 4 Green values are sent for every 2 Red and Blue values.



Figure 1: Bayer pattern image sensor arrangement
Connecting to Image Sensors
CMOS sensors ordinarily output a parallel digital stream of pixel components in either YCbCr or RGB format, along with horizontal and vertical synchronization and a pixel clock. Sometimes, they allow for an external clock and sync signals to control the transfer of image frames out from the sensor.

CCDs, on the other hand, usually hook up to an "Analog Front End" (AFE) chip, such as the AD9948, that processes the analog output signal, digitizes it, and generates appropriate timing to scan the CCD array. A processor supplies synchronization signals to the AFE, which needs this control to manage the CCD array. The digitized parallel output stream from the AFE might be in 10-bit, or even 12-bit, resolution per pixel component.

Recently, LVDS (low-voltage differential signaling) has become an important alternative to the parallel data bus approach. LVDS is a low-cost, low pin-count , high-speed serial interconnect that has better noise immunity and lower power consumption than the standard parallel approach. This is important as sensor resolutions and color depths increase, and as portable multimedia applications become more widespread.

Image Pipe
Of course, the picture-taking process doesn't end at the sensor; on the contrary, its journey is just beginning. Let's take a look at what a raw image has to go through before becoming a pretty picture on a display. In digital cameras, this sequence of processing stages is known as the "image processing pipeline," or just "image pipe." Refer to Figure 2 for one possible dataflow. These algorithms are typically performed on a media processor such as those in Analog Devices' Blackfin family.


http://i.cmpnet.com/videsignline/2006/06/blackfin-2.gif

Figure 2: Example Software Image Pipe Flow

Mechanical Feedback Control
Before the shutter button is even released, the focus and exposure systems work with the mechanical camera components to control lens position based on scene characteristics.

Auto-exposure algorithms measure brightness over discrete scene regions to compensate for overexposed or underexposed areas by manipulating shutter speed and/or aperture size. The net goals here are to maintain relative contrast between different regions in the image and to achieve a target average luminance.

Auto-focus algorithms divide into two categories. Active methods use infrared or ultrasonic emitters/receivers to estimate the distance between the camera and the object being photographed. Passive methods, on the other hand, make focusing decisions based on the received image in the camera.

In both of these subsystems, the media processor manipulates the various lens and shutter motors via PWM output signals. For auto-exposure control, it also adjusts the Automatic Gain Control (AGC) circuit of the sensor.

Preprocessing
As we discussed earlier, a sensor's output needs to be gamma-corrected to account for eventual display, as well as to compensate for nonlinearities in the sensor's capture response.

Since sensors usually have a few inactive or defective pixels, a common preprocessing technique is to eliminate these via median filtering, relying on the fact that sharp changes from pixel to pixel are abnormal, since the optical process blurs the image somewhat.

Lens correction (shading / distortion correction)
This set of algorithms accounts for the physical properties of lenses that warp the output image compared to the actual scene the user is viewing. Different lenses can cause different distortions; for instance, wide-angle lenses create a "barreling" or "bulging" effect, while telephoto lenses create a "pincushion" or "pinching" effect.

Lens shading distortion reduces image brightness in the area around the lens. Chromatic aberration causes color fringes around an image. The media processor needs to mathematically transform the image in order to correct for these distortions.

Image stability compensation, or hand-shaking correction is another area of preprocessing. Here, the processor adjusts for the translational motion of the received image, often with the help of external transducers that relate the real-time motion profile of the sensor.

White balance is another important stage of preprocessing. When we look at a scene, regardless of lighting conditions, our eyes tend to normalize everything to the same set of natural colors. For instance, an apple looks deep red to us whether we're indoors under fluorescent lighting, or outside in sunny weather. However, an image sensor's "perception" of color depends largely on lighting conditions, so it needs to map its acquired image to appear "lighting-agnostic" in its final output. This mapping can be done either manually or automatically.

In manual systems, you point your camera at an object you determine to be "white," and the camera will then shift the "color temperature" of all images it takes to accommodate this mapping. Automatic White Balance (AWB), on the other hand, uses inputs from the image sensor and an extra white balance sensor to determine what should be regarded as "true white" in an image. It tweaks the relative gains between the R, G and B channels of the image. Naturally, AWB requires more image processing than manual methods, and it's another target of proprietary vendor algorithms.

De-mosaic / Pixel interpolation / Noise reduction / Edge enhancement
De-mosaicing is perhaps the most crucial and numerically intensive operation in the image pipeline. Each camera manufacturer typically has their own "secret recipe," but in general, the approaches fall into a few main algorithm categories.

Nonadaptive algorithms like bilinear interpolation or bicubic interpolation are among the simplest to implement, and they work well in smooth areas of an image. However, edges and texture-rich regions present a challenge to these straightforward implementations. Adaptive algorithms, those that change behavior based on localized image traits, can provide better results.

One example of an adaptive approach is edge-directed reconstruction. Here, the algorithm analyzes the region surrounding a pixel and determines in which direction to perform interpolation. If it finds an edge nearby, it interpolates along the edge, rather than across it. Another adaptive scheme assumes a constant hue for an entire object, and this prevents abrupt changes in color gradients within individual objects. Many other de-mosaicing approaches exist, some involving frequency-domain analysis, Bayesian probabilistic estimation, and even neural networks.

Color Transformation
In this stage, the interpolated RGB image is transformed to the targeted output color space (if not already in the right space). For compression or display to a television, this will usually involve an RGB®YCbCr matrix transformation, often with another gamma correction stage to accommodate the target display. The YCbCr outputs may also be chroma subsampled at this stage to the standard 4:2:2 format for color bandwidth reduction with little visual impact.

Postprocessing
In this phase, the image is perfected via a variety of filtering operations before being sent to the display and/or storage media. For instance, edge enhancement, pixel thresholding for noise reduction, and color-artifact removal are all common at this stage.

Display / Compress / Store
Once the image itself is ready for viewing, the image pipe branches off in two different directions. In the first, the postprocessed image is output to the target display, usually an integrated LCD screen (but sometimes an NTSC/PAL television monitor, in certain camera modes). In the second, the image is sent to the media processor's compression algorithm, where industry-standard compression techniques (JPEG, for instance) are applied before the picture is stored locally in some storage medium (usually a non-volatile Flash memory card).

About the authors
David Katz is a Senior DSP Applications Engineer at Analog Devices, Inc., where he is involved in specifying and supporting Blackfin media processors. He has published dozens of embedded processor articles both domestically and internationally. Previously, he worked at Motorola, Inc., as a senior design engineer in cable modem and automation groups. David holds both a B.S. and M. Eng. in Electrical Engineering from Cornell University. He can be reached at David.Katz@analog.com.

Rick Gentile joined ADI in 2000 as a Senior DSP Applications Engineer, and he currently leads the Blackfin DSP Applications Group. Prior to joining ADI, Rick was a Member of the Technical Staff at MIT Lincoln Laboratory, where he designed several signal processors used in a wide range of radar sensors. He received a B.S. in 1987 from the University of Massachusetts at Amherst and an M.S. in 1994 from Northeastern University, both in Electrical and Computer Engineering. He can be reached at Rick.Gentile@analog.com.


Fundamentals of embedded video, part 5

Let's walk through the system of Figure 1 to illustrate some fundamental video processing steps present in various combinations in an embedded video application. In the diagram, an interlaced-scan CMOS sensor sends a 4:2:2 YCbCr video stream through the processor's video port, at which point it is deinterlaced and scan-rate converted. It then passes through some computational algorithm(s) and is prepared for output to an LCD panel. This preparation involves chroma resampling, gamma correction, color conversion, scaling, blending with graphics, and packing into the appropriate output format for display on the LCD panel. Note that this system is only provided as an example, and not all of these components are necessary in a given system. Additionally, these steps may occur in a different order than shown here.
http://i.cmpnet.com/dspdesignline/2007/10/adifigure4_1_big.gif
Figure 1. Example flow of camera input to LCD output, with processing stages in-between

Deinterlacing
When taking video source data from a camera that outputs interlaced NTSC data, it's often necessary to deinterlace it so that the odd and even lines are interleaved in memory, instead of being located in two separate field buffers. Deinterlacing is needed not only for efficient block-based video processing, but also for displaying interlaced video in progressive format (for instance, on an LCD panel). There are many ways to deinterlace, including line doubling, line averaging, median filtering, and motion compensation.

Scan Rate Conversion
Once the video has been deinterlaced, a scan-rate conversion process may be necessary, in order to insure that the input frame rate matches the output display refresh rate. In order to equalize the two, fields may need to be dropped or duplicated. Of course, as with deinterlacing, some sort of filtering is preferable in order to smooth out high-frequency artifacts caused by creating abrupt frame transitions.

A special case of frame rate conversion that's employed to convert a 24 frame/sec stream (common for 35 mm and 70 mm movie recordings) to the 30 frame/sec required by NTSC video is 3:2 pulldown. For instance, motion pictures recorded at 24 fps would run 25% faster (=30/24) if each film frame is used only once in an NTSC video system. Therefore, 3:2 pulldown was conceived to adapt the 24 fps stream into a 30 fps video sequence. It does so by repeating frames in a certain periodic pattern, shown in Figure 2.

http://i.cmpnet.com/dspdesignline/2007/10/adifigure4_2_big.gif

Figure 2. 3:2 pulldown frame repetition pattern
(24 fps progressive movie 30 fps interlaced TV)

Pixel Processing
As we discussed above, there are a lot of video algorithms in common use, and they're stratified into spatial and temporal classifications. One particularly common video operator is the two-dimensional (2D) convolution kernel, which is used for many different forms of image filtering.

2D convolution
Since a video stream is really an image sequence moving at a specified rate, image filters need to operate fast enough to keep up with the succession of input images. Thus, it is imperative that image filter kernels be optimized for execution in the lowest possible number of processor cycles. This can be illustrated by examining a simple image filter set based on two-dimensional convolution.

Convolution is one of the fundamental operations in image processing. In a two-dimensional convolution, the calculation performed for a given pixel is a weighted sum of intensity values from pixels in its immediate neighborhood. Since the neighborhood of a mask is centered on a given pixel, the mask usually has odd dimensions. The mask size is typically small relative to the image. A 3x3 mask is a common choice, because it is computationally reasonable on a per-pixel basis, yet large enough to detect edges in an image. However, it should be noted that 5x5, 7x7 and beyond are also widely used. Camera image pipes, for example, can employ 11x11 (and larger!) kernels for extremely complex filtering operations.

The basic structure of the 3x3 kernel is shown in Figure 3a. As an example, the output of the convolution process for a pixel at row 20, column 10 in an image would be: Out(20,10)=A*(19,9)+B*(19,10)+C*(19,11)+D*(20,9)+E*(20,10)+ F*(20,11)+G*(21,9)+H*(21,10)+I*(21,11)
http://i.cmpnet.com/dspdesignline/2007/10/adifigure4_4_big.gif
Figure 3. The 3x3 Convolution mask and how it can be used

It is important to choose coefficients in a way that aids computation. For instance, scale factors that are powers of 2 (including fractions) are preferred because multiplications can then be replaced by simple shift operations.

Figures 3b-e show several useful 3x3 kernels, each of which is explained briefly below.

The Delta Function shown in Figure 3b is among the simplest of image manipulations—it passes the current pixel through without modification.

Figure 3c shows two popular forms of an edge detection mask. The first one detects vertical edges, while the second one detects horizontal edges. High output values correspond to higher degrees of edge presence.

The kernel in Figure 3d is a smoothing filter. It performs an average of the 8 surrounding pixels and places the result at the current pixel location. This has the effect of "smoothing," or low-pass filtering, the image.

The filter in Figure 3e is known as an "unsharp masking" operator. It can be considered as producing an edge-enhanced image by subtracting from the current pixel a smoothed version of itself (constructed by averaging the 8 surrounding pixels).

Dealing with Image Boundaries
What happens when a function like 2D convolution operates on pixels near an image's border regions? To properly perform pixel filtering requires pixel information "outside" these boundaries. There are a couple of remedies for this situation. The simplest is just to ignore these edge regions. That is, consider that a 5x5 convolution kernel needs 2 pixels to the left, top, bottom, and right of the current pixel in order to function properly. Therefore, why not just shave 2 rows off of the image in each direction, so as to guarantee that the kernel will always act on real data? Of course, this isn't always an ideal approach, since it throws out real image data. Also, in cases where filters are strung together to create more complex pixel manipulations, this scheme will continually narrow the input image with every new filter stage that executes.

Other popular ways of handling the image boundary quandary are to duplicate rows and/or columns of pixels, or to wrap around from the left (top) edge back to the previous right (bottom) edge. While these might be easy to implement in practice, they create data that didn't exist before, and therefore corrupt filtering results to some extent.

Perhaps the most straightforward, and least damaging, method for dealing with image boundaries is to consider everything that lies outside of the actual image to be zero-valued, or black. Although this scheme, too, distorts filtering results, it is not as invasive as creating lines of potentially random non-zero-valued pixels.

Chroma Resampling and Color Conversion
Ultimately, the data stream in our example needs to be converted to RGB space. We already discussed how to convert between 4:4:4 YCbCr and RGB spaces, via a 3x3 matrix multiplication. However, up to this point, our pixel values are still in 4:2:2 YCbCr space. Therefore, we need to resample the chroma values to achieve a 4:4:4 format. Then the transformation to RGB will be straightforward, as we've already seen.

Resampling from 4:2:2 to 4:4:4 involves interpolating Cb and Cr values for those Y samples that are missing one of these components. A clear-cut way to resample is to interpolate the missing chroma values from their nearest neighbors by simple averaging. That is, a missing Cb value at a pixel site would be replaced by the average of the nearest two Cb values. Higher-order filtering might be necessary for some applications, but this simplified approach is often sufficient. Another approach is to replicate the chrominance values of neighboring pixels for those values that are missing in the current pixel's representation.

In general, conversions from 4:1:1 space to 4:2:2 or 4:4:4 formats involve only a one-dimensional filter (with tap values and quantities consistent with the level of filtering desired). However, resampling from 4:2:0 format into 4:2:2 or 4:4:4 format involves vertical sampling as well, necessitating a two-dimensional convolution kernel.

Because chroma resampling and YCbCr→RGB conversion are both linear operations, it is possible to combine the steps into a single mathematical operation, thus achieving 4:2:2 YCbCr→RGB conversion efficiently.

Scaling and Cropping
Video scaling allows the generation of an output stream whose resolution is different from that of the input format. Ideally, the fixed scaling requirements (input data resolution, output panel resolution) are known ahead of time, in order to avoid the computational load of arbitrary scaling between input and output streams.

Depending on the application, scaling can be done either upwards or downwards. It is important to understand the content of the image to be scaled (e.g., the presence of text and thin lines). Improper scaling can make text unreadable or cause some horizontal lines to disappear in the scaled image.

The easiest method to adjust an input frame size to an output frame that's smaller is simply to crop the image. For instance, if the input frame size is 720x480 pixels, and the output is a VGA frame (640x480 pixels), you can drop the first 40 and the last 40 pixels on each line. The advantage here is that there are no artifacts associated with dropping pixels or duplicating them. Of course, the disadvantage is that you'd lose 80 pixels (about 11%) of frame content. Sometimes this isn't too much of an issue, because the leftmost and rightmost extremities of the screen (as well as the top and bottom regions) are often obscured from view by the mechanical enclosure of the display.

If cropping isn't an option, there are several ways to downsample (reduce pixel and/or line count) or upsample (increase pixel and/or line count) an image that allow tradeoffs between processing complexity and resultant image quality.

Increasing or decreasing pixels per row
One straightforward method of scaling involves either dropping pixels (for downsampling) or duplicating existing pixels (for upsampling). That is, when scaling down to a lower resolution, some number of pixels on each line (and/or some number of lines per frame) can be thrown away. While this certainly reduces processing load, the results will yield aliasing and visual artifacts.

A small step up in complexity uses linear interpolation to improve the image quality. For example, when scaling down an image, filtering in either the horizontal or vertical direction obtains a new output pixel, which then replaces the pixels used during the interpolation process. As with the previous technique, information is still thrown away, and artifacts and aliasing will be present again.

If image quality is paramount, there are other ways to perform scaling without reducing quality. These methods strive to maintain the high frequency content of the image consistent with the horizontal and vertical scaling, while reducing the effects of aliasing. For example, assume that an image is to be scaled by a factor of Y:X. To accomplish this scaling, the image could be upsampled ("interpolated") by a factor of Y, filtered to eliminate aliasing, and then downsampled ("decimated") by a factor of X. In practice, these two sampling processes can be combined into a single multirate filter.

Increasing or Reducing lines per frame
The guidelines for increasing or reducing the number of pixels per row generally extend to modifying the number of lines per frame of an image. For example, throwing out every other line (or one entire interlaced field) provides a quick method of reducing vertical resolution. However, as we've mentioned above, some sort of vertical filtering is necessary whenever removing or duplicating lines, because these processes introduce artifacts into the image. The same filter strategies apply here: simple vertical averaging, higher-order FIR filters, or multirate filters to scale vertically to an exact ratio.

Display Processing
Alpha Blending
Often it is necessary to combine two image and/or video buffers prior to display. A practical example of this is overlaying of icons like signal strength and battery level indicators onto a cellular phone's graphics display. An example involving two video streams is picture-in-picture functionality.

When combining two streams, you need to decide which stream "wins" in places where content overlaps. This is where alpha blending comes in. It defines a variable alpha (a) that indicates a "transparency factor" between an overlay stream and a background stream as follows:

Output value = α (foreground pixel value) + (1-α) (background pixel value)

As the equation shows, an α value of 0 results in a completely transparent overlay, whereas a value of 1 results in a completely opaque overlay that disregards the background image entirely.

Alpha is sometimes sent as a separate channel along with the pixel-wise luma and chroma information. This results in the notation "4:2:2:4," where the last digit indicates an alpha key that accompanies each 4:2:2 pixel entity. Alpha is coded in the same way as the luma component, but often only a few discrete levels of transparency (perhaps 16) are needed for most applications. Sometimes a video overlay buffer is premultiplied by alpha or premapped via a lookup table, in which case it's referred to as a "shaped" video buffer.

Compositing
The act of compositing involves positioning an overlay buffer inside a larger image buffer. Common examples are a "Picture-in-Picture" mode on a video display, and placement of graphics icons (like battery and signal strength indicators) over the background image or video. In general, the composition function can take several iterations before the output image is complete. In other words, there may be many "layers" of graphics and video that combine to generate a composite image.

Two-dimensional DMA capability is very useful for compositing functions, because it allows the positioning of arbitrarily-sized rectangular buffers inside a larger buffer. One thing to keep in mind is that any image cropping should take place after the composition process, because the positioned overlay might violate any previously cropped boundaries. Of course, an alternative is to ensure that the overlay won't violate the boundaries in the first place, but this is sometimes asking too much!

Chroma Keying
The term "chroma keying" refers to a process by which a particular color (usually blue or green) in one image is replaced by the content in a second image when the two are composited together. This provides a convenient way to combine two video images by purposefully tailoring parts of the first image to be replaced by the appropriate sections of the second image. Chroma keying can be performed in either software or hardware on a media processor.

Output Formatting
Most color LCD displays targeted for consumer applications (TFT-LCDs) have a digital RGB interface. Each pixel in the display actually has 3 subpixels—one each with Red, Green and Blue filters—that the human eye resolves as a single color pixel. For example, a 320x240 pixel display actually has 960x240 pixel components, accounting for the R, G, and B subpixels. Each subpixel has 8 bits of intensity, thus forming the basis of the common 24-bit color LCD display.

The three most common configurations use either 8 bits per channel for RGB (RGB888 format), 6 bits per channel (RGB666 format), or 5 bits per channel for R and B, and 6 bits for G (RGB565 format).

RGB888 provides the greatest color clarity of the three. With a total of 24 bits of resolution, this format provides over 16 million shades of color. It offers the high resolution and precision needed in high performance applications like LCD TVs.

The RGB666 format is popular in portable electronics. Providing over 262,000 shades of color, this format has a total of 18 bits of resolution. However, because the 18-pin (6+6+6) data bus doesn't conform nicely to 16-bit processor data paths, a popular industry compromise is to use 5 bits each of R and B, and 6 bits of G (5+5+6 = a 16-bit data bus) to connect to a RGB666 panel. This scenario works well because green is the most visually important color of the three. The least-significant bits of both Red and Blue are tied at the panel to their respective most-significant bits. This ensures a full dynamic range for each color channel (full intensity down to total black).

We hope that this article series has given you a good understanding of the basics involved in embedded video processing. For a more in-depth discussion on media processing issues, including data flow and media framework analyses, you may wish to refer to "Embedded Media Processing."

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.