Friday, November 16, 2007

Fundamentals of embedded video, part 5

Let's walk through the system of Figure 1 to illustrate some fundamental video processing steps present in various combinations in an embedded video application. In the diagram, an interlaced-scan CMOS sensor sends a 4:2:2 YCbCr video stream through the processor's video port, at which point it is deinterlaced and scan-rate converted. It then passes through some computational algorithm(s) and is prepared for output to an LCD panel. This preparation involves chroma resampling, gamma correction, color conversion, scaling, blending with graphics, and packing into the appropriate output format for display on the LCD panel. Note that this system is only provided as an example, and not all of these components are necessary in a given system. Additionally, these steps may occur in a different order than shown here.
Figure 1. Example flow of camera input to LCD output, with processing stages in-between

When taking video source data from a camera that outputs interlaced NTSC data, it's often necessary to deinterlace it so that the odd and even lines are interleaved in memory, instead of being located in two separate field buffers. Deinterlacing is needed not only for efficient block-based video processing, but also for displaying interlaced video in progressive format (for instance, on an LCD panel). There are many ways to deinterlace, including line doubling, line averaging, median filtering, and motion compensation.

Scan Rate Conversion
Once the video has been deinterlaced, a scan-rate conversion process may be necessary, in order to insure that the input frame rate matches the output display refresh rate. In order to equalize the two, fields may need to be dropped or duplicated. Of course, as with deinterlacing, some sort of filtering is preferable in order to smooth out high-frequency artifacts caused by creating abrupt frame transitions.

A special case of frame rate conversion that's employed to convert a 24 frame/sec stream (common for 35 mm and 70 mm movie recordings) to the 30 frame/sec required by NTSC video is 3:2 pulldown. For instance, motion pictures recorded at 24 fps would run 25% faster (=30/24) if each film frame is used only once in an NTSC video system. Therefore, 3:2 pulldown was conceived to adapt the 24 fps stream into a 30 fps video sequence. It does so by repeating frames in a certain periodic pattern, shown in Figure 2.

Figure 2. 3:2 pulldown frame repetition pattern
(24 fps progressive movie 30 fps interlaced TV)

Pixel Processing
As we discussed above, there are a lot of video algorithms in common use, and they're stratified into spatial and temporal classifications. One particularly common video operator is the two-dimensional (2D) convolution kernel, which is used for many different forms of image filtering.

2D convolution
Since a video stream is really an image sequence moving at a specified rate, image filters need to operate fast enough to keep up with the succession of input images. Thus, it is imperative that image filter kernels be optimized for execution in the lowest possible number of processor cycles. This can be illustrated by examining a simple image filter set based on two-dimensional convolution.

Convolution is one of the fundamental operations in image processing. In a two-dimensional convolution, the calculation performed for a given pixel is a weighted sum of intensity values from pixels in its immediate neighborhood. Since the neighborhood of a mask is centered on a given pixel, the mask usually has odd dimensions. The mask size is typically small relative to the image. A 3x3 mask is a common choice, because it is computationally reasonable on a per-pixel basis, yet large enough to detect edges in an image. However, it should be noted that 5x5, 7x7 and beyond are also widely used. Camera image pipes, for example, can employ 11x11 (and larger!) kernels for extremely complex filtering operations.

The basic structure of the 3x3 kernel is shown in Figure 3a. As an example, the output of the convolution process for a pixel at row 20, column 10 in an image would be: Out(20,10)=A*(19,9)+B*(19,10)+C*(19,11)+D*(20,9)+E*(20,10)+ F*(20,11)+G*(21,9)+H*(21,10)+I*(21,11)
Figure 3. The 3x3 Convolution mask and how it can be used

It is important to choose coefficients in a way that aids computation. For instance, scale factors that are powers of 2 (including fractions) are preferred because multiplications can then be replaced by simple shift operations.

Figures 3b-e show several useful 3x3 kernels, each of which is explained briefly below.

The Delta Function shown in Figure 3b is among the simplest of image manipulations—it passes the current pixel through without modification.

Figure 3c shows two popular forms of an edge detection mask. The first one detects vertical edges, while the second one detects horizontal edges. High output values correspond to higher degrees of edge presence.

The kernel in Figure 3d is a smoothing filter. It performs an average of the 8 surrounding pixels and places the result at the current pixel location. This has the effect of "smoothing," or low-pass filtering, the image.

The filter in Figure 3e is known as an "unsharp masking" operator. It can be considered as producing an edge-enhanced image by subtracting from the current pixel a smoothed version of itself (constructed by averaging the 8 surrounding pixels).

Dealing with Image Boundaries
What happens when a function like 2D convolution operates on pixels near an image's border regions? To properly perform pixel filtering requires pixel information "outside" these boundaries. There are a couple of remedies for this situation. The simplest is just to ignore these edge regions. That is, consider that a 5x5 convolution kernel needs 2 pixels to the left, top, bottom, and right of the current pixel in order to function properly. Therefore, why not just shave 2 rows off of the image in each direction, so as to guarantee that the kernel will always act on real data? Of course, this isn't always an ideal approach, since it throws out real image data. Also, in cases where filters are strung together to create more complex pixel manipulations, this scheme will continually narrow the input image with every new filter stage that executes.

Other popular ways of handling the image boundary quandary are to duplicate rows and/or columns of pixels, or to wrap around from the left (top) edge back to the previous right (bottom) edge. While these might be easy to implement in practice, they create data that didn't exist before, and therefore corrupt filtering results to some extent.

Perhaps the most straightforward, and least damaging, method for dealing with image boundaries is to consider everything that lies outside of the actual image to be zero-valued, or black. Although this scheme, too, distorts filtering results, it is not as invasive as creating lines of potentially random non-zero-valued pixels.

Chroma Resampling and Color Conversion
Ultimately, the data stream in our example needs to be converted to RGB space. We already discussed how to convert between 4:4:4 YCbCr and RGB spaces, via a 3x3 matrix multiplication. However, up to this point, our pixel values are still in 4:2:2 YCbCr space. Therefore, we need to resample the chroma values to achieve a 4:4:4 format. Then the transformation to RGB will be straightforward, as we've already seen.

Resampling from 4:2:2 to 4:4:4 involves interpolating Cb and Cr values for those Y samples that are missing one of these components. A clear-cut way to resample is to interpolate the missing chroma values from their nearest neighbors by simple averaging. That is, a missing Cb value at a pixel site would be replaced by the average of the nearest two Cb values. Higher-order filtering might be necessary for some applications, but this simplified approach is often sufficient. Another approach is to replicate the chrominance values of neighboring pixels for those values that are missing in the current pixel's representation.

In general, conversions from 4:1:1 space to 4:2:2 or 4:4:4 formats involve only a one-dimensional filter (with tap values and quantities consistent with the level of filtering desired). However, resampling from 4:2:0 format into 4:2:2 or 4:4:4 format involves vertical sampling as well, necessitating a two-dimensional convolution kernel.

Because chroma resampling and YCbCr→RGB conversion are both linear operations, it is possible to combine the steps into a single mathematical operation, thus achieving 4:2:2 YCbCr→RGB conversion efficiently.

Scaling and Cropping
Video scaling allows the generation of an output stream whose resolution is different from that of the input format. Ideally, the fixed scaling requirements (input data resolution, output panel resolution) are known ahead of time, in order to avoid the computational load of arbitrary scaling between input and output streams.

Depending on the application, scaling can be done either upwards or downwards. It is important to understand the content of the image to be scaled (e.g., the presence of text and thin lines). Improper scaling can make text unreadable or cause some horizontal lines to disappear in the scaled image.

The easiest method to adjust an input frame size to an output frame that's smaller is simply to crop the image. For instance, if the input frame size is 720x480 pixels, and the output is a VGA frame (640x480 pixels), you can drop the first 40 and the last 40 pixels on each line. The advantage here is that there are no artifacts associated with dropping pixels or duplicating them. Of course, the disadvantage is that you'd lose 80 pixels (about 11%) of frame content. Sometimes this isn't too much of an issue, because the leftmost and rightmost extremities of the screen (as well as the top and bottom regions) are often obscured from view by the mechanical enclosure of the display.

If cropping isn't an option, there are several ways to downsample (reduce pixel and/or line count) or upsample (increase pixel and/or line count) an image that allow tradeoffs between processing complexity and resultant image quality.

Increasing or decreasing pixels per row
One straightforward method of scaling involves either dropping pixels (for downsampling) or duplicating existing pixels (for upsampling). That is, when scaling down to a lower resolution, some number of pixels on each line (and/or some number of lines per frame) can be thrown away. While this certainly reduces processing load, the results will yield aliasing and visual artifacts.

A small step up in complexity uses linear interpolation to improve the image quality. For example, when scaling down an image, filtering in either the horizontal or vertical direction obtains a new output pixel, which then replaces the pixels used during the interpolation process. As with the previous technique, information is still thrown away, and artifacts and aliasing will be present again.

If image quality is paramount, there are other ways to perform scaling without reducing quality. These methods strive to maintain the high frequency content of the image consistent with the horizontal and vertical scaling, while reducing the effects of aliasing. For example, assume that an image is to be scaled by a factor of Y:X. To accomplish this scaling, the image could be upsampled ("interpolated") by a factor of Y, filtered to eliminate aliasing, and then downsampled ("decimated") by a factor of X. In practice, these two sampling processes can be combined into a single multirate filter.

Increasing or Reducing lines per frame
The guidelines for increasing or reducing the number of pixels per row generally extend to modifying the number of lines per frame of an image. For example, throwing out every other line (or one entire interlaced field) provides a quick method of reducing vertical resolution. However, as we've mentioned above, some sort of vertical filtering is necessary whenever removing or duplicating lines, because these processes introduce artifacts into the image. The same filter strategies apply here: simple vertical averaging, higher-order FIR filters, or multirate filters to scale vertically to an exact ratio.

Display Processing
Alpha Blending
Often it is necessary to combine two image and/or video buffers prior to display. A practical example of this is overlaying of icons like signal strength and battery level indicators onto a cellular phone's graphics display. An example involving two video streams is picture-in-picture functionality.

When combining two streams, you need to decide which stream "wins" in places where content overlaps. This is where alpha blending comes in. It defines a variable alpha (a) that indicates a "transparency factor" between an overlay stream and a background stream as follows:

Output value = α (foreground pixel value) + (1-α) (background pixel value)

As the equation shows, an α value of 0 results in a completely transparent overlay, whereas a value of 1 results in a completely opaque overlay that disregards the background image entirely.

Alpha is sometimes sent as a separate channel along with the pixel-wise luma and chroma information. This results in the notation "4:2:2:4," where the last digit indicates an alpha key that accompanies each 4:2:2 pixel entity. Alpha is coded in the same way as the luma component, but often only a few discrete levels of transparency (perhaps 16) are needed for most applications. Sometimes a video overlay buffer is premultiplied by alpha or premapped via a lookup table, in which case it's referred to as a "shaped" video buffer.

The act of compositing involves positioning an overlay buffer inside a larger image buffer. Common examples are a "Picture-in-Picture" mode on a video display, and placement of graphics icons (like battery and signal strength indicators) over the background image or video. In general, the composition function can take several iterations before the output image is complete. In other words, there may be many "layers" of graphics and video that combine to generate a composite image.

Two-dimensional DMA capability is very useful for compositing functions, because it allows the positioning of arbitrarily-sized rectangular buffers inside a larger buffer. One thing to keep in mind is that any image cropping should take place after the composition process, because the positioned overlay might violate any previously cropped boundaries. Of course, an alternative is to ensure that the overlay won't violate the boundaries in the first place, but this is sometimes asking too much!

Chroma Keying
The term "chroma keying" refers to a process by which a particular color (usually blue or green) in one image is replaced by the content in a second image when the two are composited together. This provides a convenient way to combine two video images by purposefully tailoring parts of the first image to be replaced by the appropriate sections of the second image. Chroma keying can be performed in either software or hardware on a media processor.

Output Formatting
Most color LCD displays targeted for consumer applications (TFT-LCDs) have a digital RGB interface. Each pixel in the display actually has 3 subpixels—one each with Red, Green and Blue filters—that the human eye resolves as a single color pixel. For example, a 320x240 pixel display actually has 960x240 pixel components, accounting for the R, G, and B subpixels. Each subpixel has 8 bits of intensity, thus forming the basis of the common 24-bit color LCD display.

The three most common configurations use either 8 bits per channel for RGB (RGB888 format), 6 bits per channel (RGB666 format), or 5 bits per channel for R and B, and 6 bits for G (RGB565 format).

RGB888 provides the greatest color clarity of the three. With a total of 24 bits of resolution, this format provides over 16 million shades of color. It offers the high resolution and precision needed in high performance applications like LCD TVs.

The RGB666 format is popular in portable electronics. Providing over 262,000 shades of color, this format has a total of 18 bits of resolution. However, because the 18-pin (6+6+6) data bus doesn't conform nicely to 16-bit processor data paths, a popular industry compromise is to use 5 bits each of R and B, and 6 bits of G (5+5+6 = a 16-bit data bus) to connect to a RGB666 panel. This scenario works well because green is the most visually important color of the three. The least-significant bits of both Red and Blue are tied at the panel to their respective most-significant bits. This ensures a full dynamic range for each color channel (full intensity down to total black).

We hope that this article series has given you a good understanding of the basics involved in embedded video processing. For a more in-depth discussion on media processing issues, including data flow and media framework analyses, you may wish to refer to "Embedded Media Processing."

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.

No comments: