Intelligent Automated Inspection Laboratory and Robotic Embedded Systems Lab: Fundamentals of embedded video, part 2

Color Spaces
There are many different ways of representing color, and each color system is suited for different purposes. The most fundamental representation is RGB color space.

RGB stands for "Red-Green-Blue," and it is a color system commonly employed in camera sensors and computer graphics displays. As the three primary colors that sum to form white light, they can combine in proportion to create most any color in the visible spectrum. RGB is the basis for all other color spaces, and it is the overwhelming choice of color space for computer graphics.

Gamma Correction
"Gamma" is a crucial phenomenon to understand when dealing with color spaces. This term describes the nonlinear nature of luminance perception and display. Note that this is a twofold manifestation: the human eye perceives brightness in a nonlinear manner, and physical output devices (such as CRTs and LCDs) display brightness nonlinearly. It turns out, by way of coincidence, that human perception of luminance sensitivity is almost exactly the inverse of a CRT's output characteristics.

Stated another way, luminance on a display is roughly proportional to the input analog signal voltage raised to the power of gamma. On a CRT or LCD display, this value is ordinarily between 2.2 and 2.5. A camera's precompensation, then, scales the RGB values to the power of (1/gamma).

The upshot of this effect is that video cameras and computer graphics routines, through a process called "gamma correction," prewarp their RGB output stream both to compensate for the target display's nonlinearity and to create a realistic model of how the eye actually views the scene. Figure 1 illustrates this process.

Gamma-corrected RGB coordinates are referred to as R'G'B' space, and the luma value Y' is derived from these coordinates. Strictly speaking, the term "luma" should only refer to this gamma-corrected luminance value, whereas the true "luminance" Y is a color science term formed from a weighted sum of R, G, and B (with no gamma correction applied).

Often when we talk about YCbCr and RGB color spaces in this series, we are referring to gamma-corrected components – in other words, Y'CbCr or R'G'B'. However, because this notation can be distracting and doesn't affect the substance of our discussion, and since it's clear that gamma correction needs to take place at sensor and/or display interfaces to a processor, we will confine ourselves to the YCbCr/RGB nomenclature even in cases where gamma adjustment has been applied. The exception to this convention is when we discuss actual color space conversion equations.

Figure 1: Gamma correction linearizes the intensity produced for a given input amplitude.

While RGB channel format is a natural scheme for representing real-world color, each of the three channels is highly correlated with the other two. You can see this by independently viewing the R, G, and B channels of a given image – you'll be able to perceive the entire image in each channel. Also, RGB is not a preferred choice for image processing because changes to one channel must be performed in the other two channels as well, and each channel has equivalent bandwidth.

To reduce required transmission bandwidths and increase video compression ratios, other color spaces were devised that are highly uncorrelated, thus providing better compression characteristics than RGB does. The most popular ones – YPbPr, YCbCr, and YUV -- all separate a luminance component from two chrominance components. This separation is performed via scaled color difference factors (B'-Y') and (R'-Y'). The Pb/Cb/U term corresponds to the (B'-Y') factor, and the Pr/Cr/V term corresponds to the (R'-Y') parameter. YPbPr is used in component analog video, YUV applies to composite NTSC and PAL systems, and YCbCr relates to component digital video.

Separating luminance and chrominance information saves image processing bandwidth. Also, as we'll see shortly, we can reduce chrominance bandwidth considerably via subsampling, without much loss in visual perception. This is a welcome feature for video-intensive systems.

As an example of how to convert between color spaces, the following equations illustrate translation between 8-bit representations of Y'CbCr and R'G'B' color spaces, where Y', R', G' and B' normally range from 16-235, and Cr and Cb range from 16-240.

Y' = (0.299)R + (0.587)G + (0.114)B

Cb = -(0.168)R - (0.330)G + (0.498)B + 128

Cr = (0.498)R - (0.417)G - (0.081)B + 128

R = Y' + 1.397(Cr - 128)

G = Y' - 0.711(Cr - 128) - 0.343(Cb - 128)

B = Y' + 1.765(Cb - 128)
Chroma subsampling
With many more rods than cones, the human eye is more attuned to brightness and less to color differences. As luck (or really, design) would have it, the YCbCr color system allows us to pay more attention to Y, and less to Cb and Cr. As a result, by subsampling these chroma values, video standards and compression algorithms can achieve large savings in video bandwidth.

Before discussing this further, let's get some nomenclature straight. Before subsampling, let's assume we have a full-bandwidth YCbCr stream. That is, a video source generates a stream of pixel components in the form of Figure 2a. This is called "4:4:4 YCbCr." This notation looks rather odd, but the simple explanation is this: the first number is always '4', corresponding historically to the ratio between the luma sampling frequency and the NTSC color subcarrier frequency. The second number corresponds to the ratio between luma and chroma within a given line (horizontally): if there's no downsampling of chroma with respect to luma, this number is '4.' The third number, if it's the same as the second digit, implies no vertical subsampling of chroma. On the other hand, if it's a 0, there is a 2:1 chroma subsampling between lines. Therefore, 4:4:4 implies that each pixel on every line has its own unique Y, Cr and Cb components.

Now, if we filter a 4:4:4 YCbCr signal by subsampling the chroma by a factor of 2 horizontally, we end up with 4:2:2 YCbCr. '4:2:2' implies that there are 4 luma values for every 2 chroma values on a given video line. Each (Y,Cb) or (Y,Cr) pair represents one pixel value. Another way to say this is that a chroma pair coincides spatially with every other luma value, as shown in Figure 2b. Believe it or not, 4:2:2 YCbCr qualitatively shows little loss in image quality compared with its 4:4:4 YCbCr source, even though it represents a savings of 33% in bandwidth over 4:4:4 YCbCr. As we'll discuss soon, 4:2:2 YCbCr is a foundation for the ITU-R BT.601 video recommendation, and it is the most common format for transferring digital video between subsystem components.

Figure 2: a) 4:4:4 vs. b) 4:2:2 YCbCr pixel sampling.

Note that 4:2:2 is not the only chroma subsampling scheme. Figure 3 shows others in popular use. For instance, we could subsample the chroma of a 4:4:4 YCbCr stream by a factor of 4 horizontally, as shown in Figure 3c, to end up with a 4:1:1 YCbCr stream. Here, the chroma pairs are spatially coincident with every fourth luma value. This chroma filtering scheme results in a 50% bandwidth savings. 4:1:1 YCbCr is a popular format for inputs to video compression algorithms and outputs from video decompression algorithms.

Another format popular in video compression/uncompression is 4:2:0 YCbCr, and it's more complex than the others we've described for a couple of reasons. For one, the Cb and Cr components are each subsampled by 2 horizontally and vertically. This means we have to store multiple video lines in order to generate this subsampled stream. What's more, there are 2 popular formats for 4:2:0 YCbCr. MPEG-2 compression uses a horizontally co-located scheme (Figure 3d, top), whereas MPEG-1 and JPEG algorithms use a form where the chroma are centered between Y samples (Figure 3d, bottom).

Figure 3: (a) YCbCr 4:4:4 stream and its chroma-subsampled derivatives (b) 4:2:2 (c) 4:1:1 (d) 4:2:0.

Digital Video
Before the mid-1990's, nearly all video was in analog form. Only then did forces like the advent of MPEG-2 compression, proliferation of streaming media on the Internet, and the FCC's adoption of a Digital Television (DTV) Standard create a "perfect storm" that brought the benefits of digital representation into the video world. These advantages over analog include better signal-to-noise performance, improved bandwidth utilization (fitting several digital video channels into each existing analog channel), and reduction in storage space through digital compression techniques.

At its root, digitizing video involves both sampling and quantizing the analog video signal. In the 2D context of a video frame, sampling entails dividing the image space, gridlike, into small regions and assigning relative amplitude values based on the intensities of color space components in each region. Note that analog video is already sampled vertically (discrete number of rows) and temporally (discrete number of frames per second).

Quantization is the process that determines these discrete amplitude values assigned during the sampling process. 8-bit video is common in consumer applications, where a value of 0 is darkest (total black) and 255 is brightest (white), for each color channel (R,G,B or YCbCr). However, it should be noted that 10-bit and 12-bit quantization per color channel is rapidly entering mainstream video products, allowing extra precision that can be useful in reducing received image noise by avoiding roundoff error.

The advent of digital video provided an excellent opportunity to standardize, to a large degree, the interfaces to NTSC and PAL systems. When the ITU (International Telecommunication Union) met to define recommendations for digital video standards, it focused on achieving a large degree of commonality between NTSC and PAL formats, such that the two could share the same coding formats.

They defined 2 separate recommendations – ITU-R BT.601 and ITU-R BT.656. Together, these two define a structure that enables different digital video system components to interoperate. Whereas BT.601 defines the parameters for digital video transfer, BT.656 defines the interface itself.

ITU-R BT.601 (formerly CCIR-601)
BT.601 specifies methods for digitally coding video signals, using the YCbCr color space for better use of channel bandwidth. It proposes 4:2:2 YCbCr as a preferred format for broadcast video. Synchronization signals (HSYNC, VSYNC, FIELD) and a clock are also provided to delineate the boundaries of active video regions. Figure 4 shows typical timing relationships between sync signals, clock and data.

(Click to enlarge)
Figure 4: Common Digital Video Format Timing.

Each BT.601 pixel component (Y, Cr, or Cb) is quantized to either 8 or 10 bits, and both NTSC and PAL have 720 pixels of active video per line. However, they differ in their vertical resolution. While 30 frames/sec NTSC has 525 lines (including vertical blanking, or retrace, regions), the 25 frame/sec rate of PAL is accommodated by adding 100 extra lines, or 625 total, to the PAL frame.

BT.601 specifies Y with a nominal range from 16 (total black) to 235 (total white). The color components Cb and Cr span from 16 to 240, but a value of 128 corresponds to no color. Sometimes, due to noise or rounding errors, a value might dip outside the nominal boundaries, but never all the way to 0 or 255.
ITU-R BT.656 (formerly CCIR-656)
Whereas BT.601 outlines how to digitally encode video, BT.656 actually defines the physical interfaces and data streams necessary to implement BT.601. It defines both bit-parallel and bit-serial modes. The bit-parallel mode requires only a 27 MHz clock (for NTSC 30 frames/sec) and 8 or 10 data lines (depending on the pixel resolution). All synchronization signals are embedded in the data stream, so no extra hardware lines are required.

The bit-serial mode requires only a multiplexed 10 bit/pixel serial data stream over a single channel, but it involves complex synchronization, spectral shaping and clock recovery conditioning. Furthermore, the bit clock rate runs close to 300 MHz, so it can be challenging to implement bit-serial BT.656 in many systems. For our purposes, we'll focus our attention on the bit-parallel mode only.

The frame partitioning and data stream characteristics of ITU-R BT.656 are shown in Figures 5 and 6, respectively, for 525/60 (NTSC) and 625/50 (PAL) systems.

Figure 5: ITU-R BT.656 Frame Partitioning.

Figure 6: ITU-R BT.656 Data Stream.

In BT.656, the Horizontal (H), Vertical (V), and Field (F) signals are sent as an embedded part of the video data stream in a series of bytes that form a control word. The Start of Active Video (SAV) and End of Active Video (EAV) signals indicate the beginning and end of data elements to read in on each line. SAV occurs on a 1-to-0 transition of H, and EAV begins on a 0-to-1 transition of H. An entire field of video is comprised of Active Video + Horizontal Blanking (the space between an EAV and SAV code) and Vertical Blanking (the space where V = 1).

A field of video commences on a transition of the F bit. The "odd field" is denoted by a value of F = 0, whereas F = 1 denotes an even field. Progressive video makes no distinction between Field 1 and Field 2, whereas interlaced video requires each field to be handled uniquely, because alternate rows of each field combine to create the actual video image.

The SAV and EAV codes are shown in more detail in Figure 7. Note there is a defined preamble of three bytes (0xFF, 0x00, 0x00 for 8-bit video, or 0x3FF, 0x000, 0x000 for 10-bit video), followed by the XY Status word, which, aside from the F (Field), V (Vertical Blanking) and H (Horizontal Blanking) bits, contains four protection bits for single-bit error detection and correction. Note that F and V are only allowed to change as part of EAV sequences (that is, transitions from H = 0 to H = 1). Also, notice that for 10-bit video, the two additional bits are actually the least-significant bits, not the most-significant bits.

Figure 7: SAV/EAV Preamble codes.

The bit definitions are as follows:

F = 0 for Field 1
F = 1 for Field 2
V = 1 during Vertical Blanking
V = 0 when not in Vertical Blanking
H = 0 at SAV
H = 1 at EAV
P3 = V XOR H
P2 = F XOR H
P1 = F XOR V
P0 = F XOR V XOR H

The vertical blanking interval (the time during which V=1) can be used to send non-video information, like audio, teletext, closed-captioning, or even data for interactive television applications. BT.656 accommodates this functionality through the use of ancillary data packets. Instead of the "0xFF, 0x00, 0x00" preamble that normally precedes control codes, the ancillary data packets all begin with a "0x00, 0xFF, 0xFF" preamble.

Assuming that ancillary data is not being sent, during horizontal and vertical blanking intervals the (Cb, Y, Cr, Y, Cb, Y, ) stream is (0x80, 0x10, 0x80, 0x10, 0x80, 0x10). Also, note that because the values 0x00 and 0xFF hold special value as control preamble demarcators, they are not allowed as part of the active video stream. In 10-bit systems, the values (0x000 through 0x003) and (0x3FC through 0x3FF) are also reserved, so as not to cause problems in 8-bit implementations.

So that's a wrap on our discussion of digital video concepts. In part 3 we turn our focus to a systems view of video, covering how video streams enter and exit embedded systems.

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.

Intelligent Automated Inspection Laboratory and Robotic Embedded Systems Lab

Friday, November 16, 2007

Fundamentals of embedded video, part 2

No comments:

Intelligent Automated Inspection Laboratory

Labels

Blog Archive

Contributors

LINKs