Friday, November 16, 2007

Fundamentals of embedded video, part 1

Part 1 of this 5-part series explains how video signals are tailored to the human vision system, and reviews the basics of NSTC and PAL video signals.
DSP DesignLine
[Part 2 looks at the basics of digital video. For the accompanying intro to audio, see Fundamentals of embedded audio.]

As consumers, we're intimately familiar with video systems in many embodiments. However, from the embedded developer's viewpoint, video represents a tangled web of different resolutions, formats, standards, sources and displays.

In this series, we will strive to untangle some of this intricate web, focusing on the most common circumstances you're likely to face in today's media processing systems. After reviewing the basics of video, we will discuss some common scenarios you may encounter in embedded video design and provide some tips and tricks for dealing with challenging video design issues.

Human Visual Perception
Let's start by discussing a little physiology. As we'll see, understanding how our eyes work has paved an important path in the evolution of video and imaging.

Our eyes contain 2 types of vision cells: rods and cones. Rods are primarily sensitive to light intensity as opposed to color, and they give us night vision capability. Cones, on the other hand, are not tuned to intensity, but instead are sensitive to wavelengths of light between 400nm(violet) and 770nm(red). Thus, the cones provide the foundation for our color perception.

There are 3 types of cones, each with a different pigment that's most sensitive to either red, green or blue energy, although there's a lot of overlap between the three responses. Taken together, the response of our cones peaks in the green region, at around 555 nm. This is why, as we'll see, we can make compromises in LCD displays by assigning the Green channel more bits of resolution than the Red or Blue channels.

The discovery of the Red, Green and Blue cones ties into the development of the trichromatic color theory, which states that almost any color of light can be conveyed by combining proportions of monochromatic Red, Green and Blue wavelengths.

Because our eyes have a lot more rods than cones, they are more sensitive to intensity than color. This allows us to save bandwidth in video and image representations by subsampling the color information.

Our perception of brightness is logarithmic, not linear. In other words, the actual intensity required to produce a 50% gray image (exactly between total black and total white) is only around 18% of the intensity we need to produce total white. This characteristic is extremely important in camera sensor and display technology, as we'll see in our discussion of gamma correction. Also, this effect leads to a reduced sensitivity to quantization distortion at high intensities, a trait that many media encoding algorithms use to their advantage.

Another visual novelty is that our eyes continually adjust to the viewing environment, creating their own reference for white, even in low-lighting or artificial-lighting situations. Because camera sensors don't innately act the same way, this gives rise to a white balance control in which the camera picks its reference point for absolute white.

Perhaps most important for image and video codecs, the eye is less sensitive to high-frequency information than low-frequency information. What's more, although it can detect fine details and color resolution in still images, it cannot do so for rapidly moving images. As a result, transform coding (DCT, FFT, etc.) and low-pass filtering can be used to reduce the total bandwidth needed to represent an image or video sequence.

Our eyes can notice a "flicker" effect at image update rates less than 50-60 times per second, or 50-60 Hz, in bright light. Under dim lighting conditions, this rate drops to about 24 Hz. Additionally, we tend to notice flicker in large uniform regions more so than in localized areas. These traits have important implications for interlaced video, refresh rates and display technologies.

What's a video signal?
At its root, a video signal is basically just a two-dimensional array of intensity and color data that is updated at a regular frame rate, conveying the perception of motion. On conventional cathode-ray tube (CRT) TVs and monitors, an electron beam modulated by the analog video signal shown in Figure 1 illuminates phosphors on the screen in a top-bottom, left-right fashion. Synchronization signals embedded in the analog signal define when the beam is actively "painting" phosphors and when it is inactive, so that the electron beam can retrace from right to left to start on the next row, or from bottom to top to begin the next video field or frame. These synchronization signals are represented in Figure 2.

Figure 1. Composition of Luma signal.

HSYNC is the horizontal synchronization signal. It demarcates the start of active video on each row (left to right) of a video frame. Horizontal Blanking is the interval in which the electron gun retraces from the right side of the screen back over to the next row on the left side.

VSYNC is the vertical synchronization signal. It defines the start (top to bottom) of a new video image. Vertical Blanking is the interval in which the electron gun retraces from the bottom right corner of the screen image back up to the top left corner.

FIELD distinguishes, for interlaced video, which field is currently being displayed. This signal is not applicable for progressive-scan video systems.

Figure 2: Typical timing relationships between HSYNC, VSYNC, FIELD.

The transmission of video information originated as a display of relative luminance from black to white – thus was born the black-and-white television system. The voltage level at a given point in space correlates to the brightness level of the image at that point.

When color TV became available, it had to be backward-compatible with B/W systems, so the color burst information was added on top of the existing luminance signal, as shown in Figure 3. Color information is also called chrominance. We'll talk more about it in our discussion on color spaces (in part 2 of this series).

Figure 3. Analog video signal with color burst.

Broadcast TV – NTSC and PAL
Analog video standards differ in the ways they encode brightness and color information. Two standards dominate the broadcast television realm – NTSC and PAL. NTSC, devised by the National Television System Committee, is prevalent in Asia and North America, whereas PAL ("Phase Alternation Line") dominates Europe and South America. PAL developed as an offshoot of NTSC, improving on its color distortion performance. A third standard, SECAM, is popular in France and parts of eastern Europe, but many of these areas use PAL as well. Our discussions will center on NTSC systems, but the results relate also to PAL-based systems.

Video Resolution
Horizontal resolution indicates the number of pixels on each line of the image, and vertical resolution designates how many horizontal lines are displayed on the screen to create the entire frame. Standard definition (SD) NTSC systems are interlaced-scan, with 480 lines of active pixels, each with 720 active pixels per line (i.e., 720x480 pixels). Frames refresh at a rate of roughly 30 frames/second (actually 29.97 fps), with interlaced fields updating at a rate of 60 fields/second (actually 59.94 fields/sec).

High definition systems (HD) often employ progressive scanning and can have much higher horizontal and vertical resolutions than SD systems. We will focus on SD systems rather than HD systems, but most of our discussion also generalizes to the higher frame and pixel rates of the high-definition systems.

When discussing video, there are two main branches along which resolutions and frame rates have evolved. These are computer graphics formats and broadcast video formats. Table 1 shows some common screen resolutions and frame rates belonging to each category. Even though these two branches emerged from separate domains with different requirements (for instance, computer graphics uses RGB progressive-scan schemes, while broadcast video uses YCbCr interlaced schemes), today they are used almost interchangeably in the embedded world. That is, VGA compares closely with the NTSC "D-1" broadcast format, and QVGA parallels CIF. It should be noted that although D-1 is 720 pixels x 486 rows, it's commonly referred to as being 720x480 pixels (which is really the arrangement of the NTSC "DV" format used for DVDs and other digital video).

Table 1. Graphics vs Broadcast standards.

nterlaced vs. Progressive Scanning
Interlaced scanning originates from early analog television broadcast, where the image needed to be updated rapidly in order to minimize visual flicker, but the technology available did not allow for refreshing the entire screen this quickly. Therefore, each frame was "interlaced," or split into two fields, one consisting of odd-numbered scan lines, and the other composed of even-numbered scan lines, as depicted in Figure 4. The frame refresh rate for NTSC/(PAL) was set at approximately 30/(25) frames/sec. Thus, large areas flicker at 60 (50) Hz, while localized regions flicker at 30 (25) Hz. This was a compromise to conserve bandwidth while accounting for the eye's greater sensitivity to flicker in large uniform regions.

Not only does some flickering persist, but interlacing also causes other artifacts. For one, the scan lines themselves are often visible. Because each NTSC field is a snapshot of activity occurring at 1/60 second intervals, a video frame consists of two temporally different fields. This isn't a problem when you're watching the display, because it presents the video in a temporally appropriate manner. However, converting interlaced fields into progressive frames (a process known as "deinterlacing"), can cause jagged edges when there's motion in an image. Deinterlacing is important because it's often more efficient to process video frames as a series of adjacent lines.

With the advent of digital television, progressive (that is, non-interlaced) scan has become a very popular input and output video format for improved image quality. Here, the entire image updates sequentially from top to bottom, at twice the scan rate of a comparable interlaced system. This eliminates many of the artifacts associated with interlaced scanning. In progressive scanning, the notion of two fields composing a video frame does not apply.

Figure 4. Interlaced Scan vs Progressive Scan illustration.

Now that we've briefly discussed the basis for video signals and some common terminology, we're almost ready to move to the really interesting stuff—digital video. We'll get to that in part 2.

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.

No comments: