Friday, November 16, 2007

Fundamentals of embedded video, part 4

Video Port Features
To handle video streams, processors must have a suitable interface that can maintain a high data transfer rate into and out of the part. Some processors accomplish this through an FPGA and/or FIFO connected to the processor's external memory interface. Typically, this device will negotiate between the constant, relatively slow stream of video (~27 MB/s for NTSC video) into/out of the processor and the sporadic but speedy and bursty nature of the external memory controller (~133 MWords/sec, or 266 MB/s).

However, there are problems with this arrangement. For example, FPGAs and FIFOs are expensive, often costing as much as the video processor itself. Additionally, using the external memory interface for video transfer steals bandwidth from its other prime use in these systems—moving video buffers back and forth between the processor core and external memory.

Therefore, a dedicated video interface is highly preferable for media processing systems. For example, on Blackfin processors, this is the Parallel Peripheral Interface (PPI). The PPI is a multifunction parallel interface that can be configured between 8 and 16 bits in width. It supports bi-directional data flow and includes three synchronization lines and a clock pin for connection to an externally supplied clock. The PPI can gluelessly decode ITU-R BT.656 data and can also interface to ITU-R BT.601 video sources and displays, as well as TFT-LCD panels. It can serve as a conduit for high-speed analog-to-digital converters (ADCs) and digital-to-analog converters (DACs). It can also emulate a host interface for an external processor.

The PPI has some built-in features that can reduce system costs and improve data flow. For instance, in BT.656 mode the PPI can decode an input video stream and automatically ignore everything except active video, effectively reducing an NTSC input's video stream rate from 27 MB/s to 20 MB/s, and markedly reducing the amount of off-chip memory needed to handle the video. Alternately, it can ignore active video regions and only read in ancillary data that's embedded in vertical blanking intervals. These modes are shown pictorially in Figure 1.

Figure 1: Selective Masking of BT.656 regions in PPI.

Likewise, the PPI can "ignore" every other field of an interlaced stream; in other words, it will not forward this data to the DMA controller. While this instantly decimates input bandwidth requirements by 50%, it also eliminates 50% of the source content, so sometimes this tradeoff might not be acceptable. Nevertheless, this can be a useful feature when the input video resolution is much greater than the required output resolution.

On a similar note, the PPI allows "skipping" of odd- or even-numbered elements, again saving DMA bandwidth for the skipped pixel elements. For example, in a 4:2:2 YCbCr stream, this feature allows only luma or chroma elements to be read in, providing convenient partitioning of an algorithm between different processors; one can read in the luma, and the other can read the chroma. Also, it provides a simple way to convert an image or video stream to grayscale (luma-only). Finally, in high-speed converter applications with interleaved I/Q data, this feature allows partitioning between in-phase and quadrature components.

Importantly, the PPI is format-agnostic, in that it is not hardwired to a specific video standard. It allows for programmable row lengths and frame lengths. This aids applications that need, say, CIF or QCIF video instead of standard NTSC/PAL formats. In general, as long as the incoming video has the proper EAV/SAV codes (for BT.656 video) or hardware synchronization signals (for BT.601 video), the PPI can process it.

Although the BT.656 and BT.601 recommendations allow for 10-bit pixel elements, this is not a very friendly word length for processing. The problem is that most processors are very efficient at handling data in 8-bit, 16-bit or 32-bit chunks, but anything in-between results in data movement inefficiencies. For example, even though a 10-bit pixel value is only 2 bits wider than an 8-bit value, most processors will treat it as a 16-bit entity with the 6 most significant bits (MSBs) set to 0. Not only does this waste bandwidth on the internal data transfer (DMA) buses, but it also wastes a lot of memory – a disadvantage in video applications, where several entire frame buffers are usually stored in external memory.

A related inefficiency associated with data sizes larger than 8 bits is non-optimal packing. Usually, a high-performance media processor will imbue its peripherals with a data packing mechanism that sits between the outside world and the internal data movement buses of the processor, and its goal is to minimize the overall bandwidth burden that the data entering or exiting the peripheral places on these buses. Therefore, an 8-bit video stream clocking into a peripheral at 27 MB/s might be packed onto a 32-bit internal data movement bus, thereby requesting service from this bus at a rate of only 27/4, or 6.75 MHz. Note that the overall data transfer rate remains the same (6.75 MHz * 32 bits = 27 MB/s). In contrast, a 10-bit video stream running at 27 MB/s would only be packed onto the 32-bit internal bus in 2 16-bit chunks, reducing the overall transfer rate to 27/2, or 13.5 MHz. In this case, since only 10 data bits out of every 16 are relevant, 37.5% of the internal bus bandwidth is wasted.

Possible Data Flows
It is instructive to examine some ways in which a video port connects in multimedia systems, to show how the system as a whole is interdependent on each component flow. In Figure 2a, an image source sends data to the PPI, at which point the DMA engine transfers it to L1 memory, where the data is processed to its final form before being sent out through a high-speed serial port. This model works very well for low-resolution video processing and for image compression algorithms like JPEG, where small blocks of video (several lines worth) can be processed and are not needed again subsequently. This flow also can work well for some data converter applications.

In Figure 2b, the video data is not routed to L1 memory, but instead is directed to L3 memory. This configuration supports algorithms such as MPEG-2 and MPEG-4, which require storage of intermediate video frames in memory in order to perform temporal compression. In such a scenario, a bidirectional DMA stream between L1 and L3 memories allows for transfers of pixel macroblocks and other intermediate data.

Figure 2: Possible video port data transfer scenarios.

Video ALUs
Most video applications need to deal with 8-bit data, since individual pixel components (whether RGB or YCbCr) are usually byte quantities. Therefore, 8-bit video ALUs and byte-based address generation can make a huge difference in pixel manipulation. This is a nontrivial point, because embedded processors typically operate on 16-bit or 32-bit boundaries.

Embedded media processors sometimes have instructions that are geared to processing 8-bit video data efficiently. For instance, Table 1 shows a summary of the specific Blackfin instructions that can be used together to handle a variety of video operations.

Table 1: Native Blackfin video instructions.

Let's look at a few examples of how these instructions can be used.

The Quad 8-bit Subtract-Absolute-Accumulate (SAA) instruction is well-suited for block-based video motion estimation. The instruction subtracts four pairs of bytes, takes the absolute value of each difference, and accumulates the results. All this happens within a single cycle. The actual formula is shown below:

Consider the macroblocks shown in Figure 3a. The reference frame of 16 pixels x 16 pixels can be further divided into 4 groups. A very reasonable assumption is that neighboring video frames are correlated with each other. That is, if there is motion, then pieces of each frame will move in relation to macroblocks in previous frames. It takes less information to encode the movement of macroblocks than it does to encode each video frame as a separate entity—MPEG compression uses this technique.

This motion detection of macroblocks decomposes into two basic steps. Given a reference macroblock in one frame, we can search all surrounding macroblocks (target macroblocks) in a subsequent frame to determine the closest match. The offset in location between the reference macroblock (in Frame n) and the best-matching target macroblock (in Frame n+1) is the motion vector.

Figure 3b shows how this can be visualized in a system.

  • Circle = some object in a video frame
  • Solid square = reference macroblock
  • Dashed square = search area for possible macroblocks
  • Dotted square = best-matching target macroblock (i.e., the one representing the motion vector of the circle object) 3: Illustration of Subtract-Absolute-Accumulate (SAA) instruction.

    The SAA instruction on a Blackfin processor is fast because it utilizes four 8-bit ALUs in each clock cycle. We can implement the following loop to iterate over each of the four entities shown in Figure 3b.

    /* used in a loop that iterates over an image block */

    SAA (R1:0,R3:2) || R1 = [I0++] || R2 = [I1++]; /* compute absolute difference and accumulate */

    SAA (R1:0,R3:2) (R) || R0 = [I0++] || R3 = [I1++];

    SAA (R1:0,R3:2) || R1 = [I0 ++ M3] || R2 = [I1++M1]; /* after fetch of 4th word of target block, pointer is made to point to the next row */

    SAA (R1:0,R3:2) (R) || R0 = [I0++] || R2 = [I1++];

    Let's now consider another example, the 4-Neighborhood Average computation whose basic kernel is shown in Figure 4a. Normally, four additions and one division (or multiplication or shift) are necessary to compute the average. The BYTEOP2P instruction can accelerate the implementation of this filter.

    The value of the center pixel of Figure 4b is defined as:

    x = Average(xN, xS, xE, xW)

    The BYTEOP2P can perform this kind of average on two pixels (Figures 6.21c,d) in 1 cycle. So, if x1 = Average(x1N, x1S, x1E, x1W), and x2 = Average(x2N, x2S, x2E, x2W), then

    R3 = BYTEOP2P(R1:0, R3:2)

    will compute both pixel averages in a single cycle, assuming the x1 (N, S, E, W) information is stored in registers R1 and R0, and the x2 (N, S, E, W) data is sourced from R3 and R2.

Figure 4: Neighborhood Average Computation.
DMA Considerations
An embedded media processor with two-dimensional DMA (2D DMA) capability offers several system-level benefits. For starters, 2D DMA can facilitate transfers of macroblocks to and from external memory, allowing data manipulation as part of the actual transfer. This eliminates the overhead typically associated with transferring non-contiguous data. It can also allow the system to minimize data bandwidth by selectively transferring, say, only the desired region of an input image, instead of the entire image.

As another example, 2D DMA allows data to be placed into memory in a sequence more natural to processing. For example, as shown in Figure 5, RGB data may enter a processor's L2 memory from a CCD sensor in interleaved RGB444 format, but using 2D DMA, it can be transferred to L3 memory in separate R, G and B planes. Interleaving/deinterleaving of color space components for video and image data saves additional data moves prior to processing.

Figure 5: Deinterleaving data with 2D DMA.

Planar vs. Interleaved Buffer Formats
How do you decide whether to structure your memory buffers as interleaved or planar? The advantage to interleaved data is that it's the natural output format of image sensors, and the natural input format for video encoders. However, planar buffers (that is, separate memory regions for each pixel component) are more effective structures for many video algorithms, since many of them (JPEG and MPEG included) work on luma and chroma channels separately. What's more, accessing planar buffers in L3 is more efficient than striding through interleaved data, because the latency penalty for SDRAM page misses is spread out over a much larger sample size when the buffers are structured in a planar manner.

We had previously discussed the need for double-buffering as a means of ensuring that current data is not overwritten by new data until you're ready for this to happen. Managing a video display buffer serves as a perfect example of this scheme. Normally, in systems involving different rates between source video and the final displayed content, it's necessary to have a smooth switchover between the old content and the new video frame. This is accomplished using a double-buffer arrangement. One buffer points to the present video frame, which is sent to the display at a certain refresh rate. The second buffer fills with the newest output frame. When this latter buffer is full, a DMA interrupt signals that it's time to output the new frame to the display. At this point, the first buffer starts filling with processed video for display, while the second buffer outputs the current display frame. The two buffers keep switching back and forth in a "ping-pong" arrangement.

It should be noted that multiple buffers can be used, instead of just two, in order to provide more margin for synchronization, and to reduce the frequency of interrupts and their associated latencies.

So now we've covered some basic issues and features associated with efficient video data movement in embedded applications. In the final part of this series, we extend these ideas into a "walkthrough" of a sample embedded video application.

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.

No comments: