Intelligent Automated Inspection Laboratory and Robotic Embedded Systems Lab: 2007

Thursday, December 27, 2007

Virtual Keyboard : VKB

ideal pour avoir un clavier a taille réel en toute circonstance, quelques habitudes à prendre tout de meme car il faut taper sur “du dur” mais ça fonctionne très bien!

The Virtual Laser Keyboard (VKB) is the ULTIMATE new gudget for Blackberry, Smartphone and PDA owners or MAC & PC users! the VKB comes with an elegant leather jacket, making it the perfect addition to your collection (and just what you want to take out of your inner suit pocket in front of your business colleagues…:-)

In the size of a Zipo lighter and in an outer spaced “enterprise” style, it uses a laser beam to generate a full-size perfectly OPERATING laser keyboard that smoothly connects to MAC’s, Smart Phones, the new Blackberry (8100,8300,8800), Any kind of PC and Most of the handheld devices (PDA’s, tablet PC’s)

Virtual Laser Keyboard Features:

Personal Digital Assistants (PDA’s)

Cellular Telephones

Laptops

Tablet PCs

Space saving Computers

Clean Rooms

Industrial Environments

Test Equipment

Sterile and Medical Environments

Transport (Air, Rail, Automotive)

Tuesday, December 25, 2007

卡爾曼濾波器在嵌入式控制系統中的應用

嵌入式控制系統要用卡爾曼濾波器來觀測進程中的變量，以便實現進程控制。本文將介紹卡爾曼濾波器設計和實現的基本原理，然後以汽車導航問題為實例，說明汽車位置控制過程中，如何利用卡爾曼濾波器這個有效工具對汽車的當前位置進行可靠的估計。

卡爾曼濾波器最初是專為飛行器導航而研發的，目前已成功應用在許多領域中。卡爾曼濾波器主要用來預估那些只能被系統本身間接或不精確觀測的系統狀態。

許多工程系統和嵌入式系統都需要濾波。例如收到受噪音干擾的無線通訊信號時，良好的濾波算法在保留有用資訊的同時，還可以從電磁信號中消除噪音。又如在電源電壓中，未受干擾的電源是那些為了消除不期望波動而進行線電壓過濾的電源設備，如果不消除這些波動，將會縮短電子設備(如電腦和列印機) 的壽命。

卡爾曼濾波器適用於觀測進程中的變量，從數學角度看，卡爾曼濾波器可估計線性系統的狀態。卡爾曼濾波器不僅能在實踐中發揮很好的作用，而且在理論上也頗引人注目，因為在各種濾波器中，卡爾曼濾波器的估計誤差最小。嵌入式控制系統經常需要使用卡爾曼濾波器，因為要控制一個進程，首先必須精確地估計進程中的變量。

本文將介紹卡爾曼濾波器設計和實現的基本原理。文章首先引入卡爾曼濾波器算法並用該算法解決汽車導航問題。為了控制汽車的位置，首先必須對汽車的當前位置進行可靠的估計。卡爾曼濾波器正是具備這項功能的有效工具。

圖1：汽車的位置(實際位置、測量位置和估計位置)

線性系統

為了用卡爾曼濾波器消除信號中的噪音，被測量的進程必須能用線性系統描述。許多物理進程，如路上行駛的車輛、圍繞地球軌道運轉的衛星、由繞組電流驅動的電機軸或正弦射頻載波信號，均可用線性系統來近似。線性系統是指能用如下兩個方程描述的簡單進程：

狀態方程：

輸出方程：

在上述方程中，A、B和C均為矩陣，k是時間系數，x稱為系統狀態，u是系統的已知輸入，y是所測量的輸出。w和z表示噪音，其中變量w稱為進程噪音，z 稱為測量噪音，它們都是向量，因此包含多個元素。x中包含系統當前狀態的所有資訊，但它不能被直接測量。因此要測量向量y，y是受到噪音z干擾的系統狀態 x的函數。我們可以利用y獲取x的估計，但不一定必須藉由y的測量值來獲得x的估計，因為y受到了噪音的干擾。

例如在沿直線運動的汽車模型中，其狀態由汽車的位置p和速度v構成。輸入u是控制加速度，而輸出y則是測量位置。假定每隔T秒時間都能改變加速度並測量汽車的位置，根據物理學基本定律，速度v將由下述方程控制：

也就是說，下一時刻(T秒之後)的速度將等於當前速度加上控制加速度與T的乘積。但前面的方程並未給出v_k+1時刻的精確值，因為實際上該速度將受到陣風及其它意外噪音的干擾。這些速度噪音是隨時間變化的隨機變量，因此下述方程能更好地反映v的實際情況：

其中(方程17)表示速度噪音。類似地，可以推導出位置p的方程：

其中(方程18)表示位置噪音。下式提出了由位置和速度構成的狀態向量x：

最後，由於測量輸出是汽車的位置，可以得到線性系統的狀態方程如下：

z_k表示由儀器誤差等因素帶來的測量噪音。如果希望藉由反饋系統控制汽車，則需要得到位置p和速度v的精確估計，換句話說，即需要對狀態x進行估計。卡爾曼濾波器正是狀態估計的有效工具。

圖2：位置測量誤差和位置估計誤差

卡爾曼濾波器的原理和算法

假定線性系統的模型如前所述，我們希望利用測量得到的y來估計系統x的狀態。由於系統的行為遵循其狀態方程，而且可以獲得系統的位置測量，那麼如何確定狀態x的最佳估計呢？我們需要能對實際狀態進行精確估計的預估器，儘管並不能直接測量該狀態。那麼預估器又應當滿足哪些準則呢？顯然必須滿足以下兩個條件。

首先，狀態估計的平均值應等於實際狀態的平均值。換言之，我們並不希望估計出現這樣或那樣的偏差。在數學上，人們總希望估計的期望值應等於狀態的期望值。

其次，我們希望狀態估計與實際狀態之間的偏差儘可能小。也就是說，不僅希望狀態估計的平均值等於實際狀態的平均值，而且希望預估器的狀態估計與實際狀態之間的偏差盡可能達到最小。在數學上，人們總希望預估器的誤差方差盡可能最小。

卡爾曼濾波器正是能滿足以上兩條準則的預估器，但卡爾曼濾波器解決方案也必須滿足特定的噪音條件。在系統模型中，w表示進程噪音，而z表示測量噪音。因此，必須假定w和z的平均值均為0且w與z不相關。這樣，在任何時刻k、w_k和x_k均為不相關的隨機變量，而噪音協方差矩陣S_w和S_z分別為：

進程噪音協方差：

測量噪音協方差：

其中w_T和z_T分別表示隨機噪音向量w和z的轉置，E(‧)表示期望值。

現在就可以開始研究卡爾曼濾波器方程了。首先，可以選擇不同的等價方程來描述卡爾曼濾波器，其中一種表述如下：

卡爾曼濾波器方程由3個方程組成，每個方程都包含矩陣運算。在上述方程中，a^-1表示矩陣求逆，而a^T上表示矩陣轉置。K矩陣稱為卡爾曼增益，而P矩陣表示估計誤差的協方差。

上述狀態估計方程相當直觀。方程中用來推導k + 1時刻狀態估計的第一項正好等於A與k時刻狀態估計的乘積加上B與k時刻已知輸入的乘積。如果不再進行測量，那麼該值就是系統的狀態估計。換言之，狀態估計可以像系統模型中的狀態向量一樣傳遞至後續時刻。方程中第二項稱為校正項，表示在測量條件下，用於對傳遞的狀態估計進行修正的校正量。

K方程的校驗表明：測量噪音較大，S_z也將較大，因此K值應當較小，而且在計算下一時刻的測量值y時，不應過分信賴該值；另一方面，測量噪音較小，S_z也將較小，這樣K值應當較大，而且在計算下一時刻的測量值時，可以充分信賴該值。

圖3：速度(實際速度和估計速度)

汽車導航應用實例

現在考慮先前提出的汽車導航問題。汽車沿直線行駛，位置測量的誤差為10英尺(一個標準偏差)；命令加速度是常數，其值為1英尺/秒²；加速度噪音為0.2英尺/秒² (一個標準偏差)；位置測量的周期為0.1秒(T=0.1)。那麼如何得到運動汽車位置的最佳估計？由於存在較大的測量噪音，可以肯定計算結果比測量值更好。

因為T=0.1，表徵系統的線性模型可用本文提出的系統建模方法推導得到：

因為測量噪音的標準偏差為10英尺，S^z矩陣可簡單地取值為100。

現在推導S^w矩陣。因為位置與加速度的0.005倍成正比並且加速度噪音為0.2英尺/秒²，因此位置噪音的偏差為(0.005)²×(0.2) ²= 10^-6。類似地，因為速度與加速度的0.1倍成正比，因此速度噪音的偏差為(0.1)²×(0.2)²=4×10^-4。最後，位置噪音和速度噪音的協方差等於位置噪音的標準方差與速度噪音的標準方差的乘積，計算結果為(0.005×0.2)×(0.1×0.2)=2×10^-5。將以上計算結果相結合，即可得到矩陣S^w：

接著，為位置和速度設置最佳的初始值，並將(方程16)初始化為最初估計的誤差。然後每隔一個時間步長執行一次卡爾曼濾波器算法。

我們使用Matlab模擬卡爾曼濾波器，模擬結果參見以下各圖。圖1顯示了車輛的實際位置和估計位置，兩條平滑曲線表示實際位置和估計位置，由於相隔太近而難以區分。較為粗糙的曲線表示測量位置。

圖2顯示了實際位置和測量位置之間的誤差以及實際位置與卡爾曼濾波器估計位置之間的誤差。測量誤差的標準偏差約為10英尺，偶爾也會出現30英尺(3Σ)的峰值。估計位置的誤差則一直保持約為2英尺。

圖3顯示了卡爾曼濾波器的優勢。因為汽車的速度是狀態x的一部份，因此我們得到位置估計的同時也得到了速度估計。圖4顯示了實際速度和卡爾曼濾波器估計速度之間的誤差。

得到上述結果的Matlab程式如表1所示。Matlab是一種很容易理解的語言，很像偽碼(pseudocode)，但具有內置矩陣運算功能。使用Matlab執行程式時，每次得到的結果會有所不同，這是因為模擬中存在隨機噪音，但總體上與這裏給出的各圖都很相近。

圖4：速度估計誤差

實際問題及其擴展

卡爾曼濾波器的基本原理很簡單，但濾波器方程對矩陣代數的依賴性很強。表2顯示了用C語言描述的卡爾曼濾波器修正方程。表2中的矩陣代數參考可見www.embedded.com/code.html。該網站列出的程式非常全面，如果問題足夠小，程式還能相應簡化。例如，矩陣的轉置：

等於：

因此，如果需要對矩陣進行轉置，就可使用上述方程。矩陣運算的補充C代碼和卡爾曼濾波可參見http: //wad.www.media.mit.edu/people/wad/mas864/proj_src.html。具有超過3個狀態的系統將顯著增加程式的代碼長度和計算量。與矩陣轉置相關的計算量正比於n³(這裏n表示矩陣的維數)。這意味著，如果卡爾曼濾波器的狀態數目增加 1倍，計算量將增至原來的8倍。對於一個維數適中的卡爾曼濾波器，幾乎所有的處理工作都花在矩陣運算上。但穩態卡爾曼濾波器在提供良好預估性能的同時，還能極大地削減運算量，因此也不必過於擔心。在穩態卡爾曼濾波器中，矩陣K_k和P_k均為常量，因此在代碼中可定義為常值，而卡爾曼濾波器方程中唯一需要實時實現的是(方程15)方程，由簡單的乘積和加法運算組成(如果使用的是DSP，則由乘積和累加運算組成)。

我們已經討論了線性系統的狀態估計，但如何估計非線性系統的狀態呢？事實上，在工程中幾乎所有的進程都是非線性的。一些非線性系統非常接近線性系統，而另一些則差別甚大。這一點早在卡爾曼濾波器發展初期階段就引起了重視，因而推出了“擴展卡爾曼濾波器”。擴展卡爾曼濾波器只是簡單地將線性卡爾曼濾波原理推廣到非線性系統。

到目前為止，我們已經逐步討論了藉由測量對狀態進行估計的步驟。獲得整個測量時序記錄後，如果希望將狀態作為時間的函數進行預估，又應當如何處理呢？例如在上述示例中，如何在獲得必要資訊的基礎上，重構汽車的行駛軌跡呢？我們似乎完全可以比採用卡爾曼濾波器做得更好，因為在k時刻預估系統狀態時，我們不僅能測量k時刻之前(包括k時刻)的狀態，還能測量k時刻之後的狀態。為此必須修正卡爾曼濾波器，這就是卡爾曼平滑濾波器的由來。

卡爾曼濾波器不僅很實用，其工作原理也很有吸引力。卡爾曼濾波器可減小估計誤差的偏差，但如果我們更關心最壞情況下的估計誤差，或者希望減小“最壞”的估計誤差，而不是“平均”誤差時，問題又該如何解決呢？H∞濾波器可以解決這個問題。H∞濾波器(讀作“H無限大”濾波器，有時寫作H∞ )是上世紀八十年代開發出的卡爾曼濾波替代方案。H∞濾波器應用不如卡爾曼濾波器廣泛，因此也不像卡爾曼濾波器那樣廣為人知，但在某些情況下更為有效。

卡爾曼濾波理論假定進程噪音w與測量噪音z互不相關。如果在系統中進程噪音和測量噪音相關，那麼應當如何處理呢？這就是相關噪音問題，同樣可以藉由修正卡爾曼濾波器來解決這個問題。此外，卡爾曼濾波器還需要明確的噪音協方差S_w和S_z。如果S_w和S_z未知，又該如何處理呢？我們如何能夠獲得狀態的最佳估計？這需要再次求助於H∞濾波器。

卡爾曼濾波涉及的領域很廣泛，不是本文就能盡述的。自1960年問世以來，業界已經發表了數千篇有關卡爾曼濾波的論文，出版了數十本教材。

發展歷史及其展望

卡爾曼濾波器最初由Rudolph Kalman開發設計，其研究成果發表在一家著名的期刊上，因為它比其它人的研究成果更為通用也更完整，因而被命名為卡爾曼濾波器，有時也被稱為 Kalman-Bucy濾波器，因為Richard Bucy早期曾與Kalman共同研究該濾波器。

卡爾曼濾波算法的起源可以追溯到1795年由年僅18歲的Karl Gauss提出的最小二乘理論。卡爾曼濾波算法與許多新技術一樣，也是致力於解決特定問題，例如阿波羅空間計劃中的太空梭導航問題。從此，卡爾曼濾波器逐漸應用到許多領域，包括各種導航(航空太空、陸地和海洋)、核電站設備、人口統計建模、製造業、地層放射性探測以及模糊邏輯和神經網路學習。

Dan Simon是Cleveland大學電子工程和電腦工程系的教授和工業顧問。他的教學和研究方向包括濾波、控制理論、嵌入式系統、模糊邏輯和神經網路。目前，他正試圖利用卡爾曼濾波器實現基於DSP的電機控制器，可以藉由電子郵件：d.j.simon@csuohio.edu與他聯繫。

參考文獻

Gelb, A. Applied Optimal Estimation. Cambridge, MA: MIT Press, 1974。這是一本“過時但仍然叫座”的參考書，它由麻省理工大學出版社出版，內容淺顯易懂，不僅從介紹最基本的原理入手，而且還非常注重實際應用。
Anderson, B. and J. Moore. Optimal Filtering. Englewood Cliffs, NJ: Prentice-Hall, 1979。這是一本純數學參考書，非常深奧，但有助於理解卡爾曼濾波算法及相關問題的基本原理。
Grewal, M. and A. Andrews. Kalman Filtering Theory and Practice. Englewood Cliffs, NJ: Prentice-Hall, 1993。本書介於以上兩本參考書之間，在理論和實踐之間架起了一座橋樑。本書的一大特色是磁片中包含卡爾曼濾波算法的原始碼，不足之處在於原始碼是用 Fortran語言編寫的。
Sorenson, H. Kalman Filtering: Theory and Application. Los Alamitos, CA: IEEE Press, 1985。這是一本有關卡爾曼濾波的經典論文集，包括Kalman 1960年發表的原始論文。雖然本文集偏向學術研究，但也有助於讀者了解卡爾曼濾波原理的發展歷程。
http://ourworld.compuserve.com/homepages/PDJoseph/是 Peter Joseph的網址，提供了許多有關卡爾曼濾波的專題資源。Joseph博士自1960年開始，長期從事卡爾曼濾波器研究，並於1968年合著了最早的有關卡爾曼濾波的教材。Peter Joseph的網站還包括卡爾曼濾波的初級、中級和高級教程。

Friday, December 21, 2007

年度風雲機器人 3隻手快如風

中國時報 2007.12.21　
年度風雲機器人 3隻手快如風
黃文正／綜合東京二十日外電報導

　▲東京上野科學博物館舉辦的「大機器人展」中展示的，由東北大學開發的舞伴機器人「PBDR」。（黃菁菁攝）

　助行、伴舞一把罩▲筑波大學研發的機器人裝（ROBOT SUIT）「HAL」是世界首創半個機器人裝備，人可以將機器裝置穿戴在身上可提升身體機能。（黃菁菁攝）

　由日本經濟產業省主辦的第二屆日本機器人大賞20日在東京市政廳舉行頒獎典禮，而今年獲選為「年度風雲機器人」者，是具擁有3支旋轉式組合機械手臂、可於每分鐘迅速撿拾120件物品的作業型機器人「M-430iA」。

　日本官方贊助舉辦的機器人大賞，旨在鼓勵並推動機器人科技產業研發，而從本屆比賽結果來看，機器人的功能性與商業潛力，顯然比娛樂性與純學術研究更吸引日本官方的支持與青睞。

　迅速撿拾物件避免致命錯誤

　此次入圍決賽的機器人，包括丹麥玩具製造商樂高的教學用機器人Mindstorms NXT，以及富士重工可載重200公斤的作業型機器人。不過，相較之下，今年獲選「年度風雲機器人」、由發那科公司設計的3支組合機械手臂機器人，功能性顯然更勝一籌。

　發那科經理仁平涼表示，這款可迅速撿拾物件的機器人，目前已正式在食品或藥物工廠服務，這兩種工作場所的衛生品質，向來令人擔心，而這款機器人，可透過前端的攝影機精準撿拾物件，完全杜絕可能的人為致命錯誤。

　仁平涼說，這款機器人無暴露在外的管線，清洗方便，衛生可靠。它可不眠不休連續24小時拼命工作，絕不會抱怨或罷工。「盡可能不用人力，已是不可避免的趨勢，因為，人類可能傳遞細菌等骯髒東西。」

　醫用透明機器人方便模擬手術

　另外，由名古屋大學研發的醫學用透明機器人，體內有仿人體血管的纏繞塑膠管線，可讓醫師模擬或實習血管手術技巧。不過，如果操作太用力，它可是會呻吟抗議。研發團隊負責人池田誠一表示，這款醫學用透明機器人已大量製造，希望所有人都可因它受益。他說這款命名為「夏娃」（「Eve」）的機器人，可針對不同的病人量身訂做，一具售價25萬日圓。

　另外，一具由富士通設計研發的人形智慧型機器人，身高60公分，還可左右搖擺跳動，售價600萬日圓。該公司表示，包括美國「國家航空暨太空總署」和德國漢堡大學，都向他們購買這款機器人，以提供科學家簡易使用的機器人硬體，幫助他們進行人工智慧科技研發。

　消防機器人背水救火不怕險

　此外，機械製造商小松製作所推出的消防機器人，身上裝載的巨型水槽，可以遙控的方式，接近爆炸火場或其他危險地點，噴灑出1300加崙的水，噴水距離可達100公尺，幾乎是一般消防員的3倍之多。

　日本機器人設計公司ZMP最新研發的蛋型機器人「Miuro」也獲選進入決賽，高約35公分的Miuro與蘋果電腦合

看護機器人銀髮族的「智慧拐杖」

中國時報 2007.12.21　
看護機器人銀髮族的「智慧拐杖」
黃菁菁／東京專題報導

　▲SECOM開發的餵食機器人「My spoon（我的湯匙）」則可協助手部不方便的老人或殘障者，讓他們不用靠別人也能自己吃飯，曾獲得06年的機器人大獎評審委員特別獎的肯定。（黃菁菁攝）

　日本邁向超高齡社會，平均約每3個年輕人要撫養一個老年人，高齡醫療和看護工作漸成為社會的沈重負擔，在看護人手不足的情形下，日本早已動腦筋到機器人身上，期望機器人能打破機械式、冷冰冰的刻板印象，成為銀髮族喜愛的「智慧型拐杖」。

　日本針對高齡市場開發的機器人，有幫助長者行走、復健、提升筋骨能力，協助臥床老人入浴、處理如廁排泄問題，有代替看護人員背負老人、餵食行動不便老人的機器人，甚至有娛樂用的寵物、彈琴、吹長笛、舞伴機器人。

　早稻田大學機械工學科教授藤江正克指出，機器人的潛在市場需求日增，日本政府希望機器人能走出實驗室與工廠，對醫院、社福機構與一般家庭帶來貢獻，而經濟產業省投注27億日圓（約台幣7億7400萬元），結合產業界和學界推動「支援人類型機器人實用計畫」。

　筑波大學研發的「機器人裝」（ROBOT SUIT）「HAL」是世界首創將人改造成像半個機器人一樣的機械裝備，將機器裝置穿戴在身上可提升身體機能，而HAL可依照主人的意思隨意操控，蹲、站、步行等動作都可隨心所欲，將來可望成為行動不便者的輔助裝備。

　以前行動不便的老年人靠拐杖步行仍感到相當吃力，支援步行的機器人則像是電動的醫療步行輔具，可以協助老人更輕鬆地在戶外或家中走動，豐田汽車等也致力開發更輕便、小型的單人座車，可望將來成為老人的代步工具。

　為了實驗這些代步機器人到底能不能協助老人外出或到商店購物，九州福岡地區將一些商店街、購物廣場和餐廳劃為「機器人特區」，提供研究單位實地進行機器人實驗，讓機器人走出實驗室，體驗一下現實的生活環境。

　SECOM公司研發的餵食機器人「My spoon」（我的湯匙）則可協助手部不方便的長者或殘障者，讓他們不用靠別人也能自己吃飯，曾獲得2006年的經濟產業省「機器人大賞」評審委員特別獎的肯定。SECOM也開發可以做出細部手指動作的人工手臂機器人，裝在輪椅或床頭等地方，可以代替老人處理拿飲料、拿書本等身邊鎖事。

　東北大學開發的舞伴機器人「PBDR」有豐富的社交舞知識，也具有高度的探知能力，可以配合真人舞伴的動作翩翩起舞，這款機器人主要是為喜歡跳社交舞的高齡族群設計的。

　步行訓練機器人則是支援復健的機器人之一，過去的步行復健訓練要患者扶著欄杆行走，必須有復健師在旁指導、攙扶，而步行訓練機器人則像健身房的步行機般，站在原地即可練習，機器人也會記錄每次訓練的時間來驗證復建的成效。

　為了提高復健者的意願，還有像電動遊樂機器般的步行訓練機器人，即在螢幕中顯示影像，腳步快時，螢幕中的小狗就越接近目標，走得慢，小狗就越來越落後等。此外，「丸富精工」的手指復健機器人，新產業創造研究機構的上肢和下肢復健機器人等都開始在醫院、老人設施等地進行測試實驗。

　日本期待機器人產業可以帶來無限的商機，但是要讓機器人真正實用，還要看價錢和安全問題。藤江教授指出，價錢的問題還在其次，因為只要機器人被接受而普及了，售價自然可以壓低，應該不至於成為富人的專利，但是比較難克服的是人們對安全、倫理問題的疑慮以及法律的限制。

　藤江說，在日本要測試機器人找義工很難，若是在實驗中出意外也沒有人願意負責，因此目前幾乎都不是患者當實驗義工，大多是醫護或實驗人員自願當被實驗者。不過，經種種的實驗證實機器人的安全性後，保險公司的態度已改變，以前不願為機器人保意外險，現在已經願意支付保險金。

　至於醫療用機器人的活用範圍，藤江表示，目前要實際運用還有點困難，厚生勞動省的安全審查標準嚴格，所以有些機器人雖然是日本研發出來的，卻無法在日本國內使用，而被歐美買斷使用並推廣，實在有點可惜。

Thursday, December 06, 2007

One Controller Does It All for Machine and Robot



Rockwell has added native support for a variety of robot kinematics to its ControlLogix platform. Other vendors have made similar moves in an effort to eliminate the need for stand-alone robot controllers.

Integrating kinematics into general motion controllers promises to eliminate some stand-alone robot controls

Joseph Ogando, Senior Editor -- Design News, November 27, 2007

Robots and packaging machines already work together closely – except for their controllers. “OEMs have essentially had to deal with two completely different control systems in the past, one for the machine and another for the robotics,” says Mike Wagner, global business manager for packaging at Rockwell Automation.

Yet, robot and packaging machines may increasingly find themselves sharing a controller. More and more motion control suppliers have started to offer robot kinematics as part of their general motion control platforms, potentially eliminating the need for stand-alone robot controllers and the associated integration headaches.

According to Wagner, the notion of integrated kinematics has always been an attractive one for machine builders and their customers. “Limitations on processing power and intellectual property in the robotics industry kept the controls separate,” he says. Now, though, more powerful processors and the expiration of patents related to Delta-style kinematics has made integrated robot kinematics a likely outcome for many packaging applications. A couple of new examples were on display at the recent Pack Expo show in Las Vegas.

B&R Industrial Automation for the first time in North America demonstrated how its Generic Motion Control (GMC) system can replace a dedicated robotics controller with software that runs on B&R’s controllers.

The software includes a variety of robot kinematics, including a six-axis articulated arm, SCADA and Delta. The demo at the show involved a six-axis articulated arm robot powered by DC motors, but the GMC system has been shown on servo-driven robots as well. “We wanted to show that we could do things other than servo,” says Helmut Kirnstoetter, B&R's international sales manager. In fact, the same GMC software can handle not just AC and DC motors but stepper motors and analog outputs too. Aside from robotics, the GMC system also offers CNC controller functionality and can handle general motion control tasks, including coordinated motion. It supports PLC Open motion control function blocks as well as motion control functions developed in-house by B&R.

Rockwell, meanwhile, showed a more tightly integrated robot control than it has had in years past. As Wagner explains, the company has long been able to offer software that would add some robot control functionality on its ControlLogix platform. Now, though, Rockwell has added native kinematics support to ControlLogix, meaning that the kinematics reside in the controller’s firmware rather than running as function blocks.

“It’s an important difference,” Wagner says, explaining that the native kinematics takes away about 70 percent of the robot-related overhead that would otherwise burden the processor, especially in very fast or multi-robot applications.

Currently, Rockwell’s native kinematics support extends to Cartesian, SCARA and Delta style robots. Wagner says the system can handle basic moves for a six-axis articulated arm robot, though full support for this more complex type of robot is not available yet.

B&R and Rockwell are certainly not alone in the push toward integrated robot control. At the previous Pack Expo, Bosch Rexroth demonstrated its take on the technology.

Monday, November 26, 2007

攤販金頭腦變身發明王

王登福點子多多隨身計算筆、自動傘、數鈔筆…讓他奪得不少大獎

　嘉義縣攤販出身的發明家王登福，藉由發明致富。

　王登福發明的可書寫、計算、查閱\各時區時間及匯率的計算筆已經改良到第六代。

嘉義縣

●從擺地攤到成為十大傑出發明家，原籍嘉義縣朴子的王登福，選擇能快速賺錢的「發明」業，在發明界出人頭地。

王登福說，當完兵後，他隻身從嘉義北上，身上只有父親給的一千元車資及當兵時儲蓄存下來的三萬元。跑到北部，住在板橋的遠親好心提供他頂樓的空間，頂樓只有一支水龍頭與一片稻草鋪成的「屋頂」，「房屋」原先是狗窩。

王登福拿出三萬元到台北市五分埔批發女裝，到萬華廣州街、士林夜市擺地攤。放眼整條街都是攤販，民眾閒逛，沒有停下來看女裝、商品，王登福苦想出連珠炮式、有諧音的叫賣方式，終於吸引民眾的好奇，有人詢問他的背景，有人觀看、挑選地攤上的皮帶、女裝等等。他說，生意好的時候，一個月賺個二、三十萬元不成問題。

賺了錢，王登福心想：難道要一輩子擺地攤？他告訴自己，「不能這樣過一生」。

他撥出擺地攤的時間去淡江大學地政班進修；而擺攤叫賣時，和顧客討價還價、向廠商批貨時計算成本，逼得他花腦筋發明出計算筆，隨身攜帶。多年來，計算筆從第一代到第六代，附加功能增多，除了計算外，也能查閱不同時區的時間，還可換算匯率。帶著一支計算筆出國，解決計算、看當地時間、換算當地錢幣等等的問題。

滂沱大雨中，當人們右手撐著雨傘、左手提著手提袋，走到車旁時，要空出一隻手來開車門，只得先把手提袋放地上，換左手撐雨傘，結果是手提袋內的物品濕了，身體也淋濕了。就這樣的情境，王登福的念頭一轉，發明不必電力、不求外力的收放自如的全自動傘，為被視為夕陽工業的洋傘、雨傘業開創一道曙光。

王登福賺了錢，經常要當著客戶面前數錢，一疊疊鈔票，往往要數上幾次才能確定金額，他感到很頭痛。一次洗澡時，因為冷水的刺激，他想到利用點鈔機的原理，發明利用震動原理的數鈔筆。

創意可能變成黃金！王登福鼓勵民眾善用智慧賺錢、善用瞬間的創意致富。他認為洗澡時，就是腦袋瓜子天馬行空的時候，整個人受到冷水或熱水的刺激，往往會激發出人意表的創意。

1997，王登福參加在瑞士的世界發明比賽得雙面金牌，三年後又在英國倫敦科學展獲雙料冠軍，奠定他當選全台十大傑出發明家的地位。他強調，成功絕非偶然，不活在過去的陰影，從中學習經驗，才能奠定發明致富的契機。

2006-06-13

Friday, November 16, 2007

Tutorial: Floating-point arithmetic on FPGAs

Inside microprocessors, numbers are represented as integers—one or several bytes stringed together. A four-byte value comprising 32 bits can hold a relatively large range of numbers: 2³², to be specific. The 32 bits can represent the numbers 0 to 4,294,967,295 or, alternatively, -2,147,483,648 to +2,147,483,647. A 32-bit processor is architected such that basic arithmetic operations on 32-bit integer numbers can be completed in just a few clock cycles, and with some performance overhead a 32-bit CPU can also support operations on 64-bit numbers. The largest value that can be represented by 64 bits is really astronomical: 18,446,744,073,709,551,615. In fact, if a Pentium processor could count 64-bit values at a frequency of 2.4 GHz, it would take it 243 years to count from zero to the maximum 64-bit integer.

Dynamic Range and Rounding Error Problems
Considering this, you would think that integers work fine, but that is not always the case. The problem with integers is the lack of dynamic range and rounding errors.

The quantization introduced through a finite resolution in the number format distorts the representation of the signal. However, as long as a signal is utilizing the range of numbers that can be represented by integer numbers, also known as the dynamic range, this distortion may be negligible.

Figure 1 shows what a quantized signal looks like for large and small dynamic swings, respectively. Clearly, with the smaller amplitude, each quantization step is bigger relative to the signal swing and introduces higher distortion or inaccuracy.

Figure 1: Signal quantization and dynamic range

The following example illustrates how integer math can mess things up.

A Calculation Gone Bad
An electronic motor control measures the velocity of a spinning motor, which typically ranges from 0 to10,000 RPM. The value is measured using a 32-bit counter. To allow some overflow margin, let's assume that the measurement is scaled so that 15,000 RPM represents the maximum 32-bit value, 4,294,967,296. If the motor is spinning at 105 RPM, this value corresponds to the number 30,064,771 within 0.0000033%, which you would think is accurate enough for most practical purposes.

Assume that the motor control is instructed to increase motor velocity by 0.15% of the current value. Because we are operating with integers, multiplying with 1.0015 is out of the question—as is multiplying by 10,015 and dividing by 10,000—because the intermediate result will cause overflow.

The only option is to divide by integer 10,000 and multiply by integer 10,015. If you do that, you end up with 30,094,064; but the correct answer is 30,109,868. Because of the truncation that happens when you divide by 10,000, the resulting velocity increase is 10.6% smaller than what you asked for. Now, an error of 10.6% of 0.15% may not sound like anything to worry about, but as you continue to perform similar adjustments to the motor speed, these errors will almost certainly accumulate to a point where they become a problem.

What you need to overcome this problem is a numeric computer representation that represents small and large numbers with equal precision. That is exactly what floating-point arithmetic does.

Floating Point to the Rescue
As you have probably guessed, floating-point arithmetic is important in industrial applications like motor control, but also in a variety of other applications. An increasing number of applications that traditionally have used integer math are turning to floating-point representation. I'll discuss this once we have looked at how floating-point math is performed inside a computer.

IEEE 754 at a Glance
A floating-point number representation on a computer uses something similar to a scientific notation with a base and an exponent. A scientific representation of 30,064,771 is 3.0064771 x 10⁷, whereas 1.001 can be written as 1.001 x 10⁰.

In the first example, 3.0064771 is called the mantissa, 10 the exponent base, and 7 the exponent.

IEEE standard 754 specifies a common format for representing floating-point numbers in a computer. Two grades of precision are defined: single precision and double precision. The representations use 32 and 64 bits, respectively. This is shown in Figure 2.

In the first example, 3.0064771 is called the mantissa, 10 the exponent base, and 7 the exponent.

http://i.cmpnet.com/dspdesignline/2006/12/xilinxfigure2_big.gif

igure 2: IEEE floating-point formats

In IEEE 754 floating-point representation, each number comprises three basic components: the sign, the exponent, and the mantissa. To maximize the range of possible numbers, the mantissa is divided into a fraction and leading digit. As I'll explain, the latter is implicit and left out of the representation.

The sign bit simply defines the polarity of the number. A value of zero means that the number is positive, whereas a 1 denotes a negative number.

The exponent represents a range of numbers, positive and negative; thus a bias value must be subtracted from the stored exponent to yield the actual exponent. The single precision bias is 127, and the double precision bias is 1,023. This means that a stored value of 100 indicates a single-precision exponent of -27. The exponent base is always 2, and this implicit value is not stored.

For both representations, exponent representations of all 0s and all 1s are reserved and indicate special numbers:

Zero: all digits set to 0, sign bit can be either 0 or 1
±∞: exponent all 1s, fraction all 0s
Not a Number (NaN): exponent all 1s, non-zero fraction. Two versions of NaN are used to signal the result of invalid operations such as dividing by zero, and indeterminate results such as operations with non-initialized operand(s).

The mantissa represents the number to be multiplied by 2 raised to the power of the exponent. Numbers are always normalized; that is, represented with one non-zero leading digit in front of the radix point. In binary math, there is only one non-zero number, 1. Thus the leading digit is always 1, allowing us to leave it out and use all the mantissa bits to represent the fraction (the decimals).

Following the previous number examples, here is what the single precision representation of the decimal value 30,064,771 will look like:

The binary integer representation of 30,064,771 is 1 1100 1010 1100 0000 1000 0011. This can be written as 1.110010101100000010000011 x 2²⁴. The leading digit is omitted, and the fraction—the string of digits following the radix point—is 1100 1010 1100 0000 1000 0011. The sign is positive and the exponent is 24 decimal. Adding the bias of 127 and converting to binary yields an IEEE 754 exponent of 1001 0111.

Putting all of the pieces together, the single representation for 30,064,771 is shown in Figure 3.

Figure 3: 30,064,771 represented in IEEE 754 single-precision format

Gain Some, Lose Some
Notice that you lose the least significant bit (LSB) of value 1 from the 32-bit integer representation—this is because of the limited precision for this format.

The range of numbers that can be represented with single precision IEEE 754 representation is ±(2-2^-23) x 2¹²⁷, or approximately ±10^38.53. This range is astronomical compared to the maximum range of 32-bit integer numbers, which by comparison is limited to around ±2.15 x 10⁹. Also, whereas the integer representation cannot represent values between 0 and 1, single-precision floating-point can represent values down to ±2^-149, or ±~10^-44.85. And we are still using only 32 bits—so this has to be a much more convenient way to represent numbers, right?

The answer depends on the requirements.

Yes, because in our example of multiplying 30,064,771 by 1.001, we can simply multiply the two numbers and the result will be extremely accurate.
No, because as in the preceding example the number 30,064,771 is not represented with full precision. In fact, 30,064,771 and 30,064,770 are represented by the exact same 32-bit bit pattern, meaning that a software algorithm will treat the numbers as identical. Worse yet, if you increment either number by 1 a billion times, none of them will change. By using 64 bits and representing the numbers in double precision format, that particular example could be made to work, but even double-precision representation will face the same limitations once the numbers get big—or small enough.
No, because most embedded processor cores ALUs (arithmetic logic units) only support integer operations, which leaves floating-point operations to be emulated in software. This severely affects processor performance. A 32-bit CPU can add two 32-bit integers with one machine code instruction; however, a library routine including bit manipulations and multiple arithmetic operations is needed to add two IEEE single-precision floating-point values. With multiplication and division, the performance gap just increases; thus for many applications, software floating-point emulation is not practical.

Floating Point Co-Processor Units
For those who remember PCs based on the Intel 8086 or 8088 processor, they came with the option of adding a floating-point coprocessor unit (FPU), the 8087. Though a compiler switch, you could tell the compiler that an 8087 was present in the system. Whenever the 8086 encountered a floating-point operation, the 8087 would take over, do the operation in hardware, and present the result on the bus.

Hardware FPUs are complex logic circuits, and in the 1980s the cost of the additional circuitry was significant; thus Intel decided that only those who needed floating-point performance would have to pay for it. The FPU was kept as an optional discrete solution until the introduction of the 80486, which came in two versions, one with and one without an FPU. With the Pentium family, the FPU was offered as a standard feature.

Floating Point is Gaining Ground
These days, applications using 32-bit embedded processors with far less processing power than a Pentium also require floating-point math. Our initial example of motor control is one of many—other applications that benefit from FPUs are industrial process control, automotive control, navigation, image processing, CAD tools, and 3D computer graphics, including games.

As floating-point capability becomes more affordable and proliferated, applications that traditionally have used integer math turn to floating-point representation. Examples of the latter include high-end audio and image processing. The latest version of Adobe Photoshop, for example, supports image formats where each color channel is represented by a floating-point number rather than the usual integer representation. The increased dynamic range fixes some problems inherent in integer-based digital imaging.

If you have ever taken a picture of a person against a bright blue sky, you know that without a powerful flash you are left with two choices; a silhouette of the person against a blue sky or a detailed face against a washed-out white sky. A floating-point image format partly solves this problem, as it makes it possible to represent subtle nuances in a picture with a wide range in brightness.

Compared to software emulation, FPUs can speed up floating-point math operations by a factor of 20 to 100 (depending on type of operation) but the availability of embedded processors with on-chip FPUs is limited. Although this feature is becoming increasingly more common at the higher end of the performance spectrum, these derivatives often come with an extensive selection of advanced peripherals and very high-performance processor cores—features and performance that you have to pay for even if you only need the floating-point math capability.

FPUs on Embedded Processors
With the MicroBlaze 4.00 processor, Xilinx makes an optional single precision FPU available. You now have the choice whether to spend some extra logic to achieve real floating-point performance or to do traditional software emulation and free up some logic (20-30% of a typical processor system) for other functions.

Why Integrated FPU is the Way to Go
A soft processor without hardware support for floating-point math can be connected to an external FPU implemented on an FPGA. Similarly, any microcontroller can be connected to an external FPU. However, unless you take special considerations on the compiler side, you cannot expect seamless cooperation between the two.

C-compilers for CPU architecture families that have no floating-point capability will always emulate floating-point operations in software by linking in the necessary library routines. If you were to connect an FPU to the processor bus, FPU access would occur through specifically designed driver routines such as this one:

void user_fmul(float *op1, float *op2, float *res) { FPU_operand1=*op1; // write operand a to FPU FPU_operand2=*op2; // write operand b to FPU FPU_operation=MUL; // tell FPU to multiply while (!(FPU_stat & FPUready)); // wait for FPU *res = FPU_result // return result }

To do the operation, z = x*y in the main program, you would have to call the above driver function as:

float x, y, z; user_fmul (&x, &y, &z);

For small and simple operations, this may work reasonably well, but for complex operations involving multiple additions, subtractions, divisions, and multiplications, such as a proportional integral derivative (PID) algorithm, this approach has three major drawbacks:

The code will be hard to write, maintain, and debug
The overhead in function calls will severely decrease performance
Each operation involves at least five bus transactions; as the bus is likely to be shared with other resources, this not only affects performance, but the time needed to perform an operation will be dependent on the bus load in the moment

The MicroBlaze Way
The optional MicroBlaze soft processor with FPU is a fully integrated solution that offers high performance, deterministic timing, and ease of use. The FPU operation is completely transparent to the user.

When you build a system with an FPU, the development tools automatically equip the CPU core with a set of floating-point assembly instructions known to the compiler.

To perform y = x*y, you would simply write:

float x, y, z; y = x * z;

and the compiler will use those special instructions to invoke the FPU and perform the operation.

Not only is this simpler, but a hardware-connected FPU guarantees a constant number of CPU cycles for each floating-point operation. Finally, the FPU provides an extreme performance boost. Every basic floating-point operation is accelerated by a factor 25 to 150, as shown in Figure 4.

Figure 4: MicroBlaze floating-point acceleration

Conclusion
Floating-point arithmetic is necessary to meet precision and performance requirements for an increasing number of applications.

Today, most 32-bit embedded processors that offer this functionality are derivatives at the higher end of the price range.

The MicroBlaze soft processor with FPU can be a cost-effective alternative to ASSP products, and results show that with the correct implementation you can benefit not only from ease-of-use but vast improvements in performance as well.

For more information on the MicroBlaze FPU, visit www.xilinx.com/ipcenter/processor_central/microblaze/microblaze_fpu.htm.

[Editor's Note: This article first appeared in the Xilinx Embedded Magazine and is presented here with the kind permission of Xcell Publications.]

About the Author
Geir Kjosavik is the Senior Staff Product Marketing Engineer of the Embedded Processing Division at Xilinx, Inc. He can be reached at geir.kjosavik@xilinx.com.

Fundamentals of embedded audio, part 3

Audio Processing Methods
Getting data to the processor's core
There are a number of ways to get audio data into the processor's core. For example, a foreground program can poll a serial port for new data, but this type of transfer is uncommon in embedded media processors because it makes inefficient use of the core.

Instead, a processor connected to an audio codec usually uses a DMA engine to transfer the data from the codec link (like a serial port) to some memory space available to the processor. This transfer of data occurs in the background without the core's intervention. The only overhead is in setting up the DMA sequence and handling the interrupts once the buffer of data has been received or transmitted.

Block processing versus sample processing
Sample processing and block processing are two approaches for dealing with digital audio data. In the sample-based method, the processor crunches the data as soon as it's available. Here, the processing function incurs overhead during each sample period. Many filters (like FIR and IIR, described later) are implemented this way because the effective latency is lower for sample-based processing than for block processing.

In block processing is a buffer of a specific length must be filled before passing the data to the processing function. Some filters are implemented using block processing because it is more efficient than sample processing. For one, the processing function does not need to be called for each sample, greatly reducing overhead. Also, many embedded processors contain multiple processing units such as multipliers or full ALUs that can crunch blocks of data in parallel. . What's more, some algorithms are, by nature, must be processed in blocks. A well known one is the Fourier Transform (and its practical counterpart, the Fast Fourier Transform, or FFT), which accepts blocks of temporal or spatial data and converts them into frequency domain representations.

Double-Buffering
In a block-based processing system that uses DMA to transfer data to and from the processor core, a "double buffer" must exist to arbitrate between the DMA transfers and the core. This is done so that the processor core and the core-independent DMA engine do not access the same data at the same time and cause a data coherency problem.

For example, to facilitate the processing of a buffer of length N, simply create a buffer of length 2-N. For a bi-directional system, two buffers of length 2-N must be created. As shown in Figure 1a, the core processes the in1 buffer and stores the result in the out1 buffer, while the DMA engine is filling in0 and transmitting the data from out0. It can be seen in Figure 1b that once the DMA engine is done with the left half of the double buffers, it starts transferring data into in1 and out of out1, while the core processes data from in0 and into out0. This configuration is sometimes called "ping-pong buffering," because the core alternates between processing the left and right halves of the double buffers.

Note that in real-time systems, the serial port DMA (or another peripheral's DMA tied to the audio sampling rate) dictates the timing budget. For this reason, the block processing algorithm must be optimized in such a way that its execution time is less than or equal to the time it takes the DMA to transfer data to/from one half of a double-buffer.

Two-dimensional (2D) DMA
When data is transferred across a digital link like I2S, it may contain several channels. These may all be multiplexed onto one data line going into the same serial port. In such a case, 2D DMA can be used to de-interleave the data so that each channel is linearly arranged in memory. Take a look at Figure 2 for a graphical depiction of this arrangement, where samples from the left and right channels are de-multiplexed into two separate blocks. This automatic data arrangement is extremely valuable for those systems that employ block processing.

Figure 2. A 2D DMA engine used to de-interleave (a) I²S stereo data into (b) separate left and right buffers.

Basic Operations
There are three fundamental operations in audio processing. They are the summing, multiplication, and time delay operations. Many more complicated effects and algorithms can be implemented using these three elements. A summer has the obvious duty of adding two signals together. A multiplication can be used to boost or attenuate an audio signal. On most media processors, these operations can be executed in a single cycle.

A time delay is a bit more complicated. The delay is accomplished with a delay line, which is really nothing more than an array in memory that holds previous data. For example, an echo algorithm might hold 500 mS of input samples for each channel. For a simple delay effect, the current output value is computed by adding the current input value to a slightly attenuated previous sample. If the audio system is sample-based, then the programmer can simply keep track of an input pointer and an output pointer (spaced at 500 mS worth of samples apart), and increment them after each sampling period.

Since delay lines are meant to be reused for subsequent sets of data, the input and output pointers will need to wrap around from the end of the delay line buffer back to the beginning. In C/C++, this is usually done by appending the modulus operator (%) to the pointer increment.

This wrap-around may incur no extra processing cycles for a processor that supports circular buffering (see Figure 3). In this case, the beginning location and length of a circular buffer must be provided only once. During processing, the software increments or decrements the current pointer within the buffer, but the hardware takes care of wrapping around to the beginning of the buffer if the current pointer falls outside of the bounds. Without this automated address generation, the programmer would have to manually keep track of the buffer, thus wasting valuable processing cycles.

Figure 3. (a) Graphical representation of a delay line using a circular buffer (b) Layout of a circular buffer in memory.

A delay line structure can give rise to an important audio building block called the comb filter, which is essentially a delay with a feedback element. When multiple comb filters are used simultaneously, they can create the effect of reverberation.

Signal generation
In some audio systems, a signal (for example, a sine wave) might need to be synthesized. Taylor Series function approximations can emulate trigonometric functions. Uniform random number generators are handy for creating white noise.

However, synthesis might not fit into a given system's processing budget. On fixed-point systems with ample memory, you can use a table lookup instead of generating a signal. This has the side effect of taking up precious memory resources, so hybrid methods can be used as a compromise. For example, you can store a coarse lookup table to save memory. During runtime, the exact values can be extracted from the table using interpolation, an operation that can take significantly less time than computing using a full Taylor Series approximation. This hybrid approach provides a good balance between computation time and memory resources.

Filtering and Algorithms
Digital filters are used in audio systems for attenuating or boosting the energy content of a sound wave at specific frequencies. The most common filter forms are high-pass, low-pass, band-pass and notch. Any of these filters can be implemented in two ways. These are the finite impulse response filter (FIR) and the infinite impulse response filter (IIR), and they are often used as building blocks to more complicated filtering algorithms like parametric equalizers and graphic equalizers.

Finite Impulse Response (FIR) filter
The FIR filter's output is determined by the sum of the current and past inputs, each of which is first multiplied by a filter coefficient. The FIR summation equation, shown in Figure 4a, is also known as "convolution," one of the most important operations in signal processing. In this syntax, x is the input vector, y is the output vector, and h holds the filter coefficients. Figure 4a also shows a graphical representation of the FIR implementation.

The convolution is such a common operation in media processing that many processors can execute a multiply-accumulate (MAC) instruction along with multiple data accesses (reads and writes) in one cycle.

Figure 4. (a) FIR filter equation and structure (b) IIR filter equation and structure.

Infinite Impulse Response (IIR) filter
Unlike the FIR, whose output depends only on inputs, the IIR filter relies on both inputs and past outputs. The basic equation for an IIR filter is a difference equation, as shown in Figure 4b. Because of the current output's dependence on past outputs, IIR filters are often referred to as "recursive filters." Figure 4b also gives a graphical perspective on the structure of the IIR filter.

Fast Fourier Transform
Quite often, we can do a better job describing an audio signal by characterizing its frequency composition. A Fourier Transform takes a time-domain signal and rearranges it into the frequency domain; the inverse Fourier Transform achieves the opposite, converting a frequency-domain representation back into the time domain. Mathematically, there are some nice property relationships between operations in the time domain and those in the frequency domain. Specifically, a time-domain convolution (or an FIR filter) is equivalent to a multiplication in the frequency domain. This tidbit would not be too practical if it weren't for a special optimized implementation of the Fourier transform called the Fast Fourier Transform (FFT). In fact, it is often more efficient to implement a sufficiently long FIR filter by transforming the input signal and coefficients into the frequency domain with an FFT, multiplying the transforms, and then transforming the result back into the time domain with an inverse FFT.

There are other transforms that are used often in audio processing. Among them, the most common is the modified discrete cosine transform (MDCT), which is the basis for many audio compression algorithms.

Sample Rate Conversion
There are times when you will need to convert a signal sampled at one frequency to a different sampling rate. One situation where this is useful is when you want to decode an audio signal sampled at, say 8 kHz, but the DAC you're using does not support that sampling frequency. Another scenario is when a signal is oversampled, and converting it to a lower frequency can lead to a reduction in computation time. The process of converting the sampling rate of a signal from one rate to another is called sampling rate conversion (or SRC).

Increasing the sampling rate is called interpolation, and decreasing it is called decimation. Decimating a signal by a factor of M is achieved by keeping only every Mth sample and discarding the rest. Interpolating a signal by a factor of L is accomplished by padding the original signal with L-1 zeros between each sample.

Even though interpolation and decimation factors are integers, you can apply them in series to an input signal and get a rational conversion factor. When you upsample by 5 and then downsample by 3, then the resulting factor is 5/3 = 1.67.

Figure 5. Sample-rate conversion through upsampling and ownsampling.

To be honest, we oversimplified the SRC process a bit too much. In order to prevent artifacts due to zero-padding a signal (which creates images in the frequency domain), an interpolated signal must be low-pass-filtered before being used as an output or as an input into a decimator. This anti-imaging low-pass filter can operate at the input sample rate, rather than at the faster output sample rate, by using a special FIR filter structure that recognizes that the inputs associated with the L-1 inserted samples have zero values.

Similarly, before they're decimated, all input signals must be low-pass-filtered to prevent aliasing. The anti-aliasing low-pass filter may be designed to operate at the decimated sample rate, rather than at the faster input sample rate, by using a FIR filter structure that realizes the output samples associated with the discarded samples need not be computed. Figure 5 shows a flow diagram of a sample rate converter. Note that it is possible to combine the anti-imaging and anti-aliasing filter into one component for computational savings.

Obviously, we've only been able to give a surface discussion on these embedded audio topics, but hopefully we've provided a useful template for the kinds of considerations necessary for developing an embedded audio processing application.

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.

Fundamentals of embedded audio, part 2

Dynamic Range and Precision
You may have seen dB specs thrown around for various products available on the market today. Table 1 lists a few fairly established products along with their assigned signal quality, measured in dB.

Table 1: Dynamic range comparison of various audio systems.

So what exactly do those numbers represent? Let's start by getting some definitions down. Use Figure 1 as a reference signal for the following "cheat sheet" of the essentials.

Figure 1: Relationship between some important terms in audio systems.

The dynamic range of the human ear (the ratio of the loudest to the quietest signal level) is about 120 dB. In systems where noise is present, dynamic range is described as the ratio of the maximum signal level to the noise floor. In other words,

Dynamic Range (dB) = Peak Level (dB) - Noise Floor (dB)

The noise floor in a purely analog system comes from the electrical properties of the system itself. In digital systems, audio signals also acquire noise from the ADCs and DACs, as well as from the quantization errors due to sampling.

Another important measure is the signal-to-noise ratio (SNR). In analog systems, this means the ratio of the nominal signal to the noise floor, where "line level" is the nominal operating level. On professional equipment, the nominal level is usually 1.228 Vrms, which translates to +4 dBu. The headroom is the difference between nominal line level and the peak level where signal distortion starts to occur. The definition of SNR is a bit different in digital systems, where it is defined as the dynamic range.

Now, armed with an understanding of dynamic range, we can start to discuss how this is useful in practice. Without going into a long derivation, let's simply state what is known as the "6 dB rule". This rule is key to the relationship between dynamic range and computational word width. The complete formulation is described in the equation below, but in shorthand the 6 dB rule means that the addition of one bit of precision will lead to a dynamic range increase of 6 dB. Note that the 6 dB rule does not take into account the analog subsystem of an audio design, so the imperfections of the transducers on both the input and the output must be considered separately.

Dynamic Range (dB) = 6.02n + 1.76 ≈ 6n dB
where
n = the number of precision bits

The "6 dB rule" dictates that the more bits we use, the higher the audio quality we can attain. In practice, however, there are only a few realistic choices of word width. Most devices suitable for embedded media processing come in three word width flavors: 16-bit, 24-bit, and 32-bit. Table 2 summarizes the dynamic ranges for these three types of processors.

Table 2: Dynamic range of various fixed-point architectures.

Since we're talking about the 6 dB rule, it is worth mentioning something about the nonlinear quantization methods that are typically used for speech signals. A telephone-quality linear PCM encoding requires 12 bits of precision. However, our ears are more sensitive to audio changes at small amplitudes than at high amplitudes. Therefore, the linear PCM sampling is overkill for telephone communications. The logarithmic quantization used by the A-law and μ–law companding standards achieves a 12-bit PCM level of quality using only 8 bits of precision. To make our lives easier, some processor vendors have implemented A-law and μ–law companding into the serial ports of their devices. This relieves the processor core from doing logarithmic calculations.

After reviewing Table 2, recall once again that the dynamic range of the human ear is around 120 dB. Because of this, 16-bit data representation doesn't quite cut it for high quality audio. This is why vendors introduced 24-bit processors. However, these 24-bit systems are a bit non-standard from a C compiler standpoint, so many audio designs these days use 32-bit processing.

Choosing the right processor is not the end of the story, because the total quality of an audio system is dictated by the quality level of the "lowest-achieving" component. Besides the processor, a complete system includes analog components like microphones and speakers, as well the converters to translate signals between the analog and digital domains. The analog domain is outside of the scope of this discussion, but the audio converters do cross into the digital realm.

Let's say that you want to use the AD1871 for sampling audio. The datasheet for this converter explains that it is a 24-bit converter, but its dynamic range is not the theoretical 144 dB – it is 105 dB. The reason for this is that a converter is not a perfect system, and vendors publish only the useful dynamic range.

If you were to hook up a 24-bit processor to the AD1871, then the SNR of your complete system would be 105 dB. The noise floor would amount to 144 dB – 105 dB = 39 dB. Figure 2 is a graphical representation of this situation. However, there is still another component of a digital audio system that we have not discussed yet: computation on the processor's core.

http://i.cmpnet.com/dspdesignline/2007/09/adifigure4_big.gif

Figure 2: An audio system's SNR consists of the weakest component's SNR.

Passing data through a processor's computational units can potentially introduce a variety of errors. One is quantization error. This can be introduced when a series of computations causes a data value to be either truncated or rounded (up or down). For example, a 16-bit processor may be able to add a vector of 16-bit data and store this in an extended-length accumulator. However, when the value in the accumulator is eventually written to a 16-bit data register, some of the bits are truncated.

Take a look at Figure 3 to see how computation errors can affect a real system. For an ideal 16-bit A/D converter (Figure 3a), the signal-to-noise ratio would be 16 x 6 = 96 dB. If quantization errors did not exist, then 16-bit computations would suffice to keep the SNR at 96 dB. Both 24-bit and 32-bit systems would dedicate 8 and 16 bits, respectively, to the dynamic range below the noise floor. In essence, those extra bits would be wasted.

However, all digital audio systems do introduce some round-off and truncation errors. If we can quantify this error to take, for example, 18 dB (or 3 bits), then it becomes clear that 16-bit computations will not suffice in keeping the system's SNR at 96 dB (Figure 3b). Another way to interpret this is to say that the effective noise floor is raised by 18 dB, and the total SNR is decreased to 96 dB – 18 dB = 78 dB. This leads to the conclusion that having extra bits below the noise floor helps to deal with the nuisance of quantization.

Figure 3 (a) Allocation of extra bits with various word width computations for an ideal 16-bit, 96 dB SNR system, when quantization error is neglected (b) Allocation of extra bits with various word width computations for an ideal 16-bit, 96 dB SNR system, when quantization noise is present.

Numeric Formats for Audio
There are many ways to represent data inside a processor. The two main processor architectures used for audio processing are fixed-point and floating-point. Fixed-point processors are designed for integer and fractional arithmetic, and they usually natively support 16-bit, 24-bit, or 32-bit data. Floating-point processors provide very good performance with native support for 32-bit or 64-bit floating-point data types. However, floating-point processors are typically more costly and consume more power than their fixed-point counterparts, and most real systems must strike a balance between quality and engineering cost.

Fixed-point Arithmetic
Processors that can perform fixed-point operations typically use two's complement binary notation for representing signals. A fixed-point format can represent both signed and unsigned integers and fractions. The signed fractional format is most common for digital signal processing on fixed-point processors. The difference between integer and fractional formats lies in the location of the binary point. For integers, the binary point is to the right of the least significant digit, whereas fractions usually have their binary point to the left of the sign bit. Figure 4a shows integer and fractional formats.

While the fixed-point convention simplifies numeric operations and conserves memory, it presents a tradeoff between dynamic range and precision. In situations that require a large range of numbers while maintaining high resolution, a radix point that can shift based on magnitude and exponent, (i.e., floating-point) is desirable.

http://i.cmpnet.com/dspdesignline/2007/09/adifigure6_big.gif

Figure 4. (a) Fractional and integer formats

Floating-point Arithmetic
Using floating-point format, very large and very small numbers can be represented in the same system. Floating-point numbers are quite similar to scientific notation representation of rational numbers. They are described with a mantissa and an exponent. The mantissa dictates precision, and the exponent controls dynamic range.

There is a standard that governs floating-point computations of digital machines. It is called IEEE-754 (Figure 4a) and can be summarized as follows for 32-bit floating-point numbers. Bit 31 (MSB) is the sign bit, where a 0 represents a positive sign and a 1 represents a negative sign. Bits 30 through 23 represent an exponent field (exp_field) as a power of 2, biased with an offset of 127. Finally, bits 22 through 0 represent a fractional mantissa (mantissa). The hidden bit is basically an implied value of 1 to the left of the radix point.

The value of a 32-bit IEEE floating-point number can be represented with the following equation:

(-1)^sign_bit x (1.mantissa) x 2^{(exp_field " 127)}

With an 8-bit exponent and a 23-bit mantissa, IEEE-754 reaches a balance between dynamic range and precision. In addition, IEEE floating-point libraries include support for additional features such as ±infinity, zero, and NaN (not a number).

無法顯示錯誤的圖片「http://i.cmpnet.com/dspdesignline/2007/09/adifigure6_big.gif」

Figure 4. (a) Fractional and integer formats (b) IEEE 754 32-bit single-precision floating-point format.

Table 3 shows the smallest and largest values attainable from the common floating-point and fixed-point types.

Table 3. Comparison of dynamic range for various data formats.

Emulation on 16-bit Architectures
As explained earlier, 16-bit processing does not provide a high enough SNR for high quality audio, but this does not mean that you shouldn't choose a 16-bit processor. For example, while a 32-bit floating-point machine makes it easier to code an algorithm that preserves 32-bit data natively, a 16-bit processor can also maintain 32-bit integrity through emulation at a much lower cost. Figure 5 illustrates some of the possibilities for choosing a data type for an embedded algorithm.

In the remainder of this section, we'll describe how to achieve floating-point and 32-bit extended precision fixed-point functionality on a 16-bit fixed-point machine.

Figure 5: Depending the goals of an application, there are many data types that can satisfy system requirements.

Floating-point emulation on fixed-point processors
On most 16-bit fixed-point processors, IEEE-754 floating-point functions are available as library calls from either C/C++ or assembly language. These libraries emulate the required floating-point processing using fixed-point multiply and ALU logic. This emulation requires additional cycles to complete. However, as fixed-point processor core clock speeds venture into the 500 MHz - 1 GHz range, the extra cycles required to emulate IEEE-754-compliant floating-point math become less significant.

It is sometimes advantageous to use a "relaxed" version of IEEE-754 in order to reduce computational complexity. This means that the floating-point arithmetic doesn't implement the standard features such ±infinity, zero, and NaN.

A further optimization is to use a more native type for the mantissa and exponent. Take, for example, Analog Devices' fixed-point Blackfin processor architecture, which has a register file set that consists of sixteen 16-bit registers that can be used instead as eight 32-bit registers. In this configuration, on every core clock cycle, two 32-bit registers can source operands for computation on all four register halves. To make optimized use of the Blackfin register file, a two-word format can be used. In this way, one word (16 bits) is reserved for the exponent and the other word (16 bits) is reserved for the fraction.

Double-Precision Fixed-Point Emulation
There are many applications where 16-bit fixed-point data is not sufficient, but where emulating floating-point arithmetic may be too computationally intensive. For these applications, extended-precision fixed-point emulation may be enough to satisfy system requirements. Using a high-speed fixed-point processor will insure a significant reduction in the amount of required processing. Two popular extended-precision formats for audio are 32-bit and 31-bit fixed-point representations.

32-Bit-Accurate Emulation
32-bit arithmetic is a natural software extension for 16-bit fixed-point processors. For processors whose 32-bit register files can be accessed as two 16-bit halves, the halves can be used together to represent a single 32-bit fixed-point number. The Blackfin processor's hardware implementation allows for single-cycle 32-bit addition and subtraction.

For instances where a 32-bit multiply will be iterated with accumulation (as is the case in some algorithms we'll talk about soon), we can achieve 32-bit accuracy with 16-bit multiplications in just 3 cycles. Each of the two 32-bit operands (R0 and R1) can be broken up into two 16-bit halves (R0.H / R0.L and R1.H / R1.L).

igure 6. 32-bit multiplication with 16-bit operations.

From Figure 6, it is easy to see that the following operations are required to emulate the 32-bit multiplication R0 x R1 with a combination of instructions using 16-bit multipliers:

Four 16-bit multiplications to yield four 32-bit results:

R1.L x R0.L
R1.L x R0.H
R1.H x R0.L
R1.H x R0.H

Three operations preserve bit place in the final answer (the >> symbol denotes a right shift). Since we are performing fractional arithmetic, the result is 1.63 (1.31 x 1.31 = 2.62 with a redundant sign bit). Most of the time, the result can be truncated to 1.31 in order to fit in a 32-bit data register. Therefore, the result of the multiplication should be in reference to the sign bit, or the most significant bit. This way the rightmost least significant bits can be safely discarded in a truncation:

(R1.L x R0.L) >> 32
(R1.L x R0.H) >> 16
(R1.H x R0.L) >> 16

The final expression for a 32-bit multiplication is:

((R1.L x R0.L) >> 32 + (R1.L x R0.H) >> 16) + ((R1.H x R0.L) >> 16 + R1.H x R0.H)

On the Blackfin architecture, these instructions can be issued in parallel to yield an effective rate of a 32-bit multiplication in three cycles.

31-Bit-Accurate Emulation
We can reduce a fixed-point multiplication requiring at most 31-bit accuracy to just 2 cycles. This technique is especially appealing for audio systems, which usually require at least 24-bit representation, but where 32-bit accuracy may be a bit excessive. Using the "6 dB rule," 31-bit-accurate emulation still maintains a dynamic range of around 186 dB, which is plenty of headroom even with all the quantization effects.

From the multiplication diagram shown in Figure 6, it is apparent that the multiplication of the least significant half-word R1.L x R0.L does not contribute much to the final result. In fact, if the result is truncated to 1.31, then this multiplication can only have an effect on the least significant bit of the 1.31 result. For many applications, the loss of accuracy due to this bit is balanced by the speeding up of the 32-bit multiplication through eliminating one 16-bit multiplication, one shift, and one addition.

The expression for 31-bit accurate multiplication is:

((R1.L x R0.H) + (R1.H x R0.L) ) >> 16 + (R1.H x R0.H)

On the Blackfin architecture, these instructions can be issued in parallel to yield an effective rate of a 2 cycles for each 32-bit multiplication.

So that's the scoop on numeric formats for audio. In the final article of this series, we'll talk about some strategies for developing embedded audio applications, focusing primarily on data movement and building blocks for common algorithms.