{"id":175,"date":"2012-12-21T16:17:39","date_gmt":"2012-12-21T06:17:39","guid":{"rendered":"http:\/\/www.computer-vision-software.com\/blog\/?p=175"},"modified":"2012-12-21T16:17:39","modified_gmt":"2012-12-21T06:17:39","slug":"fpga-implementation-of-myaudience-count-overview-and-details","status":"publish","type":"post","link":"http:\/\/www.computer-vision-software.com\/blog\/2012\/12\/fpga-implementation-of-myaudience-count-overview-and-details\/","title":{"rendered":"FPGA implementation of myAudience-Count. Overview and details."},"content":{"rendered":"<p>Recently, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.rhondasoftware.com\/');\"  href=\"http:\/\/www.rhondasoftware.com\/\" title=\"Rhonda Software\" target=\"_blank\">Rhonda Software<\/a> took yet another step towards more power, area and cost effective solutions targeting broad range of embedded devices. In an effort to make one of our leading solutions <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.myaudience.com\/count\/overview');\" title=\"myAudience-Count\"  href=\"http:\/\/www.myaudience.com\/count\/overview\" target=\"_blank\">myAudience-Count<\/a> embedded-friendly, different possibilities were considered. Here is where <b>FPGA<\/b> technology came at hand.<\/p>\n<p><!--more--><\/p>\n<p>With <b>Video Analytic<\/b> target in mind, after extensive market research it was decided to use <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.latticesemi.com\/products\/applicationcutsheets\/hdr60cutsheet.cfm');\"  href=\"http:\/\/www.latticesemi.com\/products\/applicationcutsheets\/hdr60cutsheet.cfm\" title=\"Lattice HDR-60 development kit\" target=\"_blank\">Lattice HDR-60 development kit<\/a> as a base platform for our <b>Embedded Count<\/b> solution. The selected kit is a good choice for several reasons, among which are: mounted 1280&#215;960 camera sensor, Ethernet PHY, DDR2 memory, 2 USB ports, and of course the main decision driver \u2013 <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.latticesemi.com\/products\/fpga\/ecp3');\"  href=\"http:\/\/www.latticesemi.com\/products\/fpga\/ecp3\" title=\"Lattice ECP3 FPGA\" target=\"_blank\">Lattice ECP3 FPGA<\/a> device with 70K of LUTs, 150KB of embedded EBR memory blocks, 256 DSP multipliers and other useful ASIC components. All of the above come packaged in a rather compact base board accompanied with development toolchain up and ready to use.<\/p>\n<p>It is time now to look at what\u2019s inside of <b>Embedded Count<\/b> product and unveil some core algorithms and approaches.<\/p>\n<p>\nAs with the <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.myaudience.com\/count\/overview');\" title=\"myAudience-Count\"  href=\"http:\/\/www.myaudience.com\/count\/overview\" target=\"_blank\">PC version<\/a> at the heart of the system there is an <b>Optical Flow<\/b> estimator which is basically a motion tracker capable of calculating for each pixel its position relative to the position of the same pixel within previous and next frame in video sequence. In general, if there is a movement present in some part of video frame, the algorithm has to find its position and direction. For unmoved areas the algorithm has to yield nothing.\n<\/p>\n<p>For those who are interested, here comes a piece of technical details. Simply speaking, any change of pixel from frame to frame can be explained by either spatial motion or changing of its brightness over time. The latter can be considered as a temporal motion. More strictly, here is a commonly used differential equation relating pixel\u2019s brightness change to its movement, called <b>Optic Flow Constraint<\/b>:<\/p>\n<p style=\"text-align: center\">\u2202I\/\u2202x*Vx + \u2202I\/\u2202y*Vy + \u2202I\/\u2202t = 0,<\/p>\n<p>where Vx and Vy are components of pixel\u2019s speed in spatial directions. \u2202I\/\u2202x, \u2202I\/\u2202y, \u2202I\/\u2202t  are spatial and temporal partial derivatives. As one can see, the equation is under-determined calling for regularization. The most common way of transforming the problem into a well-posed one is adding a Smoothness Constraint of some form which basically postulates that the adjacent pixels tend to move in the same or similar direction. This constraint which usually comes in a form of a laplacian or some other mixture of second derivatives effectively transforms the original equation into, in general, an over-determined system of linear equations which can be solved approximately with the use of convolutions only. Interested readers may refer to a <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/ojs.academypublisher.com\/index.php\/jmm\/article\/view\/02053845');\" title=\"FPGA-based Real-time Optical Flow Algorithm Design and Implementation\"  href=\"http:\/\/ojs.academypublisher.com\/index.php\/jmm\/article\/view\/02053845\" target=\"_blank\">nice article<\/a> by Zhaoyi Wei et al., where the idea is cleanly explained without too much of an analytic overhead.<\/p>\n<p>So, the building blocks of the <b>Optical Flow<\/b> algorithm are spatial and temporal <b>derivatives<\/b>, which call for intermediate frame buffers and <b>convolutions<\/b>, which require buffers, multiplications, summations and divisions.  Hardware implementation itself dictates some constraints the major of which is that in order to be real time the algorithm has to be non-iterative and fully pipelined. As a new pixel from sensor is produced on each clock cycle, it has to be pushed into the processing pipeline at once before the next pixel becomes ready. The amount of memory for storing intermediate results is strictly limited to 150 Kb in total. And of course, there is no such luxury as floating point calculations.<\/p>\n<p>Fortunately, both convolutions and spatial derivatives require only a limited number of frame lines at a time equal to the size of the convolution kernel. Even better, with the use of <b>shift-register<\/b> structure with taps, they can be easily pipelined so that one output pixel is produced on every clock cycle while new pixel is being pushed into the pipeline.<\/p>\n<p>Multiplications and divisions are much tougher with <b>FPGA<\/b>. There are a limited number of fixed-point DSP multipliers within the chip and there are no dividers. Implementing either of them within LUT logic will eat up all of the available resources before long. To overcome the lack of multipliers and dividers, approximate convolution kernels for both smoothing and differentiation were carefully designed. The coefficients as well as their sum were chosen to be powers of two. Thus only summations and shifts were required. The shifts on <b>FPGA<\/b> are resource-free, because they do not produce additional logic or interconnect. To reduce resources consumption even further the separability of the kernels in spatial directions was highly exploited, which allowed to efficiently transform a 2D sub-problem into an 1D.<\/p>\n<p>The next logical step towards problem simplification was rewriting of the <b>Optical Flow Constraint<\/b> equation eliminating the \u2202I\/\u2202x*Vx term and leaving one spatial and one temporal dimension only. This could be done for this particular problem, because the original task of counting people who crosses a virtual line considers motion in the direction orthogonal to this line only and pays no attention to the parallel movements. With a little quality penalty this greatly reduced the amount of required calculations and freed a lot of <b>FPGA<\/b> resources.<\/p>\n<p>Additionally, as too detailed motion map was not required for solving the problem, a frame downscaling was implemented which allowed to both lessen the intermediate buffers requirements and reduce working clock frequency and power consumption.<\/p>\n<p>The result of <b>Optical Flow<\/b> calculation is further combined with the result of <b>Background Model<\/b> module also implemented in hardware and working in parallel with <b>Optical Flow<\/b> estimator. What follows is the reduction of the combined field into a line and extracting line segments which correspond to the persons being counted. The result of the reduction is then transferred into a CPU implemented as a soft core on the same chip for final post-processing and transmitting onto <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/portal.myaudience.com\/login.php');\"  href=\"https:\/\/portal.myaudience.com\/login.php\" title=\"myAudience portal\" target=\"_blank\">myAudience portal<\/a> over Ethernet.<\/p>\n<p>Among other important hardware modules of the system there are: <b>Background Model<\/b> (mentioned above), <b>Debayer<\/b> (responsible for converting <b>Bayer<\/b> pattern coming from sensor to RGB), <b>Tone-mapper<\/b> (for compressing <b>tonal range<\/b> of input pixels from 12 to 8 bit), <b>JPEG encoder<\/b> (for streaming preview frames onto calibration web-UI), Ethernet MAC, DDR2 controller, <b>LM32 CPU<\/b> + <b>Embedded Linux<\/b> (for running Ethernet stack, transmitting <b>People Count<\/b> results onto <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/portal.myaudience.com\/login.php');\"  href=\"https:\/\/portal.myaudience.com\/login.php\" title=\"myAudience portal\" target=\"_blank\">myAudience portal<\/a>, running web-server and JPEG preview streamer), I2C master (for programming sensor\u2019s registers), UART and others. All of them were successfully fitted into a single 70K LUT <b>FPGA<\/b> consuming about 85% of available chip resources in both LUTs and memory blocks and forming a finished, production-ready <b>People Count<\/b> solution targeted embedded systems.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently, Rhonda Software took yet another step towards more power, area and cost effective solutions targeting broad range of embedded devices. In an effort to make one of our leading solutions myAudience-Count embedded-friendly, different possibilities were considered. Here is where FPGA technology came at hand.<\/p>\n","protected":false},"author":37,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[78],"tags":[81,88,89,83,82,80],"class_list":["post-175","post","type-post","status-publish","format-standard","hentry","category-fpga","tag-background-model","tag-fpga","tag-hw","tag-motion","tag-myaudience-count","tag-optical-flow"],"_links":{"self":[{"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/posts\/175","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/users\/37"}],"replies":[{"embeddable":true,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/comments?post=175"}],"version-history":[{"count":0,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/posts\/175\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/media?parent=175"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/categories?post=175"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.computer-vision-software.com\/blog\/wp-json\/wp\/v2\/tags?post=175"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}