Posted on : 21-12-2012 | By : Sergey Koulik | In : FPGA
Recently, Rhonda Software took yet another step towards more power, area and cost effective solutions targeting broad range of embedded devices. In an effort to make one of our leading solutions myAudience-Count embedded-friendly, different possibilities were considered. Here is where FPGA technology came at hand.
With Video Analytic target in mind, after extensive market research it was decided to use Lattice HDR-60 development kit as a base platform for our Embedded Count solution. The selected kit is a good choice for several reasons, among which are: mounted 1280×960 camera sensor, Ethernet PHY, DDR2 memory, 2 USB ports, and of course the main decision driver – Lattice ECP3 FPGA device with 70K of LUTs, 150KB of embedded EBR memory blocks, 256 DSP multipliers and other useful ASIC components. All of the above come packaged in a rather compact base board accompanied with development toolchain up and ready to use.
It is time now to look at what’s inside of Embedded Count product and unveil some core algorithms and approaches.
As with the PC version at the heart of the system there is an Optical Flow estimator which is basically a motion tracker capable of calculating for each pixel its position relative to the position of the same pixel within previous and next frame in video sequence. In general, if there is a movement present in some part of video frame, the algorithm has to find its position and direction. For unmoved areas the algorithm has to yield nothing.
For those who are interested, here comes a piece of technical details. Simply speaking, any change of pixel from frame to frame can be explained by either spatial motion or changing of its brightness over time. The latter can be considered as a temporal motion. More strictly, here is a commonly used differential equation relating pixel’s brightness change to its movement, called Optic Flow Constraint:
∂I/∂x*Vx + ∂I/∂y*Vy + ∂I/∂t = 0,
where Vx and Vy are components of pixel’s speed in spatial directions. ∂I/∂x, ∂I/∂y, ∂I/∂t are spatial and temporal partial derivatives. As one can see, the equation is under-determined calling for regularization. The most common way of transforming the problem into a well-posed one is adding a Smoothness Constraint of some form which basically postulates that the adjacent pixels tend to move in the same or similar direction. This constraint which usually comes in a form of a laplacian or some other mixture of second derivatives effectively transforms the original equation into, in general, an over-determined system of linear equations which can be solved approximately with the use of convolutions only. Interested readers may refer to a nice article by Zhaoyi Wei et al., where the idea is cleanly explained without too much of an analytic overhead.
So, the building blocks of the Optical Flow algorithm are spatial and temporal derivatives, which call for intermediate frame buffers and convolutions, which require buffers, multiplications, summations and divisions. Hardware implementation itself dictates some constraints the major of which is that in order to be real time the algorithm has to be non-iterative and fully pipelined. As a new pixel from sensor is produced on each clock cycle, it has to be pushed into the processing pipeline at once before the next pixel becomes ready. The amount of memory for storing intermediate results is strictly limited to 150 Kb in total. And of course, there is no such luxury as floating point calculations.
Fortunately, both convolutions and spatial derivatives require only a limited number of frame lines at a time equal to the size of the convolution kernel. Even better, with the use of shift-register structure with taps, they can be easily pipelined so that one output pixel is produced on every clock cycle while new pixel is being pushed into the pipeline.
Multiplications and divisions are much tougher with FPGA. There are a limited number of fixed-point DSP multipliers within the chip and there are no dividers. Implementing either of them within LUT logic will eat up all of the available resources before long. To overcome the lack of multipliers and dividers, approximate convolution kernels for both smoothing and differentiation were carefully designed. The coefficients as well as their sum were chosen to be powers of two. Thus only summations and shifts were required. The shifts on FPGA are resource-free, because they do not produce additional logic or interconnect. To reduce resources consumption even further the separability of the kernels in spatial directions was highly exploited, which allowed to efficiently transform a 2D sub-problem into an 1D.
The next logical step towards problem simplification was rewriting of the Optical Flow Constraint equation eliminating the ∂I/∂x*Vx term and leaving one spatial and one temporal dimension only. This could be done for this particular problem, because the original task of counting people who crosses a virtual line considers motion in the direction orthogonal to this line only and pays no attention to the parallel movements. With a little quality penalty this greatly reduced the amount of required calculations and freed a lot of FPGA resources.
Additionally, as too detailed motion map was not required for solving the problem, a frame downscaling was implemented which allowed to both lessen the intermediate buffers requirements and reduce working clock frequency and power consumption.
The result of Optical Flow calculation is further combined with the result of Background Model module also implemented in hardware and working in parallel with Optical Flow estimator. What follows is the reduction of the combined field into a line and extracting line segments which correspond to the persons being counted. The result of the reduction is then transferred into a CPU implemented as a soft core on the same chip for final post-processing and transmitting onto myAudience portal over Ethernet.
Among other important hardware modules of the system there are: Background Model (mentioned above), Debayer (responsible for converting Bayer pattern coming from sensor to RGB), Tone-mapper (for compressing tonal range of input pixels from 12 to 8 bit), JPEG encoder (for streaming preview frames onto calibration web-UI), Ethernet MAC, DDR2 controller, LM32 CPU + Embedded Linux (for running Ethernet stack, transmitting People Count results onto myAudience portal, running web-server and JPEG preview streamer), I2C master (for programming sensor’s registers), UART and others. All of them were successfully fitted into a single 70K LUT FPGA consuming about 85% of available chip resources in both LUTs and memory blocks and forming a finished, production-ready People Count solution targeted embedded systems.