Implementation of an 8-bit Dynamic Fixed-Point Convolutional Neural Network for Human Sign Language Recognition on a Xilinx FPGA Board
Sammanfattning: The goal of this thesis work is to implement a convolutional neural network on an FPGA device with the capability of recognising human sign language. The set of gestures that the neural network can identify has been taken from the Swedish sign language, and it consists of the signs used for representing the letters of the Swedish alphabet (a.k.a. fingerspelling). The motivation driving this project lies in the tremendous interest aroused by neural networks in recent years for its ability for solving complex problems and its capacity to learn by example. More specifically, convolutional neural networks are being extensively used for image classification, and this project aims to design a hardware accelerator to compute the convolutional layers of such type of network topology and test its accuracy and performance when dealing with human sign language. The network topology of choice is Zynqnet, proposed by Gschwend in 2016, which is a topology that has already been implemented successfully on an FPGA platform and it has been trained with the large picture dataset provided by ImageNet, for its popular image recognition contest. In this regard, the aim of this work is not to propose a new neural network topology but to re-use an existent one by introducing some improvements like the utilisation of an 8-bit dynamic fixed-point scheme and challenge it with a different but related task, like human sign language recognition. The methodology followed to carry out a successful hardware implementation has consisted, first, of the installation and setup of a reliable framework used for the training of the neural network. Different frameworks were tried out, like MATLAB or Caffe, but finally, DIGITS from NVIDIA was the more convenient due to its graphical environment and because it provides all the compatibility and drivers needed to run together with the GPU used in this project. Then, an image dataset of more than 13,000 pictures of hand gestures has been built up to grant enough input data for the framework to fine-tune ZynqNet for the new task, i.e. to provide the neural network with the ability to classify the different hand-signs into its corresponding alphabet letter. In parallel, the Register-Transfer Level (RTL) abstraction of the hardware architecture has been generated using a High-Level Synthesis tool chain, in which the algorithmic descriptions are written in C/C++. Finally, the validation of the design has been done by means of co-simulation techniques where the golden data obtained with the C test bench is compared with the output data of the RTL implementation, and all of it within the simulation environment provided by the Vivado Design Suite. As a result, the best-performing obtained solution achieved an accuracy of 80.1\% in the inference test and a frame rate of 6.4 FPS with a clock frequency of 250 MHz.
HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)