Object Detection using CNNs in
Microcontrollers with TensorFlow Lite
Introduction
Computer Vision is the field that enables machines to interpret and understand visual information from an image. Object detection, a key area within computer vision, focuses on identifying and locating objects within images or video streams. This field started in the 1960s, but evolved rapidly, especially after 2010 following the advent of deep learning and the introduction of CNNs. This made it possible to train models directly on raw pixel data, leading to major improvements in accuracy and speed. Already 2024 saw the certification of the first Level 3 autonomous vehicle, capable of driving up to 95 km/h in certain conditions on the German highways [1].
However, these are intense computing operations that require a lot of hardware resources. Still, object detection can be implemented on constrained hardware resources embedded systems, and this is what we will explore in this article.
Use Case and Constraints
There are many use cases where computer vision can be used for IoT edge devices, but for this article we will try to implement a model that will detect when a person is wearing safety equipment, specifically a high visibility vest and/or a helmet. The main advantage of having the model running on the edge device is cost saving, since the images won’t have to be sent over to a server to have them analyzed, and this would reduce the network traffic immensely for every device. As for constraints, the resolution of the image should be 240*240, resulting in a size of 57KB for each image, the object detection should take up to 1 second and the accuracy should be above 80%. To be able to properly display the images with the object detection, they will be sent via Wi-Fi to a PC where they will be displayed. Another advantage is the improved response time, since data doesn’t need to be transmitted to the server and wait for it to be analyzed and then the result is received back. At the same time, the system will be immune to any drop in the data connection.
Hardware
We will be using the Espressif ESP32-S3_EYE board. It features a Dual-core Xtensa® 32-bit LX7 microprocessor capable of running up to 240 MHz, 512Kb of internal RAM memory, supplemented by an 8MB external SPI RAM and 8MB of Flash memory. It includes 128-bit vector instructions optimized for machine learning tasks. It supports the TensorFlow Lite for Microcontrollers, enabling the deployment of lightweight machine learning models directly on the device. This allows for efficient inference of trained models without needing a cloud connection, which is what we plan to do. The board also contains an LCD, which at 1.3inch is too small and we won’t be using it.
Model Used
The model we will be using is FOMO [2]. The main benefits in comparison to other models are the speed and size of the model. For example, in comparison to MobileNet SSD, FOMO is up to 30 times faster [2] and occupies up to 10 times less memory. With even less than 200KB of RAM [2] it can fit in most microcontrollers. The reason for these huge differences is in the design of the algorithm. Rather than focusing on detecting the exact position and size of the detected object, the algorithm will only focus on the position in the image. Based on this observation, classical bounding boxes are no longer needed. Instead, a detection based on the centroids of objects is enough [2]. These huge advantages also come with 2 limitations: the objects need to be similar in size and the objects can’t be too close to one another. However, these limitations have a limited impact on our use case and the used training set.
Training Set
For training our model we used an open source, released under Apache 2.0 license, data set found on Kaggle.com [3]. The dataset consists of 10,500 images captured from diverse locations, such as construction sites, factories, and outdoor work environments. Each image contains an average of five annotated objects, providing a comprehensive set of scenarios for the model to learn from. The advantage of using an existing data set is that the time to create or gather images is saved and using an open source one means you have the data set for free. The disadvantages are the unknown quality of the dataset. For example, our chosen dataset has invalid detected objects as you can see in the images below and images that are not relevant, but which would influence the training of the model.
And especially if it is a big dataset, curating it takes a lot of time. In our case from the 10.500 images that the set has, after curating it, we ended up with only 1892 images, so below 20%, meaning that less than 1 image in 5 was useful to us.
Training
For the training the following parameters could be adjusted:
- Batches — Number of epochs to train the neural network on, which means number of times the training data is passed to the model
- Learning rate — how fast the neural network learns. It defines how big of a step the model takes while training. A too high learning rate will make the learning jump over minima, but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum
- Batch size — is the number of data samples processed together in one go during training in machine learning. A smaller batch size often results in noisier gradient estimates, which can help escape local minima and potentially lead to better generalization but may also cause more fluctuations in the training process. Conversely, a larger batch size provides more accurate gradient estimates, which can make the training process more stable and utilize hardware resources more efficiently, but it may also increase the risk of converging to suboptimal minima and require adjustments in learning rate and other hyperparameters.
- Data Augmentation — perform small changes to the data set to create new data samples. The most used methods are geometric transformations, color space adjustments, and noise injections.
After running more than 50 trainings, the best results were obtained with the following parameters: Batches — 80, Learning Rate — 0,01, Batch Size — 152, Use data augmentation and don’t use learned optimizer. With these parameters the accuracy on the test set was 75,25%, while on the validation set the accuracy was 88,5%, which shows a small overfitting.
Testing
Testing the module showed overall good performance, being able to detect also when several objects are close together. The limitation seems to be that it prefers uniform backgrounds, as soon as the background is no longer uniform, the detection stops working.
Software Implementation
To obtain the best possible framerate with a complex model that alone takes around 1120 milliseconds, the following architecture decisions were implemented:
— Image update with the detected objects is transferred to the receiving device — the ESP device captures the image, runs the detection and sends the image and the detection result over Wi-Fi to the PC. A python script on the PC performs the following actions:
- Every 50 milliseconds, if there is an image the CV2 library will display the image
- creates a TCP server and waits for a connection and messages from the ESP device. The ESP sends two types of messages identified by an identifier: either a jpg image or the result of the detection.
- When a jpeg image is received from ESP the script will load it as a CV2 image
- When a detection message is received, the script will parse the message, and draw the detected object’s centroid position as well as detection certainty on the image
— 2 tasks are running in parallel, one on each core performing the following actions:
- On core 0, the task will take a photo from the camera module as a jpg image, send it to the queue for the second task, then also send it via Wi-Fi to the PC
- A queue with the size of 1 element transfers the image between the tasks running on the cores
- On core 1, the task will wait until it receives an image from the queue, it will convert it from jpeg image to rgb888, crop and interpolate it if needed, run the detection and then send the results to the receiving PC if any objects were detected.
By parallelizing the architecture as much as possible, the average time between frames was reduced to 1195 milliseconds, out of which just 75 milliseconds overhead for capturing the image in jpeg format, converting it to rgb888, sending the image and the detection result to the receiving pc, receiving and parsing this data and displaying the image.
Conclusion
Computer vision has evolved tremendously in the last 10 years, maybe the field with the biggest advances in this timeframe. Not only the advances in hardware performance push the field forwards, but more so, the development of software algorithms used for object detection. Frameworks like TensorFlow Lite, OpenVINO, and PyTorch Mobile coupled with the introduction of dedicated AI accelerators and microcontrollers have facilitated the execution of CNNs on embedded systems. The challenges are finding the proper data sets to train the models and the adoption rate of computer vision day-to-day in an economically viable way.
References
(2) https://www.edgeimpulse.com/blog/announcing-fomo-faster-objects-more-objects/
(3) https://www.kaggle.com/datasets/niravnaik/safety-helmet-and-reflective-jacket/data