Posts | Comments

Planet Arduino

Archive for the ‘TensorFlow Lite for Microcontrollers’ Category

Being able to add dynamic lighting and images that can synchronize with a dancer is important to many performances, which rely on both and music and visual effects to create the show. Eduardo Padrón aimed to do exactly that by monitoring a performer’s moves with an accelerometer and playing the appropriate effect based on the recognized movement. 

Padrón’s system is designed around a Raspberry Pi 4 running an MQTT server for communication with auxiliary IoT boards. Movement data was collected via a Nano 33 BLE Sense and its onboard accelerometer to gather information and send it to a Google Colab environment. From here, a model was trained on these samples for 600 epochs, achieving an accuracy of around 91%. After deploying this model onto the Arduino, he was able to output the correct gesture over USB where it interacts with the running Python script. Once the gesture is received, the MQTT server publishes the message to any client devices such as an ESP8266 for lighting and plays an associated video or sound. 

Padrón plans on adding more features to his system in the future such as Bluetooth connectivity, kinetic sculptures, and more animation effects. You can read more about his project, which was named a winner in the TensorFlow Lite for Microcontrollers Challenge, on DevPost here.

The post Mapping Dance syncs movement and stage lighting using tinyML appeared first on Arduino Blog.

In this deep dive article, performance optimization specialist Larry Bank (a.k.a The Performance Whisperer) takes a look at the work he did for the Arduino team on the latest version of the Arduino_OV767x library.

Arduino recently announced an update to the Arduino_OV767x camera library that makes it possible to run machine vision using TensorFlow Lite Micro on your Arduino Nano 33 BLE board. 

If you just want to try this and run machine learning on Arduino, you can skip to the project tutorial.

The rest of this article is going to look at some of the lower level optimization work that made this all possible. There are higher performance industrial-targeted options like the Arduino Portenta available for machine vision, but the Arduino Nano 33 BLE has sufficient performance with TensorFlow Lite Micro support ready in the Arduino IDE. Combined with an OV767x module makes a low-cost machine vision solution for lower frame-rate applications like the person detection example in TensorFlow Lite Micro.

Need for speed

Recent optimizations done by Google and Arm to the CMSIS-NN library also improved the TensorFlow Lite Micro inference speed by over 16x, and as a consequence bringing down inference time from 19 seconds to just 1.2 seconds on the Arduino Nano 33 BLE boards.  By selecting the person_detection example in the Arduino_TensorFlowLite library, you are automatically including CMSIS-NN underneath and benefitting from these optimizations. The only difference you should see is that it runs a lot faster!

The CMSIS-NN library provides optimized neural network kernel implementations for all Arm’s Cortex-M processors, ranging from Cortex-M0 to Cortex-M55. The library utilizes the processor’s capabilities, such as DSP and M-Profile Vector (MVE) extensions, to enable the best possible performance. 

The Arduino Nano 33 BLE board is powered by Arm Cortex-M4, which supports DSP extensions. That will enable the optimized kernels to perform multiple operations in one cycle using SIMD (Single Instruction Multiple Data) instructions. Another optimization technique used by the CMSIS-NN library is loop unrolling. These techniques combined will give us the following example where the SIMD instruction, SMLAD (Signed Multiply with Addition), is used together with loop unrolling to perform a matrix multiplication y=a*b, where

  a=[1,2]

and

  b=[3,5
     4,6]

a, b are 8-bit values and y is a 32-bit value. With regular C, the code would look something like this:

for(i=0; i<2; ++i)
    for(j=0; j<2; ++j)
      y[i] += a[j] * b[j][i]

However, using loop unrolling and SIMD instructions, the loop will end up looking like this:

a_operand = a[0] | a[1] << 16 // put a[0], a[1] into one variable
  for(i=0; i<2; ++i)
    b_operand = b[0][i] | b[1][i] << 16 // vice versa for b
    y[i] = __SMLAD(a_operand, b_operand, y[i])

This code will save cycles due to

  • fewer for-loop checks
  • __SMLAD performs two multiply and accumulate in one cycle

This is a simplified example of how two of the CMSIS-NN optimization techniques are used.

Figure 1: Performance with initial versions of libraries
Figure 2: Performance with CMSIS-NN optimizations 

This improvement means the image acquisition and preprocessing stages now have a proportionally bigger impact on machine vision performance. So in Arduino our objective was to improve the overall performance of machine vision inferencing on Arduino Nano BLE sense by optimizing the Arduino_OV767X library while maintaining the same library API, usability and stability. 

Figure 3: Performance with CMSIS-NN and camera library optimizations 

For this, we enlisted the help of Larry Bank who specializes in embedded software optimization. Larry’s work got the camera image read down from 1500ms to just 393ms for a QCIF (176×144 pixel) image. This was a great improvement!

Let’s have a look at how Larry approached the camera library optimization and how some of these techniques can apply to your Arduino code in general.

Performance optimizing Arduino code

It’s rarely practical or necessary to optimize every line of code you write. In fact there are very good reasons to prioritize readable, maintainable code. Being readable and optimized don’t necessarily have to be mutually exclusive. However, embedded systems have constrained resources, and when applications demand more performance, some trade-offs might have to be made. Sometimes it is necessary to restructure algorithms, pay attention to compiler behavior, or even analyze timing of machine code instructions in order to squeeze the most out of a microcontroller. In some cases this can make the code less readable — but the beauty of an Arduino library is that this can be abstracted (hidden) from user sketch code beneath the cleaner library function APIs. 

What does “Camera.readFrame” do?

We’ve connected a camera to the Arduino. The Arduino_OV767X library sets up the camera and lets us transfer the raw image data from the camera into the Arduino Nano BLE memory. The smallest resolution setting, QCIF, is 176 x 144 pixels. Each pixel is encoded in 2 bytes. We therefore need to transfer at least 50688 bytes (176 x 144 x 2 ) every time we capture an image with Camera.readFrame. Because the function is performing a byte read operation over 50 thousand times per frame, the way it’s implemented has a big impact on performance. So let’s have a look at how we can most efficiently connect the camera to the Arduino and read a byte of data from it. 

Philosophy

I tend to see the world of code through the “lens” of optimization. I’m not advocating for everyone to share my obsession with optimization. However, when it does become necessary, it’s helpful to understand details of the target hardware and CPU. What I often encounter with my clients is that their code implements their algorithm neatly and is very readable, but it’s not necessarily ‘performance friendly’ to the target machine. I assume this is because most people see code from a top-down approach: they think in terms of the abstract math and how to process the data. My history in working with very humble machines and later turning that into a career has flipped that narrative on its head. I see software from the bottom up: I think about how the memory, I/O and CPU registers interact to move and process the data used by the algorithm. It’s often possible to make dramatic improvements to the code execution speed without losing any of its readability. When your readable/maintainable solution still isn’t fast enough, the next phase is what I call ‘uglification.’ This involves writing code that takes advantage of specific features of the CPU and is nearly always more difficult to follow (at least at first glance!).

Optimization methodology

Optimization is an iterative process. I usually work in this order:

  • Test assumptions in the algorithm (sometimes requires tracing the data)
  • Make innocuous changes in the logic to better suit the CPU (e.g. change modulus to logical AND)
  • Flatten the hierarchy or simplify overly nested classes/structures
  • Test any slow/fast paths (aka statistics of the data — e.g. is 99% of the incoming data 0?)
  • Go back to the author(s) and challenge their decisions on data precision / storage
  • Make the code more suitable for the target architecture (e.g. 32 vs 64-bit CPU registers)
  • If necessary (and permitted by the client) use intrinsics or other CPU-specific features
  • Go back and test every assumption again

If you would like to investigate this topic further, I’ve written a more detailed presentation on Writing Performant C++ code.

Depending on the size of the project, sometimes it’s hard to know where to start if there are too many moving parts. If a profiler is available, it can help narrow the search for the “hot spots” or functions which are taking the majority of the time to do their work. If no profiler is available, then I’ll usually use a time function like micros() to read the current tick counter to measure execution speed in different parts of the code. Here is an example of measuring absolute execution time on Arduino:

long lTime;
  lTime = micros();
  <do the work>
  iTime = micros() - lTime;
  Serial.printf(“Time to execute xxx = %d microseconds\n”, (int)lTime);

I’ve also used a profiler for my optimization work with OpenMV. I modified the embedded C code to run as a MacOS command line app to make use of the excellent XCode Instruments profiler. When doing that, it’s important to understand how differently code executes on a PC versus embedded — this is mostly due to the speed of the CPU compared to the speed of memory.

Pins, GPIO and PORTs

One of the most powerful features of the Arduino platform is that it presents a consistent API to the programmer for accessing hardware and software features that, in reality, can vary greatly across different target architectures. For example, the features found in common on most embedded devices like GPIO pins, I2C, SPI, FLASH, EEPROM, RAM, etc. have many diverse implementations and require very different code to initialize and access them.

Let’s look at the first in our list, GPIO (General Purpose Input/Output pins). On the original Arduino Uno (AVR MCU), the GPIO lines are arranged in groups of 8 bits per “PORT” (it’s an 8-bit CPU after all) and each port has a data direction register (determines if it’s configured for input or output), a read register and a write register. The newer Arduino boards are all built around various Arm Cortex-M microcontrollers. These MCUs have GPIO pins arranged into groups of 32-bits per “PORT” (hmm – it’s a 32-bit CPU, I wonder if that’s the reason). They have a similar set of control mechanisms, but add a twist — they include registers to SET or CLR specific bits without disturbing the other bits of the port (e.g. port->CLR = 1; will clear GPIO bit 0 of that port). From the programmer’s view, Arduino presents a consistent set of functions to access these pins on these diverse platforms (clickable links below to the function definitions on Arduino.cc):

pinMode(pin, mode);
digitalRead(pin);
digitalWrite(pin, value);

For me, this is the most powerful idea of Arduino. I can build and deploy my code to an AVR, a Cortex-M, ESP8266 or an ESP32 and not have to change a single line of code nor maintain multiple build scripts. In fact, in my daily work (both hobby and professional), I’m constantly testing my code on those 4 platforms. For example, my LCD/OLED display library (OneBitDisplay) can control various monochrome LCD and OLED displays and the same code runs on all Arduino boards and can even be built on Linux.

One downside to having these ‘wrapper’ functions hide the details of the underlying implementation is that performance can suffer. For most projects it’s not an issue, but when you need to get every ounce of speed out of your code, it can make a huge difference.

Camera data capture

One of the biggest challenges of this project was that the original OV7670 library was only able to run at less than 1 frame per second (FPS) when talking to the Nano 33. The reason for the low data rate is that the Nano 33 doesn’t expose any hardware which can directly capture the parallel image data, so it must be done ‘manually’ by testing the sync signals and reading the data bits through GPIO pins (e.g. digitalRead) using software loops. The Arduino pin functions (digitalRead, digitalWrite) actually contain a lot of code which checks that the pin number is valid, uses a lookup table to convert the pin number to the I/O port address and bit value and may even disable interrupts before reading or changing the pin state. If we were to use the digitalRead function for an application like this, it would limit the data capture rate to be too slow to operate the camera. You’ll see this further down when we examine the actual code used to capture the data. 

First, a quick review of the OV7670 camera module: According to its datasheet, it’s capable of capturing a VGA (640×480) color image at up to 30 FPS. The kit used for this project has the camera mounted to a small PCB and presents an 8-bit parallel data bus and various sync signals.

It requires an external “master clock” (MCLK in the photo) to drive its internal state machine which is used to generate all of the other timing signals. The Nano 33 can provide this external clock source by using its I2S clock. The OV767X library sets this master clock to 16Mhz (the camera can handle up to 48Mhz) and then there is a set of configuration registers to divide this value to arrive at the desired frame rate. Only a few possible frame rates are available (1, 5, 10, 15, 20, and 30 FPS).

Above is one of the timing diagrams from the OV7670 datasheet. This particular drawing shows the timing of the data for each byte received along each image row. The HREF signal is used to signal the start and end of a row and then each byte is clocked in with the PCLK signal. The original library code read each bit (D0-D7) in a loop and combined them together to form each data byte. The image data comes quickly, so we have very little time to read each byte. Assembling them one bit at a time is not very efficient. You might be thinking that it’s not that hard of a problem to solve on the Nano 33. After all, it has 22 GPIO pins and the Cortex-M inside it has 32-bit wide GPIO ports, so just hook up the data bits sequentially and you’ll be able to read the 8 data bits in one shot, then Mission Accomplished™. If only things were that easy. The Nano 33 does have plenty of GPIO pins, but there isn’t a continuous sequence of 8 bits available using any of the pins! I’m guessing that the original code did it one bit at a time because it didn’t look like there was a better alternative. In the pinout diagram below, please notice the P0.xx and P1.xx numbers. These are the Cortex-M GPIO port 0 and 1-bit numbers (other Cortex-M processors would label them PA and PB).

I wasn’t going to let this little bump in the road stop me from making use of bit parallelism. If you look carefully at the bit positions, the best continuous run we can get is 6 bits in a row with P1.10 through P1.15. It’s not possible to read the 8 data bits in one shot…or is it? If we connect D0/D1 of the camera to P1.02/P1.03 and D2-D7 to P1.10-P1.15, we can do a single 32-bit read from port P1 and get all 8 bits in one shot. The bits are in order, but will have a gap between D1 and D2 (P1.04 to P1.09). Luckily the Arm CPU has what’s called a barrel shifter. It also has a smart instruction set which allows data to be shifted ‘for free’ at the same time the instruction is doing something else. Let’s take a look at how and why I changed the code:

Original:

uint8_t in = 0;
  for (int k = 0; k < 8; k++) {
     bitWrite(in, k, (*_dataPorts[k] & _dataMasks[k]) != 0);
  }

Optimized:

  uint32_t in = port->IN; // read all bits in parallel
  in >>= 2; // place bits 0 and 1 at the "bottom" of the 
register
  in &= 0x3f03; // isolate the 8 bits we care about
  in |= (in >> 6); // combine the upper 6 and lower 2 bits

Code analysis

If you’re not interested in the nitty gritty details of the code changes I made, you can skip this section and go right to the results below.First, let’s look at what the original code did. When I first looked at it, I didn’t recognize bitWrite; apparently it’s not a well known Arduino bit manipulation macro; it’s defined as:

#define bitWrite(value, bit, bitvalue) (bitvalue ? 
bitSet(value, bit) : bitClear(value, bit))

This macro was written with the intention of being used on GPIO ports (the variable value) where the logical state of bitvalue would be turned into a single write of a 0 or 1 to the appropriate bit. It makes less sense to be used on a regular variable because it inserts a branch to switch between the two possible outcomes. For the task at hand, it’s not necessary to use bitClear() on the in variable since it’s already initialized to 0 before the start of each byte loop. A better choice would be:

if (*_dataPorts[k] & _dataMasks[k]) in |= (1 << k);

The arrays _dataPorts[] and _dataMasks[] contain the memory mapped GPIO port addresses and bit masks to directly access the GPIO pins (bypassing digitalRead). So here’s a play-by-play of what the original code was doing:

  • Set in to 0
  • Set k to 0
  • Read the address of the GPIO port from _dataPorts[] at index k
  • Read the bit mask of the GPIO port from _dataMasks[] at index k
  • Read 32-bit data from the GPIO port address
  • Logical AND the data with the mask
  • Shift 1 left by k bits to prepare for bitClear and bitSet
  • Compare the result of the AND to zero
  • Branch to bitSet() code if true or use bitClear() if false
  • bitClear or bitSet depending on the result
  • Increment loop variable k
  • Compare k to the constant value 8
  • Branch if less back to step 3
  • Repeat steps 3 through 13, 8 times
  • Store the byte in the data array (not shown above)

The new code does the following:

  • Read the 32-bit data from the GPIO port address
  • Shift it right by 2 bits
  • Logical AND (mask) the 8 bits we’re interested in
  • Shift and OR the results to form 8 continuous bits
  • Store the byte in the data array (not shown above)

Each of the steps listed above basically translates into a single Arm instruction. If we assume that each instruction takes roughly the same amount of time to execute (mostly true on Cortex-M), then old vs. new is 91 versus 5 instructions to capture each byte of camera data, an 18x improvement! If we’re capturing a QVGA frame (320x240x2 = 153600 bytes), that becomes many millions of extra instructions.

Results

The optimized byte capture code translates into 5 Arm instructions and allows the capture loop to now handle a setting of 5 FPS instead of 1 FPS. The FPS numbers don’t seem to be exact, but the original capture time (QVGA @ 1 FPS) was 1.5 seconds while the new capture time when set to 5 FPS is 0.393 seconds. I tested 10 FPS, but readFrame() doesn’t read the data correctly at that speed. I don’t have an oscilloscope handy to probe the signals to see why it’s failing. The code may be fast enough now (I think it is), but the sync signals may become too unstable at that speed. I’ll leave this as an exercise to the readers who have the equipment to see what happens to the signals at 10 FPS.

For the work I did on the OV767X library, I created a test fixture to make sure that the camera data was being received correctly. For ML/data processing applications, it’s not necessary to do this. The built-in camera test pattern can be used to confirm the integrity of the data by using a CRC32.

My tinned protoboard test fixture with 320×240 LCD

Note: The frames come one immediately after another. If you capture a frame and then do some processing and then try to capture another frame, you may hit the middle of the next frame when you call readFrame(). The code will then wait until the next VSync signal, so that frame’s capture time could be as much as 2x as long as a single frame time.

More tips

I enjoy testing the limits of embedded hardware, especially when it involves bits, bytes and pixels. I’ve written a few blog posts that explore the topics of speed and power usage if you’re interested in learning more about it.

Conclusion

  • The embedded microcontrollers available today are capable of handling jobs that were unimaginable just a few years ago.
  • Optimized ML solutions from Google and Edge Impulse are capable of running on low-cost, battery-powered boards (vision, vibration, audio, whatever sensor you want to monitor).
  • Python and Arduino programming environments can test your project idea with little effort.
  • Software can be written an infinite number of ways to accomplish the same task, but one constant remains: TANSTATFC (there ain’t no such thing as the fastest code).
  • Never assume the performance you’re seeing is what you’re stuck with. Think of existing libraries and generic APIs available through open source libraries and environments as a starting point.
  • Knowing a bit of info about the target platform can be helpful, but it’s not necessary to read the MCU datasheet. In the code above, the larger concept of Arm Cortex-M 32-bit GPIO ports was sufficient to accomplish the task without knowing the specifics of the nRF52’s I/O hardware.
  • Don’t be afraid to dig a little deeper and test every assumption.
  • If you encounter difficulties, the community is large and there are a ton of resources out there. Asking for help is a sign of strength, not weakness.

Machine learning (ML) algorithms come in all shapes and sizes, each with their own trade-offs. We continue our exploration of TinyML on Arduino with a look at the Arduino KNN library.

In addition to powerful deep learning frameworks like TensorFlow for Arduino, there are also classical ML approaches suitable for smaller data sets on embedded devices that are useful and easy to understand — one of the simplest is KNN.

One advantage of KNN is once the Arduino has some example data it is instantly ready to classify! We’ve released a new Arduino library so you can include KNN in your sketches quickly and easily, with no off-device training or additional tools required. 

In this article, we’ll take a look at KNN using the color classifier example. We’ve shown the same application with deep learning before — KNN is a faster and lighter weight approach by comparison, but won’t scale as well to larger more complex datasets. 

Color classification example sketch

In this tutorial, we’ll run through how to classify objects by color using the Arduino_KNN library on the Arduino Nano 33 BLE Sense.

To set up, you will need the following:

  • Arduino Nano 33 BLE Sense board
  • Micro USB cable
  • Open the Arduino IDE or Arduino Create
  • Install the Arduino_KNN library 
  • Select ColorClassifier from File > Examples > Arduino_KNN 
  • Compile this sketch and upload to your Arduino board

The Arduino_KNN library

The example sketch makes use of the Arduino_KNN library.  The library provides a simple interface to make use of KNN in your own sketches:

#include <Arduino_KNN.h>

// Create a new KNNClassifier
KNNClassifier myKNN(INPUTS);

In our example INPUTS=3 – for the red, green and blue values from the color sensor.

Sampling object colors

When you open the Serial Monitor you should see the following message:

Arduino KNN color classifier
Show me an example Apple

The Arduino board is ready to sample an object color. If you don’t have an Apple, Pear and Orange to hand you might want to edit the sketch to put different labels in. Keep in mind that the color sensor works best in a well lit room on matte, non-shiny objects and each class needs to have distinct colors! (The color sensor isn’t ideal to distinguish between an orange and a tangerine — but it could detect how ripe an orange is. If you want to classify objects by shape you can always use a camera.)

When you put the Arduino board close to the object it samples the color and adds it to the KNN examples along with a number labelling the class the object belongs to (i.e. numbers 0,1 or 2 representing Apple, Orange or Pear). ML techniques where you provide labelled example data are also called supervised learning.

The code in the sketch to add the example data to the KNN function is as follows:

readColor(color);

// Add example color to the KNN model
myKNN.addExample(color, currentClass);

The red, green and blue levels of the color sample are also output over serial:

The sketch takes 30 color samples for each object class. You can show it one object and it will sample the color 30 times — you don’t need 30 apples for this tutorial! (Although a broader dataset would make the model more generalized.)

Classification

With the example samples acquired the sketch will now ask to guess your object! The example reads the color sensor using the same function as it uses when it acquired training data — only this time it calls the classify function which will guess an object class when you show it a color:

 readColor(color);

 // Classify the object
 classification = myKNN.classify(color, K);

You can try showing it an object and see how it does:

Let me guess your object
0.44,0.28,0.28
You showed me an Apple

Note: It will not be 100% accurate especially if the surface of the object varies or the lighting conditions change. You can experiment with different numbers of examples, values for k and different objects and environments to see how this affects results.

How does KNN work?

Although the  Arduino_KNN library does the math for you it’s useful to understand how ML algorithms work when choosing one for your application. In a nutshell, the KNN algorithm classifies objects by comparing how close they are to previously seen examples. Here’s an example chart with average daily temperature and humidity data points. Each example is labelled with a season:

To classify a new object (the “?” on the chart) the KNN classifier looks for the most similar previous example(s) it has seen.  As there are two inputs in our example the algorithm does this by calculating the distance between the new object and each of the previous examples. You can see the closest example above is labelled “Winter”.

The k in KNN is just the number of closest examples the algorithm considers. With k=3 it counts the three closest examples. In the chart above the algorithm would give two votes for Spring and one for Winter — so the result would change to Spring. 

One disadvantage of KNN is the larger the amount of training example data there is, the longer the KNN algorithm needs to spend checking each time it classifies an object. This makes KNN less feasible for large datasets and is a major difference between KNN and a deep learning based approach. 

Classifying objects by color

In our color classifier example there are three inputs from the color sensor. The example colors from each object can be thought of as points in three dimensional space positioned on red, green and blue axes. As usual the KNN algorithm guesses objects by checking how close the inputs are to previously seen examples, but because there are three inputs this time it has to calculate the distances in three dimensional space. The more dimensions the data has the more work it is to compute the classification result.

Further thoughts

This is just a quick taste of what’s possible with KNN. You’ll find an example for board orientation in the library examples, as well as a simple example for you to build on. You can use any sensor on the BLE Sense board as an input, and even combine KNN with other ML techniques.

Of course there are other machine learning resources available for Arduino include TensorFlow Lite tutorials as well as support from professional tools such as Edge Impulse and Qeexo. We’ll be inviting more experts to explore machine learning on Arduino more in the coming weeks.

This post was originally published by Sandeep Mistry and Dominic Pajak on the TensorFlow blog.

Arduino is on a mission to make machine learning simple enough for anyone to use. We’ve been working with the TensorFlow Lite team over the past few months and are excited to show you what we’ve been up to together: bringing TensorFlow Lite Micro to the Arduino Nano 33 BLE Sense. In this article, we’ll show you how to install and run several new TensorFlow Lite Micro examples that are now available in the Arduino Library Manager.

The first tutorial below shows you how to install a neural network on your Arduino board to recognize simple voice commands.

Example 1: Running the pre-trained micro_speech inference example.

Next, we’ll introduce a more in-depth tutorial you can use to train your own custom gesture recognition model for Arduino using TensorFlow in Colab. This material is based on a practical workshop held by Sandeep Mistry and Dan Coleman, an updated version of which is now online

If you have previous experience with Arduino, you may be able to get these tutorials working within a couple of hours. If you’re entirely new to microcontrollers, it may take a bit longer. 

Example 2: Training your own gesture classification model.

We’re excited to share some of the first examples and tutorials, and to see what you will build from here. Let’s get started!

Note: The following projects are based on TensorFlow Lite for Microcontrollers which is currently experimental within the TensorFlow repo. This is still a new and emerging field!

Microcontrollers and TinyML

Microcontrollers, such as those used on Arduino boards, are low-cost, single chip, self-contained computer systems. They’re the invisible computers embedded inside billions of everyday gadgets like wearables, drones, 3D printers, toys, rice cookers, smart plugs, e-scooters, washing machines. The trend to connect these devices is part of what is referred to as the Internet of Things.

Arduino is an open-source platform and community focused on making microcontroller application development accessible to everyone. The board we’re using here has an Arm Cortex-M4 microcontroller running at 64 MHz with 1MB Flash memory and 256 KB of RAM. This is tiny in comparison to Cloud, PC, or mobile but reasonable by microcontroller standards.

Arduino Nano 33 BLE Sense board is smaller than a stick of gum.

There are practical reasons you might want to squeeze ML on microcontrollers, including: 

  • Function – wanting a smart device to act quickly and locally (independent of the Internet).
  • Cost – accomplishing this with simple, lower cost hardware.
  • Privacy – not wanting to share all sensor data externally.
  • Efficiency – smaller device form-factor, energy-harvesting or longer battery life.

There’s a final goal which we’re building towards that is very important:

  • Machine learning can make microcontrollers accessible to developers who don’t have a background in embedded development 

On the machine learning side, there are techniques you can use to fit neural network models into memory constrained devices like microcontrollers. One of the key steps is the quantization of the weights from floating point to 8-bit integers. This also has the effect of making inference quicker to calculate and more applicable to lower clock-rate devices. 

TinyML is an emerging field and there is still work to do – but what’s exciting is there’s a vast unexplored application space out there. Billions of microcontrollers combined with all sorts of sensors in all sorts of places which can lead to some seriously creative and valuable TinyML applications in the future.

What you need to get started

The Arduino Nano 33 BLE Sense has a variety of onboard sensors meaning potential for some cool TinyML applications:

  • Voice – digital microphone
  • Motion – 9-axis IMU (accelerometer, gyroscope, magnetometer)
  • Environmental – temperature, humidity and pressure
  • Light – brightness, color and object proximity

Unlike classic Arduino Uno, the board combines a microcontroller with onboard sensors which means you can address many use cases without additional hardware or wiring. The board is also small enough to be used in end applications like wearables. As the name suggests it has Bluetooth LE connectivity so you can send data (or inference results) to a laptop, mobile app or other BLE boards and peripherals.

Tip: Sensors on a USB stick – Connecting the BLE Sense board over USB is an easy way to capture data and add multiple sensors to single board computers without the need for additional wiring or hardware – a nice addition to a Raspberry Pi, for example.

TensorFlow Lite for Microcontrollers examples

The inference examples for TensorFlow Lite for Microcontrollers are now packaged and available through the Arduino Library manager making it possible to include and run them on Arduino in a few clicks. In this section we’ll show you how to run them. The examples are:

  • micro_speech – speech recognition using the onboard microphone
  • magic_wand – gesture recognition using the onboard IMU
  • person_detection – person detection using an external ArduCam camera

For more background on the examples you can take a look at the source in the TensorFlow repository. The models in these examples were previously trained. The tutorials below show you how to deploy and run them on an Arduino. In the next section, we’ll discuss training.

How to run the examples using Arduino Create web editor

Once you connect your Arduino Nano 33 BLE Sense to your desktop machine with a USB cable you will be able to compile and run the following TensorFlow examples on the board by using the Arduino Create web editor:

Compiling an example from the Arduino_TensorFlowLite library.

Focus on the speech recognition example: micro_speech

One of the first steps with an Arduino board is getting the LED to flash. Here, we’ll do it with a twist by using TensorFlow Lite Micro to recognise voice keywords. It has a simple vocabulary of “yes” and “no”. Remember this model is running locally on a microcontroller with only 256KB of RAM, so don’t expect commercial ‘voice assistant’ level accuracy – it has no Internet connection and on the order of 2000x less local RAM available.

Note the board can be battery powered as well. As the Arduino can be connected to motors, actuators and more this offers the potential for voice-controlled projects.

Running the micro_speech example.

How to run the examples using the Arduino IDE

Alternatively you can use try the same inference examples using Arduino IDE application.

First, follow the instructions in the next section Setting up the Arduino IDE.

In the Arduino IDE, you will see the examples available via the File > Examples > Arduino_TensorFlowLite menu in the ArduinoIDE.

Select an example and the sketch will open. To compile, upload and run the examples on the board, and click the arrow icon:

For advanced users who prefer a command line, there is also the arduino-cli.

Training a TensorFlow Lite Micro model for Arduino

[optimize output image]
Gesture classification on Arduino BLE 33 Nano Sense, output as emojis.

Next we will use ML to enable the Arduino board to recognise gestures. We’ll capture motion data from the Arduino Nano 33 BLE Sense board, import it into TensorFlow to train a model, and deploy the resulting classifier onto the board.

The idea for this tutorial was based on Charlie Gerard’s awesome Play Street Fighter with body movements using Arduino and Tensorflow.js. In Charlie’s example, the board is streaming all sensor data from the Arduino to another machine which performs the gesture classification in Tensorflow.js. We take this further and “TinyML-ifiy” it by performing gesture classification on the Arduino board itself. This is made easier in our case as the Arduino Nano 33 BLE Sense board we’re using has a more powerful Arm Cortex-M4 processor, and an on-board IMU.

We’ve adapted the tutorial below, so no additional hardware is needed – the sampling starts on detecting movement of the board. The original version of the tutorial adds a breadboard and a hardware button to press to trigger sampling. If you want to get into a little hardware, you can follow that version instead.

Setting up the Arduino IDE

Following the steps below sets up the Arduino IDE application used to both upload inference models to your board and download training data from it in the next section. There are a few more steps involved than using Arduino Create web editor because we will need to download and install the specific board and libraries in the Arduino IDE.

  • In the Arduino IDE menu select Tools > Board > Boards Manager…
    • Search for “Nano BLE” and press install on the board 
    • It will take several minutes to install
    • When it’s done close the Boards Manager window
This image has an empty alt attribute; its file name is k88wXiRDpbadmTW1EreSetBJwHgMN3IP4skuTRywedIgp2aAWvzg3mqyDPZ2_fafH7tFXK-GtFwPEcnMAM0fqfa8XeYCc7orh6LXg4pD2_fKu1JQqw8LALMHfv6lFIBA3a_9pYeg
  • Now go to the Library Manager Tools > Manage Libraries…
    • Search for and install the Arduino_TensorFlowLite library

Next search for and install the Arduino_LSM9DS1 library:

  • Finally, plug the micro USB cable into the board and your computer
  • Choose the board Tools > Board > Arduino Nano 33 BLE
  • Choose the port Tools > Port > COM5 (Arduino Nano 33 BLE) 
    • Note that the actual port name may be different on your computer

There are more detailed Getting Started and Troubleshooting guides on the Arduino site if you need help.

Streaming sensor data from the Arduino board

First, we need to capture some training data. You can capture sensor data logs from the Arduino board over the same USB cable you use to program the board with your laptop or PC.

Arduino boards run small applications (also called sketches) which are compiled from .ino format Arduino source code, and programmed onto the board using the Arduino IDE or Arduino Create. 

We’ll be using a pre-made sketch IMU_Capture.ino which does the following:

  • Monitor the board’s accelerometer and gyroscope 
  • Trigger a sample window on detecting significant linear acceleration of the board 
  • Sample for one second at 119Hz, outputting CSV format data over USB 
  • Loop back and monitor for the next gesture

The sensors we choose to read from the board, the sample rate, the trigger threshold, and whether we stream data output as CSV, JSON, binary or some other format are all customizable in the sketch running on the Arduino. There is also scope to perform signal preprocessing and filtering on the device before the data is output to the log – this we can cover in another blog. For now, you can just upload the sketch and get sampling.

To program the board with this sketch in the Arduino IDE:

  • Download IMU_Capture.ino and open it in the Arduino IDE
  • Compile and upload it to the board with Sketch > Upload

Visualizing live sensor data log from the Arduino board

With that done we can now visualize the data coming off the board. We’re not capturing data yet this is just to give you a feel for how the sensor data capture is triggered and how long a sample window is. This will help when it comes to collecting training samples.

  • In the Arduino IDE, open the Serial Plotter Tools > Serial Plotter
    • If you get an error that the board is not available, reselect the port:
    • Tools > Port > portname (Arduino Nano 33 BLE) 
  • Pick up the board and practice your punch and flex gestures
    • You’ll see it only sample for a one second window, then wait for the next gesture
  • You should see a live graph of the sensor data capture (see GIF below)
Arduino IDE Serial Plotter will show a live graph of CSV data output from your board.

When you’re done be sure to close the Serial Plotter window – this is important as the next step won’t work otherwise.

Capturing gesture training data 

To capture data as a CSV log to upload to TensorFlow, you can use Arduino IDE > Tools > Serial Monitor to view the data and export it to your desktop machine:

  • Reset the board by pressing the small white button on the top
  • Pick up the board in one hand (picking it up later will trigger sampling)
  • In the Arduino IDE, open the Serial Monitor Tools > Serial Monitor
    • If you get an error that the board is not available, reselect the port:
    • Tools > Port > portname (Arduino Nano 33 BLE) 
  • Make a punch gesture with the board in your hand (Be careful whilst doing this!)
    • Make the outward punch quickly enough to trigger the capture
    • Return to a neutral position slowly so as not to trigger the capture again 
  • Repeat the gesture capture step 10 or more times to gather more data
  • Copy and paste the data from the Serial Console to new text file called punch.csv 
  • Clear the console window output and repeat all the steps above, this time with a flex gesture in a file called flex.csv 
    • Make the inward flex fast enough to trigger capture returning slowly each time

Note the first line of your two csv files should contain the fields aX,aY,aZ,gX,gY,gZ.

Linux tip: If you prefer you can redirect the sensor log output from the Arduino straight to a .csv file on the command line. With the Serial Plotter / Serial Monitor windows closed use:

 $ cat /dev/cu.usbmodem[nnnnn] > sensorlog.csv

Training in TensorFlow

We’re going to use Google Colab to train our machine learning model using the data we collected from the Arduino board in the previous section. Colab provides a Jupyter notebook that allows us to run our TensorFlow training in a web browser.

Arduino gesture recognition training colab.

The colab will step you through the following:

  • Set up Python environment
  • Upload the punch.csv and flex.csv data 
  • Parse and prepare the data
  • Build and train the model
  • Convert the trained model to TensorFlow Lite
  • Encode the model in an Arduino header file

The final step of the colab is generates the model.h file to download and include in our Arduino IDE gesture classifier project in the next section:

Let’s open the notebook in Colab and run through the steps in the cells – arduino_tinyml_workshop.ipynb

Classifying IMU Data

Next we will use model.h file we just trained and downloaded from Colab in the previous section in our Arduino IDE project:

  • Open IMU_Classifier.ino in the Arduino IDE.
  • Create a new tab in the IDE. When asked name it model.h
  • Open the model.h tab and paste in the version you downloaded from Colab
  • Upload the sketch: Sketch > Upload
  • Open the Serial Monitor: Tools > Serial Monitor
  • Perform some gestures
  • The confidence of each gesture will be printed to the Serial Monitor (0 = low confidence, 1 =  high confidence)

Congratulations you’ve just trained your first ML application for Arduino!

For added fun the Emoji_Button.ino example shows how to create a USB keyboard that prints an emoji character in Linux and macOS. Try combining the Emoji_Button.ino example with the IMU_Classifier.ino sketch to create a gesture controlled emoji keyboard ?.

Conclusion

It’s an exciting time with a lot to learn and explore in TinyML. We hope this blog has given you some idea of the potential and a starting point to start applying it in your own projects. Be sure to let us know what you build and share it with the Arduino community.

For a comprehensive background on TinyML and the example applications in this article, we recommend Pete Warden and Daniel Situnayake’s new O’Reilly book “TinyML: Machine Learning with TensorFlow on Arduino and Ultra-Low Power Microcontrollers.”



  • Newsletter

    Sign up for the PlanetArduino Newsletter, which delivers the most popular articles via e-mail to your inbox every week. Just fill in the information below and submit.

  • Like Us on Facebook