Getting a Running Start with the NVIDIA Jetson Nano

1. Introduction

On March 18th, 2019, NVIDIA pre-announced their new “Jetson Nano” GPU development board, with shipments then-scheduled to begin June 2019. This is an intriguing little system claiming 472 GFLOP of performance via a 128-core NVIDIA Maxwell GPU, a quad core ARM A57 processor, 4GB of RAM, and gigabit Ethernet — and all at a sub-$100 price point.

Phoronix benchmarked and reported on the Jetson Nano, stating that “overall, this is arguably the best sub-$100 ARM developer board we’ve seen to date depending upon your use-cases. The Jetson Nano will certainly open up NVIDIA Tegra SoCs [“System on a Chip”] to appearing in more low-cost DIY projects and other hobbyist use-cases as well as opening up GPU/CUDA acceleration that until now has really not been possible on the low-cost boards.”

While the device is largely intended for developers working to develop machine learning/neural networks at the edge on low-power mobile devices, we were intrigued in the device’s low cost, its relatively high-performance claims, and its NVIDIA GPU, all conveniently running on top of a tailored version of Ubuntu Linux.

We were particularly interested in whether the neural network models described in our earlier Farsight blog article entitled ‘A Second Approach To Automating Detection of “Random-Looking” Domain Names: Neural Networks/Deep Learning’ could be run faster on the inexpensive Jetson Nano.

We had high hopes for a speed-up since our early 2015-era 3.1 GHz Dual-Core Intel Core i7 Macbook Pro laptop (as used for the earlier runs) peaked at just 79.3 GFLOP under Catalina (when tested with the Intel LINPACK benchmark, roughly 1/6th the quoted performance of the Jetson Nano. On the other hand, the Neural Network/Deep Learning model we’d built was not purely a floating point computational exercise — it also involved non-parallel computation and I/O, which might take longer on the slower ARM CPU.

Of course, the Jetson Nano is not the only device competing in this space, see also, among others:

The Intel Neural Compute Stick 2, $73.27 from Amazon (“Intel claims the chipset can hit 4 teraflops of compute and 1 trillion operations per second of dedicated neural net compute at full blast.”)
The Google Coral USB Accelerator, $74.99 from Cora.ai (“An individual Edge TPU is capable of performing 4 trillion operations (tera-operations) per second (TOPS.)”) Coral also offers a dev board version for those who’d prefer that to a USB stick form factor, $149.99
And in a completely different price class, the UDOO Bolt V8, $418.00 excl. tax/VAT/shipping is described as “Almost twice as fast as the MacBook Pro 13″, for VR, AR, and AI projects,” according to the vendor.

One comparison of those alternatives can be seen in “Battle of Edge AI — Nvidia vs Google vs Intel”

Anyhow, having recently received a Jetson Nano as a Christmas gift from a family member (thank you Bev!), we’ve begun to experiment a little with it. This article is meant to capture some of what we’ve learned in that process, so if you decide you want to experiment with one of your own you can have a quick(er) and smoother start.

READ THE WHOLE BLOGPOST before you buy a Jetson Nano or begin to work with your own system.

Disclaimer: The information in this article is offered “as-is,” with all faults, errors and omissions, and no warranty whatsoever. Proceed at your own risk. The prices you pay and the performance you may see may vary, etc. This is not an endorsement (nor is it meant to be taken as a negative critique) of any product mentioned.

2. Parts and Hardware Assembly Notes

We built our system from the following parts:

NVIDIA Jetson Nano Developer Kit $98.98
GeeekPi Jetson Nano Case $29.99
128GB Micro SD Card $23.99
5V/4A Barrel Style Power Supply $8.99

(Power Supply Note: One of my colleagues who tested a Jetson Nano board with a bench power supply delivering exactly 5 volts DC (with plenty of amps) via the barrel power supply port found that the board would sometimes brown out under load until he increased the voltage to 5.2V, at which point the power problems he ran into completely disappeared.)

Total: $161.95 (with the usual free shipping if you’re buying from Amazon as a Prime member, etc.)

You’ll also need (at least for your initial setup, if not for routine use):

A laptop or desktop (I used an Apple Macbook Pro running Catalina)
An HDMI-capable monitor with HDMI cable
A USB keyboard and mouse
An Ethernet cable (we note that the board has a M.2 Key E socket under the GPU module, should you want to add a compatible WiFi daughter card, but the system doesn’t come with such a card pre-installed). You can see an example of how to install one here.

Before You Get Started: The case mentioned above comes unassembled. Assembling the case is somewhat fiddly and can prove tricky if you have large hands.

Part of assembling the case includes installing the Nano board in the case. The Jetson Nano is static sensitive and is shipped in a static-protective bag, so be sure to follow appropriate antistatic control measures to avoid damage to the board or any of its components (for some ESD hints, see an example here.)

The case comes with a tiny printed manual; that manual may be enough for some, but I encourage you to also consider reviewing:

The supplemental N100 information
The assembly videos available here and here

Note that multiple versions of the basic metal Jetson Nano case exist: depending on the version you have, you may wonder why the video shows easy access while you’re struggling to get stuff hooked to the right pin. Answer: they may actually have a different version of the case than you do.

Getting Started:

Eight tiny screws need to be removed to open the case and get at the enclosed parts pack (we would have preferred to not have the empty case screwed together as shipped).
Installing the switches: Spin the nuts off, then install the two switches finger tight.
GPU fan orientation: The case includes two cooling fans, one premounted on the case, and one for you to install on the heatsink attached to the GPU. While you might expect to use self-tapping screws to mount the fan on the GPU heatsink, the case expects you to mount the GPU fan with the provided long skinny bolts and nuts, instead. Because the fan is not marked, note that the fan is correctly oriented when the bolt head recesses are pointing up (one side of the heat sink fan is counterbored for screw heads, the other side of the heat sink fan is flush, and the bolts aren’t long enough unless they’re recessed in the counterbored holes).
Actually securing the fan bolts: You can hold the nuts with the small antistatic tweezers supplied with the case, however the tweezers aren’t very strong and are relatively easy-to-break (or at least I broke mine). You may find it helpful to try a fine hemostat if you have one handy, or you may want to have someone assist you while installing the small nuts and bolts. (Note that if the nuts fall off while the system is running, they may fall onto the board and potentially cause a short, so be sure you have them adequately secured).
Wiring the power switch: There are four wires on the supplied power switch: the black and red wires are used to turn the system on/off. The other two wires provide power to the switch’s own light.

As stated on raspberrypiwiki.com/index.php/N100 page, ‘And the blue and white cable of power control switch need to be insert in “3.3V” and “GND” pin at J41.’ An illuminated magnifier may be helpful when it comes to finding the right pins on the board. Use the included mini zip ties to manage the wires in the case.
Jumpers:
- Be SURE to jump the “disable auto-on” pins on J40 pins 7 and 8 (assuming you’re installing the board in a case which has a manual power switch, like the one we’re using).
- While the board can use either a micro USB power supply or a power supply with a barrel connector, we urge you to use the barrel connector power supply to avoid issues with insufficient power. Assuming you are using the barrel power supply, BE SURE to jump J48 to enable the 5V/4A barrel power supply jack. (I missed this and had to remove those eight tiny case screws, install this jumper, and then replace the eight dang tiny screws again)
You then need to flash the micro SD card and install it before the board can boot.

3. Flashing And Installing The Micro SD Card

Flash your micro SD card as described here.

When formatting your micro SD card, tell the formatting tool to use (almost) all the space on the card (exclude/reserve perhaps 4 GB for ram disk space). If you’re too conservative (for example electing to only use 95GB or 100GB on a 128GB card for the main partition), you likely won’t be able to easily increase the size of the partition later.

Speaking of micro SD cards, be sure to buy and use a large-enough one. A basic 128GB micro SD card can be had for under $20 from some vendors, and a 256GB card for less than twice that, so there’s really no excuse for trying to limp along with some overly-tiny $6 16GB micro SD card!

Once you’ve flashed the micro SD card, install it on the Jetson Nano. The socket for the micro SD card is on the side of the board opposite the USB ports and other connectors. The contacts on the micro SD card should be oriented toward the center of the board, and the contacts should be facing upward. You should feel a spring-like detent click when the card locks into place.

4. Initial Configuration

Connect the Jetson Nano’s power supply, monitor, keyboard, mouse and Ethernet cable.

Push the power button on the case to boot the system. The power button should light, and you should also be able to see a glowing green LED on the board through the ventilation holes on the case (you may need to look closely to see it through the holes). The fans are thermally controlled and may NOT be running at this point.

The basic configuration is menu driven and pretty straightforward. You’ll be choosing the language you speak, accepting a license agreement, selecting a time zone and keyboard style, and other routine system-setup tasks. Two things to note:

Make a note of the name you picked for the system, and your username
Be sure to use a strong password and make a note of it, too.

5. Configuring Networking

a) IPv4 Address: We will normally be accessing the Jetson Nano over the local network, rather than by sitting at the monitor with a keyboard directly attached to the device. We’re going to assume that we do NOT need to directly access the system from the Internet, so we’re not going to give it a public IP nor a globally resolvable domain name.

We assume the system is going to be plugged into a wireless home “router.” The system will get an RFC1918 IP address from the wireless router via DHCP when the board boots. To ensure that IP doesn’t vary over time, we “locked” that IP to the Jetson Nano device via our wireless router’s management interface. Make a note of the RFC1918 IP address the router assigned to the Jetson Nano.

Using your favorite editor, add that IP and the name of your system to your Mac’s /etc/hosts file (you’ll need to use sudo or su (if you’ve enabled su on your Mac) to modify that file). Creating an /etc/hosts file entry will let you connect to the Jetson Nano using the system’s name from your local laptop.

b) Configuring ssh: You can ssh into the Jetson Nano using username and password, but we prefer to use ssh preshared keys for convenience and for improved security.

For the purposes of this article, we include the machine name, either mymac or gpuboard, as part of the prompt to help make it clear where each command is being run. To create new keys on your Mac, say:

mymac $ ssh-keygen -b 4096 -t rsa

To install that key on the Jetson Nano, assuming your username is jsmith and the system is called gpuboard, you’d enter the following command on your Mac:

mymac $ ssh-copy-id jsmith@gpuboard

You’ll need to provide your Jetson Nano’s account password to install that preshared key.

After that, you should then be able to login to your Jetson Nano by saying:

mymac $ ssh jsmith@gpuboard      <-- regular login

mymac $ ssh -X jsmith@gpuboard   <-- if you want to login and forward
					X11 traffic back (doing so will 
					require an X server installed on 
					your local laptop)

6) Enabling Access to The Root Account (Optional)

If you dislike sudo and prefer to su to root when doing a series of tasks that require administrative privileges, you can enable the root account by setting a (strong!) password for root with

gpuboard $ sudo passwd root

In the rest of this document, we will show commands that need root privileges by prefixing them with a hash sign.

If you’d rather not enable the root account, simply remember to prefix any command needing root permissions (denoted below with a # prompt) with sudo.

7) Restoring The Missing Documentation

Because the Jetson Nano was envisioned as potentially being used as an embedded device with limited storage, the default install has been “minimized.” In particular, Canonical, the publisher of Ubuntu, decided that they’d exclude all documentation from the default Ubuntu install in order to save disk space.

As a result, /etc/dpkg/dpkg.cfg.d/excludes comes pre-set to exclude documentation from what gets installed. We, on the other hand, REALLY LIKE documentation, and we have plenty of storage space on our 128GB micro SD card. We’ll restore what was intentionally omitted by entering:

gpuboard # unminimize

8) Ensuring Full Performance From The System

Because the Jetson Nano may be used in power-constrained situations (e.g., for robotic vision projects and that sort of thing), it also comes preconfigured to minimize power draw on battery power sources. This may include disabling some of the system’s four CPUs or running the CPUs at a reduced clock rate. Because we have abundant wall power with our 5V/4A barrel power supply and want maximum performance from the board, we need to set the system to ensure power management doesn’t inadvertently downrate performance.

For maximum performance we want to set the system power profile to zero (also known as “10 watt” mode). We also want to ensure that all four CPUs are available. We also want to ensure that these power-related settings survive a system reboot. Therefore we’ll use vim to add the following lines to etc/rc.local

echo 1 > /sys/devices/system/cpu/cpu0/online
echo 1 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu2/online
echo 1 > /sys/devices/system/cpu/cpu3/online
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
nvpmodel -m 0
( sleep 60 && jetson_clocks )&

After exiting your editor and saving that file, set /etc/rc.local to be executable with:

gpuboard # chmod a+rx /etc/rc.local

Confirm the file runs okay now:

gpuboard # /etc/rc.local

Also disable ondemand by saying:

gpuboard # update-rc.d -f ondemand remove

Since we’re “semi-hotrodding” the board, this is also a good time to consider enabling dynamic fan control.

9) Ensure The System’s Software Is Up-To-Date

Let’s now ensure the system is up-to-date by saying:

gpuboard # apt-get update
gpuboard # apt-get upgrade

Important note: because of the customized version of Ubuntu (Linux For Tegra, L4T) running on the board, do NOT EVER attempt to run dist-upgrade! You are “stuck” (at least for now) on Ubuntu 18.04 LTS.

10) Let’s Also Remove Packages We Don’t Need

Unneeded packages may potentially slow the system down, either during boot or routine use, but be careful you don’t inadvertently delete a package you actually need. An example of a package that I was comfortable removing was:

gpuboard # apt-get purge modemmanager

You can also probably safely clean up no-longer needed packages with:

gpuboard # apt autoremove

11) Adding Swap Space

You can see the current amount of swap space you’ve got on the Jetson Nano with:

gpuboard $ zramctl

12) Ensure The System’s Clock Is NTP-Sync’d For Accurate Logging

I’m a big believer in accurate logs. Do yourself a favor and make sure your system time is accurate by running NTP. Install an NTP client by saying:

gpuboard # apt install chrony

13) Ensure The Cuda Compiler and Libraries Are Installed And Usable

It’s easy to mistakenly get the impression that the Cuda GPU-enabled compiler and libraries AREN’T installed. For example, perhaps you tried saying:

gpuboard $ nvcc --version
bash: nvcc: command not found

That might lead you down a futile path of trying to download and install the Cuda tools when in fact they’re ALREADY installed as part of the Jetpack installer, just not in your default PATH.

Make sure nvcc and related libraries are accessible by editing ~/.bashrc with your favorite editor, adding at the bottom:

export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

Now run that file by saying:

gpuboard $ source ~/.bashrc

From then on, you should then see what you’d hope to see when you run nvcc, e.g.:

gpuboard $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Mon_Mar_11_22:13:24_CDT_2019
Cuda compilation tools, release 10.0, V10.0.326

Now let’s try one of the examples that come with the Cuda tools. We need to copy the read-only samples to a directory under our home directory by saying:

gpuboard $ cuda-install-samples-10.0.sh ~

We can then make those programs (this may take an hour or so to run) by saying:

gpuboard $ cd NVIDIA_CUDA-10.0_Samples/
gpuboard $ time make
[lots of output elided here]
Finished building CUDA samples

real	61m27.359s
user	54m47.080s
sys	5m37.152s

If you review the output from that make, you may notice some warning messages (unused variables, deprecated declarations, etc.), but those are believed to be non-problematic. Let’s try one of the programs that we’ve now got compiled…

gpuboard $ cd ~/NVIDIA_CUDA-10.0_Samples/bin/aarch64/linux/release
gpuboard $ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:               3964 MBytes (4156911616 bytes
( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
GPU Max Clock rate:                            922 MHz (0.92 GHz)
Memory Clock rate:                             1600 Mhz
Memory Bus Width:                              64-bit
L2 Cache Size:                                 262144 bytes
[etc]
Result = PASS

Looks like we’re in business!

14) Installing Application Software

Now we’re ready to begin installing our application environment.

Let’s start by installing numpy and scipy, and the packages required by those applications (again, some of these may take a while to download and install):

gpuboard # apt-get install git cmake libatlas-base-dev gfortran
gpuboard # apt-get install libhdf5-serial-dev hdf5-tools python3-dev python3-pip
gpuboard # apt-get install libpcap-dev libpq-dev python3-matplotlib 
gpuboard # pip3 install numpy testresources setuptools cython pandas
gpuboard # apt-get install python3-scipy

Everything should install fine, but we can verify that numpy is working with a tiny test program to transpose a small matrix:

gpuboard $ cat test.py
#!/usr/bin/python3

import numpy as np

x = np.array([[1, 2], [3, 4], [5, 6]])
print ("x ==> \n", x)

xprime = x.T
print ("\nxprime ==> \n", xprime)

gpuboard $ chmod a+rx test.py
gpuboard $ ./test.py
x ==> 
 [[1 2]
 [3 4]
 [5 6]]

xprime ==> 
 [[1 3 5]
 [2 4 6]]

This step looks good, too! Now let’s go on to install our machine learning/neural network applications.

15) Machine Learning/Neural Networking Applications

The Jetson Nano has always been positioned as a platform for running machine learning/neural network applications. A previous Farsight blog article used keras (with theano/Tensorflow). Let’s install that software now.

a) We’ll begin with keras; it’s easy:

gpuboard # pip3 install keras

b) Ditto theano:

gpuboard # pip3 install theano

c) Tensorflow, on the other hand, is somewhat fussy on the Jetson Nano. For details, please visit here.

gpuboard # pip3 install --upgrade numpy==1.16.1 future==0.17.1 mock==3.0.5 h5py==2.9.0 keras_preprocessing==1.0.5 keras_applications==1.0.8 gast==0.2.2 enum34 futures protobuf

gpuboard # pip3 install --pre --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v43 tensorflow-gpu

If you need to/want to install from source, be aware that this can literally take over 40 hours.

Other machine learning packages are also available for the Jetson Nano, but we’re not going to go over installing them here.

16) Performance

Per this NVIDIA article, let’s install jtop as a nice summary performance monitoring tool. It’s available here or via pip3:

gpuboard # pip3 install jetson-stats

We also want iostat, which is part of the sysstat package:

gpuboard # apt-get install sysstat -y

Now we’re ready to actually run a real application, namely the code from our earlier machine learning blog article.

17) Sample Code From Machine Learning Blog Article: macOS Catalina With Keras and Theano

All the preceding was basically meant to get us set up to run the code shown in Appendices III and IV from our early machine learning article.

For reference, here’s the performance we want to beat — Keras on a MacBook Pro Retina (Early 2015) laptop (3.1 GHz dual-core Intel Core i7, 16GB 1867 MHz DDR3, Intel Iris Graphics 6100 1536 MB, running under Mac OS X Catalina 10.15.2, with the Intel Math Kernel Library for OS X (submitted December 16, 2019):

mymac $ cat .keras/keras.json
{
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "theano",
    "image_data_format": "channels_last"
}

mymac $ time python3 run-model-embeddings.py
Using Theano backend.
[...]
Post model save, elapsed time in seconds = 1116.3568624319998

real	18m38.418s
user	17m14.130s
sys	0m33.749s

mymac $ time python3 run-model-embeddings-2.py
Using Theano backend.
[...]
real	6m26.779s
user	6m12.230s
sys	0m5.035s

18) Sample Code From Machine Learning Blog Article: MacOS Catalina With Keras and TensorFlow

The TensorFlow backend to Keras proved to be even faster on that Mac:

mymac $ cat .keras/keras.json 
{
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": tensorflow,
    "image_data_format": "channels_last"
}

mymac $ time python3 run-model-embeddings.py
Using TensorFlow.
[...]
Post model save, elapsed time in seconds = 755.7816738820001

real	12m38.517s
user	19m41.349s
sys	2m38.368s

mymac $ time python3 run-model-embeddings-2.py
Using TensorFlow backend.
[...]
real	3m53.989s
user	4m18.361s
sys	0m13.532s

19) Advanced Instruction Support On Catalina?

An interesting thing we noted during the Keras run on the Mac with the TensorFlow backend was the informational message that:

"tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA"

See here for a discussion of this message.

We attempted to compile a custom version of from source for our Mac that enabled those instructions, but saw (for example):

Warning: "Disabling AVX support: clang compiler shipped with XCode 11.[012] generates broken assembly with -macosx-version-min=10.15 and AVX enabled."

As a result, the results reported here were from the default brew version of TensorFlow (even if that didn’t include AVX2 and FMA instructions). It’s possible that if/when these bugs can be resolved, performance on the Mac might be improved still further.

20) Sample Code From Machine Learning Blog Article: Jetson Nano With Keras and TensorFlow

With current Mac laptop benchmark results in hand, we’re finally ready to try running the code from our earlier machine learning blog article on the Jetson Nano.

We moved that code and the associated data file (new-tokenized-20-char.txt.gz) to the Jetson Nano using sftp. We then ran that code on the Jetson Nano.

We ensured Keras on the Jetson Nano was configured to use TensorFlow as a backend:

gpuboard $ cat .keras/keras.json
{
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "tensorflow",
    "image_data_format": "channels_last"
}

gpuboard $ time python3 run-model-embeddings.py
Using TensorFlow backend.
2020-01-01 19:33:29.027814: I 	tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully 
opened dynamic library libcudart.so.10.0
Reading in the tokenized data...
After reading in training data, elapsed time in seconds = 18.10690813[...]
[...]

We can watch the system using jtop in another ssh window:

Output from the run looks like:

[...]
3264272/3264272 [==============================] - 1900s 582us/step - loss: 0.0265 - accuracy: 0.9942
Epoch 2/3
3264272/3264272 [==============================] - 1848s 566us/step - loss: 0.0156 - accuracy: 0.9966 
Epoch 3/3
3264272/3264272 [==============================] - 1845s 565us/step - loss: 0.0138 - accuracy: 0.9969 
After fitting the model, elapsed time in seconds = 5694.876962216
Evaluate...
accuracy: 99.73%
[...]
Post model save, elapsed time in seconds = 6399.951873356999
real	106m59.793s
user	100m32.988s
sys	20m25.848s

Let’s now try running run-model-embeddings-2.py:

gpuboard $ time python3 run-model-embeddings-2.py
[...]
real	19m22.221s
user	20m36.464s
sys	1m47.480s

21) Sample Code From Machine Learning Blog Article: Jetson Nano With Keras and Theano

Now let’s try Keras on the Jetson Nano with theano for the backend instead of TensorFlow. We set the .keras/keras.json file to look like:

gpuboard $ cat .keras/keras.json
{
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "theano",
    "image_data_format": "channels_last"
}

We also realized that the ~/.theanorc file configuration might impact use/non-use of the GPU.

We began by reviewing the stock configuration reported by:

gpuboard $ python3 -c 'import theano; print(theano.config)' | more

We noted in particular the following settings from that config report:

device (cpu, opencl*, cuda*) 
    Doc:  Default device for computations. If cuda* or opencl*, change 
    thedefault to try to move computation to the GPU. Do not use upper 
    caseletters, only lower case even if NVIDIA uses capital letters.
    Value:  cpu

[...]

gpuarray.single_stream () 
    Doc:  If your computations are mostly lots of small elements, using 
    single-stream will avoid the synchronization overhead and usually be faster. 
    For larger elements it does not make a difference yet.  In the future when 
    true multi-stream is enabled in libgpuarray, this may change. If you want to 
    make sure to have optimal performance, check both options.     
    Value:  True

Running WITHOUT Any GPU setting in the .theanorc file:

gpuboard $ time python3 run-model-embeddings.py
Using Theano backend.
2020-01-01 21:59:46.847850: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Reading in the tokenized data...
After reading in training data, elapsed time in seconds = 18.17661204400065
[...]
Epoch 1/3
3264272/3264272 [==============================] - 1204s 369us/step - loss: 0.0255 - accuracy: 0.9943
Epoch 2/3
3264272/3264272 [==============================] - 1202s 368us/step - loss: 0.0152 - accuracy: 0.9967 
Epoch 3/3
3264272/3264272 [==============================] - 1206s 369us/step - loss: 0.0136 - accuracy: 0.9970 
After fitting the model, elapsed time in seconds = 3886.5203647100006
Evaluate...
accuracy: 99.70%
[...]

Post model save, elapsed time in seconds = 4270.506449021001

real	71m30.840s
user	70m41.324s
sys	0m21.528s

gpuboard $ time python3 run-model-embeddings-2.py
[...]
real	22m36.315s
user	22m29.492s
sys	0m2.668s

22) Sample Code From Machine Learning Blog Article: Jetson Nano With Keras and Theano (Using An Explicit GPU Setting)

Running WITH the GPU set explicitly in the .theanorc file:

We now tried setting the GPU explicitly in the .theanorc file:

gpuboard $ cat .theanorc
[cuda]
root = /usr/local/cuda

[global]
device = cuda0
floatX = float32
gpuarray.single_stream = True

[dnn]
library_path = /usr/local/cuda/lib64/

We initially saw:

ERROR (theano.gpuarray): pygpu was configured but could not be imported or is too old (version 0.7 or higher required)

pygpu is part of libgpuarray, which is in turn obviously part of Theano. We attempted to upgrade just pygpu/libgpuarray, but eventually ended up building a full new copy of Theano 2 from source more-or-less as described here:

Note that this can take a substantial period of time (as in, don’t bother staying up, go ahead and get some sleep). [N.B. After getting this built and running, I stumbled upon this article (scroll down to the item mentioning “TensorFlow 2.0 can be installed with JetPack4.3 now” posted 12/26/2019 03:13 AM).

Anyhow, when we re-ran the model with our built-from-source Theano 2.0 with gpuarray.single_stream = True (designated by +SING in the summary result table below), we had success:

gpuboard $ time python3 run-model-embeddings.py
Using Theano backend.
/usr/local/lib/python3.6/dist-packages/theano/gpuarray/dnn.py:184: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to a version >= v5 and <= v7.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 7603 on context None
Mapped name None to device cuda0: NVIDIA Tegra X1 (0000:00:00.0)
2020-01-05 18:51:28.809377: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[...]
Fit...
Epoch 1/3
3264272/3264272 [==============================] - 764s 234us/step - loss: 0.0263 - accuracy: 0.9940
Epoch 2/3
3264272/3264272 [==============================] - 764s 234us/step - loss: 0.0155 - accuracy: 0.9966  
Epoch 3/3
3264272/3264272 [==============================] - 762s 233us/step - loss: 0.0134 - accuracy: 0.9970  
After fitting the model, elapsed time in seconds = 2439.6910164309666
[...]
real	44m58.908s
user	34m55.716s
sys	8m26.236s

Obviously 44m58.9s is significantly better than 71m30.8. How about the second job in the sequence?

gpuboard $ time python3 run-model-embeddings-2.py
Using Theano backend.
/usr/local/lib/python3.6/dist-packages/theano/gpuarray/dnn.py:184: UserWarning: 
Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to a version >= v5 and <= v7.
warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 7603 on context None
Mapped name None to device cuda0: NVIDIA Tegra X1 (0000:00:00.0)
2020-01-05 19:41:42.214777: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[...]
real	18m38.689s
user	17m38.668s
sys	0m49.432s

This was again better than the Nano Theano run w/o the GPU explicitly set (18m38.7s vs. 22m36.315s).

After finishing those runs, we then set gpuarray.single_stream = False (signified by -SING in the table below) and reran. When I did so, performance was essentially unchanged.

23) Summary Performance Table

So did the Nano beat the baseline Mac laptop? No:

Mac Laptop	run-model-embeddings.py      [...]-2.py          Sum	        This/Best

theano		 18m  38.4s	              6m 26.8s           25m 5.2s 	  1.36
tensorflow	 12m  38.5s		      3m 54.0s           16m 32.5s	  1 (Best)
 
Jetson Nano	run-model-embeddings.py      [...]-2.py          Sum	         This/Best

tensorflow	 106m 59.8s		      19m 22.2s          126m 22s	   7.63
theano		 71m  30.8s		      22m 36.3s           94m 7.1s	   5.68

theano+GPU-SING	 45m  25.4s	              18m 32.4            63m 57.1s        3.86
theano+GPU+SING	 44m  58.9s		      18m 38.7s           63m 37.6s 	   3.85

Obviously, the Jetson Nano’s performance was not the performance we’d hoped to see, but the observed performance may be more the result of my inexperience with Theano and the Jetson Nano than the board itself. We welcome any feedbcak/suggestions for improving the performance of the board from readers.

24) Conclusion

We hope that any of you who have a Jetson Nano (or who are thinking of getting one to experiment with) find the preceding information helpful. If you have additional tips or tricks for improving the performance of the sample model on the Nano, we’d love to hear from you. The author can be reached at [email protected]

Acknowledgements

Many thanks to my colleagues David Waitzman, Kelvin Dealca and Jeremy Reed for their assistance in reviewing this article, and to Farsight CTO Ben April for sharing his testing experiences with his own Jetson Nano board. Any remaining flaws are solely the responsibility of the author.

Joe St Sauver Ph.D. is a Distinguished Scientist with Farsight Security®, Inc.