Getting a Running Start with the NVIDIA Jetson Nano
1. Introduction
On March 18th, 2019, NVIDIA pre-announced their new “Jetson Nano” GPU development board, with shipments then-scheduled to begin June 2019. This is an intriguing little system claiming 472 GFLOP of performance via a 128-core NVIDIA Maxwell GPU, a quad core ARM A57 processor, 4GB of RAM, and gigabit Ethernet — and all at a sub-$100 price point.
Phoronix benchmarked and reported on the Jetson Nano, stating that “overall, this is arguably the best sub-$100 ARM developer board we’ve seen to date depending upon your use-cases. The Jetson Nano will certainly open up NVIDIA Tegra SoCs [“System on a Chip”] to appearing in more low-cost DIY projects and other hobbyist use-cases as well as opening up GPU/CUDA acceleration that until now has really not been possible on the low-cost boards.”
While the device is largely intended for developers working to develop machine learning/neural networks at the edge on low-power mobile devices, we were intrigued in the device’s low cost, its relatively high-performance claims, and its NVIDIA GPU, all conveniently running on top of a tailored version of Ubuntu Linux.
We were particularly interested in whether the neural network models described in our earlier Farsight blog article entitled ‘A Second Approach To Automating Detection of “Random-Looking” Domain Names: Neural Networks/Deep Learning’ could be run faster on the inexpensive Jetson Nano.
We had high hopes for a speed-up since our early 2015-era 3.1 GHz Dual-Core Intel Core i7 Macbook Pro laptop (as used for the earlier runs) peaked at just 79.3 GFLOP under Catalina (when tested with the Intel LINPACK benchmark, roughly 1/6th the quoted performance of the Jetson Nano. On the other hand, the Neural Network/Deep Learning model we’d built was not purely a floating point computational exercise — it also involved non-parallel computation and I/O, which might take longer on the slower ARM CPU.
Of course, the Jetson Nano is not the only device competing in this space, see also, among others:
The Intel Neural Compute Stick 2, $73.27 from Amazon (“Intel claims the chipset can hit 4 teraflops of compute and 1 trillion operations per second of dedicated neural net compute at full blast.”)
The Google Coral USB Accelerator, $74.99 from Cora.ai (“An individual Edge TPU is capable of performing 4 trillion operations (tera-operations) per second (TOPS.)”) Coral also offers a dev board version for those who’d prefer that to a USB stick form factor, $149.99
And in a completely different price class, the UDOO Bolt V8, $418.00 excl. tax/VAT/shipping is described as “Almost twice as fast as the MacBook Pro 13″, for VR, AR, and AI projects,” according to the vendor.
One comparison of those alternatives can be seen in “Battle of Edge AI — Nvidia vs Google vs Intel”
Anyhow, having recently received a Jetson Nano as a Christmas gift from a family member (thank you Bev!), we’ve begun to experiment a little with it. This article is meant to capture some of what we’ve learned in that process, so if you decide you want to experiment with one of your own you can have a quick(er) and smoother start.
READ THE WHOLE BLOGPOST before you buy a Jetson Nano or begin to work with your own system.
Disclaimer: The information in this article is offered “as-is,” with all faults, errors and omissions, and no warranty whatsoever. Proceed at your own risk. The prices you pay and the performance you may see may vary, etc. This is not an endorsement (nor is it meant to be taken as a negative critique) of any product mentioned.
2. Parts and Hardware Assembly Notes
We built our system from the following parts:
GeeekPi Jetson Nano Case $29.99
128GB Micro SD Card $23.99
5V/4A Barrel Style Power Supply $8.99
(Power Supply Note: One of my colleagues who tested a Jetson Nano board with a bench power supply delivering exactly 5 volts DC (with plenty of amps) via the barrel power supply port found that the board would sometimes brown out under load until he increased the voltage to 5.2V, at which point the power problems he ran into completely disappeared.)
Total: $161.95 (with the usual free shipping if you’re buying from Amazon as a Prime member, etc.)
You’ll also need (at least for your initial setup, if not for routine use):
A laptop or desktop (I used an Apple Macbook Pro running Catalina)
An HDMI-capable monitor with HDMI cable
A USB keyboard and mouse
An Ethernet cable (we note that the board has a M.2 Key E socket under the GPU module, should you want to add a compatible WiFi daughter card, but the system doesn’t come with such a card pre-installed). You can see an example of how to install one here.
Before You Get Started: The case mentioned above comes unassembled. Assembling the case is somewhat fiddly and can prove tricky if you have large hands.
Part of assembling the case includes installing the Nano board in the case. The Jetson Nano is static sensitive and is shipped in a static-protective bag, so be sure to follow appropriate antistatic control measures to avoid damage to the board or any of its components (for some ESD hints, see an example here.)
The case comes with a tiny printed manual; that manual may be enough for some, but I encourage you to also consider reviewing:
Note that multiple versions of the basic metal Jetson Nano case exist: depending on the version you have, you may wonder why the video shows easy access while you’re struggling to get stuff hooked to the right pin. Answer: they may actually have a different version of the case than you do.
Getting Started:
Eight tiny screws need to be removed to open the case and get at the enclosed parts pack (we would have preferred to not have the empty case screwed together as shipped).
Installing the switches: Spin the nuts off, then install the two switches finger tight.
GPU fan orientation: The case includes two cooling fans, one premounted on the case, and one for you to install on the heatsink attached to the GPU. While you might expect to use self-tapping screws to mount the fan on the GPU heatsink, the case expects you to mount the GPU fan with the provided long skinny bolts and nuts, instead. Because the fan is not marked, note that the fan is correctly oriented when the bolt head recesses are pointing up (one side of the heat sink fan is counterbored for screw heads, the other side of the heat sink fan is flush, and the bolts aren’t long enough unless they’re recessed in the counterbored holes).
Actually securing the fan bolts: You can hold the nuts with the small antistatic tweezers supplied with the case, however the tweezers aren’t very strong and are relatively easy-to-break (or at least I broke mine). You may find it helpful to try a fine hemostat if you have one handy, or you may want to have someone assist you while installing the small nuts and bolts. (Note that if the nuts fall off while the system is running, they may fall onto the board and potentially cause a short, so be sure you have them adequately secured).
Wiring the power switch: There are four wires on the supplied power switch: the black and red wires are used to turn the system on/off. The other two wires provide power to the switch’s own light.
As stated on raspberrypiwiki.com/index.php/N100 page, ‘And the blue and white cable of power control switch need to be insert in “3.3V” and “GND” pin at J41.’ An illuminated magnifier may be helpful when it comes to finding the right pins on the board. Use the included mini zip ties to manage the wires in the case.
Jumpers:
Be SURE to jump the “disable auto-on” pins on J40 pins 7 and 8 (assuming you’re installing the board in a case which has a manual power switch, like the one we’re using).
While the board can use either a micro USB power supply or a power supply with a barrel connector, we urge you to use the barrel connector power supply to avoid issues with insufficient power. Assuming you are using the barrel power supply, BE SURE to jump J48 to enable the 5V/4A barrel power supply jack. (I missed this and had to remove those eight tiny case screws, install this jumper, and then replace the eight dang tiny screws again)
You then need to flash the micro SD card and install it before the board can boot.
3. Flashing And Installing The Micro SD Card
Flash your micro SD card as described here.
When formatting your micro SD card, tell the formatting tool to use (almost) all the space on the card (exclude/reserve perhaps 4 GB for ram disk space). If you’re too conservative (for example electing to only use 95GB or 100GB on a 128GB card for the main partition), you likely won’t be able to easily increase the size of the partition later.
Speaking of micro SD cards, be sure to buy and use a large-enough one. A basic 128GB micro SD card can be had for under $20 from some vendors, and a 256GB card for less than twice that, so there’s really no excuse for trying to limp along with some overly-tiny $6 16GB micro SD card!
Once you’ve flashed the micro SD card, install it on the Jetson Nano. The socket for the micro SD card is on the side of the board opposite the USB ports and other connectors. The contacts on the micro SD card should be oriented toward the center of the board, and the contacts should be facing upward. You should feel a spring-like detent click when the card locks into place.
4. Initial Configuration
Connect the Jetson Nano’s power supply, monitor, keyboard, mouse and Ethernet cable.
Push the power button on the case to boot the system. The power button should light, and you should also be able to see a glowing green LED on the board through the ventilation holes on the case (you may need to look closely to see it through the holes). The fans are thermally controlled and may NOT be running at this point.
The basic configuration is menu driven and pretty straightforward. You’ll be choosing the language you speak, accepting a license agreement, selecting a time zone and keyboard style, and other routine system-setup tasks. Two things to note:
- Make a note of the name you picked for the system, and your username
- Be sure to use a strong password and make a note of it, too.
5. Configuring Networking
a) IPv4 Address: We will normally be accessing the Jetson Nano over the local network, rather than by sitting at the monitor with a keyboard directly attached to the device. We’re going to assume that we do NOT need to directly access the system from the Internet, so we’re not going to give it a public IP nor a globally resolvable domain name.
We assume the system is going to be plugged into a wireless home “router.” The system will get an RFC1918 IP address from the wireless router via DHCP when the board boots. To ensure that IP doesn’t vary over time, we “locked” that IP to the Jetson Nano device via our wireless router’s management interface. Make a note of the RFC1918 IP address the router assigned to the Jetson Nano.
Using your favorite editor, add that IP and the name of your system to your Mac’s /etc/hosts
file (you’ll need to use sudo
or su
(if you’ve enabled su
on your Mac) to modify that file). Creating an /etc/hosts
file entry will let you connect to the Jetson Nano using the system’s name from your local laptop.
b) Configuring ssh: You can ssh into the Jetson Nano using username and password, but we prefer to use ssh preshared keys for convenience and for improved security.
For the purposes of this article, we include the machine name, either mymac
or gpuboard
, as part of the prompt to help make it clear where each command is being
run. To create new keys on your Mac, say:
mymac $ ssh-keygen -b 4096 -t rsa
To install that key on the Jetson Nano, assuming your username is jsmith
and the system is called gpuboard
, you’d enter the following command on your Mac:
mymac $ ssh-copy-id jsmith@gpuboard
You’ll need to provide your Jetson Nano’s account password to install that preshared key.
After that, you should then be able to login to your Jetson Nano by saying:
mymac $ ssh jsmith@gpuboard <-- regular login mymac $ ssh -X jsmith@gpuboard <-- if you want to login and forward X11 traffic back (doing so will require an X server installed on your local laptop)
6) Enabling Access to The Root Account (Optional)
If you dislike sudo and prefer to su
to root
when doing a series of tasks that require administrative privileges, you can enable the root account by setting a (strong!) password for root with
gpuboard $ sudo passwd root
In the rest of this document, we will show commands that need root privileges by prefixing them with a hash sign.
If you’d rather not enable the root account, simply remember to prefix any command needing root permissions (denoted below with a # prompt) with sudo.
7) Restoring The Missing Documentation
Because the Jetson Nano was envisioned as potentially being used as an embedded device with limited storage, the default install has been “minimized.” In particular, Canonical, the publisher of Ubuntu, decided that they’d exclude all documentation from the default Ubuntu install in order to save disk space.
As a result, /etc/dpkg/dpkg.cfg.d/excludes
comes pre-set to exclude documentation from what gets installed. We, on the other hand, REALLY LIKE documentation, and we have plenty of storage space on our 128GB micro SD card. We’ll restore what was intentionally omitted by entering:
gpuboard # unminimize
8) Ensuring Full Performance From The System
Because the Jetson Nano may be used in power-constrained situations (e.g., for robotic vision projects and that sort of thing), it also comes preconfigured to minimize power draw on battery power sources. This may include disabling some of the system’s four CPUs or running the CPUs at a reduced clock rate. Because we have abundant wall power with our 5V/4A barrel power supply and want maximum performance from the board, we need to set the system to ensure power management doesn’t inadvertently downrate performance.
For maximum performance we want to set the system power profile to zero (also known as “10 watt” mode). We also want to ensure that all four CPUs are available. We also want to ensure that these power-related settings survive a system reboot. Therefore we’ll use vim
to add the following lines to etc/rc.local
echo 1 > /sys/devices/system/cpu/cpu0/online echo 1 > /sys/devices/system/cpu/cpu1/online echo 1 > /sys/devices/system/cpu/cpu2/online echo 1 > /sys/devices/system/cpu/cpu3/online echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor nvpmodel -m 0 ( sleep 60 && jetson_clocks )&
After exiting your editor and saving that file, set /etc/rc.local to be executable with:
gpuboard # chmod a+rx /etc/rc.local
Confirm the file runs okay now:
gpuboard # /etc/rc.local
Also disable ondemand by saying:
gpuboard # update-rc.d -f ondemand remove
Since we’re “semi-hotrodding” the board, this is also a good time to consider enabling dynamic fan control.
9) Ensure The System’s Software Is Up-To-Date
Let’s now ensure the system is up-to-date by saying:
gpuboard # apt-get update gpuboard # apt-get upgrade
Important note: because of the customized version of Ubuntu (Linux For Tegra, L4T) running on the board, do NOT EVER attempt to run dist-upgrade! You are “stuck” (at least for now) on Ubuntu 18.04 LTS.
10) Let’s Also Remove Packages We Don’t Need
Unneeded packages may potentially slow the system down, either during boot or routine use, but be careful you don’t inadvertently delete a package you actually need. An example of a package that I was comfortable removing was:
gpuboard # apt-get purge modemmanager
You can also probably safely clean up no-longer needed packages with:
gpuboard # apt autoremove
11) Adding Swap Space
You can see the current amount of swap space you’ve got on the Jetson Nano with:
gpuboard $ zramctl
See also:
gpuboard $ cat /proc/swaps
If you need (or want) to add additional swap space on the Jetson Nano, see the process described here.
Steps 2 and 3 of this are also worth a look.
12) Ensure The System’s Clock Is NTP-Sync’d For Accurate Logging
I’m a big believer in accurate logs. Do yourself a favor and make sure your system time is accurate by running NTP. Install an NTP client by saying:
gpuboard # apt install chrony
13) Ensure The Cuda Compiler and Libraries Are Installed And Usable
It’s easy to mistakenly get the impression that the Cuda GPU-enabled compiler and libraries AREN’T installed. For example, perhaps you tried saying:
gpuboard $ nvcc --version bash: nvcc: command not found
That might lead you down a futile path of trying to download and install the Cuda tools when in fact they’re ALREADY installed as part of the Jetpack installer, just not in your default PATH.
Make sure nvcc and related libraries are accessible by editing ~/.bashrc
with your favorite editor, adding at the bottom:
export PATH=${PATH}:/usr/local/cuda/bin export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64
Now run that file by saying:
gpuboard $ source ~/.bashrc
From then on, you should then see what you’d hope to see when you run nvcc
, e.g.:
gpuboard $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Mon_Mar_11_22:13:24_CDT_2019 Cuda compilation tools, release 10.0, V10.0.326
Now let’s try one of the examples that come with the Cuda tools. We need to copy the read-only samples to a directory under our home directory by saying:
gpuboard $ cuda-install-samples-10.0.sh ~
We can then make those programs (this may take an hour or so to run) by saying:
gpuboard $ cd NVIDIA_CUDA-10.0_Samples/ gpuboard $ time make [lots of output elided here] Finished building CUDA samples real 61m27.359s user 54m47.080s sys 5m37.152s
If you review the output from that make, you may notice some warning messages (unused variables, deprecated declarations, etc.), but those are believed to be non-problematic. Let’s try one of the programs that we’ve now got compiled…
gpuboard $ cd ~/NVIDIA_CUDA-10.0_Samples/bin/aarch64/linux/release gpuboard $ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "NVIDIA Tegra X1" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 5.3 Total amount of global memory: 3964 MBytes (4156911616 bytes ( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores GPU Max Clock rate: 922 MHz (0.92 GHz) Memory Clock rate: 1600 Mhz Memory Bus Width: 64-bit L2 Cache Size: 262144 bytes [etc] Result = PASS
Looks like we’re in business!
14) Installing Application Software
Now we’re ready to begin installing our application environment.
Let’s start by installing numpy
and scipy
, and the packages required by those applications (again, some of these may take a while to download and install):
gpuboard # apt-get install git cmake libatlas-base-dev gfortran gpuboard # apt-get install libhdf5-serial-dev hdf5-tools python3-dev python3-pip gpuboard # apt-get install libpcap-dev libpq-dev python3-matplotlib gpuboard # pip3 install numpy testresources setuptools cython pandas gpuboard # apt-get install python3-scipy
Everything should install fine, but we can verify that numpy
is working with a tiny test program to transpose a small matrix:
gpuboard $ cat test.py #!/usr/bin/python3 import numpy as np x = np.array([[1, 2], [3, 4], [5, 6]]) print ("x ==> \n", x) xprime = x.T print ("\nxprime ==> \n", xprime) gpuboard $ chmod a+rx test.py gpuboard $ ./test.py x ==> [[1 2] [3 4] [5 6]] xprime ==> [[1 3 5] [2 4 6]]
This step looks good, too! Now let’s go on to install our machine learning/neural network applications.
15) Machine Learning/Neural Networking Applications
The Jetson Nano has always been positioned as a platform for running machine learning/neural network applications. A previous Farsight blog article used keras (with theano/Tensorflow). Let’s install that software now.
a) We’ll begin with keras; it’s easy:
gpuboard # pip3 install keras
b) Ditto theano:
gpuboard # pip3 install theano
c) Tensorflow, on the other hand, is somewhat fussy on the Jetson Nano. For details, please visit here.
gpuboard # pip3 install --upgrade numpy==1.16.1 future==0.17.1 mock==3.0.5 h5py==2.9.0 keras_preprocessing==1.0.5 keras_applications==1.0.8 gast==0.2.2 enum34 futures protobuf gpuboard # pip3 install --pre --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v43 tensorflow-gpu
If you need to/want to install from source, be aware that this can literally take over 40 hours.
Other machine learning packages are also available for the Jetson Nano, but we’re not going to go over installing them here.
16) Performance
Per this NVIDIA article, let’s install jtop
as a nice summary performance monitoring tool. It’s available here or via pip3:
gpuboard # pip3 install jetson-stats
We also want iostat
, which is part of the sysstat package:
gpuboard # apt-get install sysstat -y
Now we’re ready to actually run a real application, namely the code from our earlier machine learning blog article.
17) Sample Code From Machine Learning Blog Article: macOS Catalina With Keras and Theano
All the preceding was basically meant to get us set up to run the code shown in Appendices III and IV from our early machine learning article.
For reference, here’s the performance we want to beat — Keras on a MacBook Pro Retina (Early 2015) laptop (3.1 GHz dual-core Intel Core i7, 16GB 1867 MHz DDR3, Intel Iris Graphics 6100 1536 MB, running under Mac OS X Catalina 10.15.2, with the Intel Math Kernel Library for OS X (submitted December 16, 2019):
mymac $ cat .keras/keras.json
{
"floatx": "float32",
"epsilon": 1e-07,
"backend": "theano",
"image_data_format": "channels_last"
}
mymac $ time python3 run-model-embeddings.py
Using Theano backend.
[...]
Post model save, elapsed time in seconds = 1116.3568624319998
real 18m38.418s
user 17m14.130s
sys 0m33.749s
mymac $ time python3 run-model-embeddings-2.py
Using Theano backend.
[...]
real 6m26.779s
user 6m12.230s
sys 0m5.035s
18) Sample Code From Machine Learning Blog Article: MacOS Catalina With Keras and TensorFlow
The TensorFlow backend to Keras proved to be even faster on that Mac:
mymac $ cat .keras/keras.json { "floatx": "float32", "epsilon": 1e-07, "backend": tensorflow, "image_data_format": "channels_last" } mymac $ time python3 run-model-embeddings.py Using TensorFlow. [...] Post model save, elapsed time in seconds = 755.7816738820001 real 12m38.517s user 19m41.349s sys 2m38.368s mymac $ time python3 run-model-embeddings-2.py Using TensorFlow backend. [...] real 3m53.989s user 4m18.361s sys 0m13.532s
19) Advanced Instruction Support On Catalina?
An interesting thing we noted during the Keras run on the Mac with the TensorFlow backend was the informational message that:
"tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA"
See here for a discussion of this message.
We attempted to compile a custom version of from source for our Mac that enabled those instructions, but saw (for example):
Warning: "Disabling AVX support: clang compiler shipped with XCode 11.[012] generates broken assembly with -macosx-version-min=10.15 and AVX enabled."
As a result, the results reported here were from the default brew
version of TensorFlow (even if that didn’t include AVX2 and FMA instructions). It’s possible that if/when these bugs can be resolved, performance on the Mac might be improved still further.
20) Sample Code From Machine Learning Blog Article: Jetson Nano With Keras and TensorFlow
With current Mac laptop benchmark results in hand, we’re finally ready to try running the code from our earlier machine learning blog article on the Jetson Nano.
We moved that code and the associated data file (new-tokenized-20-char.txt.gz
) to the Jetson Nano using sftp
. We then ran that code on the Jetson Nano.
We ensured Keras on the Jetson Nano was configured to use TensorFlow as a backend:
gpuboard $ cat .keras/keras.json { "floatx": "float32", "epsilon": 1e-07, "backend": "tensorflow", "image_data_format": "channels_last" } gpuboard $ time python3 run-model-embeddings.py Using TensorFlow backend. 2020-01-01 19:33:29.027814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 Reading in the tokenized data... After reading in training data, elapsed time in seconds = 18.10690813[...] [...]
We can watch the system using jtop
in another ssh window:
Output from the run looks like:
[...] 3264272/3264272 [==============================] - 1900s 582us/step - loss: 0.0265 - accuracy: 0.9942 Epoch 2/3 3264272/3264272 [==============================] - 1848s 566us/step - loss: 0.0156 - accuracy: 0.9966 Epoch 3/3 3264272/3264272 [==============================] - 1845s 565us/step - loss: 0.0138 - accuracy: 0.9969 After fitting the model, elapsed time in seconds = 5694.876962216 Evaluate... accuracy: 99.73% [...] Post model save, elapsed time in seconds = 6399.951873356999 real 106m59.793s user 100m32.988s sys 20m25.848s
Let’s now try running run-model-embeddings-2.py:
gpuboard $ time python3 run-model-embeddings-2.py [...] real 19m22.221s user 20m36.464s sys 1m47.480s
21) Sample Code From Machine Learning Blog Article: Jetson Nano With Keras and Theano
Now let’s try Keras on the Jetson Nano with theano for the backend instead of TensorFlow. We set the .keras/keras.json
file to look like:
gpuboard $ cat .keras/keras.json
{
"floatx": "float32",
"epsilon": 1e-07,
"backend": "theano",
"image_data_format": "channels_last"
}
We also realized that the ~/.theanorc file configuration might impact use/non-use of the GPU.
We began by reviewing the stock configuration reported by:
gpuboard $ python3 -c 'import theano; print(theano.config)' | more
We noted in particular the following settings from that config report:
device (cpu, opencl*, cuda*) Doc: Default device for computations. If cuda* or opencl*, change thedefault to try to move computation to the GPU. Do not use upper caseletters, only lower case even if NVIDIA uses capital letters. Value: cpu [...]
gpuarray.single_stream () Doc: If your computations are mostly lots of small elements, using single-stream will avoid the synchronization overhead and usually be faster. For larger elements it does not make a difference yet. In the future when true multi-stream is enabled in libgpuarray, this may change. If you want to make sure to have optimal performance, check both options. Value: True
Running WITHOUT Any GPU setting in the .theanorc file:
gpuboard $ time python3 run-model-embeddings.py
Using Theano backend.
2020-01-01 21:59:46.847850: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Reading in the tokenized data...
After reading in training data, elapsed time in seconds = 18.17661204400065
[...]
Epoch 1/3
3264272/3264272 [==============================] - 1204s 369us/step - loss: 0.0255 - accuracy: 0.9943
Epoch 2/3
3264272/3264272 [==============================] - 1202s 368us/step - loss: 0.0152 - accuracy: 0.9967
Epoch 3/3
3264272/3264272 [==============================] - 1206s 369us/step - loss: 0.0136 - accuracy: 0.9970
After fitting the model, elapsed time in seconds = 3886.5203647100006
Evaluate...
accuracy: 99.70%
[...]
Post model save, elapsed time in seconds = 4270.506449021001
real 71m30.840s user 70m41.324s sys 0m21.528s
gpuboard $ time python3 run-model-embeddings-2.py [...] real 22m36.315s user 22m29.492s sys 0m2.668s
22) Sample Code From Machine Learning Blog Article: Jetson Nano With Keras and Theano (Using An Explicit GPU Setting)
Running WITH the GPU set explicitly in the .theanorc
file:
We now tried setting the GPU explicitly in the .theanorc
file:
gpuboard $ cat .theanorc [cuda] root = /usr/local/cuda [global] device = cuda0 floatX = float32 gpuarray.single_stream = True [dnn] library_path = /usr/local/cuda/lib64/
We initially saw:
ERROR (theano.gpuarray): pygpu was configured but could not be imported or is too old (version 0.7 or higher required)
pygpu
is part of libgpuarray, which is in turn obviously part of Theano. We attempted to upgrade just pygpu/libgpuarray
, but eventually ended up building a full new copy of Theano 2 from source more-or-less as described here:
Note that this can take a substantial period of time (as in, don’t bother staying up, go ahead and get some sleep). [N.B. After getting this built and running, I stumbled upon this article (scroll down to the item mentioning “TensorFlow 2.0 can be installed with JetPack4.3 now” posted 12/26/2019 03:13 AM).
Anyhow, when we re-ran the model with our built-from-source Theano 2.0 with gpuarray.single_stream = True
(designated by +SING in the summary result table below), we had success:
gpuboard $ time python3 run-model-embeddings.py Using Theano backend. /usr/local/lib/python3.6/dist-packages/theano/gpuarray/dnn.py:184: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to a version >= v5 and <= v7. warnings.warn("Your cuDNN version is more recent than " Using cuDNN version 7603 on context None Mapped name None to device cuda0: NVIDIA Tegra X1 (0000:00:00.0) 2020-01-05 18:51:28.809377: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 [...] Fit... Epoch 1/3 3264272/3264272 [==============================] - 764s 234us/step - loss: 0.0263 - accuracy: 0.9940 Epoch 2/3 3264272/3264272 [==============================] - 764s 234us/step - loss: 0.0155 - accuracy: 0.9966 Epoch 3/3 3264272/3264272 [==============================] - 762s 233us/step - loss: 0.0134 - accuracy: 0.9970 After fitting the model, elapsed time in seconds = 2439.6910164309666 [...] real 44m58.908s user 34m55.716s sys 8m26.236s
Obviously 44m58.9s is significantly better than 71m30.8. How about the second job in the sequence?
gpuboard $ time python3 run-model-embeddings-2.py Using Theano backend. /usr/local/lib/python3.6/dist-packages/theano/gpuarray/dnn.py:184: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to a version >= v5 and <= v7. warnings.warn("Your cuDNN version is more recent than " Using cuDNN version 7603 on context None Mapped name None to device cuda0: NVIDIA Tegra X1 (0000:00:00.0) 2020-01-05 19:41:42.214777: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 [...] real 18m38.689s user 17m38.668s sys 0m49.432s
This was again better than the Nano Theano run w/o the GPU explicitly set (18m38.7s vs. 22m36.315s).
After finishing those runs, we then set gpuarray.single_stream = False
(signified by -SING in the table below) and reran. When I did so, performance was essentially unchanged.
23) Summary Performance Table
So did the Nano beat the baseline Mac laptop? No:
Mac Laptop run-model-embeddings.py [...]-2.py Sum This/Best theano 18m 38.4s 6m 26.8s 25m 5.2s 1.36 tensorflow 12m 38.5s 3m 54.0s 16m 32.5s 1 (Best) Jetson Nano run-model-embeddings.py [...]-2.py Sum This/Best tensorflow 106m 59.8s 19m 22.2s 126m 22s 7.63 theano 71m 30.8s 22m 36.3s 94m 7.1s 5.68 theano+GPU-SING 45m 25.4s 18m 32.4 63m 57.1s 3.86 theano+GPU+SING 44m 58.9s 18m 38.7s 63m 37.6s 3.85
Obviously, the Jetson Nano’s performance was not the performance we’d hoped to see, but the observed performance may be more the result of my inexperience with Theano and the Jetson Nano than the board itself. We welcome any feedbcak/suggestions for improving the performance of the board from readers.
24) Conclusion
We hope that any of you who have a Jetson Nano (or who are thinking of getting one to experiment with) find the preceding information helpful. If you have additional tips or tricks for improving the performance of the sample model on the Nano, we’d love to hear from you. The author can be reached at [email protected]
Acknowledgements
Many thanks to my colleagues David Waitzman, Kelvin Dealca and Jeremy Reed for their assistance in reviewing this article, and to Farsight CTO Ben April for sharing his testing experiences with his own Jetson Nano board. Any remaining flaws are solely the responsibility of the author.
Joe St Sauver Ph.D. is a Distinguished Scientist with Farsight Security®, Inc.