Ubuntu_Machine_for_Deep_Learning

Basics

Install ssh stuff: just edit ~/.ssh/config

sudo ssh-keygen -t ed25519 -C "[email protected]"
# sudo will give you a file in /root/
# but make sure you change the directory to your user directory
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

Install shell stuff (systemwide)

sudo apt install zsh tmux neofetch
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
# and add neofetch to .zshrc

Customize zsh:

You can install above by: setting ZSH_THEME="powerlevel10k/powerlevel10k" in .zshrc and add zsh-autosuggestions in plugins=(...) in .zshrc.

git clone --depth=1 https://github.com/romkatv/powerlevel10k.git ${ZSH_CUSTOM:-$HOME/.oh-my-zsh/custom}/themes/powerlevel10k
git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions

For settings, choose:

Install python stuff (user)

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Random Stuff

Krita

To install Krita, download .AppImage form their website and put it into /home/koke_cacao/bin/. Then do a cd ~/bin && ln -s krita-5.1.0-x86_64.appimage krita.

Cloud Servers

Oracle Setup

# open up fire wall on oracle cloud
# remember go to subnet and set Ingress Rules
# https://stackoverflow.com/questions/54794217/opening-port-80-on-oracle-cloud-infrastructure-compute-node
# https://medium.com/@harjulthakkar/part5-firewall-configuration-on-oracle-public-cloud-3b71b487666c
echo "Configuring IP Table... \n"
sudo iptables -L && \
sudo iptables-save > ~/iptables-rules && \
sudo iptables -P INPUT ACCEPT && \
sudo iptables -P OUTPUT ACCEPT && \
sudo iptables -P FORWARD ACCEPT && \
sudo iptables -F && \
sudo iptables-save | sudo tee /etc/iptables.conf && \
echo "My service will automatically do sudo iptables-restore < /etc/iptables.conf to load saved iptables.conf on server start."

Nvidia Stuff

Understand NVCC, CUDA Driver, CUDA Toolkit, Cudnn

Terminologies:

Compute Unified Device Architecture (CUDA): a programming language, an API, a programming model.

Cudnn: software library for deep learning computing. It has cuFFT, cuDNN and many GPU-accelerated libraries

CUDA Toolkit:

NVCC: a compiler

file extension significance
.cu cuda Source file , Include host and device Code
.cup Pretreated cuda Source file , Compilation options --preprocess/-E
.c c Source file
.cc/.cxx/.cpp c++ Source file
.gpu gpu Intermediate document , Compilation options --gpu
.ptx Similar to assembly code , Compilation options --ptx
.o/.obj Target file , Compilation options --compile/-c
.a/.lib The library files , Compilation options --lib/-lib
.res Resource file
.so Shared target file , Compilation options --shared/-shared
.cubin cuda Binary file , Compilation options -cubin

nvidia-smi: a project based on the NVIDIA Management Library(NVML) for managing GPU performance and state

Sometimes the CUDA version shown in nvcc --version and in nvidia-smi is not the same, this is because: there are runtime API and driver API

In the development process, you can only choose either runtime API or driver API. You can't mix two of them. runtime API is a more advanced package and easier to use while driver API is a lower layer API. The difference is documented here

Installation

Install cuda stuff

Pre-Installation Checks:

You need to understand what version of cuda you need and what driver version match the driver version of cuda:

CUDA Forward Compatible Upgrade 418.40.04+ (CUDA 10.1) 450.36.06+ (CUDA 11.0) 470.57.02+ (CUDA 11.4) 495.29.05+ (CUDA 11.5) 510.39.01+ (CUDA 11.6) 515.43.04+ (CUDA 11.7)
11-7 X C C X* C Not Required
11-6 C C C Not required X
11-5 C C C X X
11-4 C C Not required X X
11-3 C C X X X
11-2 C C X X X
11-1 C C X X X
11-0 C Not required X X X
10-2 C X X X X
10-1 Not required X X X X
10-0 X X X X X

For updated version of table and more information, read here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package

To perform a recommended default installation of drivers and cuda for Nvidia, here is what you should do:

  1. To install driver: sudo ubuntu-drivers autoinstall
  2. To install CUDA: add package and do sudo apt install cuda

Debugging During Installations

If you could not somehow perform sudo apt install cuda, then check using sudo apt list --installed | grep nvidia if there are any nvidia-related packages.

If there are version conflicts, saying x but it is not going to be installed, try sudo apt install x to see what it says. If x but y is to be installed, it means we have the package with version x but we are currently trying to install y due to we need y to satisfy dependency. This issue typically happens in a computer-vendor-installed version of operating system where it overrides apt source (therefore the thing we are currently try to install got resolved to vendor-specific version y). To see all the apt source:

ls /etc/apt/sources.list.d

You can just remove the ones you don't like.

You could also try sudo apt --fix-broken install, but you should be careful what it does.

Post-Installation

Environment Variables

You should define your environment variables in .bashrc like the following:

export PATH="/usr/local/cuda-11.7/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH"

Systemwide Settings

  1. You need to make sure systemctl status nvidia-persistenced is enabled. If not, enable it.
  2. You need to disable hot-pluggable memory

In default setting of udev rules, you have hot-pluggable memory enabled. The default settings can be viewed /lib/udev/rules.d/40-vm-hotadd.rules.

However, Nvidia doesn't like this default setting. To change the default setting, we copy the default setting to /etc/udev/rules.d.

You should not change files directly in /lib/udev/rules.d/. You should overwrite the default setting by copy the file to /etc/udev/rules.d and change it there.

Wo therefore perform sudo cp /lib/udev/rules.d/40-vm-hotadd.rules /etc/udev/rules.d and then remove the line containing something like:

SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"

Small Tweaks

sudo add-apt-repository -y ppa:lubomir-brindza/nautilus-typeahead
sudo apt install nautilus

Other Config

  1. add Chinese language follow this
  2. set nautilus search to only 1 level
  3. bind flameshot gui to keyboard shortcut to take screen shot
Gnome Extensions:

- Desktop Icons NG (DING) by rastersoft (system)

- Freon by UshakovVasilii

- OpenWeather by skrewball

- Ubuntu AppIndicators by didrocks (system)

- Ubuntu Dock by didrocks (system)

CUDA Toolkit

sudo apt install nvidia-cuda-toolkit

APT Source

Install CMake follow this guide

Install additional codecs

sudo apt install ubuntu-restricted-extras

Issues

Device Issues

If mouse middle click scroll does not work:

Make Cuda Filed

When doing apt install, you might get:

Error! Bad return status for module build on kernel: 5.15.0-46-generic (x86_64)
Consult /var/lib/dkms/nvidia/515.65.01/build/make.log for more information.
dpkg: error processing package nvidia-dkms-515 (--configure):
 installed nvidia-dkms-515 package post-installation script subprocess returned error exit status 10

When consulting the make.log, you would see.

ProblemType: Package
DKMSBuildLog:
 DKMS make.log for nvidia-515.65.01 for kernel 5.15.0-46-generic (x86_64)
 Thu Aug 11 03:40:04 AM EDT 2022
 make[1]: Entering directory '/usr/src/linux-headers-5.15.0-46-generic'
 test -e include/generated/autoconf.h -a -e include/config/auto.conf || (               \
 echo >&2;                                                      \
 echo >&2 "  ERROR: Kernel configuration is invalid.";          \
 echo >&2 "         include/generated/autoconf.h or include/config/auto.conf are missing.";\
 echo >&2 "         Run 'make oldconfig && make prepare' on kernel src to fix it.";     \
 echo >&2 ;                                                     \
 /bin/false)
 warning: the compiler differs from the one used to build the kernel
   The kernel was built by: gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
   You are using:           cc (Ubuntu 10.3.0-15ubuntu1) 10.3.0
 make -f ./scripts/Makefile.build obj=/var/lib/dkms/nvidia/515.65.01/build \
 single-build= \
 need-builtin=1 need-modorder=1
   ln -sf /var/lib/dkms/nvidia/515.65.01/build/nvidia/nv-kernel.o_binary /var/lib/dkms/nvidia/515.65.01/build/nvidia/nv-kernel.o
   ln -sf /var/lib/dkms/nvidia/515.65.01/build/nvidia-modeset/nv-modeset-kernel.o_binary /var/lib/dkms/nvidia/515.65.01/build/nvidia-modeset/nv-modeset-kernel.o

In the log, you might also see a warning, saying that the kernel is compiled using gcc-11, but the driver is compiled using gcc-9. cd to gcc-11 and redo apt install will solve the problem.

ZFS Storage Full

Use zfs list -r -t snapshot -o name,used,referenced,creation bpool/BOOT command to see all snapshots. Use zfs list -r -t snapshot -o name,used,referenced,creation bpool/BOOT | tail -n 4 | cut -c 35-40 | xargs -n 1 sudo zsysctl state remove --system to remove the last 4 snapshot.

Error Occurred at Startup

You can do sudo dmesg or checkout /var/crash.

Davinci Resolve

Davinci Resolve gives The GPU failed to perform image processing because of an error. Error code 999.. This link gives the solution.

If nvidia gpu is used in on-demand mode, you have to explicitly demand it. To enable set the following environment variables:

export __NV_PRIME_RENDER_OFFLOAD=1
export __GLX_VENDOR_LIBRARY_NAME=nvidia

Davinci Resolve could then be launched at /opt/resolve/bin/resolve.

Other solution (not working) involve:

Connecting to Public WiFi

After you setup your Ethernet to connect to your laptop, here is a route table

(base) ➜  ~ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.0.0.2        0.0.0.0         UG    20100  0        0 enp2s0
default         _gateway        0.0.0.0         UG    20600  0        0 wlp3s0
10.0.0.0        0.0.0.0         255.255.255.0   U     100    0        0 enp2s0
100.64.0.0      0.0.0.0         255.255.240.0   U     600    0        0 wlp3s0
link-local      0.0.0.0         255.255.0.0     U     1000   0        0 enp2s0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0

You can see that default destination with gateway 10.0.0.2 is set to top priority with no netmask. This is a problem when we connect to public WLAN. Usually, you connect to https://_gateway to register WiFi, but since the WiFi is not established yet, you will try to connect the redirection through 10.0.0.2. To resolve this issue, we choose to set the priority of enp2s0 to a higher value.

This can be done by installing ifmetric and do sudo ifmetric enp2s0 30600. You can download an offline version of the package and put it onto the serving using ssh.

(base) ➜  ~ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    20600  0        0 wlp3s0
default         10.0.0.2        0.0.0.0         UG    30600  0        0 enp2s0
10.0.0.0        0.0.0.0         255.255.255.0   U     30600  0        0 enp2s0
100.64.0.0      0.0.0.0         255.255.240.0   U     600    0        0 wlp3s0
link-local      0.0.0.0         255.255.0.0     U     30600  0        0 enp2s0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0

This method will not interrupt your current connection. Route table should automatically reset after reboot.

Table of Content