close-circle
Close
0%
0%

32-TFLOP Deep Learning GPU Box

A super-fast linux-based machine with multiple GPUs for training deep neural nets

Similar projects worth following
close
I decided to build a machine to speed up the process of training convolutional neural networks for a variety of applications. Training times are normally very long, so having a powerful machine means you don't have to wait days to optimize parameters.

I wanted to make use of the newest GTX 1080 graphics cards that Nvidia has just released. These have 8GB of RAM and can operate at over 8-TFLOPS.

Nvidia describes a deep learning dev box but this costs more than $10K. Building one yourself is a good option.

The machine I put together is build for less than $5K using an Gigabyte GA-X99P-SLI motherboard with 500GB M.2 SSD drive and a 1TB secondary SATA drive for training data storage.

I describe in this project the hardware that I chose and also the process of installing Ubuntu and TensorFlow (Google's machine learning python package.)

After sending quite a long time learning the details about things like PCIe lanes I eventually managed to specify the parts for this machine:

Since the GTX 1080 only just came out I decided not to buy any until the price drops a bit and third party vendors start selling them. I read that its not that smart to buy the "Founders Edition" because the cooling solution is not that good and there is an early adopter price premium. For the moment I am using a single GTX 970 to get everything set up and working, and plan to replace this eventually.

The CPU has 40 PCIe version 3 lanes. The motherboards all seem to have the issue that you can't really tell it how many lanes to use for each slot. I would rather it used x8, x8, x8, x8 leaving 8 lanes for the M.2 drive and other stuff, but in fact it tries to use x16 for the graphics cards and then has to use PCIe version 2 for the M.2. This is a known limitation.

I think for these compute intensive tasks the actual bandwidth between CPU and GPUs is not too important because most of the time is spent computing on the GPU.

I will be installing Ubuntu 14.04, CUDA toolkit 7.5 and CuDNN v4, and Tensor Flow on Python 3.

  • 1 × Gigabyte LGA2011-3 Intel X99 SLI ATX Motherboard GA-X99P-SLI Motherboard
  • 1 × Samsung 950 PRO Series - 512GB PCIe NVMe - M.2 Internal SSD (MZ-V5P512BW) Drive
  • 1 × Samsung 850 EVO 1 TB 2.5-Inch SATA III Internal SSD (MZ-75E1T0B/AM) Secondary drive
  • 1 × General 2.5" SSD to 3.5" SATA Hard Disk Drive HDD Adapter CADDY TRAY CAGE Hot Swap Plug SSD drive bay
  • 1 × EVGA SuperNOVA 1600 G2 80+ GOLD, 1600W Fully Modular NVIDIA SLI and Crossfire Ready 10 Year Warranty Power Supply 120-G2-1600-X1 PSU

View all 9 components

  • Success

    robotbugs06/27/2016 at 23:11 0 comments

    I finally got tensorflow compiled from source and running with python3 cudnn version 5 and cuda toolkit 7.5 on the single GTX 970 that I currently have in this box.

    The results are pretty awesome the training example for conv nets with mnist runs more than 25 times faster than it does on my macbook pro.

    Now I am trying to decide if I should just add in a bunch of GTX 970s for the moment, because the 1080 is still very pricey.

  • Progress

    robotbugs06/27/2016 at 22:24 0 comments

    Well I got past the Bazel thing (I think) by downloading and installing the binary.

    Then the next issue was to try to run the tensor flow config script. However this wants to know a bunch of info which is tricky to find, the worst being the location of the cudnn library. It turns out that the headers are installed in /usr/include and the libs in /usr/lib/x86_64-linux-gnu/ and this script cant handle that. I had to copy the header to /usr/local/cuda/include and the libs to lib64 in the same directory. Also I had to add the libraries to the search path by adding the path to /etc/ld.so.conf.d/cuda.conf and running sudo ldconfig which is something I hadn't seen before.

    I'm compiling TensorFlow now from source. I can't believe how many horrible warning messages are generated by this code.

  • Unable to install Bazel dependency

    robotbugs06/27/2016 at 20:02 1 comment

    I am trying to install TensorFlow from source on ubuntu 14.04. Installing the dependencies is the problem. I was able to install the python related dependencies, and then moved on to Bazel. The first step was to install java8, and that went ok.

    The next was to install Bazel itself. When doing the apt-get update I just get the error

    "W: Failed to fetch http://storage.googleapis.com/bazel-apt/dists/stable/InRelease Unable to find expected entry 'jdk1.8/binary-i386/Packages' in Release file (Wrong sources.list entry or malformed file) / E: Some index files failed to download. They have been ignored, or old ones used instead."

    and then I am lost. I can't find anything useful about this error on the web. I might try installing the binary directly.

  • TensorFlow and cuDNN versions.

    robotbugs06/27/2016 at 19:28 0 comments

    Turns out I have to install tensor flow from source because the binary version only works with cuDNN version 4. I installed cuDNN version 5, and I was going to roll back to v4, but then I found out that version 4 does not work with the GTX 1080 cards. That needs version 5. So now I am trying to uninstall TensorFlow and then reinstall it from source.

  • SLI

    robotbugs06/05/2016 at 03:08 0 comments

    One of the things I assumed earlier on was that I would want to use SLI, however this is not true. SLI is useful for games when you want multiple graphics cards to look like a single card. But in the case of using Tensor Flow or cuDNN, one just needs to specify which GPU device you want each part of the code to run on, and there's no need to pretend each one is a single device. So I didn't bother buying any SLI bridges or whatever.

  • GTX 1080

    robotbugs06/05/2016 at 03:06 0 comments

    Here are some details about the GTX 1080. I definitely want to use four of these cards, but I'm having to wait for the price to come down. Currently the founders edition is available on Amazon but with a huge markup at around $850. I'm not sure how long it takes for these cards to settle on a more reasonable price. But I'm happy to get things going with the GTX 970 at the moment since its better than training on my laptop and there's a lot of setup to do with customizing the environment.

  • Installing packages

    robotbugs06/04/2016 at 05:03 0 comments

      The next tasks were

      1. Ensure all packages up to date with apt-get
      2. Ensure that python3 was ok
      3. Install scipy and any other relevant packages
      4. Install CUDA toolkit 7.5
      5. Install cuDNN
      6. Install pip3 because TensorFlow needs this
      7. Install TensorFlow

      I installed scipy using 'sudo apt-get install python3-scipy'.

      I installed CUDA toolkit using the deb(network) link, using dpkg, then apt-get.

      I then went on to attempt to install cuDNN and chose the newest version (5). This turned out to be a mistake. Also cuDNN installation is confusing. In the end I installed it using the two deb packages, not manually. It puts the header files and shared libraries in the usual place but not the same place as CUDA toolkit.

      Then I installed pip3 because this is the only way to get TensorFlow for linux. I followed the instructions on the TensorFlow site for linux with python3 and a GPU and everything built ok.

      I needed to set the LD_LIBRARY_PATH to get python3 to find the cuda libraries.

      I ran the tests on the TensorFlow page and everything seemed ok. But then when I tried to run the MNIST training example it crashed.

      Eventually I found that this crash is because TensorFlow needs cuDNN version 4 not version 5. So now I have to go back and screw with the installation trying to remove cuDNN and install the older one and possibly rebuild TensorFlow.

  • Installing the OS

    robotbugs06/04/2016 at 04:50 0 comments

      I first installed Ubuntu 16.04 from a DVD ISO using an external DVD drive. This went ok. I firstly formatted the new data drive:

      1. Install gksudo using apt-get
      2. install gparted using gksudo
      3. Run parted and partition and format the drive in MSDOS mode

      Then I opened up the Ubuntu GUI for the installation / software update, and found that the machine was using an open source driver for the GTX 970 that was installed. It gave me the option of switching to the Nvidia proprietary driver which I did.

      Then I attempted to install CUDA toolkit 7.5, and then hit my first snag, because there is no version available for Ubuntu 16.04. If you try to download the deb package anyway you will find that it will not install and complains about the signing key being too short. This is because the change to Ubuntu 16.04 has deprecated the package validation that Nvidia is using.

      So therefore I started again and installed Ubuntu 14.04.4 LTS.

      Then I ran into another problem which is that there was now a UI related crash on startup (unity-settings-daemon). However when I switched to the Nvidia display driver again this seemed to stop happening.

  • Getting the hardware running

    robotbugs06/04/2016 at 04:33 0 comments

    The building process was not too hard and kind of fun. The components look impressive.

    One hassle with putting this together was fitting the water cooling radiator to the provided fan and attaching this to the case. I had to tie the thing in place with zip ties so that I could then put the bolts through and insert the washers without it all falling apart and losing small parts in the case.

    It's not obvious which holes to use for screwing the PSU to the wall of the case. Some holes that look like holes are not threaded so it was easy to cross thread the screws into these - inspect carefully which are legitimate holes.

    The mains cabe is very thick. With 1600W the machine basically needs its own 15A circuit. However the four GTX 1080s are not too power hungry so I think the PSU is probably over specified.

    Another hassle was trying to plug all the cables into the motherboard when its actually quite dark in there (I used a flashlight) and also it was easy to bend the fine pins on the USB board interconnects.

    There are four fans and a water pump in the cooling system. I plugged the fan that is on the radiator into the CPU_FAN board connector and the water pump into the CPU_OPT connector. There are three other fan connectors distributed around the board which I used for the case fans.

    It's important to put the RAM in the right location. This system uses 4 DDR4 modules to make up 32GB. These go in the gray RAM sockets.

    I was pleased to find that it booted up fine.

    I upgraded the BIOS. One can use the QFLASH tool with a USB drive that contained the BIOS from the Gigabyte site. Initially I was getting an error trying to upgrade but eventually I realized that I was using the X99-SLI BIOS instead of the one for X99P-SLI.

    The big problem that stopped me for almost a week was that the BIOS could not see the M.2 drive, so I could not install any OS. I looked everywhere for solutions on line and throughout every setting on the BIOS. In the end I replaced both the M.2 drive and the motherboard and then it worked ok. So I think there was some problem with the motherboard. Note that this motherboard does not support SATA M.2 drives, they must be PCIe.

View all 9 project logs

Enjoy this project?

Share

Discussions

Hendrik Wiese wrote 04/12/2017 at 08:27 point

Hey, very interesting project. I'm planning to build something similar. How's the status of the project? Do you already have the four 1080 GPUs up and running? What can you say about performance? Would be happy to see an update! 

  Are you sure? yes | no

karl.94404 wrote 02/13/2017 at 17:24 point

how do you measure tflops, how fast is your box now?

  Are you sure? yes | no

oliver wrote 11/15/2016 at 02:23 point

fixed it! updating the bios is all I needed to do.

  Are you sure? yes | no

oliver wrote 10/07/2016 at 17:59 point

So I have built the same machine here. Right now I have a 3 1080s available to me (1 mine and 2 borrowed that I need to return). I have been able to get 2 to work with no problem however when I plug the third in I cant get past the bios screen. I never updated the bios like you did though. Has anyone experienced a similar problem? Thanks so much for this project. It has been very helpful.

  Are you sure? yes | no

liu2bao3yuan1 wrote 08/30/2016 at 16:55 point

Do you mind sharing with me how well the cooling works with this configuration? How's the GPU temperature when running with full speed? Is the fan running loud? I want to build a similar machine, but am worried about overheating. Thanks a lot!

  Are you sure? yes | no

Will middle wrote 08/11/2016 at 18:58 point

from what I understand adding more GPUs in SLI doesn't increase the total video ram available for training neural nets. I too have acquired 2 GTX 970 but it seems that I am not able to load a neural net of more than 3.5GB even though the total video ram between the two is 7GB

Is this consistent with what you have found or researched? 

  Are you sure? yes | no

Frank Dai wrote 07/15/2016 at 00:52 point

As to the motherboard, I suggest to use Asus X99-E WS or Asus X99-E WS/USB3.1, because they have two PCE-e bridges which support 4-way PCI-e 3.0 x16 at full speed

  Are you sure? yes | no

Hector Flores wrote 06/07/2016 at 03:14 point

This is a cool project. I am also building a deep learning machine. My parts are very similar to yours, but I'm not ballin' like you are. I only opted for one ssd (the samsung 950) and I am not water cooling my cpu. 

  Are you sure? yes | no

Marco Foco wrote 06/06/2016 at 08:57 point

Isn't the PCIe M.2 drive taking up some lanes, too? I think the only viable config with this hardware is 8x, 8x, 8x, 8x.

  Are you sure? yes | no

robotbugs wrote 06/07/2016 at 20:05 point

Yeah I'd like it to be 8x 8x 8x 8x but I think there's no option to do that. from what I read the difference between 8x and 16x is not worth worrying about.

  Are you sure? yes | no

justin.francis wrote 06/05/2016 at 22:42 point

p.s. you should not have to rebuild tensorflow you should be able to just drag and drop the cudnn v4 files to your cuda folder using sudo nautilus

  Are you sure? yes | no

robotbugs wrote 06/07/2016 at 20:06 point

I was worried that the header files for v5 that were used in the build would be different than the ones for v4.

  Are you sure? yes | no

justin.francis wrote 06/05/2016 at 22:40 point

Here is a video I made on how to install GPU Tensorflow, maybe it can help you out!


  Are you sure? yes | no

robotbugs wrote 06/07/2016 at 20:07 point

OK useful stuff. The main problem I had was that I used cuDNN v5 and now have to roll back.

  Are you sure? yes | no

deknus wrote 06/05/2016 at 14:54 point

LGA2011 X79 Motherboard can provide 4 16x PCIE slots, and i7-4820K has 40 PCIe version 3 lanes too. If I have 4 gtx 1080 cards, is it work at 16x, 8x, 8x, 8x? X79 and i7-4820K can be cheaper.

  Are you sure? yes | no

robotbugs wrote 06/07/2016 at 20:06 point

Sounds like a possibility.

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates