Close
0%
0%

Raspberry Pi Ceph Cluster

A Raspberry Pi Ceph Cluster using 2TB USB drives.

Similar projects worth following
I needed an easily expandable storage solution for warehousing my ever growing hoard of data. I decided to go with Ceph since it's open source and I had slight experience from work. The most important benefit is that I can continuously expand the storage capacity as needed simply by adding more nodes and drives.

Current Hardware:

  • 1x RPi 4 w/ 2GB RAM
    • management machine
    • influxdb, prometheus, apcupsd, apt-cacher
  • 3x RPi 4 w/ 4GB RAM
    • 1x ceph mon/mgr/mds per RPi
  • 15x RPi 4 w/ 8GB RAM
    • 2 ceph osds per RPi
    • 2x Seagate 2TB USB 3.0 HDD per RPi

Current Total Raw Capacity: 55 TiB

The RPi's are all housed in a nine drawer cabinet with rear exhaust fans.  Each drawer has an independent 5V 10A power supply.  There is a 48-port network switch in the rear of the cabinet to provide the necessary network fabric.

The HDDs are double-stacked five wide to fit 10 HDDs in each drawer along with five RPi 4's.  A 2" x 7" x 1/8" aluminum bar is sandwiched between the drives for heat dissipation.  Each drawer has a custom 5-port USB power fanout board to power the RPi's.  The RPi's have the USB PMIC bypassed with a jumper wire to power the HDDs since the 1.2A current limit is insufficient to spin up both drives.

  • 1 × Raspberry Pi 4 w/ 2GB RAM
  • 3 × Raspberry Pi 4 w/ 4GB RAM
  • 15 × Raspberry Pi 4 w/ 8GB RAM
  • 19 × MB-MJ64GA/AM Samsung PRO Endurance 64GB 100MB/s (U1) MicroSDXC Memory Card with Adapter
  • 19 × USB C Cable (1 ft) USB Type C Cable Braided Fast Charge Cord

View all 10 components

  • Rebalance Complete

    Robert Rouquette12/10/2020 at 15:15 0 comments

    The rebalance finally completed.  I had to relax the global mon_target_pg_per_osd setting on my cluster to allow the PG count increase and the balancer to settle.  Without setting that parameter to 1, the balancer and PG autoscaler were caught in a slow thrashing loop

  • Thermal Performance

    Robert Rouquette11/21/2020 at 00:49 0 comments

    Per the inquiry by @Toby Archer here's the plot of CPU and disk temperature over the last 30 days.

  • OSD Memory Consumption

    Robert Rouquette11/20/2020 at 00:44 0 comments

    Per the inquiry from @Aaron Covrig , here's the memory consumption over the past week:

    The three series with nightly jumps are the mon instances, and the jumps are the mon services shutting down for the backup process.

  • Rebalance Still In Progress

    Robert Rouquette11/13/2020 at 02:46 1 comment

    The OSD rebalance is still running, but it did free up enough space that I was able to finish migrating my media files.  The total raw space currently in use is 66% of 54 TiB.

  • Added 10 more drives

    Robert Rouquette10/27/2020 at 00:57 0 comments

    I added the (hopefully) last 10 drives for this year.  This brings the total raw capacity to 55 TiB and should be enough to finish migrating my data.

  • Ceph PG Autoscaler

    Robert Rouquette10/24/2020 at 18:48 0 comments

    It looks like the PG autoscaler kicked in last night., and made the following changes:

    • fs_ec_5_2: 64 pgs -> 256 pgs

    The autobalancer should kick in later tonight or tomorrow to pull the OSDs back into alignment.

  • Rebalance Complete

    Robert Rouquette10/20/2020 at 17:16 0 comments

    The cluster rebalance is finally complete.   The final distribution also appears to be tighter than before the rebalance.

  • Rebalance Progress

    Robert Rouquette10/13/2020 at 21:01 0 comments

    The rebalance is still progressing.  It's particularly slower because I also changed the failure domain on my cephfs EC pool from OSD to HOST which increased the number of PGs that needed to be remapped.  Aside from that, the rate of progression appears fairly typical.  The two gaps in the plot are power outages caused by Hurricane Delta.

  • Added 8 more HDDs

    Robert Rouquette10/07/2020 at 22:51 0 comments

    I added 4 more RPi 4's and 8 2TB drives to bring the overall disk count to 20.  The cluster rebalance looks like it's going to take a few days, but at least the I/O performance isn't too severely impacted.

  • Grafana Dashboard

    Robert Rouquette09/29/2020 at 20:52 0 comments

    I added a dashboard to my grafana server to show the telegraf data from the ceph cluster.  You can see the disk io and cpu load from the cluster rebalancing from 10 osds to 12 osds.

View all 10 project logs

Enjoy this project?

Share

Discussions

christian.boeing wrote 12/29/2020 at 06:43 point

Thanks for your answer. I have expected that you are going 64 bit. (sure, 8GB Pi´s I saw now)
I use a ceph cluster for some years at home too. 

Started with 3 nodes and ceph hammer. First I used odroid XU4, but because of USB-HD issues changed to odroid hc1/hc2. Disadvantage is that they are only 32 bit.
But I love the flexibility of ceph.
Now my cluster is a little mixed: 3 mon/mgr/mds with octopus 64 bit, 12 osd mimic 32 bit. With armhf I found no possibility to go more than mimic. The octopus version for armhf in ubuntu 20.04 is broken.

  Are you sure? yes | no

Jeremy Rosengren wrote 12/28/2020 at 16:39 point

What kind of throughput are you getting when you're transferring files to/from cephfs?

  Are you sure? yes | no

Robert Rouquette wrote 12/29/2020 at 00:49 point

Cephfs can fully saturate my single 1 Gbps NIC indefinitely.  I don't currently have any machines with bonded ports, so I don't know the tipping point.

  Are you sure? yes | no

for.domination wrote 6 days ago point

Since seeing your project two weeks ago I couldnt get it out of my head and now I'm trying to find a usecase that would necessitate building a cluster ;)

Have you ever benchmarked your setup? Red Hat has a go-along written up on  https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/ceph-performance-benchmarking

I'd be quite interested how performance increases in such a large scale compared to the other rpi clusters found on the net which are mostly 3-5 units only. If you have a client with multiple ports it'd be interesting to see how much LA increases speeds... Maybe that can circumvent the client as the limiting factor.

  Are you sure? yes | no

christian.boeing wrote 12/28/2020 at 15:51 point

Looks very nice.

I think you are using the raspies in 64 bit mode because of limited support of ceph for armhf?
Which version of ceph is running?

  Are you sure? yes | no

Robert Rouquette wrote 12/29/2020 at 00:46 point

The rpi4's are running Ubuntu 20.04 aarch64, and I'm using Ceph Octopus (15.2.7) from the Ubuntu repos (not the Ceph repos)

  Are you sure? yes | no

miouge wrote 11/25/2020 at 08:48 point

Looks great. Just a couple of questions:

- Which install method did you use for Ceph?

- What kind of data protection do you use? Replication or EC? How has the performance?

  Are you sure? yes | no

Robert Rouquette wrote 11/25/2020 at 19:22 point

I used the ceph-deploy method.  It's simpler and makes more sense for lower-power embedded systems like the Raspberry Pi since it's a bare-metal installation instead of being containerized.

I use 3x replication for the meta, fs root, and rbd pools.  I use 5:2 EC for the majority of my cephfs data.

  Are you sure? yes | no

Toby Archer wrote 11/20/2020 at 14:15 point

20TB raw capacity per shelf is great. How are you finding heat? It would be very cool to wire in two temperature probes to your management Pi's GPIO and monitor temperature by the exhausts.

Have you found single gigabit per node to become a bottleneck?

Awesome project, makes me very jealous.

  Are you sure? yes | no

Robert Rouquette wrote 11/21/2020 at 00:54 point

The CPU and disk temperature tends to stay below 55 C.  The Gigabit ethernet hasn't been a bottleneck so far, and I don't expect it to be.  The disks don't perform well enough with random IO to saturate the networking, and filesystem IO on CephFS is typically uniformly distributed across all of the OSDs.  As a result the networking bottleneck is almost always on the client side.

  Are you sure? yes | no

Aaron Covrig wrote 11/16/2020 at 18:33 point

This is a sweet looking project!  I noticed that you look to be playing it safe with how you distributed your Pi's based on available ram, are you able to provide any details on what the memory consumption has looked like when idle and under load?

  Are you sure? yes | no

Robert Rouquette wrote 11/20/2020 at 00:40 point

The OSDs are configured for a maximum of 3 GiB per service instance and they tend to consume all of it.  The comes to 6 GiB per 8 GiB RPi just for the OSD services.  The kernel and other system services consume a minor amount as well, so they tend to consistently hover around 20% memory nearly all the time.  The extra "unused" memory is necessary as padding since there is no swap space.  (Adding swap on an SD card is simply inviting premature hardware failure.)

  Are you sure? yes | no

Aaron Covrig wrote 11/20/2020 at 02:00 point

Awesome, thank you.  

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates