Close
0%
0%

Raspberry Pi Ceph Cluster

A Raspberry Pi Ceph Cluster using 2TB USB drives.

Similar projects worth following
I needed an easily expandable storage solution for warehousing my ever growing hoard of data. I decided to go with Ceph since it's open source and I had slight experience from work. The most important benefit is that I can continuously expand the storage capacity as needed simply by adding more nodes and drives.

Current Hardware:

  • 1x RPi 4 w/ 2GB RAM
    • management machine
    • influxdb, prometheus, apcupsd, apt-cacher
  • 3x RPi 4 w/ 4GB RAM
    • 1x ceph mon/mgr/mds per RPi
  • 15x RPi 4 w/ 8GB RAM
    • 2 ceph osds per RPi
    • 2x Seagate 2TB USB 3.0 HDD per RPi

Current Total Raw Capacity: 55 TiB

The RPi's are all housed in a nine drawer cabinet with rear exhaust fans.  Each drawer has an independent 5V 10A power supply.  There is a 48-port network switch in the rear of the cabinet to provide the necessary network fabric.

The HDDs are double-stacked five wide to fit 10 HDDs in each drawer along with five RPi 4's.  A 2" x 7" x 1/8" aluminum bar is sandwiched between the drives for heat dissipation.  Each drawer has a custom 5-port USB power fanout board to power the RPi's.  The RPi's have the USB PMIC bypassed with a jumper wire to power the HDDs since the 1.2A current limit is insufficient to spin up both drives.

  • 1 × Raspberry Pi 4 w/ 2GB RAM
  • 3 × Raspberry Pi 4 w/ 4GB RAM
  • 15 × Raspberry Pi 4 w/ 8GB RAM
  • 19 × MB-MJ64GA/AM Samsung PRO Endurance 64GB 100MB/s (U1) MicroSDXC Memory Card with Adapter
  • 19 × USB C Cable (1 ft) USB Type C Cable Braided Fast Charge Cord

View all 10 components

  • Inconsistent Placement Group

    Robert Rouquette2 days ago 0 comments

    The OSD deep scrubbing found an inconsistency in one of the placement groups.  I've marked the PG in question for repair, so hopefully it's correctable and is merely a transient issue.  The repair operation should be complete in the next few hours.


    [UPDATE]
    Ceph was able to successfully repair the PG.

  • Zabbix Monitoring

    Robert Rouquette01/22/2021 at 00:30 0 comments

    I decided I have enough nodes that some comprehensive monitoring was worth the time, so I configured Zabbix to monitor the nodes in the ceph cluster.  I used the zbx-smartctl project for collecting the smart data.  

  • Rebalance Complete

    Robert Rouquette12/10/2020 at 15:15 0 comments

    The rebalance finally completed.  I had to relax the global mon_target_pg_per_osd setting on my cluster to allow the PG count increase and the balancer to settle.  Without setting that parameter to 1, the balancer and PG autoscaler were caught in a slow thrashing loop

  • Thermal Performance

    Robert Rouquette11/21/2020 at 00:49 0 comments

    Per the inquiry by @Toby Archer here's the plot of CPU and disk temperature over the last 30 days.

  • OSD Memory Consumption

    Robert Rouquette11/20/2020 at 00:44 2 comments

    Per the inquiry from @Aaron Covrig , here's the memory consumption over the past week:

    The three series with nightly jumps are the mon instances, and the jumps are the mon services shutting down for the backup process.

  • Rebalance Still In Progress

    Robert Rouquette11/13/2020 at 02:46 2 comments

    The OSD rebalance is still running, but it did free up enough space that I was able to finish migrating my media files.  The total raw space currently in use is 66% of 54 TiB.

  • Added 10 more drives

    Robert Rouquette10/27/2020 at 00:57 0 comments

    I added the (hopefully) last 10 drives for this year.  This brings the total raw capacity to 55 TiB and should be enough to finish migrating my data.

  • Ceph PG Autoscaler

    Robert Rouquette10/24/2020 at 18:48 0 comments

    It looks like the PG autoscaler kicked in last night., and made the following changes:

    • fs_ec_5_2: 64 pgs -> 256 pgs

    The autobalancer should kick in later tonight or tomorrow to pull the OSDs back into alignment.

  • Rebalance Complete

    Robert Rouquette10/20/2020 at 17:16 0 comments

    The cluster rebalance is finally complete.   The final distribution also appears to be tighter than before the rebalance.

  • Rebalance Progress

    Robert Rouquette10/13/2020 at 21:01 0 comments

    The rebalance is still progressing.  It's particularly slower because I also changed the failure domain on my cephfs EC pool from OSD to HOST which increased the number of PGs that needed to be remapped.  Aside from that, the rate of progression appears fairly typical.  The two gaps in the plot are power outages caused by Hurricane Delta.

View all 12 project logs

Enjoy this project?

Share

Discussions

Stefano wrote 02/12/2021 at 23:50 point

Really compliments! From the photos you should be a fan of r/cableporn ! :)

  Are you sure? yes | no

marazm wrote 02/04/2021 at 11:50 point

Słabo z chłodzeniem.

Moim zdaniem lepszym sposobem była by tuba. Zobacz jak projektowano cray-1 albo nowy mac

płyty ułożone dookoła wytwarzają naturalny obieg powietrza jak w kominie. Do tego można spokojnie wstawić człodzenie wewnątrz i na zewnątrz (np. pojemnik z wodą) 

  Are you sure? yes | no

Robert Rouquette wrote 02/04/2021 at 16:39 point

From my own observations of the hardware temperatures over time, the cooling is already adequate.  While none of the hardware is running cold, it's still certainly well within acceptable operating limits.  More aggressive cooling would simply increase the power consumption and increase the TCO.

  Are you sure? yes | no

josephchrzempiec wrote 02/04/2021 at 11:26 point

Hello, Do you or is there a way to set thiss up? I'm having a hard time trying to figure this out.

I wold be nice if there was a video tutorial on how to setup the raspberry pi cluster for this.

  Are you sure? yes | no

Robert Rouquette wrote 02/04/2021 at 16:45 point

I'm not particularly inclined to create a video tutorial.  There are already countless online tutorials available (not counting the official Ceph documentation) that walk through the planning, creation, and stand-up of Ceph clusters.  There are no special steps or tricks to the RPi OS installation either, and any distro that supports the RPi will work.

  Are you sure? yes | no

josephchrzempiec wrote 02/05/2021 at 12:54 point

Hello i understand. I just don't know where to start in software after setting up all the hardware. Where to begin.

  Are you sure? yes | no

Kevin Morrison wrote 02/03/2021 at 17:56 point

This is so overkill for what it does. Technology should be used to simplify a need or a process, not pervert and over complicate it!

  Are you sure? yes | no

Robert Rouquette wrote 02/03/2021 at 20:18 point

I agree with your sentiment about over complication. However, after many years of maintaining RAID arrays both large and small, and having watched as Ceph and other distributed storage systems have matured, I've come to appreciate that the distributed storage approach is significantly better in terms of failure tolerance and data integrity.  In terms of data warehousing, which is the principal purpose of this project, distributed storage offers the best long-term viability compared to RAID systems in terms of both data loss probability and capacity expansion.  I have personally experienced instances where rebuilding a failed drive on a RAID array triggered secondary drive failures.

I do ultimately agree that for the average person, this type of storage would be inappropriate, but in a situation where substantial continuous growth is expected, conventional single-machine approaches have their limits.

  Are you sure? yes | no

Kevin Morrison wrote 02/03/2021 at 21:32 point

The more I look into this the more sense it makes and like you I have decades building and working with RAID systems and this does look like a more flexible way to work with storage.

  Are you sure? yes | no

Timo Birnschein wrote 02/03/2021 at 17:18 point

I'm not familiar with Ceph but from what you described (based on the 8GB RP4 requirements) and what's written online, it seems like the RAM is being used as a cache and Ceph distributes the data to the disks. That would make writing extremely fast but reading is still bound to the read speed of the drives. Or do I misunderstand the concept?

  Are you sure? yes | no

Robert Rouquette wrote 02/03/2021 at 19:58 point

The data is distributed on the client side among the OSD service instances.  The OSD in-memory cache fulfills the same purpose as a kernel disk cache would, but allows the OSD daemons more nuanced access for it's internal operations and data integrity verification.

In terms of IO performance, the caching does help repetitive reads, but, for the most part, the performance is strictly limited by the RIO performance of the physical storage media.  Ceph write operations block on the client side until the data is actually committed to disk.  The only exception is CephFS where there is some limited client side buffering to allow for data chunking on large block writes.

  Are you sure? yes | no

andrewroffey wrote 01/29/2021 at 02:14 point

I've been looking at doing something similar, but the cost of 4TB drives looks to be better per GB. Is there any reason for going for 2x2TB drives per Pi, e.g. performance, cost? Could you go 1x4TB, or even 2x4TB per Pi? 8TB per OSD would still technically meet the "~1GB RAM per 1TB" rule of thumb for Ceph.

  Are you sure? yes | no

Robert Rouquette wrote 01/29/2021 at 22:16 point

I went with 2TB drives because they seemed like a good trade off between capacity on rebuild time.  There's also the fact that Ceph performance is directly tied to the number of OSDs.  More OSDs gives you more throughput especially with low performance hard drives.

That being said, there's no reason you can't stretch the OSD memory that far, but it's important to keep in mind that the full 8GB on the RPi is not truly usable by the OSD daemons.  The kernel carves out its own dedicated chunk, and you need to leave some for the kernel disk cache.  I would not use more than 7.5GB of the 8GB if you the RPi to remain stable.

  Are you sure? yes | no

Andrew Luke Nesbit wrote 01/24/2021 at 16:59 point

I was thinking of doing almost the exact same thing, except instead of using RPI 4's I was going to use Orange Pi +2E boards.  Does anybody have any comments on how this change will affect the feasibility of the project?

  Are you sure? yes | no

Robert Rouquette wrote 01/24/2021 at 17:17 point

The main limitation with that approach would be the lack of USB 3 and memory. The +2E only has USB 2 and 2G RAM.  It would certainly be possible to run an OSD daemon, but I don't think you would  get usable read/write performance.

  Are you sure? yes | no

Andrew Luke Nesbit wrote 01/24/2021 at 17:42 point

You know, you're absolutely right... without USB 3 it's going to be practically unusable.  But I just had another thought.... what about if, instead of using a USB-connected SSD, I use a large SD flash card for the "user storage" and use the eMMC and/or SPI (still trying to work out the differences and capabilities of each) for the OS?

  Are you sure? yes | no

christian.boeing wrote 12/29/2020 at 06:43 point

Thanks for your answer. I have expected that you are going 64 bit. (sure, 8GB Pi´s I saw now)
I use a ceph cluster for some years at home too. 

Started with 3 nodes and ceph hammer. First I used odroid XU4, but because of USB-HD issues changed to odroid hc1/hc2. Disadvantage is that they are only 32 bit.
But I love the flexibility of ceph.
Now my cluster is a little mixed: 3 mon/mgr/mds with octopus 64 bit, 12 osd mimic 32 bit. With armhf I found no possibility to go more than mimic. The octopus version for armhf in ubuntu 20.04 is broken.

  Are you sure? yes | no

Jeremy Rosengren wrote 12/28/2020 at 16:39 point

What kind of throughput are you getting when you're transferring files to/from cephfs?

  Are you sure? yes | no

Robert Rouquette wrote 12/29/2020 at 00:49 point

Cephfs can fully saturate my single 1 Gbps NIC indefinitely.  I don't currently have any machines with bonded ports, so I don't know the tipping point.

  Are you sure? yes | no

for.domination wrote 01/11/2021 at 13:54 point

Since seeing your project two weeks ago I couldnt get it out of my head and now I'm trying to find a usecase that would necessitate building a cluster ;)

Have you ever benchmarked your setup? Red Hat has a go-along written up on  https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/ceph-performance-benchmarking

I'd be quite interested how performance increases in such a large scale compared to the other rpi clusters found on the net which are mostly 3-5 units only. If you have a client with multiple ports it'd be interesting to see how much LA increases speeds... Maybe that can circumvent the client as the limiting factor.

  Are you sure? yes | no

christian.boeing wrote 12/28/2020 at 15:51 point

Looks very nice.

I think you are using the raspies in 64 bit mode because of limited support of ceph for armhf?
Which version of ceph is running?

  Are you sure? yes | no

Robert Rouquette wrote 12/29/2020 at 00:46 point

The rpi4's are running Ubuntu 20.04 aarch64, and I'm using Ceph Octopus (15.2.7) from the Ubuntu repos (not the Ceph repos)

  Are you sure? yes | no

miouge wrote 11/25/2020 at 08:48 point

Looks great. Just a couple of questions:

- Which install method did you use for Ceph?

- What kind of data protection do you use? Replication or EC? How has the performance?

  Are you sure? yes | no

Robert Rouquette wrote 11/25/2020 at 19:22 point

I used the ceph-deploy method.  It's simpler and makes more sense for lower-power embedded systems like the Raspberry Pi since it's a bare-metal installation instead of being containerized.

I use 3x replication for the meta, fs root, and rbd pools.  I use 5:2 EC for the majority of my cephfs data.

  Are you sure? yes | no

Toby Archer wrote 11/20/2020 at 14:15 point

20TB raw capacity per shelf is great. How are you finding heat? It would be very cool to wire in two temperature probes to your management Pi's GPIO and monitor temperature by the exhausts.

Have you found single gigabit per node to become a bottleneck?

Awesome project, makes me very jealous.

  Are you sure? yes | no

Robert Rouquette wrote 11/21/2020 at 00:54 point

The CPU and disk temperature tends to stay below 55 C.  The Gigabit ethernet hasn't been a bottleneck so far, and I don't expect it to be.  The disks don't perform well enough with random IO to saturate the networking, and filesystem IO on CephFS is typically uniformly distributed across all of the OSDs.  As a result the networking bottleneck is almost always on the client side.

  Are you sure? yes | no

Aaron Covrig wrote 11/16/2020 at 18:33 point

This is a sweet looking project!  I noticed that you look to be playing it safe with how you distributed your Pi's based on available ram, are you able to provide any details on what the memory consumption has looked like when idle and under load?

  Are you sure? yes | no

Robert Rouquette wrote 11/20/2020 at 00:40 point

The OSDs are configured for a maximum of 3 GiB per service instance and they tend to consume all of it.  The comes to 6 GiB per 8 GiB RPi just for the OSD services.  The kernel and other system services consume a minor amount as well, so they tend to consistently hover around 20% memory nearly all the time.  The extra "unused" memory is necessary as padding since there is no swap space.  (Adding swap on an SD card is simply inviting premature hardware failure.)

  Are you sure? yes | no

Aaron Covrig wrote 11/20/2020 at 02:00 point

Awesome, thank you.  

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates