Close
0%
0%

Raspberry Pi Ceph Cluster

A Raspberry Pi Ceph Cluster using 2TB USB drives.

Similar projects worth following
I needed an easily expandable storage solution for warehousing my ever growing hoard of data. I decided to go with Ceph since it's open source and I had slight experience from work. The most important benefit is that I can continuously expand the storage capacity as needed simply by adding more nodes and drives.

Current Hardware:

  • 1x RPi 4 w/ 2GB RAM
    • management machine
    • influxdb, apcupsd, apt-cacher
  • 3x RPi 4 w/ 4GB RAM
    • 1x ceph mon/mgr/mds per RPi
  • 17x RPi 4 w/ 8GB RAM
    • 2 ceph osds per RPi
    • 2x Seagate 2TB USB 3.0 HDD per RPi

Current Total Raw Capacity: 62 TiB

The RPi's are all housed in a nine drawer cabinet with rear exhaust fans.  Each drawer has an independent 5V 10A power supply.  There is a 48-port network switch in the rear of the cabinet to provide the necessary network fabric.

The HDDs are double-stacked five wide to fit 10 HDDs in each drawer along with five RPi 4's.  A 2" x 7" x 1/8" aluminum bar is sandwiched between the drives for heat dissipation.  Each drawer has a custom 5-port USB power fanout board to power the RPi's.  The RPi's have the USB PMIC bypassed with a jumper wire to power the HDDs since the 1.2A current limit is insufficient to spin up both drives.

  • 1 × Raspberry Pi 4 w/ 2GB RAM
  • 3 × Raspberry Pi 4 w/ 4GB RAM
  • 17 × Raspberry Pi 4 w/ 8GB RAM
  • 21 × MB-MJ64GA/AM Samsung PRO Endurance 64GB 100MB/s (U1) MicroSDXC Memory Card with Adapter
  • 21 × USB C Cable (1 ft) USB Type C Cable Braided Fast Charge Cord

View all 11 components

  • Unstable Kernel (5.4.0-1041-raspi)

    Robert Rouquette08/24/2021 at 16:42 0 comments

    The linux-image-5.4.0-1041-raspi version of the Ubuntu linux-image package appears to be unstable.  I've had two of the OSD RPi boards randomly lockup.  The boards recover on their own once power-cycled, but this is the first time I've observed this behavior.  There no messages in the system logs. The logs simply stop at the time of the lockup, and resume on reboot.  I've upgraded all of the RPis to the latest image version (1042) which should hopefully resolve the issue.

  • Rebalance Complete

    Robert Rouquette06/28/2021 at 14:26 0 comments

  • Two More OSDs

    Robert Rouquette06/17/2021 at 01:40 0 comments

    I've added the last two drives for this round of expansion which brings the total for the cluster to 34 HDDs (OSDs).  Amazon would not allow me to purchase more of the STGX2000400 drives, so I went with the STKB2000412 drives instead.  They have roughly the same performance, but cost about $4 more per drive.  The aluminum top portion of the case should provide better thermal performance though.


  • Additional Storage

    Robert Rouquette06/09/2021 at 01:32 0 comments

    Added the first two of four additional drives.  I plan to add the other two once the rebalance completes.


  • OSD Recovery Complete

    Robert Rouquette05/19/2021 at 03:24 0 comments

    Once the failed drive was replaced, the cluster was able to rebalance and repair the inconsistent PGs.

      cluster:
        id:     105370dd-a69b-4836-b18c-53bcb8865174
        health: HEALTH_OK
     
      services:
        mon: 3 daemons, quorum ceph-mon00,ceph-mon01,ceph-mon02 (age 33m)
        mgr: ceph-mon02(active, since 13d), standbys: ceph-mon00, ceph-mon01
        mds: cephfs:2 {0=ceph-mon00=up:active,1=ceph-mon02=up:active} 1 up:standby
        osd: 30 osds: 30 up (since 9d), 30 in (since 9d)
     
      data:
        pools:   5 pools, 385 pgs
        objects: 8.42M objects, 29 TiB
        usage:   41 TiB used, 14 TiB / 55 TiB avail
        pgs:     385 active+clean
     
      io:
        client:   7.7 KiB/s wr, 0 op/s rd, 0 op/s wr
     
    

     After poking around on the failed drive, it looks like the actual 2.5" drive itself is fine.  The USB-to-SATA controller seems to be culprit, and randomly garbles data over the USB interface.  I was also able to observe it fail to enumerate on the USB bus.  A failure rate of 1 in 30 isn't bad considering the cost of the drives.

  • OSD Failure

    Robert Rouquette05/04/2021 at 19:13 0 comments

    The deep scrubs turned up a few more repairable inconsistencies until a few days ago when they grew concerning.  It turned out that one of the OSDs had unexplained read errors.  Smartctl showed that there were no read errors recorded by the disk, so I initially assumed it was just the result of a power failure.  It became obvious that something was physically wrong with the disk when previously clean or repaired PGs were found to have new errors.

    As a result I've marked the suspect OSD out of the cluster and I have ordered a replacement drive.  The exact cause of the read errors is unknown, but since it is isolated to a single drive, and the other OSD on the same RPi is fine, it's most likely just a bad drive.

    Ceph is currently rebuilding the data from the bad drive, and I'll post an update once the new drive arrives.

    The inconsistent PGs all have a single OSD in common: 2147483647 (formerly identified as 25)

  • Inconsistent Placement Group

    Robert Rouquette04/08/2021 at 16:54 0 comments

    The OSD deep scrubbing found an inconsistency in one of the placement groups.  I've marked the PG in question for repair, so hopefully it's correctable and is merely a transient issue.  The repair operation should be complete in the next few hours.


    [UPDATE]
    Ceph was able to successfully repair the PG.

  • Zabbix Monitoring

    Robert Rouquette01/22/2021 at 00:30 0 comments

    I decided I have enough nodes that some comprehensive monitoring was worth the time, so I configured Zabbix to monitor the nodes in the ceph cluster.  I used the zbx-smartctl project for collecting the smart data.  

  • Rebalance Complete

    Robert Rouquette12/10/2020 at 15:15 0 comments

    The rebalance finally completed.  I had to relax the global mon_target_pg_per_osd setting on my cluster to allow the PG count increase and the balancer to settle.  Without setting that parameter to 1, the balancer and PG autoscaler were caught in a slow thrashing loop

  • Thermal Performance

    Robert Rouquette11/21/2020 at 00:49 0 comments

    Per the inquiry by @Toby Archer here's the plot of CPU and disk temperature over the last 30 days.

View all 18 project logs

Enjoy this project?

Share

Discussions

6355 wrote 08/25/2021 at 17:04 point

Very interesting approach, thank you!

But, I suspect the following problems.

The most (if not all) current 2TB 2.5" HDD's are SMR. And I doubt any of these SMR are HM-SMR. Seagate can be very secretive when it comes to which drives are SMR (instead of CMR), so goes for STGX2000400.

And currently, Ceph+SMR gives very surprising performance. Some experiments to support HM-SMR with BlueStore are going ( https://docs.ceph.com/en/latest/dev/zoned-storage/ ), but just for HM-SMR, not for DM-SMR, or even for HA-SMR...

Another problem, according to my experience, could be Type A USB connectors. I've seen many cases they got oxidized and lose electrical contact (since then I prefer Type C USB connectors), there are rumors some connector grease can resolve this though...

  Are you sure? yes | no

Robert Tisma wrote 08/01/2021 at 02:27 point

Phenomenal! This is truely amazing. I am planning on creating a much smaller version of your setup. Two questions.

What is your approach for backups? At this scale, it doesnt seem practical (for a home lab atleast).

How much time does it really take to maintain the cluster? Since you have put cluster online, how much maintenace have you had to do (not including adding or removing nodes or hdds)? Im curious if jumping into this is something that will be a second child or something that wont require much time doing maintanace (a failure a week vs a few failures a year for example). I know thats a loaded question and dependent on alot of things, but im curious for your particular setup.

  Are you sure? yes | no

Robert Rouquette wrote 08/02/2021 at 03:50 point

There are no backups of the stored data itself.  The Ceph cluster, by design, has built-in data redundancy for protection against drive and hardware failure.  The only part that is backed up is the monitor databases, since a total loss of the monitors would be catastrophic.  Individual drive or node failure isn't particularly worrisome since Ceph is self healing and will automatically rebuild any lost data using the remaining space on the cluster.

Aside from the one failed drive and adding storage, the cluster has been relatively maintenance free.  I perform software updates every few months whenever there are critical security updates or bug fixes, but that's about it.  I've averaged less than two hours per month in maintenance and monitoring time so far.

  Are you sure? yes | no

Stefano wrote 02/12/2021 at 23:50 point

Really compliments! From the photos you should be a fan of r/cableporn ! :)

  Are you sure? yes | no

marazm wrote 02/04/2021 at 11:50 point

Słabo z chłodzeniem.

Moim zdaniem lepszym sposobem była by tuba. Zobacz jak projektowano cray-1 albo nowy mac

płyty ułożone dookoła wytwarzają naturalny obieg powietrza jak w kominie. Do tego można spokojnie wstawić człodzenie wewnątrz i na zewnątrz (np. pojemnik z wodą) 

  Are you sure? yes | no

Robert Rouquette wrote 02/04/2021 at 16:39 point

From my own observations of the hardware temperatures over time, the cooling is already adequate.  While none of the hardware is running cold, it's still certainly well within acceptable operating limits.  More aggressive cooling would simply increase the power consumption and increase the TCO.

  Are you sure? yes | no

josephchrzempiec wrote 02/04/2021 at 11:26 point

Hello, Do you or is there a way to set thiss up? I'm having a hard time trying to figure this out.

I wold be nice if there was a video tutorial on how to setup the raspberry pi cluster for this.

  Are you sure? yes | no

Robert Rouquette wrote 02/04/2021 at 16:45 point

I'm not particularly inclined to create a video tutorial.  There are already countless online tutorials available (not counting the official Ceph documentation) that walk through the planning, creation, and stand-up of Ceph clusters.  There are no special steps or tricks to the RPi OS installation either, and any distro that supports the RPi will work.

  Are you sure? yes | no

josephchrzempiec wrote 02/05/2021 at 12:54 point

Hello i understand. I just don't know where to start in software after setting up all the hardware. Where to begin.

  Are you sure? yes | no

Robert Tisma wrote 08/03/2021 at 04:02 point

did you use ceph-ansible for provisioning the RPis?

  Are you sure? yes | no

Robert Rouquette wrote 08/24/2021 at 16:28 point

@Robert Tisma My provisioning process was script based.  I stand up the RPi with the standard ubuntu server raspi image.  Once the RPi boots and has a DHCP lease, I ssh in and change the default password. (Required by the ubuntu image on first login.)  Once the password is set, I have provisioning scripts that set the system hostname, install ssh keys, install the base packages I want, perform package updates, and reboot the RPi.

I do the initial ceph related package installation and configuration with ceph-deploy.

  Are you sure? yes | no

Kevin Morrison wrote 02/03/2021 at 17:56 point

This is so overkill for what it does. Technology should be used to simplify a need or a process, not pervert and over complicate it!

  Are you sure? yes | no

Robert Rouquette wrote 02/03/2021 at 20:18 point

I agree with your sentiment about over complication. However, after many years of maintaining RAID arrays both large and small, and having watched as Ceph and other distributed storage systems have matured, I've come to appreciate that the distributed storage approach is significantly better in terms of failure tolerance and data integrity.  In terms of data warehousing, which is the principal purpose of this project, distributed storage offers the best long-term viability compared to RAID systems in terms of both data loss probability and capacity expansion.  I have personally experienced instances where rebuilding a failed drive on a RAID array triggered secondary drive failures.

I do ultimately agree that for the average person, this type of storage would be inappropriate, but in a situation where substantial continuous growth is expected, conventional single-machine approaches have their limits.

  Are you sure? yes | no

Kevin Morrison wrote 02/03/2021 at 21:32 point

The more I look into this the more sense it makes and like you I have decades building and working with RAID systems and this does look like a more flexible way to work with storage.

  Are you sure? yes | no

Timo Birnschein wrote 02/03/2021 at 17:18 point

I'm not familiar with Ceph but from what you described (based on the 8GB RP4 requirements) and what's written online, it seems like the RAM is being used as a cache and Ceph distributes the data to the disks. That would make writing extremely fast but reading is still bound to the read speed of the drives. Or do I misunderstand the concept?

  Are you sure? yes | no

Robert Rouquette wrote 02/03/2021 at 19:58 point

The data is distributed on the client side among the OSD service instances.  The OSD in-memory cache fulfills the same purpose as a kernel disk cache would, but allows the OSD daemons more nuanced access for it's internal operations and data integrity verification.

In terms of IO performance, the caching does help repetitive reads, but, for the most part, the performance is strictly limited by the RIO performance of the physical storage media.  Ceph write operations block on the client side until the data is actually committed to disk.  The only exception is CephFS where there is some limited client side buffering to allow for data chunking on large block writes.

  Are you sure? yes | no

andrewroffey wrote 01/29/2021 at 02:14 point

I've been looking at doing something similar, but the cost of 4TB drives looks to be better per GB. Is there any reason for going for 2x2TB drives per Pi, e.g. performance, cost? Could you go 1x4TB, or even 2x4TB per Pi? 8TB per OSD would still technically meet the "~1GB RAM per 1TB" rule of thumb for Ceph.

  Are you sure? yes | no

Robert Rouquette wrote 01/29/2021 at 22:16 point

I went with 2TB drives because they seemed like a good trade off between capacity on rebuild time.  There's also the fact that Ceph performance is directly tied to the number of OSDs.  More OSDs gives you more throughput especially with low performance hard drives.

That being said, there's no reason you can't stretch the OSD memory that far, but it's important to keep in mind that the full 8GB on the RPi is not truly usable by the OSD daemons.  The kernel carves out its own dedicated chunk, and you need to leave some for the kernel disk cache.  I would not use more than 7.5GB of the 8GB if you the RPi to remain stable.

  Are you sure? yes | no

Andrew Luke Nesbit wrote 01/24/2021 at 16:59 point

I was thinking of doing almost the exact same thing, except instead of using RPI 4's I was going to use Orange Pi +2E boards.  Does anybody have any comments on how this change will affect the feasibility of the project?

  Are you sure? yes | no

Robert Rouquette wrote 01/24/2021 at 17:17 point

The main limitation with that approach would be the lack of USB 3 and memory. The +2E only has USB 2 and 2G RAM.  It would certainly be possible to run an OSD daemon, but I don't think you would  get usable read/write performance.

  Are you sure? yes | no

Andrew Luke Nesbit wrote 01/24/2021 at 17:42 point

You know, you're absolutely right... without USB 3 it's going to be practically unusable.  But I just had another thought.... what about if, instead of using a USB-connected SSD, I use a large SD flash card for the "user storage" and use the eMMC and/or SPI (still trying to work out the differences and capabilities of each) for the OS?

  Are you sure? yes | no

christian.boeing wrote 12/29/2020 at 06:43 point

Thanks for your answer. I have expected that you are going 64 bit. (sure, 8GB Pi´s I saw now)
I use a ceph cluster for some years at home too. 

Started with 3 nodes and ceph hammer. First I used odroid XU4, but because of USB-HD issues changed to odroid hc1/hc2. Disadvantage is that they are only 32 bit.
But I love the flexibility of ceph.
Now my cluster is a little mixed: 3 mon/mgr/mds with octopus 64 bit, 12 osd mimic 32 bit. With armhf I found no possibility to go more than mimic. The octopus version for armhf in ubuntu 20.04 is broken.

  Are you sure? yes | no

Jeremy Rosengren wrote 12/28/2020 at 16:39 point

What kind of throughput are you getting when you're transferring files to/from cephfs?

  Are you sure? yes | no

Robert Rouquette wrote 12/29/2020 at 00:49 point

Cephfs can fully saturate my single 1 Gbps NIC indefinitely.  I don't currently have any machines with bonded ports, so I don't know the tipping point.

  Are you sure? yes | no

for.domination wrote 01/11/2021 at 13:54 point

Since seeing your project two weeks ago I couldnt get it out of my head and now I'm trying to find a usecase that would necessitate building a cluster ;)

Have you ever benchmarked your setup? Red Hat has a go-along written up on  https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/ceph-performance-benchmarking

I'd be quite interested how performance increases in such a large scale compared to the other rpi clusters found on the net which are mostly 3-5 units only. If you have a client with multiple ports it'd be interesting to see how much LA increases speeds... Maybe that can circumvent the client as the limiting factor.

  Are you sure? yes | no

christian.boeing wrote 12/28/2020 at 15:51 point

Looks very nice.

I think you are using the raspies in 64 bit mode because of limited support of ceph for armhf?
Which version of ceph is running?

  Are you sure? yes | no

Robert Rouquette wrote 12/29/2020 at 00:46 point

The rpi4's are running Ubuntu 20.04 aarch64, and I'm using Ceph Octopus (15.2.7) from the Ubuntu repos (not the Ceph repos)

  Are you sure? yes | no

miouge wrote 11/25/2020 at 08:48 point

Looks great. Just a couple of questions:

- Which install method did you use for Ceph?

- What kind of data protection do you use? Replication or EC? How has the performance?

  Are you sure? yes | no

Robert Rouquette wrote 11/25/2020 at 19:22 point

I used the ceph-deploy method.  It's simpler and makes more sense for lower-power embedded systems like the Raspberry Pi since it's a bare-metal installation instead of being containerized.

I use 3x replication for the meta, fs root, and rbd pools.  I use 5:2 EC for the majority of my cephfs data.

  Are you sure? yes | no

Toby Archer wrote 11/20/2020 at 14:15 point

20TB raw capacity per shelf is great. How are you finding heat? It would be very cool to wire in two temperature probes to your management Pi's GPIO and monitor temperature by the exhausts.

Have you found single gigabit per node to become a bottleneck?

Awesome project, makes me very jealous.

  Are you sure? yes | no

Robert Rouquette wrote 11/21/2020 at 00:54 point

The CPU and disk temperature tends to stay below 55 C.  The Gigabit ethernet hasn't been a bottleneck so far, and I don't expect it to be.  The disks don't perform well enough with random IO to saturate the networking, and filesystem IO on CephFS is typically uniformly distributed across all of the OSDs.  As a result the networking bottleneck is almost always on the client side.

  Are you sure? yes | no

Aaron Covrig wrote 11/16/2020 at 18:33 point

This is a sweet looking project!  I noticed that you look to be playing it safe with how you distributed your Pi's based on available ram, are you able to provide any details on what the memory consumption has looked like when idle and under load?

  Are you sure? yes | no

Robert Rouquette wrote 11/20/2020 at 00:40 point

The OSDs are configured for a maximum of 3 GiB per service instance and they tend to consume all of it.  The comes to 6 GiB per 8 GiB RPi just for the OSD services.  The kernel and other system services consume a minor amount as well, so they tend to consistently hover around 20% memory nearly all the time.  The extra "unused" memory is necessary as padding since there is no swap space.  (Adding swap on an SD card is simply inviting premature hardware failure.)

  Are you sure? yes | no

Aaron Covrig wrote 11/20/2020 at 02:00 point

Awesome, thank you.  

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates