Raspberry Pi Ceph Cluster

Description

I needed an easily expandable storage solution for warehousing my ever growing hoard of data. I decided to go with Ceph since it's open source and I had slight experience from work. The most important benefit is that I can continuously expand the storage capacity as needed simply by adding more nodes and drives.

Details

Current Hardware:

1x RPi 4 w/ 2GB RAM
- management machine
- influxdb, apcupsd, apt-cacher
3x RPi 4 w/ 4GB RAM
- 1x ceph mon/mgr/mds per RPi
18x RPi 4 w/ 8GB RAM
- 2 ceph osds per RPi
- 2x Seagate 2TB USB 3.0 HDD per RPi

Current Total Raw Capacity: 65 TiB

The RPi's are all housed in a nine drawer cabinet with rear exhaust fans. Each drawer has an independent 5V 10A power supply. There is a 48-port network switch in the rear of the cabinet to provide the necessary network fabric.

The HDDs are double-stacked five wide to fit 10 HDDs in each drawer along with five RPi 4's. A 2" x 7" x 1/8" aluminum bar is sandwiched between the drives for heat dissipation. Each drawer has a custom 5-port USB power fanout board to power the RPi's. The RPi's have the USB PMIC bypassed with a jumper wire to power the HDDs since the 1.2A current limit is insufficient to spin up both drives.

Components

1 × Raspberry Pi 4 w/ 2GB RAM

3 × Raspberry Pi 4 w/ 4GB RAM

18 × Raspberry Pi 4 w/ 8GB RAM

22 × MB-MJ64GA/AM Samsung PRO Endurance 64GB 100MB/s (U1) MicroSDXC Memory Card with Adapter

22 × USB C Cable (1 ft) USB Type C Cable Braided Fast Charge Cord

Project Logs

Collapse

Instability Followup and Resolution
Robert Rouquette • 03/30/2022 at 15:36 • 0 comments
The OSD instability I encountered after the kernel update persisted though with less frequency. I've finally determined that cause is a confluence of small issues that amplify each other:
- Power Supply Aging - The 5V 10A supplies have lost a small amount of their output headroom with age.
- CPU Governor Changes - The ondemand CPU governor is no longer as aggressive at reducing the CPU frequency
- CPU Aging - The CPU and PCIe controller appear to have become more prone to core undervoltage.
I've remediated the instability by underclocking the CPU. Underclocking was insufficient on its own, so I've also applied slight overvoltage as well. The OSD RPi's have been holding steady after applying both changes.

Here's the current state pf my usercfg.txt:
```
# Place "config.txt" changes (dtparam, dtoverlay, disable_overscan, etc.) in
# this file. Please refer to the README file for a description of the various
# configuration files on the boot partition.

max_usb_current=1
over_voltage=2
arm_freq=500 
```
These changes have not appeared to impact my Ceph performance.
Unstable Kernel (5.4.0-1041-raspi)
Robert Rouquette • 08/24/2021 at 16:42 • 0 comments

The linux-image-5.4.0-1041-raspi version of the Ubuntu linux-image package appears to be unstable. I've had two of the OSD RPi boards randomly lockup. The boards recover on their own once power-cycled, but this is the first time I've observed this behavior. There no messages in the system logs. The logs simply stop at the time of the lockup, and resume on reboot. I've upgraded all of the RPis to the latest image version (1042) which should hopefully resolve the issue.
Rebalance Complete
Robert Rouquette • 06/28/2021 at 14:26 • 0 comments
Two More OSDs
Robert Rouquette • 06/17/2021 at 01:40 • 0 comments

I've added the last two drives for this round of expansion which brings the total for the cluster to 34 HDDs (OSDs). Amazon would not allow me to purchase more of the STGX2000400 drives, so I went with the STKB2000412 drives instead. They have roughly the same performance, but cost about $4 more per drive. The aluminum top portion of the case should provide better thermal performance though.
Additional Storage
Robert Rouquette • 06/09/2021 at 01:32 • 0 comments

Added the first two of four additional drives. I plan to add the other two once the rebalance completes.

OSD Recovery Complete

Robert Rouquette • 05/19/2021 at 03:24 • 0 comments

Once the failed drive was replaced, the cluster was able to rebalance and repair the inconsistent PGs.

  cluster:
    id:     105370dd-a69b-4836-b18c-53bcb8865174
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ceph-mon00,ceph-mon01,ceph-mon02 (age 33m)
    mgr: ceph-mon02(active, since 13d), standbys: ceph-mon00, ceph-mon01
    mds: cephfs:2 {0=ceph-mon00=up:active,1=ceph-mon02=up:active} 1 up:standby
    osd: 30 osds: 30 up (since 9d), 30 in (since 9d)
 
  data:
    pools:   5 pools, 385 pgs
    objects: 8.42M objects, 29 TiB
    usage:   41 TiB used, 14 TiB / 55 TiB avail
    pgs:     385 active+clean
 
  io:
    client:   7.7 KiB/s wr, 0 op/s rd, 0 op/s wr

After poking around on the failed drive, it looks like the actual 2.5" drive itself is fine. The USB-to-SATA controller seems to be culprit, and randomly garbles data over the USB interface. I was also able to observe it fail to enumerate on the USB bus. A failure rate of 1 in 30 isn't bad considering the cost of the drives.

OSD Failure
Robert Rouquette • 05/04/2021 at 19:13 • 0 comments

The deep scrubs turned up a few more repairable inconsistencies until a few days ago when they grew concerning. It turned out that one of the OSDs had unexplained read errors. Smartctl showed that there were no read errors recorded by the disk, so I initially assumed it was just the result of a power failure. It became obvious that something was physically wrong with the disk when previously clean or repaired PGs were found to have new errors.

As a result I've marked the suspect OSD out of the cluster and I have ordered a replacement drive. The exact cause of the read errors is unknown, but since it is isolated to a single drive, and the other OSD on the same RPi is fine, it's most likely just a bad drive.

Ceph is currently rebuilding the data from the bad drive, and I'll post an update once the new drive arrives.

The inconsistent PGs all have a single OSD in common: 2147483647 (formerly identified as 25)
Inconsistent Placement Group
Robert Rouquette • 04/08/2021 at 16:54 • 0 comments

The OSD deep scrubbing found an inconsistency in one of the placement groups. I've marked the PG in question for repair, so hopefully it's correctable and is merely a transient issue. The repair operation should be complete in the next few hours.

[UPDATE]
Ceph was able to successfully repair the PG.
Zabbix Monitoring
Robert Rouquette • 01/22/2021 at 00:30 • 0 comments

I decided I have enough nodes that some comprehensive monitoring was worth the time, so I configured Zabbix to monitor the nodes in the ceph cluster. I used the zbx-smartctl project for collecting the smart data.
Rebalance Complete
Robert Rouquette • 12/10/2020 at 15:15 • 0 comments

The rebalance finally completed. I had to relax the global mon_target_pg_per_osd setting on my cluster to allow the PG count increase and the balancer to settle. Without setting that parameter to 1, the balancer and PG autoscaler were caught in a slow thrashing loop

View all 19 project logs

Discussions

Toby Archer wrote 11/20/2020 at 14:15

20TB raw capacity per shelf is great. How are you finding heat? It would be very cool to wire in two temperature probes to your management Pi's GPIO and monitor temperature by the exhausts.

Have you found single gigabit per node to become a bottleneck?

Awesome project, makes me very jealous.

Are you sure? yes | no

Robert Rouquette wrote 11/21/2020 at 00:54

The CPU and disk temperature tends to stay below 55 C. The Gigabit ethernet hasn't been a bottleneck so far, and I don't expect it to be. The disks don't perform well enough with random IO to saturate the networking, and filesystem IO on CephFS is typically uniformly distributed across all of the OSDs. As a result the networking bottleneck is almost always on the client side.

Are you sure? yes | no

Aaron Covrig wrote 11/16/2020 at 18:33

This is a sweet looking project! I noticed that you look to be playing it safe with how you distributed your Pi's based on available ram, are you able to provide any details on what the memory consumption has looked like when idle and under load?

Are you sure? yes | no

Robert Rouquette wrote 11/20/2020 at 00:40

The OSDs are configured for a maximum of 3 GiB per service instance and they tend to consume all of it. The comes to 6 GiB per 8 GiB RPi just for the OSD services. The kernel and other system services consume a minor amount as well, so they tend to consistently hover around 20% memory nearly all the time. The extra "unused" memory is necessary as padding since there is no swap space. (Adding swap on an SD card is simply inviting premature hardware failure.)

Are you sure? yes | no

Aaron Covrig wrote 11/20/2020 at 02:00

Awesome, thank you.

Are you sure? yes | no

Raspberry Pi Ceph Cluster

Description

Details

Components

Project Logs

Collapse

Instability Followup and Resolution

Unstable Kernel (5.4.0-1041-raspi)

Rebalance Complete

Two More OSDs

Additional Storage

OSD Recovery Complete

OSD Failure

Inconsistent Placement Group

Zabbix Monitoring

Rebalance Complete

Discussions

Similar Projects

Restoring an Asus Chromebox CN60

PD Buddy Sink

Fujitsu Futro S900 Thin client: Second SATA port

Home Power Usage Monitor

Raspberry Pi Ceph Cluster

Become a Hackaday.io member

Just one more thing

Description

Details

Components

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse