CEPH
Last updated
Last updated
In computing, Ceph (pronounced /ˈsɛf/ or /ˈkɛf/) is a free-software storage platform, implements object storage on a single distributed computer cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level, and freely available. [Wikipedia]())
Our software automates the process of building and managing Linux clusters in your data center and in the cloud
Open Source
Community Focused
Software Defined Storage Solution
Distributed object storage
Redundancy
Replication
Erasure
Cache Tiering
Efficient scale out
Build on commodity hardware
Most popular choice of distributed storage for OpenStack: Nova (VM virtual disks), Glance (images), Cinder (block storage), RadosGW
Copy-On-Write (Glance image to Nova/Cinder)
Self healing
Self managed
No bottlenecks
Object Access (like Amazon S3)
Block Access
Distributed File System (cephfs)
RADOS (Reliable Autonomic Distributed Object Store) .. Documentation Ceph Storage Cluster
radosgw (object storage) .. Documentation Ceph Object Storage
RESTful Interface
S3 and Swift APIs
rbd (block device) .. DocumentationCeph Block Device
Block devices
Snapshot
CephFS (File System) .. Documentation Ceph Filesystem
POSIX Compliant
Separate Data and Metadata
For use e.g. with Hadoop
We recommend using XFS
Head Node (Controller)
SQL Database
Cluster Management GUI (JSON + SSL)
Cluster Management Shell
Web Based User Portal
Third Party Applications
Node-005 (Nova App)
Node-004 (CEPH OSD)
Node-003 (CEPH OSD)
Node-002 (Nova Compute)
Node-002 (Nova Compute)
Server CEPH OSDs Node
Type
Fat Node
Many Cores / Sockets 20+ HDDs, 1+ Journal SSDs
Thin Node
Faster recovery
1 Socket is enough
Physical Disk
SSDs Journals Fast Vs HDD Slow
File System (btrfs, xfs)
One Object Storage Daemon
OSD serve object storage to clients
Peer to perform replication and recovery
Server CEPH Monitor
Store Cluster Map (at least 3)
Brain of the Cluster
Do not server stored objects to clients
Server CEPH Metadata (for CephFS)
Pool
Logical container for storage objects
Parameters
Name, ID
Replicas
CRUSH rules
Operations
Create / Read / Write Objects
Placement Groups (PGs)
Balance data across OSD
1 PG spans several OSD
1 OSD serves many PGs
Tunable (50-100 per OSD)
Control Replication Under Scalable Hashing
Monitors mantain CRUSH map
Clients understand CRUSH
Standalone Storage System
Back End for OpenStack Block Storage
Storage Nodes
CPU 1.5 GHz per OSD
Memory 1 or 2 GB per Terabyte of Storage (16GB)
Storage Controller
SSDs for OSD Journal
HDDs
Krusty The Cloud
17 Hypervisor Nodes
400 VMs
7 CEPH OSDs (Extending to 10)
Considerations
Network determines the number of SSDs
Number of SSDs determine number of HDDs
Number of HDDs determine number of CPU core count
Size count of HDDs determines the amount of memory needed
Network
Single Fabric
Single Switch, VLANs
Problems: One broadcast domain, bandwidth
Multiple Fabric
Fabric for VLAN/VXLAN
CEPH Access (ceph-public)
CEPH Cluster (ceph-cluster)
NICs
1 GigE, 10 GigE
MTUs
1500 Vs 9000
Disks
SSD Journals
Amount of data to write before failure
1 GigE good for SATA SSD
10 GigE good for PCIe SSD
Hard Disk
5 HDDs per 1 SSD (from 4 to 8 is common)
3 SSDs per OSD node (on 10 GigE)
5 OSD Daemon per node
Processor
1 Socket but how many cores
Depends on SSD and networking
1 CPU core per Daemon Disk
1 SATA SSD Journal per ~4-6 HDD
1 PCIe SSD Journal per ~6-20 HDD
Example. 2 SATA SSDs could handle 12 OSDs which would require 12 core CPU
Hyper Threading Cores Vs Physical Cores
HT enabled
Memory
0.5 GB - 1 GB per TB per Daemon
More is better (Linux VFS caching)
OSD node with 4 x 2 TB Disks (4 Daemons) -> 8 GB of RAM
OSD node with 16 x 2 TB Disks (16 Daemons) -> 32 GB of RAM
Up to 16
CMDaemon