CEPH

In computing, Ceph (pronounced /ˈsɛf/ or /ˈkɛf/) is a free-software storage platform, implements object storage on a single distributed computer cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level, and freely available. [Wikipedia](https://en.wikipedia.org/wiki/Ceph_(software))

CEPH Homepage
CEPH Git
Bright Computing
- Our software automates the process of building and managing Linux clusters in your data center and in the cloud
The Definitive Guide: Ceph cluster On Raspberry Pi

CEPH :: Videos

CEPH :: Philosophy

Open Source
Community Focused

CEPH :: Features

Software Defined Storage Solution
Distributed object storage
Redundancy
- Replication
- Erasure
- Cache Tiering
Efficient scale out
Build on commodity hardware
Most popular choice of distributed storage for OpenStack: Nova (VM virtual disks), Glance (images), Cinder (block storage), RadosGW
Copy-On-Write (Glance image to Nova/Cinder)

CEPH :: Storage Cluster

Self healing
Self managed
No bottlenecks

CEPH :: 3 Interfaces

Object Access (like Amazon S3)
Block Access
Distributed File System (cephfs)

CEPH :: Architecture

Ceph Intro & Architectural Overview
RADOS (Reliable Autonomic Distributed Object Store) .. Documentation Ceph Storage Cluster
- radosgw (object storage) .. Documentation Ceph Object Storage
  - RESTful Interface
  - S3 and Swift APIs
- rbd (block device) .. DocumentationCeph Block Device
  - Block devices
  - Up to 16 EiB
  - Thin Provisioning
  - Snapshot
- CephFS (File System) .. Documentation Ceph Filesystem
  - POSIX Compliant
  - Separate Data and Metadata
  - For use e.g. with Hadoop
  - We recommend using XFS
    XFS: the filesystem of the future?
Head Node (Controller)
- SQL Database
- CMDaemon 1
  - Cluster Management GUI (JSON + SSL)
  - Cluster Management Shell
  - Web Based User Portal
- Third Party Applications
- Node-005 (Nova App)
- Node-004 (CEPH OSD)
- Node-003 (CEPH OSD)
- Node-002 (Nova Compute)
- Node-002 (Nova Compute)

CEPH :: Components

Server CEPH OSDs Node
- Type
  - Fat Node
    Many Cores / Sockets 20+ HDDs, 1+ Journal SSDs
  - Thin Node
    Faster recovery
    1 Socket is enough
- Physical Disk
- SSDs Journals Fast Vs HDD Slow
- File System (btrfs, xfs)
- One Object Storage Daemon
  - OSD serve object storage to clients
  - Peer to perform replication and recovery
Server CEPH Monitor
- Store Cluster Map (at least 3)
- Brain of the Cluster
- Do not server stored objects to clients
Server CEPH Metadata (for CephFS)

CEPH :: Conceptual Components

Pool
- Logical container for storage objects
- Parameters
  - Name, ID
  - Replicas
  - CRUSH rules
- Operations
  - Create / Read / Write Objects
Placement Groups (PGs)
- Balance data across OSD
- 1 PG spans several OSD
- 1 OSD serves many PGs
- Tunable (50-100 per OSD)
Control Replication Under Scalable Hashing
- Monitors mantain CRUSH map
- Clients understand CRUSH

CEPH :: Playground

Standalone Storage System
Back End for OpenStack Block Storage

CEPH :: Implementation

Implementation Good Ceph OSD Hardware - A Pragmatic Guide
Designing for High Performance Ceph at Scale
Ceph at CERN: A Year in the Life of a Petabyte-Scale Block Storage Service
Storage Nodes
- CPU 1.5 GHz per OSD
- Memory 1 or 2 GB per Terabyte of Storage (16GB)
- Storage Controller
- SSDs for OSD Journal
- HDDs
Krusty The Cloud
- 17 Hypervisor Nodes
- 400 VMs
- 7 CEPH OSDs (Extending to 10)

Considerations

Network determines the number of SSDs
Number of SSDs determine number of HDDs
Number of HDDs determine number of CPU core count
Size count of HDDs determines the amount of memory needed
Network
- Single Fabric
  - Single Switch, VLANs
  - Problems: One broadcast domain, bandwidth
- Multiple Fabric
  - Fabric for VLAN/VXLAN
  - CEPH Access (ceph-public)
  - CEPH Cluster (ceph-cluster)
- NICs
  - 1 GigE, 10 GigE
- MTUs
  - 1500 Vs 9000
Disks
- SSD Journals
- Amount of data to write before failure
- 1 GigE good for SATA SSD
- 10 GigE good for PCIe SSD
Hard Disk
- Intel® Solid-State Drive DC S3700 Series: Specification
- 5 HDDs per 1 SSD (from 4 to 8 is common)
- 3 SSDs per OSD node (on 10 GigE)
- 5 OSD Daemon per node
Processor
- 1 Socket but how many cores
  - Depends on SSD and networking
    1 CPU core per Daemon Disk
    1 SATA SSD Journal per ~4-6 HDD
    1 PCIe SSD Journal per ~6-20 HDD
    Example. 2 SATA SSDs could handle 12 OSDs which would require 12 core CPU
- Hyper Threading Cores Vs Physical Cores
  - HT enabled
Memory
- 0.5 GB - 1 GB per TB per Daemon
- More is better (Linux VFS caching)
- OSD node with 4 x 2 TB Disks (4 Daemons) -> 8 GB of RAM
- OSD node with 16 x 2 TB Disks (16 Daemons) -> 32 GB of RAM

CEPH :: Integration

PreviousStorage NextIntegration

Last updated 7 years ago