CEPH basics

Base terms

  • OSD - Basic physical cluster unit, usually represents one harddisk
  • Bucket - Set describing one unit of cluster layout (datacenter, rack, host), could be nested
  • Placement Group (PG) - Basic logical unit of cluster.
  • Pool - Higher logical unit of cluster, roughly equivalent of logical partition on HDD

Replicas distribution in cluster:

Po založení clusteru se vytvoří pool, který má pevně daný počet PG a danou hodnotu size, která určuje kolik replik každého objektu by měl cluster držet. Každá PG je namapovaná na tolik OSD, kolik udavá size poolu. To, které OSD budou vybrány, určuje ruleset (lze nastavit pro každý pool jinak). CRUSH algoritmus se pokouší o co nejrovnoměrnější a nejnáhodnější rozdělení. Pokud dojde k výpadku OSD, jsou všechny jeho PG přemapovány na zbývající OSD. Která to budou opět určuje ruleset poolu. Po znovuspuštění vypadlých OSD jsou přidány do clusteru a je provedeno další přemapování. Ruleset je možné změnit za běhu clusteru, po jeho změně se opět automaticky spustí rebalancing.

Deployment via Ceph-deploy:

https://github.com/ceph/ceph-deploy

Requirements

  • set of linux servers accessible via hostnames (node1, … , nodeX)
  • admin server with passwordless connection to them (could be a part of cluster but not recommended)

Deployment

  1. Install ceph-deploy on admin server:

    # pip install ceph-deploy
    
  2. Create initial config:

    # mkdir ~/ceph_cluster && cd ~/ceph_cluster
    # ceph-deploy new node1
    
  3. Install ceph on nodes:

    # ceph-deploy install node1 … nodeX
    
  4. Deploy initial members:

    # ceph-deploy mon create
    
  5. Create OSD on nodex (replace nodex and sdx):

    # ceph-deploy osd create nodex:/dev/sdx
    
  6. Activate created OSD (NOTE: osd create creates two partitions on block device - the first is used for actual data storage and the second for storing journal):

    # ceph-deploy osd activate nodex:/dev/sdx1:/dev/sdx2
    
  7. Print out OSD tree on one of ceph nodes:

    # ceph osd tree
    

Basic usage

Creating pool and putting an object into cluster

Useful for testing purposes.

  1. Create pool (for ideal pg_number see documentation or use pgcalc):

    # ceph osd pool create pool-A {pg_number}
    
  2. Create a dummy object:

    # dd if=/dev/zero of=object-A bs=4M count=1
    
  3. Put it into the pool:

    # rados -p pool-A put object-A object-A        
    

Locating object in cluster

    # ceph osd map pool-A object-A
    osdmap e407 pool 'pool-A' (1) object 'object-A' -> pg 1.b301e3e8 (1.1e8) -> up ([1,17], p1) acting ([1,17,4], p1)

that means: - pg-name: 1.b301e3e8 - pgid: 1.1e8 - should be on OSDs: 1,4,17 - physically is on OSDs: 1, 17

Show placement group info

produces long json output

    # ceph pg {pgid} query 

CRUSH rulesets

example:

    rule jbods {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type jbod
        step chooseleaf firstn 1 type host
        step emit
    }

description:

  • ruleset: id integer
  • type: replicated (replicas of object)/erasure (RAID4)
  • min_size: minimal pool size for applying the ruleset
  • max_size: maximal pool size for applying the ruleset
  • step …: actual rules (kind of programming language)
  • step take default: selects entry point in cluster tree (must be instance not type - e.g. rack1)
  • step choose firstn 0 type jbod: select as many entities of given type as possible (max pool size) and pass them to next step (exact amount could be used, negative number means pool_size - num)
  • step chooseleaf firstn 1 type host: select leaf entry (OSD) per every first host of every jbod passed from previous step
  • step emit: return (does not end the ruleset if some steps follow)

Modifying a CRUSHmap & testing ruleset

  1. Dump current crushmap:

    # ceph osd getcrushmap -o crushmap.img
    
  2. Decompile:

    # crushtool -d crushmap.img -o crushmap.decompiled
    
  3. Edit rule:

    # vim crushmap.decompiled
    
  4. Compile:

    # crushtool -c crushmap.decompiled -o crushmap.changed.img
    
  5. Run test:

    # crushtool -i crushmap.changed.img --test --ruleset 2 --num-rep 3 --show-mappings
    
  6. If OK apply map:

    # ceph osd setcrushmap -i crushmap.changed.img
    

Last modified: 2017-05-25