CephÊǼÓÖÝ´óѧSanta Cruz·ÖУµÄSage Weil£¨DreamHostµÄÁªºÏ´´Ê¼ÈË£©×¨Îª²©Ê¿ÂÛÎÄÉè¼ÆµÄÐÂÒ»´ú×ÔÓÉÈí¼þ·Ö²¼Ê½Îļþϵͳ¡£×Ô2007Äê±ÏÒµÖ®ºó£¬Sage¿ªÊ¼È«Ö°Í¶Èëµ½Ceph¿ª ·¢Ö®ÖУ¬Ê¹ÆäÄÜÊÊÓÃÓÚÉú²ú»·¾³¡£CephµÄÖ÷ҪĿ±êÊÇÉè¼Æ³É»ùÓÚPOSIXµÄûÓе¥µã¹ÊÕϵķֲ¼Ê½Îļþϵͳ£¬Ê¹Êý¾ÝÄÜÈÝ´íºÍÎÞ·ìµÄ¸´ÖÆ¡£2010Äê3 Ô£¬Linus Torvalds½«Ceph clientºÏ²¢µ½ÄÚ ºË2.6.34ÖС£
CephÖÐÓкܶàÔÚ·Ö²¼Ê½ÏµÍ³ÁìÓò·Ç³£ÐÂÓ±µÄ¼¼Êõµã£¬¶Ô½â¾ö·Ö²¼Ê½ÎļþϵͳÖÐһЩ³£¼ûµÄÎÊÌâµÄÑо¿·Ç³£ÓÐÖ¸µ¼ÒâÒå¡£ËùÒÔÖµµÃÑо¿¡£
RADOS¼ò½é
1 RADOS¸ÅÊö
RADOS (Reliable, Autonomic Distributed Object Store) ÊÇCephµÄºËÐÄÖ®Ò»£¬×÷ΪCeph·Ö²¼Ê½ÎļþϵͳµÄÒ»¸ö×ÓÏîÄ¿£¬ÌرðΪCephµÄÐèÇóÉè¼Æ£¬Äܹ»ÔÚ¶¯Ì¬±ä»¯ºÍÒìÖʽṹµÄ´æ´¢É豸»úȺ֮ÉÏÌṩһÖÖÎȶ¨¡¢¿ÉÀ©Õ¹¡¢¸ßÐÔÄܵĵ¥Ò»Âß¼¶ÔÏó(Object)´æ´¢½Ó¿ÚºÍÄܹ»ÊµÏÖ½ÚµãµÄ×ÔÊÊÓ¦ºÍ×Ô¹ÜÀíµÄ´æ´¢ÏµÍ³¡£ÊÂʵÉÏ£¬RADOSÒ²¿ÉÒÔµ¥¶À×÷ΪһÖÖ·Ö²¼Ê½Êý¾Ý´æ´¢ÏµÍ³£¬¸øÊʺÏÏàÓ¦ÐèÇóµÄ·Ö²¼Ê½ÎļþϵͳÌṩÊý¾Ý´æ´¢·þÎñ¡£
2 RADOS¼Ü¹¹¼ò½é
RADOSϵͳÖ÷ÒªÓÉÁ½¸ö²¿·Ö×é³É(Èçͼ1Ëùʾ)£º
1£®ÓÉÊýÄ¿¿É±äµÄ´ó¹æÄ£OSDs£¨Object Storage Devices£©×é³ÉµÄ»úȺ£¬¸ºÔð´æ´¢ËùÓеÄObjectsÊý¾Ý£»
2£®ÓÉÉÙÁ¿Monitors×é³ÉµÄÇ¿ñîºÏ¡¢Ð¡¹æÄ£»úȺ£¬¸ºÔð¹ÜÀíCluster Map£¬ÆäÖÐCluster MapÊÇÕû¸öRADOSϵͳµÄ¹Ø¼üÊý¾Ý½á¹¹£¬¹ÜÀí»úȺÖеÄËùÓгÉÔ±¡¢¹ØÏµ¡¢ÊôÐÔµÈÐÅÏ¢ÒÔ¼°Êý¾ÝµÄ·Ö·¢¡£

ͼ1 RADOSϵͳ¼Ü¹¹Í¼Ê¾
¶ÔÓÚRADOSϵͳ£¬½Úµã×éÖ¯¹ÜÀíºÍÊý¾Ý·Ö·¢²ßÂÔ¾ùÓÐÄÚ²¿µÄMonitorsȫȨ¸ºÔð£¬ËùÒÔ£¬´ÓClients½Ç¶ÈÉè¼ÆÏà¶Ô±È½Ï¼òµ¥£¬Ëü¸øÓ¦ÓÃÌṩµÄ½öΪ¼òµ¥µÄ´æ´¢½Ó¿Ú¡£
3 RADOSÏêϸ½éÉÜ
3.1 À©Õ¹»úȺ
1£®Cluster Map
´æ´¢»úȺµÄ¹ÜÀí£¬Î¨Ò»µÄ;¾¶ÊÇCluster Mapͨ¹ý¶ÔMonitor Cluster²Ù×÷Íê³É¡£Cluster MapÊÇÕû¸öRADOSϵͳµÄºËÐÄÊý¾Ý½á¹¹£¬ÆäÖÐÖ¸¶¨ÁË»úȺÖеÄOSDsÐÅÏ¢ºÍËùÓÐÊý¾ÝµÄ·Ö²¼Çé¿ö¡£ËùÓÐÉæ¼°µ½RADOSϵͳµÄStorage½ÚµãºÍClients¶¼ÓÐ×îÐÂepochµÄCluster Map¸±±¾¡£ÒòΪCluster MapµÄÌØÊâÐÔ£¬ClientÏòÉÏÌṩÁ˷dz£¼òµ¥µÄ½Ó¿ÚʵÏÖ½«Õû¸ö´æ´¢»úȺ³éÏóΪµ¥Ò»µÄÂß¼¶ÔÏó´æ´¢½á¹¹¡£
Cluster MapµÄ¸üÐÂÓÉOSDµÄ״̬±ä»¯»òÕ߯äËûʼþÔì³ÉÊý¾Ý²ãµÄ±ä»¯Çý¶¯£¬Ã¿Ò»´ÎCluster Map¸üж¼ÐèÒª½«map epochÔö¼Ó£¬map epochʹCluster MapÔÚËùÓнڵãÉϵĸ±±¾¶¼±£³Öͬ²½£¬Í¬Ê±£¬map epoch¿ÉÒÔʹһЩ¹ýÆÚµÄCluster MapÄܹ»Í¨¹ýͨÐŶԵȽڵ㼰ʱ¸üС£
ÔÚ´ó¹æÄ£µÄ·Ö²¼Ê½ÏµÍ³ÖУ¬OSDsµÄfailures/recoveriesÊdz£¼ûµÄ£¬ËùÒÔ£¬Cluster MapµÄ¸üÐÂ¾Í±È½ÏÆµ·±£¬Èç¹û½«Õû¸öCluster Map½øÐзַ¢»ò¹ã²¥ÏÔÈ»»áÔì³É×ÊÔ´µÄÀË·Ñ£¬RADOS²ÉÓ÷ַ¢incremental mapµÄ²ßÂÔ±ÜÃâ×ÊÔ´ÀË·Ñ£¬ÆäÖÐincremental map½ö°üº¬ÁËÁ½¸öÁ¬ÐøepochÖ®¼äCluster MapµÄÔöÁ¿ÐÅÏ¢¡£
2£®Data Placement
Êý¾ÝÇ¨ÒÆ£ºµ±ÓÐеĴ¢´æÉ豸¼ÓÈëʱ£¬»úȺÉϵÄÊý¾Ý»áËæ»úµÄÑ¡³öÒ»²¿·ÖÇ¨ÒÆµ½ÐµÄÉ豸ÉÏ£¬Î¬³ÖÏÖÓд洢½á¹¹µÄƽºâ¡£
Êý¾Ý·Ö·¢£ºÍ¨¹ýÁ½¸ö½×¶ÎµÄ¼ÆËãµÃµ½ºÏÊʵÄObjectµÄ´æ´¢Î»Öá£Èçͼ2Ëùʾ¡£

ͼ2 Êý¾Ý·Ö·¢Í¼Ê¾
1£®Objectµ½PGµÄÓ³Éä¡£PG (Placement Group)ÊÇObjectsµÄÂß¼¼¯ºÏ¡£ÏàͬPGÀïµÄObject»á±»ÏµÍ³·Ö·¢µ½ÏàͬµÄOSDs¼¯ºÏÖС£ÓÉObjectµÄÃû³ÆÍ¨¹ýHashËã·¨µÃµ½µÄ½á¹û½áºÏÆäËûһЩÐÞÕý²ÎÊý¿ÉÒԵõ½ObjectËù¶ÔÓ¦µÄPG¡£
2£®RADOSϵͳ¸ù¾Ý¸ù¾ÝCluster Map½«PGs·ÖÅäµ½ÏàÓ¦µÄOSDs¡£Õâ×éOSDsÕýÊÇPGÖеÄObjectsÊý¾ÝµÄ´æ´¢Î»Öá£RADOS²ÉÓÃCRUSHË㷨ʵÏÖÁËÒ»ÖÖÎȶ¨¡¢Î±Ëæ»úµÄhashËã·¨¡£CRUSHʵÏÖÁËÆ½ºâµÄºÍÓëÈÝÁ¿Ïà¹ØµÄÊý¾Ý·ÖÅä²ßÂÔ¡£CRUSHµÃµ½µÄÒ»×éOSDs»¹²»ÊÇ×îÖÕµÄÊý¾Ý´æ´¢Ä¿±ê£¬ÐèÒª¾¹ý³õ²½µÄfilter£¬ÒòΪ¶ÔÓÚ´ó¹æÄ£µÄ·Ö²¼Ê½»úȺ£¬å´»úµÈÔÒòʹµÃ²¿·Ö½Úµã¿ÉÄÜʧЧ£¬filter¾ÍÊÇΪ¹ýÂËÕâЩ½Úµã£¬Èç¹û¹ýÂËºó´æ´¢Ä¿±ê²»ÄÜÂú×ãʹÓÃÔò×èÈûµ±Ç°²Ù×÷¡£
3£®Device State
Cluster MapÖйØÓÚDevice StateµÄÃèÊö¼ûϱíËùʾ¡£
±í1 Device StateÃèÊö
|
¡ª |
in |
out |
| ¡ª |
¡ª |
assigned PGs |
not assigned PGs |
| up |
online & reachable |
active |
online & idle |
| down |
unreachable |
unreachable & not remapped |
failed |
4£®Map propagate
Cluster MapÔÚOSDÖ®¼äµÄ¸üÐÂÊÇͨ¹ýÒ»ÖÖÇÀռʽµÄ·½·¨½øÐС£Cluster Map epochµÄ²îÒìÖ»ÓÐÔÚÁ½¸öͨÐÅʵÌåÖ®¼äÓÐÒâÒ壬Á½¸öͨÐÅʵÌåÔÚ½øÐÐÐÅÏ¢½»»»Ö®Ç°¶¼ÐèÒª½»»»epoch£¬±£Ö¤Cluster MapµÄͬ²½¡£ÕâÒ»ÊôÐÔʹµÃCluster MapÔÚͨÐÅʵÌåÄÚ²¿Ö®¼äµÄ¸üзֵ£ÁËÈ«¾ÖµÄCluster Map·Ö·¢Ñ¹Á¦¡£
ÿһ¸öOSD¶¼»á»º´æ×î½üCluster MapºÍµ½µ±Ç°Ê±¿ÌµÄËùÓÐincremental mapÐÅÏ¢£¬OSDµÄËùÓÐmessage¶¼»áǶÈëincremental map£¬Í¬Ê±ÕìÌýÓëÆäͨÐŵÄpeerµÄCluster Map epoch¡£µ±´ÓpeerÊÕµ½µÄmessageÖз¢ÏÖÆäepochÊ**ýÆÚµÄ£¬OSD shareÏà¶ÔpeerÀ´ËµµÄincremental map£¬Ê¹Í¨ÐŵÄpeers¶¼±£³Öͬ²½£»Í¬ÑùµÄ£¬µ±´ÓpeerÊÕµ½messageÖз¢ÏÖ±¾µØepoch¹ýÆÚ£¬´ÓÆäǶÈëµ½messageÖеÄincremental mapÖзÖÎöµÃµ½Ïà¶Ô±¾µØµÄincremental mapÈ»ºó¸üУ¬±£³Öͬ²½¡£
²»ÊÇͬһ¸öͨÐŶԵȷ½µÄÁ½¸öOSDÖ®¼äµÄepoch²îÒ죬²»Ó°Ïìͬ²½¡£
3.2 ÖÇÄÜ´æ´¢
1£®Replication
RADOSʵÏÖÁËÈýÖÖ²»Í¬µÄReplication·½°¸£¬¼ûÏÂͼ3ʾ£º

ͼ3 RADOSʵÏÖµÄÈýÖÖreplication·½°¸
Primary-copy£º¶Áд²Ù×÷¾ùÔÚprimary OSDÉϽøÐУ¬²¢ÐиüÐÂreplicas£»
Chain£ºÁ´Ê½¶Áд£¬¶Áд·ÖÀ룻
Spaly£ºPrimary-copyºÍChainµÄÕÛÖз½°¸£º²¢ÐиüÐÂreplicasºÍ¶Áд·ÖÀë¡£
2£®Consistency
Ò»ÖÂÐÔÎÊÌâÖ÷ÒªÓÐÁ½¸ö·½Ã棬·Ö±ðÊÇUpdateºÍRead£º
- Update£ºÔÚRADOSϵͳÖÐËùÓÐMessage¶¼Ç¶ÈëÁË·¢ËͶ˵Ämap epochе÷»úȺµÄÒ»ÖÂÐÔ¡£
- Read£ºÎª±ÜÃⲿ·ÖOSDʧЧµ¼ÖÂÊý¾Ý²»ÄÜ´Ó¸ÃOSD¶ÁÐèҪתÏòеÄOSD£¬µ«ÊÇread operationµÄ·¢Æð·½»¹Ã»ÓиÃOSDµÄʧЧÐÅÏ¢µÄÎÊÌ⣬ͬһ¸öPGËùÔÚµÄOSDsÐèҪʵʱ½»»»Heartbeat¡£
3£®Failure Detection
´íÎó¼ì²â£ºRADOS²ÉÈ¡Òì²½¡¢ÓÐÐòµÄµã¶ÔµãHeartbeat¡£(´Ë´¦µÄ´íÎó¼ì²âÊÇOSDs×ÔÉí¼ì²â)
4£®Data Migration & Failure Recovery
ÓÉÓÚÉ豸ʧЧ¡¢»úȺÀ©Õ¹¡¢´íÎó»Ö¸´Ôì³ÉµÄCluster Map¸üÐÂʹµÃPGµ½OSDsµÄ¶ÔÓ¦¹ØÏµ·¢ÉúÁ˱仯£¬Ò»µ©Cluster Map·¢Éú±ä»¯£¬ÏàÓ¦µÄOSDsÉϵÄÊý¾ÝÒ²ÐèÒª×öÏàÓ¦µÄµ÷Õû¡£
Êý¾ÝµÄÒÆÖ²ºÍÊý¾Ý»Ö¸´¶¼ÊÇÓÉPrimary OSD¸ºÔðͳһÍê³É¡£
(Data Migration & Failure Recovery¾ßÌå·½·¨´ýÐø)
3.3 Monitors
MonitorsÊÇCluster MapÖ÷±¸·Ý´æ´¢Ä¿±ê£¬ËùÓÐÆäËûλÖÃÉϵÄCluster Map×î³õ¶¼ÊÇ´ÓMonitorsÇëÇóµÃµ½¡£Monitorsͨ¹ý¶ÔCluster MapµÄÖÜÆÚ¸üÐÂÉý¼¶ÊµÏÖ´æ´¢»úȺµÄ¹ÜÀí¡£
MonitorµÄ¹¤×÷·ÖÁ½¸ö½×¶Î£º
1£®Ê×ÏÈÔÚ¶à¸öMonitorsÖÐÑ¡¾ÙLeader£¬Ö®ºóLeaderÏòËùÓÐMonitorsÇëÇóMap Epoch£¬MonitorsÖÜÆÚÐÔÏòLeader»ã±¨½á¹û²¢¸æÖªÆä»îÔ¾(Active Monitor)£¬Leaderͳ¼ÆQuorum¡£Õâ½×¶ÎµÄÒâÒåÊDZ£Ö¤ËùÓеÄMonitorsµÄMap Epoch¶¼ÊÇ×îеģ¬Í¨¹ýIncremental updates¶ÔÒÑʧЧµÄCluster Map½øÐиüС£
2£®LeaderÖÜÆÚÏòÿһ¸öActive MonitorÊÚȨÐí¿ÉÌṩ·Ö·¢Cluster Map¸±±¾¸øOSDsºÍClientsµÄ·þÎñ¡£µ±ÊÚȨʧЧµ«LeaderÈÔûÓÐÖØÐ·ַ¢ÈÏΪLeader died£¬´ËÊ±ÖØ»ØµÚÒ»½×¶Î½øÐÐLeaderÖØÑ¡£»µ±Active MonitorûÓÐÖÜÆÚÏòLeader·´À¡ACKÔòÈÏΪÓÐMonitor died£¬ÖػصÚÒ»½×¶Î½øÐÐLeaderÑ¡¾Ù²¢¸üÐÂQuorum¡£LeaderÖÜÆÚ·Ö·¢LeaseºÍActive MonitorÖÜÆÚ·´À¡ACKµÄÁíÍâÒ»¸ö×÷ÓÃÊÇͬ²½MonitorsµÄCluster Map¡£Active MonitorÊÕµ½UpdateÇëÇóʱ£¬Ê×ÏÈÑéÖ¤µ±Ç°µÄEpochÊÇ·ñΪ×îУ¬Èç¹û²»ÊÇ£¬¸üкóÏòÉϻ㱨µ½Leader£¬Leader·Ö·¢¸øËùÓеÄMonitors£¬Í¬Ê±»ØÊÕÊÚȨ£¬ÖØÐ¿ªÊ¼ÐÂÒ»ÂÖµÄLeaderÑ¡¾Ùµ½Cluster Map·þÎñ¡£
ͨ³£MonitorµÄ¸ºÔرȽÏС£ºOSDsÉϵÄCluster Map¸üÐÂͨ¹ýOSDsÖ®¼äµÄ»úÖÆÊµÏÖ£»OSDsµÄ״̬±ä»¯±È½Ïº±¼û²»»á¶ÔMonitorsµÄ¸ºÔØÔì³ÉÓ°Ïì¡£µ«ÊÇÒ»Ð©ÌØÊâÇé¿ö¿ÉÄÜ»á¶ÔMonitors¸ºÔØ´øÀ´Ó°Ï죬±ÈÈ磺ͬʱÓÐn OSDs failed£¬Ã¿Ò»¸öOSD store m¸öPGs£¬´Ëʱ»áÐγÉm¡Án¸öfailure reportµ½´ïMonitors£¬¶ÔÓÚ¹æÄ£½Ï´óµÄ»úȺÕâÑùµÄÊý¾ÝÁ¿±È½Ï´ó¡£Îª±ÜÃâÕâÖÖÇé¿ö¸øMonitor´øÀ´µÄ¸ºÔØÑ¹Á¦£¬OSDs²ÉÓÃÎ±Ëæ»úµÄʱ¼ä¼ä¸ô½»´í°²ÅÅfailure¼ì²â(´Ë´¦ÊÇ´ÓOSDsµ½MonitorµÄ¼ì²â)ÏòÉϻ㱨£¬ÁíÍâ¸ù¾ÝMonitorsµÄ²¢Ðл¯ºÍ¸ºÔؾùºâ·ÖÅäµÄÌØµã£¬À©Õ¹MonitorsÊǽâ¾öMonitorsµÄ¸ºÔØÑ¹Á¦µÄÁíÒ»´ëÊ©¡£
4 ×ܽá
Ó봫ͳµÄ·Ö²¼Ê½Êý¾Ý´æ´¢²»Í¬£¬RADOS×î´óµÄÌØµãÊÇ£º
1£®½«ÎļþÓ³Éäµ½ObjectsºóÀûÓÃCluster Mapͨ¹ýCRUSH¼ÆËã¶ø²»ÊDzéÕÒ±í·½Ê½¶¨Î»ÎļþÊý¾ÝÔÚ´æ´¢É豸ÖеÄλÖá£Ê¡È¥ÁË´«Í³µÄFileµ½BlockµÄÓ³ÉäºÍBlockMap¹ÜÀí¡£
2£®RADOS³ä·ÖÀûÓÃÁËOSDsµÄÖÇÄÜÌØµã£¬½«²¿·ÖÈÎÎñÊÚȨ¸øOSDs£¬×î´ó³Ì¶ÈµÄʵÏÖ¿ÉÀ©Õ¹¡£
5 ²Î¿¼ÎÄÏ×
[1] RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
[2] Ceph: A Scalable, High-Performance Distributed File System.
Ceph µÄOSD½ÚµãÊÇ×ÔÖ÷´æ´¢µÄ£¬Ó¦¸Ã¿ÉÒÔÎÞÉÏÏÞµØÀ©Õ¹£¬¾Ý˵MooseFSµÄÊý¾Ý½Úµã¾Í±»ÏÞÖÆÔÚ75¸ö¡£
from: http://www.tbdata.org/archives/1589