關于我們
書單推薦
新書推薦
|
大話存儲后傳/次世代數(shù)據(jù)存儲思維與技術
全書分為: 靈活的數(shù)據(jù)布局、應用感知及可視化存儲智能、存儲類芯片、儲海鉤沉、集群和多控制器、傳統(tǒng)存儲系統(tǒng)、新興存儲系統(tǒng)、大話光存儲系統(tǒng)、體系結構、IO協(xié)議棧及性能分析、存儲軟件、固態(tài)存儲幾個大章, 其中每章又有多個小節(jié)。每一個小節(jié)都是一個獨立的課題。
冬瓜哥對技術的追求已經(jīng)到了“癡迷”的境界,與10年前相比,文筆解析更為到位,技術理解更為精準。其公眾號的每篇文章,都是存儲業(yè)界風向標。
冬瓜哥(張冬),現(xiàn)任某半導體公司系統(tǒng)架構師,著有《大話存儲》系列圖書。存儲領域技術專家和布道者。
第一章 靈活的數(shù)據(jù)布局 ·········································································1
1.1 Raid1.0和Raid1.5 ······························································································2 1.2 Raid5EE和Raid2.0 ·····························································································4 1.3 Lun2.0/SmartMotion ························································································13 第二章 應用感知及可視化存儲智能 ·····················································23 2.1 應用感知精細化自動存儲分層······································································25 2.2 應用感知精細化SmartMotion ········································································27 2.3 應用感知精細化QoS ······················································································28 2.4 產(chǎn)品化及可視化展現(xiàn)······················································································31 2.5 包裝概念制作PPT ···························································································43 2.6 評浪潮“活性”存儲概念··············································································49 第三章 存儲類芯片 ··············································································53 3.1 通道及Raid控制器架構 ··················································································54 3.2 SAS Expander架構 ··························································································60 第四章 儲海鉤沉 ··················································································65 4.1 你絕對想不到的兩種高格調(diào)存儲器······························································66 4.2 JBOD里都有什么····························································································70 4.3 Raid4校驗盤之殤 ····························································································72 4.4 為什么說Raid卡是臺小電腦 ··········································································73 4.5 為什么Raid卡電池被換為超級電容 ······························································74 4.6 固件和微碼到底什么區(qū)別··············································································75 4.7 FC成環(huán)器內(nèi)部真的是個環(huán)嗎 ·········································································76 4.8 為什么說SAS、FC對CPU耗費比TCPIP+以太網(wǎng)低 ····································77 4.9 雙控存儲之間的心跳線都跑了哪些流量······················································78 第五章集群和多控制器 ······································································· 79 5.1 淺談雙活和多路徑··························································································80 5.2 “淺”談容災和雙活數(shù)據(jù)中心(上)··························································82 5.3 “淺”談容災和雙活數(shù)據(jù)中心(下)··························································87 5.4 集群文件系統(tǒng)架構演變深度梳理圖解··························································96 5.5 從多控緩存管理到集群鎖············································································107 5.6 共享式與分布式各論····················································································115 5.7 “冬瓜哥畫PPT”雙活是個坑 ·····································································118 第六章傳統(tǒng)存儲系統(tǒng) ········································································· 121 6.1 與存儲系統(tǒng)相關的一些基本話題分享························································122 6.2 高端存儲系統(tǒng)江湖風云錄!········································································133 6.3 驚了!原來高端存儲架構是這樣演進的!················································145 6.4 傳統(tǒng)高端存儲系統(tǒng)把數(shù)據(jù)緩存集中外置一石三鳥····································155 6.5 傳統(tǒng)外置存儲已近黃昏················································································156 6.6 存儲圈老炮大戰(zhàn)小鮮肉················································································166 6.7 傳統(tǒng)存儲老矣,新興存儲能當大任否?····················································167 第七章次世代存儲系統(tǒng) ····································································· 185 7.1 一桿老槍照玩次世代存儲系統(tǒng)····································································187 7.2 最有傳統(tǒng)存儲格調(diào)的次世代存儲系統(tǒng)························································192 7.3 最適合大規(guī)模數(shù)據(jù)中心的次世代存儲系統(tǒng)················································203 7.4 最高性能的次世代存儲系統(tǒng)········································································206 7.5 最具備感知應用能力的次世代存儲系統(tǒng)····················································214 7.6 最具有數(shù)據(jù)管理靈活性的次時代存儲系統(tǒng)················································225 第八章光存儲系統(tǒng)············································································ 237 8.1 光存儲基本原理····························································································238 8.2 神秘的激光頭及藍光技術············································································244 8.3 剖析藍光存儲系統(tǒng)························································································249 8.4 光存儲系統(tǒng)生態(tài)····························································································253 8.5 站在未來看現(xiàn)在····························································································259 第九章體系結構 ················································································ 263 9.1 大話眾核心處理器體系結構········································································264 9.2 致敬龍芯!冬瓜哥手工設計了一個CPU譯碼器! ····································271 9.3 NUNA體系結構首次落地InCloudRack機柜 ···············································274 9.4 評宏杉科技的CloudSAN架構 ······································································278 9.5 內(nèi)存竟然還能這么玩?!············································································283 9.6 PCIe交換,什么鬼?····················································································293 9.7 聊聊FPGA/GPCPU/PCIe/Cache-Coherency ················································300 9.8 【科普】超算到底是怎樣算的?································································305 第十章 I/O 協(xié)議棧及性能分析 ···························································· 317 10.1 最完整的存儲系統(tǒng)接口/協(xié)議/連接方式總結 ···········································318 10.2 I/O協(xié)議棧前沿技術研究動態(tài) ····································································332 10.3 Raid組的Stripe Size到底設置為多少合適? ·············································344 10.4 并發(fā)I/O——系統(tǒng)性能的根本! ································································347 10.5 關于I/O時延你被騙了多久? ····································································349 10.6 如何測得整條I/O路徑上的并發(fā)度? ························································351 10.7 隊列深度、時延、并發(fā)度、吞吐量的關系到底是什么··························351 10.8 為什么Raid對于某些場景沒有任何提速作用? ······································365 10.9 為什么測試時性能出色,上線時卻慘不忍睹?······································366 10.10 隊列深度過淺有什么影響?····································································368 10.11 隊列深度調(diào)節(jié)為多大最理想? ································································369 10.12 機械盤的隨機I/O平均時延為什么有一過性降低? ······························370 10.13 數(shù)據(jù)布局到底是怎么影響性能的?························································371 10.14 關于同步I/O與阻塞I/O的誤解 ·································································374 10.15 原子寫,什么鬼?!················································································375 10.16 何不做個USB Target? ·············································································385 10.17 冬瓜哥的一項新存儲技術專利已正式通過············································385 10.18 小梳理一下iSCSI底層 ··············································································394 10.19 FC的4次Login過程簡析 ···········································································396 第十一章存儲軟件············································································ 397 11.1 Thin就是個坑誰用誰找抽!······································································398 11.2 存儲系統(tǒng)OS變遷 ·························································································400 第十二章固態(tài)存儲············································································ 409 12.1 淺析固態(tài)介質(zhì)在存儲系統(tǒng)中的應用方式··················································410 12.2 關于SSD元數(shù)據(jù)及掉電保護的誤解··························································420 12.3 關于閃存FTL的Host Base和Device Based的誤解 ····································421 12.4 關于SSD HMB與CMB ···············································································423 12.5 同有科技展翅歸來······················································································424 12.6 和老唐說相聲之SSD性能測試之“玉”··················································435 12.7 固態(tài)盤到底該怎么做Raid? ······································································441 12.8 當Raid2.0遇上全固態(tài)存儲 ·········································································448 12.9 上/下頁、快/慢頁、MSB/LSB都些什么鬼? ··········································451 12.10 關于對MSB/LSB寫0時的步驟 ·································································457
1.1 Raid1.0和Raid1.5
在機械盤時代,影響最終I/O性能的根本因素無非就是兩個,一個是頂端源頭,也就是應用的I/O調(diào)用方式和I/O屬性;另一個是底端源頭,那就是數(shù)據(jù)最終是以什么形式、狀態(tài)存放在多少機械盤上的。應用如何I/O調(diào)用完全不是存儲系統(tǒng)可以控制的事情,所以從這個源頭來解決性能問題對于存儲系統(tǒng)來講是無法做什么工作的。但是數(shù)據(jù)如何組織、排布,絕對是存儲系統(tǒng)重中之重的工作。 這一點從Raid誕生開始就一直在不斷的演化當中。舉個最簡單的例子,從Raid3到Raid4再到Raid5,Raid3當時設計的時候致力于單線程大塊連續(xù)地址I/O吞吐量最大化,為了實現(xiàn)這個目的,Raid3的條帶非常窄,窄到每次上層下發(fā)的I/O目標地址基本上都落在了所有盤上,這樣幾乎每個I/O都會讓多個盤并行讀寫來服務于這個I/O,而其他I/O就必須等待,所以我們說Raid3陣列場景下,上層的I/O之間是不能并發(fā)的,但是單個I/O是可以采用多盤為其并發(fā)的。所以,如果系統(tǒng)內(nèi)只有一個線程(或者說用戶、程序、業(yè)務),而且這個線程是大塊連續(xù)地址I/O追求吞吐量的業(yè)務,那么Raid3非常合適。但是大部分業(yè)務其實不是這樣,而是追求上層的I/O能夠充分地并行執(zhí)行,比如多線程、多用戶發(fā)出的I/O能夠并發(fā)地被響應,此時就需要增大條帶到一個合適的值,讓一個I/O目標地址范圍不至于牽動Raid組中所有盤為其服務,這樣就有一定幾率讓一組盤同時響應多個I/O,而且盤數(shù)越多,并發(fā)幾率就越大。Raid4相當于條帶可調(diào)的Raid3,但是Raid4獨立校驗盤的存在不但讓其成為高故障率的熱點盤,而且也制約了本可以并發(fā)的I/O,因為伴隨著每個I/O的執(zhí)行,校驗盤上對應條帶的校驗塊都需要被更新,而由于所有校驗塊只存放在這塊盤上,所以上層的I/O只能一個一個第一章 靈活的數(shù)據(jù)布局3地順著執(zhí)行,不能并發(fā)。Raid5則通過把校驗塊打散在Raid組中所有磁盤上,從而實現(xiàn)了并發(fā)I/O。大部分存儲廠商提供針對條帶寬度的設置,比如從32KB到128KB。假設一個I/O請求讀16KB,在一個8塊盤做的Raid5組里,如果條帶為32KB,則每塊盤上的段(Segment)為4KB,這個I/O起碼要占用4塊盤,假設并發(fā)幾率為100%,那么這個Raid組能并發(fā)兩個16KB的I/O,并發(fā)8個4KB的I/O;如果將條帶寬度調(diào)節(jié)為128KB,則在100%并發(fā)幾率的條件下可并發(fā)8個小于等于16KB的I/O。 講到這里,我們可以看到單單是調(diào)節(jié)條帶寬度,以及優(yōu)化校驗塊的布局,就可以得到迥異的性能表現(xiàn)。但是再怎么折騰,I/O性能始終受限在Raid組那少得可憐的幾塊或者十幾塊盤上。為什么是幾塊或者十幾塊?難道不能把100塊盤做成一個大Raid5組,然后,通過把所有邏輯卷創(chuàng)建在它上面來增加每個邏輯卷的性能么?你不會選擇這么做的,當一旦有一塊盤壞掉,系統(tǒng)需要重構的時候,你會后悔當時的決定,因為你會發(fā)現(xiàn)此時整個系統(tǒng)性能大幅降低,哪個邏輯卷也別想好過,因為此時99塊盤都 在全速讀出數(shù)據(jù),系統(tǒng)計算xor校驗塊,然后把校驗塊寫入熱備盤中。當然,你可以控制降速重構,來緩解在線業(yè)務的I/O性能,但是付出的代價就是增加了重構時間,重構周期內(nèi)如果有盤再壞,那么全部數(shù)據(jù)蕩然無存。所以,必須縮小故障影響域,所以一個Raid組最好是幾塊或者十幾塊盤。這比較尷尬,所以人們想出了解決辦法,那就是把多個小Raid5/6組拼接成大Raid0,也就是Raid50/60,然后將邏輯卷分布在其上。當然,目前的存儲廠商黔驢技窮,再也弄出什么新花樣,所以它們習慣把這個大Raid50/60組成“Pool”,也就是池,從而迷惑一部分人,認為存儲又在革新了,存儲依然生命力旺盛。 那冬瓜哥在這里也不妨順水推舟忽悠一下,如果把傳統(tǒng)的Raid組叫作Raid1.0,把Raid50/60叫作Raid1.5。我們其實在這里可以體會出一種周期式上升的規(guī)律,早期盤數(shù)較少,主要靠條帶寬度來調(diào)節(jié)不同場景的性能;后來人們想通了,為何不用Raid50呢? 把數(shù)據(jù)直接分布到幾百塊盤中,豈不快哉?上層的并發(fā)線程I/O在底層可以實現(xiàn)大規(guī)模并發(fā),達到超高吞吐量。此時,人們被成功沖昏了頭腦,沒人再去考慮另一個可怕的問題。至這些文字傾諸筆端時仍沒有人考慮這個問題,至少從廠商的產(chǎn)品動向里沒有看出。究其原因,可能是另一輪底層的演變,那就是固態(tài)介質(zhì)。底層的車輪是不斷地提速的,上層的形態(tài)是循環(huán)往復的,但有時候上層可能直接跨越式前進,跨越了其中應該有的一個形態(tài),這個形態(tài)或者轉瞬即逝,亦或者根本沒出現(xiàn)過,但是總會有人產(chǎn)生火花,即便這火花是那么微弱。這個可怕的問題其實被一個更可怕的問題蓋過了,這個更可怕的問題就是重構時間過長。一塊4TB的SATA盤,在重構的時候就算全速寫入,其轉速決定了其吞吐量極4 大話存儲后傳——次世代數(shù)據(jù)存儲思維與技術限也基本在80MB/s左右,可以算一下,需要58h,實際中為了保證在線業(yè)務的性能,一般會限制在中速重構,也就是40MB/s左右,此時需要116h,也就是5天5夜,我敢打賭沒有哪個系統(tǒng)管理員能在這一周內(nèi)睡好覺。 1.2 Raid5EE和Raid2.0 20年前有人發(fā)明過一種叫作Raid5EE的技術,其目的有兩個,第一是把平時閑著沒事干的熱備盤用起來,第二就是加速重構。很顯然,如果把下圖中用“H(hot spare)”表示的熱備盤的空間也像校驗盤一樣,打散到所有盤上的話,就會變成圖右側所示的布局,每個P塊都跟著一個H塊。這樣整個Raid組能比原來多一塊磁盤可用于工作。另外,由于H空間也被打散了,當有一塊盤損壞時,重構的速度理應被加快,因為此時可以多盤并發(fā)寫入了。但是實際卻不然,整個系統(tǒng)的重構速度其實并不是被這塊單獨的熱備盤限制了,而是被所有盤一起限制了,因為熱備盤以滿速率寫入重構后的數(shù)據(jù)的前提是,其他所有盤都以滿速率讀出數(shù)據(jù),然后系統(tǒng)對其做xor。就算把熱備盤打散,甚至把熱備盤換成SSD、內(nèi)存,對結果也毫無影響。那到底怎樣才能加速重構呢?唯一的辦法只有像下圖所示這樣,把原本擠在5塊盤里的條帶,橫向打散,請注意,是以條帶為粒度打散,打散單盤是毫無用處的。這樣,才能成倍地提升重構速度。
你還可能感興趣
我要評論
|