设计工具
应用程序

微米 9200 MAX reference architecture block performance

瑞安梅雷迪思 | 2018年4月

Why do you only test 2x replication!?!

There are reasons 固态硬盘 guys like me usually test Ceph with 2x replication; 固态硬盘s are more reliable than spinners, performance is better with 2x, 等等....... But what if you absolutely need 3x replication minimum? How does that impact performance on our super fast all-NVMe Ceph reference architecture? 我很高兴你这么问.

This blog is a quick performance review of our new Intel® 来自purley的Ceph RA featuring our fastest NVMe drive, the 微米 9200 MAX (6.4TB).

Our new reference architecture uses Red Hat Ceph 存储.0,基于… Ceph夜光(12.2.1). Testing in the RA is limited to Filestore performance since that is the currently supported storage engine for RHCS 3.0.

Performance is impacted exactly as one would expect when comparing 2x replication to 3x. 4KB random write IOPS decrease by about 35%, reads stay exactly the same, and 70/30 IOPS decrease by around 25%.

 块的工作负载

2复制IOPS

3x复制IOPS

2x Replication Average Latency

3x Replication Average Latency

 4KB随机读

 200万年

 200万年

 1.6 ms

 1.6 ms

 4KB随机写入

 363,000

 237,000

 5.3ms

 8.1 ms

 4kb 70/30 r / w

 781,000

 577,000

 1.4毫秒读取/ 3.5毫秒写入

 1.7毫秒读取/ 5.4毫秒写入

blog_image_block_workloads

This solution is optimized for block performance. Random small block testing using the Rados块驱动程序 in Linux saturates platinum-level 8168 Intel Purley processors in a 2-socket storage node.

With 10 drives per storage node, this architecture has a usable storage capacity of 232TB that can be scaled out by adding additional 1U storage nodes.

参考设计-硬件

blog_image_switches

测试结果及分析

Ceph测试方法

Ceph is configured using FileStore with 2 Object 存储 Daemons (OSDs) per 微米 9200MAX NVMe 固态硬盘. A 20GB journal was used for each OSD. With 10 drives per storage node and 2 OSDs per drive, Ceph has 80 total OSDs with 232TB of usable capacity.

The Ceph pools tested were created with 8192 placement groups. The 2x replicated pool in Red Hat Ceph 3.0 is tested with 100 RBD images at 75GB each, providing 7.5TB of data on a 2x replicated pool, 15TB的总数据.

The 3x replicated pool in Red Hat Ceph 3.0 is tested with 100 RBD images at 50GB each, providing 5TB of data on a 3x replicated pool, 15TB的总数据.

4KB random block performance was measured using FIO synthetic load generation tool against the Rados块驱动程序.

RBD FIO 4KB随机读 Performance

4KB Random read performance is essentially identical between a 2x and 3x replicated pool.

blog_image_random_1

RBD FIO 4KB随机写入 Performance

With 3x replication, performance in IOPs is reduced by ~35% over a 2x replicated pool. Average latency is increased by a similar margin. 

blog_image_random_2.

4KB write performance hits an optimal mix of IOPs and latency at 60 FIO clients, 363k IOPs, 5.3 ms average latency on a 2x replicated pool and 237k IOPS, 8.在3x上的平均延迟为1ms. 此时此刻, the average CPU utilization on the Ceph storage nodes is over 90%, 限制性能. 

RBD FIO 4KB Random 70% Read / 30% Write Performance

The 70/30 random R/W workload IOPs performance decreases by 25% when going from a 2x replicated pool to a 3x replicated pool. Read latencies are close, slightly increased for the 3x replicated pool. Write latencies are 50%+ higher for the 3x replicated pool.

blog_image_random_3

你想知道更多吗?

RHCS 3.0 + the 微米 9200 MAX NVMe 固态硬盘 on the Intel Purley platform is super fast. 参见新出版的 微米 / Red Hat / Supermicro Reference Architecture. I will present our RA and other Ceph tuning and performance topics during my session at OpenStack Summit 2018. 接下来会有更多报道. 请继续关注!

Have additional questions about our testing or methodology? Leave a comment below or you can email us ssd@shopping-wonder.com.

Director, 存储 Solutions Architecture

瑞安梅雷迪思

瑞安梅雷迪思 is director of Data Center Workload Engineering for 微米's 存储 Business Unit, testing new technologies to help build 微米's thought leadership and awareness in fields like AI and NVMe-oF/TCP, along with all-flash software-defined storage technologies.