Eucalyptus

Merge lp://qastaging/~syuu1228/eucalyptus/rados4eucalyptus-devel into lp://qastaging/eucalyptus

rados4eucalyptus-devel
Merge into eucalyptus-devel

Proposed by Takuya ASADA on 2010-11-15

Status:	Needs review
Proposed branch:	lp://qastaging/~syuu1228/eucalyptus/rados4eucalyptus-devel
Merge into:	lp://qastaging/eucalyptus
Diff against target:	16619 lines (has conflicts) Text conflict in clc/build.xml Contents conflict in clc/modules/database/conf/scripts/caches.groovy
To merge this branch:	bzr merge lp://qastaging/~syuu1228/eucalyptus/rados4eucalyptus-devel
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Neil Soman		2010-11-15	Pending
Review via email: mp+40838@code.qastaging.launchpad.net

Description of the change

Hi,

I'm trying to integrate a distributed storage system named "RADOS" for both S3 service and EBS service on Eucalyptus, to make it scalable.

RADOS is a part of Ceph filesystem, which provides distributed object store.
It provides API for C/C++ and S3 compatible server, EBS compatible block device for qemu/kvm.
Details are described on this page:
http://ceph.newdream.net/2009/05/the-rados-distributed-object-store/

This patch is minimum implement of the work, it's implements SystemStorageManager, FileIO, ChunkedDataFile for RADOS.
This provides RADOS based Walrus backend, when it's enabled Walrus stores objects to RADOS cluster instead of local file system.

Existing local file system module is moved from storage-common to storage-fs, and new rados module added as storage-rados.
It can switch via configure option --with-rados=[rados home], if it specified storage-rados will install. Otherwise storage-fs will install.

And this patch doesn't include following features, it going to be another patches:
- Zerocopy on JNI
  Currently the implementation copy buffer between librados and Java code, we need to prevent it to get better performance.
  To make it zerocopy, we need to modify Ceph implementation, not only my code.
  I'm discussing it on Ceph ML now, but it takes few more time.
- Multiple Walrus support for CLC
  We need to support multiple Walrus to make it scalable.
  I just implemented "quick hack" support Multiple Walrus, but I need to make it better before posting the patch.
- RADOS based EBS(rbd) support
  I only worked on Walrus now, but I also would like to work on rbd support for Storage Controller.

*Performance test1: chunk size
On storage-rados, default chunk size(8KB) in StorageManager.sendObject() is too small.
Here's throughput on default chunk size:
- storage-fs: 33.34MB/s
- storage-rados: 4.81MB/s

And here's read throughput when extending chunk size from 8KB to 80MB on storage-rados:
8K 4.81MB/s
80K 13.37MB/s
800K 29.52MB/s
1M 31.58MB/s
2M 33.54MB/s
3M 35.19MB/s
4M 35.67MB/s
5M 37.49MB/s
6M 35MB/s
8M 33.65MB/s
80M 23.85MB/s

So we need to change the default chunk size to 5MB on storage-rados.
I also measured 5MB on rados-fs for comparison, it's 65.60MB/s.
That means changing the default chunk size also makes rados-fs faster anyway.

Testing environment as follows:
Throughput are measured by s3cmd from Gigabit ethernet, same segment with Walrus.
RADOS cluster constructed with 9 nodes, 1 node for monitor, the others are storage(OSD).
Test file size is 1GB, single file.

[node assignment]
node0: CC/CLC, s3cmd
node1: Walrus, RADOS Monitor
node2: RADOS Storage
node3: RADOS Storage
node4: RADOS Storage
node5: RADOS Storage
node6: RADOS Storage
node7: RADOS Storage
node8: RADOS Storage
node9: RADOS Storage

[node spec]
CPU: Athlon II X4 605e
Memory: 16GB
HDD: SATA 250GB via Areca SATA Host Adapter RAID Controller
OS: Ubuntu Server 10.04

*Performance test2: scalability
I also measured the throughput when requesting multiple read request concurrently.
(This actually requires multiple Walrus implementation which doesn't include the patch, as I described earlier)

Test condition as follows:
Compared the performance when number of Walrus node is 1, 2, 4, 8 using storage-rados, and also storage-fs.
Storage node for RADOS is always 8 node, sharing Walrus node.
Read requests are sending from 2 nodes, each node sends 1, 2, 4, 8, 16, 32, 64 requests concurrently.
So overall requests are 2, 4, 8, 16, 32, 64, 128.

The graph is on following URL:
http://cid-35288454e2692e6b.photos.live.com/self.aspx/public/graph.png
Y-axis is average throughput(per request), unit is MB/s.
X-axis is number of request.

From the graph we can see the system scaling when adding Walrus nodes.
Even on 1 Walrus node, it's faster than storage-fs. This probably means 8 nodes of storage cluster is faster than local filesystem when multiple requests occurred.

Testing environment as follows:
Throughput are measured by modified s3cmd from Gigabit ethernet, same segment with Walrus, 2 nodes.
RADOS cluster constructed 1 node for monitor, 8 nodes for storage(OSD).
Test file size is 10MB, 64 files.

[node assignment]
node0: CC/CLC, s3cmd
node1: Walrus, RADOS Monitor, RADOS Storage
node2: Walrus, RADOS Storage
node3: Walrus, RADOS Storage
node4: Walrus, RADOS Storage
node5: Walrus, RADOS Storage
node6: Walrus, RADOS Storage
node7: Walrus, RADOS Storage
node8: Walrus, RADOS Storage
node9: s3cmd

[node spec]
same as test1

*Documents
Here's a document for install procedure:
http://r4eucalyptus.wikia.com/wiki/Installing_RADOS4Eucalyptus_2.0