Ceph Disk write slow on dd oflag=dsync on small block sizes - dd

We have deployed a ceph cluster with ceph version 12.2.5, using Dell R730xd servers as storage nodes with 10 7.2k NLSAS drives as OSDs. We have 3 storage nodes.
We did not configured RAID settings and used the drives directly to create OSDs.
We are using ceph-ansible-stable-3.1 to deploy the ceph cluster.
We have encounter slow performance on disk write test in VM uses a RBD image.
[root#test-vm-1 vol2_common]# dd if=/dev/zero of=disk-test bs=512 count=1000 oflag=direct ; dd if=/dev/zero of=disk-test bs=512 count=1000 oflag=dsync ; dd if=/dev/zero of=disk-test bs=512 count=1000
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 0.101852 s, 5.0 MB/s
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 21.7985 s, 23.5 kB/s
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 0.00702407 s, 72.9 MB/s
when checking on OSD node, under osd directory, we identified that same lower disk speeds.
[root#storage01moc ~]# cd /var/lib/ceph/osd/ceph-26
[root#storage01moc ceph-26]# dd if=/dev/zero of=disk-test bs=512 count=1000 oflag=direct ; dd if=/dev/zero of=disk-test bs=512 count=1000 oflag=dsync ; dd if=/dev/zero of=disk-test bs=512 count=1000
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 14.6416 s, 35.0 kB/s
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 9.93967 s, 51.5 kB/s
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 0.00591158 s, 86.6 MB/s
We suspect that the cause of the issue is no hardware caching is available when not using any RAID configuration (RAID 0) on individual OSD drives.
Ceph Configurations
[global]
fsid = ....
mon initial members = ...
mon host = ....
public network = ...
cluster network = ...
mon_pg_warn_max_object_skew=500
[osd]
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = noatime,largeio,inode64,swalloc
osd journal size = 10240
[client]
rbd cache = true
rbd cache writethrough until flush = true
rbd_concurrent_management_ops = 20
Disk Details
=== START OF INFORMATION SECTION ===
Vendor: TOSHIBA
Product: MG04SCA60EE
Revision: DR07
Compliance: SPC-4
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
Formatted with type 2 protection
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed Aug 1 20:59:52 2018 +08
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
Please let me know if we Shrink OSDs and use RAID 0 on Drives and recreate OSDs, will it help for increasing the disk writes ?
Thanks in advance.

When we configure each OSD drive with RAID0 on the storage controller, disk write issue was resolved.
Reason for the slowness identified as the RAID controller write cache was not applicable on the drives that not configured with any RAID level.

Related

Docker cgroup.procs no space left on device

After some problem with Docker and my dedicated server with a Debian (the provider give some OS image without some features needed by Docker, so I recompiled Linux kernel yesterday and activate the features needed, I followed some instruction in blog).
Now I was happy to have success with docker I tried to create image... and I have an error.
$ docker run -d -t -i phusion/baseimage /sbin/my_init -- bash -l
Unable to find image 'phusion/baseimage:latest' locally
Pulling repository phusion/baseimage
5a14c1498ff4: Download complete
511136ea3c5a: Download complete
53f858aaaf03: Download complete
837339b91538: Download complete
615c102e2290: Download complete
b39b81afc8ca: Download complete
8254ff58b098: Download complete
ec5f59360a64: Download complete
2ce4ac388730: Download complete
2eccda511755: Download complete
Status: Downloaded newer image for phusion/baseimage:latest
0bd93f0053140645a930a3411972d8ea9a35385ac9fafd94012c9841562beea8
FATA[0039] Error response from daemon: Cannot start container 0bd93f0053140645a930a3411972d8ea9a35385ac9fafd94012c9841562beea8: [8] System error: write /sys/fs/cgroup/docker/0bd93f0053140645a930a3411972d8ea9a35385ac9fafd94012c9841562beea8/cgroup.procs: no space left on device
More informations :
$ docker info
Containers: 3
Images: 12
Storage Driver: devicemapper
Pool Name: docker-8:1-275423-pool
Pool Blocksize: 65.54 kB
Backing Filesystem: extfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 814.4 MB
Data Space Total: 107.4 GB
Data Space Available: 12.22 GB
Metadata Space Used: 1.413 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.146 GB
Udev Sync Supported: false
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.82-git (2013-10-04)
Execution Driver: native-0.2
Kernel Version: 3.19.0-xxxx-std-ipv6-64
Operating System: Debian GNU/Linux 8 (jessie)
CPUs: 4
Total Memory: 7.691 GiB
Name: ns3289160.ip-5-135-180.eu
ID: JK54:ZD2Q:F75Q:MBD6:7MPA:NGL6:75EP:MLAN:UYVU:QIPI:BTDP:YA2Z
System :
$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 10M 0 10M 0% /dev
tmpfs 788M 456K 788M 1% /run
/dev/sda1 20G 7.8G 11G 43% /
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 1.7G 4.0K 1.7G 1% /dev/shm
/dev/sda2 898G 11G 842G 2% /home
Edit: command du -sk /var
# du -sk /var
3927624 /var
Edit: command fdisk -l
# fdisk -l
Disk /dev/loop0:
100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/loop1: 2 GiB, 2147483648 bytes, 4194304 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x00060a5c
Device Boot Start End Sectors Size Id Type
/dev/sda1 * 4096 40962047 40957952 19.5G 83 Linux
/dev/sda2 40962048 1952471039 1911508992 911.5G 83 Linux
/dev/sda3 1952471040 1953517567 1046528 511M 82 Linux swap / Solaris
Disk /dev/mapper/docker-8:1-275423-pool: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 byte
You should not remove cgroup support in docker. Otherwise you may get warning like WARNING: Your kernel does not support memory swappiness capabilities, memory swappiness discarded. when you run a docker container.
A simple command should do the trick.
sudo echo 1 > /sys/fs/cgroup/docker/cgroup.clone_children
If it still does not work, run below commands and restart docker service:
sudo echo 0 > /sys/fs/cgroup/docker/cpuset.mems
sudo echo 0 > /sys/fs/cgroup/docker/cpuset.cpus
I installed docker via docker-lxc in the debian repos, I followed a tuto. I tried another solution (with success), I updated my source.list /etc/apt/source.list from jessie to sid, I removed docker-lxc with a purge and I installed docker.io.
The error changed. It was mkdir -p /sys/... can't create dir : access denied
So I find a comment in a blog and I tried the solution it was to comment this line previously added by the tutorial :
## file /etc/fstab
# cgroup /sys/fs/cgroup cgroup defaults 0 0
and reboot the server.
yum install -y libcgroup libcgroup-devel libcgroup-tools
cgclear
service cgconfig restart
mount -t cgroup none /cgroup
vi /etc/fstab
cgroup /sys/fs/cgroup cgroup defaults 0 0

What happens if my stripe count is set to more than my number of stripes

I have a doubt regarding Lustre file system. If I have a file of size 64 GB and I set stripe size to 1GB, my number of stripes become 64. But if I set my stripe count as 128, what does the Lustre do in that case?
Probably you are missing considering the details like number of OSTs while thinking about this. I will be more elaborated on this because I have seen confusion about striping in many. Please bear with me.
So in case of file of 64GB with stripe_size=1GB; stripes of 1GB are 64 but not the stripe_count. We can have any no. of stripes depending upon the file size and stripe_size, but not the stripe_count which is dependent on OSTs. Here is small experiment, I have 2 OSTs and creating 64 MB file with stripe_size=1M(default)...
/dev/loop0 on /mnt/mds1 type lustre (rw,loop=/dev/loop0)
/dev/loop1 on /mnt/ost1 type lustre (rw,loop=/dev/loop1)
/dev/loop2 on /mnt/ost2 type lustre (rw,loop=/dev/loop2)
ashish-203#tcp:/lustre on /mnt/lustre type lustre (rw,user_xattr,flock)
[root#ashish-203 tests]# lfs df -h
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 146.4M 17.5M 119.4M 13% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 183.1M 25.2M 147.7M 15% /mnt/lustre[OST:0]
lustre-OST0001_UUID 183.1M 25.2M 147.7M 15% /mnt/lustre[OST:1]
filesystem summary: 366.1M 50.3M 295.5M 15% /mnt/lustre
Now I create a file foo with -c -1(stripe on all OSTs)...
[root#ashish-203 tests]# lfs setstripe -c -1 /mnt/lustre/foo
[root#ashish-203 tests]# lfs getstripe /mnt/lustre/foo
/mnt/lustre/foo
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 1
obdidx objid objid group
1 2 0x2 0
0 2 0x2 0
[root#ashish-203 tests]# dd if=/dev/urandom of=/mnt/lustre/foo bs=1M count=64
64+0 records in
64+0 records out
67108864 bytes (67 MB) copied, 7.465 s, 9.0 MB/s
[root#ashish-203 tests]# du -h /mnt/lustre/foo
64M /mnt/lustre/foo
[root#ashish-203 tests]# lfs df -h /mnt/lustre/foo
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 146.4M 17.5M 119.4M 13% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 183.1M 57.2M 115.8M 33% /mnt/lustre[OST:0]
lustre-OST0001_UUID 183.1M 57.2M 115.8M 33% /mnt/lustre[OST:1]
filesystem summary: 366.1M 114.3M 231.6M 33% /mnt/lustre
[root#ashish-203 tests]# lfs getstripe -c /mnt/lustre/foo
2
So here we can see file 64MB is created and stripe_count is 2 which means data is written to all the 2 OSTs equally.
stripe_count is equal to the no. of objects in a single file. and each object of the file is stored on different OST. Hence it is the number of OSTs which is responsible for no. of stripe_count.
Now when you say if I change stripe_count to 128 then you should have the 128 OSTs if you don't then file will be striped only on available OSTs and that will be your stripe_count(if file is created with "-c -1" option).
But if I set my stripe count as 128, what does the Lustre do in that case?
So if you have suppose 64 OSTs then lustre will stripe file only on 64 OSTs
Here is small experiment for above theory...
[root#ashish-203 tests]# lfs setstripe -c 4 /mnt/lustre/bar
[root#ashish-203 tests]# dd if=/dev/urandom of=/mnt/lustre/bar bs=1M count=64
64+0 records in
64+0 records out
67108864 bytes (67 MB) copied, 7.31459 s, 9.2 MB/s
[root#ashish-203 tests]# du -h /mnt/lustre/bar
64M /mnt/lustre/bar
[root#ashish-203 tests]# lfs df -h /mnt/lustre
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 146.4M 17.5M 119.4M 13% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 183.1M 89.2M 83.9M 52% /mnt/lustre[OST:0]
lustre-OST0001_UUID 183.1M 89.2M 83.9M 52% /mnt/lustre[OST:1]
filesystem summary: 366.1M 178.3M 167.7M 52% /mnt/lustre
[root#ashish-203 tests]# lfs getstripe -c /mnt/lustre/bar
2
You can see in spite of setting the stripe_count=4 stripe_count is 2 and data was written only 2 OSTs.
Summary:- Don't confuse stripe_count with no. of stripes.
stripe_count is on how many OSTs you want to stripe your data if they are available and stripes = (file size / stripe_size).
Hope that answers you question...
If the stripe count is set to 128, 64 stripes are utilized out of 128 and rest 64 stripes are left out. Lustre filesystem writes data in round-robin fashion which is necessary while striping. Also to make sure that the rest of the stripes are not left out imbalanced we need to set a property so that the write starts from 65th stripe.

Unable to boot after shrinking an Amazon EBS volume

I have followed several links on Google and on this forums to reduce EBS Volumes on Amazon AWS, including these links too:
http://wiki.jokeru.ro/shrink-amazon-ebs-root-volume
and
http://www.lantean.co/shrinking-ebs-volume/
I have a 254 GB EBS Volume which needs to be resized to 150 GB. Here are the steps what I did:
Create a new instance with 8 GB volume /dev/xvde (base OS)
Mount 254 GB Volume /dev/xvdj on base OS.
Mount 150 GB empty Volume /dev/xvdk on base OS.
/dev/xvdj has 8 partitions as follows:
Disk /dev/xvdj: 272.7 GB, 272730423296 bytes
255 heads, 63 sectors/track, 33157 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00029527
Device Boot Start End Blocks Id System
/dev/xvdj1 * 1 13 102400 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/xvdj2 13 6540 52428800 83 Linux
/dev/xvdj3 6540 10457 31457280 83 Linux
/dev/xvdj4 10457 33114 181998592 5 Extended
/dev/xvdj5 10457 11501 8388608 82 Linux swap / Solaris
/dev/xvdj6 11501 12154 5242880 83 Linux
/dev/xvdj7 12154 12285 1048576 83 Linux
/dev/xvdj8 12285 33114 167314432 83 Linux
Since /dev/xvdk is empty volume, it has no partition, which i suppose has to be created according to /dev/xvdj
According to the links above, I ran e2fsck -f /dev/xvdj1, followed by resize2fs -M -p /dev/xvdj1, for all the partitions on /dev/xvdj (except /dev/xvdj4 and 5)
After the above command is completed, I created partitions on /dev/xvdk volume according to the requirements, keeping in mind about the sizes which will be greater than the partitions of /dev/xvdj
The partitions of 150 GB Volume is as follows:
Disk /dev/xvdk: 161.1 GB, 161061273600 bytes
255 heads, 63 sectors/track, 19581 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xeea3d8c8
Device Boot Start End Blocks Id System
/dev/xvdk1 * 1 132 1060258+ 83 Linux
/dev/xvdk2 133 9271 73409017+ 83 Linux
/dev/xvdk3 9272 17105 62926605 83 Linux
/dev/xvdk4 17106 19581 19888470 5 Extended
/dev/xvdk5 17106 18150 8393931 82 Linux swap / Solaris
/dev/xvdk6 18151 19456 10490413+ 83 Linux
/dev/xvdk7 19457 19581 1004031 83 Linux
For the first partition, I also added a bootable flag using fdisk /dev/xvdk and selecting a and partition number (which is 1)
After partitions, I followed the above links to count the blocks, issuing the dd command to copy data.
When the dd command was completed, I ran e2fsck -f /dev/xvdk1, followed by resize2fs -p /dev/xvdk1, for all the partitions on /dev/xvdk (except /dev/xvdk4 and 5)
After completing the above command, I powered off the base OS and then detached the 150 GB Volume.
I created a snapshot of the the 150 GB Volume, and after the snapshot was created, I created an image (AMI) out of that snapshot.
I used this image to launch an instance, to which I was successfull, but after launching I am unable to connect to that instance.
Also, 1 of the 2 status checks were throwing an error about connectivity, which I am unable to investigate where could I have gone wrong.
Can someone tell me where I have gone wrong or Am I completely off the track?
I have found out an elegant solution which was present in the knowledge base of AWS.
The link is here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage_expand_partition.html#expanding-partition-parted
It makes use of parted tool.
NOTE: You need to attach the volume to a different instance in order to resize it.

Performance difference between two EBS volumes mounted to the same EC2 instance

I'm trying to do some very basic performance tests on EC2 with EBS.
I have created a m3.large instance with 2 EBS volumes:
root#ip-172-31-37-37:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
xvdb 202:16 0 30G 0 disk /mnt
xvdc 202:32 0 30G 0 disk /data
Than, I'm running a simple write disk test with dd. One volume is 3x better than the other.
root#ip-172-31-37-37:~# time sh -c "dd if=/dev/zero of=/mnt/tmp1 bs=4k count=800000 && sync"
800000+0 records in
800000+0 records out
3276800000 bytes (3.3 GB) copied, 13.9361 s, 235 MB/s
real 0m27.571s
user 0m0.073s
sys 0m4.695s
root#ip-172-31-37-37:~# time sh -c "dd if=/dev/zero of=/data/tmp1 bs=4k count=800000 && sync"
800000+0 records in
800000+0 records out
3276800000 bytes (3.3 GB) copied, 39.5834 s, 82.8 MB/s
real 0m40.203s
user 0m0.086s
sys 0m3.347s
root#ip-172-31-37-37:~#
Any idea why?

'cat /proc/swaps' returns nothing [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
Please do not waste anymore of your time on this question...I ended up deleting the whole VM and creating another. The time it took me to do this is less than the time it would take to fix the issue. I have couple of SSDs in RAID mode.
Thank you for all those who tried to troubleshoot the issue!
I am having this problem with ubnuntu not showing active swap spaces when I run the command cat /proc/swaps. Here is a list of commands I ran. I even added a new swap space (file: /swapfile1) just to make sure that at least one swap space, but still I get nothing.
hebbo#ubuntu-12-lts:~$ sudo fdisk -l
[sudo] password for hebbo:
Disk /dev/sda: 26.8 GB, 26843545600 bytes
255 heads, 63 sectors/track, 3263 cylinders, total 52428800 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000e3a7a
Device Boot Start End Blocks Id System
/dev/sda1 * 46569472 52426751 2928640 82 Linux swap / Solaris
/dev/sda2 2046 46567423 23282689 5 Extended
/dev/sda5 2048 46567423 23282688 83 Linux
Partition table entries are not in disk order
hebbo#ubuntu-12-lts:~$ sudo su
root#ubuntu-12-lts:/home/hebbo# cat /proc/swaps
Filename Type Size Used Priority
root#ubuntu-12-lts:/home/hebbo# dd if=/dev/zero of=/swapfile1 bs=1024 count=524288
524288+0 records in
524288+0 records out
536870912 bytes (537 MB) copied, 1.18755 s, 452 MB/s
root#ubuntu-12-lts:/home/hebbo# mkswap /swapfile1
Setting up swapspace version 1, size = 524284 KiB
no label, UUID=cb846612-5f27-428f-9f83-bbe24b410a78
root#ubuntu-12-lts:/home/hebbo# chown root:root /swapfile1
root#ubuntu-12-lts:/home/hebbo# chmod 0600 /swapfile1
root#ubuntu-12-lts:/home/hebbo# swapon /swapfile1
root#ubuntu-12-lts:/home/hebbo# cat /proc/swaps
Filename Type Size Used Priority
root#ubuntu-12-lts:/home/hebbo#
Any idea how to fix this?
This is ubuntu 12.04 LTS running kernel 3.9.0 in a vmware VM.
Thanks in advance!
To activate /swapfile1 after Linux system reboot, add entry to /etc/fstab file. Open this file using a text editor such as vi:
# vi /etc/fstab
Add the following line:
/swapfile1 swap swap defaults 0 0
Save and close the file. Next time Linux comes up after reboot, it enables the new swap file for you automatically.
Have a look here for more info.
I just tried it and it works on my box.
Linux fileserver 3.8.0-32-generic #47~precise1-Ubuntu SMP Wed Oct 2 16:19:35 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
ortang#fileserver:~$ cat /proc/swaps
Filename Type Size Used Priority
/dev/dm-2 partition 4194300 0 -1
ortang#fileserver:~$ sudo su
root#fileserver:/home/ortang# dd if=/dev/zero of=/swapfile bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 0.695721 s, 772 MB/s
root#fileserver:/home/ortang# chmod 600 /swapfile
root#fileserver:/home/ortang# mkswap /swapfile
Setting up swapspace version 1, size = 524284 KiB
no label, UUID=63cdcf3d-ba03-42ce-b598-15b6aa3ca67d
root#fileserver:/home/ortang# swapon /swapfile
root#fileserver:/home/ortang# cat /proc/swaps
Filename Type Size Used Priority
/dev/dm-2 partition 4194300 0 -1
/swapfile file 524284 0 -2
One thing i can imagine why it is working on my box, is that i already have a working swap partition, and it seems you don't.
It could also be caused by the kernel you use, 3.9.0 is not the regular 12.04.3 LTS kernel? Have you built the kernel yourself?
Whats the output of
grep CONFIG_SWAP /boot/config-`uname -r`
or
zcat /proc/config.gz | grep CONFIG_SWAP
is swap enabled in your kernel?
I ended up deleting the whole VM and creating another. The time it took me to do this is less than the time it would take to fix the issue. I have couple of SSDs in RAID mode. And I already had all the downloads on the same host machine. All in all ~7 minutes.
Thanks for all those who helped troubleshoot the issue.

Resources