GlusterFS 3.4 striped volume disk usage - glusterfs

In GlusterFS 3.4.3 server,
When I create a volume with
gluster volume create NEW-VOLNAME stripe COUNT NEW-BRICK-LIST...
and store some files, the volume consumes 1.5 times space of actual data stored, regardless of the number of stripes.
e.g. If I create 1GB file in the volume with
dd if=/dev/urandom of=random1gb bs=1M count=1000
It consumes 1.5GB of total disk space of the bricks. "ls -alhs", "du -hs", and "df -h" all indicate the same fact - 1.5GB of space used for an 1GB file. Inspecting each brick and summing up the usage also shows the same result.
Interestingly, this doesn't happen with the newer version, GlusterFS 3.5 server. i.e. 1GB file uses 1GB of total brick space - normal.
It's good that it is fixed in 3.5, but I cannot use 3.5 right now due to another issue.
I couldn't find any document or article about this. Do I have a wrong option(I left everything default)? Or is it a bug in 3.4? It seems to be too serious a problem to be just a bug.
If it is by design, why?? To me it looks like huge waste of storage for a storage system.
To be fair, I'd like to point out that GlusterFS works very well except for this issue. Excellent performance (especially with qemu-libgfapi integration), easy setup, and flexibility.

Related

How can I quickly erase all partition information and data on partitions in LInux?

I'm testing a program to use on Raspberry Pi OS. A good part of what it does is read the partitioning info on the system drive, which is going to be (in this case), /boot and / and no extra partitions, just the two. I'm using a Python script that calls sfdisk. I do what so many examples show: I get the info from the system drive, read it as output, then use it as input to run the command to format the target drive.
I'm using Python and doing this with subprocess.run(). The script I'm writing, when it writes the 2nd partition on the target drive, writes it as a small size, then I use parted to extend the partition to the end of the drive. In between tests, to wipe my data so I can start fresh, I've been using sfdisk to make one partition for the full size of the drive. Also, I'm using USB memory sticks at this point for testing. I'll generally be using that for drives or using SD cards.
The problem I'm finding is that the file structure is persistent on the partitions on the target drive. (All this paragraph is about ONLY the target drive.) If I divide it up into 2 partitions (as I need to use, eventually), I find that /boot, the small 1st partition, still has all the files from previous usage of the partition. If I've tried to wipe the information by making only one big partition on the drive, I still see only, in that one partition, the original files for the /boot partition. If I split it into 2 partitions, the locations are going to be the same as when I normally make a Raspbian image and I find the files in both /boot and the system drive are still there.
So repartitioning, with the partitions in the same location, leaves me with the files still intact from the previous incarnation of a partition in the same sectors.
I'd like to, for testing, just wipe out all the information so I start fresh with each test, but I do not want to just use dd and write gigabytes of 0s or 1s to the full drive to wipe out the data.
What can I do to make sure:
The partition table is wiped out between tests
Any directory structure or file information for the partitions is wiped out so there are no files still surviving on any partitions when I start my testing?
A "nice" thing about linux filesystems is that they are separate from partition tables. This has saved me in the past when partition tables have been accidentally deleted or corrupted - recreate the partition table and the filesystem is still there! For your use case, if you want the files to be "gone", you need to destroy the filesystem superblocks. Destroying just the first one is probably sufficient for your use case.
Using dd to overwrite just the first MB of each of your filesystems should get you what you need. So, if you're starting your first partition/FS on block 0, you could do something like
# write 1MB of zeros to wipe out /boot
dd if=/dev/zero of=/dev/path_to_your_device bs=1024 count=1024
That ought to wipe out the /boot file system. From there you'll need to calculate the start of your root volume and you can use skip as per https://superuser.com/questions/380717/how-to-output-file-from-the-specified-offset-but-not-dd-bs-1-skip-n to write a meg of zeros at the start of your root filesystem.
Alternately, if /boot is small, you can just write sizeof(/boot)+1MB (assuming you start /root immediately after /boot) and it'll overwrite the primary superblock from /root too while saving you some calculations.
Note that the alternate superblocks will still exist, so at some point if you (or someone) wanted to get back what was there previously then recovery of alternate superblocks might be possible, except that whatever files were present in that first 1MB of disk would be corrupt due to overwrite.

Too much disk space used by Apache Kudu for WALs

I have a hive table that is of 2.7 MB (which is stored in a parquet format). When I use impala-shell to convert this hive table to kudu, I notice that the /tserver/ folder size increases by around 300 MB. Upon exploring further, I see it is the /tserver/wals/ folder that holds the majority of this increase. I am facing serious issues due to this. If a 2.7 MB file generates a 300 MB WAL, then I cannot really work on bigger data. Is there a solution to this?
My kudu version is 1.1.0 and impala is 2.7.0.
I never used KUDU but I'm able to Google on a few keywords, and read some documentation.
From the Kudu configuration reference section "Unsupported flags"...
--log_preallocate_segments Whether the WAL should preallocate the entire segment before writing to it Default true
--log_segment_size_mb The default segment size for log roll-overs, in MB Default 64
--log_min_segments_to_retain The minimum number of past log segments to keep at all times, regardless of what is required for
durability. Must be at least 1. Default 2
--log_max_segments_to_retain The maximum number of past log segments to keep at all times for the purposes of catching up other
peers. Default 10
Looks like you have a minimum disk requirement of (2+1)x64 MB per tablet, for the WAL only. And it can grow up to 10x64 MB if some tablets are straggling and cannot catch up.
Plus some temp disk space for compaction etc. etc.
[Edit] these default values have changed in Kudu 1.4 (released in June 2017); quoting the Release Notes...
The default size for Write Ahead Log (WAL) segments has been reduced
from 64MB to 8MB. Additionally, in the case that all replicas of a
tablet are fully up to date and data has been flushed from memory,
servers will now retain only a single WAL segment rather than two.
These changes are expected to reduce the average consumption of disk
space on the configured WAL disk by 16x

VMware ESXi slow local storage only on one disk partition

I'm experiencing very odd behaviour of VMware ESXi free hypervisor 6 whilst using local hard drives as a storage for VM's.
Everything works up to one partition.
Here's the setup.
2TB WD RED drive divided onto 2 pieces - one partition 1 TB total in size, and another 500 GB. Both parts/partitions of this drive are assigned to one VM (running Ubuntu 14.04 LTS) and are formatted and configured in fstab regularly. Everything fine on that.
Now the issue with performance.
When I try to read or write from big (1TB size) partition, mounted in /mnt/bigpart I get expected both write and read speeds (~150 MB/s).
But if try to do the same with smaller partition (500GB), both read and write speeds are 50% lower! So I cannot max read spead above 80 MB/s. Writes are even lower.
I just don't get it. Also esxitop (d) shows exactly the same results. Smaller partition just cannot seem to be any faster.
This is very odd as both partitions are preallocated (in favour of spinning drive speed), and both are physically located on the same hard drive.
I know that in theory with spinning hard drives it can be that end of the drive platter is somewhat slower then the beginning, but this is just too much of a performance hit.
Additionally, the hard drive has ~360 GB of free space after those preallocations.
Perhaps I should try to re-assign the smaller partition again but this time with thin provisioning.
Take a look at measurements:
BIGGER (1TB) PARTITION / DISK
11649792+0 records in
11649792+0 records out
5964693504 bytes (6.0 GB) copied, 39.873 s, 150 MB/s
SMALLER (500GB) PARTITION / DISK
11649792+0 records in
11649792+0 records out
5964693504 bytes (6.0 GB) copied, 67.1635 s, 88.8 MB/s
I know that in theory with spinning hard drives it can be that end
of the drive platter is somewhat slower then the beginning
This is true in practice as well. Look at the sustained transfer rate
of this 2 TB hard drive which is from a different vendor. The chart
displays the sequential read throughput depending on the offset.
In the first terabyte, the sequential read throughput is between 170
and 130 MiB/s, which is pretty close to what you experience (150
MB/s). The throughput drops sharply after the second half of the hard
drive. Even if it does not explain 100% of the performance hit you
experience, it is probably the dominant factor.
this can (don't have to) be a problem with block alignment.
There is in real-cases no big difference if a vmdk is provisioned as thin or thick.
So you have two local datastores (VMFS5?) on the same harddisk?
Do both Datastores have a block size of 1 MB? (Host -> Configuration -> Storage)
if yes - do both partitions in your guest have a block size of 1 MB too?
is it possible that one partition is with MBR generated and one with GPT? (GPT would be the better way)
Maybe you can do also a SMART check of the HDD - maybe there are some broken sectors.

How to prevent Cassandra commit logs filling up disk space

I'm running a two node Datastax AMI cluster on AWS. Yesterday, Cassandra started refusing connections from everything. The system logs showed nothing. After a lot of tinkering, I discovered that the commit logs had filled up all the disk space on the allotted mount and this seemed to be causing the connection refusal (deleted some of the commit logs, restarted and was able to connect).
I'm on DataStax AMI 2.5.1 and Cassandra 2.1.7
If I decide to wipe and restart everything from scratch, how do I ensure that this does not happen again?
You could try lowering the commitlog_total_space_in_mb setting in your cassandra.yaml. The default is 8192MB for 64-bit systems (it should be commented-out in your .yaml file... you'll have to un-comment it when setting it). It's usually a good idea to plan for that when sizing your disk(s).
You can verify this by running a du on your commitlog directory:
$ du -d 1 -h ./commitlog
8.1G ./commitlog
Although, a smaller commit log space will cause more frequent flushes (increased disk I/O), so you'll want to keep any eye on that.
Edit 20190318
Just had a related thought (on my 4-year-old answer). I saw that it received some attention recently, and wanted to make sure that the right information is out there.
It's important to note that sometimes the commit log can grow in an "out of control" fashion. Essentially, this can happen because the write load on the node exceeds Cassandra's ability to keep up with flushing the memtables (and thus, removing old commitlog files). If you find a node with dozens of commitlog files, and the number seems to keep growing, this might be your issue.
Essentially, your memtable_cleanup_threshold may be too low. Although this property is deprecated, you can still control how it is calculated by lowering the number of memtable_flush_writers.
memtable_cleanup_threshold = 1 / (memtable_flush_writers + 1)
The documentation has been updated as of 3.x, but used to say this:
# memtable_flush_writers defaults to the smaller of (number of disks,
# number of cores), with a minimum of 2 and a maximum of 8.
#
# If your data directories are backed by SSD, you should increase this
# to the number of cores.
#memtable_flush_writers: 8
...which (I feel) led to many folks setting this value WAY too high.
Assuming a value of 8, the memtable_cleanup_threshold is .111. When the footprint of all memtables exceeds this ratio of total memory available, flushing occurs. Too many flush (blocking) writers can prevent this from happening expediently. With a single /data dir, I recommend setting this value to 2.
In addition to decreasing the commitlog size as suggested by BryceAtNetwork23, a proper solution to ensure it won't happen again will have monitoring of the disk setup so that you are alerted when its getting full and have time to act/increase the disk size.
Seeing as you are using DataStax, you could set an alert for this in OpsCenter. Haven't used this within the cloud myself, but I imagine it would work. Alerts can be set by clicking Alerts in the top banner -> Manage Alerts -> Add Alert. Configure the mounts to watch and the thresholds to trigger on.
Or, I'm sure there are better tools to monitor disk space out there.

How to reduce the default metadata size for an XFS file system?

I have a special-purpose 12-disk volume, 48 TB total. After mkfs with default parameters, mounting using inode_64,
the reported available space for files is 44 TB. So there is 4 TB metadata overhead, almost 10%.
I'm thinking this metadata size is probably intended to accomodate tens of millions of inodes, whereas I use only large files
and would need 1-2 million files max. Given this, my question is whether it's possible to recover 2-3 TB out of the 4 TB metadata, to use for file data.
In the man page I see a maxpct option, possibly others, but I cannot figure out what is the correct way to use them
in my case. I still need to make sure that the volume can hold the 2 million files.
Also, I understand some metadata space is used for journaling and here I don't know how much would be enough.
Based on the specific percentage of storage that you're seeing missing, it seems likely that you're being misled by the difference between binary and decimal units of storage. Since disks are measured in decimal terabytes, using software tools that measure available storage in binary terabytes (which are 10% larger) will give you results that appear to be about 9% too low. The storage hasn't actually gone anywhere, though; you're just using units that make it look smaller!
By default, the coreutils versions of the df and du commands use binary units. You can use the -H flag to make them use decimal units instead.
You could drop it from 5% to 1%.
maxpct
This specifies the maximum percentage of space in the filesystem that can be allocated to inodes. The default value is 25% for filesystems under 1TB, 5% for filesystems under 50TB and 1% for filesystems over 50TB.
Just keep in mind that tuning a filesystem is a tricky affair. There's a delicate balance achieved between stability, reliability, and performance.

Resources