linux - how to compare 2 filesystem blocks

linux - how to compare 2 filesystem blocks - linux

Is there a way to compare (binary or checksum etc) file blocks between 2 different linux servers. The files on both the servers are residing in SAN. Reason - replication is setup on postgres database, i want to check if the blocks for a table in both the primary and mirror will they exactly the same or different for tables?

I would use dd to read the block of the file you care about, and then pipe it to md5sum to get a checksum that I can compare to the other machine, like:
$ dd if=/path/to/postgresql/data bs=4096 skip=<block number minus one> count=1 | md5sum
5561f64d760047a7a56e99a71a66c890
(You should substitute your own block size in the bs= parameter if 4KB isn't right for PostgreSQL.)

Related

Fastest way to get the files count and total size of a folder in GCS?

Assume there is bucket with a folder root, it has subfolders and files. Is there any way to get the total files count and total size of the root folder?
What I tried:
With gsutil du I'm getting the size quickly but won't the get count. With gsutil ls ___ I'm getting list and size, if I pipe it with awk and sum them. I might get the expected result but ls itself is taking lot of time.
So is there a better/faster way to handle this?

Doing an object listing of some sort is the way to go - both the ls and du commands in gsutil perform object listing API calls under the hood.
If you want to get a summary of all objects in a bucket, check Cloud Monitoring (as mentioned in the docs). But, this isn't applicable if you want statistics for a subset of objects - GCS doesn't support actual "folders", so all your objects under the "folder" foo are actually just objects named with a common prefix, foo/.
If you want to analyze the number of objects under a given prefix, you'll need to perform object listing API calls (either using a client library or using gsutil). The listing operations can only return so many objects per response and thus are paginated, meaning you'll have to make several calls if you have lots of objects under the desired prefix. The max number of results per listing call is currently 1,000. So as an example, if you had 200,000 objects to list, you'd have to make 200 sequential API calls.
A note on gsutil's ls:
There are several scenarios in which gsutil can do "extra" work when completing an ls command, like when doing a "long" listing using the -L flag or performing recursive listings using the -r flag. To save time and perform the fewest number of listings possible in order to obtain a total count of bytes under some prefix, you'll want to do a "flat" listing using gsutil's wildcard support, e.g.:
gsutil ls -l gs://my-bucket/some-prefix/**
Alternatively, you could try writing a script using one of the GCS client libraries, like the Python library and its list_blobs functionality.

If you want to track the count of objects in a bucket over a long time, Cloud Monitoring offers the metric "storage/object_count". The metric updates about once per day, which makes it more useful for long-term trends.
As for counting instantaneously, unfortunately gsutil ls is probably your best bet.

Using gsutil du -sh, which could be a good idea for small directories.
For big directories, I am not able to get a result, even after a few hours, but only the following retrying message:
Using gsutil ls which is more efficient.
For big directories, it could take tens of minutes, but at least it complete.
To retrieve the number of files and the total size of a directory with gsutil ls, you can use the following command:
gsutil ls -l gs://bucket/dir/** | awk '{size+=$1} END {print "nb_files:", NR, "\ntotal_size:",size,"B"}'
Then divide the value by:
1024 for KB
1024 * 1024 for MB
1024 * 1024 * 1024 for GB
...
Example:

How to get disk name that contain specific partition

If i know that partition is for example /dev/sda1 how can i get disk name (/dev/sda in this case) that contain the partition?
The output should be only path to disk. (like '/dev/sda')
EDIT: It shouldn't be string manipulation

You can use the shell's built-in string chopping:
$ d=/dev/sda1
$ echo ${d%%[0-9]*}
/dev/sda
$ d=/dev/sda11212
$ echo ${d%%[0-9]*}
/dev/sda
This works for some of the disk names only. If there can be several digits in the name, it will chop everything after the first.
What is the exact specification to separate a disk name from a partition name?

You can use sed to get the disk. Because partitions are just increments of disk names, it's easy to perform:
echo "/dev/sda1" | sed 's/[0-9]*//g'
which produces the output /dev/sda
Another command you can use to obtain disk information is lsblk. Just typing it without args prints out all info pertaining to your disks and partitions.

how to use dd to read and seek different block size

I have a situation where i have to read sparse file. This file is having data at specific offset. Now i want to achieve.
1) Read 3 blocks(custom sizes) from the given offset
2) offset need to be seek using 1M
So, i am trying below command but not successful. I am reading more contents for sure.
dd if=a_Sparse_file_ofSIZe_1024M of=/dev/null ibs=1M skip=512 obs=262144 count=3
skip 512M of blocks and read from 512M+1 th offset using block of 256K for 3 counts.
skip always should be in MBs and count blocks are variable.
I am sure i am reading more data. Can someone please correct me.

You can always string 2 dds together, the first one to skip and the second one to read your actual data:
dd if=a_Sparse_file_ofSIZe_1024M bs=1M skip=N | dd bs=262144 count=3

The count parameter seems to be based on ibs, so the obs value does not matter here. As your obs value is four times smaller than ibs, I would suggest to set bs=256K and just multiply skip value by four: skip=2048.

Linux: sorting a 500GB text file with 10^10 records

I have a 500GB text file with about 10 billions rows that needs to be sorted in alphabetical order. What is the best algorithm to use? Can my implementation & set-up be improved ?
For now, I am using the coreutils sort command:
LANG=C
sort -k2,2 --field-separator=',' --buffer-size=(80% RAM) --temporary-directory=/volatile BigFile
I am running this in AWS EC2 on a 120GB RAM & 16 cores virtual machine. It takes the most part of the day.
/volatile is a 10TB RAID0 array.
The 'LANG=C' trick delivers a x2 speed gain (thanks to 1)
By default 'sort' uses 50% of the available RAM. Going up to 80-90% gives some improvement.
My understanding is that gnu 'sort' is a variant of the merge-sort algorithm with O(n log n), which is the fastest : see 2 & 3 . Would moving to QuickSort help (I'm happy with an unstable sort)?
One thing I have noticed is that only 8 cores are used. This is related to default_max_threads set to 8 in linux coreutils sort.c (See 4). Would it help to recompile sort.c with 16 ?
Thanks!
FOLLOW-UP :
#dariusz
I used Chris and your suggestions below.
As the data was already generated in batches: I sorted each bucket separately (on several separate machines) and then used the 'sort --merge' function. Works like a charm and is much faster: O(log N/K) vs O(log N).
I also rethinked the project from scratch: some data post-processing is now performed while the data is generated, so that some un-needed data (noise) can be discarded before sorting takes place.
All together, data size reduction & sort/merge led to massive reduction in computing resources needed to achieve my objective.
Thanks for all your helpful comments.

The benefit of quicksort over mergesort is no additional memory overhead. The benefit of mergesort is the guaranteed O(n log n) run time, where as quicksort can be much worse in the event of poor pivot point sampling. If you have no reason to be concerned about the memory use, don't change. If you do, just ensure you pick a quicksort implementation that does solid pivot sampling.
I don't think it would help spectacularly to recompile sort.c. It might be, on a micro-optimization scale. But your bottleneck here is going to be memory/disk speed, not amount of processor available. My intuition would be that 8 threads is going to be maxing out your I/O throughput already, and you would see no performance improvement, but this would certainly be dependent on your specific setup.
Also, you can gain significant performance increases by taking advantage of the distribution of your data. For example, evenly distributed data can be sorted very quickly by a single bucket sort pass, and then using mergesort to sort the buckets. This also has the added benefit of decreasing the total memory overhead of mergesort. If the memory comlexity of mergesort is O(N), and you can separate your data into K buckets, your new memory overhead is O(N/K).

Just an idea:
I assume the file contents are generated for quite a large amout of time. Write an application (script?) which would periodically move the up-till-now generated file to a different location, append its contents to another file, perform a sort on that different file, and repeat until all data is gathered.
That way your system would spend more time sorting, but the results would be available sooner, since sorting partially-sorted data will be faster than sorting the unsorted data.

I think, you need perform that sort in 2 stages:
Split to trie -like buckets, fit into memory.
Iterate buckets according alphabeth order, fetch each, sort, and append to output file.
This is example.
Imagine, you have bucket limit 2 lines only, and your input file is:
infile:
0000
0001
0002
0003
5
53
52
7000
on the 1st iteration, you read your input file "super-bucket, with empty prefix", and split according 1st letter.
There would be 3 output files:
0:
000
001
002
003
5:
(empty)
3
2
7:
000
As you see, bucket with filename/prefix 7 contains only one record 000, which is "7000", splitted to 7 - filename, and 000 - tail of the string. since this is just one record, wil do not need to split this file anymore. But, files "0" and "5" contains 4 and 3 records, what is more than limit 2. So, need split them again.
After split:
00:
01
02
03
5:
(empty)
52:
(empty)
53:
(empty)
7:
000
As you see, files with prefix "5" and "7" already splitted. so, need just split file "00".
As you see, after splitting, you will have set of relative small files.
Thereafter, run 2nd stage:
Sort filenames, and process filenames according sorted order.
sort each file, and append resut to output, with adding file name to output string.

Fastest way to shuffle lines in a file in Linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?

Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)

The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.

You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string