Why does liblzma fail to compress any random string? - string

I'm using the ruby binding, ruby-xz.
random_string = SecureRandom.random_bytes(100)
compressed_string = XZ.compress(random_string, compression_level = 9, check = :none, extreme = true)
compressed_string.size # => always 148
I've tested it ten thousands of times, on strings of varying length.
I know that at least half of the strings are 1-incompressible (cannot be compresse by more than 1 bit), 3/4 of the strings are 2-incompressible, etc. (This follows from a counting argument.) This, obviously, says nothing about the lower bound of the number compressible strings, but there are bound to be a few, aren't there?

Explanation
There are a few reasons:
liblzma, when not in RAW mode, adds a header describing the dictionary size and a few other settings. That is one of the reasons it grows in size.
LZMA, like a lot of other compressors, uses a range encoder to encode the output of the dictionary compression (in essence a badass version of LZ77) in the least amount of bits needed. So at the end of the bit stream, the last bits are padded to make it into a full byte.
You are compressing random noise, which as you note, is hard to compress. The range encoder tries to find the least amount of bits to encode the symbols outputted by the dictionary compression round. So in this case, there will be a lot of symbols. If, there was one (or two) recurring patterns that LZMA found, it could be that in the end it only saves a bit or two from the output. Which as explained in point 2, you cannot observe on a byte level.
Experiment
Some small experiments for observing the overhead.
empty file with lzma in raw mode:
$ dd if=/dev/urandom bs=1k count=0 2>/dev/null | xz -9 -e --format=raw -c 2>/dev/null | wc -c
1
it needed at least one or two bits to say it reached the end of the stream, and this was padded to one byte
1k file filled with zeroes
$ dd if=/dev/zero bs=1k count=1 2>/dev/null | xz -9 -e --format=raw -c 2>/dev/null | wc -c
19
quite nice, but complexity theory wise, still perhaps a few bytes to many (1000x'\0' would have been optimal encoding)
1k file with all bits at 1
$ dd if=/dev/zero bs=1k count=1 2>/dev/null | sed 's/\x00/\xFF/g'| xz -9 -e --format=raw -c 2>/dev/null | wc -c
21
interestingly, xz compresses this a little worse than all zeroes. most likely related to the fact that LZMA dictionary works on a bit level (which was one of the novel ideas of LZMA).
1k random file:
$ dd if=/dev/urandom bs=1k count=1 2>/dev/null | xz -9 -e --format=raw -c 2>/dev/null | wc -c
1028
so 4 bytes more than the input, still not bad.
1000 runs of 1k random files:
$ for i in {1..1000}; do dd if=/dev/urandom bs=1k count=1 2>/dev/null | xz -9 -e --format=raw -c 2>/dev/null | wc -c; done | sort | uniq -c
1000 1028
so every time, 1028 bytes needed.

Related

Merge text files in a numeric order

I'm stuck with a problem. I would like to merge two text files given a specific onset time.
For example:
Text1 (in column):
30
100
200
Text2 (in column):
10
50
70
My output should be 1 text file (single column) like this:
10
30
50
70
100
200
I can use cat or merge to combine files, but not sure how to take care of the order for the onset time.
Thank you in advance for all your help!
Like this:
sort -n file1 file2
Most sort commands (e.g. GNU coreutils, free BSD, open BSD, mac osx, uutils) have a merge option for creating one sorted file from multiple files that are already sorted.
sort -m -n text1 text2
The only sort without such an option I could find is from busybox. But even that version tolerates an -m option, ignores it, sorts the files as usual, and therefore still gives the expected result.
I would have assumed that using -m doesn't really matter that much compared to just sorting the concatenated files like busybox does, since sorting algorithms should have optimizations for already sorted parts. However, a small test on my system with GNU coreutils 8.28 proved the contrary:
shuf -i 1-1000000000 -n 10000000 | sort -n > text1 # 10M lines, 95MB
shuf -i 1-1000000000 -n 10000000 | sort -n > text2
time sort -m -n text1 text2 | md5sum # real: 2.0s (used only 1 CPU core)
time sort -n text1 text2 | md5sum # real: 4.5s (used 2 CPU cores)
Although you could just pipe both files to sort -n it seems inelegant not to use the fact that your input files are already sorted. If it is indeed the case that your inputs are sorted, you could do something like:
awk 'BEGIN{n = getline a < "text2"} {
while( n && a < $1) { print a; n = getline a < "text2"}
} 1 ' text1

extracting last n percentage of a file output from zcat command

I am trying to extract the last 2 percentage of a file output coming from the zcat command. I tried something doing
numlines=$(zcat file.tar.gz | wc -l)
zcat file.tar.gz | tail -n + $numlines*(98/100)
But the problem with this approach is my file is too big, and I can't afford to run the zcat command twice. Is there some way I could do it by maybe piping the number of lines , or some other ways.
EDIT :
The output of zcat file.tar.gz | tar -xO | dd 2>&1 | tail -n 1 is
16942224047 bytes (17 GB, 16 GiB) copied, 109.154 s, 155 MB/s
Any help would be greatly appreciated.
Read content to a variable. I assume that there is enough RAM available.
content=$(zcat file.tar.gz| tar -xO)
lines=$(wc -l <<<"$content")
ninetyeight=$((100-$lines/100*98))
tail -n $ninetyeight
This only works if the file contains at least 100 lines.
The following awk program will only keep the last n% of your file into memory. The percentage is taken floor wise, that is to say, if we n% of the file represents 134.56 lines, it will print 134 lines
awk -v n=2 '{a[FNR]=$0; min=FNR-int(FNR*n/100)}
{i=min; while(i in a) delete a[i--]}
END{for(i=min+1;i<=FNR;++i) print a[i]}' - < <(zcat file)
you can verify this when you replace zcat file with seq 100

How to create large file (require long compress time) on Linux

I make parallel job now So I'm trying to create dummyFile and compreaa that on the backgrounds.
Like this
Create dummy file
for in ()
do
Compress that file &
done
wait
I need to create dummy data So I tried
fallocate -l 1g test.txt
And
tar cfv test.txt
But this compress job is done just 5seconds
How can I create dummydata big and required long compress time (3minute~5minute)
There are two things going on here. The first is that tar won't compress anything unless you pass it a z flag along with what you already have to trigger gzip compression:
tar cvfz test.txt
For a very similar effect, you can invoke gzip directly:
gzip test.txt
The second issue is that with most compression schemes, a gigantic string of zeros, which is likely what you generate, is very easy to compress. You can fix that by supplying random data. On a Unix-like system you can use the pseudo-file /dev/urandom. This answer gives three options in decreasing order of preference, depending on what works:
head that understands suffixes like G for Gibibyte:
head -c 1G < /dev/urandom > test.txt
head that needs it spelled out:
head -c 1073741824 < /dev/urandom > test.txt
No head at all, so use dd, where file size is block size (bs) times count (1073741824 = 1024 * 1048576):
dd bs=1024 count=1048576 < /dev/urandom > test.txt
Something like this may work. There are some bash specific operators.
#!/bin/bash
function createCompressDelete()
{
_rdmfile="$1"
cat /dev/urandom > "$_rdmfile" & # This writes to file in the background
pidcat=$! #Save the backgrounded pid for later use
echo "createCompressDelete::$_rdmfile::pid[$pidcat]"
sleep 2
while [ -f "$_rdmfile" ]
do
fsize=$(du "$_rdmfile" | awk '{print $1}')
if (( $fsize < (1024*1024) )); then # Check the size for 1G
sleep 10
echo -n "...$fsize"
else
kill "$pidcat" # Kill the pid
tar czvf "${_rdmfile}".tar.gz "$_rdmfile" # compress
rm -f "${_rdmfile}" # delete the create file
rm -f "${_rdmfile}".tar.gz # delete the tarball
fi
done
}
# Run for any number of files
for i in file1 file2 file3 file4
do
createCompressDelete "$i" &> "$i".log & # run it in the background
done

How to run multiple instances of this command dd

I want to run this command multiple times but im not sure how to do it.
It has to be a single command. what this command does is it pushes my cpu cores to
%100
dd if=/dev/zero of=/dev/null
Its for an assignment. please help if you can
thank you
This is what in the assignment says. Maybe it can be helpfull to figure it out
"Figure out how to run multiple instances of the command dd
if=/dev/zero of=/dev/null at the same time. You could also use the
command sum /dev/zero. You should run one instance per CPU core, so as
to push CPU utilization to 100% on all of the CPU cores in your
virtual machine. You should be able to launch all of the instances by
running a single command or pipeline as a regular user "
so far i tried doing
dd if=/dev/zero of=/dev/null | xargs -p2
but that doesn't do the job right
Your assignment is probably already due and over. But for future readers, here's a single line solution.
perl -e 'print "/dev/zero\n" x'$(nproc --all) | xargs -n 1 -P $(nproc --all) -I{} dd if={} of=/dev/null
How does this work? Let's dissect the pipeline.
nproc --all will return the number of cores in the system. Let's pretend your system has 4 cores.
perl -e 'print "/dev/zero\n" x 4' will print 4 lines of /dev/zero.
Output
/dev/zero
/dev/zero
/dev/zero
/dev/zero
The output of perl is then passed to xargs.
-n 1 tells xargs to use only one argument at a time.
-I {} tells xargs that the argument shall replace the occurrences of {}
-P 4 tells xargs to run as many as 4 instances of the command in parallel
A shorter version of the above command can be written like this:
perl -e 'print "if=/dev/zero of=/dev/null\n" x '$(nproc --all) | xargs -n2 -P0 dd
This will run 4 copies:
dd if=/dev/zero of=/dev/null | dd if=/dev/zero of=/dev/null | dd if=/dev/zero of=/dev/null | dd if=/dev/zero of=/dev/null
But it is really not recommended as a solution for homework, as it looks as if you do not understand what | does. Here nothing is being sent through the pipe. It has the advantage that it is easy to stop with a Ctrl-C.
If the goal is simply to increase carbon emissions then this is shorter:
burnP6 | burnP6 | burnP6 | burnP6
If you have GNU Parallel:
yes /dev/zero | parallel dd if={} of=/dev/null
yes | parallel burnP6
GNU Parallel starts by default 1 job per CPU core, and thus it only reads that many arguments from yes.
Many ways.. for example repeating the command four times:
command & ; command & ; command & ; command &
..or in a more systematic way:
for i in {1..4}
do
dd if=/dev/zero of=/dev/null &
done
Or you could try my home made parallel data transfer tool pdd. This tool spawns several threads and each of the threads is bond to a CPU core.

How to show the first x bytes using hexdump?

I have two files and I want to see if the first 40 bytes are similar. How can I do this using hex dump?
If you are using the BSD hexdump utility (which will also be installed as hd, with a different default output format) then you can supply the -n40 command line parameter to limit the dump to the first 40 bytes:
hexdump -n40 filename
If you are using the Posix standard od, you need a capital N. You might find the following invocation useful:
od -N40 -w40 -tx1 -Ax filename
(You can do that with hexdump, too, but the format string is more work to figure out :) ).
Try this:
head -c 40 myfile | hexdump
Not sure why you need hexdump here,
diff <(dd bs=1 count=40 if=file1) <(dd bs=1 count=40 if=file2)
with hexdump:
diff <(dd bs=1 count=40 if=file1|hexdump) <(dd bs=1 count=40 if=file2|hexdump)

Resources