How to create large file (require long compress time) on Linux

How to create large file (require long compress time) on Linux - linux

I make parallel job now So I'm trying to create dummyFile and compreaa that on the backgrounds.
Like this
Create dummy file
for in ()
do
Compress that file &
done
wait
I need to create dummy data So I tried
fallocate -l 1g test.txt
And
tar cfv test.txt
But this compress job is done just 5seconds
How can I create dummydata big and required long compress time (3minute~5minute)

There are two things going on here. The first is that tar won't compress anything unless you pass it a z flag along with what you already have to trigger gzip compression:
tar cvfz test.txt
For a very similar effect, you can invoke gzip directly:
gzip test.txt
The second issue is that with most compression schemes, a gigantic string of zeros, which is likely what you generate, is very easy to compress. You can fix that by supplying random data. On a Unix-like system you can use the pseudo-file /dev/urandom. This answer gives three options in decreasing order of preference, depending on what works:
head that understands suffixes like G for Gibibyte:
head -c 1G < /dev/urandom > test.txt
head that needs it spelled out:
head -c 1073741824 < /dev/urandom > test.txt
No head at all, so use dd, where file size is block size (bs) times count (1073741824 = 1024 * 1048576):
dd bs=1024 count=1048576 < /dev/urandom > test.txt

Something like this may work. There are some bash specific operators.
#!/bin/bash
function createCompressDelete()
{
_rdmfile="$1"
cat /dev/urandom > "$_rdmfile" & # This writes to file in the background
pidcat=$! #Save the backgrounded pid for later use
echo "createCompressDelete::$_rdmfile::pid[$pidcat]"
sleep 2
while [ -f "$_rdmfile" ]
do
fsize=$(du "$_rdmfile" | awk '{print $1}')
if (( $fsize < (1024*1024) )); then # Check the size for 1G
sleep 10
echo -n "...$fsize"
else
kill "$pidcat" # Kill the pid
tar czvf "${_rdmfile}".tar.gz "$_rdmfile" # compress
rm -f "${_rdmfile}" # delete the create file
rm -f "${_rdmfile}".tar.gz # delete the tarball
fi
done
}
# Run for any number of files
for i in file1 file2 file3 file4
do
createCompressDelete "$i" &> "$i".log & # run it in the background
done

Related

How to touch all files that are returned by a sorted ls?

If I have the following:
ls|sort -n
How would I touch all those files in the order of the sorted files? Something like:
ls|sort -n|touch
What would be the proper syntax? Note that I need to sort touch the files in the exact order they're being sorted -- as I'm trying to sort these files for a FAT reader with minimal metadata reading.

ls -1tr | while read file; do touch "$file"; sleep 1; done
If you want to preserve distance in modification time from one file to the next then call this instead:
upmodstamps() {
oldest_elapsed=$(( $(date +%s) - $(stat -c %Y "`ls -1tr|head -1`") ))
for file in *; do
oldstamp=$(stat -c %Y "$file")
newstamp=$(( $oldstamp + $oldest_elapsed ))
newstamp_fmt=$(date --date=#${newstamp} +'%Y%m%d%H%M.%S')
touch -t ${newstamp_fmt} "$file"
done
}
Note: date usage assumes GNU

You can use this command
(ls|sort -n >> list.txt )
touch $(cat list.txt)
OR
touch $(ls /path/to/dir | sort -n)
OR if you want to copy files instead of creating empty files use this command
cp list.txt ./DirectoryWhereYouWantToCopy

Try like this
touch $(ls | sort -n)

Can you give a few file name?
if you have file names with numbers as 1file, 10file, 11file .. 20file, then you need use --general-numeric-sort
ls | sort --general-numeric-sort --output=../workingDirectory/sortedFiles.txt
cat sortedFiles.txt
1file
10file
11file
12file
20file
and move sortedFile.txt into your working directory or where ever you want.
touch $(cat ../workingDirectory/sortedFiles.txt)
this will create empty files with the exact same name

Search, match and copy directories into another based on names in a txt file

My goal is copy a bulk of specific directories whose names are in a txt file as follows:
$ cat names.txt
raw1
raw2
raw3
raw4
raw5
These directories have subdirectories, hence it is important to copy all the contents. When I list in my terminal it looks like this:
$ ls -l
raw3
raw7
raw1
raw8
raw5
raw6
raw2
raw4
To perform this task, I have tried the following:
cat names.txt | while read line; do grep -l '$line' | xargs -r0 cp -t <desired_destination>; done
But, I get this mistake
cp: cannot stat No such file or directory
I suppose it's because the names in the file list (names.txt) don't match in sorting with the ones in the terminal. Notice that they are unsorted and by using while read line doesn't work. Thank you for taking the time and commitment to help me.

Having problems following the logic of the current code so in the name of K.I.S.S. I propose:
tgtdir=/my/target/directory
while read -r srcdir
do
[[ -d "${srcdir}" ]] && cp -rp "${srcdir}" "${tgtdir}"
done < <(tr -d '\r' < names.dat)
NOTES:
the < <(tr -d '\r' < names.dat) is used to remove windows/dos line endings from names.dat (per comments from OP); if names.dat is updated to remove the \r' then the tr -d with be a no-op (ie, bit of overhead to spawn the subprocess but the script should still read names.dat correctly)
assumes script is run from the directory where the source directories reside otherwise code can be modified to either cd to said directory or preface the ${srcdir} references with said directory
OP can add/modify the cp flags as needed, but I'm assuming at a minimum -r will be needed in order to recursively copy the directories

UUoC.
cat names.txt | while read line; do ...; done
is better written
while read line; do ...; done < names.txt
do grep -l '$LINE' | is eating your input.
printf "%s\n" 1 2 3 |while read line; do echo "Read: [$line]"; grep . | cat; done
Read: [1]
2
3
In your case, it is likely finding no lines that match the literal string $LINE which you have embedded in single-qote marks, which do not allow it to be parsed for content. Use "$line" (avoid capitals), and wouldn't be helpful even if it did match:
$: printf "%s\n" 1 2 3 | grep -l .
(standard input)
You didn't tell it what to read from, so -l is pointless since it's reading the same stdin stream that the read is.
I think what you want is a little simpler -
xargs cp -Rt /your/desired/target/directory/ < names.txt
Assuming you wanted to leave the originals where they were.

How to append the output of Parallel Grep to a file?

I have a file of 500 MB, and a pattern file of 20MB. Since it was taking too much time to grep the 1.2 million patterns from the file with 5 million lines, I split the pattern file into 100 parts.
I tried to run Grep parallely with the multiple patterns as below.
for pat1 in vailtar_*
do
parallel --block 75M --pipe grep $pat1 infile >> outfile
done;
But I cannot get the output to append to a file. I tried without the block option and as below too -
cat infile | parallel --block 75M --pipe grep $pat1 >> outfile
< infile parallel --block 75M --pipe grep $pat1 >> outfile
Is there anyway to make the parallel grep append the output to a file?
Thanks in advance.

Perhaps it will work better like this?
for pat1 in vailtar_*
do
parallel --block 75M --pipe grep -f $pat1 < infile
done > outfile
That will take all the output from everything inside the for loop, and put it in outfile.
Incidentally, I think you meant to use infile as stdin, instead of as an argument to grep, and I think you meant to have -f $pat, not just the filename as the pattern. I've fixed both issues in my version.
However, if I were trying to solve this problem I might do it like this:
parallel 'grep -f {} infile' ::: vailtar_*
(I've not tested that.)

Merge sort gzipped files

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.
How can I merge all of these files so that the resulting output is also sorted?
I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.
PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.

This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:
sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted
Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.
For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:
cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz

#!/bin/bash
FILES=file*.gz # list of your 40 gzip files
# (e.g. file1.gz ... file40.gz)
WORK1="merged.gz" # first temp file and the final file
WORK2="tempfile.gz" # second temp file
> "$WORK1" # create empty final file
> "$WORK2" # create empty temp file
gzip -qc "$WORK2" > "$WORK1" # compress content of empty second
# file to first temp file
for I in $FILES; do
echo current file: "$I"
sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2"
mv "$WORK2" "$WORK1"
done
Fill $FILES the easiest way with the list of your files with bash globbing (file*.gz) or with a list of 40 filenames (separated with white blanks). Your files in $FILES stay unchanged.
Finally, the 80 GB data are compressed in $WORK1. While processing this script no uncompressed data where written to disk.

Adding a differently flavoured multi-file merge within a single pipeline - it takes all (pre-sorted) files in $OUT/uniques, sort-merges them and compresses the output, lz4 is used due to it's speed:
find $OUT/uniques -name '*.lz4' |
awk '{print "<( <" $0 " lz4cat )"}' |
tr "\n" " " |
(echo -n sort -m -k3b -k2 " "; cat -; echo) |
bash |
lz4 \
> $OUT/uniques-merged.tsv.lz4

It is true there are zgrep and other common utilities that play with compressed files, but in this case you need to sort/merge uncompressed data and compress the result.

wc gzipped files?

I have a directory with both uncompressed and gzipped files and want to run wc -l on this directory. wc will provide a line count value for the compressed files which is not accurate (since it seems to count newlines in the gzipped version of the file). Is there a way to create a zwc script similar to zgrep that will detected the gzipped files and count the uncompressed lines?

Try this zwc script:
#! /bin/bash --
for F in "$#"; do
echo "$(zcat -f <"$F" | wc -l) $F"
done

You can use zgrep to count lines as well (or rather the beginning of lines)
zgrep -c ^ file.txt

I use too "cat file_name | gzip -d | wc -l"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to create large file (require long compress time) on Linux - linux

Related

How to touch all files that are returned by a sorted ls?

Search, match and copy directories into another based on names in a txt file

How to append the output of Parallel Grep to a file?

Merge sort gzipped files

wc gzipped files?

Categories

Resources