Calculate combined filesize of thousands of files - linux

We have a software package that performs tasks by assigning the batch of files a job number. Batches can have any number of files in them. The files are then stored in a directory structure similar to this:
/asc/array1/.storage/10/10297/10297-Low-res.m4a
...
/asc/array1/.storage/3/3814/3814-preview.jpg
The filename is generated automatically. The directory in .storage is the thousandths digits of the file number.
There is also a database which associates the job number and the file number with the client in question. Running a SQL query, I can list out the job number, client and the full path to the files. Example:
213 sample-data /asc/array1/.storage/10/10297/10297-Low-res.m4a
...
214 client-abc /asc/array1/.storage/3/3814/3814-preview.jpg
My task is to calculate the total storage being used per client. So, I wrote a quick and dirty bash script to iterate over every single row and du the file, adding it to an associative array. I then plan to echo this out or produce a CSV file for ingest into PowerBI or some other tool. Is this the best way to handle this? Here is a copy of the script as it stands:
#!/bin/sh
declare -A clientArr
# 1 == Job Num
# 2 == Client
# 3 == Path
while read line; do
client=$(echo "$line" | awk '{ print $2 }')
path=$(echo "$line" | awk '{ print $3 }')
if [ -f "$path" ]; then
size=$(du -s "$path" | awk '{ print $1 }')
clientArr[$client]=$((${clientArr[$client]}+${size}))
fi
done < /tmp/pm_report.txt
for key in "${!clientArr[#]}"; do
echo "$key,${clientArr[$key]}"
done

Assuming:
you have GNU coreutils du
the filenames do not contain whitespace
This has no shell loops, calls du once, and iterates over the pm_report file twice.
file=/tmp/pm_report.txt
awk '{printf "%s\0", $3}' "$file" \
| du -s --files0-from=- 2>/dev/null \
| awk '
NR == FNR {du[$2] = $1; next}
{client_du[$2] += du[$3]}
END {
OFS = "\t"
for (client in client_du) print client, client_du[client]
}
' - "$file"

Using file foo:
$ cat foo
213 sample-data foo # this file
214 client-abc bar # some file I had in the dir
215 some nonexistent # didn't have this one
and the awk:
$ gawk ' # using GNU awk
#load "filefuncs" # for this default extension
!stat($3,statdata) { # "returns zero upon success"
a[$2]+=statdata["size"] # get the size and update array
}
END { # in the end
for(i in a) # iterate all
print i,a[i] # and output
}' foo foo # running twice for testing array grouping
Output:
client-abc 70
sample-data 18

Related

Bash function with input fails awk command

I am writing a function in a BASH shell script, that should return lines from csv-files with headers, having more commas than the header. This can happen, as there are values inside these files, that could contain commas. For quality control, I must identify these lines to later clean them up. What I have currently:
#!/bin/bash
get_bad_lines () {
local correct_no_of_commas=$(head -n 1 $1/$1_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $1 | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$1/$1_0_${i}_0.csv" ]; then
echo "File: $1_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "$1_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$1/$1_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk '$1 > $correct_no_of_commas {print}'
done
}
get_bad_lines products
get_bad_lines users
The output of this program is now all the comma-counts with all of the line numbers in all the files,
and I suspect this is due to the input $1 (foldername, i.e. products & users) conflicting with the call to awk with reference to $1 as well (where I wish to grab the first column being the count of commas for that line in the current file in the loop).
Is this the issue? and if so, would it be solvable by either referencing the 1.st column or the folder name by different variable names instead of both of them using $1 ?
Example, current output:
5 6667
5 6668
5 6669
5 6670
(should only show lines for that file having more than 5 commas).
Tried variable declaration in call to awk as well, with same effect
(as in the accepted answer to Awk field variable clash with function argument)
:
get_bad_lines () {
local table_name=$1
local correct_no_of_commas=$(head -n 1 $table_name/${table_name}_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $table_name | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$table_name/${table_name}_0_${i}_0.csv" ]; then
echo "File: ${table_name}_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "${table_name}_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$table_name/${table_name}_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk -v table_name="$table_name" '$1 > $correct_no_of_commas {print}'
done
}
You can use awk the full way to achieve that :
get_bad_lines () {
find "$1" -maxdepth 1 -name "$1_0_*_0.csv" | while read -r my_file ; do
awk -v table_name="$1" '
NR==1 { num_comma=gsub(/,/, ""); }
/,/ { if (gsub(/,/, ",", $0) > num_comma) wrong_array[wrong++]=NR":"$0;}
END { if (wrong > 0) {
print(FILENAME" has over "num_comma" commas in the following lines:");
for (i=0;i<wrong;i++) { print(wrong_array[i]); }
}
}' "${my_file}"
done
}
For why your original awk command failed to give only lines with too many commas, that is because you are using a shell variable correct_no_of_commas inside a single quoted awk statement ('$1 > $correct_no_of_commas {print}'). Thus there no substitution by the shell, and awk read "$correct_no_of_commas" as is, and perceives it as an undefined variable. More precisely, awk look for the variable correct_no_of_commas which is undefined in the awk script so it is an empty string . awk will then execute $1 > $"" as matching condition, and as $"" is a $0 equivalent, awk will compare the count in $1 with the full input line. From a numerical point of view, the full input line has the form <tab><count><tab><num_line>, so it is 0 for awk. Thus, $1 > $correct_no_of_commas will be always true.
You can identify all the bad lines with a single awk command
awk -F, 'FNR==1{print FILENAME; headerCount=NF;} NF>headerCount{print} ENDFILE{print "#######\n"}' /path/here/*.csv
If you want the line number also to be printed, use this
awk -F, 'FNR==1{print FILENAME"\nLine#\tLine"; headerCount=NF;} NF>headerCount{print FNR"\t"$0} ENDFILE{print "#######\n"}' /path/here/*.csv

How to find list of words (in thousands) in list of tsv files (hundreds), with output as number of match for each string in each file, in linux?

I have hundreds of tsv file with following structure (example):
GH1 123 family1
GH2 23 family2
.
.
.
GH4 45 family4
GH6 34 family6
And i have a text file with list of words (thousands):
GH1
GH2
GH3
.
.
.
GH1000
I want to get output which contain number of each words occurred in each file like this
GH1 GH2 GH3 ... GH1000
filename1 1 1 0... 4
.
.
.
filename2 2 3 1... 0
I try this code but it gives me zero only
for file in *.tsv; do
echo $file >> output.tsv
cat fore.txt | while read line; do
awk -F "\\t" '{print $1}' $file | grep -wc $line >>output.tsv
echo "\\t">>output.tsv;
done ;
done
Use the following script.
Just put sdtout to output.txt file.
#!/bin/bash
while read p; do
echo -n "$p "
done <words.txt
echo ""
for file in *.tsv; do
echo -n "$file = "
while read p; do
COUNT=$(sed 's/$p/$p\n/g' $file | grep -c "$p")
echo -n "$COUNT "
done <words.txt
echo ""
done
Here is a simple Awk script which collects a list like the one you describe.
awk 'BEGIN { printf "\t" }
NR==FNR { a[$1] = n = FNR;
printf "\t%s", $1; next }
FNR==1 {
if(f) { printf "%s", f;
for (i=1; i<=n; i++)
printf "\t%s", 0+b[i] }
printf "\n"
delete b
f = FILENAME }
$1 in a { b[$1]++ }' fore.txt *.tsv /etc/motd
To avoid repeating the big block in END, we add a short sentinel file at the end whose only purpose is to supply a file after the last whose counts will not be reported.
The shell's while read loop is slow and inefficient and somewhat error-prone (you basically always want read -r and handling incomplete text files is hairy); in addition, the brute-force method will require reading the word file once per iteration, which incurs a heavy I/O penalty.

Speeding up separation of large text file based on line content in Bash

I have a very large text file (around 20 GB and 300 million lines), which contains three columns separated by tabs:
word1 word2 word3
word1 word2 word3
word1 word2 word3
word1 word2 word3
word1, word2, and word3 are different in each line. word3 specifies the class of the line, and repeats often for different lines (having thousands of different values). The goal is to separate the file by the line class (word3). I.e. word1 and word2 should be stored in a file called word3, for all the lines. For example, for the line:
a b c
the string "a b" should be appended to the file called c.
Now I know how this can be done with a while loop, reading line by line of a file, and appending the proper file for each line:
while IFS='' read -r line || [[ -n "$line" ]]; do
# Variables
read -a line_array <<< ${line}
word1=${line_array[0]}
word2=${line_array[1]}
word3=${line_array[2]}
# Adding word1 and word2 to file word3
echo "${word1} ${word2}" >> ${word3}
done < "inputfile"
It works, but is very slow (even though I have a fast workstation with an SSD). How can this be speed up? I have already tried to carry out this procedure in /dev/shm, and splitted the file into 10 pieces and run the above script in parallel for each file. But it is still quite slow. Is there a way to further speed this up?
Let's generate an example file:
$ seq -f "%.0f" 3000000 | awk -F $'\t' '{print $1 FS "Col_B" FS int(2000*rand())}' >file
That generates a 3 million line file with 2,000 different values in column 3 similar to this:
$ head -n 3 file; echo "..."; tail -n 3 file
1 Col_B 1680
2 Col_B 788
3 Col_B 1566
...
2999998 Col_B 1562
2999999 Col_B 1803
3000000 Col_B 1252
With a simple awk you can generate the files you describe this way:
$ time awk -F $'\t' '{ print $1 " " $2 >> $3; close($3) }' file
real 3m31.011s
user 0m25.260s
sys 3m0.994s
So that awk will generate the 2,000 group files in about 3 minutes 31 seconds. Certainly faster than Bash, but this can be faster by presorting the file by the third column and writing each group file in one go.
You can use the Unix sort utility in a pipe and feed the output to a script that can separate the sorted groups to different files. Use the -s option with sort and the value of the third field will be the only fields that will change the order of the lines.
Since we can assume sort has partitioned the file into groups based on column 3 of the file, the script only needs to detect when that value changes:
$ time sort -s -k3 file | awk -F $'\t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }'
real 0m4.727s
user 0m5.495s
sys 0m0.541s
Because of the efficiency gained by presorting, the same net process completes in 5 seconds.
If you are sure that the 'words' in column 3 are ascii only (ie, you do not need to deal with UTF-8), you can set LC_ALL=C for additional speed:
$ time LC_ALL=C sort -s -k3 file | awk -F $'\t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }'
real 0m3.801s
user 0m3.796s
sys 0m0.479s
From comments:
1) Please add a line to explain why we need the bracketed expression in fn != ($3 ""):
The awk construct of fn != ($3 "") {action} is an effective shortcut for fn != $3 || fn=="" {action} use the one you consider most readable.
2) Not sure if this also works if the file is larger than the available memory, so this might be a limiting factor.:
I ran the first and the last awk with 300 million records and 20,000 output files. The last one with sort did the task in 12 minutes. The first took 10 hours...
It may be that the sort version actually scales better since opening appending and closing 20,000 files 300 million times takes a while. It is more efficient to gang up and stream similar data.
3) I was thinking about sort earlier but then felt it might not be the fastest because we have to read the whole file twice with this approach.:
This is the case for purely random data; if the actual data is somewhat ordered, there is a tradeoff with reading the file twice. The first awk would be significantly faster with less random data. But then it will also take time to determine if the file is sorted. If you know file is mostly sorted, use the first; if it is likely somewhat disordered, use the last.
You can use awk:
awk -F $'\t' '{ print $1 " " $2 >> $3; close($3) }' file
This solution uses GNU parallel, but may be tuned with the other awk solutions. Also it has a nice progress bar:
parallel -a data_file --bar 'read -a arr <<< {}; echo "${arr[0]} ${arr[1]}" >> ${arr[2]}'
Use awk for example:
awk -F '{ print $1 FS $2 > $3 }' FILES
Or this Perl script (written by me) - I won't repost it here, as it is a bit longer. awk should be somewhat slower as it (re)opens the files for every line. This is better than the Perl script whenever you have more than 250 different values/output files (or whatever your OS has as limit for the number of simultaneously open filehandles). The Perl script tries to hold all input data in memory, which is much faster but can be problematic for large inputs.
The solution for a large count of output files was posted by user oguzismail :
awk '{ print $1 FS $2 >> $3; close($3) }' file
This (re)opens the output file for every line and it won't run into the issue of having too many open output filehandles open at the same time. (Re)opening the file might be slower, but reportedly isn't.
Edit: Fixed awk invocation - it printed the whole line to the output, instead of the first two columns.
You question is very much similar in nature to Is it possible to parallelize awk writing to multiple files through GNU parallel?
If your disk can handle it:
splitter() {
mkdir -p $1
cd $1
awk -F $'\t' '{ print $1 " " $2 >> $3; close($3) }'
}
export -f splitter
# Do the splitting in each dir
parallel --pipepart -a myfile --block -1 splitter {%}
# Merge the results
parallel 'cd {}; ls' ::: dir-* | sort -u | parallel 'cat */{} > {}'
# Cleanup dirs
rm -r */

How should I count the duplicate lines in each file?

I have tried this :
dirs=$1
for dir in $dirs
do
ls -R $dir
done
Like this?:
$ cat > foo
this
nope
$ cat > bar
neither
this
$ sort *|uniq -c
1 neither
1 nope
2 this
and weed out the ones with just 1s:
... | awk '$1>1'
2 this
Use sort with uniq to find the duplicate lines.
#!/bin/bash
dirs=("$#")
for dir in "${dirs[#]}" ; do
cat "$dir"/*
done | sort | uniq -c | sort -n | tail -n1
uniq -c will prepend the number of occurrences to each line
sort -n will sort the lines by the number of occurrences
tail -n1 will only output the last line, i.e. the maximum. If you want to see all the lines with the same number of duplicates, add the following instead of tail:
perl -ane 'if ($F[0] == $n) { push #buff, $_ }
else { #buff = $_ }
$n = $F[0];
END { print for #buff }'
You could use awk. If you just want to "count the duplicate lines", we could infer that you're after "all lines which have appeared earlier in the same file". The following would produce these counts:
#!/bin/sh
for file in "$#"; do
if [ -s "$file" ]; then
awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' "$file"
fi
done
The awk script first checks to see if the current line is stored in the array a, and if it does, increments a counter. Then it adds the line to its array. At the end of the file, we print the total.
Note that this might have problems on very large files, since the entire input file needs to be read into memory in the array.
Example:
$ printf 'foo\nbar\nthis\nbar\nthat\nbar\n' > inp.txt
$ awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' inp.txt
inp.txt: 2
The word 'bar' exist three times in the file, thus there are two duplicates.
To aggregate multiple files, you can just feed multiple files to awk:
$ printf 'foo\nbar\nthis\nbar\n' > inp1.txt
$ printf 'red\nblue\ngreen\nbar\n' > inp2.txt
$ awk '$0 in a {c++} {a[$0]} END {print c}' inp1.txt inp2.txt
2
For this, the word 'bar' appears twice in the first file and once in the second file -- a total of three times, thus we still have two duplicates.

GREP command in loop

I have about 3000 files in a folder. My files have data as given below:
VISITERM_0 VISITERM_20 VISITERM_35 ..... and so on
Each files do not have the same values like as above. They vary from 0 till 99.
I want to find out how many files in the folder have each of the VISITERMS. For example, if VISITERM_0 is present in 300 files in the folder, then I need it to print
VISITERM_0 300
Similary if there are 1000 files that contain VISITERM_1, I need it to print
VISITERM_1 1000
So, I want to print the VISITERMs and the number of files that have them starting from VISITERM_0 till VISITERM_99.
I made use of grep command which is
grep VISITERM_0 * -l | wc -l
However, this is for a single term and I want to loop this from VISITERM_0 till VISITERM_99. Please helP!
#!/bin/bash
# ^^- the above is important; #!/bin/sh would allow only POSIX syntax
# use a C-style for loop, which is a bash extension
for ((i=0; i<100; i++)); do
# Calculate number of matches...
num_matches=$(find . -type f -exec grep -l -e "VISITERM_$i" '{}' + | wc -l)
# ...and print the result.
printf 'VISITERM_%d\t%d\n' "$i" "$num_matches"
done
Here is an gnu awk (gnu due to multiple characters in RS) that should do:
awk -v RS=" |\n" '{n=split($1,a,"VISITERM_");if (n==2 && a[2]<100) b[a[2]]++} END {for (i in b) print "VISITERM_"i,b[i]}' *
Example:
cat file1
VISITERM_0 VISITERM_320 VISITERM_35
cat file2
VISITERM_0 VISITERM_20 VISITERM_32
VISITERM_20 VISITERM_42 VISITERM_11
Gives:
awk -v RS=" |\n" '{n=split($1,a,"VISITERM_");if (n==2 && a[2]<100) b[a[2]]++} END {for (i in b) print "VISITERM_"i,b[i]}' file*
VISITERM_0 2
VISITERM_11 1
VISITERM_20 2
VISITERM_32 1
VISITERM_35 1
VISITERM_42 1
How it works:
awk -v RS=" |\n" ' # Set record selector to space or new line
{n=split($1,a,"VISITERM_") # Split record using "VISITERM_" as separator and store hits of split in "n"
if (n==2 && a[2]<100) # If "n" is "2" (does contain "ISITERM_") and has number less "100"
b[a[2]]++} # Count the hit of each number and stor it in array "b"
END {for (i in b) # Walk trough array "b"
print "VISITERM_"i,b[i]} # Print the hits
' file* # Read the files
PS
If everything is only on one line, change to RS=" ". Then it should work on most awk

Resources