I have about 3000 files in a folder. My files have data as given below:
VISITERM_0 VISITERM_20 VISITERM_35 ..... and so on
Each files do not have the same values like as above. They vary from 0 till 99.
I want to find out how many files in the folder have each of the VISITERMS. For example, if VISITERM_0 is present in 300 files in the folder, then I need it to print
VISITERM_0 300
Similary if there are 1000 files that contain VISITERM_1, I need it to print
VISITERM_1 1000
So, I want to print the VISITERMs and the number of files that have them starting from VISITERM_0 till VISITERM_99.
I made use of grep command which is
grep VISITERM_0 * -l | wc -l
However, this is for a single term and I want to loop this from VISITERM_0 till VISITERM_99. Please helP!
#!/bin/bash
# ^^- the above is important; #!/bin/sh would allow only POSIX syntax
# use a C-style for loop, which is a bash extension
for ((i=0; i<100; i++)); do
# Calculate number of matches...
num_matches=$(find . -type f -exec grep -l -e "VISITERM_$i" '{}' + | wc -l)
# ...and print the result.
printf 'VISITERM_%d\t%d\n' "$i" "$num_matches"
done
Here is an gnu awk (gnu due to multiple characters in RS) that should do:
awk -v RS=" |\n" '{n=split($1,a,"VISITERM_");if (n==2 && a[2]<100) b[a[2]]++} END {for (i in b) print "VISITERM_"i,b[i]}' *
Example:
cat file1
VISITERM_0 VISITERM_320 VISITERM_35
cat file2
VISITERM_0 VISITERM_20 VISITERM_32
VISITERM_20 VISITERM_42 VISITERM_11
Gives:
awk -v RS=" |\n" '{n=split($1,a,"VISITERM_");if (n==2 && a[2]<100) b[a[2]]++} END {for (i in b) print "VISITERM_"i,b[i]}' file*
VISITERM_0 2
VISITERM_11 1
VISITERM_20 2
VISITERM_32 1
VISITERM_35 1
VISITERM_42 1
How it works:
awk -v RS=" |\n" ' # Set record selector to space or new line
{n=split($1,a,"VISITERM_") # Split record using "VISITERM_" as separator and store hits of split in "n"
if (n==2 && a[2]<100) # If "n" is "2" (does contain "ISITERM_") and has number less "100"
b[a[2]]++} # Count the hit of each number and stor it in array "b"
END {for (i in b) # Walk trough array "b"
print "VISITERM_"i,b[i]} # Print the hits
' file* # Read the files
PS
If everything is only on one line, change to RS=" ". Then it should work on most awk
Related
We have a software package that performs tasks by assigning the batch of files a job number. Batches can have any number of files in them. The files are then stored in a directory structure similar to this:
/asc/array1/.storage/10/10297/10297-Low-res.m4a
...
/asc/array1/.storage/3/3814/3814-preview.jpg
The filename is generated automatically. The directory in .storage is the thousandths digits of the file number.
There is also a database which associates the job number and the file number with the client in question. Running a SQL query, I can list out the job number, client and the full path to the files. Example:
213 sample-data /asc/array1/.storage/10/10297/10297-Low-res.m4a
...
214 client-abc /asc/array1/.storage/3/3814/3814-preview.jpg
My task is to calculate the total storage being used per client. So, I wrote a quick and dirty bash script to iterate over every single row and du the file, adding it to an associative array. I then plan to echo this out or produce a CSV file for ingest into PowerBI or some other tool. Is this the best way to handle this? Here is a copy of the script as it stands:
#!/bin/sh
declare -A clientArr
# 1 == Job Num
# 2 == Client
# 3 == Path
while read line; do
client=$(echo "$line" | awk '{ print $2 }')
path=$(echo "$line" | awk '{ print $3 }')
if [ -f "$path" ]; then
size=$(du -s "$path" | awk '{ print $1 }')
clientArr[$client]=$((${clientArr[$client]}+${size}))
fi
done < /tmp/pm_report.txt
for key in "${!clientArr[#]}"; do
echo "$key,${clientArr[$key]}"
done
Assuming:
you have GNU coreutils du
the filenames do not contain whitespace
This has no shell loops, calls du once, and iterates over the pm_report file twice.
file=/tmp/pm_report.txt
awk '{printf "%s\0", $3}' "$file" \
| du -s --files0-from=- 2>/dev/null \
| awk '
NR == FNR {du[$2] = $1; next}
{client_du[$2] += du[$3]}
END {
OFS = "\t"
for (client in client_du) print client, client_du[client]
}
' - "$file"
Using file foo:
$ cat foo
213 sample-data foo # this file
214 client-abc bar # some file I had in the dir
215 some nonexistent # didn't have this one
and the awk:
$ gawk ' # using GNU awk
#load "filefuncs" # for this default extension
!stat($3,statdata) { # "returns zero upon success"
a[$2]+=statdata["size"] # get the size and update array
}
END { # in the end
for(i in a) # iterate all
print i,a[i] # and output
}' foo foo # running twice for testing array grouping
Output:
client-abc 70
sample-data 18
I have a list file, which has id and number and am trying to get those lines from a master file which do not have those ids.
List file
nw_66 17296
nw_67 21414
nw_68 21372
nw_69 27387
nw_70 15830
nw_71 32348
nw_72 21925
nw_73 20363
master file
nw_1 5896
nw_2 52814
nw_3 14537
nw_4 87323
nw_5 56466
......
......
nw_n xxxxx
so far am trying this but not working as expected.
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
Kindly help
Give this awk one-liner a try:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Maybe this helps:
awk 'NR == FNR {id[$1]=1;next}
{
if (id[$1] == "") {
print $0
}
}' listfile masterfile
We accept 2 files as input above, first one is listfile, second is masterfile.
NR == FNR would be true while awk is going through listfile. In the associative array id[], all ids in listfile are made a key with value as 1.
When awk goes through masterfile, it only prints a line if $1 i.e. the id is not a key in array ids.
The OP attempted the following line:
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
This line will not work as for every entry $i, you print all entries in master.txt tat are not equivalent to "$i". As a consequence, you will end up with multiple copies of master.txt, each missing a single line.
Example:
$ for i in 1 2; do grep -v -w "$i" <(seq 1 3); done
2 \ copy of seq 1 3 without entry 1
3 /
1 \ copy of seq 1 3 without entry 2
3 /
Furthermore, the attempt reads the file master.txt multiple times. This is very inefficient.
The unix tool grep allows one the check multiple expressions stored in a file in a single go. This is done using the -f flag. Normally this looks like:
$ grep -f list.txt master.txt
The OP can use this now in the following way:
$ grep -vwf <(awk '{print $1}' list.txt) master.txt
But this would do matches over the full line.
The awk solution presented by Kent is more flexible and allows the OP to define a more tuned match:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Here the OP clearly states, I want to match column 1 of list with column 1 of master and I don't care about spaces or whatever is in column 2. The grep solution could still match entries in column 2.
this is my script
SourceFile='/root/Document/Source/'
FND=$(find $SourceFile. -regextype posix-regex -iregex "^.*/ABCDEF_555_[0-9]{5}\.txt$")
echo $FND
#*I've tried using "awk" but haven't gotten perfect results*
File Name:
ABCDEF_555_12345.txt
ABCDEF_555_54321.txt
ABCDEF_555_11223.txt
BEFORE
File Content from ABCDEF_555_12345.txt:
no|name|address|pos_code
1|rick|ABC|12342
2|rock|ABC|12342
3|Robert|DEF|54321
File Content from ABCDEF_555_54321.txt:
no|id|name|city
1|0101|RIZKI|JKT
2|0102|LALA|SMG
3|0302|ROY|YGY
i want to append a column that shows the file name in every row starting from the 2nd, and append a column with name_file to the first and i want to change the contents of the original files.
AFTER
file: ABCDEF_555_12345.txt
no|name|address|pos_code|name_file
1|rick|ABC|12342|ABCDEF_555_12345.txt
2|rock|ABC|12342|ABCDEF_555_12345.txt
3|Robert|DEF|54321|ABCDEF_555_12345.txt
file: ABCDEF_555_54321.txt
no|id|name|city|name_file
1|0101|RIZKI|JKT|ABCDEF_555_54321.txt
2|0102|LALA|SMG|ABCDEF_555_54321.txt
3|0302|ROY|YGY|ABCDEF_555_54321.txt
please give me light to find a solution :))
Thanks :))
The best solution is to use awk.
If it's the first line (NR == 1), print the line and append |name_file.
For all other lines print the line and append the filename using the FILENAME variable:
awk 'NR == 1 {print $0 "|name_file"; next;}{print $0 "|" FILENAME;}' foo.txt
You can either use it with multiple files:
find . -iname "*.txt" -print0 | xargs -0 awk '
NR == 1 {print $0 "|name_file"; next;}
FRN == 1 {next;} # Skip header of next files
{print $0 "|" FILENAME;}'
My first solution used to use the paste command.
Paste allows you to concatenate files horizontally (compared to cat which concatenates vertically).
To achieve the following with paste, do:
first concatenate the first line of your file (head -n1 foo.txt) with the column header (echo "name_file"). The command paste accept the -d flag to define the separator between columns.
second, extract all lines except the first (tail -n+2 foo.txt) and concatenate them with as many foo.txt required (use a for loop, computing the number of lines to fill.
The solution looks like this:
paste -d'|' <(head -n1 foo.txt) <(echo "name_file")
paste -d'|' <(tail -n+2 foo.txt) <(for i in $(seq $(tail -n+2 foo.txt | wc -l)); do echo "foo.txt"; done)
no|name|address|pos_code|name_file
1|rick|ABC|12342|foo.txt
2|rock|ABC|12342|foo.txt
3|Robert|DEF|54321|foo.txt
However, the awk solution must be prefered because it is clearer (only one call, less process substitutions and co.), and faster.
$ wc -l foo.txt
100004 foo.txt
$ time ./awk.sh >/dev/null
./awk.sh > /dev/null 0,03s user 0,01s system 98% cpu 0,041 total
$ time ./paste.sh >/dev/null
./paste.sh > /dev/null 0,38s user 0,33s system 154% cpu 0,459 total
Using find and GNU awk:
My find implementation doesn't have regextype posix-regex and I used posix-extended instead, but since you got the correct results it should be fine.
srcdir='/root/Document/Source/'
find "$srcdir" -regextype posix-regex -iregex ".*/ABCDEF_555_[0-9]{5}\.txt$"\
-exec awk -i inplace -v fname="{}" '
BEGIN{ OFS=FS="|"; sub(/.*\//, "", fname) } # set field separators / extract filename
{ $(NF+1)=NR==1 ? "name_file" : fname; print } # add header field / filename, print line
' {} \;
The pathname found by find is passed to awk in variable fname. In the BEGIN block the filename is extracted from the path.
The files are modified "inplace", make sure you make a backup of your files before running this.
How to find 10 most frequent words in the file in Unix/Linux?
I tried using this command in Unix:
sort file.txt | uniq -c | sort -nr | head -10
However I am not sure if it's correct and whether it is showing me 10 most frequent words in the large file.
I have a shell demo to deal with your problem ,even you have a file with more than one Word in one line
wordcount.sh
#!/bin/bash
# filename: wordcount.sh
# usage: word count
# handle position arguments
if [ $# -ne 1 ]
then
echo "Usage: $0 filename"
exit -1
fi
# realize word count
printf "%-14s%s\n" "Word" "Count"
cat $1 | tr 'A-Z' 'a-z' | \
egrep -o "\b[[:alpha:]]+\b" | \
awk '{ count[$0]++ }
END{
for(ind in count)
{ printf("%-14s%d\n",ind,count[ind]); }
}' | sort -k2 -n -r | head -n 10
just run ./wordcount.sh filename.txt
explain
Use the tr command to convert all uppercase letters to lowercase letters, then use the egrep command to grab all the words in the text and output them item by item. Finally, use the awk command and the associative array to implement the word count function, and decrement the output according to the number of occurrences. .
I have tried this :
dirs=$1
for dir in $dirs
do
ls -R $dir
done
Like this?:
$ cat > foo
this
nope
$ cat > bar
neither
this
$ sort *|uniq -c
1 neither
1 nope
2 this
and weed out the ones with just 1s:
... | awk '$1>1'
2 this
Use sort with uniq to find the duplicate lines.
#!/bin/bash
dirs=("$#")
for dir in "${dirs[#]}" ; do
cat "$dir"/*
done | sort | uniq -c | sort -n | tail -n1
uniq -c will prepend the number of occurrences to each line
sort -n will sort the lines by the number of occurrences
tail -n1 will only output the last line, i.e. the maximum. If you want to see all the lines with the same number of duplicates, add the following instead of tail:
perl -ane 'if ($F[0] == $n) { push #buff, $_ }
else { #buff = $_ }
$n = $F[0];
END { print for #buff }'
You could use awk. If you just want to "count the duplicate lines", we could infer that you're after "all lines which have appeared earlier in the same file". The following would produce these counts:
#!/bin/sh
for file in "$#"; do
if [ -s "$file" ]; then
awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' "$file"
fi
done
The awk script first checks to see if the current line is stored in the array a, and if it does, increments a counter. Then it adds the line to its array. At the end of the file, we print the total.
Note that this might have problems on very large files, since the entire input file needs to be read into memory in the array.
Example:
$ printf 'foo\nbar\nthis\nbar\nthat\nbar\n' > inp.txt
$ awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' inp.txt
inp.txt: 2
The word 'bar' exist three times in the file, thus there are two duplicates.
To aggregate multiple files, you can just feed multiple files to awk:
$ printf 'foo\nbar\nthis\nbar\n' > inp1.txt
$ printf 'red\nblue\ngreen\nbar\n' > inp2.txt
$ awk '$0 in a {c++} {a[$0]} END {print c}' inp1.txt inp2.txt
2
For this, the word 'bar' appears twice in the first file and once in the second file -- a total of three times, thus we still have two duplicates.