Linux Bash Script for calculating average of multiple files - linux

I am writing a scipt which will take parameter of the folder which it will do the job. The aim is to calculate average number of reviews and print the result next to the name of the file. I wrote the script for only one file it works okay but I couldn't find any solutions to do it on multiple files. I should get an output like ;
% ./averagereviews.sh path_to_folder
hotel_11212 3.51
hotel_2121 2.62
hotel_31212 2.43
...
I done this task for only one hotel and the code is like this;
grep "<Overall>" $1 | sed 's/<Overall>//g'| awk '{SUM += $1} END {print SUM/NR}'
This simply searches the word "" in the file and gets the number next to it, then adds these numbers and divides the sum with NR to find average.
when I run it the output is the average value for the given hotel
./averagereviews.sh hotel_190158.dat
4.00578
But I should do this to multiple .dat files in a folder with printing the name of hotel. How can I do that ?

You could "cheat"
> cat averagereviews.sh
#!/bin/bash
SUM=0
data_files=$(ls $1/dataFile*.dat)
cat $data_files | grep "<Overall>" | sed -e 's/<Overall>//g' | awk '{SUM += $1} END {print SUM/NR}'
and run (from wherever, with whichever paths you need)
> ~/tools/averagereviews.sh /tmp/data/
Simply, I'm cating all the files first, and apply your command to the rest - having it behave like the pipe is a single file.

Related

Print number of lines matching conditions in files generated by count

I'm trying to figure out how to print, using purely awk, lines who satisfied the count number provided by a while count loop in bash. Here's some lines of the input.
NODE_1_posplwpl
NODE_1_owkokwo
NODE_1_kslkow
NODE_2_fbjfh
NODE_2_lsmlsm
NODE_3_Loskos
NODE_3_pospls
What i want to do is to print lines who match, in their second field, the count number provided by the while count loop, into a file named file_$count_test.
So a file called "file_1_test" will contain lines with "NODE_1.." , "file_2_test" will contain lines with "NODE_2.." ; like that with all the lines of the file.
Here's my code.
#! /bin/bash
while read CNAME
do
let count=$count+1
grep "^${CNAME}_" > file_${count}_test
awk -v X=$count '{ FS="_" } { if ($2 == X) print $0 }' > file_${count}_test
done <$1
exit 1
This code creates only the file_1_test, which is empty. So the awk condition seems to work bad.
Looks like you're trying to split your input into separate files named based on the number between the underscores. That'd just be:
awk -F'_' '{print > ("file_" $2 "_test")}' file
You may need to change it to:
awk -F'_' '$2!=prev{close(out); out="file_" $2 "_test"} {print > out; prev=$2}' file
if you're generating a lot of output files and not using GNU awk as that could lead to a "too many open files" error.
wrt your comments below, look:
$ cat file
NODE_1_posplwpl
NODE_1_owkokwo
NODE_1_kslkow
NODE_2_fbjfh
NODE_2_lsmlsm
NODE_3_Loskos
NODE_3_pospls
$ awk -F'_' '{print $0 " > " ("file_" $2 "_test")}' file
NODE_1_posplwpl > file_1_test
NODE_1_owkokwo > file_1_test
NODE_1_kslkow > file_1_test
NODE_2_fbjfh > file_2_test
NODE_2_lsmlsm > file_2_test
NODE_3_Loskos > file_3_test
NODE_3_pospls > file_3_test
Just change $0 " > " to > like in the first script to have the output go to the separate files instead of just showing you that that would happen like this last script does.

how to add numbers in C-shell

I have a question about C-shell. In my script, I want to automatically add all the numbers and get the total number. How to implement such function in C-shell?
My script is shown below:
#!/bin/csh -f
set log_list = $1
echo "Search begins now"
foreach subdir(`cat $log_list`)
grep "feature identified" "$subdir" -A1 | grep "ne=" | awk '{print $7}'
echo "done"
end
For this script, it will grep the log file "log_list" for keyword "feature identified" and the next line containing keyword "ne=". I care about the number after "ne=", for example ne=140.
Then the grep output will be like this:
ne=100
ne=115
ne=120
...
There are more than 1K lines of such numbers. Of course I can redirect the grep data to a new file(in Linux). Then copy all the data into Excel spreadsheet to add them up. But I want to do this in the script. And it will make thing easier.
The final result should be like this:
total_ne=335
Do you know how to do this in the C-shell? Thanks!

Linux commandline find and add up numbers

In the log file which may contain information of different form, I need to grep only those lines that contain substring "ABC", then among chosen lines extract (it always exists) the number of Kb at the end (pattern is ": %n Kb", where %n is the number from 0 and above). Finally I need to add up all the values to get the amount of memory used by an app.
2016-01-14T16:15:01.695Z [INFO] application - ABC 5f18dda7-a30a-44f5-82dd-69d4b5469245: 118 Kb
2016-01-14T16:15:04.535Z [INFO] application - 5f18dda7-a30a-44f5-82dd-69d4b5469245
grep isn't a verb, but awk is!
awk '/ABC/ {s+= $(NF-1)} END {print s "Kb"}'
should work (untested)
You can use following chain:
grep ABC logfile.txt | egrep -o "[0-9]+ Kb" | cut -f1 -d" "| paste -s -d+ | bc
I need to grep only those lines that contain substring "ABC", then among chosen lines extract (it always exists) the number of Kb at the end
This looks like a job for awk. The number is always the second last column, which awk can extract easily:
awk '/ABC/ { print $(NF-1) }' filename_here
Here NF-1 is the index of the second-last column, and $ gets the value in that column.
But you want to sum it up, rather than just extract. That's a simple task, and it shows off a slightly more advanced usage of awk:
awk '
BEGIN { sum = 0; }
/ABC/ { sum += $(NF-1); }
END { print sum; }
' filename_here
Technically speaking you can omit the entire BEGIN line, but I consider it good style to be up-front about the variables you expect to use in the program.

Finding duplicate entries across very large text files in bash

I am working with very large data files extracted from a database. There are duplicates across these files that I need to remove. If there are duplicates they will exist across files not within the same file. The files contain entries that look like the following:
File1
623898/bn-oopi-990iu/I Like Potato
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
File2
568798/jj-ytut-786hh/Hello Mike
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
So the Sesame Street line will have to be removed possibly even across 5 files but at least remain in one of them. From what I have been able to grab so far I can perform the following cat * | sort | uniq -cd to give me each duplicated line and the number of times they have been duplicated. But have no way of getting the file name. cat * | sort | uniq -cd | grep "" * doesn't work. Any ideas or approaches for a solution would be great.
Expanding your original idea:
sort * | uniq -cd | awk '{print $2}' | grep -Ff- *
i.e. form the output, print only the duplicate strings, then search all the files for them (list of things to search from taken form -, i.e. stdin), literally (-F).
Something along these lines might be useful:
awk '!seen[$0] { print $0 > FILENAME ".new" } { seen[$0] = 1 }' file1 file2 file3 ...
twalberg's solution works perfectly but if your files are really large it could exhaust the available memory because it creates one entry in an associative array per encountered unique record. If it happens, you can try a similar approach where there is only one entry per duplicate record (I assume you have GNU awk and your files are named *.txt):
sort *.txt | uniq -d > dup
awk 'BEGIN {while(getline < "dup") {dup[$0] = 1}} \
!($0 in dup) {print >> (FILENAME ".new")} \
$0 in dup {if(dup[$0] == 1) {print >> (FILENAME ".new");dup[$0] = 0}}' *.txt
Note that if you have many duplicates it could also exhaust the available memory. You can solve this by splitting the dup file in smaller chunks and run the awk script on each chunk.

Bash script to list files periodically

I have a huge set of files, 64,000, and I want to create a Bash script that lists the name of files using
ls -1 > file.txt
for every 4,000 files and store the resulted file.txt in a separate folder. So, every 4000 files have their names listed in a text files that is stored in a folder. The result is
folder01 contains file.txt that lists files #0-#4000
folder02 contains file.txt that lists files #4001-#8000
folder03 contains file.txt that lists files #8001-#12000
.
.
.
folder16 contains file.txt that lists files #60000-#64000
Thank you very much in advance
You can try
ls -1 | awk '
{
if (! ((NR-1)%4000)) {
if (j) close(fnn)
fn=sprintf("folder%02d",++j)
system("mkdir "fn)
fnn=fn"/file.txt"
}
print >> fnn
}'
Explanation:
NR is the current record number in awk, that is: the current line number.
NR starts at 1, on the first line, so we subtract 1 such that the if statement is true for the first line
system calls an operating system function from within awk
print in itself prints the current line to standard output, we can redirect (and append) the output to the file using >>
All uninitialized variables in awk will have a zero value, so we do not need to say j=0 in the beginning of the program
This will get you pretty close;
ls -1 | split -l 4000 -d - folder
Run the result of ls through split, breaking every 4000 lines (-l 4000), using numeric suffixes (-d), from standard input (-) and start the naming of the files with folder.
Results in folder00, folder01, ...
Here an exact solution using awk:
ls -1 | awk '
(NR-1) % 4000 == 0 {
dir = sprintf("folder%02d", ++nr)
system("mkdir -p " dir);
}
{ print >> dir "/file.txt"} '
There are already some good answers above, but I would also suggest you take a look at the watch command. This will re-run a command every n seconds, so you can, well, watch the output.

Resources