compare grep output for same string from two files

compare grep output for same string from two files - linux

I have 3 files:
1 contains list of strings to check for
2 contains new prices
3 contains prices that would need to be replaced if these prices change in file #2
Example:
file #1
item1
item2
file #2
item1cost100
item2cost200
file# #3
item1cost101
item2cost199
After running the script, the file #3 should be updated
file# #3
item1cost100
item2cost200
file #2 and #3 contain a lot of entries but only entries in file #1 need to be checked and if different written to file #3
I only got as far as comparing the two files for 1 string, i am not sure how to loop through contents of file #1 and how to write the changes to file #3
I started working with sed command and got stuck not knowing how to unwrap variables
Here is what I got
item="item1"
itemold=$(cat file2 | grep item1)
echo $itemold
itemnew=$(cat file3 | grep item1)
echo $itemnew
echo $item
if [ $itemold = $itemnew ]; then
echo "MATCH!"
else
echo "NO MATCH!"
fi

#!/bin/bash
cut -d' ' -f1 file1 > a
grep `cat a` file2 > b
grep `cat a` file3 > c
sed s/`cat c`/`cat b`/g file3 > d
cut -d' ' -f2 file1 > a
grep `cat a` file2 > b
grep `cat a` file3 > c
sed s/`cat c`/`cat b`/g d > file3
This takes care of the case you gave. It could be generalized.

Related

bash count sequential files

I'm pretty new to bash scripting so some of the syntaxes may not be optimal. Please do point them out if you see one.
I have files in a directory named sequentially.
Example: prob01_01 prob01_03 prob01_07 prob02_01 prob02_03 ....
I am trying to have the script iterate through the current directory and count how many extensions each problem has. Then print the pre-extension name then count
Sample output for above would be:
prob01 3
prob02 2
This is my code:
#!/bin/bash
temp=$(mktemp)
element=''
count=0
for i in *
do
current=${i%_*}
if [[ $current == $element ]]
then
let "count+=1"
else
echo $element $count >> temp
element=$current
count=1
fi
done
echo 'heres the temp:'
cat temp
rm 'temp'
The Problem:
Current output:
prob1 3
Desired output:
prob1 3
prob2 2
The last count isn't appended because it's not seeing a different element after it
My Guess on possible solutions:
Have the last append occur at the end of the for loop?

Your code has 2 problems.
The first problem doesn't answer your question. You make a temporary file, the filename is stored in $temp. You should use that one, and not the file with the fixed name temp.
The problem is that you only write results when you see a new problem/filename. The last one will not be printed.
Fixing only these problems will result in
results() {
if (( count == 0 )); then
return
fi
echo $element $count >> "${temp}"
}
temp=$(mktemp)
element=''
count=0
for i in prob*
do
current=${i%_*}
if [[ $current == $element ]]
then
let "count+=1" # Better is using ((count++))
else
results
element=$current
count=1
fi
done
results
echo 'heres the temp:'
cat "${temp}"
rm "${temp}"
You can do without the script with
ls prob* | cut -d"_" -f1 | sort | uniq -c
When you want the have the output displayed as given, you need one more step.
ls prob* | cut -d"_" -f1 | sort | uniq -c | awk '{print $2 " " $1}'

You may use printf + awk solution:
printf '%s\n' *_* | awk -F_ '{a[$1]++} END{for (i in a) print i, a[i]}'
prob01 3
prob02 2
We use printf to print each file that has at least one _
We use awk to get a count of each file's first element delimited by _ by using an associative array.

I would do it like this:
$ ls | awk -F_ '{print $1}' | sort | uniq -c | awk '{print $2 " " $1}'
prob01 3
prob02 2

Append all files to one single file in unix and rename the output file with part of first and last filenames

For example, I have below log files from the 16th-20th of Feb 2015. Now I want to create a single file named, mainentrywatcherReport_2015-02-16_2015-02-20.log. So in other words, I want to extract the date format from the first and last file of week (Mon-Fri) and create one output file every Saturday. I will be using cron to trigger the script every Saturday.
$ ls -l
mainentrywatcher_2015-02-16.log
mainentrywatcher_2015-02-17.log
mainentrywatcher_2015-02-18.log
mainentrywatcher_2015-02-19.log
mainentrywatcher_2015-02-20.log
$ cat *.log >> mainentrywatcherReport_2015-02-16_2015-02-20.log
$ mv *.log archive/
Can anybody help on how to rename the output file to above format?

Perhaps try this:
parta=`ls -l | head -n1 | cut -d'_' -f2 | cut -d'.' -f1`
partb=`ls -l | head -n5 | cut -d'_' -f2 | cut -d'.' -f1`
filename=mainentrywatcherReport_${parta}_${partb}.log
cat *.log >> ${filename}
"ls -l" output is described in the question
"head -nX" takes the Xth line of the output
"cut -d'_' -f2" takes everything (that remains) after the first underscore
"cut -d'.' -f1" times everything (that remains) before the first period
both commands are surrounded by ` marks (above tilde ~) to capture the output of the command to a variable
file name assembles the two dates stripped of the unnecessary with the other formatting desired for the final file name.
the cat command demonstrates one possible way to use the resulting filename
Happy coding! Leave a comment if you have any questions.

You can try this if you want to introduce simple looping...
FROM=ls -lrt mainentrywatcher_* | awk '{print $9}' | head -1 | cut -d"_" -f2 | cut -d"." -f1
TO=ls -lrt mainentrywatcher_* | awk '{print $9}' | tail -1 | cut -d"_" -f2 | cut -d"." -f1
FINAL_LOG=mainentrywatcherReport_${FROM}_${TO}.log
for i in ls -lrt mainentrywatcher_* | awk '{print $9}'
do
cat $i >> $FINAL_LOG
done
echo "All Logs Stored in $FINAL_LOG"

Another approach given your daily files and test contents as follows:
mainentrywatcher_2015-02-16.log -> a
mainentrywatcher_2015-02-17.log -> b
mainentrywatcher_2015-02-18.log -> c
mainentrywatcher_2015-02-19.log -> d
mainentrywatcher_2015-02-20.log -> e
That utilizes bash parameter expansion/substring extraction would be a simple loop:
#!/bin/bash
declare -i cnt=0 # simple counter to determine begin
for i in mainentrywatcher_2015-02-*; do # loop through each matching file
tmp=${i//*_/} # isolate date
tmp=${tmp//.*/}
[ $cnt -eq 0 ] && begin=$tmp || end=$tmp # assign first to begin, last to end
((cnt++)) # increment counter
done
cmbfname="${i//_*/}_${begin}_${end}.log" # form the combined logfile name
cat ${i//_*/}* > $cmbfname # cat all into combined name
## print out begin/end/cmbfname & contents to verify
printf "\nbegin: %s\nend : %s\nfname: %s\n\n" $begin $end $cmbfname
printf "contents: %s\n\n" $cmbfname
cat $cmbfname
exit 0
use/output:
alchemy:~/scr/tmp/stack/tmp> bash weekly.sh
begin: 2015-02-16
end : 2015-02-20
fname: mainentrywatcher_2015-02-16_2015-02-20.log
contents: mainentrywatcher_2015-02-16_2015-02-20.log
a
b
c
d
e
You can, of course, modify the for loop to accept a positional parameter containing the partial filename and pass the partial file name from the command line.

Something like this:
#!/bin/sh
LOGS="`echo mainentrywatcher_2[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].log`"
HEAD=
TAIL=
for logs in $LOGS
do
TAIL=`echo $logs | sed -e 's/^.*mainentrywatcher_//' -e 's/\.log$//'`
test -z "$HEAD" && HEAD=$TAIL
done
cat $LOGS >mainentrywatcherReport_${HEAD}_${TAIL}.log
mv $LOGS archive/
That is:
get a list of the existing logs (which happen to be sorted) in a variable $LOGS
walk through the list, getting just the date according to the example
save the first date as $HEAD
save the last date as $TAIL
after the loop, cat all of those files into the new output file
move the used-up log-files into the archive directory.

Join 2 files by common column header (without awk/sed)

Basically I want to get all records from file2, but filter out columns whose header doesn't appear in file1
Example:
file1
Name Location
file2
Name Phone_Number Location Email
Jim 032131 xyz xyz#qqq.com
Tim 037903 zzz zzz#qqq.com
Pimp 039141 xxz xxz#qqq.com
Output
Name Location
Jim xyz
Tim zzz
Pimp xxz
Is there a way to do this without awk or sed, but still using coreutils tools? I've tried doing it with join, but couldn't get it working.

ALL_COLUMNS=$(head -n1 file2)
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Explanation:
ALL_COLUMNS=$(head -n1 file2)
It saves all the column names to filter next
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
For every column in file1, we look for the position of the one with the same name in file2 and append it to JOIN_FORMAT in the way of "2.<number_of_column>,"
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Once we have the option string complete (2.1,2.3,), we pass it to join removing the last ,.
join prints the unpairable lines from the second file provided (-a2 -> file2), but only the columns specified in the -o option.

Not very efficient, but works for your example:
#!/bin/bash
read -r -a cols < file1
echo "${cols[#]}"
read -r -a header < <(head -n1 file2)
keep=()
for (( i=0; i<${#header}; i++ )) ; do
for c in "${cols[#]}" ; do
if [[ ${header[i]} == "$c" ]] ; then
keep+=($i)
fi
done
done
while read -r -a data ; do
for idx in ${keep[#]} ; do
printf '%s ' "${data[idx]}"
done
printf '\n'
done < <(tail -n+2 file2)
Tools used: head and tail. They aren't essential, though. And bash, of course.

Multi thread shell script

can anybody help me with writing a multi thread shell script
Basically i have two files one file contain around >10K lines(main_file) and another contain around 200 line(sub_file). These 200 lines contain repeated string sorted of main file.I'm trying make separate files for each string to other file using below command
i have collected the string which are repeated into sub_file.
The string are present randomly in main_file.
a=0
while IFS= read -r line
do
a=$(($a+1));
users[$a]=$line
egrep "${line}" $main_file >> $line
done <"$sub_file"
if i make to use in single thread it take more time so thinking to use multithread process and complete the process in minimum time..
help me out...

The tool you need for that is gnu parallel:
parallel egrep '{}' "$mainfile" '>' '{}' < "$sub_file"
You can adjust the number of jobs processed with the option -P:
parallel -P 4 egrep '{}' "$mainfile" '>' '{}' < "$sub_file"
Please see the manual for more info.
By the way to make sure that you don't process a line twice you could make the input unique:
awk '!a[$0]++' "$sub_file" | parallel -P 4 egrep '{}' "$mainfile" '>' '{}'

NOTE: Posting from my previous post. This is not directly applicable, but very similar to tweak
I have a file 1.txt with the below contents.
-----cat 1.txt-----
1234
5678
1256
1234
1247
I have 3 more files in a folder
-----ls -lrt-------
A1.txt
A2.txt
A3.txt
The contents of those three files are similar format with different data values (All the three files are tab delimited)
-----cat A1.txt----
A X 1234 B 1234
A X 5678 B 1234
A X 1256 B 1256
-----cat A2.txt----
A Y 8888 B 1234
A Y 9999 B 1256
A X 1234 B 1256
-----cat A3.txt----
A Y 6798 C 1256
My objective is to do a search on all the A1,A2 and A3 (Only for the 3rd column of the TAB delimited file)for text in 1.txt
and the output must be redirected to the file matches.txt as given below.
Code:
/home/A1.txt:A X 1234 B 1234
/home/A1.txt:A X 5678 B 1234
/home/A1.txt:A X 1256 B 1256
/home/A2.txt:A X 1234 B 1256
The following should work.
cat A*.txt | tr -s '\t' '|' > combined.dat
{ while read myline;do
recset=`echo $myline | cut -f19 -d '|'|tr -d '\r'`
var=$(grep $recset 1.txt|wc -l)
if [[ $var -ne 0 ]]; then
echo $myline >> final.dat
fi
done } < combined.dat
{ while read myline;do
recset=`echo $myline | cut -f19 -d '|'|tr -d '\r'`
var=$(grep $recset 1.txt|wc -l)
if [[ $var -ne 0 ]]; then
echo $myline >> final2.dat
fi
done } < combined.dat
Using AWK
awk 'NR==FNR{a[$0]=1}$3 in a{print FILENAME":"$0}' 1.txt A* > matches.txt
For pipe delimited
awk –F’|’ 'NR==FNR{a[$0]=1}$3 in a{print FILENAME":"$0}' 1.txt A* > matches.txt

Calculate Word occurrences from file in bash

I'm sorry for the very noob question, but I'm kind of new to bash programming (started a few days ago). Basically what I want to do is keep one file with all the word occurrences of another file
I know I can do this:
sort | uniq -c | sort
the thing is that after that I want to take a second file, calculate the occurrences again and update the first one. After I take a third file and so on.
What I'm doing at the moment works without any problem (I'm using grep, sed and awk), but it looks pretty slow.
I'm pretty sure there is a very efficient way just with a command or so, using uniq, but I can't figure out.
Could you please lead me to the right way?
I'm also pasting the code I wrote:
#!/bin/bash
# count the number of word occurrences from a file and writes to another file #
# the words are listed from the most frequent to the less one #
touch .check # used to check the occurrances. Temporary file
touch distribution.txt # final file with all the occurrences calculated
page=$1 # contains the file I'm calculating
occurrences=$2 # temporary file for the occurrences
# takes all the words from the file $page and orders them by occurrences
cat $page | tr -cs A-Za-z\' '\n'| tr A-Z a-z > .check
# loop to update the old file with the new information
# basically what I do is check word by word and add them to the old file as an update
cat .check | while read words
do
word=${words} # word I'm calculating
strlen=${#word} # word's length
# I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions
if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
then
# if the word was never found before it writes it with 1 occurrence
if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
then
echo "$word: 1" | cat >> $occurrences
# else it calculates the occurrences
else
old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
let "new=old+1"
sed -i "s/^$word: $old$/$word: $new/g" $occurrences
fi
fi
done
rm .check
# finally it orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt

Well, I'm not sure that I've got the point of the thing you are trying to do,
but I would do it this way:
while read file
do
cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
done < file-list
Now you have statistics for all your file, and now you simple aggregate it:
while read file
do
cat stat.$file
done < file-list \
| sort -k2 \
| awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}'
Example of usage:
$ for i in ls bash cp; do man $i > $i.txt ; done
$ cat <<EOF > file-list
> ls.txt
> bash.txt
> cp.txt
> EOF
$ while read file; do
> cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
> done < file-list
$ while read file
> do
> cat stat.$file
> done < file-list \
> | sort -k2 \
> | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head
3875 the
1671 is
1137 to
1118 a
1072 of
793 if
744 and
533 command
514 in
507 shell

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

compare grep output for same string from two files - linux

#!/bin/bash cut -d' ' -f1 file1 > a grep `cat a` file2 > b grep `cat a` file3 > c sed s/`cat c`/`cat b`/g file3 > d cut -d' ' -f2 file1 > a grep `cat a` file2 > b grep `cat a` file3 > c sed s/`cat c`/`cat b`/g d > file3 This takes care of the case you gave. It could be generalized.

Related

bash count sequential files

Append all files to one single file in unix and rename the output file with part of first and last filenames

Join 2 files by common column header (without awk/sed)

Multi thread shell script

Calculate Word occurrences from file in bash

Categories

Resources