Comparing two csv files with different lengths but output only the line where it matches the same value in two different column - linux

I've been trying to compare two csv file using simple shell script but I think the code that I was using is not doing it's job. what I want to do is, compare the two files using Column 6 from first.csv and Column 2 in second.csv and when it matches, it will output the line from first.csv. see below as an example
first.csv
1,0,3210820,0,536,7855712
1,0,3523340820,0,36,53712
1,0,321023423i420,0,336,0255712
1,0,321082234324,0,66324,027312
second.csv
14,7855712,Whie,Black
124,7855712,Green,Black
174,1197,Black,Orange
1284,98132197,Yellow,purple
35384,9811123197,purple,purple
13354,0981123131197,green,green
183434,0811912313127,white,green
Output should be from the first file:
1,0,3210820,0,536,7855712
I've been using the code below.
cat first.csv | while read line
do
cat second.csv | grep $line > output_file
done
please help. Thank you

Your question is not entirely clear, but here is what I think you want:
cat first.csv | while read LINE; do
VAL=`echo "$LINE" | cut -d, -f6`
grep -q "$VAL" second.csv && echo $LINE
done
The first line in the loop extracts the 6th field from the line and stores it in VAL. The next line checks (quietly), if VAL occurs in second.csv and if so, outputs the line.
Note that grep will check for any occurence in second.csv, not only in field 2. To check only against field 2, change it to:
cut -d, -f2 second.csv | grep -q "$VAL" && echo $LINE
Unrelated to your question I would like to comment, that those things can much more efficiency be solved in a language like python.

Well... If you have bash with process substitution you can treat all the 2nd fields in second.csv (with a $ appended to anchor the search at the end of the line) as input from a file. Then using grep -f match data from the 2nd column of second.csv with the end of the line in first.csv doing what you intend.
You can use the <(process) form to redirect the 2nd field as a file using:
grep -f <(awk -F, '{print $2"$"}' second.csv) first.csv
Example Output
With the data you show in first.csv and second.csv you get:
1,0,3210820,0,536,7855712
Adding the "$" anchor as part of the 2nd field from second.csv should satisfy the match only in the 6th field (end of line) in first.csv.
The benefit here being there is but a single call to grep and awk, not an additional subshell spawned per-iteration. Doesn't matter with small files like your sample input, but with millions of lines, we are talking hours (or days) of processing time difference.

Related

How to get first word of every line and pipe it into dmenu script

I have a text file like this:
first state
second state
third state
Getting the first word from every line isn't difficult, but the problem comes when adding the extra \n required to separate every word (selection) in dmenu, per its syntax:
echo -e "first\nsecond\nthird" | dmenu
I haven't been able to figure out how to add the separating \n. I've tried this:
state=$(awk '{for(i=1;i<=NF;i+=2)print $(i)'\n'}' text.txt)
But it doesn't work. I also tried this:
lol=$(grep -o "^\S*" states.txt | perl -ne 'print "$_"')
But same deal. Not sure what I'm doing wrong.
Your problem is in the AWK script. You need to identify each input line as a record. This way, you can control how each record in the output is separated via the ORS variable (output record separator). By default this separator is the newline, which should be good enough for your purpose.
Now to print the first word of every input record (each line in the input stream in this case), you just need to print the first field:
awk '{print $1}' textfile | dmenu
If you need the output to include the explicit \n string (not the control character), then you can just overwrite the ORS variable to fit your needs:
awk 'BEGIN{ORS="\\n"}{print $1}' textfile | dmenu
This could be more easily done in while loop, could you please try following. This is simple, while is reading the file and during that its creating 2 variables 1st is named first and other is rest first contains first field which we are passing to dmenu later inside.
while read first rest
do
dmenu "$first"
done < "Input_file"
Based on the text file example, the following should achieve what you require:
awk '{ printf "%s\\n",$1 }' textfile | dmenu
Print the first space separated field of each line along with \n (\n needs to be escaped to stop it being interpreted by awk)
In your code
state=$(awk '{for(i=1;i<=NF;i+=2)print $(i)'\n'}' text.txt)
you attempted to use ' inside your awk code, however code is what between ' and first following ', therefore code is {for(i=1;i<=NF;i+=2)print $(i) and this does not work. You should use " for strings inside awk code.
If you want to merely get nth column cut will be enough in most cases, let states.txt content be
first state
second state
third state
then you can do:
cut -d ' ' -f 1 states.txt | dmenu
Explanation: treat space as delimiter (-d ' ') and get 1st column (-f 1)
(tested in cut (GNU coreutils) 8.30)

Delete lines from a file matching first 2 fields from a second file in shell script

Suppose I have setA.txt:
a|b|0.1
c|d|0.2
b|a|0.3
and I also have setB.txt:
c|d|200
a|b|100
Now I want to delete from setA.txt lines that have the same first 2 fields with setB.txt, so the output should be:
b|a|0.3
I tried:
comm -23 <(sort setA.txt) <(sort setB.txt)
But the equality is defined for whole line, so it won't work. How can I do this?
$ awk -F\| 'FNR==NR{seen[$1,$2]=1;next;} !seen[$1,$2]' setB.txt setA.txt
b|a|0.3
This reads through setB.txt just once, extracts the needed information from it, and then reads through setA.txt while deciding which lines to print.
How it works
-F\|
This sets the field separator to a vertical bar, |.
FNR==NR{seen[$1,$2]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, when FNR==NR, we are reading the first file, setB.txt. If so, set the value of associative array seen to true, 1, for the key consisting of fields one and two. Lastly, skip the rest of the commands and start over on the next line.
!seen[$1,$2]
If we get to this command, we are working on the second file, setA.txt. Since ! means negation, the condition is true if seen[$1,$2] is false which means that this combination of fields one and two was not in setB.txt. If so, then the default action is performed which is to print the line.
This should work:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p' setB.txt |sed -f- setA.txt
How this works:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p'
generates an output:
/^c|d/d
/^a|b/d
which is then used as a sed script for the next sed after the pipe and outputs:
b|a|0.3
(IFS=$'|'; cat setA.txt | while read x y z; do grep -q -P "\Q$x|$y|\E" setB.txt || echo "$x|$y|$z"; done; )
explanation: grep -q means only test if grep can find the regexp, but do not output, -P means use Perl syntax, so that the | is matched as is because the \Q..\E struct.
IFS=$'|' will make bash to use | instead of the spaces (SPC, TAB, etc.) as token separator.

Find the most common line in a file in bash

I have a file of strings:
string-string-123
string-string-123
string-string-123
string-string-12345
string-string-12345
string-string-12345-123
How do I retrieve the most common line in bash (string-string-123)?
You can use sort with uniq
sort file | uniq -c | sort -n -r
You could use awk to do this:
awk '{++a[$0]}END{for(i in a)if(a[i]>max){max=a[i];k=i}print k}' file
The array a keeps a count of each line. Once the file has been read, we loop through it and find the line with the maximum count.
Alternatively, you can skip the loop in the END block by assigning the line during the processing of the file:
awk 'max < ++c[$0] {max = c[$0]; line = $0} END {print line}' file
Thanks to glenn jackman for this useful suggestion.
It has rightly been pointed out that the two approaches above will only print out one of the most frequently occurring lines in the case of a tie. The following version will print out all of the most frequently occurring lines:
awk 'max<++c[$0] {max=c[$0]} END {for(i in c)if(c[i]==max)print i}' file
Tom Fenech's elegant awk answer works great [in the amended version that prints all most frequently occurring lines in the event of a tie].
However, it may not be suitable for large files, because all distinct input lines are stored in an associative array in memory, which could be a problem if there are many non-duplicate lines; that said, it's much faster than the approaches discussed below.
Grzegorz Żur's answer combines multiple utilities elegantly to implicitly produce the desired result, but:
all distinct lines are printed (highest-frequency count first)
output lines are prefixed by their occurrence count (which may actually be desirable).
While you can pipe Grzegorz Żur's answer to head to limit the number of lines shown, you can't assume a fixed number of lines in general.
Building on Grzegorz's answer, here's a generic solution that shows all most-frequently-occurring lines - however many there are - and only them:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1'
If you don't want the output lines prefixed with the occurrence count:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1' |
sed 's/^ *[0-9]\{1,\} //'
Explanation of Grzegorz Żur's answer:
uniq -c outputs the set of unique input lines prefixed with their respective occurrence count (-c), followed by a single space.
sort -n -r then sorts the resulting lines numerically (-n), in descending order (-r), so that the most frequently occurring line(s) are at the top.
Note that sort, if -k is not specified, will generally try to sort by the entire input line, but -n causes only the longest prefix that is recognized as an integer to be used for sorting, which is exactly what's needed here.
Explanation of my awk command:
NR==1 {prev=$1} stores the 1st whitespace-separated field ($1) in variable prev for the first input line (NR==1)
$1!=prev {exit} terminates processing, if the 1st whitespace-separated field is not the same as the previous line's - this means that a non-topmost line has been reached, and no more lines need printing.
1 is shorthand for { print } meaning that the input line at hand should be printed as is.
Explanation of my sed command:
^ *[0-9]\{1,\} matches the numeric prefix (denoting the occurrence count) of each output line, as (originally) produced by uniq -c
applying s/...// means that the prefix is replaced with an empty string, i.e., effectively removed.

Error with a script in bash

I have a little error with a script I wrote in bash and I can't figure out what's I'm doing wrong
note that I'm using this script for thousands of calculations and this error happened only a few times (like 20 or so), but it still happened
What the script does is this: basically it takes in input a web page that I got from a site with the utility w3m and it counts all the occurrences of the words in it... After it orders them from the most common to the ones that occur only once
this is the code:
#!/bin/bash
# counts the numbers of words from specific sites #
# writes in a file the occurrences ordered from the most common #
touch check # file used to analyze the occurrences
touch distribution # final file ordered
page=$1 # the web page that needs to be analyzed
occurrences=$2 # temporary file for the occurrences
dictionary=$3 # dictionary used for another purpose (ignore this)
# write the words one by column
cat $page | tr -c [:alnum:] "\n" | sed '/^$/d' > check
# lopp to analyze the words
cat check | while read words
do
word=${words}
strlen=${#word}
# ignores blacklisted words or small ones
if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
then
# if the word isn't in the file
if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
then
echo "$word: 1" | cat >> $occurrences
# else if it is already in the file, it calculates the occurrences
else
old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
### HERE IS THE ERROR, EITHER THE LET OR THE SED ###
let "new=old+1"
sed -i "s/^$word: $old$/$word: $new/g" $occurrences
fi
fi
done
# orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution
# ignore this, not important
grep -w "1" distribution | awk -F ":" '{print $1}' > temp_dictionary
for line in `cat temp_dictionary`
do
if ! grep -Fxq $line $dictionary
then
echo $line >> $dictionary
fi
done
rm check
rm temp_dictionary
this is the error: (I'm translating it, so it could be different in english)
./wordOccurrences line:30 let:x // where x is a number, usually 9 or 10 (but also 11, 13, etc)
1: syntax error in the espression (the error token is 1)
sed: expression -e #1, character y: command 's' not terminated // where y is another number (this one is also usually 9 or 10) with y being different from x
EDIT:
Talking with kev it looks like it's a newline problem
I added an echo between let and sed to print the sed and it worked perfectly for like 5 to 10 minutes until that error. Usually the sed without error looked like this:
s/^CONSULENTI: 6$/CONSULENTI: 7/g
but when I got the error it was like this:
s/^00145: 1
1$/00145: 4/g
how to fix this?
If you get a new line in $old, it means awk prints two lines so there is a duplicate in $occurences.
The script seems complicated to count words, and not efficient because it launches many processes and process file in a loop ;
maybe you can do something similar with
sort | uniq -c
You should also consider that your case-insensitivity is not consistent throughout the program. I created a page with just "foooo" in it and ran the program, then created one with "Foooo" in it and ran the program again. The 'old=`awk...' line sets 'old' to the empty string because awk is matching case sensitively. This results in the occurrences file not being updated. The subsequent sed and possibly some of the greps are also case sensitive.
This may not be the only error since it doesn't explain the error message you saw, but it is an indication that the same word with different capitalization will be handled erroneously by your script.
The following would separate the words, lowercase them, and then remove the ones smaller than three characters:
tr -cs '[:alnum:]' '\n' <foo | tr '[:upper:]' '[:lower:]' | egrep -v '^.{0,2}$'
Using this at the front of your script would mean that the rest of the script would not have to be case insensitive to be correct.

Count the number of occurrences in a string. Linux

Okay so what I am trying to figure out is how do I count the number of periods in a string and then cut everything up to that point but minus 2. Meaning like this:
string="aaa.bbb.ccc.ddd.google.com"
number_of_periods="5"
number_of_periods=`expr $number_of_periods-2`
string=`echo $string | cut -d"." -f$number_of_periods`
echo $string
result: "aaa.bbb.ccc.ddd"
The way that I was thinking of doing it was sending the string to a text file and then just greping for the number of times like this:
grep -c "." infile
The reason I don't want to do that is because I want to avoid creating another text file for I do not have permission to do so. It would also be simpler for the code I am trying to build right now.
EDIT
I don't think I made it clear but I want to make finding the number of periods more dynamic because the address I will be looking at will change as the script moves forward.
If you don't need to count the dots, but just remove the penultimate dot and everything afterwards, you can use Bash's built-in string manuipulation.
${string%substring}
Deletes shortest match of $substring from back of $string.
Example:
$ string="aaa.bbb.ccc.ddd.google.com"
$ echo ${string%.*.*}
aaa.bbb.ccc.ddd
Nice and simple and no need for sed, awk or cut!
What about this:
echo "aaa.bbb.ccc.ddd.google.com"|awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
(further shortened by helpful comment from #steve)
gives:
aaa.bbb.ccc.ddd
The awk command:
awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
works by separating the input line into fields (FS) by ., then joining them as output (OFS) with ., but the number of fields (NF) has been reduced by 2. The final 1 in the command is responsible for the print.
This will reduce a given input line by eliminating the last two period separated items.
This approach is "shell-agnostic" :)
Perhaps this will help:
#!/bin/sh
input="aaa.bbb.ccc.ddd.google.com"
number_of_fields=$(echo $input | tr "." "\n" | wc -l)
interesting_fields=$(($number_of_fields-2))
echo $input | cut -d. -f-${interesting_fields}
grep -o "\." <<<"aaa.bbb.ccc.ddd.google.com" | wc -l
5

Resources