Count specific numbers from a column from an input file linux

Count specific numbers from a column from an input file linux - linux

i was trying to read a file and count a specific number at a specific place and show how many times it appears, for example:
1st field are numbers, 2nd field brand name, 3rd field a group they belong to, 4th and 5th not important.
1:audi:2:1990:5
2:bmw:2:1987:4
3:bugatti:3:1988:19
4.buick:4:2000:12
5:dodge:2:1999:4
6:ferrari:2:2000:4
As an output, i want to search by column 3, and group together 2's(by brand name) and count how many of them i have.
The output i am looking for should look like this:
1:audi:2:1990:5
2:bmw:2:1987:4
5:dodge:2:1999:4
6:ferrari:2:2000:4
4 -> showing how many lines there are.
I have tried taken this approach but can't figure it out:
file="cars.txt"; sort -t ":" -k3 $file #sorting by the 3rd field
grep -c '2' cars.txt # this counts all the 2's in the file including number 2.
I hope you understand. and thank you in advance.

I am not sure exactly what you mean by "group together by brand name", but the following will get you the output that you describe.
awk -F':' '$3 == 2' Input.txt
If you want a line count, you can pipe that to wc -l.
awk -F':' '$3 == 2' Input.txt | wc -l

I guess line 4 is 4:buick and not 4.buick. Then I suggest this
$ awk 'BEGIN{FS=":"} $3~2{total++;print} END{print "TOTAL --- "total}' Input.txt

Plain bash solution:
#!/bin/bash
while IFS=":" read -ra line; do
if (( ${line[2]} == 2 )); then
IFS=":" && echo "${line[*]}"
(( count++ ))
fi
done < file
echo "Count = $count"
Output:
1:audi:2:1990:5
2:bmw:2:1987:4
5:dodge:2:1999:4
6:ferrari:2:2000:4
Count = 4

Related

Check if all lines in a file are in the same format

I would like to wrote a little shell script that permit to check if all lines on a file has the same number of ;
I have a file containing the following format :
$ cat filename.txt
34567890;098765456789;098765567;9876;9876;EXTG;687J;
4567800987987;09876789;9667876YH;9876;098765;098765;09876;
SLKL987H;09876LKJ;POIUYT;PÖIUYT;88765K;POIUYTY;LKJHGFDF;
TYUIO;09876LKJ;POIUYT;LKJHG;88765K;POIUYTY;OIUYT;
...
...
...
SDFGHJK;RTYUIO9876;4567890LKJHGFD;POIUYTRF56789;POIUY;POIUYT;9876;
I use the following command for determine of the number of ; of each line :
awk -F';' 'NF{print (NF-1)}' filename.txt
I have the following output :
7
7
7
7
...
...
...
7
Because number of ; on each line of this file is 7.
Now, I want to wrote a script that permit me to verify if all the lines in the file have 7 commas. If it's OK, it tells me that the file is correct. Otherwise, if there is a single line containing more than 7 commas, it tells me that the file is not correct.

Rather than printing output, return a value. eg
awk -F',' 'NR==1{count = NF} NF!=count{status=1}END{exit status}' filename.txt
If there are no lines or if all lines contain the same number of commas, this will return 0. Otherwise, it returns 1 to indicate failure.

Count the number of unique lines and verify that the count is 1.
if (($(awk -F';' 'NF{print (NF-1)}' filename.txt | uniq | wc -l) == 1)); then
echo good
else
echo bad
fi

Just pipe the result through sort -u | wc -l. If all lines have the same number of fields, this will produce one line of output.
Alternatively, just look for a line in awk that doesn't have the same number of fields as the first line.
awk -F';' 'NR==1 {linecount=NF}
linecount != NF { print "Bad line " $0; exit 1}
' filename.txt && echo "Good file"
You can also adapt the old trick used to output only the first of duplicate lines.
awk -F';' '{a[NF]=1}; length(a) > 1 {exit 1}' filename.txt
Each line updates the count of lines with that number of fields. Exit with status 1 as soon as a has more than one entry. Basically, a acts as a set of all field counts seen so far.

Based on all the information you have given me, I ended up doing the following. And it works for me.
nbCol=`awk -F';' '(NR==1){print NF;}' $1`
val=7
awk -F';' 'NR==1{count = NF} NF != count { exit 1}' $1
result=`echo $?`
if [ $result -eq 0 ] && [ $nbCol -eq $val ];then
echo "Good Format"
else
echo "Bad Format"
fi

grep reverse with exact matching

I have a list file, which has id and number and am trying to get those lines from a master file which do not have those ids.
List file
nw_66 17296
nw_67 21414
nw_68 21372
nw_69 27387
nw_70 15830
nw_71 32348
nw_72 21925
nw_73 20363
master file
nw_1 5896
nw_2 52814
nw_3 14537
nw_4 87323
nw_5 56466
......
......
nw_n xxxxx
so far am trying this but not working as expected.
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
Kindly help

Give this awk one-liner a try:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master

Maybe this helps:
awk 'NR == FNR {id[$1]=1;next}
{
if (id[$1] == "") {
print $0
}
}' listfile masterfile
We accept 2 files as input above, first one is listfile, second is masterfile.
NR == FNR would be true while awk is going through listfile. In the associative array id[], all ids in listfile are made a key with value as 1.
When awk goes through masterfile, it only prints a line if $1 i.e. the id is not a key in array ids.

The OP attempted the following line:
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
This line will not work as for every entry $i, you print all entries in master.txt tat are not equivalent to "$i". As a consequence, you will end up with multiple copies of master.txt, each missing a single line.
Example:
$ for i in 1 2; do grep -v -w "$i" <(seq 1 3); done
2 \ copy of seq 1 3 without entry 1
3 /
1 \ copy of seq 1 3 without entry 2
3 /
Furthermore, the attempt reads the file master.txt multiple times. This is very inefficient.
The unix tool grep allows one the check multiple expressions stored in a file in a single go. This is done using the -f flag. Normally this looks like:
$ grep -f list.txt master.txt
The OP can use this now in the following way:
$ grep -vwf <(awk '{print $1}' list.txt) master.txt
But this would do matches over the full line.
The awk solution presented by Kent is more flexible and allows the OP to define a more tuned match:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Here the OP clearly states, I want to match column 1 of list with column 1 of master and I don't care about spaces or whatever is in column 2. The grep solution could still match entries in column 2.

how to insert separate lines ( elements ) in parameter

I have the following csv file
more file.csv
1,yes,yes,customer1,1,2,3,4
2,no,yes,customer5,34,56,33,2
3,yes,yes,customer11
4,no,no,customer14
5,yes,no,customer15
6,yes,yes,customer21
7,no,yes,customer34
8,no,yes,customer89
The following (awk) line was written in order to manipulate and take line from the csv and put each element (line) in to the parameter - LINES
declare LINES=` awk -F, 'BEGIN{IGNORECASE=1} $2=="yes" {printf "\"Line number %d customer %s\"\n", $1, $4}' file.csv `
.
echo $LINES
"Line number 1 customer customer1" "Line number 3 customer customer11" "Line number 5 customer customer15" "Line number 6 customer 21”
but when I want to print the number of elemnt in parameter LINES I get 1 ??
echo ${#LINES[*]}
1
While actually I need to get 4 elements ( lines )
Please advice how to fix the awk line in order to get 4 elements?
remark:
please see this example , when I edit manual the LINES , the elements should be 4
declare LINES=( "Line number 1 customer customer1" "Line number 3 customer customer11" "Line number 5 customer customer15" "Line number 6 customer 21” )
echo ${#LINES[*]}
4

The awk output isn't being stored in an array. You’d need declare -a LINES=($(...)) to do that. But even then, bash splits array elements on any whitespace, not only newlines. And if you were to wrap the process substitution in quotes like LINES=("$(...)") you would only have a single element containing the entire output from the command.
You could do the necessary text manipulation with a read loop to preserve the number of elements that contain whitespace.
declare -a lines
while IFS=, read -r line_number answer _ customer _; do
if [[ $answer == #(yes|YES) ]]; then
lines+=("Line number $line_number customer $customer")
fi
done < file.csv
As noted in the comments, depending on the bash version, usage of #(...) inside [[ ... ]] may require shopt -s extglob.
Alternatively, the if could be replaced with a case:
case $answer in
yes|YES)
LINES+=("Line number $line_number customer $customer")
;;
esac

Try this:
a=$(awk -F, 'BEGIN{IGNORECASE=1} $2=="yes" {printf "Line number %d customer %s;", $1, $4}' file.csv)
IFS=';' read -a LINES <<< "${a}"

As #JohnB mentioned, you are populating LINES as a scalar variable, not an array. Try this:
$ IFS=$'\n' LINES=( $(awk 'BEGIN{for(i=1;i<=3;i++) printf "\"Line number %d\"\n", i}') )
$ echo ${#LINES[*]}
3
$ echo "${LINES[0]}"
"Line number 1"
$ echo "${LINES[1]}"
"Line number 2"
$ echo "${LINES[2]}"
"Line number 3"
and tweak to suit your real input/output which would probably result in:
IFS=$'\n' LINES=( $(awk -F, 'BEGIN{IGNORECASE=1} $2=="yes"{printf "\"Line number %d customer %s\"\n", $1, $4}' file.csv) )

If you're using bash, you can just use the mapfile builtin:
$ mapfile -t LINES < \
<(awk -F, 'BEGIN{IGNORECASE=1}
$2=="yes" {printf "\"Line number %d customer %s\"\n", $1, $4}' file.csv)
$ echo "${#LINES[*]}"
4
$ echo "${LINES[#]}"
"Line number 1 customer customer1" "Line number 3 customer customer11" "Line number 5 customer customer15" "Line number 6 customer customer21"

bash print first to nth column in a line iteratively

I am trying to get the column names of a file and print them iteratively. I guess the problem is with the print $i but I don't know how to correct it. The code I tried is:
#! /bin/bash
for i in {2..5}
do
set snp = head -n 1 smaller.txt | awk '{print $i}'
echo $snp
done
Example input file:
ID Name Age Sex State Ext
1 A 12 M UT 811
2 B 12 F UT 818
Desired output:
Name
Age
Sex
State
Ext
But the output I get is blank screen.

You'd better just read the first line of your file and store the result as an array:
read -a header < smaller.txt
and then printf the relevant fields:
printf "%s\n" "${header[#]:1}"
Moreover, this uses bash only, and involves no unnecessary loops.
Edit. To also answer your comment, you'll be able to loop through the header fields thus:
read -a header < smaller.txt
for snp in "${header[#]:1}"; do
echo "$snp"
done
Edit 2. Your original method had many many mistakes. Here's a corrected version of it (although what I wrote before is a much preferable way of solving your problem):
for i in {2..5}; do
snp=$(head -n 1 smaller.txt | awk "{print \$$i}")
echo "$snp"
done
set probably doesn't do what you think it does.
Because of the single quotes in awk '{print $i}', the $i never gets expanded by bash.
This algorithm is not good since you're calling head and awk 4 times, whereas you don't need a single external process.
Hope this helps!

You can print it using awk itself:
awk 'NR==1{for (i=2; i<=5; i++) print $i}' smaller.txt

The main problem with your code is that your assignment syntax is wrong. Change this:
set snp = head -n 1 smaller.txt | awk '{print $i}'
to this:
snp=$(head -n 1 smaller.txt | awk '{print $i}')
That is:
Do not use set. set is for setting shell options, numbered parameters, and so on, not for assigning arbitrary variables.
Remove the spaces around =.
To run a command and capture its output as a string, use $(...) (or `...`, but $(...) is less error-prone).
That said, I agree with gniourf_gniourf's approach.

Here's another alternative; not necessarily better or worse than any of the others:
for n in $(head smaller.txt)
do
echo ${n}
done

somthin like
for x1 in $(head -n1 smaller.txt );do
echo $x1
done

How do I find the count of multiple words in a text file?

I am able to find the number of times a word occurs in a text file, like in Linux we can use:
cat filename|grep -c tom
My question is, how do I find the count of multiple words like "tom" and "joe" in a text file.

Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).
test.txt:
tom is really really cool! joe for the win!
tom is actually lame.
$ grep -c '\<\(tom\|joe\)\>' test.txt
2
As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.
I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.
$ grep -o '\(joe\|tom\)' test.txt|wc -l
3
3...the correct answer! Hope this helps

Ok, so first split the file into words, then sort and uniq:
tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c
You use uniq:
sort filename | uniq -c

Use awk:
{for (i=1;i<=NF;i++)
count[$i]++
}
END {
for (i in count)
print count[i], i
}
This will produce a complete word frequency count for the input.
Pipe tho output to grep to get the desired fields
awk -f w.awk input | grep -E 'tom|joe'
BTW, you do not need cat in your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use
grep -c tom filename
if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-)

The sample you gave does not search for words "tom". It will count "atom" and "bottom" and many more.
Grep searches for regular expressions. Regular expression that matches word "tom" or "joe" is
\<\(tom\|joe\)\>

You could do regexp,
cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"

Here is one:
cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c
UPDATE
A shell script solution:
#!/bin/bash
file_name="$2"
string="$1"
if [ $# -ne 2 ]
then
echo "Usage: $0 <pattern to search> <file_name>"
exit 1
fi
if [ ! -f "$file_name" ]
then
echo "file \"$file_name\" does not exist, or is not a regular file"
exit 2
fi
line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0
# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
do
flag=0
while [[ "$line" == *$string* ]]
do
flag=1
line_no_list[line_no_indx]=$curr_line_indx
line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
line=${line/"$string"/}
done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
if (( flag == 1 ))
then
line_no_indx=$((line_no_indx+2))
fi
curr_line_indx=$((curr_line_indx+1))
done < "$file_name"
echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "
for ((i=0; i<line_no_indx; i=i+2))
do
echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done
echo

I completely forgot about grep -f:
cat filename | grep -fc names
AWK solution:
Assuming the names are in a file called names:
cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -
Note that your original grep doesn't search for words. e.g.
$ echo tomorrow | grep -c tom
1
You need grep -w

gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'
The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.
We use gawk because the POSIX awk doesn't allow regex record separator.
For brevity, you can replace '{print}' with 1 - either way, it's an Awk program that simply prints out all input records ("is 1 true? it is? then do the default action, which is {print}.")

To find all hits in all lines
echo "tom is really really cool! joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3
This will count "tomtom" as 2 hits.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Count specific numbers from a column from an input file linux - linux

I am not sure exactly what you mean by "group together by brand name", but the following will get you the output that you describe. awk -F':' '$3 == 2' Input.txt If you want a line count, you can pipe that to wc -l. awk -F':' '$3 == 2' Input.txt | wc -l

I guess line 4 is 4:buick and not 4.buick. Then I suggest this $ awk 'BEGIN{FS=":"} $3~2{total++;print} END{print "TOTAL --- "total}' Input.txt

Plain bash solution: #!/bin/bash while IFS=":" read -ra line; do if (( ${line[2]} == 2 )); then IFS=":" && echo "${line[*]}" (( count++ )) fi done < file echo "Count = $count" Output: 1:audi:2:1990:5 2:bmw:2:1987:4 5:dodge:2:1999:4 6:ferrari:2:2000:4 Count = 4

Related

Check if all lines in a file are in the same format

grep reverse with exact matching

how to insert separate lines ( elements ) in parameter

bash print first to nth column in a line iteratively

How do I find the count of multiple words in a text file?

Categories

Resources