How do I count lines which specific column has two patterns? - linux

year start year end location topic data type data value
2016 2017 AL Alcohol Crude Prevalence 16.9
2016 2017 CA Alcohol Other 15
2016 2017 AZ Neuropathy Other 13.1
2016 2017 HI Smoke Crude Prevalence 20
2016 2017 IL Cancer Other 20
2016 2017 KS Cancer Other 14
2016 2017 AZ Smoke Crude Prevalence 16.9
2016 2017 KY Cancer Other 13.8
2016 2017 LA Alcohol Crude Prevalence 18
The answer is required to count lines which are associated with the disease “topic”s "Alcohol" and "Cancer".
I already got the index of column named as "topic" , but the contents I am going to extract from "topic" is not correct, then I am not able to count the lines which is containing "Alcohol" and "Cancer", how to solve it?
Here is my code:
awk '{print $4}' AAA.csv > topic.txt
head -n5 topic.txt | less

You could try the following:
the call to awk gets the column in question, the grep filters the keywords, and the word count counts the lines
$ awk '{ print $4 }' data.txt | grep -e Alcohol -e Cancer | wc -l
6

Using a regexp with grep:
cat data.txt|tr -s " "|cut -d " " -f 4|grep -E '(Alcohol|Cancer)'|wc -l
If you are sure that words "Alcohol" and "Cancer" only appear in the 4th column you can just do
grep -E '(Alcohol|Cancer)' data.txt|wc -l
Addition
The OP asks in the comment:
If there are many columns, and I don't know the index of them. How can I extract the columns just based on their name ("topic")?
This code will store in the variable i the column containing "topic". Essentially, the code stores the first line of data.txt as an array variable s, and then parse the array elements until it finds the desired word. (You have to increase i by one at the end because the array index starts from 0).
Note: the code works only if actually a column "topic" is found.
head -n 1 data.txt|read -a s
for (( i=0; i<${#s[#]}; i++ ))
do
if [ "${s[$i]}" == "topic" ]
then
break
fi
done
i=$(( $i + 1 ))

Related

How can I change my program to let it display the output

I'm trying to print out the happiest countries in the world for 2022, by receiving the data from https://en.wikipedia.org/wiki/World_Happiness_Report?action=raw). and then editing displaying the first 5 countries. Here is my code:
#!/bin/bash
content=$(curl -s "https://en.wikipedia.org/wiki/World_Happiness_Report?action=raw")
lines=$(echo "$content" | grep '^\|' | sed -n '/2022/{n;p;}')
top_5=$(echo "$lines" | awk '{print $3}' | sort | head -n 5)
echo "$top_5"
However, when I run this code in Ubuntu, nothing shows up, its just blank, like this:
....(My computer server).....:~$ bash happy_countriesnew.sh
#(I'm expecting there to be a list here)
....(My computer server).....:~$
I'm expecting something like this instead of the blank space my terminal is displaying:
Finland
Norway
Denmark
Iceland
Switzerland
Netherlands
Canada
New Zealand
Sweden
Australia
What am I doing wrong and what should I change?
echo | grep | sed | awk is a bit of an anti-pattern. Typically, you want to refactor such pipelines to just be a call to awk. In your case, it looks like your code that is attempting to extract the 2022 data is flawed. The data is already sorted, so you can drop the sort and get the data you want with:
sed -n '/^=== 2022 report/,/^=/{ s/}}//; /^|[12345]|/s/.*|//p; }'
The first portion (the /^=== 2022 report/,/^=/) tells sed to only work on lines between those that match the two given patterns, which is the data you are interested in. The rest is just cleaning up and extracting just the country name, printing only those lines in which the 2nd field is exactly one of the single digits 1, 2, 3, 4, or 5.
Note that this is not terribly flexible, and it is difficult to modify it to print the top 7 or the top 12, so you might want something like:
sed -n '/^=== 2022 report/,/^=/{ s/}}//; /^|[[:digit:]]/s/.*|//p; }' | head -n 5
Note that it could be argued that sed | head is also a bit of an anti-pattern, but keeping track of lines of output in sed is tedious and the pipe to head is less egregious than attempting to write such code.
I guess you see this error (but you are ignoring it)
grep: empty (sub)expression
the problem is with your grep expression, remove the ecape
lines=$(echo "$content" | grep '^|' | sed -n '/2022/{n;p;}')
and check for errors.
Using awk:
awk -F"{{|}}|[|]" '/^=== 2022 rep/ {f=1} /^=== 2021 rep/ {f=0} {if(f==1 && /flag/) {print $6}}' <<<"$content" | head -n 5
Finland
Denmark
Iceland
Switzerland
Netherlands
-F"{{|}}|[|]" # set field separator to '{{' or '}}' or '|'
/^=== 2022 rep/ {f=1} # set flag if line starts with '=== 2022 rep'
/^=== 2021 rep/ {f=0} # unset flag if line starts with '=== 2021 rep'
{if(f==1 && /flag/) {print $6}}' # if f is set and line contains 'flag' text print 6th field
Note: Assumes "$content" variable is populated via content=$(curl -s "https://en.wikipedia.org/wiki/World_Happiness_Report?action=raw")
-- or --
You could use bash command substitution and avoid the intermediate content variable altogether:
awk -F"{{|}}|[|]" '/^=== 2022 rep/ {f=1} /^=== 2021 rep/ {f=0} {if(f==1 && /flag/) {print $6}}' < <(curl -s "https://en.wikipedia.org/wiki/World_Happiness_Report?action=raw") | head -n 5
Output:
Finland
Denmark
Iceland
Switzerland
Netherlands
curl …………… |
gawk 'NF *= 2<NF' FS='^[|][1-5][|][|][{][{]flag[|]|[}][}]$' OFS=
Finland
Denmark
Iceland
Switzerland
Netherlands
If you wanna shrink it even further :
mawk 'NF *= 2<NF' FS='^[|][1-5][|].+[|]|[}]+$' OFS=
this approach makes is easy to expand the list to, say, Top 17 :
nawk 'NF *= 2<NF' FS='^[|]([1-9]|1[0-7])[|].+[|]|[}]+$' OFS=
1 Finland
2 Denmark
3 Iceland
4 Switzerland
5 Netherlands
6 Luxembourg
7 Sweden
8 Norway
9 Israel
10 New Zealand
11 Austria
12 Australia
13 Ireland
14 Germany
15 Canada
16 United States
17 United Kingdom

Bash alias for awk command produces different result from running command directly [duplicate]

This question already has answers here:
Using awk in BASH alias or function
(2 answers)
bash aliases and awk escaping of quotes
(3 answers)
Closed 6 months ago.
I wrote an awk command to deduplicate a .csv file. I'm running Ubuntu 20.04. This is the command:
awk -F, ' {key = $2 FS} !seen[key]++' gigs.csv > try.csv
I don't want to have to type it all the time, so I made an alias for it in ~/.bash_aliases as follows:
alias dedupe="awk -F, ' {key = $2 FS} !seen[key]++' gigs.csv > try.csv"
However, when I run dedupe in my terminal, it produces only one line, which is not the same result when I type out the full command. The full command produces the desired results Did I make a mistake with the aliasing? Why does this happen and how can I resolve it?
Here is a sample from the original .csv file:
Tue 30 Aug 08:34:17 AM,Do you use facebook? work remote from home. we are hiring!,https://atlanta.craigslist.org/atl/cpg/d/atlanta-do-you-use-facebook-work-remote/7527729597.html
Mon 29 Aug 03:51:29 PM,Cash for your opinions!,https://atlanta.craigslist.org/atl/cpg/d/atlanta-cash-for-your-opinions/7527517063.html
Mon 29 Aug 01:22:54 PM,Telecommute earn $20 per easy online product test gig w/ free products,https://montgomery.craigslist.org/cpg/d/hope-hull-telecommute-earn-20-per-easy/7527471859.html
Mon 29 Aug 01:53:58 PM,Telecommute earn $20 per easy online product test gig w/ free products,https://atlanta.craigslist.org/atl/cpg/d/smyrna-telecommute-earn-20-per-easy/7527456060.html
Mon 29 Aug 12:50:59 PM,Telecommute earn $20 per easy online product test gig w/ free products,https://bham.craigslist.org/cpg/d/adamsville-telecommute-earn-20-per-easy/7527454527.html
Wed 31 Aug 09:23:41 PM,Looking for a sales development rep,https://bham.craigslist.org/cpg/d/adamsville-looking-for-sales/7528472497.html
Wed 31 Aug 11:21:58 AM,Earn ~$30 | work from home | looking for 'ok google' users | taskverse,https://bham.craigslist.org/cpg/d/harbor-city-earn-30-work-from-home/7528233394.html
Mon 29 Aug 12:50:59 PM,Telecommute earn $20 per easy online product test gig w/ free products,https://bham.craigslist.org/cpg/d/adamsville-telecommute-earn-20-per-easy/7527454527.html
Wed 31 Aug 11:28:56 AM,Earn ~$30 | work from home | looking for 'ok google' users | taskverse,https://tuscaloosa.craigslist.org/cpg/d/harbor-city-earn-30-work-from-home/7528236901.html
Wed 31 Aug 11:27:53 AM,Earn ~$30 | work from home | looking for 'ok google' users | taskverse,https://montgomery.craigslist.org/cpg/d/harbor-city-earn-30-work-from-home/7528236389.html
I
Define the alias using single quotes rather than double quotes. Nothing is special inside single quotes, so you won't have any unexpected issues with expansions like "... $2 ..." being expanded to the value of the 2nd positional parameter. The only thing is that to include an inner single quote, you need to break the quoting with ' ... '\'' ... ' or ' ... '"'"' ... '
alias dedupe='awk -F, '\'' {key = $2 FS} !seen[key]++'\'' gigs.csv > try.csv'
A function may be preferable in this case:
dedupe () { awk -F, ' {key = $2 FS} !seen[key]++' gigs.csv > try.csv; }

how to sort this in bash

Hello I have a file containing these lines:
apple
12
orange
4
rice
16
how to use bash to sort it by numbers ?
Suppose each number is the price for the above object.
I want they are formatted like this:
12 apple
4 orange
16 rice
or
apple 12
orange 4
rice 16
Thanks
A solution using paste + sort to get each product sorted by its price:
$ paste - - < file|sort -k 2nr
rice 16
apple 12
orange 4
Explanation
From paste man:
Write lines consisting of the sequentially corresponding lines from
each FILE, separated by TABs, to standard output. With no FILE, or
when FILE is -, read standard input.
paste gets the stream coming from the stdin (your <file) and figures that each line belongs to the fictional archive represented by - , so we get two columns using - -
sort use the flag -k 2nr to get paste output sorted by second column in reverse numerical order.
you can use awk:
awk '!(NR%2){printf "%s %s\n" ,$0 ,p}{p=$0}' inputfile
(slightly adapted from this answer)
If you want to sort the output afterwards, you can use sort (quite logically):
awk '!(NR%2){printf "%s %s\n" ,$0 ,p}{p=$0}' inputfile | sort -n
this would give:
4 orange
12 apple
16 rice
Another solution using awk
$ awk '/[0-9]+/{print prev, $0; next} {prev=$0}' input
apple 12
orange 4
rice 16
while read -r line1 && read -r line2;do
printf '%s %s\n' "$line1" "$line2"
done < input_file
If you want lines to be sorted by price, pipe the result to sort -k2:
while read -r line1 && read -r line2;do
printf '%s %s\n' "$line1" "$line2"
done < input_file | sort -k2
You can do this using paste and awk
$ paste - - <lines.txt | awk '{printf("%s %s\n",$2,$1)}'
12 apple
4 orange
16 rice
an awk-based solution without needing external paste / sort, using regex, calculating modulo % of anything, or awk/bash loops
{m,g}awk '(_*=--_) ? (__ = $!_)<__ : ($++NF = __)_' FS='\n'
12 apple
4 orange
16 rice

How to grep only two words in a line in file between them specific number of random words present

Given a file with this content:
Feb 1 ohio a1 rambo
Feb 1 ny a1 sandy
Feb 1 dc a2 rambo
Feb 2 alpht a1 jazzy
I only want the count of those lines containing Feb 1 and rambo.
You can use awk to do this more efficiently:
$ awk '/Feb 1/ && /rambo/' file
Feb 1 ohio a1 rambo
Feb 1 dc a2 rambo
To count matches:
$ awk '/Feb 1/ && /rambo/ {sum++} END{print sum}' file
2
Explanation
awk '/Feb 1/ && /rambo/' is saying: match all lines in which both Feb 1 and rambo are matched. When this evaluates to True, awk performs its default behaviour: print the line.
awk '/Feb 1/ && /rambo/ {sum++} END{print sum}' does the same, only that instead of printing the line, increments the var sum. When the file has been fully scanned, it enters in the END block, where it prints the value of the var sum.
Is Feb 1 always before rambo? if yes:
grep -c "Feb 1 .* rambo"
Try this as per #Marc's suggestions,
grep 'Feb 1.*rambo' file |wc -l
In case, position of both strings are not sure to be as mentioned in question following command will be useful,
grep 'rambo' file|grep 'Feb 1'|wc -l
The output will be,
2
Here is what I tried,
The awk solution is probably clearer, but this is a nice sed technique:
sed -n '/Feb 1/{/rambo/p; }' | wc -l

How to keep only those rows which are unique in a tab-delimited file in unix

Here, two rows are considered redundant if second value is same.
Is there any unix/linux command that can achieve the following.
1 aa
2 aa
1 ss
3 dd
4 dd
Result
1 aa
1 ss
3 dd
I generally use the following command but it does not achieve what I want here.
sort -k2 /Users/fahim/Desktop/delnow2.csv | uniq
Edit:
My file had roughly 25 million lines:
Time when using the solution suggested by #Steve : 33 seconds.
$date; awk -F '\t' '!a[$2]++' myfile.txt > outfile.txt; date
Wed Nov 27 18:00:16 EST 2013
Wed Nov 27 18:00:49 EST 2013
The sort and unique is taking too much time. I quit after waiting for 5 minutes.
Perhaps this is what you're looking for:
awk -F "\t" '!a[$2]++' file
Results:
1 aa
1 ss
3 dd
I understand that you want a unique sorted file by the second field.
You need to add -u to sort to achieve this.
sort -u -k2 /Users/fahim/Desktop/delnow2.csv

Resources