Grep total amount of specific elements based on date - linux

Is there a way in linux to filter multiple files with bunch of data in one command without writing a script?
For this example I want to know how many males appear by date. Also the problem is that a specific date (January 3rd) appears in 2 seperate files:
file1
Jan 1 john male=yes
Jan 1 james male=yes
Jan 2 kate male=no
Jan 3 jonathan male=yes
file2
Jan 3 alice male=no
Jan 4 john male=yes
Jan 4 jonathan male=yes
Jan 4 alice male=no
I want the total amount of males for each date from all files. If there are no males for a specific date, no output will be given.
Jan 1 2
Jan 3 1
Jan 4 2
The only way I can think of is count the total amount of male genders given a specific date, but this would not performant as in real-world examples there could be much more files and manually entering all the dates would be a waste of time. Any help would be appreciated, thank you!
localhost:~# cat file1 file2 | grep "male=yes" | grep "Jan 1" | wc -l
2

grep -h 'male=yes' file? | \
cut -c-6 | \
awk '{c[$0] += 1} END {for(i in c){printf "%6s %4d\n", i, c[i]}}'
The grep will print the male lines, cut will remove everything but the first 6 chars (date) and awk will count every date and printout every date and the counter in the end.
Given your files the output will be:
Jan 1 2
Jan 3 1
Jan 4 2

Related

How do I count lines which specific column has two patterns?

year start year end location topic data type data value
2016 2017 AL Alcohol Crude Prevalence 16.9
2016 2017 CA Alcohol Other 15
2016 2017 AZ Neuropathy Other 13.1
2016 2017 HI Smoke Crude Prevalence 20
2016 2017 IL Cancer Other 20
2016 2017 KS Cancer Other 14
2016 2017 AZ Smoke Crude Prevalence 16.9
2016 2017 KY Cancer Other 13.8
2016 2017 LA Alcohol Crude Prevalence 18
The answer is required to count lines which are associated with the disease “topic”s "Alcohol" and "Cancer".
I already got the index of column named as "topic" , but the contents I am going to extract from "topic" is not correct, then I am not able to count the lines which is containing "Alcohol" and "Cancer", how to solve it?
Here is my code:
awk '{print $4}' AAA.csv > topic.txt
head -n5 topic.txt | less
You could try the following:
the call to awk gets the column in question, the grep filters the keywords, and the word count counts the lines
$ awk '{ print $4 }' data.txt | grep -e Alcohol -e Cancer | wc -l
6
Using a regexp with grep:
cat data.txt|tr -s " "|cut -d " " -f 4|grep -E '(Alcohol|Cancer)'|wc -l
If you are sure that words "Alcohol" and "Cancer" only appear in the 4th column you can just do
grep -E '(Alcohol|Cancer)' data.txt|wc -l
Addition
The OP asks in the comment:
If there are many columns, and I don't know the index of them. How can I extract the columns just based on their name ("topic")?
This code will store in the variable i the column containing "topic". Essentially, the code stores the first line of data.txt as an array variable s, and then parse the array elements until it finds the desired word. (You have to increase i by one at the end because the array index starts from 0).
Note: the code works only if actually a column "topic" is found.
head -n 1 data.txt|read -a s
for (( i=0; i<${#s[#]}; i++ ))
do
if [ "${s[$i]}" == "topic" ]
then
break
fi
done
i=$(( $i + 1 ))

Linux Grouping and Counting Files by attribute

I am trying to return a list of the months that files were created using the following code.
ls -l|awk '{A[$6":"]++}END{for (i in A){print i" "A[i]}}'
I am using the below code to validate each output.
ls -la | grep -c "Jan"
However as you can see from my output:
: 1
Jan: 19
Feb: 11
Mar: 28
Apr: 10
May: 14
Jun: 24
Jul: 4
Aug: 16
Sep: 10
Oct: 30
Nov: 4
Dec: 1
Output of ls|grep
I end up with 1 record showing no date. Also both January and December are short by 1. Can anyone assist?
You could do it this way using awk and sort
$ ls -l | awk '$6!=""{m[$6]++}END{for(i in m){printf "%s : %s%s",i,m[i],ORS }}' | sort -k1M
Jan : 7
Mar : 1
Apr : 8
Aug : 2
The problem comes with the first line of ls -l which doesn't contain a month field

How to grep only two words in a line in file between them specific number of random words present

Given a file with this content:
Feb 1 ohio a1 rambo
Feb 1 ny a1 sandy
Feb 1 dc a2 rambo
Feb 2 alpht a1 jazzy
I only want the count of those lines containing Feb 1 and rambo.
You can use awk to do this more efficiently:
$ awk '/Feb 1/ && /rambo/' file
Feb 1 ohio a1 rambo
Feb 1 dc a2 rambo
To count matches:
$ awk '/Feb 1/ && /rambo/ {sum++} END{print sum}' file
2
Explanation
awk '/Feb 1/ && /rambo/' is saying: match all lines in which both Feb 1 and rambo are matched. When this evaluates to True, awk performs its default behaviour: print the line.
awk '/Feb 1/ && /rambo/ {sum++} END{print sum}' does the same, only that instead of printing the line, increments the var sum. When the file has been fully scanned, it enters in the END block, where it prints the value of the var sum.
Is Feb 1 always before rambo? if yes:
grep -c "Feb 1 .* rambo"
Try this as per #Marc's suggestions,
grep 'Feb 1.*rambo' file |wc -l
In case, position of both strings are not sure to be as mentioned in question following command will be useful,
grep 'rambo' file|grep 'Feb 1'|wc -l
The output will be,
2
Here is what I tried,
The awk solution is probably clearer, but this is a nice sed technique:
sed -n '/Feb 1/{/rambo/p; }' | wc -l

Show a list of users that logged in exactly 5 days ago from today in linux?

The last command displays the history of login attempts. How to filter the output so that it displays the users logged in from 5 days before current date?
Here is what I've been able to do so far:
last | grep Dec | grep -v reboot | awk '{print$5}'
This parses the dates from the output of last command.
#!/bin/bash
count=`$date "+%d"`
count=$((count-5))
last|grep -v reboot|grep Dec|awk '($5>=$count) {print $0}'
worked for me :) Thanks for the help #Olivier Dulac
I couldn't do it in one line, but here's a little bash script which might get the job done:
#! /bin/bash
# Find the date string we want
x=$(date --date="5 days ago" +"%a %b %e");
# And now chain a heap of commands together to...
# 1. Get the list of user
# 2. Ignore reboot
# 3. Filter the date lines we want
# 4. Print the user name using awk
# 5+6. Sort them and extract the unique values
last | grep -v "reboot" | grep "$x" | awk '{print $1}' | sort | uniq
in your awk (I don't have "last" here so I can't know the format)
just add a condition to only print the whole line when you see what you want:
ex: if the month is the 3rd field, and day is the 4th field,
last | grep -v reboot | awk ' ( ($3 == "Dec") && ($4 == "07") ) { print $0 ; }'
(once again, without an actual excerpt of "last", I can't tell if the above works, but I hope you get the general idea)
I think chooban's solution is the closest, but it lists only the matching lines. I found a better solution, and most probably it handles the 2013-12-31 - 2014-01-01 issue properly (I found no trace of the output format if a user is logged in more the one year..., or the login time is in the previous year). It is a grep-less one (long)liner:
last | awk -v l="$(last -t $(date -d '-5 day' +%Y%m%d%H%M%S)|head -n 1)" 'BEGIN {l=substr(l,1,55)} /^reboot / {next} substr($0,1,55) == l {exit} 1
I assumed that there is no such user as 'reboot'. It uses the fact that last -t YYYYMMDDHHMMSS prints the lines before the specific date, but unfortunately it changes the format if the logout is inside the specified period (shows "gone - no logout"), so it has to be cut off.
This is not the nicest as it calls last twice, but it seems working.
Output:
root pts/1 mytst.xyzzy.tv Wed Dec 11 12:45 still logged in
root pts/0 mytst.xyzzy.tv Wed Dec 11 11:25 still logged in
root pts/0 mytst.xyzzy.tv Tue Dec 10 16:02 - 17:14 (01:12)
root pts/0 mytst.xyzzy.tv Tue Dec 10 10:59 - 15:04 (04:05)
root pts/0 mytst.xyzzy.tv Mon Dec 9 13:23 - 17:10 (03:46)
root pts/1 mytst.xyzzy.tv Fri Dec 6 16:01 - 16:07 (00:06)
root pts/0 mytst.xyzzy.tv Fri Dec 6 15:52 - 16:08 (00:15)
I hope this could help!

How to keep only those rows which are unique in a tab-delimited file in unix

Here, two rows are considered redundant if second value is same.
Is there any unix/linux command that can achieve the following.
1 aa
2 aa
1 ss
3 dd
4 dd
Result
1 aa
1 ss
3 dd
I generally use the following command but it does not achieve what I want here.
sort -k2 /Users/fahim/Desktop/delnow2.csv | uniq
Edit:
My file had roughly 25 million lines:
Time when using the solution suggested by #Steve : 33 seconds.
$date; awk -F '\t' '!a[$2]++' myfile.txt > outfile.txt; date
Wed Nov 27 18:00:16 EST 2013
Wed Nov 27 18:00:49 EST 2013
The sort and unique is taking too much time. I quit after waiting for 5 minutes.
Perhaps this is what you're looking for:
awk -F "\t" '!a[$2]++' file
Results:
1 aa
1 ss
3 dd
I understand that you want a unique sorted file by the second field.
You need to add -u to sort to achieve this.
sort -u -k2 /Users/fahim/Desktop/delnow2.csv

Resources