Find the most common line in a file in bash - linux

I have a file of strings:
string-string-123
string-string-123
string-string-123
string-string-12345
string-string-12345
string-string-12345-123
How do I retrieve the most common line in bash (string-string-123)?

You can use sort with uniq
sort file | uniq -c | sort -n -r

You could use awk to do this:
awk '{++a[$0]}END{for(i in a)if(a[i]>max){max=a[i];k=i}print k}' file
The array a keeps a count of each line. Once the file has been read, we loop through it and find the line with the maximum count.
Alternatively, you can skip the loop in the END block by assigning the line during the processing of the file:
awk 'max < ++c[$0] {max = c[$0]; line = $0} END {print line}' file
Thanks to glenn jackman for this useful suggestion.
It has rightly been pointed out that the two approaches above will only print out one of the most frequently occurring lines in the case of a tie. The following version will print out all of the most frequently occurring lines:
awk 'max<++c[$0] {max=c[$0]} END {for(i in c)if(c[i]==max)print i}' file

Tom Fenech's elegant awk answer works great [in the amended version that prints all most frequently occurring lines in the event of a tie].
However, it may not be suitable for large files, because all distinct input lines are stored in an associative array in memory, which could be a problem if there are many non-duplicate lines; that said, it's much faster than the approaches discussed below.
Grzegorz Żur's answer combines multiple utilities elegantly to implicitly produce the desired result, but:
all distinct lines are printed (highest-frequency count first)
output lines are prefixed by their occurrence count (which may actually be desirable).
While you can pipe Grzegorz Żur's answer to head to limit the number of lines shown, you can't assume a fixed number of lines in general.
Building on Grzegorz's answer, here's a generic solution that shows all most-frequently-occurring lines - however many there are - and only them:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1'
If you don't want the output lines prefixed with the occurrence count:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1' |
sed 's/^ *[0-9]\{1,\} //'
Explanation of Grzegorz Żur's answer:
uniq -c outputs the set of unique input lines prefixed with their respective occurrence count (-c), followed by a single space.
sort -n -r then sorts the resulting lines numerically (-n), in descending order (-r), so that the most frequently occurring line(s) are at the top.
Note that sort, if -k is not specified, will generally try to sort by the entire input line, but -n causes only the longest prefix that is recognized as an integer to be used for sorting, which is exactly what's needed here.
Explanation of my awk command:
NR==1 {prev=$1} stores the 1st whitespace-separated field ($1) in variable prev for the first input line (NR==1)
$1!=prev {exit} terminates processing, if the 1st whitespace-separated field is not the same as the previous line's - this means that a non-topmost line has been reached, and no more lines need printing.
1 is shorthand for { print } meaning that the input line at hand should be printed as is.
Explanation of my sed command:
^ *[0-9]\{1,\} matches the numeric prefix (denoting the occurrence count) of each output line, as (originally) produced by uniq -c
applying s/...// means that the prefix is replaced with an empty string, i.e., effectively removed.

Related

Filtering large data file by date using command line

I have a csv file that contains a bunch of data with one of the columns being date. I am trying to extract all lines that have dates in a specific year and save it into a new file.
The format of file is like this with the date and time in the second column:
000000000,10/04/2021 02:10:15 AM,.....
So far I tried:
grep -E ^2020 data.csv >> temp.csv
But it just produced an empty temp list. Any ideas on how I can do this?
One potential solution is with awk:
awk -F"," '$2 ~ /\/2020 /' data.csv > temp.csv
Another potential option is with grep:
grep "\/2020 " data.csv > temp.csv
However, the grep solution may detect "/2020 " elsewhere in the file, rather than in column 2.
Although awk solution is best here, e.g.
awk -F, 'index($2, "/2021 ")' file
grep can also be used here:
grep '^[^,]*,[^,]*/2021 ' file
See the online demo
Notes:
awk -F, 'index($2, "/2021 ")' splits the lines (records) into fields with a comma (see -F,), and if there is a /2021 + space in the second field ($2) the line is printed
the ^[^,]*,[^,]*/2021 pattern in the grep command matches
^ - start of string
[^,]* - zero or more non-comma chars
,[^,]* - a , and zero or more non-comma chars
/2021 - a literal substring.

How to get first word of every line and pipe it into dmenu script

I have a text file like this:
first state
second state
third state
Getting the first word from every line isn't difficult, but the problem comes when adding the extra \n required to separate every word (selection) in dmenu, per its syntax:
echo -e "first\nsecond\nthird" | dmenu
I haven't been able to figure out how to add the separating \n. I've tried this:
state=$(awk '{for(i=1;i<=NF;i+=2)print $(i)'\n'}' text.txt)
But it doesn't work. I also tried this:
lol=$(grep -o "^\S*" states.txt | perl -ne 'print "$_"')
But same deal. Not sure what I'm doing wrong.
Your problem is in the AWK script. You need to identify each input line as a record. This way, you can control how each record in the output is separated via the ORS variable (output record separator). By default this separator is the newline, which should be good enough for your purpose.
Now to print the first word of every input record (each line in the input stream in this case), you just need to print the first field:
awk '{print $1}' textfile | dmenu
If you need the output to include the explicit \n string (not the control character), then you can just overwrite the ORS variable to fit your needs:
awk 'BEGIN{ORS="\\n"}{print $1}' textfile | dmenu
This could be more easily done in while loop, could you please try following. This is simple, while is reading the file and during that its creating 2 variables 1st is named first and other is rest first contains first field which we are passing to dmenu later inside.
while read first rest
do
dmenu "$first"
done < "Input_file"
Based on the text file example, the following should achieve what you require:
awk '{ printf "%s\\n",$1 }' textfile | dmenu
Print the first space separated field of each line along with \n (\n needs to be escaped to stop it being interpreted by awk)
In your code
state=$(awk '{for(i=1;i<=NF;i+=2)print $(i)'\n'}' text.txt)
you attempted to use ' inside your awk code, however code is what between ' and first following ', therefore code is {for(i=1;i<=NF;i+=2)print $(i) and this does not work. You should use " for strings inside awk code.
If you want to merely get nth column cut will be enough in most cases, let states.txt content be
first state
second state
third state
then you can do:
cut -d ' ' -f 1 states.txt | dmenu
Explanation: treat space as delimiter (-d ' ') and get 1st column (-f 1)
(tested in cut (GNU coreutils) 8.30)

Linux Bash: extracting text from file int variable

I haven't found anything that clearly answers my question. Although very close, I think...
I have a file with a line:
# Skipsdata for serienummer 1158
I want to extract the 4 digit number at the end and put it into a variable, this number changes from file to file so I can't just search for "1158". But the "# Skipsdata for serienummer" always remains the same.
I believe that either grep, sed or awk may be the answer but I'm not 100 % clear on their usage.
Using Awk as
numberRequired=$(awk '/# Skipsdata for serienummer/{print $NF}' file)
printf "%s\n" "$numberRequired"
1158
You can use grep with the -o switch, which prints only the matched part instead of the whole line.
Print all numbers at the end of lines from file yourFile
grep -Po '\d+$' yourFile
Print all four digit numbers at the end of lines like described in your question:
grep -Po '^# Skipsdata for serienummer \K\d{4}$' yourFile
-P enables perl style regexes which support \d and especially \K.
\d matches any digit (0-9).
\d{4} matches exactly four digits.
\K lets grep forget the previously matched part, such that only the part afterwards is printed.
There are multiple ways to find your number. Assuming the input data is in a file called inputfile:
mynumber=$(sed -n 's/# Skipsdata for serienummer //p' <inputfile) will print only the number and ignore all the other lines;
mynumber=$(grep '^# Skipsdata for serienummer' inputfile | cut -d ' ' -f 5) will filter the relevant lines first, then only output the 5th field (the number)

Delete lines from a file matching first 2 fields from a second file in shell script

Suppose I have setA.txt:
a|b|0.1
c|d|0.2
b|a|0.3
and I also have setB.txt:
c|d|200
a|b|100
Now I want to delete from setA.txt lines that have the same first 2 fields with setB.txt, so the output should be:
b|a|0.3
I tried:
comm -23 <(sort setA.txt) <(sort setB.txt)
But the equality is defined for whole line, so it won't work. How can I do this?
$ awk -F\| 'FNR==NR{seen[$1,$2]=1;next;} !seen[$1,$2]' setB.txt setA.txt
b|a|0.3
This reads through setB.txt just once, extracts the needed information from it, and then reads through setA.txt while deciding which lines to print.
How it works
-F\|
This sets the field separator to a vertical bar, |.
FNR==NR{seen[$1,$2]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, when FNR==NR, we are reading the first file, setB.txt. If so, set the value of associative array seen to true, 1, for the key consisting of fields one and two. Lastly, skip the rest of the commands and start over on the next line.
!seen[$1,$2]
If we get to this command, we are working on the second file, setA.txt. Since ! means negation, the condition is true if seen[$1,$2] is false which means that this combination of fields one and two was not in setB.txt. If so, then the default action is performed which is to print the line.
This should work:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p' setB.txt |sed -f- setA.txt
How this works:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p'
generates an output:
/^c|d/d
/^a|b/d
which is then used as a sed script for the next sed after the pipe and outputs:
b|a|0.3
(IFS=$'|'; cat setA.txt | while read x y z; do grep -q -P "\Q$x|$y|\E" setB.txt || echo "$x|$y|$z"; done; )
explanation: grep -q means only test if grep can find the regexp, but do not output, -P means use Perl syntax, so that the | is matched as is because the \Q..\E struct.
IFS=$'|' will make bash to use | instead of the spaces (SPC, TAB, etc.) as token separator.

Count the number of occurrences in a string. Linux

Okay so what I am trying to figure out is how do I count the number of periods in a string and then cut everything up to that point but minus 2. Meaning like this:
string="aaa.bbb.ccc.ddd.google.com"
number_of_periods="5"
number_of_periods=`expr $number_of_periods-2`
string=`echo $string | cut -d"." -f$number_of_periods`
echo $string
result: "aaa.bbb.ccc.ddd"
The way that I was thinking of doing it was sending the string to a text file and then just greping for the number of times like this:
grep -c "." infile
The reason I don't want to do that is because I want to avoid creating another text file for I do not have permission to do so. It would also be simpler for the code I am trying to build right now.
EDIT
I don't think I made it clear but I want to make finding the number of periods more dynamic because the address I will be looking at will change as the script moves forward.
If you don't need to count the dots, but just remove the penultimate dot and everything afterwards, you can use Bash's built-in string manuipulation.
${string%substring}
Deletes shortest match of $substring from back of $string.
Example:
$ string="aaa.bbb.ccc.ddd.google.com"
$ echo ${string%.*.*}
aaa.bbb.ccc.ddd
Nice and simple and no need for sed, awk or cut!
What about this:
echo "aaa.bbb.ccc.ddd.google.com"|awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
(further shortened by helpful comment from #steve)
gives:
aaa.bbb.ccc.ddd
The awk command:
awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
works by separating the input line into fields (FS) by ., then joining them as output (OFS) with ., but the number of fields (NF) has been reduced by 2. The final 1 in the command is responsible for the print.
This will reduce a given input line by eliminating the last two period separated items.
This approach is "shell-agnostic" :)
Perhaps this will help:
#!/bin/sh
input="aaa.bbb.ccc.ddd.google.com"
number_of_fields=$(echo $input | tr "." "\n" | wc -l)
interesting_fields=$(($number_of_fields-2))
echo $input | cut -d. -f-${interesting_fields}
grep -o "\." <<<"aaa.bbb.ccc.ddd.google.com" | wc -l
5

Resources