how to count repeated sentence in Shell - linux

cat file1.txt
abc bcd abc ...
abcd bcde cdef ...
abcd bcde cdef ...
abcd bcde cdef ...
efg fgh ...
efg fgh ...
hig ...
My expected result is like as below:
abc bcd abc ...
abcd bcde cdef ...
<!!! pay attention, above sentence has repeated 3 times !!!>
efg fgh ...
<!!! pay attention, above sentence has repeated 3 times !!!>
hig ...
I have found a way to deal with the issues, but my code is a little noisy.
cat file1.txt | uniq -c | sed -e 's/ \+/ /g' -e 's/^.//g' | awk '{print $0," ",$1}'| sed -e 's/^[2-9] /\n/g' -e 's/^[1] //g' |sed -e 's/[^1]$/\n<!!! pay attention, above sentence has repeated & times !!!> \n/g' -e 's/[1]$//g'
abc bcd abc ...
abcd bcde cdef ...
<!!! pay attention, above sentence has repeated 3 times !!!>
efg fgh ...
<!!! pay attention, above sentence has repeated 2 times !!!>
hig ...
I was wondering if you could show me more high-efficiency way to achieve the goal.Thanks a lot.

sort + uniq + sed solution:
sort file1.txt | uniq -c | sed -E 's/^ +1 (.+)/\1\n/;
s/^ +([2-9]|[0-9]{2,}) (.+)/\2\n<!!! pay attention, the above sentence has repeated \1 times !!!>\n/'
The output:
abc bcd abc ...
abcd bcde cdef ...
<!!! pay attention, the above sentence has repeated 3 times !!!>
efg fgh ...
<!!! pay attention, the above sentence has repeated 2 times !!!>
hig ...
Or with awk:
sort file1.txt | uniq -c | awk '{ n=$1; sub(/^ +[0-9]+ +/,"");
printf "%s\n%s",$0,(n==1? ORS:"<!!! pay attention, the above sentence has repeated "n" times !!!>\n\n") }'

$ awk '
$0==prev { cnt++; next }
{ prt(); prev=$0; cnt=1 }
END { prt() }
function prt() {
if (NR>1) print prev (cnt>1 ? ORS "repeated " cnt " times" : "") ORS
}
' file
abc bcd abc ...
abcd bcde cdef ...
repeated 3 times
efg fgh ...
repeated 2 times
hig ...

If you're lines are not already grouped, then you could use
awk '
NR == FNR {count[$0]++; next}
!seen[$0]++ {
print
if (count[$0] > 1)
print "... repeated", count[$0], "times"
}
' file1.txt file1.txt
This will consume a lot of memory if your file is very large. You might want to sort it first.

Related

I want to find some strings/words from column 1 and 2 in file1 that match column 1 in file2 and replace with column 2 strings/words in file2

I'm still learning coding using Linux platform. I have search for problems similar to mine but the once I found they were either specific or focusing only on changing the entire column 1.
Here are example of my files:
File 1
abc Gamma 3.44
bcd abc 5.77
abc Alpha 1.99
beta abc 0.88
bcd Alpha 5.66
File 2
Gamma Bacteria
Alpha Bacteria
Beta Bacteria
Output file3
abc Bacteria 3.44
bcd abc 5.77
abc Bacteria 1.99
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried:
awk:
$ awk 'FNR==NR{a[$1]=$2;next} {if ($1,$2 in a){$1,$2=a[$1,$2]}; print $0}' file2 file1
$ awk 'NR==FNR {a[FNR]=$0; next} /$1|$2/ {$1 $2=a[FNR]} 1' file2 file1
They gave me:
abc Gamma 3.44
abc 5.77
abc Alpha 1.99
Bacteria abc 0.88
bcd Alpha 5.66
Only changing the $1 and remove the other text strings in column 1 which are not found in file2 $2
And this one:
$ awk -F'\t' -v OFS='\t' 'FNR==1 { next }FNR == NR { file2[$1,$2] = $1 FS $2 } FNR != NR { file1[$1,$2,] = $1 FS $2} END { print "Match:"; for (k in file1) if (k in file1) print file2[k] # Or file1[k]}' file2 file1
Didn't work
Then after i tried sed:
$ sed = file2 | sed -r 'N;s/(.*)\n(.*)/\1s|\&$|\2|/' | sed -f - file1
This gave me an error and complained about
sed -e not being called properly.
Then after take only the smallest $3 if $1 and $2 or $2 and $1 are similar
file 4
bcd abc 5.77
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried this code:
$ awk 'NR == $1&$2 || $3 < min {line = $0; min = $3}END{print line}' file3
$ awk '/^$1/{if(h){print h RS m}min=""; h=$0; next}min=="" || $3 < min{min=$3; m=$0}END{print h RS m}' file3
$ awk -F'\t' '$3 != "NF==min"' OFS='\t' file3
$ awk -v a=NODE '{c=a*$3+(1-a)} !($1 in min) || c<min[$1]{min[$1]=c; minLine[$1]=$0} END{for(k in minLine) print minLine[k]}' file3 | column -t
All didn't work and i tried to research what what does each line means and changed it to fit my problem. But they all failed
This might work for you (GNU sed):
sed -E 's#(.*) (.*)#/^\1 /Is/\\S+/\2/;/^\\S+ \1 /Is/\\S+/\2/2#' file2 |
sed -Ef - file1
Generate a sed script from file2 which is run against file1 to produce the required format.

How to Print All line between matching first occurrence of word?

input.txt
ABC
CDE
EFG
XYZ
ABC
PQR
EFG
From above file i want to print lines between 'ABC' and first occurrence of 'EFG'.
Expected output :
ABC
CDE
EFG
ABC
PQR
EFG
How can i print lines from one word to first occurrence of second word?
EDIT: In case you want to print all occurrences of lines coming between ABC to DEF and leave others then try following.
awk '/ABC/{found=1} found;/EFG/{found=""}' Input_file
Could you please try following.
awk '/ABC/{flag=1} flag && !count;/EFG/{count++}' Input_file
$ awk '/ABC/,/EFG/' file
Output:
ABC
CDE
EFG
ABC
PQR
EFG
This might work for you (GNU sed):
sed -n '/ABC/{:a;N;/EFG/!ba;p}' file
Turn off implicit printing by using the -n option.
Gather up lines between ABC and EFG and then print them. Repeat.
If you want to only print between the first occurrence of ABC to EFG, use:
sed -n '/ABC/{:a;N;/EFG/!ba;p;q}' file
To print the second through fourth occurrences, use:
sed -En '/ABC/{:a;N;/EFG/!ba;x;s/^/x/;/^x{2,4}$/{x;p;x};x;}' file

How to obtain the query order output when we use grep?

I have 2 files
file1.txt
1
3
5
2
File2.txt
1 aaa
2 bbb
3 ccc
4 aaa
5 bbb
Desired output:
1 aaa
3 ccc
5 bbb
2 bbb
Command used : cat File1.txt |grep -wf- File2.txt but the output was:
1 aaa
2 bbb
3 ccc
5 bbb
Is it a way to return the output in the query order?
Thanks in advance!!!
Important Edit
On second thought, do not use grep with redirection as it's incredibly slow. Use awk to read the original patterns to get the order back.
Use this instead
grep -f patterns searchdata | awk 'NR==FNR { line[$1] = $0; next } $1 in line { print line[$1] }' - patterns > matched
Benchmark
#!/bin/bash
paste <(shuf -i 1-10000) <(crunch 4 4 2>/dev/null | shuf -n 10000) > searchdata
shuf -i 1-10000 > patterns
printf 'Testing awk:'
time grep -f patterns searchdata | awk 'NR==FNR { line[$1] = $0; next } $1 in line { print line[$1] }' - patterns > matched
wc -l matched
cat /dev/null > matched
printf '\nTesting grep with redirection:'
time {
while read -r pat; do
grep -w "$pat" searchdata >> matched
done < patterns
}
wc -l matched
Output
Testing awk:
real 0m0.022s
user 0m0.017s
sys 0m0.010s
10000 matched
Testing grep with redirection:
real 0m36.370s
user 0m28.761s
sys 0m7.909s
10000 matched
Original
To preserve the query order, read the file line-by-line:
while read -r pat; do grep -w "$pat" file2.txt; done < file1.txt
I don't think grep has an option to support this, but this solution will be slower if you have large files to read from.

Printing all the lines that contain a certain word exactly k times

I have to search for all the lines from a file which contain a given word exactly k times. I think that I should use grep/sed/awk but I don't know how. My idea was to check every line by line using sed and grep like this:
line=1
while [ (sed -n -'($line)p' $name) -n ]; do
if [ (sed -n -'($line)p' $name | grep -w -c $word) -eq "$number" ]; then
sed -n -'($line)p' $name
fi
let line+=1
done
My first problem is that I get the following error : syntax error near unexpected token 'sed'. Then I realize that for my test file the command sed -n -'p1' test.txt | grep -w -c "ab" doesn't return the exact number of apparitions of "ab" in the first line from my file (it returns 1 but there are 3 apparitions).
My test.txt file:
abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b
awk to the rescue!
$ awk -F'\\<ab\\>' -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab
note that \< and \> word boundaries might be gawk specific.
for variable assignment, I think easiest will be
$ word=ab; awk -F"\\\<$word\\\>" -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab
You could use grep, but you'd have to use it twice. (You can't use a single grep because ERE has no way to negate a string, you can only negate a bracket expression, which will match single characters.)
The following is tested with GNU grep v2.5.1, where you can use \< and \> as (possibly non-portable) word delimiters:
$ word="ab"
$ < input.txt egrep "(\<$word\>.*){3}" | egrep -v "(\<$word\>.*){4}"
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
$ < input.txt egrep "(\<$word\>.*){2}" | egrep -v "(\<$word\>.*){3}"
kkmd ab jnabc bad ab
The idea here is that we'll extract from our input file lines with N occurrences of the word, then strip from that result any lines with N+1 occurrences. Lines with fewer than N occurrences of course won't be matched by the first grep.
Or, you might also do this in pure bash, if you're feeling slightly masochistic:
$ word="ab"; num=3
$ readarray lines < input.txt
$ for this in "${lines[#]}"; do declare -A words=(); x=( $this ); for y in "${x[#]}"; do ((words[$y]++)); done; [ "0${words[$word]}" -eq "$num" ] && echo "$this"; done
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
Broken out for easier reading (or scripting):
#!/usr/bin/env bash
# Salt to taste
word="ab"; num=3
# Pull content into an array. This isn't strictly necessary, but I like
# getting my file IO over with quickly if possible.
readarray lines < input.txt
# Walk through the array (or you could just walk through the input file)
for this in "${lines[#]}"; do
# Initialize this line's counter array
declare -A words=()
# Break up the words into array elements
x=( $this )
# Step though the array, counting each unique word
for y in "${x[#]}"; do
((words[$y]++))
done
# Check the count for "our" word
[ "0${words[$word]}" -eq $num ] && echo "$this"
done
Wasn't that fun? :)
But this awk option makes the most sense to me. It's a portable one-liner that doesn't depend on GNU awk (so it'll work in OS X, BSD, etc.)
awk -v word="ab" -v num=3 '{for(i=1;i<=NF;i++){a[$i]++}} a[word]==num; {delete a}' input.txt
This works by building an associative array to count the words on each line, then printing the line if the count for the "interesting" word is what's specified as num. It's the same basic concept as the bash script above, but awk lets us do this so much better. :)
You can do this with grep
grep -E "(${word}.*){${number}}" test.txt
This looks for ${number} occurrences of ${word} per line. The wildcard .* is needed since we also want to match occurrences where matches of ${word} are not next to each other.
Here's what I do:
$ echo 'abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b' > test.txt
$ word=abc
$ number=2
$ grep -E "(${word}.*){${number}}" test.txt
> abc ab cds ab abcd edfs ab
> abcde bad abc cdef a b
Maybe you need to use sed. If you are looking for character sequences, you can use code like this. However, it doesn't distinguish between the word on its own and the word embedded in another word (so it treats ab and abc as both containing ab).
word="ab"
number=2
sed -n -e "/\($word.*\)\{$(($number + 1))\}/d" -e "/\($word.*\)\{$number\}/p" test.txt
By default, nothing is printed (-n).
The first -e expression looks for 3 (or more) occurrences of $word and deletes lines containing them (and skips to the next line of input). The $(($number + 1)) is shell arithmetic.
The second -e expressions looks for 2 occurrences of $word (there won't be more) and prints the lines that match.
If you want words on their own, then you have to work a lot harder. You'd need extended regular expressions, triggered with the -E option on BSD (Mac OS X), or -r with GNU sed.
number=2
plus1=$(($number + 1))
word=ab
sed -En -e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$plus1}/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$number}$word$/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]|$).*){$number}/p" test.txt
This is similar to the previous version, but it has considerably more delicate word handling.
The unit (^|[^[:alnum:]]) looks for either the start of line or a non-alphanumeric character (change alnum to alpha throughout if you don't want digits to stop matches).
The first -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters, N+1 times, and deletes such lines (skipping to the next line of input).
The second -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times, and then the word again followed by end of line, and deletes such lines.
The third -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times and prints such lines.
Given the (extended) input file:
abc NO ab cds ab abcd edfs ab
kkmd YES ab jnabc bad ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO e bad abc cdef a b
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab NO abcd abcd ab ab
hope NO abcd abcd ab ab ab
nope NO abcd abcd ab ab ab
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
Example output:
kkmd YES ab jnabc bad ab
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
It is not a trivial exercise in sed. It would be simpler if you could rely on word-boundary detection. For example, in Perl:
number=2
plus1=$(($number + 1))
word=ab
perl -n -e "next if /(\b$word\b.*?){$plus1}/;
print if /(\b$word\b.*?){$number}/" test.txt
This produces the same output as the sed script, but is a lot simpler because of the \b word boundary detection (the .*? non-greedy matching isn't crucial to the operation of the script).

How do I parse out a text file with AWK and fprint in BASH?

I have a sample.txt file as follows:
Name City ST Zip CTY
John Smith BrooklynNY10050USA
Paul DavidsonQueens NY10040USA
Michael SmithNY NY10030USA
George HermanBronx NY10020USA
Image of input (in case if upload doesn't show properly)
Input
Desired output is into separate columns as shown below:
Desired Output
I tried this:
#!/bin/bash
awk '{printf "%13-s %-8s %-2s %-5s %-3s\n", $1, $2, $3, $4, $5}' sample.txt > new.txt
And it's unsuccessful with this result:
Name City ST Zip CTY
John Smith BrooklynNY10050USA
Paul DavidsonQueens NY10040USA
Michael SmithNY NY10030USA
George HermanBronx NY10020USA
Would appreciate it if anyone could tweak this so the text file will be in delimited format as shown above. Thank you so much!!
You can use sed to insert spaces to specific positions:
cat data.txt | sed -e 's#\(.\{13\}\)\(.*\)#\1 \2#g' | sed -e 's#\(.\{22\}\)\(.*\)#\1 \2#g' |sed -e '1s#\(.\{29\}\)\(.*\)#\1 \2#g' | sed -e '2,$s#\(.\{25\}\)\(.*\)#\1 \2#g' | sed -e 's#\(.\{31\}\)\(.*\)#\1 \2#g'
With gawk you can set the input field widths in the BEGIN block:
$ gawk 'BEGIN { FIELDWIDTHS = "13 8 2 5 3" } { print $1, $2, $3, $4, $5 }' fw.txt
Name City ST Zip CTY
John Smith Brooklyn NY 10050 USA
Paul Davidson Queens NY 10040 USA
Michael Smith NY NY 10030 USA
George Herman Bronx NY 10020 USA
If your awk does not have FIELDWIDTHS, it's a bit tedious but you can use substr:
$ awk '{ print substr($0,1,13), substr($0,14,8), substr($0,22,2), substr($0,24,5), substr($0,29,3) }' fw.txt
Name City ST Zip CTY
John Smith Brooklyn NY 10050 USA
Paul Davidson Queens NY 10040 USA
Michael Smith NY NY 10030 USA
George Herman Bronx NY 10020 USA
You can split the field lengths into an array then loop over $0 and gather the substrings in regular awk:
awk 'BEGIN {n=split("13 8 2 5 3",ar)}
{
j=1
s=""
sep="\t"
for(i=1;i<n;i++)
{s=s substr($0, j, ar[i]) sep; j+=ar[i]}
s=s substr($0, j, ar[i])
print s
}' file
That uses a tab to delimit the fields, but you can also use a space if preferred.

Resources