Printing all the lines that contain a certain word exactly k times - linux

I have to search for all the lines from a file which contain a given word exactly k times. I think that I should use grep/sed/awk but I don't know how. My idea was to check every line by line using sed and grep like this:
line=1
while [ (sed -n -'($line)p' $name) -n ]; do
if [ (sed -n -'($line)p' $name | grep -w -c $word) -eq "$number" ]; then
sed -n -'($line)p' $name
fi
let line+=1
done
My first problem is that I get the following error : syntax error near unexpected token 'sed'. Then I realize that for my test file the command sed -n -'p1' test.txt | grep -w -c "ab" doesn't return the exact number of apparitions of "ab" in the first line from my file (it returns 1 but there are 3 apparitions).
My test.txt file:
abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b

awk to the rescue!
$ awk -F'\\<ab\\>' -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab
note that \< and \> word boundaries might be gawk specific.
for variable assignment, I think easiest will be
$ word=ab; awk -F"\\\<$word\\\>" -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab

You could use grep, but you'd have to use it twice. (You can't use a single grep because ERE has no way to negate a string, you can only negate a bracket expression, which will match single characters.)
The following is tested with GNU grep v2.5.1, where you can use \< and \> as (possibly non-portable) word delimiters:
$ word="ab"
$ < input.txt egrep "(\<$word\>.*){3}" | egrep -v "(\<$word\>.*){4}"
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
$ < input.txt egrep "(\<$word\>.*){2}" | egrep -v "(\<$word\>.*){3}"
kkmd ab jnabc bad ab
The idea here is that we'll extract from our input file lines with N occurrences of the word, then strip from that result any lines with N+1 occurrences. Lines with fewer than N occurrences of course won't be matched by the first grep.
Or, you might also do this in pure bash, if you're feeling slightly masochistic:
$ word="ab"; num=3
$ readarray lines < input.txt
$ for this in "${lines[#]}"; do declare -A words=(); x=( $this ); for y in "${x[#]}"; do ((words[$y]++)); done; [ "0${words[$word]}" -eq "$num" ] && echo "$this"; done
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
Broken out for easier reading (or scripting):
#!/usr/bin/env bash
# Salt to taste
word="ab"; num=3
# Pull content into an array. This isn't strictly necessary, but I like
# getting my file IO over with quickly if possible.
readarray lines < input.txt
# Walk through the array (or you could just walk through the input file)
for this in "${lines[#]}"; do
# Initialize this line's counter array
declare -A words=()
# Break up the words into array elements
x=( $this )
# Step though the array, counting each unique word
for y in "${x[#]}"; do
((words[$y]++))
done
# Check the count for "our" word
[ "0${words[$word]}" -eq $num ] && echo "$this"
done
Wasn't that fun? :)
But this awk option makes the most sense to me. It's a portable one-liner that doesn't depend on GNU awk (so it'll work in OS X, BSD, etc.)
awk -v word="ab" -v num=3 '{for(i=1;i<=NF;i++){a[$i]++}} a[word]==num; {delete a}' input.txt
This works by building an associative array to count the words on each line, then printing the line if the count for the "interesting" word is what's specified as num. It's the same basic concept as the bash script above, but awk lets us do this so much better. :)

You can do this with grep
grep -E "(${word}.*){${number}}" test.txt
This looks for ${number} occurrences of ${word} per line. The wildcard .* is needed since we also want to match occurrences where matches of ${word} are not next to each other.
Here's what I do:
$ echo 'abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b' > test.txt
$ word=abc
$ number=2
$ grep -E "(${word}.*){${number}}" test.txt
> abc ab cds ab abcd edfs ab
> abcde bad abc cdef a b

Maybe you need to use sed. If you are looking for character sequences, you can use code like this. However, it doesn't distinguish between the word on its own and the word embedded in another word (so it treats ab and abc as both containing ab).
word="ab"
number=2
sed -n -e "/\($word.*\)\{$(($number + 1))\}/d" -e "/\($word.*\)\{$number\}/p" test.txt
By default, nothing is printed (-n).
The first -e expression looks for 3 (or more) occurrences of $word and deletes lines containing them (and skips to the next line of input). The $(($number + 1)) is shell arithmetic.
The second -e expressions looks for 2 occurrences of $word (there won't be more) and prints the lines that match.
If you want words on their own, then you have to work a lot harder. You'd need extended regular expressions, triggered with the -E option on BSD (Mac OS X), or -r with GNU sed.
number=2
plus1=$(($number + 1))
word=ab
sed -En -e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$plus1}/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$number}$word$/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]|$).*){$number}/p" test.txt
This is similar to the previous version, but it has considerably more delicate word handling.
The unit (^|[^[:alnum:]]) looks for either the start of line or a non-alphanumeric character (change alnum to alpha throughout if you don't want digits to stop matches).
The first -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters, N+1 times, and deletes such lines (skipping to the next line of input).
The second -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times, and then the word again followed by end of line, and deletes such lines.
The third -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times and prints such lines.
Given the (extended) input file:
abc NO ab cds ab abcd edfs ab
kkmd YES ab jnabc bad ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO e bad abc cdef a b
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab NO abcd abcd ab ab
hope NO abcd abcd ab ab ab
nope NO abcd abcd ab ab ab
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
Example output:
kkmd YES ab jnabc bad ab
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
It is not a trivial exercise in sed. It would be simpler if you could rely on word-boundary detection. For example, in Perl:
number=2
plus1=$(($number + 1))
word=ab
perl -n -e "next if /(\b$word\b.*?){$plus1}/;
print if /(\b$word\b.*?){$number}/" test.txt
This produces the same output as the sed script, but is a lot simpler because of the \b word boundary detection (the .*? non-greedy matching isn't crucial to the operation of the script).

Related

sed: filter string subset from lines matching regexp

I have a file of the following format:
abc: A B C D E
abc: 1 2 3 4 5
def D E F G H
def: 10 11 12 23 99
...
That is a first line with strings after ':' is a header for the next line with numbers. I'd like to use sed to extract only a line starting with PATTERN string with numbers in the line.
Number of numbers in a line is variable, but assume that I know exactly how many I'm expecting, so I tried this command:
% sed 's/^abc: \([0-9]+ [0-9]+ [0-9]+\)$/\1/g' < file.txt
But it dumps all entries from the file. What am I doing wrong?
sed does substitutions and prints each line, whether a substitution happens or not.
Your regular expression is wrong. It would match only three numbers separated by spaces if extended regex flag was given (-E). Without it, not even that, because the + sign will be interpreted literally.
The best here is to use addresses and only print lines that have a match:
sed -nE '/^abc: [0-9]+ [0-9]+ [0-9]+ [0-9]+ [0-9]+$/p' < file.txt
or better,
sed -nE '/^abc:( [0-9]+){5}$/p' < file.txt
The -n flag disables the "print all lines" behavior of sed described in (1). Only the lines that reach the p command will be printed.
to extract only a line starting with PATTERN string with numbers in the line and Number of numbers in a line is variable means at least one number, so:
$ sed -n '/abc: \([0-9]\+\)/p' file
Output:
abc: 1 2 3 4 5
With exactly 5 numbers, use:
$ sed -n '/abc: \([0-9]\+\( \|$\)\)\{5\}/p' file
With #Mark's additional question in a comment "If I want to just extract the matched numbers (and remove prefix, e.g, abc)…" this is the pattern I came up with:
sed -En 's/^abc: (([0-9]+[ \t]?)+)[ \t]*$/\1/gp' file.txt
I'm using the -E flag for extended regular expressions to avoid all the escaping that would be needed.
Given this file:
abc: A B C D E
abc: 1 2 3 4 5
abc: 1 c9 A 7f
def D E F G H
def: 10 11 12 23 99
… this regex matches abc: 1 2 3 4 5 while excluding abc: 1 c9 A 7f — it also allows variable whitespace and trailing whitespace.
With any sed:
$ sed -n 's/^abc: \([0-9 ]*\)$/\1/p' file
1 2 3 4 5

Is it possible to repeat a match in a grep regexp?

I am using this:
grep '\s[A-Z]+\s[A-Z]+\s' file.txt -Po
Which will match
ABC DE
AB AB
DEF GHIFOO
etc
What I want to do is something like
grep '\s([A-Z]+)\s%1\s' file.txt -Po
to only match
AB AB
BC BC
DDD DDD
etc.
I can't work out if it's even possible, let alone how. Is it?
Thanks
The first captured group should be specified as \1 not as %1:
Sample file.txt:
AA AB
AB AB
BC BC
DDD DDD
NN WN
Consider the updated regex patten:
grep -Po '\b([A-Z]+)\s\1\s*' file.txt
The output:
AB AB
BC BC
DDD DDD
Bonus approach for opposite action:
grep -Po '\b([A-Z]+)\s(?!\1)[A-Z]+\s*' file.txt
The output:
AA AB
NN WN

How to shift string by some number of characters in linux

Is there a one liner that shifts all characters in a string by some i number. The input string can contain any ascii characters. It would be for a cypher.
For example, if b comes after a then command 1 "ab" returns "bc", command 3 "ab" returns "de". It should work with any ascii character not just with letters.
The command you want is called caesar.
this gawk command gives you new sequence with ascii code+1:
awk 'BEGIN{FS=OFS="";s=2;for(n=0;n<=127;n++)ord[sprintf("%c",n)]=n}
{for(i=1;i<=NF;i++)$i=sprintf("%c",(ord[$i]+s)%127)}7'
test with string shifted with step 2:
kent$ echo "xyab+123"|awk 'BEGIN{FS=OFS="";s=2;for(n=0;n<=127;n++)ord[sprintf("%c",n)]=n}{for(i=1;i<=NF;i++)$i=sprintf("%c",(ord[$i]+s)%127)}7'
z{cd-345
you just need pass the s as variable, to define shift step.
Use Perl:
echo -n bbb | perl -F'' -ane 'foreach(#F){$_++; printf "$_"}END{print "\n"}'
ccc
If you need shift a N chars (in the case below 4):
echo -n bbb | perl -F'' -ane 'foreach(#F){ $a=ord($_); $a+=4; print chr($a)} END{print "\n"}'
fff
Shifting to negative value:
echo -n bbb | perl -F'' -ane 'foreach(#F){ $a=ord($_); $a-=1; print chr($a)} END{print "\n"}'
aaa

Making horizontal String vertical shell or awk

I have a string
ABCDEFGHIJ
I would like it to print.
A
B
C
D
E
F
G
H
I
J
ie horizontal, no editing between characters to vertical. Bonus points for how to put a number next to each one with a single line. It'd be nice if this were an awk or shell script, but I am open to learning new things. :) Thanks!
If you just want to convert a string to one-char-per-line, you just need to tell awk that each input character is a separate field and that each output field should be separated by a newline and then recompile each record by assigning a field to itself:
awk -v FS= -v OFS='\n' '{$1=$1}1'
e.g.:
$ echo "ABCDEFGHIJ" | awk -v FS= -v OFS='\n' '{$1=$1}1'
A
B
C
D
E
F
G
H
I
J
and if you want field numbers next to each character, see #Kent's solution or pipe to cat -n.
The sed solution you posted is non-portable and will fail with some seds on some OSs, and it will add an undesirable blank line to the end of your sed output which will then become a trailing line number after your pipe to cat -n so it's not a good alternative. You should accept #Kent's answer.
awk one-liner:
awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)print i,$i}'
test :
kent$ echo "ABCDEF"|awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)print i,$i}'
1 A
2 B
3 C
4 D
5 E
6 F
So I figured this one out on my own with sed.
sed 's/./&\n/g' horiz.txt > vert.txt
One more awk
echo "ABCDEFGHIJ" | awk '{gsub(/./,"&\n")}1'
A
B
C
D
E
F
G
H
I
J
This might work for you (GNU sed):
sed 's/\B/\n/g' <<<ABCDEFGHIJ
for line numbers:
sed 's/\B/\n/g' <<<ABCDEFGHIJ | sed = | sed 'N;y/\n/ /'
or:
sed 's/\B/\n/g' <<<ABCDEFGHIJ | cat -n

How to do sum from the file and move in particular way in another file in linux?

Acttualy this is my assignment.I have three-four file,related by student record.Every file have two-three student record.like this
Course Name:Opreating System
Credit: 4
123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25
243567 0 1 1 0 1 1 0 1 0 0 0 7 9 12 15 17 15
Every file have different coursename.I did every coursename and studentid move
in one file but now i don't know how to add all marks and move to another file on same place where is id? Can you please tell me how to do it?
It looks like this:
Student# Operating Systems JAVA C++ Web Programming GPA
123456 76 63 50 82 67.75
243567 80 - 34 63 59
I did like this:
#!/bin/sh
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
echo "STUDENT ID" > rsh2
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
echo "GPA" >> rsh2
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
Some comments and a few pointers :
It would help to add 'comments' for each line of code that is not self evident ; i.e. code like mv f f.bak doesn't need to be commented, but I'm not sure what the intent of your many lines of code are.
You insert a comment with the '#' char, like
# concatenate all files that contain the word CREDITS into a file called rsh1
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
Also note that you consistently use all uppercase for your search targets, i.e. CREDITS, when your sample files shows mixed case. Either used correct case for your search targets, i.e.
`grep -l 'Credits'`
OR tell grep to -i(gnore case), i.e.
`grep -il 'Credits'
Your line
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
can be reduced to 1 call to sed (and you have the same case confusion thing going on), try
sed -n '/COURSE/i{;s/COURSE NAME: //gip;}' rsh1 >> rsh2
This means (-n don't print every line by default),
`gip` = global substitute,
= ignore case in matching
print only lines where substituion was made
So you're editing out the string COURSE NAME for any line that has COURSE in it, and only printing those lines' (you probably don't need the 'g' (global) specifier given that you expect only 1 instance per line)
Your line
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
Actually looks pretty good, very advanced, you're trying to 'fold' each 2 lines together into 1 line, right?
But,
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
I'm really confused by this, is this where you're trying to total a students score? (with a sort embedded I guess not). Why do you think you need a sort,
While it is possible to perform arithmetic in sed, it is super-crazy hard, so you can either use bash variables to calculate the values OR use a unix tool that is designed to process text AND perform logical and mathematical operations of the data presented, awk or perl come to mind here
Anyway, one solution to total each score is to use awk
echo "123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25" |\
awk '{for (i=2;i<=NF;i++) { tot+=$i }; print $1 "\t" tot }'
Will give you a clue on how to proceed for that.
Awk has predefined variables that it populates for each file, and each line of text that it reads, i.e.
$0 = complete line of text (as defined by the internal variables RS (RecordSeparator)
which defaults to '\n' new-line char, the unix end-of-line char
$1 = first field in text (as defined by the internal variables FS (FieldSeparator)
which defaults to (possibly multiple) space chars OR tab char
a line with 2 connected spaces chars and 1 tab char has 3 fields)
NF = Number(of)Fields in current line of data (again fields defined by value of FS as
described above)
(there are many others, besides, $0, $n, $NF, $FS, $RS).
you can programatically increment for values like $1, $2, $3, by using a variable as in the example code, like $i (i is a variable that has a number between 2 and NF. The leading '$'
says give me the value of field i (i.e. $2, $3, $4 ...)
Incidentally, your problem could be easily solved with a single awk script, but apparently, you're supposed to learn about cat, cut, grep, etc, which is a very worthwhile goal.
I hope this helps.

Resources