sed: filter string subset from lines matching regexp - linux

I have a file of the following format:
abc: A B C D E
abc: 1 2 3 4 5
def D E F G H
def: 10 11 12 23 99
...
That is a first line with strings after ':' is a header for the next line with numbers. I'd like to use sed to extract only a line starting with PATTERN string with numbers in the line.
Number of numbers in a line is variable, but assume that I know exactly how many I'm expecting, so I tried this command:
% sed 's/^abc: \([0-9]+ [0-9]+ [0-9]+\)$/\1/g' < file.txt
But it dumps all entries from the file. What am I doing wrong?

sed does substitutions and prints each line, whether a substitution happens or not.
Your regular expression is wrong. It would match only three numbers separated by spaces if extended regex flag was given (-E). Without it, not even that, because the + sign will be interpreted literally.
The best here is to use addresses and only print lines that have a match:
sed -nE '/^abc: [0-9]+ [0-9]+ [0-9]+ [0-9]+ [0-9]+$/p' < file.txt
or better,
sed -nE '/^abc:( [0-9]+){5}$/p' < file.txt
The -n flag disables the "print all lines" behavior of sed described in (1). Only the lines that reach the p command will be printed.

to extract only a line starting with PATTERN string with numbers in the line and Number of numbers in a line is variable means at least one number, so:
$ sed -n '/abc: \([0-9]\+\)/p' file
Output:
abc: 1 2 3 4 5
With exactly 5 numbers, use:
$ sed -n '/abc: \([0-9]\+\( \|$\)\)\{5\}/p' file

With #Mark's additional question in a comment "If I want to just extract the matched numbers (and remove prefix, e.g, abc)…" this is the pattern I came up with:
sed -En 's/^abc: (([0-9]+[ \t]?)+)[ \t]*$/\1/gp' file.txt
I'm using the -E flag for extended regular expressions to avoid all the escaping that would be needed.
Given this file:
abc: A B C D E
abc: 1 2 3 4 5
abc: 1 c9 A 7f
def D E F G H
def: 10 11 12 23 99
… this regex matches abc: 1 2 3 4 5 while excluding abc: 1 c9 A 7f — it also allows variable whitespace and trailing whitespace.

With any sed:
$ sed -n 's/^abc: \([0-9 ]*\)$/\1/p' file
1 2 3 4 5

Related

printf format specifiers in awk does not work for multiple parameters

I'm trying to Write a program script in Bash named example7
Which accepts as parameters: a file name (lets call it file 1) and a list of
Numbers (below we'll call it List 1) and what the program needs to do is to print as output the columns from
File 1 after rescheduling them to right or left by the numbers in list 1. (This is obtainable by using the printf command of awk).
Example
Suppose the contents of an F1 file are:
A abcd ddd eee zz tt
ab gggwe 12 88 iii jjj
yaara yyzz 12abcd xyz x y z
After running the program by command:
example7 F1 -8 -7 6 4
Output:
A abcd ddd eee
ab gggwe 12 88
yaara yyzz 12abcd xyz
In the example above between A and ABCD there are 7 spaces, between abcd and ddd
there are 6 spaces and between ddd and eee
there is one space.
Another example:
After running the program by command:
example7 F1 -8 -7 6 4 5
Output:
A abcd ddd eee zz
ab gggwe 12 88 iii
yaara yyzz 12abcd xyz x
In the example above between A and ABCD there are 7 spaces, between abcd and ddd
there are 6 spaces, between ddd and eee
there is one space, between eee and zz there are 3 spaces, between 88 and iii
there are two spaces, between xyz and x there are 4 spaces.
I've tried doing something like this:
file=$1
shift
awk '{printf "%'$1's\n" ,$1}' $file
but it only works for one number and one parameter and I don't know how I can do it for multiple columns and multiple parameters..
any help will be appreciated.
Set an awk variable to all the remaining parameters, then split it and loop over them.
file=$1
shift
awk -v sizes="$*" '{words = split(sizes, s); for(i = 1; i <= words; i++) printf("%" s[i] "s", $i); print ""; }' "$file"
It's generally wrong to try to substitute a shell variable directly into an awk script. You should prefer to set an awk variable using -v, and then use awk's own string concatenation operation, as I did with s[i].

Calculating a sum of numbers in C shell

I'm trying to calculate a sum numbers positioned on separate lines using C shell.
I must do it with specific commands using pipes.
There is a number of commands: comand.. | comand.. | (comands...)
printing lines in the following form:
1
2
8
4
7
The result should be 22, since 1 + 2 + 8 + 4 + 7 = 22.
I tried ... | bc | tr "\n" "+" | bc, but it didn't work.
I can't use AWK, or variables. That is why I am asking for help.
You actually can use the C shell variables, as they are part of the syntax. Without using variables, you need to pipe, and pipe again:
your-command | sed '2~1 s/^/+/' | xargs | bc
The sed command prepends plus character to all lines starting from the second; xargs joins the lines as a sequence of arguments.
The SED expression can be improved to filter out non-numeric lines:
'/^[^0-9]\+$/ d; 2~1 s/\([0-9]\+\)/+\1/'

4 lines invert grep search in a directory that contains many files

I have many log files in a directory. In those files, there are many lines. Some of these lines contain ERROR word.
I am using grep ERROR abc.* to get error lines from all the abc1,abc2,abc3,etc files.
Now, there are 4-5 ERROR lines that I want to avoid.
So, I am using
grep ERROR abc* | grep -v 'str1\| str2'
This works fine. But when I insert 1 more string,
grep ERROR abc* | grep -v 'str1\| str2\| str3
it doesn’t get affected.
I need to avoid 4-5 strings.. can anybody suggest a solution?
You are using multiple search pattern, i.e. in a way a regex expression. -E in grep supports an extended regular expression as you can see from the man page below
-e PATTERN, --regexp=PATTERN
Use PATTERN as the pattern. This can be used to specify multiple search patterns, or to protect a pattern beginning with a hyphen (-). (-e is specified by POSIX.)
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
So you need to use the -E flag along with the -v invert search
grep ERROR abc* | grep -Ev 'str1|str2|str3|str4|str5'
An example of the usage for your reference:-
$ cat sample.txt
ID F1 F2 F3 F4 ID F1 F2 F3 F4
aa aa
bb 1 2 3 4 bb 1 2 3 4
cc 1 2 3 4 cc 1 2 3 4
dd 1 2 3 4 dd 1 2 3 4
xx xx
$ grep -vE "aa|xx|yy|F2|cc|dd" sample.txt
bb 1 2 3 4 bb 1 2 3 4
Your example should work, but you can also use
grep ERROR abc* | grep -e 'str1' -e 'str2' -e 'str3' -v

Printing all the lines that contain a certain word exactly k times

I have to search for all the lines from a file which contain a given word exactly k times. I think that I should use grep/sed/awk but I don't know how. My idea was to check every line by line using sed and grep like this:
line=1
while [ (sed -n -'($line)p' $name) -n ]; do
if [ (sed -n -'($line)p' $name | grep -w -c $word) -eq "$number" ]; then
sed -n -'($line)p' $name
fi
let line+=1
done
My first problem is that I get the following error : syntax error near unexpected token 'sed'. Then I realize that for my test file the command sed -n -'p1' test.txt | grep -w -c "ab" doesn't return the exact number of apparitions of "ab" in the first line from my file (it returns 1 but there are 3 apparitions).
My test.txt file:
abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b
awk to the rescue!
$ awk -F'\\<ab\\>' -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab
note that \< and \> word boundaries might be gawk specific.
for variable assignment, I think easiest will be
$ word=ab; awk -F"\\\<$word\\\>" -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab
You could use grep, but you'd have to use it twice. (You can't use a single grep because ERE has no way to negate a string, you can only negate a bracket expression, which will match single characters.)
The following is tested with GNU grep v2.5.1, where you can use \< and \> as (possibly non-portable) word delimiters:
$ word="ab"
$ < input.txt egrep "(\<$word\>.*){3}" | egrep -v "(\<$word\>.*){4}"
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
$ < input.txt egrep "(\<$word\>.*){2}" | egrep -v "(\<$word\>.*){3}"
kkmd ab jnabc bad ab
The idea here is that we'll extract from our input file lines with N occurrences of the word, then strip from that result any lines with N+1 occurrences. Lines with fewer than N occurrences of course won't be matched by the first grep.
Or, you might also do this in pure bash, if you're feeling slightly masochistic:
$ word="ab"; num=3
$ readarray lines < input.txt
$ for this in "${lines[#]}"; do declare -A words=(); x=( $this ); for y in "${x[#]}"; do ((words[$y]++)); done; [ "0${words[$word]}" -eq "$num" ] && echo "$this"; done
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
Broken out for easier reading (or scripting):
#!/usr/bin/env bash
# Salt to taste
word="ab"; num=3
# Pull content into an array. This isn't strictly necessary, but I like
# getting my file IO over with quickly if possible.
readarray lines < input.txt
# Walk through the array (or you could just walk through the input file)
for this in "${lines[#]}"; do
# Initialize this line's counter array
declare -A words=()
# Break up the words into array elements
x=( $this )
# Step though the array, counting each unique word
for y in "${x[#]}"; do
((words[$y]++))
done
# Check the count for "our" word
[ "0${words[$word]}" -eq $num ] && echo "$this"
done
Wasn't that fun? :)
But this awk option makes the most sense to me. It's a portable one-liner that doesn't depend on GNU awk (so it'll work in OS X, BSD, etc.)
awk -v word="ab" -v num=3 '{for(i=1;i<=NF;i++){a[$i]++}} a[word]==num; {delete a}' input.txt
This works by building an associative array to count the words on each line, then printing the line if the count for the "interesting" word is what's specified as num. It's the same basic concept as the bash script above, but awk lets us do this so much better. :)
You can do this with grep
grep -E "(${word}.*){${number}}" test.txt
This looks for ${number} occurrences of ${word} per line. The wildcard .* is needed since we also want to match occurrences where matches of ${word} are not next to each other.
Here's what I do:
$ echo 'abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b' > test.txt
$ word=abc
$ number=2
$ grep -E "(${word}.*){${number}}" test.txt
> abc ab cds ab abcd edfs ab
> abcde bad abc cdef a b
Maybe you need to use sed. If you are looking for character sequences, you can use code like this. However, it doesn't distinguish between the word on its own and the word embedded in another word (so it treats ab and abc as both containing ab).
word="ab"
number=2
sed -n -e "/\($word.*\)\{$(($number + 1))\}/d" -e "/\($word.*\)\{$number\}/p" test.txt
By default, nothing is printed (-n).
The first -e expression looks for 3 (or more) occurrences of $word and deletes lines containing them (and skips to the next line of input). The $(($number + 1)) is shell arithmetic.
The second -e expressions looks for 2 occurrences of $word (there won't be more) and prints the lines that match.
If you want words on their own, then you have to work a lot harder. You'd need extended regular expressions, triggered with the -E option on BSD (Mac OS X), or -r with GNU sed.
number=2
plus1=$(($number + 1))
word=ab
sed -En -e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$plus1}/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$number}$word$/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]|$).*){$number}/p" test.txt
This is similar to the previous version, but it has considerably more delicate word handling.
The unit (^|[^[:alnum:]]) looks for either the start of line or a non-alphanumeric character (change alnum to alpha throughout if you don't want digits to stop matches).
The first -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters, N+1 times, and deletes such lines (skipping to the next line of input).
The second -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times, and then the word again followed by end of line, and deletes such lines.
The third -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times and prints such lines.
Given the (extended) input file:
abc NO ab cds ab abcd edfs ab
kkmd YES ab jnabc bad ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO e bad abc cdef a b
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab NO abcd abcd ab ab
hope NO abcd abcd ab ab ab
nope NO abcd abcd ab ab ab
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
Example output:
kkmd YES ab jnabc bad ab
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
It is not a trivial exercise in sed. It would be simpler if you could rely on word-boundary detection. For example, in Perl:
number=2
plus1=$(($number + 1))
word=ab
perl -n -e "next if /(\b$word\b.*?){$plus1}/;
print if /(\b$word\b.*?){$number}/" test.txt
This produces the same output as the sed script, but is a lot simpler because of the \b word boundary detection (the .*? non-greedy matching isn't crucial to the operation of the script).

How to do sum from the file and move in particular way in another file in linux?

Acttualy this is my assignment.I have three-four file,related by student record.Every file have two-three student record.like this
Course Name:Opreating System
Credit: 4
123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25
243567 0 1 1 0 1 1 0 1 0 0 0 7 9 12 15 17 15
Every file have different coursename.I did every coursename and studentid move
in one file but now i don't know how to add all marks and move to another file on same place where is id? Can you please tell me how to do it?
It looks like this:
Student# Operating Systems JAVA C++ Web Programming GPA
123456 76 63 50 82 67.75
243567 80 - 34 63 59
I did like this:
#!/bin/sh
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
echo "STUDENT ID" > rsh2
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
echo "GPA" >> rsh2
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
Some comments and a few pointers :
It would help to add 'comments' for each line of code that is not self evident ; i.e. code like mv f f.bak doesn't need to be commented, but I'm not sure what the intent of your many lines of code are.
You insert a comment with the '#' char, like
# concatenate all files that contain the word CREDITS into a file called rsh1
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
Also note that you consistently use all uppercase for your search targets, i.e. CREDITS, when your sample files shows mixed case. Either used correct case for your search targets, i.e.
`grep -l 'Credits'`
OR tell grep to -i(gnore case), i.e.
`grep -il 'Credits'
Your line
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
can be reduced to 1 call to sed (and you have the same case confusion thing going on), try
sed -n '/COURSE/i{;s/COURSE NAME: //gip;}' rsh1 >> rsh2
This means (-n don't print every line by default),
`gip` = global substitute,
= ignore case in matching
print only lines where substituion was made
So you're editing out the string COURSE NAME for any line that has COURSE in it, and only printing those lines' (you probably don't need the 'g' (global) specifier given that you expect only 1 instance per line)
Your line
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
Actually looks pretty good, very advanced, you're trying to 'fold' each 2 lines together into 1 line, right?
But,
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
I'm really confused by this, is this where you're trying to total a students score? (with a sort embedded I guess not). Why do you think you need a sort,
While it is possible to perform arithmetic in sed, it is super-crazy hard, so you can either use bash variables to calculate the values OR use a unix tool that is designed to process text AND perform logical and mathematical operations of the data presented, awk or perl come to mind here
Anyway, one solution to total each score is to use awk
echo "123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25" |\
awk '{for (i=2;i<=NF;i++) { tot+=$i }; print $1 "\t" tot }'
Will give you a clue on how to proceed for that.
Awk has predefined variables that it populates for each file, and each line of text that it reads, i.e.
$0 = complete line of text (as defined by the internal variables RS (RecordSeparator)
which defaults to '\n' new-line char, the unix end-of-line char
$1 = first field in text (as defined by the internal variables FS (FieldSeparator)
which defaults to (possibly multiple) space chars OR tab char
a line with 2 connected spaces chars and 1 tab char has 3 fields)
NF = Number(of)Fields in current line of data (again fields defined by value of FS as
described above)
(there are many others, besides, $0, $n, $NF, $FS, $RS).
you can programatically increment for values like $1, $2, $3, by using a variable as in the example code, like $i (i is a variable that has a number between 2 and NF. The leading '$'
says give me the value of field i (i.e. $2, $3, $4 ...)
Incidentally, your problem could be easily solved with a single awk script, but apparently, you're supposed to learn about cat, cut, grep, etc, which is a very worthwhile goal.
I hope this helps.

Resources