Regarding grep command - linux

I have two queries
I do a grep and get the line number of the input file. i want to retrieve a set of lines before and after the line number from the inputfile and redirect to a /tmp/testout file. how can i do it.
I have a line numbers 10000,20000. I want to retrieve the lines between 10000 and 20000 of the input file and redirect to a /tmp/testout file. how can i doi it

for grep -C is the straight forward option
for the 2nd question try this!
sed -n "100000,20000p" bar.txt > foo.txt

You want to look into the -A -B and -C options of grep. See man grep for more information
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing -- between contiguous groups of
matches.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing -- between contiguous groups of
matches.
-C NUM, --context=NUM
Print NUM lines of output context. Places a line containing --
between contiguous groups of matches.
For redirecting the output, do the following: grep "your pattern" yourinputfile > /tmp/testout

See head and/or tail.
For example:
head -n 20000 <input> | tail -n 10000 > /tmp/testout
whereas the argument of tail is (20000 - 10000).

If you're using GNU grep, you can supply -B and -A to get lines before and after the match with grep.
E.g.
grep -B 5 -A 10 SearchString File
will give print each line matching SearchString from File plus 5 lines before and 10 lines after the matching line.
For the other part of your question, you can use head/tail or sed. Please see other answers for details.

For part 2, awk will allow you to print a range of lines thus:
awk 'NR==10000,NR==20000{print}{}' inputfile.txt >/tmp/testout
This basically gives a range based on the record number NR.
For part 1, context from grep can be obtained using the --after-context=X and --before-context=X switches. If you're running a grep that doesn't allow that, you can dummy up an awk script based on the part 2 answer above.

to see the before and after: (3 lines before and 3 lines after)
grep -C3 foo bar.txt
the second question:
head -20000 bar.txt | tail -10000 > foo.txt

you can do these with just awk, eg display 2 lines before and after "6", and display range from linenumber 4 to 8
$ cat file
1
2
3
4
5
6
7
8
9
10
$ awk 'c--&&c>=0{print "2 numbers below 6: "$0};/6/{c=2;for(i=d;i>d-2;i--)print "2 numbers above 6: "a[i];delete a}{a[++d]=$0} NR>3&&NR<9{print "With range: ->"$0}' file
With range: ->4
With range: ->5
2 numbers above 6: 5
2 numbers above 6: 4
With range: ->6
2 numbers below 6: 7
With range: ->7
2 numbers below 6: 8
With range: ->8

If your grep doesn't have -A, -B and -C, then this sed command may work for you:
sed -n '1bb;:a;/PATTERN/{h;n;p;H;g;bb};N;//p;:b;99,$D;ba' inputfile > outputfile
where PATTERN is the regular expression your looking for and 99 is one greater than the number of context lines you want (equivalent to -C 98).
It works by keeping a window of lines in memory and when the regex matches, the captured lines are output.
If your sed doesn't like semicolons and prefers -e, this version may work for you:
sed -n -e '1bb' -e ':a' -e '/PATTERN/{h' -e 'n' -e 'p' -e 'H' -e 'g' -e 'bb}' -e 'N' -e '//p' -e ':b' -e '99,$D' -e 'ba' inputfile > outputfile
For your line range output, this will work and will finish a little more quickly if there are a large number of lines after the end of the range:
sed -n '100000,20000p;q' inputfile > outputfile
or
sed -n -e '100000,20000p' -e 'q' inputfile > outputfile

Related

How to check that the 4th character in a file is 'a' using linux grep command

To find the 2nd character it was grep -e '^.[aA]'. Then what will be for the 4th character? I tried grep -e'^...[aA]'. But it went wrong.
grep processes the input line by line. ^.[aA] is true if a or A is the second character on any line.
You can combine grep with head to only inspect the first line:
head -n1 filename | grep '^...[aA]'
But it still wouldn't work for a file whose first line is shorter than four characters:
x
ya
To really check the fourth character in a file, grep is not the best tool.
#! /bin/bash
read -N4 chars < filename
if [[ "${chars:3:1}" == [aA] ]] ; then
echo Found
fi
But if you tried hard enough, you can still use it. E.g., use tr to replace newlines by spaces, then you can run your grep:
tr '\n' ' ' < filename | grep '^...[aA]'

Is sed not able to match the first line of a file with the end of the address range?

Sed is failing me on macos and linux:
$ printf "1\n2\n3" | sed -n -e '1,1p'
1
$ printf "1\n2\n3" | sed -n -e '1,/1/p'
1
2
3
The end address range pattern /1/ doesn't work. /2/ would though.
printf "1\n2\n3" | sed -n -e '1,/2/p'
1
2
In (your) BSD sed examples, the line 1 starts the range, and /1/ closes it, but the search for /1/ starts only after line 1 (range start).
In GNU sed, there's an extension that handles your exact case, the 0,/regexp/ range address. The docs explain it best:
0,/regexp/ A line number of 0 can be used in an address specification like 0,/regexp/ so that sed will try to match regexp in the first input line too. In other words, 0,/regexp/ is similar to 1,/regexp/, except that if addr2 matches the very first line of input the 0,/regexp/ form will consider it to end the range, whereas the 1,/regexp/ form will match the beginning of its range and hence make the range span up to the second occurrence of the regular expression.
For example:
$ printf "%d\n" {1..3} | sed -n -e '0,/1/p'
1

A Shell Script to simulate the wc command with its options?

we have to write a shell script program , which works similar to wc command.
receives -l, -c and -w as its options.
Shell scripting syntax aside; MY QUESTION is that can we simulate logic of wc -c or wc -l or wc -w using sed or grep or anything else ; if yes then how?
IMP: Don't use wc in script
A single awk command that you can parameterize by setting the appropriate -v variables to 0:
LC_ALL=C awk -v l=1 -v w=1 -v c=1 '
{ wc+=NF; cc+=1+length($0) }
END { printf "%s\t%s\t%s\n", l ? NR : "", w ? wc: "", c ? cc : ""}
' file
Note:
For simplicity, you always get 3 \t-separated output fields, with fields whose output wasn't requested empty; it wouldn't be hard to modify this to emulate wc's output behavior, however.
As explained in choroba's grep answer, you must prepend LC_ALL=C  to awk ..., if you really want to count bytes (-c) rather than (potentially multi-byte) characters (-m).
To count characters (the equivalent of wc -m), remove LC_ALL=C  above.
Caveat: This won't work BSD awk, as also found on macOS, unfortunately, because it not Unicode-aware and always counts the number of bytes (try awk '{print length($0)}' <<<ü).
wc -l strictly counts the number of \n characters, so it doesn't count an incomplete line - one missing a trailing \n - at the end of its input; the above awk command, by contrast, does count that line (and an implied trailing newline in the byte/character count).
How it works:
awk's NF variables contains the number of fields on each input line, where the line is broken into fields by arbitrary runs of whitespace by default; in other words: by default, fields are words.
$0 is the input line at hand, whose length() tells you the number of characters / bytes, with 1 added to account for the \n character at the end of the line.
Note how variables wc and cc need to initialization, because awk implicitly treats empty/undefined variables as 0 in a numeric context (such as with compound operator +=).
NR contains the current, 1-based line number, which in the END block is equal to the total number of input lines.
Using awk:
-l:
awk 'END{print NR}' inFile
-w:
awk '{words+=NF}END{print words}' inFile
-c:
ls -l inFile | awk '{print $5}'
If you can use grep, simulating the line count is easy: just count how many times something that matches always happens:
grep -c '^' filename
This should output the same as wc -l (but it might report one more line if the file doesn't end in a newline).
To get the number of words, you can use the following pipeline:
grep -o '[^[:space:]]\+' filename | grep -c '^'
You need grep that supports the -o option which prints each matching string to a line of its own. The expression matches all non-space sequences, and piping them into what we used in the previous case just counts them.
To get the number of characters (wc -c), you can use
LC_ALL=C grep -o . filename | grep -c '^'
Setting LC_ALL is needed if your locale supports UTF-8, otherwise you'd count wc -m. You need to add the number of newlines to the output number, so
echo $(( $( grep -c '^' filename )
+ $( LC_ALL=C grep -o . filename | grep -c '^' ) ))

How to concatenate multiple lines of output to one line?

If I run the command cat file | grep pattern, I get many lines of output. How do you concatenate all lines into one line, effectively replacing each "\n" with "\" " (end with " followed by space)?
cat file | grep pattern | xargs sed s/\n/ /g
isn't working for me.
Use tr '\n' ' ' to translate all newline characters to spaces:
$ grep pattern file | tr '\n' ' '
Note: grep reads files, cat concatenates files. Don't cat file | grep!
Edit:
tr can only handle single character translations. You could use awk to change the output record separator like:
$ grep pattern file | awk '{print}' ORS='" '
This would transform:
one
two
three
to:
one" two" three"
Piping output to xargs will concatenate each line of output to a single line with spaces:
grep pattern file | xargs
Or any command, eg. ls | xargs. The default limit of xargs output is ~4096 characters, but can be increased with eg. xargs -s 8192.
grep xargs
In bash echo without quotes remove carriage returns, tabs and multiple spaces
echo $(cat file)
This could be what you want
cat file | grep pattern | paste -sd' '
As to your edit, I'm not sure what it means, perhaps this?
cat file | grep pattern | paste -sd'~' | sed -e 's/~/" "/g'
(this assumes that ~ does not occur in file)
This is an example which produces output separated by commas. You can replace the comma by whatever separator you need.
cat <<EOD | xargs | sed 's/ /,/g'
> 1
> 2
> 3
> 4
> 5
> EOD
produces:
1,2,3,4,5
The fastest and easiest ways I know to solve this problem:
When we want to replace the new line character \n with the space:
xargs < file
xargs has own limits on the number of characters per line and the number of all characters combined, but we can increase them. Details can be found by running this command: xargs --show-limits and of course in the manual: man xargs
When we want to replace one character with another exactly one character:
tr '\n' ' ' < file
When we want to replace one character with many characters:
tr '\n' '~' < file | sed s/~/many_characters/g
First, we replace the newline characters \n for tildes ~ (or choose another unique character not present in the text), and then we replace the tilde characters with any other characters (many_characters) and we do it for each tilde (flag g).
Here is another simple method using awk:
# cat > file.txt
a
b
c
# cat file.txt | awk '{ printf("%s ", $0) }'
a b c
Also, if your file has columns, this gives an easy way to concatenate only certain columns:
# cat > cols.txt
a b c
d e f
# cat cols.txt | awk '{ printf("%s ", $2) }'
b e
I like the xargs solution, but if it's important to not collapse spaces, then one might instead do:
sed ':b;N;$!bb;s/\n/ /g'
That will replace newlines for spaces, without substituting the last line terminator like tr '\n' ' ' would.
This also allows you to use other joining strings besides a space, like a comma, etc, something that xargs cannot do:
$ seq 1 5 | sed ':b;N;$!bb;s/\n/,/g'
1,2,3,4,5
Here is the method using ex editor (part of Vim):
Join all lines and print to the standard output:
$ ex +%j +%p -scq! file
Join all lines in-place (in the file):
$ ex +%j -scwq file
Note: This will concatenate all lines inside the file it-self!
Probably the best way to do it is using 'awk' tool which will generate output into one line
$ awk ' /pattern/ {print}' ORS=' ' /path/to/file
It will merge all lines into one with space delimiter
paste -sd'~' giving error.
Here's what worked for me on mac using bash
cat file | grep pattern | paste -d' ' -s -
from man paste .
-d list Use one or more of the provided characters to replace the newline characters instead of the default tab. The characters
in list are used circularly, i.e., when list is exhausted the first character from list is reused. This continues until
a line from the last input file (in default operation) or the last line in each file (using the -s option) is displayed,
at which time paste begins selecting characters from the beginning of list again.
The following special characters can also be used in list:
\n newline character
\t tab character
\\ backslash character
\0 Empty string (not a null character).
Any other character preceded by a backslash is equivalent to the character itself.
-s Concatenate all of the lines of each separate input file in command line order. The newline character of every line
except the last line in each input file is replaced with the tab character, unless otherwise specified by the -d option.
If ‘-’ is specified for one or more of the input files, the standard input is used; standard input is read one line at a time,
circularly, for each instance of ‘-’.
On red hat linux I just use echo :
echo $(cat /some/file/name)
This gives me all records of a file on just one line.

sed how to delete first 17 lines and last 8 lines in a file

I have a big file 150GB CSV file and I would like to remove the first 17 lines and the last 8 lines. I have tried the following but seems that's not working right
sed -i -n -e :a -e '1,8!{P;N;D;};N;ba'
and
sed -i '1,17d'
I wonder if someone can help with sed or awk, one liner will be great?
head and tail are better for the job than sed or awk.
tail -n+18 file | head -n-8 > newfile
awk -v nr="$(wc -l < file)" 'NR>17 && NR<(nr-8)' file
All awk:
awk 'NR>y+x{print A[NR%y]} {A[NR%y]=$0}' x=17 y=8 file
Try this :
sed '{[/]<n>|<string>|<regex>[/]}d' <fileName>
sed '{[/]<adr1>[,<adr2>][/]d' <fileName>
where
/.../=delimiters
n = line number
string = string found in in line
regex = regular expression corresponding to the searched pattern
addr = address of a line (number or pattern )
d = delete
Refer this link
LENGTH=`wc -l < file`
head -n $((LENGTH-8)) file | tail -n $((LENGTH-17)) > file
Edit: As mtk posted in comment this won't work. If you want to use wc and track file length you should use:
LENGTH=`wc -l < file`
head -n $((LENGTH-8)) file | tail -n $((LENGTH-8-17)) > file
or:
LENGTH=`wc -l < file`
head -n $((LENGTH-8)) file > file
LENGTH=`wc -l < file`
tail -n $((LENGTH-17)) file > file
What makes this solution less elegant than that posted by choroba :)
I learnt this today for the shell.
{
ghead -17 > /dev/null
sed -n -e :a -e '1,8!{P;N;D;};N;ba'
} < my-bigfile > subset-of
One has to use a non consuming head, hence the use of ghead from the GNU coreutils.
Similar to Thor's answer, but a bit shorter:
sed -i '' -e $'1,17d;:a\nN;19,25ba\nP;D' file.txt
The -i '' tells sed to edit the file in place. (The syntax may be a bit different on your system. Check the man page.)
If you want to delete front lines from the front and tail from the end, you'd have to use the following numbers:
1,{front}d;:a\nN;{front+2},{front+tail}ba\nP;D
(I put them in curly braces here, but that's just pseudocode. You'll have to replace them by the actual numbers. Also, it should work with {front+1}, but it doesn't on my machine (macOS 10.12.4). I think that's a bug.)
I'll try to explain how the command works. Here's a human-readable version:
1,17d # delete lines 1 ... 17, goto start
:a # define label a
N # add next line from file to buffer, quit if at end of file
19,25ba # if line number is 19 ... 25, goto start (label a)
P # print first line in buffer
D # delete first line from buffer, go back to start
First we skip 17 lines. That's easy. The rest is tricky, but basically we keep a buffer of eight lines. We only start printing lines when the buffer is full, but we stop printing when we reach the end of the file, so at the end, there are still eight lines left in the buffer that we didn't print - in other words, we deleted them.

Resources