Using the command line to combine non-adjacent sections of a file - linux

Is it possible to concatenate the headers lines in a file with the output from a filter using grep? Perhaps using the cat command or something else from GNU's coreutils?
In particular, I have a tab delimited file that roughly looks like the following:
var1 var2 var3
1 MT 500
30 CA 40000
10 NV 1240
40 TX 500
30 UT 35000
10 AZ 1405
35 CO 500
15 UT 9000
1 NV 1505
30 CA 40000
10 NV 1240
I would like to select from lines 2 - N all lines that contain "CA" using grep and also to place the first row, the variable names, in the first line of the output file using GNU/Linux commands.
The desired output for the example would be:
var1 var2 var3
30 CA 40000
35 CA 65000
15 CA 2500
I can select the two sets of desired output with the following lines of code.
head -1 filename
grep -E CA filename
My initial idea is to combine the output of these commands using cat, but I have not been successful so far.

If you're running the commands from a shell (including shell scripts), you can run each command separately and redirect the output:
head -1 filename > outputfile
grep -E CA filename >> outputfile
The first line will overwrite outputfile, because a single > was used. The second line will append to outputfile, because >> was used.
If you want to do this in a single command, the following worked in bash:
(head -1 filename && grep -E CA filename) > outputfile
If you want the output to go to standard output, leave off the parenthesis and redirection:
head -1 filename && grep -E CA filename

It's not clear what you're looking for, but perhaps just:
{ head -1 filename; grep -E CA filename; } > output
or
awk 'NR==1 || /CA/' filename > output
But another interpretation of your question is best addressed using sed or awk.
For example, to print lines 5-9 and line 14, you can do:
sed -n -e 5,9p -e 14p
or
awk '(NR >=5 && NR <=9) || NR==14'

I just came across a method that uses the cat command.
cat <(head -1 filename) <(grep -E CA filename) > outputfile
This site, tldp.org, calls the <(command) syntax "process substitution."
It is unclear to me what method would be more efficient in terms of memory / speed, but this is testable.

Related

Using grep -m to save X amount of lines into new zipped file

I have a file that has this pattern (the following text is equivalente to 1 sequence):
#A00479:60:HL5HKDSXX:1:1101:1759:1000 1:N:0:CAGCGTTA
TGAGCCACAGACCCTGGATCCCTCCCTGAGGTCCCATGGGACGGGCAGGCTGGGCATACCTGCAGAGAAGATGTGGCCAGCCACGGCCAGGAACGCATCGGTCACCACAGGCTCAGACTGCAGGGAGATGTGCAGCTGACGCGCCACGTTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
I'd like to use grep to "pick" the first 100 sequences that have the pattern "#" and save that to a new zipped file
I was trying something like this
gzip | grep -m 10 # test_seq_R1.fasta | cat test_seq_R1.fasta > test_seq_R1_zipped
But it is basically returning the same content from the original file test_seq_R1.fasta.
How can I choose the first 100 sequences that initiate with the # pattern and zip it to a new file using grep and gzip?
Thank you
Suggesting an awk script:
awk 'count < 101 && /^#/ {++count; print}' input.txt

How do I insert text to the 1st line of a file using sed?

Hi I'm trying to add text to the 1st line of a file using sed
so far iv'e tried
#!/bin/bash
touch test
sed -i -e '1i/etc/example/live/example.com/fullchain.pem;\' test
And this dosn't work
also tried
#!/bin/bash
touch test
sed -i "1i ssl_certificate /etc/example/live/example.com/fullchain.pem;" test
this dosn't seem to work either
oddly when I try
#!/bin/bash
touch test
echo "ssl_certificate /etc/example/live/example.com/fullchain.pem;" > test
I get the 1st line of text to appear when i use cat test
but as soon as i type sed -i "2i ssl_certificate_key /etc/example/live/example.com/privkey.pem;"
I can't see the information that i sould do on line 2 this being ssl_certificate_key /etc/example/live/example.com/privkey.pem;
so my question to summerise
Can text be inserted into the 1st line of a newly created file using sed?
If yes whats the best way of inserting text after the 1st line of text?
Suppose you have a file like this:
one
two
Then to append to the first line:
$ sed '1 s_$_/etc/example/live/example.com/fullchain.pem;_' file
one/etc/example/live/example.com/fullchain.pem;
two
To insert before the first line:
$ sed '1 i /etc/example/live/example.com/fullchain.pem;' file
/etc/example/live/example.com/fullchain.pem;
one
two
Or, to append after the first line:
$ sed '1 a /etc/example/live/example.com/fullchain.pem;' file
one
/etc/example/live/example.com/fullchain.pem;
two
Note the number 1 in those sed expressions - that's called the address in sed terminology. It tells you on which line the command that follows is to operate.
If your file doesn't contain the line you're addressing, the sed command won't get executed. That's why you can't insert/append on line 1, if your file is empty.
Instead of using stream editor, to append (to empty files), just use a shell redirection >>:
echo "content" >> file
Your problem stems from the fact that sed cannot locate the line you're telling it to write at, for example:
touch test
sed -i -e '1i/etc/example/live/example.com/fullchain.pem;\' test
attempts to write to insert at the line 1 of test, but that line doesn't exist at this point. If you've created your file as:
echo -en "\n" > test
sed -i '1i/etc/example/live/example.com/fullchain.pem;\' test
it would not complain, but you'd be having an extra line. Similarly, when you call:
sed -i "2i ssl_certificate_key /etc/example/live/example.com/privkey.pem;"
you're telling sed to insert the following data at the line 2 which doesn't exist at that point so sed doesn't get to edit the file.
So, for the initial line or the last line in the file, you should not use sed because simple > and >> stream redirects are more than enough.
Your command will work if you make sure the input file has at least one line:
[ "$(wc -l < test)" -gt 0 ] || printf '\n' >> test
sed -i -e '1 i/etc/example/live/example.com/fullchain.pem;\' test
To insert text to the first line and put the rest on a new line using sed on macOS this worked for me
sed -i '' '1 i \
Insert
' ~/Downloads/File-path.txt
First and Last
I would assume that anyone who searched for how to insert/append text to the beginning/end of a file probably also needs to know how to do the other also.
cal | \
gsed -E \
-e '1i\{' \
-e '1i\ "lines": [' \
-e 's/(.*)/ "\1",/' \
-e '$s/,$//' \
-e '$a\ ]' \
-e '$a\}'
Explanation
This is cal output piped to gnu-sed (called gsed on macOS installed via brew.sh) with extended RegEx (-E) and 6 "scripts" applied (-e) and line breaks escaped with \ for readability. Scripts 1 & 2 use 1i\ to "at line 1, insert". Scripts 5 & 6 use $a\ to "at line <last>, append". I vertically aligned the text outputs to make the code represent what is expected in the result. Scripts 3 & 4 do substitutions (the latter applying only to "line <last>"). The result is converting command output to valid JSON.
output
{
"lines": [
" October 2019 ",
"Su Mo Tu We Th Fr Sa ",
" 1 2 3 4 5 ",
" 6 7 8 9 10 11 12 ",
"13 14 15 16 17 18 19 ",
"20 21 22 23 24 25 26 ",
"27 28 29 30 31 ",
" "
]
}
For help getting this to work with the macos/BSD version of sed, see my answer here.

Process large amount of data using bash

I've got to process a large amount of txt files in a folder using bash scripting.
Each file contains million of row and they are formatted like this:
File #1:
en ample_1 200
it example_3 24
ar example_5 500
fr.b example_4 570
fr.c example_2 39
en.n bample_6 10
File #2:
de example_3 4
uk.n example_5 50
de.n example_4 70
uk example_2 9
en ample_1 79
en.n bample_6 1
...
I've got to filter by "en" or "en.n", finding duplicate occurrences in the second column, sum third colum and get a sorted file like this:
en ample_1 279
en.n bample_6 11
Here my script:
#! /bin/bash
clear
BASEPATH=<base_path>
FILES=<folder_with_files>
TEMP_UNZIPPED="tmp"
FINAL_RES="pg-1"
#iterate each file in folder and apply grep
INDEX=0
DATE=$(date "+DATE: %d/%m/%y - TIME: %H:%M:%S")
echo "$DATE" > log
for i in ${BASEPATH}${FILES}
do
FILENAME="${i%.*}"
if [ $INDEX = 0 ]; then
VAR=$(gunzip $i)
#-e -> multiple condition; -w exact word; -r grep recursively; -h remove file path
FILTER_EN=$(grep -e '^en.n\|^en ' $FILENAME > $FINAL_RES)
INDEX=1
#remove file to free space
rm $FILENAME
else
VAR=$(gunzip $i)
FILTER_EN=$(grep -e '^en.n\|^en ' $FILENAME > $TEMP_UNZIPPED)
cat $TEMP_UNZIPPED >> $FINAL_RES
#AWK BLOCK
#create array a indexed with page title and adding frequency parameter as value.
#eg. a['ciao']=2 -> the second time I find "ciao", I sum previous value 2 with the new. This is why i use "+=" operator
#for each element in array I print i=page_title and array content such as frequency
PARSING=$(awk '{ page_title=$1" "$2;
frequency=$3;
array[page_title]+=frequency
}END{
for (i in array){
print i,array[i] | "sort -k2,2"
}
}' $FINAL_RES)
echo "$PARSING" > $FINAL_RES
#END AWK BLOCK
rm $FILENAME
rm $TEMP_UNZIPPED
fi
done
mv $FINAL_RES $BASEPATH/06/01/
DATE=$(date "+DATE: %d/%m/%y - TIME: %H:%M:%S")
echo "$DATE" >> log
Everything works, but it take a long long time to execute. Does anyone know how to get same result, with less time and less lines of code?
The UNIX shell is an environment from which to manipulate files and processes and sequence calls to tools. The UNIX tool which shell calls to manipulate text is awk so just use it:
$ awk '$1~/^en(\.n)?$/{tot[$1" "$2]+=$3} END{for (key in tot) print key, tot[key]}' file | sort
en ample_1 279
en.n bample_6 11
Your script has too many issues to comment on which indicates you are a beginner at shell programming - get the books Bash Shell Scripting Recipes by Chris Johnson and Effective Awk Programming, 4th Edition, by Arnold Robins.

how to edit a line with an extension in one file from the data of another file in linux

Inn my text file I have the following lines.
input.k
has
2684717 -194.7050476 64.2345581 150.6500092 0 0
2684718 -213.1575623 62.7032242 150.6500092 0 0
*INCLUDE
$# filename
./meshes/exportneu/147.k
*END
and
mesh.k
has
100
I want to replace the 147.k in input.k to another number form another file which is 100 in mesh.k
Required output
2684717 -194.7050476 64.2345581 150.6500092 0 0
2684718 -213.1575623 62.7032242 150.6500092 0 0
*INCLUDE
$# filename
../meshes/exportneu/100.k
*END
I used
sed '/\<meshes\>/!d;=;s/.* ([^ ]\+).*/\1/;R mesh.k' input.k |
sed 'N;N;s|\n|s/|;s|\n|/|;s|$|/|;q' >temp.sed
sed -i -f temp.sed input.k
The point is that I want to replace this 147.k to 100.k where 100 is written in another file mesh.k , like in the other file only 100 is present or it could be 3 digit anyother number.
i know it can work with searching the line with word meshes for example and the dividing with last / and piping the data from other file but am not able to formulate the sed or awk.
regards
You can try something like this.
awk 'NR==FNR{a[++i]=$1; next} {{sub(/[0-9]+/,a[++j]); print}}' f2 f1
However, please note that your substitutions from another file needs to be in the same order as your input line that needs the substitution -
$ cat f1
../meshes/exportneu/147.k
../secondline/exportneu/10.k
$ cat f2
100
40
$ awk 'NR==FNR{a[++i]=$1; next} {{sub(/[0-9]+/,a[++j]); print}}' f2 f1
../meshes/exportneu/100.k
../secondline/exportneu/40.k
You can improve upon your substitution inside awk as per your file. This is just to get you in the right direction.

grep: show lines surrounding each match

How do I grep and show the preceding and following 5 lines surrounding each matched line?
For BSD or GNU grep you can use -B num to set how many lines before the match and -A num for the number of lines after the match.
grep -B 3 -A 2 foo README.txt
If you want the same number of lines before and after you can use -C num.
grep -C 3 foo README.txt
This will show 3 lines before and 3 lines after.
-A and -B will work, as will -C n (for n lines of context), or just -n (for n lines of context... as long as n is 1 to 9).
ack works with similar arguments as grep, and accepts -C. But it's usually better for searching through code.
grep astring myfile -A 5 -B 5
That will grep "myfile" for "astring", and show 5 lines before and after each match
ripgrep
If you care about the performance, use ripgrep which has similar syntax to grep, e.g.
rg -C5 "pattern" .
-C, --context NUM - Show NUM lines before and after each match.
There are also parameters such as -A/--after-context and -B/--before-context.
The tool is built on top of Rust's regex engine which makes it very efficient on the large data.
I normally use
grep searchstring file -C n # n for number of lines of context up and down
Many of the tools like grep also have really great man files too. I find myself referring to grep's man page a lot because there is so much you can do with it.
man grep
Many GNU tools also have an info page that may have more useful information in addition to the man page.
info grep
Use grep
$ grep --help | grep -i context
Context control:
-B, --before-context=NUM print NUM lines of leading context
-A, --after-context=NUM print NUM lines of trailing context
-C, --context=NUM print NUM lines of output context
-NUM same as --context=NUM
If you search code often, AG the silver searcher is much more efficient (ie faster) than grep.
You show context lines by using the -C option.
Eg:
ag -C 3 "foo" myFile
line 1
line 2
line 3
line that has "foo"
line 5
line 6
line 7
Search for "17655" in /some/file.txt showing 10 lines context before and after (using Awk), output preceded with line number followed by a colon. Use this on Solaris when grep does not support the -[ACB] options.
awk '
/17655/ {
for (i = (b + 1) % 10; i != b; i = (i + 1) % 10) {
print before[i]
}
print (NR ":" ($0))
a = 10
}
a-- > 0 {
print (NR ":" ($0))
}
{
before[b] = (NR ":" ($0))
b = (b + 1) % 10
}' /some/file.txt;
Let's understand using an example.
We can use grep with options:
-A 5 # this will give you 5 lines after searched string.
-B 5 # this will give you 5 lines before searched string.
-C 5 # this will give you 5 lines before & after searched string
Example.
File.txt contains 6 lines and following are the operations.
[abc#xyz]~/% cat file.txt # print all file data
this is first line
this is 2nd line
this is 3rd line
this is 4th line
this is 5th line
this is 6th line
[abc#xyz]~% grep "3rd" file.txt # we are searching for keyword '3rd' in the file
this is 3rd line
[abc#xyz]~% grep -A 2 "3rd" file.txt # print 2 lines after finding the searched string
this is 3rd line
this is 4th line
this is 5th line
[abc#xyz]~% grep -B 2 "3rd" file.txt # Print 2 lines before the search string.
this is first line
this is 2nd line
this is 3rd line
[abc#xyz]~% grep -C 2 "3rd" file.txt # print 2 line before and 2 line after the searched string
this is first line
this is 2nd line
this is 3rd line
this is 4th line
this is 5th line
Trick to remember options:
-A  → A means "after"
-B  → B means "before"
-C  → C means "in between"
I do it the compact way:
grep -5 string file
That is the equivalent of:
grep -A 5 -B 5 string file
Here is the #Ygor solution in awk
awk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=3 a=3 s="pattern" myfile
Note: Replace a and b variables with number of lines before and after.
It's especially useful for system which doesn't support grep's -A, -B and -C parameters.
Grep has an option called Context Line Control, you can use the --context in that, simply,
| grep -C 5
or
| grep -5
Should do the trick
$ grep thestring thefile -5
-5 gets you 5 lines above and below the match 'thestring' is equivalent to -C 5 or -A 5 -B 5.

Resources