I have a column
1
1
1
2
2
2
I would like to insert a blank line when the value in the column changes:
1
1
1
<- blank line
2
2
2
I would recommend using awk:
awk -v i=1 'NR>1 && $i!=p { print "" }{ p=$i } 1' file
On any line after the first, if value of the "i"th column is different to the previous value, print a blank line. Always set the value of p. The 1 at the end evaluates to true, which means that awk prints the line. i can be set to the column number of your choice.
while read L; do [[ "$L" != "$PL" && "$PL" != "" ]] && echo; echo "$L"; PL="$L"; done < file
awk(1) seems like the obvious answer to this problem:
#!/usr/bin/awk -f
BEGIN { prev = "" }
/./ {
if (prev != "" && prev != $1) print ""
print
prev = $1
}
You can also do this with SED:
sed '{N;s/^\(.*\)\n\1$/\1\n\1/;tx;P;s/^.*\n/\n/;P;D;:x;P;D}'
The long version with explanations is:
sed '{
N # read second line; (terminate if there are no more lines)
s/^\(.*\)\n\1$/\1\n\1/ # try to replace two identical lines with themselves
tx # if replacement succeeded then goto label x
P # print the first line
s/^.*\n/\n/ # replace first line by empty line
P # print this empty line
D # delete empty line and proceed with input
:x # label x
P # print first line
D # delete first line and proceed with input
}'
One thing I like about using (GNU) SED (what which is not clear if it is useful to you from your question) is that you can easily apply changes in-place with the -i switch, e.g.
sed -i '{N;s/^\(.*\)\n\1$/\1\n\1/;tx;P;s/^.*\n/\n/;P;D;:x;P;D}' FILE
You could use getline function in Awk to match the current line against the following line:
awk '{f=$1; print; getline}f != $1{print ""}1' file
Related
I am looking for a way to filter a (~12 Gb) largefile.txt with long strings in each line for each of the words (one per line) in a queryfile.txt. But afterwards, instead of outputting/saving the whole line that each query word is found in, I'd like to save only that query word and a second word which I only know the start of (e.g. "ABC") and that I know for certain is in the same line the first word was found in.
For example, if queryfile.txt has the words:
this
next
And largefile.txt has the lines:
this is the first line with an ABCword # contents of first line will be saved
and there is an ABCword2 in this one as well # contents of 2nd line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
(Notice that the largefile.txt always has a word starting with ABC included in every line. It's also impossible for one of the query words to start with "ABC")
The save file should look similar to:
this ABCword1
this ABCword2
next ABCword2
So far I've looked into other similar posts' suggestions, namely combining grep and awk, with commands similar to:
LC_ALL=C grep -f queryfile.txt largefile.txt | awk -F"," '$2~/ABC/' > results.txt
The problem is that not only is the query word not being saved but the -F"," '$2~/ABC/' command doesn't seem to be the correct one for fetching words beginning with 'ABC' either.
I also found ways of only using awk, but still haven't managed to adapt the code to save the word #2 as well instead of the whole line:
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' queryfile.txt largefile.txt > results.txt
2nd attempt based on updated sample input/output in question:
$ cat tst.awk
FNR==NR { words[$1]; next }
{
queryWord = otherWord = ""
for (i=1; i<=NF; i++) {
if ( $i in words ) {
queryWord = $i
}
else if ( $i ~ /^ABC/ ) {
otherWord = $i
}
}
if ( (queryWord != "") && (otherWord != "") ) {
print queryWord, otherWord
}
}
$ awk -f tst.awk queryfile.txt largefile.txt
this ABCword
next ABCword2
Original answer:
This MAY be what you're trying to do (untested):
awk '
FNR==NR { word2lgth[$1] = length($1); next }
($1 in word2lgth) && (match(substr($0,word2lgth[$1]+1),/ ABC[[:alnum:]_]+/) ) {
print substr($0,1,word2lgth[$1]+1+RSTART+RLENGTH)
}
' queryfile.txt largefile.txt > results.txt
Given:
cat large_file
this is the first line with an ABCword
and the next line has an ABCword2 too CRABCAKE
third line has an ABCword3
ABCword4 and this is behind
cat query_file
this
next
(The comments you have on each line of large_file are eliminated otherwise ABCword3 prints since there is 'this' in the comment.)
You can actually do this entirely with GNU sed and tr manipulation of the query file:
pat=$(gsed -E 's/^(.+)$/\\b\1\\b/' query_file | tr '\n' '|' | gsed 's/|$//')
gsed -nE "s/.*(${pat}).*(\<ABC[a-zA-Z0-9]*).*/\1 \2/p; s/.*(\<ABC[a-zA-Z0-9]*).*(${pat}).*/\1 \2/p" large_file
Prints:
this ABCword
next ABCword2
ABCword4 this
This one assumes your queryfile has more entries than there are words one a line in the largefile. Also, it does not consider your comments as comments but processes them as reqular data and therefore if cut'n'pasted, the third record is a match too.
$ awk '
NR==FNR { # process queryfile
a[$0] # hash those query words
next
}
{ # process largefile
for(i=1;i<=NF && !(f1 && f2);i++) # iterate until both words found
if(!f1 && ($i in a)) # f1 holds the matching query word
f1=$i
else if(!f2 && ($i~/^ABC/)) # f2 holds the ABC starting word
f2=$i
if(f1 && f2) # if both were found
print f1,f2 # output them
f1=f2=""
}' queryfile largefile
Using sed in a while loop
$ cat queryfile.txt
this
next
$ cat largefile.txt
this is the first line with an ABCword # contents of this line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
$ while read -r line; do sed -n "s/.*\($line\).*\(ABC[^ ]*\).*/\1 \2/p" largefile.txt; done < queryfile.txt
this ABCword
next ABCword2
How to get information from specimen1 to specimen3 and paste it into another file 'DNA_combined.txt'?
I tried cut command and awk commend but I found that it is tricky to cutting by paragraph(?) or sequence.
My trial was something like cut -d '>' -f 1-3 dna1.fasta > DNA_combined.txt
You can get the line number for each row using Esc + : and type set nu
Once you get the line number corresponding to each row:
Note down the line number corresponding to Line containing >Specimen 1 (say X) and Specimen 3 (say Y)
Then, use sed command to get the text between two lines
sed -n 'X,Yp' dna1.fasta > DNA_combined.txt
Please let me know if you have any questions.
If you want the first three sequences irrespective of the content after >, you can use this:
$ cat ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
>four
ATGC
>five
GTA
$ awk '/^>/ && ++count==4{exit} 1' ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
/^>/ matches the start of a sequence
for such sequences, increment the count variable
if count reaches 4, the exit command will terminate the script
1 idiomatic way to print contents of input record
Would you please try the following:
awk '
BEGIN {print ">Specimen1-3"} # print header
/^>Specimen/ {f = match($0, "^>Specimen[1-3]") ? 1 : 0; next}
# set the flag depending on the number
f # print if f == 1
' dna1.fasta > DNA_combined.txt
Hello let say I have a file such as :
$OUT some text
some text
some text
$OUT
$OUT
$OUT
how can I use sed in order to replace the 3 $OUT into "replace-thing" ?
and get
$OUT some text
some text
some text
replace-thing
With sed:
sed -n '1h; 1!H; ${g; s/\$OUT\n\$OUT\n\$OUT/replace-thing/g; p;}' file
GNU sed does not require the semicolon after p.
With commentary
sed -n ' # without printing every line:
# next 2 lines read the entire file into memory
1h # line 1, store current line in the hold space
1!H # not line 1, append a newline and current line to hold space
# now do the search-and-replace on the file contents
${ # on the last line:
g # replace pattern space with contents of hold space
s/\$OUT\n\$OUT\n\$OUT/replace-thing/g # do replacement
p # and print the revised contents
}
' file
This is the main reason I only use sed for very simple things: once you start using the lesser-used commands, you need extensive commentary to understand the program.
Note the commented version does not work on the BSD-derived sed on MacOS -- the comments break it, but removing them is OK.
In plain bash:
pattern=$'$OUT\n$OUT\n$OUT' # using ANSI-C quotes
contents=$(< file)
echo "${contents//$pattern/replace-thing}"
And the perl one-liner:
perl -0777 -pe 's/\$OUT(\n\$OUT){2}/replace-thing/g' file
for this particular task, I recommend to use awk instead. (hope that's an option too)
Update: to replace all 3 $OUT use: (Thanks to #thanasisp and #glenn jackman)
cat input.txt | awk '
BEGIN {
i = 0
p = "$OUT" # Pattern to match
n = 3 # N matches
r = "replace-thing"
}
$0 == p {
++i
if(i == n){
print(r)
i = 0 #reset counter (optional)
}
}
$0 != p {
i = 0
print($0)
}'
If you just want to replace the 3th $OUT usage, use:
cat input.txt | awk '
BEGIN {
i = 0
p = "\\$OUT" # Pattern to match
n = 3 # Nth match
r = "replace-thing"
}
$0 ~ p {
++i
if(i == n){
print(r)
}
}
i <= n || $0 !~ p {
print($0)
}'
This might work for you (GNU sed):
sed -E ':a;N;s/[^\n]*/&/3;Ta;/^(\$OUT\n?){3}$/d;P;D' file
Gather up 3 lines in the pattern space and if those 3 lines each contain $OUT, delete them. Otherwise, print/delete the first line and repeat.
I've below task to achieve. I completed it using awk/sed/tail/grep together but I believe it's doable using only awk - therefore I'm asking for your kind help:
What will be awk syntax for -
Get last line from file A (csv format)
LAST=$(tail -n1 A)
Check if line from file A exist in file B (csv as well), if yes...
NO=$(grep -nw "$LAST" B|awk -F: {print $1})'
Check if there are newer lines in file B, if yes...
BELOW=$(expr $NO + 1)
if awk "NR==$BELOW" B; then
Delete everything in file B from $NO to the 2nd row
sed -i "2,$NO d" B; fi
BIG THANKS for any help - appreciated!
Something like this might work:
awk 'FNR == NR { last = $0; next }
!newer && $0 == last { newer = 1; next }
newer || FNR == 1' A B
Basically FNR == NR { last = $0; next } sets last to to each line as long as we are in the first file, so in the end it will be the first file's last line.
In the second file if we are not in the newer lines and the line equals last, the next line is newer than what was in the first file: !newer && $0 == last { newer = 1; next }.
And when we are either in the first line or the newer lines of the second file, it is printed: newer || FNR == 1.
This differs from the original in that it prints out the newer lines of B, instead of modifying B in place. Of course you can redirect the output to a temporary file in the shell and then move it over B if it contains more than one line. Or have Awk return an exit status and use that, e.g.,:
tmpf=`mktemp B'.XXXXXX'`
awk 'FNR == NR { last = $0; next }
!newer && $0 == last { newer = NR; next }
newer || FNR == 1 { print }
END { exit (!newer || NR == newer) }' \
A B >"$tmpf" && mv "$tmpf" B || rm -f "$tmpf"
Admittedly not entirely in Awk anymore, but I'd say close enough and better in practice.
Suppose i have a file with this structures
1001.txt
1002.txt
1003.txt
1004.txt
1005.txt
2001.txt
2002.txt
2003.txt
...
Now how can I delete first 10 numbers of line which start with '2'? There might be more than 10 lines start with '2'.
I know I can use grep '^2' file | wc -l to find number of lines which start with '2'. But how to delete the first 10 numbers of line?
You can pipe your list through this Perl one-liner:
perl -p -e '$_="" if (/^2/ and $i++ >= 10)'
Another in awk. Testing with value 2 as your data only had 3 lines of related data. Replace the latter 2 with a 10.
$ awk '/^2/ && ++c<=2 {next} 1' file
1001.txt
1002.txt
1003.txt
1004.txt
1005.txt
2003.txt
.
.
.
Explained:
$ awk '/^2/ && ++c<=2 { # if it starts with a 2 and counter still got iterations left
next # skip to the next record
} 1 # (else) output
' file
awk alternative:
awk '{ if (substr($0,1,1)=="2") { count++ } if ( count > 10 || substr($0,1,1)!="2") { print $0 } }' filename
If the first character of the line is 2, increment a counter. Then only print the line if count is greater than 10 or the first character isn't 2.