I have thousands of text files that I have imported that contain a piece of text that I would like to remove.
It is not just a block of text but a pattern.
<!--
# Translator(s):
#
# username1 <email1>
# username2 <email2>
# usernameN <emailN>
#
-->
The block if it appears it will have 1 or more users being listed with their email addresses.
I have another small awk program that accomplish the task in a very few rows of code. It can be used to remove patterns of text from a file. Start as well as stop regexp can be set.
# This block is a range pattern and captures all lines between( and including )
# the start '<!--' to the end '-->' and stores the content in record $0.
# Record $0 contains every line in the range pattern.
# awk -f remove_email.awk yourfile
# The if statement is not needed to accomplish the task, but may be useful.
# It says - if the range patterns in $0 contains a '#' then it will print
# the string "Found an email..." if uncommented.
# command 'next' will discard the content of the current record and search
# for the next record.
# At the same time the awk program begins from the beginning.
/<!--/, /-->/ {
#if( $0 ~ /#/ ){
# print "Found an email and removed that!"
#}
next
}
# This line prints the body of the file to standard output - if not captured in
# the block above.
1 {
print
}
Save the code in 'remove_email.awk' and run it by:
awk -f remove_email.awk yourfile
This sed solution might work:
sed '/^<!--/,/^-->/{/^<!--/{h;d};H;/^-->/{x;/^<!--\n# Translator(s):\n#\(\n# [^<]*<email[0-9]\+>\)\+\n#\n-->$/!p};d}' file
An alternative (perhaps better solution?):
sed '/^<!--/{:a;N;/^-->/M!ba;/^<!--\n# Translator(s):\n#\(\n# \w\+ <[^>]\+>\)+\n#\n-->/d}' file
This gathers up the lines that start with <!-- and end with --> then pattern matches on the collection i.e. the second line is # Translator(s): the third line is #, the fourth and perhaps more lines follow # username <email address>, the penultimate line is # and the last line is -->. If a match is made the entire collection is deleted otherwise it is printed as normal.
for this task you need look-ahead, which is normally done with a parser.
Another solution, but not very efficient would be:
sed "s/-->/&\n/;s/<!--/\n&/" file | awk 'BEGIN {RS = "";FS = "\n"}/username/{print}'
HTH Chris
perl -i.orig -00 -pe 's/<!--\s+#\s*Translator.*?\s-->//gs' file1 file2 file3
Here is my solution, if I understood your problem correctly. Save the following to a file called remove_blocks.awk:
# See the beginning of the block, mark it
/<!--/ {
state = "block_started"
}
# At the end of the block, if the block does not contain email, print
# out the whole block.
/^-->/ {
if (!block_contains_user_email) {
for (i = 0; i < count; i++) {
print saved_line[i];
}
print
}
count = 0
block_contains_user_email = 0
state = ""
next
}
# Encounter a block: save the lines and wait until the end of the block
# to decide if we should print it out
state == "block_started" {
saved_line[count++] = $0
if (NF>=3 && $3 ~ /#/) {
block_contains_user_email = 1
}
next
}
# For everything else, print the line
1
Assume that your text file is in data.txt (or many files, for that matter):
awk -f remove_blocks.awk data.txt
The above command will print out everything in the text file, minus the blocks which contain user email.
Related
I am looking for a way to filter a (~12 Gb) largefile.txt with long strings in each line for each of the words (one per line) in a queryfile.txt. But afterwards, instead of outputting/saving the whole line that each query word is found in, I'd like to save only that query word and a second word which I only know the start of (e.g. "ABC") and that I know for certain is in the same line the first word was found in.
For example, if queryfile.txt has the words:
this
next
And largefile.txt has the lines:
this is the first line with an ABCword # contents of first line will be saved
and there is an ABCword2 in this one as well # contents of 2nd line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
(Notice that the largefile.txt always has a word starting with ABC included in every line. It's also impossible for one of the query words to start with "ABC")
The save file should look similar to:
this ABCword1
this ABCword2
next ABCword2
So far I've looked into other similar posts' suggestions, namely combining grep and awk, with commands similar to:
LC_ALL=C grep -f queryfile.txt largefile.txt | awk -F"," '$2~/ABC/' > results.txt
The problem is that not only is the query word not being saved but the -F"," '$2~/ABC/' command doesn't seem to be the correct one for fetching words beginning with 'ABC' either.
I also found ways of only using awk, but still haven't managed to adapt the code to save the word #2 as well instead of the whole line:
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' queryfile.txt largefile.txt > results.txt
2nd attempt based on updated sample input/output in question:
$ cat tst.awk
FNR==NR { words[$1]; next }
{
queryWord = otherWord = ""
for (i=1; i<=NF; i++) {
if ( $i in words ) {
queryWord = $i
}
else if ( $i ~ /^ABC/ ) {
otherWord = $i
}
}
if ( (queryWord != "") && (otherWord != "") ) {
print queryWord, otherWord
}
}
$ awk -f tst.awk queryfile.txt largefile.txt
this ABCword
next ABCword2
Original answer:
This MAY be what you're trying to do (untested):
awk '
FNR==NR { word2lgth[$1] = length($1); next }
($1 in word2lgth) && (match(substr($0,word2lgth[$1]+1),/ ABC[[:alnum:]_]+/) ) {
print substr($0,1,word2lgth[$1]+1+RSTART+RLENGTH)
}
' queryfile.txt largefile.txt > results.txt
Given:
cat large_file
this is the first line with an ABCword
and the next line has an ABCword2 too CRABCAKE
third line has an ABCword3
ABCword4 and this is behind
cat query_file
this
next
(The comments you have on each line of large_file are eliminated otherwise ABCword3 prints since there is 'this' in the comment.)
You can actually do this entirely with GNU sed and tr manipulation of the query file:
pat=$(gsed -E 's/^(.+)$/\\b\1\\b/' query_file | tr '\n' '|' | gsed 's/|$//')
gsed -nE "s/.*(${pat}).*(\<ABC[a-zA-Z0-9]*).*/\1 \2/p; s/.*(\<ABC[a-zA-Z0-9]*).*(${pat}).*/\1 \2/p" large_file
Prints:
this ABCword
next ABCword2
ABCword4 this
This one assumes your queryfile has more entries than there are words one a line in the largefile. Also, it does not consider your comments as comments but processes them as reqular data and therefore if cut'n'pasted, the third record is a match too.
$ awk '
NR==FNR { # process queryfile
a[$0] # hash those query words
next
}
{ # process largefile
for(i=1;i<=NF && !(f1 && f2);i++) # iterate until both words found
if(!f1 && ($i in a)) # f1 holds the matching query word
f1=$i
else if(!f2 && ($i~/^ABC/)) # f2 holds the ABC starting word
f2=$i
if(f1 && f2) # if both were found
print f1,f2 # output them
f1=f2=""
}' queryfile largefile
Using sed in a while loop
$ cat queryfile.txt
this
next
$ cat largefile.txt
this is the first line with an ABCword # contents of this line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
$ while read -r line; do sed -n "s/.*\($line\).*\(ABC[^ ]*\).*/\1 \2/p" largefile.txt; done < queryfile.txt
this ABCword
next ABCword2
We are trying to update the multiple servers configuration file, with new entries in specific location, before a specific lines. The example contains part of the configuration file
#
#
#
#
# The lines below should not be replaced in this file
# Contact Sysadmins before make changes to this line
I need to match these lines including the "#" and newline, with the above lines and then add new entry above these line, as an example
New entry 1
New entry 2
#
#
#
#
# The lines below should not be replaced in this file
# Contact Sysadmins before make changes to this line
Tried this in perl as an inline code, as follows
/usr/bin/perl -lne 'print "\nNew entry 1\nNew entry 2" if (/[#\n]*# The lines below should not be replaced in this file/); print $_' filename
My regex does not work. I am not an expert in perl or regex in any other language. On many of the server instead of 4 "#" there may be 2 or 3, before the line. Any help would be greatly appreciated. I have to update the same file on 2000+ servers.
You are processing the file one line at a time, so [#\n]*# won't ever match anything but #.
One solution involves telling Perl to treat the entire file as one line and thus reading the entire file into memory.
perl -0777pe's/^([#\n]*# The lines below)/New entry 1\nNew entry 2\n$1/mg'
The other would involve postponing the print of lines starting with #.
perl -ne'
$buf .= $_;
next if /^#$/;
print "New entry 1\nNew entry 2\n" if /^# The lines below/;
print $buf;
$buf = "";
END { print $buf; }
'
Tested:
$ perl -0777pe's/^([#\n]*# The lines below)/New entry 1\nNew entry 2\n$1/mg' file
New entry 1
New entry 2
#
#
#
#
# The lines below should not be replaced in this file
# Contact Sysadmins before make changes to this line
$ perl -ne'
$buf .= $_;
next if /^#$/;
print "New entry 1\nNew entry 2\n" if /^# The lines below/;
print $buf;
$buf = "";
END { print $buf; }
' file
New entry 1
New entry 2
#
#
#
#
# The lines below should not be replaced in this file
# Contact Sysadmins before make changes to this line
(Other tests perform too.)
You can use sed command to substitute the new entry lines before the #lines start poistion.
sed '/#/ s/^/New Entry 1\n/' input.txt
I have a fasta file which looks like this.
>header1
ATGC....
>header2
ATGC...
My list files looks like this
organism1
organism2
and contains a list of organism that I want to replace the header with.
I tried to use a for loop using sed command which is as follows:
for i in `cat list7b`; do sed "s/^>/$i/g" sequence.fa; done
but it didn't work please tell how I can achieve this task.
The result file should look like this
>organism1
ATGC...
>organism2
ATGC....
that is >header1 replaced with >organism_1 and so on
The two headers are distinguished from ATGC as header always starts with > greater than sign whereas ATGC would not. That's how they are distinguished.
The header lines should be replaced by the order of appearance, i.e. first header* replaced with first-line from file, 2nd header from the second and so on.
I also request to explain the logic if possible.
thanks in advance.
With awk this is easy to do in one run.
Assuming your fasta file is named sequence.fa and your organisms list file is named list7b as in the question you can use
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' list7b sequence.fa > output.fa
Explanation:
NR == FNR is a condition for doing something with the first file only. (total number of records is equal to number of records in current file)
{ o[n++] = $0; next } puts the input line into array o, counts the entries and skips further processing of the input line, so o will contain all your organism lines.
The next part is executed for the remaining file(s).
/^>/ && i < n is valid for lines that start with > as long as i is less than the number of elements n that were put into array o.
{ $0 = ">" o[i++] } replaces the current line with > followed by the array element (i.e. a line from the first file) and increments the index i to the next element.
1 is an "always true" condition with the implicit default action { print } to print the current line for every input line.
I'm new to bash and need help to copy Row 2 onwards from one file into a specific position (150 characters in) in another file. Through looking through the forum, I've found a way to include specific text listed at this position:
sed -i 's/^(.{150})/\1specifictextlisted/' destinationfile.txt
However, I can't seem to find a way to copy content from one file into this.
Basically, I'm working with these 2 starting files and need the following output:
File 1 contents:
Sequence
AAAAAAAAAGGGGGGGGGGGCCCCCCCCCTTTTTTTTT
File 2 contents:
chr2
tccccagcccagccccggccccatccccagcccagcctatccccagcccagcctatccccagcccagccccggccccagccccagccccggccccagccccagccccggccccagccccggccccatccccggccccggccccatccccggccccggccccggccccggccccggccccatccccagcccagccccagccccatccccagcccagccccggcccagccccagcccagccccagccacagcccagccccggccccagccccggcccaggcccagcccca
Desired output contents:
chr2
tccccagcccagccccggccccatccccagcccagcctatccccagcccagcctatccccagcccagccccggccccagccccagccccggccccagccccagccccggccccagccccggccccatccccggccccggccccatccccgAAAAAAAAAGGGGGGGGGGGCCCCCCCCCTTTTTTTTTgccccggccccggccccggccccggccccatccccagcccagccccagccccatccccagcccagccccggcccagccccagcccagccccagccacagcccagccccggccccagccccggcccaggcccagcccca
Can anybody put me on the right track to achieving this?
If the file is really huge instead of just 327 characters you might want to use dd:
dd if=chr2 bs=1 count=150 status=none of=destinationfile.txt
tr -d '\n' < Sequence >> destinationfile.txt
dd if=chr2 bs=1 skip=150 seek=189 status=none of=destinationfile.txt
189 is 150+length of Sequence.
You can use awk for that:
awk 'NR==FNR{a=$2;next}{print $1, substr($2, 0, 149) "" a "" substr($2, 150)}' file1 file2
Explanation:
# Total row number == row number in file
# This is only true when processing file1
NR==FNR {
a=$2 # store column 2 in a variable 'a'
next # do not process the block below
}
# Because of the 'next' statement above, this
# block gets only executed for file2
{
# put 'a' in the middle of the second column and print it
print $1, substr($2, 0, 149) "" a "" substr($2, 150)
}
I assume that both files contain only a single line, like in your example.
Edit: In comments you said that the files actually spread two lines, in that case you can use the following awk script:
# usage: awk -f this_file.awk file1 file2
# True for the second line in each file
FNR==2 {
# Total line number equals line number in file
# This is only true while we are processing file1
if(NR==FNR) {
insert=$0 # Store the string to be inserted in a variable
} else {
# Insert the string in file1
# Assigning to $0 will modify the current line
$0 = substr($0, 0, 149) "" insert "" substr($0, 150)
}
}
# Print lines of file2 (line 2 has been modified above)
NR!=FNR
You can use bash and read one char at the time from the file:
i=1
while read -n 1 -r; do
echo -n "$REPLY"
let i++
if [ $i -eq 150 ]; then
echo -n "AAAAAAAAAGGGGGGGGGGGCCCCCCCCCTTTTTTTTT"
fi
done < chr2 > destinationfile.txt
This simply reads a char, echos it and increments the counter. If the counter is 150 it echos your sequence. You can replace the echo with a cat file | tr -d '\n'. Just make sure to remove any newlines, like here with tr. That is also why I use echo -n so it doesn't add any.
Problem Statement:
I have a delimited text file offloaded from Teradata which happens to have "\n" (newline characters or EOL markers) inside data fields.
The same EOL marker is at the end of each new line for one entire line or record.
I need to split this file in two or more files (based on no of records given by me) while retaining the newline chars in data fields but against the line breaks at the end of each lines.
Example:
1|Alan
Wake|15
2|Nathan
Drake|10
3|Gordon
Freeman|11
Expectation :
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11
What i have tried :
awk 'BEGIN{RS="\n"}NR%2==1{x="SplitF"++i;}{print > x}' inputfile.txt
The code can't discern between data field newlines and actual newlines. Is there a way it can be achieved?
EDIT:: i have changed the problem statement with example. Please share your thoughts on the new example.
Use the following awk approach:
awk '{ r=(r!="")?r RS $0 : $0; if(NR%4==0){ print r > "file"++i".txt"; r="" } }
END{ if(r) print r > "file"++i".txt" }' inputfile.txt
NR%4==0 - your logical single line occupies two physical records, so we expect to separate on each 4 records
Results:
> cat file1.txt
1|Alan
Wake
2|Nathan
Drake
> cat file2.txt
3|Gordon
Freeman
If you are using GNU awk you can do this by setting RS appropriately, e.g.:
parse.awk
BEGIN { RS="[0-9]\\|" }
# Skip the empty first record by checking NF (Note: this will also skip
# any empty records later in the input)
NF {
# Send record with the appropriate key to a numbered file
printf("%s", d $0) > "file" i ".txt"
}
# When we found enough records, close current file and
# prepare i for opening the next one
#
# Note: NR-1 because of the empty first record
(NR-1)%n == 0 {
close("file" i ".txt")
i++
}
# Remember the record key in d, again,
# becuase of the empty first record
{ d=RT }
Run it like this:
gawk -f parse.awk n=2 infile
Where n is the number of records to put into each file.
Output:
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11