My file contains something like the below:
X-TM-AS-Product-Ver: IMSVA-8.2.0.1391-8.0.0.1202-22662.005
X-TM-AS-Result: No--0.364-7.0-31-10
X-imss-scan-details: No--0.364-7.0-31-10
X-TMASE-Version: IMSVA-8.2.0.1391-8.0.1202-22662.005
X-TMASE-Result: 10--0.363600-5.000000
X-TMASE-MatchedRID: 40jyuBT4FtykMGOaBzW2QbxygpRxo469FspPdEyOR1qJNv6smPBGj5g3
9Rgsjteo4vM1YF6AJbZcLc3sLtjOty5V0GTrwsKpl6V6bOpOzUAdzA5USlz33EYWGTXfmDJJ3Qf
wsVk0UbuGrPnef/I+eo9h73qb6JgVCR2fClyPE+EPh2lMKov3fdtvzshqXylpWZGeMhmJ7ScqBW
z6M5VHW/fngY5M/1HkzhvqqZL61o+ZdBoyruxjzQ==
This is my real text! I need to extract this line!
The existing code, written in the past by someone else, executes the below line:
cat $my_file | egrep -v "^(X-TM-AS)"
| egrep -v "X-imss-scan-details"
supposedly to remove all those key value lines which start with "X-".
The above piece of code has been working fine up until today because keys starting with X-TMASE has never been among the keys in the past. It has started to appear in the files today, and therefore it has caused the code to fail in extraction of the useful data.
Among the newly added keys, it seems to me that X-TMASE-MatchedRID is the one creating the headache for us, as it has a value which spans multiple lines:
X-TMASE-MatchedRID: 40jyuBT4FtykMGOaBzW2QbxygpRxo469FspPdEyOR1qJNv6smPBGj5g3
9Rgsjteo4vM1YF6AJbZcLc3sLtjOty5V0GTrwsKpl6V6bOpOzUAdzA5USlz33EYWGTXfmDJJ3Qf
wsVk0UbuGrPnef/I+eo9h73qb6JgVCR2fClyPE+EPh2lMKov3fdtvzshqXylpWZGeMhmJ7ScqBW
z6M5VHW/fngY5M/1HkzhvqqZL61o+ZdBoyruxjzQ==
Initially I tried the below:
cat $my_file | egrep -v "^(X-TM-AS)"
| egrep -v "X-imss-scan-details"
| egrep -v "^(X-TMASE-)"
But it didn't work. It didn't completely eliminate the value for X-TMASE-MatchedRID:
9Rgsjteo4vM1YF6AJbZcLc3sLtjOty5V0GTrwsKpl6V6bOpOzUAdzA5USlz33EYWGTXfmDJJ3Qf
wsVk0UbuGrPnef/I+eo9h73qb6JgVCR2fClyPE+EPh2lMKov3fdtvzshqXylpWZGeMhmJ7ScqBW
z6M5VHW/fngY5M/1HkzhvqqZL61o+ZdBoyruxjzQ==
This is my real text! I need to extract this line!
I wanted the output to be:
This is my real text! I need to extract this line!
That is, I don't want any metadata to be seen in the output.
Any idea how that can be achieved using egrep or any equivalent command?
If you just want to remove the first paragraph some other command is better, for example sed
sed '1,/^$/ d' "$my_file"
Related
I'm trying to subtitute headers in a file that looks like this:
NC_037638.1 Apis mellifera strain DH4 linkage group LG1, Amel_HAv3.1, whole genome shotgun sequence
GAGAGAATTAACTACCTTAACCTGAACCTAAACCTACCGATAACCTAACTCTAAACTATACCTTTAACCCCTAAACCCTA
CACCTAAGTCCTAAACCAATAACCTTAACCCTAACAACTATATAAAACACTAACCTATAACCTAATCCCCTAACTACTAA
ActactaacctaacctaaaactatatacctaacctaaaccttaCCCTAACCATAACCTATTACTCTAACCCTACCAAGAG
CCTAAACCTAGAAACTTAACCCCTACAACCCTTAACCTTAACCTACACCTAACTACCTAATCCTACCTAACCtataccta
The file (Bee.fasta) has several headers (one for each sequence), the headers look like this:
NC_037638.1 Apis mellifera strain DH4 linkage group LG1, Amel_HAv3.1, whole genome shotgun sequence
I want to change them into this:
LG1
*LG1 is just an example, depending on the line of the file it can be LG1, LG2, LG3, ...
The above code changes only the first header per iteration, leaving the latter headers unchanged.
Thanks in advance :)
I'm trying to subtitute headers in a file with the following code:
#!/bin/bash
grep 'LG' Be.fasta > old_headers.txt
while read header
do
new_header=$(echo $header | awk -F ' ' '{print $8}')
sed "s/$header/$new_header/g" Bee.fasta >> somefile.txt
done < old_headers.txt
The above code changes only the first header per iteration, leaving the latter headers unchanged.
You're overthinking this. Plus looping over lines of text using bash is pretty much always a bad idea, performance-wise. Tools like sed, awk & perl were born to this job (text processing).
Since we know that the word group can only appear in headers and never in the gene-sequence Jason's sed in the comments should do all you ask for.
$ cat Bee.fasta
NC_037638.1 Apis mellifera strain DH4 linkage group LG1, Amel_HAv3.1, whole genome shotgun sequence
GAGAGAATTAACTACCTTAACCTGAACCTAAACCTACCGATAACCTAACTCTAAACTATACCTTTAACCCCTAAACCCTA CACCTAAGTCCTAAACCAATAACCTTAACCCTAACAACTATATAAAACACTAACCTATAACCTAATCCCCTAACTACTAA ActactaacctaacctaaaactatatacctaacctaaaccttaCCCTAACCATAACCTATTACTCTAACCCTACCAAGAG CCTAAACCTAGAAACTTAACCCCTACAACCCTTAACCTTAACCTACACCTAACTACCTAATCCTACCTAACCtataccta
$ sed -E 's/^.*group *([^,]+).*$/\1/g' Bee.fasta > somefile.txt
$ cat somefile.txt
LG1
GAGAGAATTAACTACCTTAACCTGAACCTAAACCTACCGATAACCTAACTCTAAACTATACCTTTAACCCCTAAACCCTA CACCTAAGTCCTAAACCAATAACCTTAACCCTAACAACTATATAAAACACTAACCTATAACCTAATCCCCTAACTACTAA ActactaacctaacctaaaactatatacctaacctaaaccttaCCCTAACCATAACCTATTACTCTAACCCTACCAAGAG CCTAAACCTAGAAACTTAACCCCTACAACCCTTAACCTTAACCTACACCTAACTACCTAATCCTACCTAACCtataccta
$
I have a text file, called texto.txt in Documentos folder, with some values like the ones below:
cat ~/Documentos/texto.txt
65f8: Testado
a4a1: Testado 2
So I want to change a whole line by using a customized function which gets as parameters the new value.
The new value will always keep the first 6 characters, changing only what comes after them. Although I am testing only the first four.
Then I edited my .bashrc including my function like shown below.
muda()
{
export BUSCA="$(echo $* | cut -c 1-4)";
sed -i "/^$BUSCA/s/.*/$*/" ~/Documentos/texto.txt ;}
When I run the command below it works like a charm, but I feel it could be improved.
muda a4a1: Testado 3
Result:
cat ~/Documentos/texto.txt
65f8: Testado
a4a1: Testado 3
Is there a smarter way to do this? Maybe by getting rid of BUSCA variable?
I'd write:
muda() {
local new_line="$*"
local key=${newline:0:4}
sed -i "s/^${key//\//\\/}.*/${new_line//\//\\/}/" ~/Documentos/texto.txt
}
Notes:
using local variables, not exported environment variables
does not call out to cut, bash can extract a substring
escaping any slashes in the variable values so the sed code is not broken.
In bash I am trying to parse following file:
Input:
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">
Wanted output:
12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples~withstuff"
11/07/2011 67100270 "https://stuff.com/findstones"
I got to the point that I have:
# less input.txt | sed -e "s/><tr><td//" -e "s/\///" -e "s/a>//" -e "s/<\/td><\/tr>//g" -e "s/<\/td><td>//g" -e "s/>$//g" -e "s/<a class=\"btn-down\" download href=//g"
<stuff.txt (15.18 KB)12/01/2015Large things158520312"https://resource.com/stones"
<flowers.pdf (83.03 MB)23/03/2011Large flowers872448000"https://resource.com/flosers with stuff"
<apples.pdf (281.16 MB)21/04/2012Large things like apples299009564"https://resource.com/apples"
<stones.pdf (634.99 MB)11/07/2011Large stones from mountains67100270"https://stuff.com/findstones"
Is there a easier way to parse it? I feel that it can be done much simpler and I am not even in the middle of parsing.
Could you please try following and let us know if this helps you.
awk -F"[><]" '{sub(/.*=/,"",$28);print $15,$23,$28}' Input_file
I'm sure the best way to solve your problem is to use an HTML parser. Solution for shown sample of file:
sed -r 's/.*(..\/..\/....).*>([0-9]*)<\/.*href=([^>]*)>/\1 \2 \3/I' input.txt
Personally, I'd use perl, but that's not what you asked, so...
A pedantic stepwise approach, so that you can edit bits of the logic when needed.
Assuming the input is a file named x:
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">
Try this:
sed -E '
s/>$//;
s/href=/>/;
s/(<[^>]+>)+/~/g;
s/~[^~]+~//;
s/~[^~]+~/ /;
s/~/ /;
' x
Output:
12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples"
11/07/2011 67100270 "https://stuff.com/findstones"
Explained:
sed -E '
This uses extended regexes, and opens a script of sed code so that I can list each pattern individually. Each will be executed in order on each line, so it's not super efficient, but it's "readable" as regex code goes, and reasonably maintainable once you understand it, and so easy to edit when something needs tweaking.
s/>$//;
Strip the closing > off the end, to preserve the URL before squashing out all the other tags.
s/href=/>/;
use the href= as a hook to insert the > back so we can squash out all the tags in one pass.
s/(<[^>]+>)+/~/g;
Convert ALL the strings of tags and everything still in them to a simple delimiter each.
s/~[^~]+~//;
Eliminate the leading and second delimiter and the first unneeded field between them.
s/~[^~]+~/ /;
Eliminate the third and fourth delimiters and the unneeded third field between them, replacing them with the space you wanted in the output.
Those two are very similar, and could certainly be combined with minimal shenannigans, but I left them nigh-redundant for easier explication.
s/~/ /;
Convert the remaining delimiter to the other space you wanted between the remaining fields.
' x
Close the script and give it the filename to read.
Obviously, this leaves a LOT of room for improvement, and is in many ways stylistically repulsive, but hopefully it is a simple explanation of tricks you can hack into a maintainably useful solution to your issue.
Good luck.
How do I get the first n lines of the output of a makefile (specifically, my complier is g++). Either a script in linux or in the makefile would work (if you could provide both, that'll be even better).
I have tried
make | head -n 5
but it's not working.
Currently, the process I go through is tedious; I'm piping the output to a text file before using head on it (then having to delete the file).
Given that the messages from the compiler appear on standard error rather than standard output, you need to redirect both:
make 2>&1 | head -n 20
(I think 5 lines will be too small to be useful.)
I'm using a bash script based on the technique used here: Get color output in bash to color the output of my builds and other scripts to make things easier to read. One of the steps in my build executes a "git pull" and the git server spits out a "welcome" string like this amidst a bunch of other output:
** WARNING: THIS IS A PRIVATE NETWORK. UNAUTHORIZED ACCESS IS PROHIBITED. **
Use of this system constitutes your consent to interception, monitoring,
and recording for official purposes of information related to such use,
including criminal investigations.
I'd like to color this specific message yellow or possibly delete it from the output while leaving the rest of the output alone. I've tried to replace a simple string like this:
WelcomeMessage="WARNING"
pathpat=".*"
ccred=$(echo -e "\033[0;31m")
ccyellow=$(echo -e "\033[0;33m")
ccend=$(echo -e "\033[0m")
git pull 2>&1 | sed -r -e "/$WelcomeMessage/ s%$pathpat%$ccyellow&$ccend%g"
The first line of the welcome string is colored yellow as expected but the rest of the lines are not. I'd really like to color the exact welcome string and only that string but for many reasons, this doesn't work:
WelcomeMessage="** WARNING: THIS IS A PRIVATE NETWORK. UNAUTHORIZED ACCESS IS PROHIBITED. **
Use of this system constitutes your consent to interception, monitoring,
and recording for official purposes of information related to such use,
including criminal investigations."
pathpat=".*"
ccred=$(echo -e "\033[0;31m")
ccyellow=$(echo -e "\033[0;33m")
ccend=$(echo -e "\033[0m")
git pull 2>&1 | sed -r -e "/$WelcomeMessage/ s%$pathpat%$ccyellow&$ccend%g"
This fails with the error: sed: -e expression #1, char 78: unterminated address regex
I've looked at a couple other questions and I was able to get the asterisks escaped (by preceding them with backslashes) but I'm baffled by the periods and multiple lines. I'd like to continue using sed to solve this problem since it integrates nicely with the colorizing solution.
Any help is appreciated. Thanks!
The following will colorize in yellow every line from the first instance of ** to the first instance of a period . that's not on the same line. This will match the entire warning message as written.
NORMAL=$(tput sgr0)
YELLOW=$(tput setaf 3)
git pull 2>&1 | sed "/\*\*/,/\./s/.*/$YELLOW&$NORMAL/"
Note: If you want to delete the message you can use this:
git pull 2>&1 | sed '/\*\*/,/\./d'