I'm trying to subtitute headers in a file that looks like this:
NC_037638.1 Apis mellifera strain DH4 linkage group LG1, Amel_HAv3.1, whole genome shotgun sequence
GAGAGAATTAACTACCTTAACCTGAACCTAAACCTACCGATAACCTAACTCTAAACTATACCTTTAACCCCTAAACCCTA
CACCTAAGTCCTAAACCAATAACCTTAACCCTAACAACTATATAAAACACTAACCTATAACCTAATCCCCTAACTACTAA
ActactaacctaacctaaaactatatacctaacctaaaccttaCCCTAACCATAACCTATTACTCTAACCCTACCAAGAG
CCTAAACCTAGAAACTTAACCCCTACAACCCTTAACCTTAACCTACACCTAACTACCTAATCCTACCTAACCtataccta
The file (Bee.fasta) has several headers (one for each sequence), the headers look like this:
NC_037638.1 Apis mellifera strain DH4 linkage group LG1, Amel_HAv3.1, whole genome shotgun sequence
I want to change them into this:
LG1
*LG1 is just an example, depending on the line of the file it can be LG1, LG2, LG3, ...
The above code changes only the first header per iteration, leaving the latter headers unchanged.
Thanks in advance :)
I'm trying to subtitute headers in a file with the following code:
#!/bin/bash
grep 'LG' Be.fasta > old_headers.txt
while read header
do
new_header=$(echo $header | awk -F ' ' '{print $8}')
sed "s/$header/$new_header/g" Bee.fasta >> somefile.txt
done < old_headers.txt
The above code changes only the first header per iteration, leaving the latter headers unchanged.
You're overthinking this. Plus looping over lines of text using bash is pretty much always a bad idea, performance-wise. Tools like sed, awk & perl were born to this job (text processing).
Since we know that the word group can only appear in headers and never in the gene-sequence Jason's sed in the comments should do all you ask for.
$ cat Bee.fasta
NC_037638.1 Apis mellifera strain DH4 linkage group LG1, Amel_HAv3.1, whole genome shotgun sequence
GAGAGAATTAACTACCTTAACCTGAACCTAAACCTACCGATAACCTAACTCTAAACTATACCTTTAACCCCTAAACCCTA CACCTAAGTCCTAAACCAATAACCTTAACCCTAACAACTATATAAAACACTAACCTATAACCTAATCCCCTAACTACTAA ActactaacctaacctaaaactatatacctaacctaaaccttaCCCTAACCATAACCTATTACTCTAACCCTACCAAGAG CCTAAACCTAGAAACTTAACCCCTACAACCCTTAACCTTAACCTACACCTAACTACCTAATCCTACCTAACCtataccta
$ sed -E 's/^.*group *([^,]+).*$/\1/g' Bee.fasta > somefile.txt
$ cat somefile.txt
LG1
GAGAGAATTAACTACCTTAACCTGAACCTAAACCTACCGATAACCTAACTCTAAACTATACCTTTAACCCCTAAACCCTA CACCTAAGTCCTAAACCAATAACCTTAACCCTAACAACTATATAAAACACTAACCTATAACCTAATCCCCTAACTACTAA ActactaacctaacctaaaactatatacctaacctaaaccttaCCCTAACCATAACCTATTACTCTAACCCTACCAAGAG CCTAAACCTAGAAACTTAACCCCTACAACCCTTAACCTTAACCTACACCTAACTACCTAATCCTACCTAACCtataccta
$
Related
Does anybody know the command to remove the header from a ppm file in Linux? I've tried this already
´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´
head -n 4 Example.ppm > header.txt
tail -n 5+ Example.ppm > body.bin
´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´
It tells me that "Tail" could not be found.
Most ppm files use newlines in the header so your first command is fine. However, the rest of the file is binary, so:
head -n 4 Example.ppm > header.txt
filesize=$(wc -c header.txt)
dd if=Example.ppm of=body.bin bs=1 skip=$filesize
You should have /bin/tail if you have /bin/head; both are in the coreutils RPM package.
The format of a ppm(5) file (http://netpbm.sourceforge.net/doc/ppm.html) is awkward to use with the line-based head/tail/sed family. The documentation describes fields separated by whitespace that is not necessarily a line break.
You will need to: 1) Ignore comments from '#' to end of line; and 2) process the remainder one field (not column, not line) at a time. Using awk(1) could be an option here.
Check the documentation (http://netpbm.sourceforge.net/doc/directory.html) for a list of conversion programs. You may find one that converts the PPM file into a form better suited to whatever usage is your ultimate goal.
My file contains something like the below:
X-TM-AS-Product-Ver: IMSVA-8.2.0.1391-8.0.0.1202-22662.005
X-TM-AS-Result: No--0.364-7.0-31-10
X-imss-scan-details: No--0.364-7.0-31-10
X-TMASE-Version: IMSVA-8.2.0.1391-8.0.1202-22662.005
X-TMASE-Result: 10--0.363600-5.000000
X-TMASE-MatchedRID: 40jyuBT4FtykMGOaBzW2QbxygpRxo469FspPdEyOR1qJNv6smPBGj5g3
9Rgsjteo4vM1YF6AJbZcLc3sLtjOty5V0GTrwsKpl6V6bOpOzUAdzA5USlz33EYWGTXfmDJJ3Qf
wsVk0UbuGrPnef/I+eo9h73qb6JgVCR2fClyPE+EPh2lMKov3fdtvzshqXylpWZGeMhmJ7ScqBW
z6M5VHW/fngY5M/1HkzhvqqZL61o+ZdBoyruxjzQ==
This is my real text! I need to extract this line!
The existing code, written in the past by someone else, executes the below line:
cat $my_file | egrep -v "^(X-TM-AS)"
| egrep -v "X-imss-scan-details"
supposedly to remove all those key value lines which start with "X-".
The above piece of code has been working fine up until today because keys starting with X-TMASE has never been among the keys in the past. It has started to appear in the files today, and therefore it has caused the code to fail in extraction of the useful data.
Among the newly added keys, it seems to me that X-TMASE-MatchedRID is the one creating the headache for us, as it has a value which spans multiple lines:
X-TMASE-MatchedRID: 40jyuBT4FtykMGOaBzW2QbxygpRxo469FspPdEyOR1qJNv6smPBGj5g3
9Rgsjteo4vM1YF6AJbZcLc3sLtjOty5V0GTrwsKpl6V6bOpOzUAdzA5USlz33EYWGTXfmDJJ3Qf
wsVk0UbuGrPnef/I+eo9h73qb6JgVCR2fClyPE+EPh2lMKov3fdtvzshqXylpWZGeMhmJ7ScqBW
z6M5VHW/fngY5M/1HkzhvqqZL61o+ZdBoyruxjzQ==
Initially I tried the below:
cat $my_file | egrep -v "^(X-TM-AS)"
| egrep -v "X-imss-scan-details"
| egrep -v "^(X-TMASE-)"
But it didn't work. It didn't completely eliminate the value for X-TMASE-MatchedRID:
9Rgsjteo4vM1YF6AJbZcLc3sLtjOty5V0GTrwsKpl6V6bOpOzUAdzA5USlz33EYWGTXfmDJJ3Qf
wsVk0UbuGrPnef/I+eo9h73qb6JgVCR2fClyPE+EPh2lMKov3fdtvzshqXylpWZGeMhmJ7ScqBW
z6M5VHW/fngY5M/1HkzhvqqZL61o+ZdBoyruxjzQ==
This is my real text! I need to extract this line!
I wanted the output to be:
This is my real text! I need to extract this line!
That is, I don't want any metadata to be seen in the output.
Any idea how that can be achieved using egrep or any equivalent command?
If you just want to remove the first paragraph some other command is better, for example sed
sed '1,/^$/ d' "$my_file"
I have this file in a directory say test.php whose contents are below
< ? php $XZKsyG=’as’;
I want to pick up the file test.php with a search based on its content. So from the directory containing it I do:
grep 'php \$[a-zA-Z]*=.as.;'
However I get no result...what am I doing wrong?
Thanks
It works for me:
$ cat file
< ? php $XZKsyG=’as’;
$ grep 'php \$[a-zA-Z]*=.as.;' file
< ? php $XZKsyG=’as’;
Are you sure the contents of the file are exactly what you showed us?
Try cat -A file or od -c file to see whether the file really looks the way you think it does.
(Note that you don't need to escape the $ character; it's only a metacharacter at the end of a line. But escaping it should be ok.)
EDIT :
The characters around the as in your file are not ASCII apostrophes; they're Unicode RIGHT SINGLE QUOTATION MARK characters (0x2019). If the file is stored in UTF-8, each of them is represented as a 3-byte sequence. The grep command works for me because my locale settings "en_US.UTF-8" are such that a UTF-8 character is matched by . in a regexp, even if it has a multi-byte representation. I suspect your locale is such that it would be matched by ....
Probably the simplest solution is to edit the file to use ASCII apostrophes.
You might also want to play around with your locale settings. Try the grep command with $LANG set to "en_US.UTF-8".
What's the output of the locale command?
That works fine for me, though you may want to look into those "funny" single quotes you have around as:
pax$ cat testfile
< ? php $XZKsyG='as';
pax$ grep 'php \$[a-zA-Z]*=.as.;' testfile
< ? php $XZKsyG='as';
Failing that, there's some things you can look at. Some of these may sound silly but I'm really just checking all bases.
Are you sure the file contains only what you think it does? Executing od -xcb file will give you a hex dump of it for better checking.
Are you sure you're accessing the right file, in the right directory?
Have you done something silly like aliasing grep to be something else?
That's if you're looking for a file containing that string. If instead you're looking for a file named like that, you can use something like:
ls -1 | grep 'php \$[a-zA-Z]*=.as.;'
The ls -1 command gives you one file per line, and piping that through grep will filter out those not matching the pattern.
I suppose I should mention that I'm not really a big fan of file names with spaces in them, but I'm violently opposed to file names made up of PHP scripts :-)
How do I quickly scrape off first few lines from a large file, without opening the whole file in main memory?
UPDATE
I do not want to pipe the starting x lines into another file and then cut the first few lines, I want to update the original file.
Not exactly vim, but to cut of the first 10 lines you could use
tail --lines=+10 somefile.txt > newfile.txt
tail -n+11 somefile.txt | vim -
To chop off the first 10* lines and open the file for edit, without creating a temporary file. Note that the file will have no name in vim when you open it this way. That's the only drawback.
* Note that although I used 11 in the command, this starts from line 11. So it will chop off the first 10 lines.
The original question was never actually answered here. I believe this is a solution:
sed -i 's/`head -n 500 foo.txt`//' foo.txt
This would eliminate the first 500 lines of a file without having to create a temporary file. (Actually, you might have to do head -n 499) I think it's actually quite useful as a one-liner for say, cleaning up log files, without just erasing the entire log.
$ seq 1 502 > foo.txt
$ sed -i 1,500d foo.txt
$ cat foo.txt
501
502
vim will always want/need to read in the whole file, so there's no way to do it using (only) vim. Darcara's suggestion looks good.
This process will always involve copying all but the first part of the file to another, so I don't see any way of doing it quickly.
Depending on what you can do with the file you may be better of using sed or awk for editing such a big file.
How about ..
split the original file into 2 parts. (p1: lines 0 - x) (p2: lines x+1 - n)
edit p1 since you want to edit the first x lines. We'll call it p1'
combine p1' and p2
In short
file -> p1 and p2
p1 -> p1'
p1' + p2 -> new_file
Commands
use split or cut
use vim or editor of your choice.
use cat to combine
I have a 2GB text file on my linux box that I'm trying to import into my database.
The problem I'm having is that the script that is processing this rdf file is choking on one line:
mismatched tag at line 25462599, column 2, byte 1455502679:
<link r:resource="http://www.epuron.de/"/>
<link r:resource="http://www.oekoworld.com/"/>
</Topic>
=^
I want to replace the </Topic> with </Line>. I can't do a search/replace on all lines but I do have the line number so I'm hoping theres some easy way to just replace that one line with the new text.
Any ideas/suggestions?
sed -i yourfile.xml -e '25462599s!</Topic>!</Line>!'
sed -i '25462599 s|</Topic>|</Line>|' nameoffile.txt
The tool for editing text files in Unix, is called ed (as opposed to sed, which as the name implies is a stream editor).
ed was once intended as an interactive editor, but it can also easily scripted. The way ed works, is that all commands take an address parameter. The way to address a specific line is just the line number, and the way to change the addressed line(s) is the s command, which takes the same regexp that sed would. So, to change the 42nd line, you would write something like 42s/old/new/.
Here's the entire command:
FILENAME=/path/to/whereever
LINENUMBER=25462599
ed -- "${FILENAME}" <<-HERE
${LINENUMBER}s!</Topic>!</Line>!
w
q
HERE
The advantage of this is that ed is standardized, while the -i flag to sed is a proprietary GNU extension that is not available on a lot of systems.
Use "head" to get the first 25462598 lines and use "tail" to get the remaining lines (starting at 25462601). Though... for a 2GB file this will likely take a while.
Also are you sure the problem is just with that line and not somewhere previous (ie. the error looks like an XML parse error which might mean the actual problem is someplace else).
My shell script:
#!/bin/bash
awk -v line=$1 -v new_content="$2" '{
if (NR == line) {
print new_content;
} else {
print $0;
}
}' $3
Arguments:
first: line number you want change
second: text you want instead original line contents
third: file name
This script prints output to stdout then you need to redirect. Example:
./script.sh 5 "New fifth line text!" file.txt
You can improve it, for example, by taking care that all your arguments has expected values.