find string and replace - string

Hi I have a file like this
L_00001_mRNA_interferase_MazF
ATGGATTATCCAAAACAAAAGGATATTGTCTGGATTGATTTTGACCCTTCTAAAGGCAAA
GAGATAAGAAAGCGGAGACCTGCGTTAGTAGTTAGTAAAGATGAATTTAATGAACGTACA
GGTTTCTGTTTAGTTTGCCCCATCACATCTACTAAAAGGAACTTTGCAACGTATATTGAA
ATAACAGACCCACAGAAAGTAGAAGGGGACGTAGTTACCCATCAATTGCGAGCGGTTGAT
TACACCACAAGAAATATCGAAAAAATTGAACAATGTGATATGTTGACGTGGATTGATGTA
GTAGAAGTAATCGGAATGTTTATTTAA
L_00002_hypothetical_protein
ATGGAAACGGTAGTTAGAAAGATAGGGAATTCAGTAGGAACTATTTTTCCGAAAAGTATT
TCACCACAAGTTGGAGAAAAGTTCACTATTCTTAAAGTTGGGGAAGCGTATATATTGAAA
CCTAAGAGAGAAGATATTTTTAAAAATGCTGAAGATTGGGTAGGGTTTAGAGAAGCTTTG
ACTAATGAAGATAAAGAATGGGACGAGATGAAACTTGAGGGAGGAGAACGCTAG
L_00003_hypothetical_protein
ATGACAACGTTTGGAGAAATTCATAGCAATGCAGAAGGTTATAAAAACGATTTTAATGAG
TTGAATAAATTAGTATTACGTGTAGCTGAAGAAAAAGCAAAAGGAGAGCCATTAGTAACG
TGGTTTCGGTTGCGGAATCGTAGGATTGCACAAGTATTAGACCCAATGAAAGAAGAAGTA
GAAAGTAAATCAAAGTACGAAAAAAGAAGAGTAGCAGCAATTAGTAAAAGCTTTTTTCTA
CTTAAAAAAGCTTTTAACTTTATTGAAGCAGAACAATTTGAAAAAGCAGAAAAATTAATT
I would like to substitute the header of each sequence with a string.
I have a conversion file like
L_00001_mRNA_interferase_MazF galM,GALM,aldose1-epimerase[EC:5.1.3.3]
L_00002_hypothetical_protein E3.2.1.85,lacG,6-phospho-beta-galactosidase[EC:3.2.1.85]
L_00003_hypothetical_protein PTS-Lac-EIIB,lacE,PTSsystem,lactose-specificIIBcomponent[EC:2.7.1.69]

Your question is unclear as to what platform you're on (Windows, Linux, Mac, ...), what languages you're constrained to, and the exact details of your input files.
On the assumption that you're on Linux, or otherwise have sed and awk available and a command shell, it could be as simple as (where $ indicates a Bourne-like shell prompt):
$ awk '{print "s/^" $1 "/" $2 "/"}' conversions.txt > conversions.sed
$ sed -f conversions.sed sequences.txt > relabeled.txt
This assumes that your first file (with the headings you want changed) is called sequences.txt and your second file (the “conversion file”) is called conversions.txt. It is further assumed that the “conversion file” contains one record per line with exactly two fields — the original and substitute headers — separated by whitespace (i.e. neither the original header nor the new header contain any spaces) and no blank lines.
In this solution, the first (awk) line converts the conversions.txt file into a sed script, conversions.sed; the second (sed) line then runs this script on the sequences.txt file, producing the relabeled.txt file, which may (or may not) be what you're looking for.
Depending on the exact nature of your input files, which isn't clear from your question, this may need a bit of tweaking.
Hope this helps.

Related

How to echo/print actual file contents on a unix system

I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

Ignore spaces, tabs and new line in SED

I tried to replace a string in a file that contains tabs and line breaks.
the command in the shell file looked something like this:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -i 's/'"$STRING_OLD"'/'"$STRING_NEW"'/' $FILE
if I manually remove the line breaks and the tabs and leave only the spaces then I can replace successfully the file. but if I leave the line breaks then SED is unable to locate the $STRING_OLD and unable to replace to the new string
thanks in advance
Kobi
sed reads lines one at a time, and usually lines are also processed one at a time, as they are read. However, sed does have facilities for reading additional lines and operating on the combined result. There are several ways that could be applied to your problem, such as:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -n "1h;2,\$H;\${g;s/$STRING_OLD/$STRING_NEW/g;p}"
That that does more or less what you describe doing manually: it concatenates all the lines of the file (but keeps newlines), and then performs the substitution on the overall buffer, all at once. That does assume, however, either that the file is short (POSIX does not require it to work if the overall file length exceeds 8192 bytes) or that you are using a sed that does not have buffer-size limitations, such as GNU sed. Since you tagged Linux, I'm supposing that GNU sed can be assumed.
In detail:
the -n option turns off line echoing, because we save everything up and print the modified text in one chunk at the end.
there are multiple sed commands, separated by semicolons, and with literal $ characters escaped (for the shell):
1h: when processing the first line of input, replace the "hold space" with the contents of the pattern space (i.e. the first line, excluding newline)
2,\$H: when processing any line from the second through the last, append a newline to the hold space, then the contents of the pattern space
\${g;s/$STRING_OLD/$STRING_NEW/g;p}: when processing the last line, perform this group of commands: copy the hold space into the pattern space; perform the substitution, globally; print the resulting contents of the pattern space.
That's one of the simpler approaches, but if you need to accommodate seds that are not as capable as GNU's with regard to buffer capacity then there are other ways to go about it. Those start to get ugly, though.

How to I copy a variable number of lines from one file to another using shell script?

I have a wpa_supplicant.conf file (WiFi developers might be familiar with it). I want to copy a variable number of lines at the beginning of it to another file. Every wpa_supplicant has the following format:
ctrl_interface=wlan0
update_config=1
device_type=0-00000000-0
network={
..
After the first few lines the network={.. part starts however the number of lines above it are variable. I don't know how many lines might be there beforehand. Is there any way to copy all the part before network{.. to another file using shell script?
This is a classic use case for awk:
awk '{ if ($0 ~ /network=\{/) { exit } print }' wpa_supplicant.conf > /some/file
Copying itself is trivial and done via a simple redirection >. Note that this will overwrite /some/file. If you want to append to a file instead of overwriting it, just use >>.
More interesting is the first part, which consists of an awk script that reads and prints wpa_supplicant.conf line by line and stops as soon as a line matches the regular expression network=\{.
I'm not going to explain how this script works in detail - if you're not familiar with awk, you should read up on it, as it comes in handy quite often when you have to do these kinds of things under time constraints.

Copy a section within two keywords into a target file

I have thousand of files in a directory and each file contains numbers of defined variables starting with keyword DEFINE and ending with a semicolon (;), I want to copy all the occurrences of the data between this keyword(Inclusive) into a target file.
Example: Below is the content of the text file:
/* This code is for lookup */
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
END.
Now from the above content i just want to copy the section starting with DEFINE and ending with ; into a target file i.e. the output should be:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
this needs to done for thousands of scripts and multiple occurences, Please help out.
Thanks a lot , the provided code works, but to a limited extent only when the whole sentence is in a single line but the data is not supposed to be in one single line it is spread in multiple line like below:
/* This code is for lookup */
DEFINE variable as a1 expr= if branchno > 55
then
extract (n123f1 using brach, code)
else
branchno = null
;
END.
The code is also in the above fashion i need to capture all the data between DEFINE and semicolon (;) after every define there will be an ending semicolon ;, this is the pattern.
It sounds like you want grep(1):
grep '^DEFINE.*;$' input > output
Try using grep. Let's say you have files with extension .txt in present directory,
grep -ho 'DEFINE.*;' *.txt > outfile
Output:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
Short Description
-o will give you only matching string rather than whole line, if line also contains something else and want to ommit it.
-h will suppress file names before matching result
Read man page of grep by typing man grep on your terminal
EDIT
If you want capability to search in multiple lines, you can use pcregrep with -M option
pcregrep -M 'DEFINE.*?(\n|.)*?;' *.txt > outfile
Works fine on my system. Check man pcregrep for more details
Reference : SO Question
One can make a simple solution using sed with version :
sed -n -e '/^DEFINE/{:a p;/;$/!{n;ba}}' your-file
Option -n prevents sed from printing every line; then each time a line begins with DEFINE, print the line (command p) then enter a loop: until you find a line ending with ;, grab the next line and loop to the print command. When exiting the loop, you do nothing.
It looks a bit dirty; it seems that the version sed15 has a shorter (and more straightforward) way to achieve this in one line:
sed -n -e '/^DEFINE/,/;$/p' your-file
Indeed, only for this version of sed, both patterns are treated; for other versions of sed like mine under cygwin, the range patterns must be on separate lines to work properly.
One last thing to remember: it does not treat inclusive patterned ranges, i.e. it stops printing after the first encountered end-pattern even if multiple start patterns have been matched. Prefer something with awk if this is a feature you are looking for.

Resources