I have a sequence file that has a repeated pattern that looks like this:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
and so on.
I want to extract the text between and including each >g## and create a new file titled protein_g##.faa
In the above example it would create a file called "protein_g34.faa" and it would be:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
I was trying to use sed but I am not very experienced using it. My guess was something like this:
$ sed -n '/^>g*/s///p; y/ /\n/' file > "g##"
but I can clearly tell that that is wrong... maybe the right thing is using awk?
Thanks!
Yeah, I would use awk for that. I don't think sed can write to more than one different output stream.
Here's how I would write that:
< input.txt awk '/^\$>/{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} {print $0 > fname}'
Breaking it down into details:
< input.txt This part reads in the input file.
awk Runs awk.
/^\$>/ On lines which start with the literal string $>, run the piece of code in brackets.
(If previous step matched) {fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} Take the first field in the previous line. Remove the first two characters of that field. Surround that with protein_ .faa. Save it as the variable fname. Print a message about switching files.
This next block has no condition before it. Implicitly, that means that it matches every line.
{print $0 > fname} Take the entire line, and send it to the filename held by fname. If no file is selected, this will cause an error.
Hope that helps!
If awk is an option:
awk '/\|/ {split($1,a,">"); fname="protein_"a[2]".faa"} {print $0 >> fname}' src.dat
awk is better than sed for this problem. You can implement it in sed with
sed -rz 's/(\$>)(g[^ ]*)([^\n]*\n[^\n]*)\n/echo '\''\1\2\3'\'' > protein_\2.faa/ge' file
This solution is nice for showing some sed tricks:
-z for parsing fragments that span several lines
(..) for remembering strings
\$ matching a literal $
[^\n]* matching until end of line
'\'' for a single quote
End single quoted string, escape single quote and start new single quoted string
\2 for recalling the second remembered string
Write a bash command in the replacement string
e execute result of replacement
awk procedure
awk allows records to be extracted between empty (or white space only) lines by setting the record separator to an empty string RS=""
Thus the records intended for each file can be got automatically.
The id to be used in the filename can be extracted from field 1 $1 by splitting the (default white-space-separated) field at the ">" mark, and using element 2 of the split array (named id in this example).
The file is written from awk before closing the file to prevent errors is you have many lines to process.
The awk procedure
The example data was saved in a file named all.seq and the following procedure used to process it:
awk 'BEGIN{RS="";} {split($1,id,">"); fn="protein_"id[2]".faa"; print $0 > fn; close(fn)}' all.seq
tests results
(terminal listings/outputs)
$ ls
all.seq protein_g104.faa protein_g115.faa protein_g34.faa
$ cat protein_g104.faa
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$ cat protein_g115.faa
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
$ cat protein_g34.faa
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
Tested using GNU Awk 5.1.0
I have written the following command
#!/bin/bash
awk -v value=$newvalue -v row=$rownum -v col=1 'BEGIN{FS=OFS=","} NR==row {$col=value}1' "${file}".csv >> temp.csv && mv temp.csv "${file}".csv
Sample Input of file.csv
Header,1
Field1,Field2,Field3
1,ABC,4567
2,XYZ,7890
Assuiming $newvalue=3 ,$rownum=4 and col=1, then the above code will replace:
Required Output
Header,1
Field1,Field2,Field3
1,ABC,4567
3,XYZ,7890
So if I know the row and column, is it possible to replace the said value using grep, sed?
Edit1: Field3 will always have a unique value for their respective rows. ( in case that info helps anyway)
Assuming your CSV file is as simple as what you show (no commas in quoted fields), and your newvalue does not contain characters that sed would interpret in a special way (e.g. ampersands, slashes or backslashes), the following should work with just sed (tested with GNU sed):
sed -Ei "$rownum s/[^,]*/$newvalue/$col" file.csv
Demo:
$ cat file.csv
Header,1
Field1,Field2,Field3
1,ABC,4567
3,XYZ,7890
$ rownum=3
$ col=2
$ newvalue="NEW"
$ sed -Ei "$rownum s/[^,]*/$newvalue/$col" file.csv
$ cat file.csv
Header,1
Field1,Field2,Field3
1,NEW,4567
3,XYZ,7890
Explanations: $rownum is used as the address (here the line number) where to apply the following command. s is the sed substitute command. [^,]* is the regular expression to search for and replace: the longest possible string not containing a comma. $newvalue is the replacement string. $col is the occurrence to replace.
If newvalue may contain ampersands, slashes or backslashes we must sanitize it first:
sanitizednewvalue=$(sed -E 's/([/\&])/\\\1/g' <<< "$newvalue")
sed -Ei "$rownum s/[^,]*/$sanitizednewvalue/$col" file.csv
Demo:
$ newvalue='NEW&\/&NEW'
$ sanitizednewvalue=$(sed -E 's/([/\&])/\\\1/g' <<< "$newvalue")
$ echo "$sanitizednewvalue"
NEW\&\\\/\&NEW
$ sed -Ei "$rownum s/[^,]*/$sanitizednewvalue/$col" file.csv
$ cat file.csv
Header,1
Field1,Field2,Field3
1,NEW&\/&NEW,4567
3,XYZ,7890
With sed, how about:
#!/bin/bash
newvalue=3
rownum=4
col=1
sed -i -E "${rownum} s/(([^,]+,){$((col-1))})[^,]+/\\1${newvalue}/" file.csv
Result of file.csv
Header,1
Field1,Field2,Field3
1,ABC,4567
3,XYZ,7890
${rownum} matches the line number.
(([^,]+,){n}) matches the n-time repetition of the group of
non-comma characters followed by a comma. Then it should be the substring
before the target (to be substituted) column by assigning n to
col - 1.
Let's Try to Implement sed command
Let us consider a sample CSV file with the following content:
$ cat file
Solaris,25,11
Ubuntu,31,2
Fedora,21,3
LinuxMint,45,4
RedHat,12,5
To remove the 1st field or column :
$ sed 's/[^,]*,//' file
25,11
31,2
21,3
45,4
12,5
This regular expression searches for a sequence of non-comma([^,]*) characters and deletes them which results in the 1st field getting removed.
To print only the last field, OR remove all fields except the last field:
$ sed 's/.*,//' file
11
2
3
4
5
This regex removes everything till the last comma(.*,) which results in deleting all the fields except the last field.
To print only the 1st field:
$ sed 's/,.*//' file
Solaris
Ubuntu
Fedora
LinuxMint
RedHat
This regex(,.*) removes the characters starting from the 1st comma till the end resulting in deleting all the fields except the last field.
To delete the 2nd field:
$ sed 's/,[^,]*,/,/' file
Solaris,11
Ubuntu,2
Fedora,3
LinuxMint,4
RedHat,5
The regex (,[^,]*,) searches for a comma and sequence of characters followed by a comma which results in matching the 2nd column, and replaces this pattern matched with just a comma, ultimately ending in deleting the 2nd column.
Note: To delete the fields in the middle gets more tougher in sed since every field has to be matched literally.
To print only the 2nd field:
$ sed 's/[^,]*,\([^,]*\).*/\1/' file
25
31
21
45
12
The regex matches the first field, second field and the rest, however groups the 2nd field alone. The whole line is now replaced with the 2nd field(\1), hence only the 2nd field gets displayed.
Print only lines in which the last column is a single digit number:
$ sed -n '/.*,[0-9]$/p' file
Ubuntu,31,2
Fedora,21,3
LinuxMint,45,4
RedHat,12,5
The regex (,[0-9]$) checks for a single digit in the last field and the p command prints the line which matches this condition.
To number all lines in the file:
$ sed = file | sed 'N;s/\n/ /'
1 Solaris,25,11
2 Ubuntu,31,2
3 Fedora,21,3
4 LinuxMint,45,4
5 RedHat,12,5
This is simulation of cat -n command. awk does it easily using the special variable NR. The '=' command of sed gives the line number of every line followed by the line itself. The sed output is piped to another sed command to join every 2 lines.
Replace the last field by 99 if the 1st field is 'Ubuntu':
$ sed 's/\(Ubuntu\)\(,.*,\).*/\1\299/' file
Solaris,25,11
Ubuntu,31,99
Fedora,21,3
LinuxMint,45,4
RedHat,12,5
This regex matches 'Ubuntu' and till the end except the last column and groups each of them as well. In the replacement part, the 1st and 2nd group along with the new number 99 is substituted.
Delete the 2nd field if the 1st field is 'RedHat':
$ sed 's/\(RedHat,\)[^,]*\(.*\)/\1\2/' file
Solaris,25,11
Ubuntu,31,2
Fedora,21,3
LinuxMint,45,4
RedHat,,5
The 1st field 'RedHat', the 2nd field and the remaining fields are grouped, and the replacement is done with only 1st and the last group , resuting in getting the 2nd field deleted.
To insert a new column at the end(last column) :
$ sed 's/.*/&,A/' file
Solaris,25,11,A
Ubuntu,31,2,A
Fedora,21,3,A
LinuxMint,45,4,A
RedHat,12,5,A
The regex (.*) matches the entire line and replacing it with the line itself (&) and the new field.
To insert a new column in the beginning(1st column):
$ sed 's/.*/A,&/' file
A,Solaris,25,11
A,Ubuntu,31,2
A,Fedora,21,3
A,LinuxMint,45,4
A,RedHat,12,5
Same as last example, just the line matched is followed by the new column
I hope this will help. Let me know if you need to use Awk or any other command.
Thank you
I have data in the text file as below
E993143|65282
C960954567|50222
P1_ABCDEFG_bbb|26153
A960416|25654
D987747|13410
I would like to have in a proper alignment using linux as below
E993143 |65282
C960954567 |50222
P1_ABCDEFG_bbb |26153
A960416 |25654
D987747 |13410
Can somebody help me here?
Note: I cannot use excel format
You can use awk as follows over your text:
awk -F"|" '{printf("%-15s \t |%-10i\n", $1, $2)}'
In this, I have fixed the max-length for 1st column as '15' and second as '10'. You can change these numbers if you are expecting a larger length.
Explanation:
"-F" flag defines the delimiter as "|"
"%-15s \t |%-10i\n" - this section is defining how we want the output string to be formatted. ' - ' in '-15s' is for left alignment of the output column and '15s' is for 15 character length string. Similarly '-10i' is for 10 digit integer value. "\t" and "\n" is to add tab space in between and line space at the end.
Output:
➜ test cat test.txt | awk -F"|" '{printf("%-15s \t |%-10i\n", $1, $2)}'
E993143 |65282
C960954567 |50222
P1_ABCDEFG_bbb |26153
A960416 |25654
D987747 |13410
➜ test
the tool "column" might help you : https://www.stefaanlippens.net/pretty-csv.html. like
cat test.csv | column -t -s \| -o \|
"|" needs escaping when used as parameter-input
This is my file:
$cat filename
10023a,vija45,8877au,qwer65,guru12 0099888das,baburam123,ganeshan1,feild55512
What I tried to do is using the sed below command to get the output to be only 6 charcters words in that file
sed -ne 's/[a-z][0-9]\{6}/&/p' filename
it displaying all words and lines
Could you please any one help me on this..
Expected output is
vija45 baburam123
8877au ganeshan1
qwer65 feild55512
guru12
Use that:
tr "," "\n" <file | grep '^.\{6\}$\|^.\{10\}$'
First tr replaces all , with newlines, that we have each segment between the commas in a line.
Then grep searches for 6 or 10 character long lines and prints them.
With your given example, the output would then be:
10023a
vija45
8877au
qwer65
baburam123
feild55512
If guru12 0099888das must also be matched as a 6 character and a 10 character word, then just change the tr part to include also spaces:
tr ", " "\n" <file | grep '^.\{6\}$\|^.\{10\}$'
I suggest you to use grep for matching.
grep -o '\b\w\{6\}\b' file
sed '
# keep only 6 char word (and space) by removing less or more than 6 character word
s/.*/,&,/
s/[^[:space:],]\{11,\}//g;s/[[:space:],][^[:space:],][[:space:],]\{1,5\}/,/g;s/[[:space:],][^[:space:],][[:space:],]\{7,9\}/,/g
# clean space element
s/[[:space:],]\{2,\}/,/g;s/^[[:space:],]*//g;s/[[:space:],]*$//g
# remove empty line
/$[[:space:],]*$/d
# 1 word per line (optional)
y/ ,/\n\n/
' YourFile
Detail:
print all word of 6 letter find in lines (option for 1 word printed per output line)
self explained
adapted for , separated
Correction: forget some g and a small bug on small word removing and add 10 char word (take 6 only in first version)
I'm trying find and remove strings like:
[1126604244001,85.00], [1122204245002,85.00], [1221104246003,85.00],
[1222204247004,85.00], [1823304248005,85.00], [1424404249006,85.00],
85.00 = constans. I mean [xxxxxxxxxxxxx,85.00],
In notepad++ is simple:
find: "[^........].............,85.00]" and replace:""
I wolud like to use awk or sed to remove string automaticly without importing it to notepad++.
ok, I have file
temp.txt
[1126604244001,17.00], [1126604244001,17.00], [1126604244001,17.00],
[1126604244001,85.00], [1122204245002,85.00], [1221104246003,85.00],
[1222204247004,85.00], [1823304248005,85.00], [1424404249006,85.00], [1126604244001,17.00], [1126604244001,17.00],
My desire output
temp.txt
[1126604244001,17.00],[1126604244001,17.00],[1126604244001,17.00],[1126604244001,17.00],[1126604244001,17.00],
Thx in advance!
With sed, simply:
sed 's/\[[^]]*,85.00\],[[:space:]]*//g' filename
With this, everything that matches the regex \[[^]]*,85.00\],[[:space:]]* is removed. The regex matches [ followed by an arbitrary number of characters that are not ], followed by ,85.00], and optionally spaces; the only syntactically tricky bit is the [^]] character set which matches all characters other than ].
Alternatively with awk:
awk -v RS='],' -v ORS='],' '!/,85.00$/' filename
This splits the input into records delimited by ], and prints only those that don't end with ,85.00.
egrep -v '[^0-9]85\.00]' YourFile
remove (not empty) line with your pattern