I need to get a particular string from a text file. the content of my file is below :
Components at each of the following levels must be
built before components at higher-numbered levels.
1. SACHHYA-opkg-utils master#964c29cc453ccd3d1b28fb40bae8df11c0dc3b3c
SACHHYA-web-SABARMATI-ap-page master#3bdc2dc1e5cee745cfced370201352045cd57195
SACHHYA-web-update-page master#24b0ffaad4d130ae5a2df0e470868846c7888392
SACHHYAWebMonaco Release/MR1_2019/3.0.7-570+36a238d
googletest-qc8017_32 branches/googletest#2692
LpmMfgTool Release/master/0.0.1-4+34833d6
opensource-avahi-qc8017_32 Release/SACHHYA-master/v1.0-4-gb70507e
opensource-OpenAvnuApple-qc8017_32 Release/SACHHYA-master/v1.0-1766-g1098033
opensource-opkg-qc8017_32 Release/SACHHYA-dev/v0.3.6.2-2-gb1e1aba
opensource-unzip-qc8017_32 Release/master/v6.0.0
opensource-util-linux-qc8017_32 Release/SACHHYA-master/1.5.0-10+877ade5
opensource-zip-qc8017_32 Release/master/v3.0.0
product-startup Release/master/4.0.0-5+5179185
ProductControllerCommon master#a1e71509aaaa9cf7a9e70d4e9c7bfc80d76e13a2
ProductUIAssets master#220944def647a72ce0194d43ef23f1d3fe146987
proprietary-airplay2-qc8017_32 Release/SACHHYA-master/2.0.2-15-g88c1c1d
SABARMATI-HSP-Images Release/master/4.4
SABARMATI-Toolchain Release/master/4.4
SABRMATILPM trunk#3408
SABARMATILpmTools #3604
SABARMATILpmUpdater Release/master/1.0.0-69+a38d6c8
The command that i am trying is :
awk /SACHHYAWebMonaco/ MyFile.txt
Using this command, I am able to get that particular line in which my string is present. Here is the result of the awk command :
SACHHYAWebMonaco Release/MR1_2019/3.0.7-570+36a238d
What I want to grep is only "3.0.7" (which is the version) from that line .
Can anyone have any suggestion to do that?
You can use / and - as field separators and print the third field.
This assumes the format of the lines and position of the information you seek will always be such.
$ awk -F[/-] '/SACHHYAWebMonaco/ {print $3}' file
3.0.7
Perl solution
$ perl -F"[/-]" -lane ' print "$F[2]" if /SACHHYAWebMonaco/ ' sachhya.txt
3.0.7
Related
I have a sequence file that has a repeated pattern that looks like this:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
and so on.
I want to extract the text between and including each >g## and create a new file titled protein_g##.faa
In the above example it would create a file called "protein_g34.faa" and it would be:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
I was trying to use sed but I am not very experienced using it. My guess was something like this:
$ sed -n '/^>g*/s///p; y/ /\n/' file > "g##"
but I can clearly tell that that is wrong... maybe the right thing is using awk?
Thanks!
Yeah, I would use awk for that. I don't think sed can write to more than one different output stream.
Here's how I would write that:
< input.txt awk '/^\$>/{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} {print $0 > fname}'
Breaking it down into details:
< input.txt This part reads in the input file.
awk Runs awk.
/^\$>/ On lines which start with the literal string $>, run the piece of code in brackets.
(If previous step matched) {fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} Take the first field in the previous line. Remove the first two characters of that field. Surround that with protein_ .faa. Save it as the variable fname. Print a message about switching files.
This next block has no condition before it. Implicitly, that means that it matches every line.
{print $0 > fname} Take the entire line, and send it to the filename held by fname. If no file is selected, this will cause an error.
Hope that helps!
If awk is an option:
awk '/\|/ {split($1,a,">"); fname="protein_"a[2]".faa"} {print $0 >> fname}' src.dat
awk is better than sed for this problem. You can implement it in sed with
sed -rz 's/(\$>)(g[^ ]*)([^\n]*\n[^\n]*)\n/echo '\''\1\2\3'\'' > protein_\2.faa/ge' file
This solution is nice for showing some sed tricks:
-z for parsing fragments that span several lines
(..) for remembering strings
\$ matching a literal $
[^\n]* matching until end of line
'\'' for a single quote
End single quoted string, escape single quote and start new single quoted string
\2 for recalling the second remembered string
Write a bash command in the replacement string
e execute result of replacement
awk procedure
awk allows records to be extracted between empty (or white space only) lines by setting the record separator to an empty string RS=""
Thus the records intended for each file can be got automatically.
The id to be used in the filename can be extracted from field 1 $1 by splitting the (default white-space-separated) field at the ">" mark, and using element 2 of the split array (named id in this example).
The file is written from awk before closing the file to prevent errors is you have many lines to process.
The awk procedure
The example data was saved in a file named all.seq and the following procedure used to process it:
awk 'BEGIN{RS="";} {split($1,id,">"); fn="protein_"id[2]".faa"; print $0 > fn; close(fn)}' all.seq
tests results
(terminal listings/outputs)
$ ls
all.seq protein_g104.faa protein_g115.faa protein_g34.faa
$ cat protein_g104.faa
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$ cat protein_g115.faa
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
$ cat protein_g34.faa
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
Tested using GNU Awk 5.1.0
I have a directory in that file names are like
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
I like to divide into 4 variables as below
v1=Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
V2=Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
V3=Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
V4=Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
If no of files increase it will goto any of above variables. I'm looking for awk one liners to achieve above.
I would do it using GNU AWK following way, let file.txt content be
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
then
awk '{arr[NR%4]=arr[NR%4] "," $0}END{print substr(arr[1],2);print substr(arr[2],2);print substr(arr[3],2);print substr(arr[0],2)}' file.txt
output
Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
Explanation: I store lines in array arr and decide where to put given line based on numer of line (NR) modulo (%) four (4). I do concatenate to what is currently stored (empty string if nothing so far) with , and content of current line ($0), this result in leading , which I remove using substr function, i.e. starting at 2nd character.
(tested in GNU Awk 5.0.1)
I have the output of grep in a folder as below,
./Data1/TEST_Data1.xml:<def-query collection="FT_R1Event" count="-1" desc="" durationEnd="1" durationStart="0" durationType="CAL" fromWS="Data1" id="_q1" timeUnit="D">
./Data2/TEST_Data2.xml:<def-query collection="FT_R2Event" count="-1" desc="" durationEnd="2" durationStart="0" durationType="ABS" fromWS="Data2" id="_q1" timeUnit="M">
I want to extract the below followed by some delimiter, say ',' as below,
Data1/TEST_Data1, durationEnd="1", timeUnit="D"
Data2/TEST_Data2, durationEnd="2", timeUnit="M"
Please help me in achieveing this using the basic linux commands.
I would do it using GNU AWK following way. Let file.txt content be
./Data1/TEST_Data1.xml:<def-query collection="FT_R1Event" count="-1" desc="" durationEnd="1" durationStart="0" durationType="CAL" fromWS="Data1" id="_q1" timeUnit="D">
./Data2/TEST_Data2.xml:<def-query collection="FT_R2Event" count="-1" desc="" durationEnd="2" durationStart="0" durationType="ABS" fromWS="Data2" id="_q1" timeUnit="M">
then
awk 'BEGIN{OFS=", ";FPAT="(^[^ ]+xml)|((durationEnd|timeUnit)=\"[^\"]+\")"}{gsub(/\.([/]|xml)/, "", $1);print}' file.txt
output
Data1/TEST_Data1, durationEnd="1", timeUnit="D"
Data2/TEST_Data2, durationEnd="2", timeUnit="M"
Explanation: I used FPAT to extract interesting elements of input, namely these which from start does not contain spaces and are following by xml or ((durationEnd or timeUnit) followed by " non-" "). Then I remove . followed by / or xml (note that . has to be literal . so it is escaped). Then I print everything, which is joined by , as I set it as output field seperator (OFS).
Disclaimer: I tested it only with shown samples.
(tested in gawk 4.2.1)
i'm making this replace
sed 's/<n3:CustId>.*<\/n3:CustId>/<n3:CustId>'"${orgkey}"'<\/n3:CustId>/' CAMBIOMINI.txt > CAMBIOMINI2.txt
but now i want to replace line by line with a differente orgkey value, i want orgkey+=1 but i dont know how to make that in the same command for all the CAMBIOMINI.txt file
Sed may not be suitable for the case that you want to alter the substitution
for each occurance.
If my undersanding of your requirement is correct, following would work:
awk 'FNR==NR {orgkey[++i]=$0; next}
{print gensub(/<n3:CustId>[^<]*<\/n3:CustId>/,"<n3:CustId>" orgkey[++j] "</n3:CustId>", "g")} ' orgkey.txt CAMBIOMINI1.txt
where orgkey.txt holds the list of substitutions:
orgkey_a
orgkey_b
orgkey_c
orgkey_d
and CAMBIOMINI1.txt will look like:
<n3:CustId>id1</n3:CustId>
<n3:CustId>id2</n3:CustId>
<n3:CustId>id3</n3:CustId>
<n3:CustId>id4</n3:CustId>
then the result will be:
<n3:CustId>orgkey_a</n3:CustId>
<n3:CustId>orgkey_b</n3:CustId>
<n3:CustId>orgkey_c</n3:CustId>
<n3:CustId>orgkey_d</n3:CustId>
Note that it does not assume the tag in CAMBIOMINI1.txt appears multiple
times in the same line as:
<n3:CustId>id1</n3:CustId> <n3:CustId>id2</n3:CustId>
<n3:CustId>id3</n3:CustId>
<n3:CustId>id4</n3:CustId>
In that case, use a Perl version instead:
perl -nle 'if (#ARGV) {push(#orgkey, $_); next}
s#<n3:CustId>.*?</n3:CustId>#"<n3:CustId>" . $orgkey[$j++] . "</n3:CustId>"#ge; print' orgkey.txt CAMBIOMINI1.txt
Case example:
$ cat data.txt
foo,bar,moo
I can obtain the field data by using cut, assuming , as separator, but only if I know which position it has. Example to obtain value bar (second field):
$ cat data.txt | cut -d "," -f 2
bar
How can I obtain that same bar (or number field == 2) if I only know it contains a a letter?
Something like:
$ cat data.txt | reversecut -d "," --string "a"
[results could be both "2" or "bar"]
In other words: how can I know what is the field containing a substring in a text-delimited file using linux shell commands/tools?
Of course, programming is allowed, but do I really need looping and conditional structures? Isn't there a command that solves this?
Case of specific shell, I would prefer Bash solutions.
A close solution here, but not exactly the same.
More same-example based scenario (upon requestion):
For a search pattern of m or mo, the results could be both 3 or moo.
For a search pattern of f or fo, the results could be both 1 or foo.
Following simple awk may also help you in same.
awk -F, '$2~/a/{print $2}' data.txt
Output will be bar in this case.
Explanation:
-F,: Setting field separator for lines as comma, to identify the fields easily.
$2~/a/: checking condition here if 2nd field is having letter a in it, if yes then printing that 2nd field.
EDIT: Adding solution as per OP's comment and edited question too now.
Let's say following Input_file is there
cat data.txt
foo,bar,moo
mo,too,far
foo,test,test1
fo,test2,test3
Then following is the code for same:
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /fo/){print $i}}}' data.txt
foo
foo
fo
OR
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /mo/){print $i}}}' data.txt
moo
mo