How to extract specific key value pairs from a grep output - linux
I have the output of grep in a folder as below,
./Data1/TEST_Data1.xml:<def-query collection="FT_R1Event" count="-1" desc="" durationEnd="1" durationStart="0" durationType="CAL" fromWS="Data1" id="_q1" timeUnit="D">
./Data2/TEST_Data2.xml:<def-query collection="FT_R2Event" count="-1" desc="" durationEnd="2" durationStart="0" durationType="ABS" fromWS="Data2" id="_q1" timeUnit="M">
I want to extract the below followed by some delimiter, say ',' as below,
Data1/TEST_Data1, durationEnd="1", timeUnit="D"
Data2/TEST_Data2, durationEnd="2", timeUnit="M"
Please help me in achieveing this using the basic linux commands.
I would do it using GNU AWK following way. Let file.txt content be
./Data1/TEST_Data1.xml:<def-query collection="FT_R1Event" count="-1" desc="" durationEnd="1" durationStart="0" durationType="CAL" fromWS="Data1" id="_q1" timeUnit="D">
./Data2/TEST_Data2.xml:<def-query collection="FT_R2Event" count="-1" desc="" durationEnd="2" durationStart="0" durationType="ABS" fromWS="Data2" id="_q1" timeUnit="M">
then
awk 'BEGIN{OFS=", ";FPAT="(^[^ ]+xml)|((durationEnd|timeUnit)=\"[^\"]+\")"}{gsub(/\.([/]|xml)/, "", $1);print}' file.txt
output
Data1/TEST_Data1, durationEnd="1", timeUnit="D"
Data2/TEST_Data2, durationEnd="2", timeUnit="M"
Explanation: I used FPAT to extract interesting elements of input, namely these which from start does not contain spaces and are following by xml or ((durationEnd or timeUnit) followed by " non-" "). Then I remove . followed by / or xml (note that . has to be literal . so it is escaped). Then I print everything, which is joined by , as I set it as output field seperator (OFS).
Disclaimer: I tested it only with shown samples.
(tested in gawk 4.2.1)
Related
Convert specific field to upper case by other field using sed
Using sed I need to covert to upper case the second field when the field city=miami and city=chicago My code now looks like this, it convert all the name to upper without filtering by city. id,name,country,sex,year,price,team,city,x,y,z 266,Aaron Russell,USA,m,1989,50,12,miami,0,0,1 179872,Abbos Rakhmonov,UZB,m,1979,0,25,chicago,0,0,0 3662,Abby Erceg,NZL,m,1977,67,20,toronto,0,0,0 5554573,Amar Music,CRO,m,1991,110,24,miami,0,0,0 3931111,Anirban Lahiri,IND,m,1987,105,27,boston,0,0,0 98402,Anissa Khelfaoui,ALG,f,1967,45,2,toronto,0,0,0 sed 's/^\(\([^,]*,\)\{1\}\)\([^,]*\)/\1\U\3/' My output: id,name,country,sex,year,price,team,city,x,y,z 266,AARON RUSELL,USA,m,1989,50,12,miami,0,0,1 179872,ABBOS RAKHMONV,UZB,m,1979,0,25,chicago,0,0,0 3662,ABBY ERCEG,NZL,m,1977,67,20,toronto,0,0,0, 5554573,AMAR MUSIC,CRO,m,1991,110,24,miami,0,0,0, 393115111,ANIRBAN LAHIRI,IND,m,1987,105,27,boston,0,0,0 998460252,ANISSA KHELFAOUI,ALG,f,1967,45,2,toronto,0,0,0 Expected output. Only using sed. id,name,country,sex,year,price,team,city,x,y,z 266,AARON RUSELL,USA,m,1989,50,12,miami,0,0,1 179872,ABBOS RAKHMONV,UZB,m,1979,0,25,chicago,0,0,0 3662,Abby Erceg,NZL,m,1977,67,20,toronto,0,0,0 5554573,AMAR MUSIC,CRO,m,1991,110,24,miami,0,0,0 393115111,Anirban Lahiri,IND,m,1987,105,27,boston,0,0,0 998460252,Anissa Khelfaoui,ALG,f,1967,45,2,toronto,0,0,0
Easier IMHO with awk: awk 'BEGIN{city=8; FS=OFS=","} $city=="miami" || $city=="chicago" {$2=toupper($2)} 1' file Prints: id,name,country,sex,year,price,team,city,x,y,z 266,AARON RUSSELL,USA,m,1989,50,12,miami,0,0,1 179872,ABBOS RAKHMONOV,UZB,m,1979,0,25,chicago,0,0,0 3662,Abby Erceg,NZL,m,1977,67,20,toronto,0,0,0 5554573,AMAR MUSIC,CRO,m,1991,110,24,miami,0,0,0 3931111,Anirban Lahiri,IND,m,1987,105,27,boston,0,0,0 98402,Anissa Khelfaoui,ALG,f,1967,45,2,toronto,0,0,0
sed -E '/^([^,]*,){7}(miami|chicago),/{s/(^[^,]*,)([^,]+)/\1\U\2/}' This relies on matching comma number 1 and number 7 in each line (ie. to match field 2 and field 8). If a single field in the CSV contained (extra) quoted or escaped commas, this would break. Note that \U syntax is specific to GNU sed, and not portable. This task would probably be more clearly expressed in awk, and gawk can also handle quoted commas, using FPAT.
awk command to split filename based on substring
I have a directory in that file names are like Abc_def_ijk.txt-1 Abc_def_ijk.txt-2 Abc_def_ijk.txt-3 Abc_def_ijk.txt-4 Abc_def_ijk.txt-5 Abc_def_ijk.txt-6 Abc_def_ijk.txt-7 Abc_def_ijk.txt-8 Abc_def_ijk.txt-9 I like to divide into 4 variables as below v1=Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9 V2=Abc_def_ijk.txt-2,Abc_def_ijk.txt-6 V3=Abc_def_ijk.txt-3,Abc_def_ijk.txt-7 V4=Abc_def_ijk.txt-4,Abc_def_ijk.txt-8 If no of files increase it will goto any of above variables. I'm looking for awk one liners to achieve above.
I would do it using GNU AWK following way, let file.txt content be Abc_def_ijk.txt-1 Abc_def_ijk.txt-2 Abc_def_ijk.txt-3 Abc_def_ijk.txt-4 Abc_def_ijk.txt-5 Abc_def_ijk.txt-6 Abc_def_ijk.txt-7 Abc_def_ijk.txt-8 Abc_def_ijk.txt-9 then awk '{arr[NR%4]=arr[NR%4] "," $0}END{print substr(arr[1],2);print substr(arr[2],2);print substr(arr[3],2);print substr(arr[0],2)}' file.txt output Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9 Abc_def_ijk.txt-2,Abc_def_ijk.txt-6 Abc_def_ijk.txt-3,Abc_def_ijk.txt-7 Abc_def_ijk.txt-4,Abc_def_ijk.txt-8 Explanation: I store lines in array arr and decide where to put given line based on numer of line (NR) modulo (%) four (4). I do concatenate to what is currently stored (empty string if nothing so far) with , and content of current line ($0), this result in leading , which I remove using substr function, i.e. starting at 2nd character. (tested in GNU Awk 5.0.1)
extract certain string from variable
I've got a text file containing the html-source of a web page. There are lines with "data-adid="...". These lines I'd like to capture. Therefore, I use: Id=$(grep -m 10 -A 1 "data-adid" Textfile) to get the first ten results. The variable Id contains the following: <arcicle class="aditem" data-adid="1234567890" <div class="aditem-image"> -- <arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> -- <arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> -- ... I would like to get the following output: id="1234567890" id="2134567890" id="3124567890" When using the grep command, I only managage to get the numbers, e.g. Id2=$(echo $Id | grep -oP '(?<=data-ad=").*?(?=")') gets 1234567890 2134567890 3124567890 When trying Id2=$(echo $Id | grep -oP '(?<=data-ad).*?(?=")') this will only give me id= id= id= How could the code be change to get the desired output?
Though html values should be dealt with tools which understand html well but since OP is mentioning he/she needs in shell like tools, I would go for awk for this one. Written and tested in https://ideone.com/EpU1aW echo "$var" | awk ' match($0,/data-adid="[^"]*"/){ val=substr($0,RSTART,RLENGTH) sub(/^data-ad/,"",val) print val val="" } '
data-ad is matching only data-ad - actually match the id= part too, with a " up until the next ". And I see no reason to use fancy lookarounds - just match the string and output the matched part only. grep -oP 'data-ad\Kid="[^"]*"' Should be enough. Note that $Id undergoes word splitting expansion and most probably should be quoted and that it's impossible to parse html using regex so you should most probably use html syntax aware tools instead.
With any sed: $ sed 's/.*data-ad\(id="[^"]*"\).*/\1/' file id="1234567890" id="2134567890" id="2134567890"
Get a particular string from text file
I need to get a particular string from a text file. the content of my file is below : Components at each of the following levels must be built before components at higher-numbered levels. 1. SACHHYA-opkg-utils master#964c29cc453ccd3d1b28fb40bae8df11c0dc3b3c SACHHYA-web-SABARMATI-ap-page master#3bdc2dc1e5cee745cfced370201352045cd57195 SACHHYA-web-update-page master#24b0ffaad4d130ae5a2df0e470868846c7888392 SACHHYAWebMonaco Release/MR1_2019/3.0.7-570+36a238d googletest-qc8017_32 branches/googletest#2692 LpmMfgTool Release/master/0.0.1-4+34833d6 opensource-avahi-qc8017_32 Release/SACHHYA-master/v1.0-4-gb70507e opensource-OpenAvnuApple-qc8017_32 Release/SACHHYA-master/v1.0-1766-g1098033 opensource-opkg-qc8017_32 Release/SACHHYA-dev/v0.3.6.2-2-gb1e1aba opensource-unzip-qc8017_32 Release/master/v6.0.0 opensource-util-linux-qc8017_32 Release/SACHHYA-master/1.5.0-10+877ade5 opensource-zip-qc8017_32 Release/master/v3.0.0 product-startup Release/master/4.0.0-5+5179185 ProductControllerCommon master#a1e71509aaaa9cf7a9e70d4e9c7bfc80d76e13a2 ProductUIAssets master#220944def647a72ce0194d43ef23f1d3fe146987 proprietary-airplay2-qc8017_32 Release/SACHHYA-master/2.0.2-15-g88c1c1d SABARMATI-HSP-Images Release/master/4.4 SABARMATI-Toolchain Release/master/4.4 SABRMATILPM trunk#3408 SABARMATILpmTools #3604 SABARMATILpmUpdater Release/master/1.0.0-69+a38d6c8 The command that i am trying is : awk /SACHHYAWebMonaco/ MyFile.txt Using this command, I am able to get that particular line in which my string is present. Here is the result of the awk command : SACHHYAWebMonaco Release/MR1_2019/3.0.7-570+36a238d What I want to grep is only "3.0.7" (which is the version) from that line . Can anyone have any suggestion to do that?
You can use / and - as field separators and print the third field. This assumes the format of the lines and position of the information you seek will always be such. $ awk -F[/-] '/SACHHYAWebMonaco/ {print $3}' file 3.0.7
Perl solution $ perl -F"[/-]" -lane ' print "$F[2]" if /SACHHYAWebMonaco/ ' sachhya.txt 3.0.7
Obtaining the field that contains a value or string on Linux shell
Case example: $ cat data.txt foo,bar,moo I can obtain the field data by using cut, assuming , as separator, but only if I know which position it has. Example to obtain value bar (second field): $ cat data.txt | cut -d "," -f 2 bar How can I obtain that same bar (or number field == 2) if I only know it contains a a letter? Something like: $ cat data.txt | reversecut -d "," --string "a" [results could be both "2" or "bar"] In other words: how can I know what is the field containing a substring in a text-delimited file using linux shell commands/tools? Of course, programming is allowed, but do I really need looping and conditional structures? Isn't there a command that solves this? Case of specific shell, I would prefer Bash solutions. A close solution here, but not exactly the same. More same-example based scenario (upon requestion): For a search pattern of m or mo, the results could be both 3 or moo. For a search pattern of f or fo, the results could be both 1 or foo.
Following simple awk may also help you in same. awk -F, '$2~/a/{print $2}' data.txt Output will be bar in this case. Explanation: -F,: Setting field separator for lines as comma, to identify the fields easily. $2~/a/: checking condition here if 2nd field is having letter a in it, if yes then printing that 2nd field. EDIT: Adding solution as per OP's comment and edited question too now. Let's say following Input_file is there cat data.txt foo,bar,moo mo,too,far foo,test,test1 fo,test2,test3 Then following is the code for same: awk -F, '{for(i=1;i<=NF;i++){if($i ~ /fo/){print $i}}}' data.txt foo foo fo OR awk -F, '{for(i=1;i<=NF;i++){if($i ~ /mo/){print $i}}}' data.txt moo mo