How to extract specific key value pairs from a grep output - linux

I have the output of grep in a folder as below,
./Data1/TEST_Data1.xml:<def-query collection="FT_R1Event" count="-1" desc="" durationEnd="1" durationStart="0" durationType="CAL" fromWS="Data1" id="_q1" timeUnit="D">
./Data2/TEST_Data2.xml:<def-query collection="FT_R2Event" count="-1" desc="" durationEnd="2" durationStart="0" durationType="ABS" fromWS="Data2" id="_q1" timeUnit="M">
I want to extract the below followed by some delimiter, say ',' as below,
Data1/TEST_Data1, durationEnd="1", timeUnit="D"
Data2/TEST_Data2, durationEnd="2", timeUnit="M"
Please help me in achieveing this using the basic linux commands.

I would do it using GNU AWK following way. Let file.txt content be
./Data1/TEST_Data1.xml:<def-query collection="FT_R1Event" count="-1" desc="" durationEnd="1" durationStart="0" durationType="CAL" fromWS="Data1" id="_q1" timeUnit="D">
./Data2/TEST_Data2.xml:<def-query collection="FT_R2Event" count="-1" desc="" durationEnd="2" durationStart="0" durationType="ABS" fromWS="Data2" id="_q1" timeUnit="M">
then
awk 'BEGIN{OFS=", ";FPAT="(^[^ ]+xml)|((durationEnd|timeUnit)=\"[^\"]+\")"}{gsub(/\.([/]|xml)/, "", $1);print}' file.txt
output
Data1/TEST_Data1, durationEnd="1", timeUnit="D"
Data2/TEST_Data2, durationEnd="2", timeUnit="M"
Explanation: I used FPAT to extract interesting elements of input, namely these which from start does not contain spaces and are following by xml or ((durationEnd or timeUnit) followed by " non-" "). Then I remove . followed by / or xml (note that . has to be literal . so it is escaped). Then I print everything, which is joined by , as I set it as output field seperator (OFS).
Disclaimer: I tested it only with shown samples.
(tested in gawk 4.2.1)

Related

Convert specific field to upper case by other field using sed

Using sed I need to covert to upper case the second field when the field city=miami and city=chicago
My code now looks like this, it convert all the name to upper without filtering by city.
id,name,country,sex,year,price,team,city,x,y,z
266,Aaron Russell,USA,m,1989,50,12,miami,0,0,1
179872,Abbos Rakhmonov,UZB,m,1979,0,25,chicago,0,0,0
3662,Abby Erceg,NZL,m,1977,67,20,toronto,0,0,0
5554573,Amar Music,CRO,m,1991,110,24,miami,0,0,0
3931111,Anirban Lahiri,IND,m,1987,105,27,boston,0,0,0
98402,Anissa Khelfaoui,ALG,f,1967,45,2,toronto,0,0,0
sed 's/^\(\([^,]*,\)\{1\}\)\([^,]*\)/\1\U\3/'
My output:
id,name,country,sex,year,price,team,city,x,y,z
266,AARON RUSELL,USA,m,1989,50,12,miami,0,0,1
179872,ABBOS RAKHMONV,UZB,m,1979,0,25,chicago,0,0,0
3662,ABBY ERCEG,NZL,m,1977,67,20,toronto,0,0,0,
5554573,AMAR MUSIC,CRO,m,1991,110,24,miami,0,0,0,
393115111,ANIRBAN LAHIRI,IND,m,1987,105,27,boston,0,0,0
998460252,ANISSA KHELFAOUI,ALG,f,1967,45,2,toronto,0,0,0
Expected output. Only using sed.
id,name,country,sex,year,price,team,city,x,y,z
266,AARON RUSELL,USA,m,1989,50,12,miami,0,0,1
179872,ABBOS RAKHMONV,UZB,m,1979,0,25,chicago,0,0,0
3662,Abby Erceg,NZL,m,1977,67,20,toronto,0,0,0
5554573,AMAR MUSIC,CRO,m,1991,110,24,miami,0,0,0
393115111,Anirban Lahiri,IND,m,1987,105,27,boston,0,0,0
998460252,Anissa Khelfaoui,ALG,f,1967,45,2,toronto,0,0,0
Easier IMHO with awk:
awk 'BEGIN{city=8; FS=OFS=","}
$city=="miami" || $city=="chicago" {$2=toupper($2)} 1' file
Prints:
id,name,country,sex,year,price,team,city,x,y,z
266,AARON RUSSELL,USA,m,1989,50,12,miami,0,0,1
179872,ABBOS RAKHMONOV,UZB,m,1979,0,25,chicago,0,0,0
3662,Abby Erceg,NZL,m,1977,67,20,toronto,0,0,0
5554573,AMAR MUSIC,CRO,m,1991,110,24,miami,0,0,0
3931111,Anirban Lahiri,IND,m,1987,105,27,boston,0,0,0
98402,Anissa Khelfaoui,ALG,f,1967,45,2,toronto,0,0,0
sed -E '/^([^,]*,){7}(miami|chicago),/{s/(^[^,]*,)([^,]+)/\1\U\2/}'
This relies on matching comma number 1 and number 7 in each line (ie. to match field 2 and field 8). If a single field in the CSV contained (extra) quoted or escaped commas, this would break.
Note that \U syntax is specific to GNU sed, and not portable.
This task would probably be more clearly expressed in awk, and gawk can also handle quoted commas, using FPAT.

awk command to split filename based on substring

I have a directory in that file names are like
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
I like to divide into 4 variables as below
v1=Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
V2=Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
V3=Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
V4=Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
If no of files increase it will goto any of above variables. I'm looking for awk one liners to achieve above.
I would do it using GNU AWK following way, let file.txt content be
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
then
awk '{arr[NR%4]=arr[NR%4] "," $0}END{print substr(arr[1],2);print substr(arr[2],2);print substr(arr[3],2);print substr(arr[0],2)}' file.txt
output
Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
Explanation: I store lines in array arr and decide where to put given line based on numer of line (NR) modulo (%) four (4). I do concatenate to what is currently stored (empty string if nothing so far) with , and content of current line ($0), this result in leading , which I remove using substr function, i.e. starting at 2nd character.
(tested in GNU Awk 5.0.1)

extract certain string from variable

I've got a text file containing the html-source of a web page. There are lines with "data-adid="...". These lines I'd like to capture.
Therefore, I use:
Id=$(grep -m 10 -A 1 "data-adid" Textfile)
to get the first ten results.
The variable Id contains the following:
<arcicle class="aditem" data-adid="1234567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
...
I would like to get the following output:
id="1234567890" id="2134567890" id="3124567890"
When using the grep command, I only managage to get the numbers, e.g.
Id2=$(echo $Id | grep -oP '(?<=data-ad=").*?(?=")')
gets 1234567890 2134567890 3124567890
When trying
Id2=$(echo $Id | grep -oP '(?<=data-ad).*?(?=")')
this will only give me id= id= id=
How could the code be change to get the desired output?
Though html values should be dealt with tools which understand html well but since OP is mentioning he/she needs in shell like tools, I would go for awk for this one. Written and tested in https://ideone.com/EpU1aW
echo "$var" |
awk '
match($0,/data-adid="[^"]*"/){
val=substr($0,RSTART,RLENGTH)
sub(/^data-ad/,"",val)
print val
val=""
}
'
data-ad is matching only data-ad - actually match the id= part too, with a " up until the next ". And I see no reason to use fancy lookarounds - just match the string and output the matched part only.
grep -oP 'data-ad\Kid="[^"]*"'
Should be enough. Note that $Id undergoes word splitting expansion and most probably should be quoted and that it's impossible to parse html using regex so you should most probably use html syntax aware tools instead.
With any sed:
$ sed 's/.*data-ad\(id="[^"]*"\).*/\1/' file
id="1234567890"
id="2134567890"
id="2134567890"

Get a particular string from text file

I need to get a particular string from a text file. the content of my file is below :
Components at each of the following levels must be
built before components at higher-numbered levels.
1. SACHHYA-opkg-utils master#964c29cc453ccd3d1b28fb40bae8df11c0dc3b3c
SACHHYA-web-SABARMATI-ap-page master#3bdc2dc1e5cee745cfced370201352045cd57195
SACHHYA-web-update-page master#24b0ffaad4d130ae5a2df0e470868846c7888392
SACHHYAWebMonaco Release/MR1_2019/3.0.7-570+36a238d
googletest-qc8017_32 branches/googletest#2692
LpmMfgTool Release/master/0.0.1-4+34833d6
opensource-avahi-qc8017_32 Release/SACHHYA-master/v1.0-4-gb70507e
opensource-OpenAvnuApple-qc8017_32 Release/SACHHYA-master/v1.0-1766-g1098033
opensource-opkg-qc8017_32 Release/SACHHYA-dev/v0.3.6.2-2-gb1e1aba
opensource-unzip-qc8017_32 Release/master/v6.0.0
opensource-util-linux-qc8017_32 Release/SACHHYA-master/1.5.0-10+877ade5
opensource-zip-qc8017_32 Release/master/v3.0.0
product-startup Release/master/4.0.0-5+5179185
ProductControllerCommon master#a1e71509aaaa9cf7a9e70d4e9c7bfc80d76e13a2
ProductUIAssets master#220944def647a72ce0194d43ef23f1d3fe146987
proprietary-airplay2-qc8017_32 Release/SACHHYA-master/2.0.2-15-g88c1c1d
SABARMATI-HSP-Images Release/master/4.4
SABARMATI-Toolchain Release/master/4.4
SABRMATILPM trunk#3408
SABARMATILpmTools #3604
SABARMATILpmUpdater Release/master/1.0.0-69+a38d6c8
The command that i am trying is :
awk /SACHHYAWebMonaco/ MyFile.txt
Using this command, I am able to get that particular line in which my string is present. Here is the result of the awk command :
SACHHYAWebMonaco Release/MR1_2019/3.0.7-570+36a238d
What I want to grep is only "3.0.7" (which is the version) from that line .
Can anyone have any suggestion to do that?
You can use / and - as field separators and print the third field.
This assumes the format of the lines and position of the information you seek will always be such.
$ awk -F[/-] '/SACHHYAWebMonaco/ {print $3}' file
3.0.7
Perl solution
$ perl -F"[/-]" -lane ' print "$F[2]" if /SACHHYAWebMonaco/ ' sachhya.txt
3.0.7

Obtaining the field that contains a value or string on Linux shell

Case example:
$ cat data.txt
foo,bar,moo
I can obtain the field data by using cut, assuming , as separator, but only if I know which position it has. Example to obtain value bar (second field):
$ cat data.txt | cut -d "," -f 2
bar
How can I obtain that same bar (or number field == 2) if I only know it contains a a letter?
Something like:
$ cat data.txt | reversecut -d "," --string "a"
[results could be both "2" or "bar"]
In other words: how can I know what is the field containing a substring in a text-delimited file using linux shell commands/tools?
Of course, programming is allowed, but do I really need looping and conditional structures? Isn't there a command that solves this?
Case of specific shell, I would prefer Bash solutions.
A close solution here, but not exactly the same.
More same-example based scenario (upon requestion):
For a search pattern of m or mo, the results could be both 3 or moo.
For a search pattern of f or fo, the results could be both 1 or foo.
Following simple awk may also help you in same.
awk -F, '$2~/a/{print $2}' data.txt
Output will be bar in this case.
Explanation:
-F,: Setting field separator for lines as comma, to identify the fields easily.
$2~/a/: checking condition here if 2nd field is having letter a in it, if yes then printing that 2nd field.
EDIT: Adding solution as per OP's comment and edited question too now.
Let's say following Input_file is there
cat data.txt
foo,bar,moo
mo,too,far
foo,test,test1
fo,test2,test3
Then following is the code for same:
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /fo/){print $i}}}' data.txt
foo
foo
fo
OR
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /mo/){print $i}}}' data.txt
moo
mo

Resources