Using Unix Tools to Extract String Values - linux

I wrote a small Perl script to extract all the values from a JSON formatted string for a given key name (shown below). So, if I set a command line switch for the Perl script to id, then it would return 1,2, and stringVal from the JSON example below. This script does the job, but I want to see how others would solve this same problem using other unix style tools such as awk, sed, or perl itself. Thanks
{
"id":"1",
"key2":"blah"
},
{
"id":"2",
"key9":"more blah"
},
{
"id":"stringVal",
"anotherKey":"even more blah"
}
Excerpt of perl script that extracts JSON values:
my #values;
while(<STDIN>) {
chomp;
s/\s+//g; # Remove spaces
s/"//g; # Remove quotes
push #values, /$opt_s:([\w]+),?/g; # $opt_s is a command line switch for the key to find
}
print join("\n",#values);

use JSON;

I would strongly suggest using the JSON module. It will parse your json input in one function (and back). It also offers an OOP interface.

gawk
gawk 'BEGIN{
FS=":"
printf "Enter key name: "
getline key < "-"
}
$0~key{
k=$2; getline ; v = $2
gsub("\"","",k)
gsub("\"","",v)
print k,v
}' file
output
$ ./shell.sh
Enter key name: id
1, blah
2, more blah
stringVal, even more blah
If you just want the id value,
$ key="id"
$ awk -vkey=$key -F":" '$0~key{gsub("\042|,","",$2);print $2}' file
1
2
stringVal

Here is a very rough Awk script to accomplish the task:
awk -v k=id -F: '/{|}/{next}{gsub(/^ +|,$/,"");gsub(/"/,"");if($1==k)print $2}' data
the -F: specifies ':' as the field separator
The -v k=id sets the key you're
searching for.
lines containing '{'
or '}' are skipped.
the first gsub
gets rid of leading whitespace and
trailing commas.
The second gsub gets
rid of double quotes.
Finally, if k
matches $1, $2 is printed.
data is the file containing your JSON

sed (provided that file is formatted as above, no more than one entry per line):
KEY=id;cat file|sed -n "s/^[[:space:]]*\"$KEY\":\"//p"|sed 's/".*$//'

Why are you parsing the string yourself when there are libraries to do this for you? json.org has JSON parsing and encoding libraries for practically every language you can think of (and probably a few that you haven't). In Perl:
use strict;
use warnings;
use JSON qw(from_json to_json);
# enable slurp mode
local $/;
my $string = <DATA>;
my $data = from_json($string);
use Data::Dumper;
print "the data was parsed as: " . Dumper($data);
__DATA__
[
{
"id":"1",
"key2":"blah"
},
{
"id":"2",
"key9":"more blah"
},
{
"id":"stringVal",
"anotherKey":"even more blah"
}
]
..produces the output (I added a top level array around the data so it would be parsed as one object):
the data was parsed as: $VAR1 = [
{
'key2' => 'blah',
'id' => '1'
},
{
'key9' => 'more blah',
'id' => '2'
},
{
'anotherKey' => 'even more blah',
'id' => 'stringVal'
}
];

If you don't mind seeing the quote and colon characters, I would simply use grep:
grep id file.json

Related

How to replace newlines between brackets

I have log file similar to this format
test {
seq-cont {
0,
67,
266
},
grp-id 505
}
}
test{
test1{
val
}
}
Here is the echo command to produce that output
$ echo -e "test {\nseq-cont {\n\t\t\t0,\n\t\t\t67,\n\t\t\t266\n\t\t\t},\n\t\tgrp-id 505\n\t}\n}\ntest{\n\ttest1{\n\t\tval\n\t}\n}\n"
Question is how to remove all whitespace between seq-cont { and the next } that may be multiple in the file.
I want the output to be like this. Preferably use sed to produce the output.
test{seq-cont{0,67,266},
grp-id 505
}
}
test{
test1{
val
}
}
Efforts by OP: Here is the one somewhat worked but not exactly what I wanted:
sed ':a;N;/{/s/[[:space:]]\+//;/}/s/}/}/;ta;P;D' logfile
It can be done using gnu-awk with a custom RS regex that matches { and closing }:
awk -v RS='{[^}]+}' 'NR==1 {gsub(/[[:space:]]+/, "", RT)} {ORS=RT} 1' file
test {seq-cont{0,67,266},
grp-id 505
}
}
test{
test1{
val
}
}
Here:
NR==1 {gsub(/[[:space:]]+/, "", RT)}: For the first record replace all whitespaces (including line breaks) with empty string.
{ORS=RT}: Set ORS to whatever text we captured in RS
PS: Remove NR==1 if you want to do this for entire file.
With your shown samples, please try following awk program. Tested and written in GNU awk.
awk -v RS= '
match($0,/{\nseq-cont {\n[^}]*/){
val=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]+/,"",val)
print substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
' Input_file
Explanation: Simple explanation would be, using RS capability to set it to null. Then using match function of awk to match everything between seq-cont { to till next occurrence of }. Removing all spaces, new lines in matched value. Finally printing all the values including newly edited values to get expected output mentioned by OP.
You can do that much easier with perl:
perl -0777 -i -pe 's/\s+(seq-cont\s*\{[^}]*\})/$1=~s|\s+||gr/ge' logfilepath
The -0777 option tells perl to slurp the file into a single string, -i saves changes inline, \s+(seq-cont\s*\{[^}]*\}) regex matches one or more whitespaces, then captures into Group 1 ($1) seq-cont, zero or more whitespaces, and then a substring between the leftmost { and the next } char ([^}]* matches zero or more chars other than }) and then all one or more whitespace character chunks (matched with \s+) are removed from the whole Group 1 value ($1) (this second inner replacement is enabled with e flag). All occurrences are handled due to the g flag (next to e).
See the online demo:
#!/bin/bash
s=$(echo -e "test {\nseq-cont {\n\t\t\t0,\n\t\t\t67,\n\t\t\t266\n\t\t\t},\n\t\tgrp-id 505\n\t}\n}\ntest{\n\ttest1{\n\t\tval\n\t}\n}\n")
perl -0777 -pe 's/\s+(seq-cont\s*\{[^}]*\})/$1=~s|\s+||gr/ge' <<< "$s"
Output:
test {seq-cont{0,67,266},
grp-id 505
}
}
test{
test1{
val
}
}

Replace multiline string with sed

I have a file that's basically an INI/CFG file the looks like this:
[thing-a]
attribute1=foo
attribute2=bar
attribute3=foobar
attribute4=barfoo
[thing-b]
attribute1=dog
attribute3=foofoo
attribute4=castles
[thing-c]
attribute1=foo
attribute4=barfoo
[thing-d]
attribute1=123455
attribute2=dogs
attribute3=biscuits
attribute4=1234
Each 'thing' has a set of attributes that could include all the same ones or a subset there of.
I am trying to write a small bash script that will replace the attributes for 'thing-c' with a predefined block $a1, $a2 & $a3 are generated elsewhere in the wider script:
NEW_BLOCK="[thing-c]
attribute1=${a1}
attribute2=${a2}
attribute3=${a3}"
I can find the right block with sed like this:
THING_BLOCK=$(sed -nr "/^\[thing-c\]/ { :l /^\s*[^#].*/ p; n; /^\[/ q; b l; }" ./myThingFile)
I'm not sure if i've gone down a rabbit hole or what with this and I'm pretty sure there is a better way of doing it.
I'm wanting to do what is:
sed "s/${THING_BLOCK}/${NEW_BLOCK}/"
But I can't quite figure out the multiline aspect to this and I'm not sure what the best route to take is.
Is there a way to do this sort of multiline find and replace with sed (or a better way with bash)
Is there a way to do this sort of multiline find and replace ...
Yes there is indeed a better way, albeit using awk:
awk -v blk="$NEW_BLOCK" -v RS= '{ORS = RT} $1 == "[thing-c]" {$0 = blk} 1' file
Using -v RS= we use an empty record separator that splits records in input file on each new line.
Another awk. Store the replacement to file2 and:
$ awk -v RS="" '
NR==FNR {
b=$0
next
}
$1~/thing-c/ {
$0=b
}
{
print (++c==1?"":ORS) $0
}' file2 file1
Output:
[thing-a]
attribute1=foo
attribute2=bar
attribute3=foobar
attribute4=barfoo
[thing-b]
attribute1=dog
attribute3=foofoo
attribute4=castles
[thing-c]
attribute1=${a1}
attribute2=${a2}
attribute3=${a3}
[thing-d]
attribute1=123455
attribute2=dogs
attribute3=biscuits
attribute4=1234
When you want to use sed(IMHO awk is better here), you must have "nice" data (no special characters that sed will try to handle and [ inside block thing-3).
I tested with
read -d '' -r NEW_BLOCK <<END
[thing-c]
attribute1=${a1}
attribute2=${a2}
attribute3=${a3}
END
For my solution I first need to replace newlines in $NEW_BLOCK with the two characters \n.
echo "This is the replacement string: ${NEW_BLOCK//$'\n'/\\n}"
With the "multi-line" option "-z" you can do
sed -rz "s/\[thing-c\][^[]*/${NEW_BLOCK//$'\n'/\\n}\n\n/" myThingFile

Partial String split in Bash

Let consider this string:
00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\00x\00x\00x\
What I want to retrieve is this:
00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\
Basically, the logic is:
As long as it's 00x\ keep reading the remaining of the string.
As long as it's not 00x\ keep reading the remaining of the string.
Split there.
How can this be achieved in bash? Pay attention that there is a "9" in the middle, and a "t". So there might be "garbage" between 2 00x\ tokens. So I can't just split the string into tokens, not I can use cut (not fixed length). Any magic I can do with awk or sed?
Thanks.
Edit: The input string can after other sings after the 00x\. Like this: 00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\00x\00x\00x\00x\00x\00x\00x\00x\00x\GL7Dx\00x\00x\00x\00x\00x\00x\00x\00x\00x\00x\BCx\V6Ax\00x\00x\00x\00x\00x\00x\00x\00x\00x\00x\H50x\ where what I want is still 00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\
Something in awk:
$ awk '
BEGIN {
FS=ORS="\\"
}
{
for(i=1;i<=NF;i++)
if(($i=="00x")&&p!="00x"&&p!="") {
printf "\n"
exit
} else {
p=$i
print $i
}
}' file
Output on the updated data
00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\
In perl using negative lookbehind:
$ perl -ne 's/(?<!00x)\\00x.*/\\/g;print' file
00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\

AWK to to find first occurrence of string and assign to variable for compare

I have written following line of code which explodes the string by the first occurrence of the string after a delimiter.
echo "$line" | awk -F':' '{ st = index($0,":");print "field1: "$1 "
=> " substr($0,st+1)}';
But I don't want to display it. Want to take both occurrences in variable so I tried the following code
explodetext="$line" | awk -F':' '{ st = index($0,":")}';
Sample data:
id:1
url:http://test.com
Expected OutPUt will be:
key=id
val=1
key=url
val=http://test.com
but not working as expected.Any solution?
Thanks
Your code, expanded:
echo "$line" \
| awk -F':' '
{
st = index($0,":")
print "field1: " $1 " => " substr($0,st+1)
}'
The output of this appears merely to split the line according to the first colon. From the sample data you've provided, it seems that your lines contain two fields, which are separated by the first colon found. This means you can't safely use awk's field separator to find your data (though you can use it for field names), making index() a reasonable approach.
One strategy might be to place your input into an array, for assessment:
#!/usr/bin/awk -f
BEGIN {
FS=":"
}
{
record[$1]=substr($0,index($0,":")+1);
}
END {
if (record["id"] > 0) {
printf("Record ID %d had a value of %s.\n", record["id"], record["url"])
} else {
print "No valid records found."
}
}
I suppose that your text file input.txt is stored in the format as given below:
id:1
url:http://test1.com
You could use the below piece of code, say awkscript, to achieve what you wish to do :
#!/bin/bash
awk '
BEGIN{FS=":"}
{
if ($2 > 0) {
if ( getline > 0){
st = index($0,":")
url = substr($0,st+1);
system("echo Do something with " url);
}
}
}' $1
Run the code as ./awkscript input.txt
Note: I assume that that the input file contains only one id/url pair as you confirmed in your comment.

make a change on the string based on mapping

I have the following string format
str="aaa.[any_1].bbb.[any_2].ccc"
I have the following mapping
map1:
any_1 ==> 1
cny_1 ==> 2
map2
any_2 ==> 1
bny_2 ==> 2
cny_2 ==> 3
What's the best command to execute on the str with taking account the above mapping in order to get
$ command $str
aaa.1.bbb.1.ccc
Turn your map files into sed scripts:
sed 's%^%s/%;s% ==> %/%;s%$%/g%' map?
Apply the resulting script to the input string. You can do it directly by process substitution:
sed 's%^%s/%;s% ==> %/%;s%$%/g%' map? | sed -f- <(echo "$str")
Output:
aaa.[1].bbb.[1].ccc
Update: I now think that I didn't understand the question correctly, and my solution therefore is wrong. I'm leaving it in here because I don't know if parts of this answer will be helpful to your question, but I encourage you to look at the other answers first.
Not sure what you mean. But here's something:
any_1="1"
any_2="2"
str="aaa.${any_1}.bbb.${any_2}.ccc"
echo $str
The curly brackets tell the interpreter where the variable name ends and the normal string resumes. Result:
aaa.1.bbb.2.ccc
You can loop this:
for any_1 in {1..2}; do
for any_2 in {1..3}; do
echo aaa.${any_1}.bbb.${any_2}.ccc
done
done
Here {1..3} represents the numbers 1, 2, and 3. Result
aaa.1.bbb.1.ccc
aaa.1.bbb.2.ccc
aaa.1.bbb.3.ccc
aaa.2.bbb.1.ccc
aaa.2.bbb.2.ccc
aaa.2.bbb.3.ccc
{
echo "${str}"
cat Map1
cat Map2
} | sed -n '1h;1!H;$!d
x
s/[[:space:]]*==>[[:space:]]*/ /g
:a
s/\[\([^]]*\)\]\(.*\)\n\1 \([^[:cntrl:]]*\)/\3\2/
ta
s/\n.*//p'
you could use several mapping, not limited to 2 (even and find to cat every mapping found).
based on fact that alias and value have no space inside (can be adapted if any)
I have upvoted #chw21's answer as it promotes - right tool for the problem scenario. However,
You can devise a perlbased command based on the following.
#!/usr/bin/perl
use strict;
use warnings;
my $text = join '',<DATA>;
my %myMap = (
'any_1' => '1',
'any_2' => '2'
);
$text =~s/\[([^]]+)\]/replace($1)/ge;
print $text;
sub replace {
my ($needle) = #_;
return "\[$needle\]" if ! exists $myMap{ lc $needle};
return $myMap{lc $needle};
}
__DATA__
aaa.[any_1].bbb.[any_2].ccc
Only thing that requires a bit of explanation is may be the regex,it matches text that comes between square brackets and sends the text to replace routine. In replace routine, we get mapped value from map corresponding to its argument.
$ cat tst.awk
BEGIN {
FS=OFS="."
m["any_1"]=1; m["cny_1"]=2
m["any_2"]=1; m["bny_2"]=2; m["cny_2"]=3
for (i in m) map["["i"]"] = m[i]
}
{
for (i=1;i<=NF;i++) {
$i = ($i in map ? map[$i] : $i)
}
print
}
$ awk -f tst.awk <<<'aaa.[any_1].bbb.[any_2].ccc'
aaa.1.bbb.1.ccc

Resources