using grep commands to find a duplicate id within a json file

using grep commands to find a duplicate id within a json file - linux

I am looking for a way to use grep on a linux server to find duplicate json records, is it possible to have a grep to search for duplicate id's in the example below ?
so the grep would return: 01
{
"book": [
{
"id": "01",
"language": "Java",
"edition": "third",
"author": "Herbert Schildt"
},
{
"id": "02",
"language": "Java",
"edition": "third",
"author": "Herbert Schildt"
},
{
"id": "03",
"language": "Java",
"edition": "third",
"author": "Herbert Schildt"
},
{
"id": "01",
"language": "Java",
"edition": "third",
"author": "Herbert Schildt"
},
{
"id": "04",
"language": "C++",
"edition": "second",
"author": "E.Balagurusamy"
}
]
}

use grep along with uniq.
grep '"id":' filename | sort | uniq -d
The -d option only prints duplicates.
However, this depends on the JSON being laid out neatly. To handle more general formatting, I recommend you use the jq utility.

A jq-based approach:
jq -r '.book[].id' < in.json | sort | uniq -d
01
This should work even for minified JSON files with no newlines.

OK, discarding any whitespace from the JSON strings I can offer this if awk is acceptable - hutch being the formatted chunk of JSON above in a file.
I use tr to remove any whitespace, use , as a field separator in awk; iterate over the one long lines elements with a for-loop, do some pattern-matching in awk to isolate ID fields and increment an array for each matched ID. At the end of processing I iterate over the array and print ID's that have more than one match.
Here your data:
$ cat hutch
{
"book": [
{
"id": "01",
"language": "Java",
"edition": "third",
"author": "Herbert Schildt"
},
{
"id": "02",
"language": "Java",
"edition": "third",
"author": "Herbert Schildt"
},
{
"id": "03",
"language": "Java",
"edition": "third",
"author": "Herbert Schildt"
},
{
"id": "01",
"language": "Java",
"edition": "third",
"author": "Herbert Schildt"
},
{
"id": "04",
"language": "C++",
"edition": "second",
"author": "E.Balagurusamy"
}
]
}
And here the finding of dupes:
$ tr -d '[:space:]' <hutch | awk -F, '{for(i=1;i<=NF;i++){if($i~/"id":/){a[gensub(/^.*"id":"([0-9]+)"$/, "\\1","1",$i)]++}}}END{for(i in a){if(a[i]>1){print i}}}'
01

Use a Perl one-liner to extract the numeric ids, then sort | uniq -d to print only the duplicates (as in the answer by Barmar):
This assumes that the id key/value pair is on the same line, but disregards whitespace (or lack of whitespace) anywhere on the line (leading, trailing, and in between):
perl -lne 'print for /"id":\s*"(\d+)"/' in.json | sort | uniq -d
This makes no assumptions (disregards whitespace and newlines). Note that it reads the entire json file into memory (using the -0777 command line switch):
perl -0777 -nE 'say for /"id":\s*"(\d+)"/g' in.json | sort | uniq -d
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-E : Tells Perl to look for code in-line, instead of in a file. Also enables all optional features. Here, enables say.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-0777 : Slurp files whole.
The regex uses this modifier:
/g : Multiple matches.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

Related

Convert a json file to csv using shell script without using jq?

I want to convert a json file to csv using shell script without using jq. Is it possible?
Here is a json :
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
},
{
"id": "0002",
"type": "donut2",
"name": "Cake2",
"ppu": 0.5522,
}
I don't want to use jq.
I want to store it in a csv file.

Bare-bones core-only perl one-liner version, to complement the python and ruby ones already given:
perl -MJSON::PP -0777 -nE '$,=","; say #$_{"id","type","name","ppu"} for #{decode_json $_}' input.json
A more robust one would use a more efficient non-core JSON parser and a CSV module to do things like properly quote fields when needed, but since your sample data doesn't include such fields I didn't bother. Can if requested.
And the unrequested jq version, because that really is the best approach whether you want it or not:
jq -r '.[] | [.id, .type, .name, .ppu] | #csv' input.json

Bash is not the tool to do this at all. But if for some reason you cannot install jq, you can simply use Python 3, which comes by default in most distros of Linux and in MacOS.
#!/usr/local/bin/python3
import json
objs=json.loads("""
[
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55
},
{
"id": "0002",
"type": "donut2",
"name": "Cake2",
"ppu": 0.5522
}
]
""")
for item in objs :
print("{}{}{}{}{}{}{}".format(item['id'],",",item['type'],",",item['name'],",",item['ppu']))
If you do not have Python 3 either, you can then do it in Ruby, which also comes by default in most distros and MacOS :
#!/usr/bin/ruby
require 'json'
content = '
[
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55
},
{
"id": "0002",
"type": "donut2",
"name": "Cake2",
"ppu": 0.5522
}
]
'
JSON.parse(content).each { |item| puts "#{item['id']},#{item['type']},#{item['name']},#{item['ppu']}" }
You can then redirect the output to a file :
script.rb > output.csv
And thats it.
Nevertheless, if you can be completely sure of the format of your input, you can do some bash magic, specially using awk. But as others also said, please don't do that.

Replace pom version using sed in Jenkinsfile

I have a variable myStrthat contains the following value:
"app": {
"services": {
"app": [{
"groupID": "com.mycompany",
"artifactId": "myapp-versions",
"version": "1.0.0"
},
{
"groupID": "com.mycompany.xyz",
"artifactId": "car-stats",
"version": "1.0-master"
},
{
"groupID": "com.mycompany.service",
"artifactId": "my-differential-service",
"version": "1.0.0-master"
}
]
}
}
Now I want to replace the version of only my-differential-service artifactId to NEW_VERSION.
I tried using sed command on myStr variable but couldn't succeed as I am not much familiar with this command.
Can anyone please guide me on how should I proceed to achieve this?
Any help would be highly appreciated.

When you need sed for parsing this, look for the support of the option -z in your sed. This option ignores the special meaning of \n.
When you know that version is the first field after the artifactId you can try
new_version=1.0.1 # avoid uppercase variable names
echo "$myStr" |
sed -rz 's/("my-differential-service",[^:]*: ")[^"]*/\1'"${new_version}"'/'
When the order of fields can change, you might want to ask for jq in Jenkins or try awk.
When you want to use sed but don't have the -z option, you can translate first:
echo "$myStr" | tr '\n' '\r' | sed -r 's/..../' | tr '\r' '\n'

If jq is available, congrats! and please try the following:
echo "{ $myStr }" | jq '(.app.services.app[] | select(.artifactId == "my-differential-service") | .version) = "NEW_VERSION"'
which yields:
{
"app": {
"services": {
"app": [
{
"groupID": "com.mycompany",
"artifactId": "myapp-versions",
"version": "1.0.0"
},
{
"groupID": "com.mycompany.xyz",
"artifactId": "car-stats",
"version": "1.0-master"
},
{
"groupID": "com.mycompany.service",
"artifactId": "my-differential-service",
"version": "NEW_VERSION"
}
]
}
}
}
If you do not need the outermost curly braces, please remove them by bash's parameter expansion or something similar.
As a fallback, you can say with sed:
echo "$myStr" | sed '
:l
N
$!b l
s/\("my-differential-service"[^"]*"version": *\)"[^"]*"/\1"NEW_VERSION"/g'
Hope this helps.

Removing pattern from multiple lines using sed or awk in two places in the same line

I have a JSON file with 12,166,466 of lines.
I want to remove quotes from values on keys:
"timestamp": "1538564256",and "score": "10", to look like
"timestamp": 1538564256, and "score": 10,.
Input:
{
"title": "DNS domain", ,
"timestamp": "1538564256",
"domain": {
"dns": [
"www.google.com"
]
},
"score": "10",
"link": "www.bit.ky/sdasd/asddsa"
"id": "c-1eOWYB9XD0VZRJuWL6"
}, {
"title": "DNS domain",
"timestamp": "1538564256",
"domain": {
"dns": [
"google.de"
]
},
"score": "10",
"link": "www.bit.ky/sdasd/asddsa",
"id": "du1eOWYB9XD0VZRJuWL6"
}
}
Expected output:
{
"title": "DNS domain", ,
"timestamp": 1538564256,
"domain": {
"dns": [
"www.google.com"
]
},
"score": 10,
"link": "www.bit.ky/sdasd/asddsa"
"id": "c-1eOWYB9XD0VZRJuWL6"
}, {
"title": "DNS domain",
"timestamp": 1538564256,
"domain": {
"dns": [
"google.de"
]
},
**"score": 10,**
"link": "www.bit.ky/sdasd/asddsa",
"id": "du1eOWYB9XD0VZRJuWL6"
}
}
I have tried:
sed -E '
s/"timestamp": "/"timestamp": /g
s/"score": "/"score": /g
'
the first part is quite straightforward, but how to remove ", at that the end of the line that contains "timestamp" and "score"? How do I access that using sed or even awk, or other tool with the mind that I have 12 million lines to process?

Assuming that you fix your JSON input file like this:
<file jq .
[
{
"title": "DNS domain",
"timestamp": "1538564256",
"domain": {
"dns": [
"www.google.com"
]
},
"score": "10",
"link": "www.bit.ky/sdasd/asddsa",
"id": "c-1eOWYB9XD0VZRJuWL6"
},
{
"title": "DNS domain",
"timestamp": "1538564256",
"domain": {
"dns": [
"google.de"
]
},
"score": "10",
"link": "www.bit.ky/sdasd/asddsa",
"id": "du1eOWYB9XD0VZRJuWL6"
}
]
You can use jq and its tonumber function to change the wanted strings to values:
<file jq '.[].timestamp |= tonumber | .[].score |= tonumber'

If the JSON structure matches roughly your example (e. g., there won't be any other whitespace characters between "timestamp", the colon, and the value), then this awk should be ok. If available, using jq for JSON transformation is the better choice by far!
awk '{print gensub(/("(timestamp|score)": )"([0-9]+)"/, "\\1\\3", "g")}' file

Be warned that tonumber can lose precision. If using tonumber is inadmissible, and if the output is produced by jq (or otherwise linearized vertically), then using awk as proposed elsewhere on this page is a good way to go. (If your awk does not have gensub, then the awk program can be easily adapted.) Here is the same thing using sed, assuming its flag for extended regex processing is -E:
sed -E -e 's/"(timestamp|score)": "([0-9]+)"/"\1": \2/'
For reference, if there's any doubt about where the relevant keys are located, here's a filter in jq that is agnostic about that:
walk(if type == "object"
then if has("timestamp") then .timestamp|=tonumber else . end
| if has("score") then .score|=tonumber else end
else . end)
If your jq does not have walk/1, then simply snarf its def from the web, e.g. from https://raw.githubusercontent.com/stedolan/jq/master/src/builtin.jq
If you wanted to convert all number-valued strings to numbers, you could write:
walk(if type=="object" then map_values(tonumber? // .) else . end)

This might work for you (GNU sed):
sed ':a;/"timestamp":\s*"1538564256",/{s/"//3g;:b;n;/timestamp/ba;/"score":\s*"10"/s/"//3g;Tb}' file
On encountering a line that contains "timestamp": "1538564256", remove the 3rd or more "'s. Then read on until another line containing timestamp and repeat or a line containing "score": "10 and remove the 3rd or more "'s.

How can i remove these special characters from JSON output file

^[[0;32m ?~V? ^[[0m
JSON file is being written by shell script.
So the text processing produces these special characters, tried using dos2unix and changing the characters globally using %s option as well.

Check this out. I introduced some control characters in a sample JSON file which can be displayed using "cat -v" command. Those with ^B,^A,^D are control characters.
Use perl to remove the control characters completely. You can redirect to a new file
> cat -v json_control.txt
^B{"menu": {
"id": "file",
"value": "File",
"popup": ^B{
"menuitem": [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
}}^D
^A
> perl -pe ' { s/[\x00-\x09\x0B-\x1F]//g } ' json_control.txt | cat -v
{"menu": {
"id": "file",
"value": "File",
"popup": {
"menuitem": [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
}}
>

search for first occurence of a string and print its value in linux

Hi i am using the below content in a file , i want the value of shortversion to be printed ,
{
"app_versions": [
{
"version": "15",
"shortversion": "0.0.15",
"title": "java expert",
"timestamp": 1469530069,
"appsize": 3436229,
"notes": ,
"mandatory": false,
"external": false,
"device_family": null,
"id": 9,
"app_id": 356250,
"minimum_os_version": "4.1",
,
{
"version": "7",
"shortversion": "0.0.7",
"title": "java expert",
"timestamp": 1469528889,
"appsize": 3436225,
,
{
"version": "3",
"shortversion": "0.0.3",
"title": "javaExpert",
"timestamp": 1469209202,
"appsize": 3420965,
how can i print the value of first occurrence of short version using sed,i have used the following awk command to get the shortversion awk -F'"' '/\"shortversion\"/{print $10;}' read.version this command is generating output of 0.0.15 which is correct , but the file is getting generated dynamically , need your valuable help on this

It is more modular to use a command line JSON parser like jq to parse your JSON input. It would be easier to maintain your script in case your JSON object tree change in the future.
You can get shortversion for the first element of your app_versions array with the following :
jq -r ". | .app_versions[1].shortversion" your_file.json

Maybe you can change a qualifier ':';eg
awk -F":" '/shortversion/{print $2}' datafile
and then use 'sed' to replace ','and '"';

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

using grep commands to find a duplicate id within a json file - linux

use grep along with uniq. grep '"id":' filename | sort | uniq -d The -d option only prints duplicates. However, this depends on the JSON being laid out neatly. To handle more general formatting, I recommend you use the jq utility.

A jq-based approach: jq -r '.book[].id' < in.json | sort | uniq -d 01 This should work even for minified JSON files with no newlines.

Related

Convert a json file to csv using shell script without using jq?

Replace pom version using sed in Jenkinsfile

Removing pattern from multiple lines using sed or awk in two places in the same line

How can i remove these special characters from JSON output file

search for first occurence of a string and print its value in linux

Categories

Resources