Nested JSON with duplicate keys - groovy
I will have to process 10 billion Nested JSON records per day using NiFi (version 1.9). As part of the job, am trying to convert the nested JSON to csv using Groovy script. I referred the below Stack Overflow questions related to the same topic and came up with the below code.
Groovy collect from map and submap
how to convert json into key value pair completely using groovy
But am not sure how to retrieve the value of duplicate keys. Sample json is defined in the variable "json" in the below code. key "Flag1" will be coming in multiple sections (i.e., "OF" & "SF"). I want to get the output as csv.
Below is the output if I execute the below groovy code 2019-10-08 22:33:29.244000,v12,-,36178,0,0/0,10.65.5.56,sf,sf (flag1 key value is replaced by that key column's last occurrence value)
I am not an expert in Groovy. Also please suggest if there is any other better approach, so that I will give a try.
import groovy.json.*
def json = '{"transaction":{"TS":"2019-10-08 22:33:29.244000","CIPG":{"CIP":"10.65.5.56","CP":"0"},"OF":{"Flag1":"of","Flag2":"-"},"SF":{"Flag1":"sf","Flag2":"-"}}'
def jsonReplace = json.replace('{"transaction":{','{"transaction":[{').replace('}}}','}}]}')
def jsonRecord = new JsonSlurper().parseText(jsonReplace)
def columns = ["TS","V","PID","RS","SR","CnID","CIP","Flag1","Flag1"]
def flatten
flatten = { row ->
def flattened = [:]
row.each { k, v ->
if (v instanceof Map) {
flattened << flatten(v)
} else if (v instanceof Collection && v.every {it instanceof Map}) {
v.each { flattened << flatten(it) }
} else {
flattened[k] = v
}
}
flattened
}
print "output: " + jsonRecord.transaction.collect {row -> columns.collect {colName -> flatten(row)[colName]}.join(',')}.join('\n')
Edit: Based on the reply from #cfrick and #stck, I have tried the option and have follow up question below.
#cfrick and #stck- Thanks for your response.
Original source JSON record will have more than 100 columns and I am using "InvokeScriptedProcessor" in NiFi to trigger the Groovy script.
Below is the original Groovy script am using in "InvokeScriptedProcessor" in which I have used Streams(inputstream, outputstream). Is this what you are referring.
Am I doing anything wrong?
import groovy.json.JsonSlurper
class customJSONtoCSV implements Processor {
def REL_SUCCESS = new Relationship.Builder().name("success").description("FlowFiles that were successfully processed").build();
def log
static def flatten(row, prefix="") {
def flattened = new HashMap<String, String>()
row.each { String k, Object v ->
def key = prefix ? prefix + "_" + k : k;
if (v instanceof Map) {
flattened.putAll(flatten(v, k))
} else {
flattened.put(key, v.toString())
}
}
return flattened
}
static def toCSVRow(HashMap row) {
def columns = ["CIPG_CIP","CIPG_CP","CIPG_SLP","CIPG_SLEP","CIPG_CVID","SIPG_SIP","SIPG_SP","SIPG_InP","SIPG_SVID","TG_T","TG_R","TG_C","TG_SDL","DL","I_R","UAP","EDBL","Ca","A","RQM","RSM","FIT","CSR","OF_Flag1","OF_Flag2","OF_Flag3","OF_Flag4","OF_Flag5","OF_Flag6","OF_Flag7","OF_Flag8","OF_Flag9","OF_Flag10","OF_Flag11","OF_Flag12","OF_Flag13","OF_Flag14","OF_Flag15","OF_Flag16","OF_Flag17","OF_Flag18","OF_Flag19","OF_Flag20","OF_Flag21","OF_Flag22","OF_Flag23","SF_Flag1","SF_Flag2","SF_Flag3","SF_Flag4","SF_Flag5","SF_Flag6","SF_Flag7","SF_Flag8","SF_Flag9","SF_Flag10","SF_Flag11","SF_Flag12","SF_Flag13","SF_Flag14","SF_Flag15","SF_Flag16","SF_Flag17","SF_Flag18","SF_Flag19","SF_Flag20","SF_Flag21","SF_Flag22","SF_Flag23","SF_Flag24","GF_Flag1","GF_Flag2","GF_Flag3","GF_Flag4","GF_Flag5","GF_Flag6","GF_Flag7","GF_Flag8","GF_Flag9","GF_Flag10","GF_Flag11","GF_Flag12","GF_Flag13","GF_Flag14","GF_Flag15","GF_Flag16","GF_Flag17","GF_Flag18","GF_Flag19","GF_Flag20","GF_Flag21","GF_Flag22","GF_Flag23","GF_Flag24","GF_Flag25","GF_Flag26","GF_Flag27","GF_Flag28","GF_Flag29","GF_Flag30","GF_Flag31","GF_Flag32","GF_Flag33","GF_Flag34","GF_Flag35","VSL_VSID","VSL_TC","VSL_MTC","VSL_NRTC","VSL_ET","VSL_HRES","VSL_VRES","VSL_FS","VSL_FR","VSL_VSD","VSL_ACB","VSL_ASB","VSL_VPR","VSL_VSST","HRU_HM","HRU_HD","HRU_HP","HRU_HQ","URLF_CID","URLF_CGID","URLF_CR","URLF_RA","URLF_USM","URLF_USP","URLF_MUS","TCPSt_WS","TCPSt_SE","TCPSt_WSFNS","TCPSt_WSF","TCPSt_EM","TCPSt_RSTE","TCPSt_MSS","NS_OPID","NS_ODID","NS_EPID","NS_TrID","NS_VSN","NS_LSUT","NS_STTS","NS_TCPPR","CQA_NL","CQA_CL","CQA_CLC","CQA_SQ","CQA_SQC","TS","V","PID","RS","SR","CnID","A_S","OS","CPr","CVB","CS","HS","SUNR","SUNS","ML","MT","TCPSL","CT","MS","MSH","SID","SuID","UA","DID","UAG","CID","HR","CRG","CP1","CP2","AIDF","UCB","CLID","CLCL","OPTS","PUAG","SSLIL"]
return columns.collect { column ->
return row.containsKey(column) ? row.get(column) : ""
}.join(',')
}
#Override
void initialize(ProcessorInitializationContext context) {
log = context.getLogger()
}
#Override
Set<Relationship> getRelationships() {
return [REL_SUCCESS] as Set
}
#Override
void onTrigger(ProcessContext context, ProcessSessionFactory sessionFactory) throws ProcessException {
try {
def session = sessionFactory.createSession()
def flowFile = session.get()
if (!flowFile) return
flowFile = session.write(flowFile,
{ inputStream, outputStream ->
def bufferedReader = new BufferedReader(new InputStreamReader(inputStream, 'UTF-8'))
def jsonSlurper = new JsonSlurper()
def line
def header = "CIPG_CIP,CIPG_CP,CIPG_SLP,CIPG_SLEP,CIPG_CVID,SIPG_SIP,SIPG_SP,SIPG_InP,SIPG_SVID,TG_T,TG_R,TG_C,TG_SDL,DL,I_R,UAP,EDBL,Ca,A,RQM,RSM,FIT,CSR,OF_Flag1,OF_Flag2,OF_Flag3,OF_Flag4,OF_Flag5,OF_Flag6,OF_Flag7,OF_Flag8,OF_Flag9,OF_Flag10,OF_Flag11,OF_Flag12,OF_Flag13,OF_Flag14,OF_Flag15,OF_Flag16,OF_Flag17,OF_Flag18,OF_Flag19,OF_Flag20,OF_Flag21,OF_Flag22,OF_Flag23,SF_Flag1,SF_Flag2,SF_Flag3,SF_Flag4,SF_Flag5,SF_Flag6,SF_Flag7,SF_Flag8,SF_Flag9,SF_Flag10,SF_Flag11,SF_Flag12,SF_Flag13,SF_Flag14,SF_Flag15,SF_Flag16,SF_Flag17,SF_Flag18,SF_Flag19,SF_Flag20,SF_Flag21,SF_Flag22,SF_Flag23,SF_Flag24,GF_Flag1,GF_Flag2,GF_Flag3,GF_Flag4,GF_Flag5,GF_Flag6,GF_Flag7,GF_Flag8,GF_Flag9,GF_Flag10,GF_Flag11,GF_Flag12,GF_Flag13,GF_Flag14,GF_Flag15,GF_Flag16,GF_Flag17,GF_Flag18,GF_Flag19,GF_Flag20,GF_Flag21,GF_Flag22,GF_Flag23,GF_Flag24,GF_Flag25,GF_Flag26,GF_Flag27,GF_Flag28,GF_Flag29,GF_Flag30,GF_Flag31,GF_Flag32,GF_Flag33,GF_Flag34,GF_Flag35,VSL_VSID,VSL_TC,VSL_MTC,VSL_NRTC,VSL_ET,VSL_HRES,VSL_VRES,VSL_FS,VSL_FR,VSL_VSD,VSL_ACB,VSL_ASB,VSL_VPR,VSL_VSST,HRU_HM,HRU_HD,HRU_HP,HRU_HQ,URLF_CID,URLF_CGID,URLF_CR,URLF_RA,URLF_USM,URLF_USP,URLF_MUS,TCPSt_WS,TCPSt_SE,TCPSt_WSFNS,TCPSt_WSF,TCPSt_EM,TCPSt_RSTE,TCPSt_MSS,NS_OPID,NS_ODID,NS_EPID,NS_TrID,NS_VSN,NS_LSUT,NS_STTS,NS_TCPPR,CQA_NL,CQA_CL,CQA_CLC,CQA_SQ,CQA_SQC,TS,V,PID,RS,SR,CnID,A_S,OS,CPr,CVB,CS,HS,SUNR,SUNS,ML,MT,TCPSL,CT,MS,MSH,SID,SuID,UA,DID,UAG,CID,HR,CRG,CP1,CP2,AIDF,UCB,CLID,CLCL,OPTS,PUAG,SSLIL"
outputStream.write("${header}\n".getBytes('UTF-8'))
while (line = bufferedReader.readLine()) {
def jsonReplace = line.replace('{"transaction":{','{"transaction":[{').replace('}}}','}}]}')
def jsonRecord = new JsonSlurper().parseText(jsonReplace)
def a = jsonRecord.transaction.collect { row ->
return flatten(row)
}.collect { row ->
return toCSVRow(row)
}
outputStream.write("${a}\n".getBytes('UTF-8'))
}
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)
session.commit()
}
catch (e) {
throw new ProcessException(e)
}
}
#Override
Collection<ValidationResult> validate(ValidationContext context) { return null }
#Override
PropertyDescriptor getPropertyDescriptor(String name) { return null }
#Override
void onPropertyModified(PropertyDescriptor descriptor, String oldValue, String newValue) { }
#Override
List<PropertyDescriptor> getPropertyDescriptors() {
return [] as List
}
#Override
String getIdentifier() { return null }
}
processor = new customJSONtoCSV()
If I should not use "collect" then what else I need to use to create the rows.
In the output flow file, the record output is coming inside []. I tried the below but it is not working. Not sure whether am doing the right thing. I want csv output without []
return toCSVRow(row).toString()
If you know what you want to extract exactly (and given you want to
generate a CSV from it) IMHO you are way better off to just shape the
data in the way you later want to consume it. E.g.
def data = new groovy.json.JsonSlurper().parseText('[{"TS":"2019-10-08 22:33:29.244000","CIPG":{"CIP":"10.65.5.56","CP":"0"},"OF":{"Flag1":"of","Flag2":"-"},"SF":{"Flag1":"sf","Flag2":"-"}}]')
extractors = [
{ it.TS },
{ it.V },
{ it.PID },
{ it.RS },
{ it.SR },
{ it.CIPG.CIP },
{ it.CIPG.CP },
{ it.OF.Flag1 },
{ it.SF.Flag1 },]
def extract(row) {
extractors.collect{ it(row) }
}
println(data.collect{extract it})
// ⇒ [[2019-10-08 22:33:29.244000, null, null, null, null, 10.65.5.56, 0, of, sf]]
As stated in the other answer, due to the sheer amount of data you are trying to
convert::
Make sure to use a library to generate the CSV file from that, or else
you will hit problems with the content, you try to write (e.g. line
breaks or the data containing the separator char).
Don't use collect (it is eager) to create the rows.
The idea is to modify "flatten" method - it should differentiate between same nested keys by providing parent key as a prefix.
I've simplified code a bit:
import groovy.json.*
def json = '{"transaction":{"TS":"2019-10-08 22:33:29.244000","CIPG":{"CIP":"10.65.5.56","CP":"0"},"OF":{"Flag1":"of","Flag2":"-"},"SF":{"Flag1":"sf","Flag2":"-"}}'
def jsonReplace = json.replace('{"transaction":{','{"transaction":[{').replace('}}','}}]')
def jsonRecord = new JsonSlurper().parseText(jsonReplace)
static def flatten(row, prefix="") {
def flattened = new HashMap<String, String>()
row.each { String k, Object v ->
def key = prefix ? prefix + "." + k : k;
if (v instanceof Map) {
flattened.putAll(flatten(v, k))
} else {
flattened.put(key, v.toString())
}
}
return flattened
}
static def toCSVRow(HashMap row) {
def columns = ["TS","V","PID","RS","SR","CnID","CIP","OF.Flag1","SF.Flag1"] // Last 2 keys have changed!
return columns.collect { column ->
return row.containsKey(column) ? row.get(column) : ""
}.join(', ')
}
def a = jsonRecord.transaction.collect { row ->
return flatten(row)
}.collect { row ->
return toCSVRow(row)
}.join('\n')
println a
Output would be:
2019-10-08 22:33:29.244000, , , , , , , of, sf
Related
Regex for groovy not working
Please find below my code. I am trying to iterate through files and in a directory and print out all the match to the regex: (&)(.+?\b) however this doesn't seem to work (it returns an empty string). Where did I go wrong? import groovy.io.FileType class FileExample { static void main(String[] args) { def list = [] def dir = new File("*path to dir*") dir.eachFileRecurse (FileType.FILES) { file -> list << file.name; def s= "${file.text}"; def w = s.toString(); w.readLines().grep(~/(&)(.+?\b)/) } } }
I figured it out: import groovy.io.FileType class FileExample { static void main(String[] args) { def list = [] def dir = new File("/path/to/directory/containingfiles") def p = [] dir.eachFileRecurse (FileType.FILES) { file -> list << file.name def s = file.text def findtp = (s =~ /(&)(.+?\b)/) findtp.each { p.push(it[2]) // As I'm only interested in the second matched group } } p.each { println it } } } Now, all the matched strings are stored in the array p and can be printed/used elsewhere by p[0] etc.
w.readLines().grep(~/(&)(.+?\b)/) does not do any output, it gives you the matching lines as return value, so if you change it to println w.readLines().grep(~/(&)(.+?\b)/) or w.readLines().grep(~/(&)(.+?\b)/).each { println it }, you will get the matched lines printed on stdout. Btw. def s= "${file.text}"; def w = s.toString(); w.readLines() is just a biiiig waste of time. It is exactly the same as file.text.readLines(). "${file.text}" is a GString that replaces the placeholders on evaluation, but as you have nothing but the file.text placeholder, this is the same as file.text as String or file.text.toString(). But as file.text actually is a String already, it is identical to file.text. And even if you would need the GString because you have more than the placeholder in it, GString already has a readLines() method, so no need for using .toString() first, even if a GString would be necessary.
Failed to add element to new ArrayList in Groovy
I am trying to simulate a multimap, each value of langVarMap is a list. When I add a new String to the list, I get the following error: No signature of method: java.lang.Boolean.add() is applicable for argument types: (java.lang.String) values: [mm] Here is the code snippet: def langs = engine.languages as Set def langVarMap = [:] engine.models.each { model -> def lang = (model.#language.text()) // String def variant = (model.#variant.text()) // String langs.add(lang) if (langVarMap.get(lang)) { def a = langVarMap.get(lang) //ArrayList langVarMap.put(lang, a.add(variant)) } else { langVarMap.put(lang, [variant]) } Thanks in advance.
Problem is with this line: langVarMap.put(lang, a.add(variant)) ArrayList.add(E e) returns boolean not the list. Adding result of add action adds a boolean value of TRUE to the map after which cannot call add method on it. Need to rewrite as following: if (langVarMap.get(lang)) { def a = langVarMap.get(lang) //ArrayList // a is already in langVarMap so don't need to put into ma again a.add(variant) } else { langVarMap.put(lang, [variant]) } And can further refine with this to remove the redundant lookup. def a = langVarMap.get(lang) //ArrayList if (a) { a.add(variant) } else { langVarMap.put(lang, [variant]) }
Groovy object properties in map
Instead of having to declare all the properties in a map from an object like: prop1: object.prop1 Can't you just drop the object in there like below somehow? Or what would be a proper way to achieve this? results: [ object, values: [ test: 'subject' ] ]
object.properties will give you a class as well You should be able to do: Given your POGO object: class User { String name String email } def object = new User(name:'tim', email:'tim#tim.com') Write a method to inspect the class and pull the non-synthetic properties from it: def extractProperties(obj) { obj.getClass() .declaredFields .findAll { !it.synthetic } .collectEntries { field -> [field.name, obj."$field.name"] } } Then, map spread that into your result map: def result = [ value: true, *:extractProperties(object) ] To give you: ['value':true, 'name':'tim', 'email':'tim#tim.com']
If you don't mind using a few libraries here's an option where you convert the object to json and then parse it back out as a map. I added mine to a baseObject which in your case object would extend. class BaseObject { Map asMap() { def jsonSlurper = new groovy.json.JsonSlurperClassic() Map map = jsonSlurper.parseText(this.asJson()) return map } String asJson(){ def jsonOutput = new groovy.json.JsonOutput() String json = jsonOutput.toJson(this) return json } } Also wrote it without the json library originally. This is like the other answers but handles cases where the object property is a List. class BaseObject { Map asMap() { Map map = objectToMap(this) return map } def objectToMap(object){ Map map = [:] for(item in object.class.declaredFields){ if(!item.synthetic){ if (object."$item.name".hasProperty('length')){ map."$item.name" = objectListToMap(object."$item.name") }else if (object."$item.name".respondsTo('asMap')){ map << [ (item.name):object."$item.name"?.asMap() ] } else{ map << [ (item.name):object."$item.name" ] } } } return map } def objectListToMap(objectList){ List list = [] for(item in objectList){ if (item.hasProperty('length')){ list << objectListToMap(item) }else { list << objectToMap(item) } } return list } }
This seems to work well *:object.properties
Creation of custom comparator for map in groovy
I have class in groovy class WhsDBFile { String name String path String svnUrl String lastRevision String lastMessage String lastAuthor } and map object def installFiles = [:] that filled in loop by WhsDBFile dbFile = new WhsDBFile() installFiles[svnDiffStatus.getPath()] = dbFile now i try to sort this with custom Comparator Comparator<WhsDBFile> whsDBFileComparator = new Comparator<WhsDBFile>() { #Override int compare(WhsDBFile o1, WhsDBFile o2) { if (FilenameUtils.getBaseName(o1.name) > FilenameUtils.getBaseName(o2.name)) { return 1 } else if (FilenameUtils.getBaseName(o1.name) > FilenameUtils.getBaseName(o2.name)) { return -1 } return 0 } } installFiles.sort(whsDBFileComparator); but get this error java.lang.String cannot be cast to WhsDBFile Any idea how to fix this? I need to use custom comparator, cause it will be much more complex in the future. p.s. full source of sample gradle task (description of WhsDBFile class is above): project.task('sample') << { def installFiles = [:] WhsDBFile dbFile = new WhsDBFile() installFiles['sample_path'] = dbFile Comparator<WhsDBFile> whsDBFileComparator = new Comparator<WhsDBFile>() { #Override int compare(WhsDBFile o1, WhsDBFile o2) { if (o1.name > o2.name) { return 1 } else if (o1.name > o2.name) { return -1 } return 0 } } installFiles.sort(whsDBFileComparator); }
You can try to sort the entrySet() : def sortedEntries = installFiles.entrySet().sort { entry1, entry2 -> entry1.value <=> entry2.value } you will have a collection of Map.Entry with this invocation. In order to have a map, you can then collectEntries() the result : def sortedMap = installFiles.entrySet().sort { entry1, entry2 -> ... }.collectEntries()
sort can also take a closure as parameter which coerces to a Comparator's compare() method as below. Usage of toUpper() method just mimics the implementation of FilenameUtils.getBaseName(). installFiles.sort { a, b -> toUpper(a.value.name) <=> toUpper(b.value.name) } // Replicating implementation of FilenameUtils.getBaseName() // This can be customized according to requirement String toUpper(String a) { a.toUpperCase() }
groovy.lang.MissingMethodException: No signature of method
I am getting the following error - groovy.lang.MissingMethodException: No signature of method: Script64$_run_closure5_closure7_closure8_closure9_closure10_closure11.doCall() is applicable for argument types: (java.lang.String) values: Possible solutions: doCall(java.lang.Object, java.lang.Object), isCase(java.lang.Object), isCase(java.lang.Object) error at line: Code - EDIT import groovy.xml.* List tempList = [] List listgenerated = [] def count = 0 for (a in 0..totalCount-1) { //nameList and valueList lists will have all the contents added as below commented pseudo code /*for (b in 0..50) { nameList.add(b,number) // number is some calculated value valueList.add(b,number) e.g. nameList=[name1, name2, name3,name4, name5] valueList =[val1, val2, val3, , val5] listgenerated should be = [[name1:val1, name2:val2], [name3:val3, name4: , name5:val5]] } */ tempList = [] for (j in count..nameList.size()) { count = j def nameKey = nameList[j] def value if (nameKey != null) { value = valueList[j] tempList << [(nameKey) : value] } } count = count listgenerated.putAt(a,tempList) number = number +1 } def process = { binding, element, name -> if( element[ name ] instanceof Collection ) { element[ name ].each { n -> binding."$name"( n ) } } else if( element[ name ] ) { binding."$name"( element[ name ] ) } } class Form { List fields } def list = [[ name:'a', val:'1' ], [ name:'b', val :'2', name2:4, xyz:'abc', pqr:'']] //Edited list f = new Form( fields: list ) //Works fine f = new Form( fields: listgenerated ) //Gives the above error String xml = XmlUtil.serialize( new StreamingMarkupBuilder().with { builder -> builder.bind { binding -> data { f.fields.each { fields -> item { fields.each { name, value -> process( binding, fields, name ) } } } } } } ) If while creating the "listgenerated" single quotes are added around values it takes it as character and while printing both lists seem different. I am unable to figure it out what exactly is going wrong. Any help is appreciated. Thanks. Ref - Groovy: dynamically create XML for collection of objects with collections of properties
I believe, where you do: //some loop to add multiple values to the list listgenerated << name+":"+value You need to do: //some loop to add multiple values to the list listgenerated << [ (name): value ] And add a map to the list rather than a String. It's hard to say though as your code example doesn't run without alteration, and I don't know if it's the alterations that are solving the problem