DataFrame partitionBy on nested columns - apache-spark

I am trying to call partitionBy on a nested field like below:
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet)
I get the below error when I run it. I do see the 'name' listed as the field in the below schema. Is there a different format to specify the column name which is nested?
java.lang.RuntimeException: Partition column data.dataDetails.name not found in schema StructType(StructField(name,StringType,true), StructField(time,StringType,true), StructField(data,StructType(StructField(dataDetails,StructType(StructField(name,StringType,true), StructField(id,StringType,true),true)),true))
This is my json file:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}

This appears to be a known issue listed here: https://issues.apache.org/jira/browse/SPARK-18084
I had this issue as well and to work around it I was able to un-nest the columns on my dataset. My dataset was a little different than your dataset, but here is the strategy...
Original Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
Modified Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data_type": "EventData",
"data_dataDetails_name" : "EventName",
"data_dataDetails_id": "1234"
}
}
Code to get to Modified Json:
def main(args: Array[String]) {
...
val data = df.select(children("data", df) ++ $"name" ++ $"time"): _*)
data.printSchema
data.write.partitionBy("data_dataDetails_name").format("csv").save(...)
}
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}").alias(s"$colname" + s"_" + s"${x.name}"))
}

Since the feature is un-available as of Spark 2.3.1, here's a workaround. Make sure to handle name conflicts between the nested fields and the fields at the root level.
{"date":"20180808","value":{"group":"xxx","team":"yyy"}}
df.select("date","value.group","value.team")
.write
.partitionBy("date","group","team")
.parquet(filenameParquet)
The partitions end up like
date=20180808/group=xxx/team=yyy/part-xxx.parquet

Related

Groovy: How do iterate through a map to create a new map with values baed on a specific condition

I am in no way an expert with groovy so please don't hold that against me.
I have JSON that looks like this:
{
"metrics": [
{
"name": "metric_a",
"help": "This tracks your A stuff.",
"type": "GAUGE",
"labels": [
"pool"
],
"unit": "",
"aggregates": [],
"meta": [
{
"category": "CAT A",
"deployment": "environment-a"
}
],
"additional_notes": "Some stuff (potentially)"
},
...
]
...
}
I'm using it as a source for automated documentation of all the metrics. So, I'm iterating through it in various ways to get the information I need. So far so good, I'm most of the way there. The problem is this all needs to be organized per the deployment environment. Meaning, multiple metrics will share the same value for deployment.
My thought was I could create a map with deployment as the key and the metric name for any metric that has a matching deployment as the value. Once I have that map, it should be easy for me to organize things the way they should be. I can't figure out how to do that. The result is all the metric names are added which is expected since I'm not doing anything to filter them out. I was thinking that groupBy would make sense here but I can't figure out how to use it effectively and frankly I'm not sure it will solve my problem by itself. Here is my code so far:
parentChild = [:]
children = []
metrics.each { metric ->
def metricName = metric.name
def depName = metric.meta.findResult{ it.deployment }
children.add(metricName)
parentChild.put(depName, children)
}
What is the best way to create a new map where the values for each key are based off a specific condition?
EDIT: The desired result would be each key in the resulting map would be a unique deployment value from all the metrics (as a string). Each value would be name of each metric that contains that deployment (as an array).
[environment-a:
[metric_a,metric_b,metric_c,...],
environment-b:
[metric_d,metric_e,metric_f,...]
...]
I would use a combo of withDefault() to pre-fill each map-entry value with a fresh TreeSet-instance (sorted no-duplicates set) and standard inject().
I reduced your sample data to the bare minimum and added some new nodes:
import groovy.json.*
String input = '''\
{
  "metrics": [
{
"name": "metric_a",
"meta": [
{
"deployment": "environment-a"
}
]
},
{
"name": "metric_b",
"meta": [
{
"deployment": "environment-a"
}
]
},
{
"name": "metric_c",
"meta": [
{
"deployment": "environment-a"
},
{
"deployment": "environment-b"
}
]
},
{
"name": "metric_d",
"meta": [
{
"deployment": "environment-b"
}
]
}
  ]
}'''
def json = new JsonSlurper().parseText input
def groupedByDeployment = json.metrics.inject( [:].withDefault{ new TreeSet() } ){ res, metric ->
  metric.meta.each{ res[ it.deployment ] << metric.name }
res
}
assert groupedByDeployment.toString() == '[environment-a:[metric_a, metric_b, metric_c], environment-b:[metric_c, metric_d]]'
If your metrics.meta array is supposed to have a single value, you can simplify the code by replacing the line:
metric.meta.each{ res[ it.deployment ] << metric.name }
with
res[ metric.meta.first().deployment ] << metric.name

Spark can not process recursive avro data

I have avsc schema like below:
{
"name": "address",
"type": [
"null",
{
"type":"record",
"name":"Address",
"namespace":"com.data",
"fields":[
{
"name":"address",
"type":[ "null","com.data.Address"],
"default":null
}
]
}
],
"default": null
}
On loading this data in pyspark:
jsonFormatSchema = open("Address.avsc", "r").read()
spark = SparkSession.builder.appName('abc').getOrCreate()
df = spark.read.format("avro")\
.option("avroSchema", jsonFormatSchema)\
.load("xxx.avro")
I got such exception:
"Found recursive reference in Avro schema, which can not be processed by Spark"
I tried many other configurations, but without any success.
To execute I use with spark-submit:
--packages org.apache.spark:spark-avro_2.12:3.0.1
This is a intended feature, you can take a look at the "issue" :
https://issues.apache.org/jira/browse/SPARK-25718

Do json key transformation with apache nifi

I need to do a json transformation in apache nifi. The json keys in the payload would be dynamically generated.
For example in the input given below, the 'customer' has attributes 'fname' and 'lname'. I need to change this 'fname' -> 'firstname' and 'lname' -> 'lastname' as provided in the 'mappingvalues'.
Since I am newbie to nifi. I dont know where to start. I have tried some json transformers like jolt. But couldn't achieve the expected result.
The jolt transform that i have used is given below :
[
{
"operation": "shift",
"spec": {
"customer": {
"*": {
"#": "&"
}
}
}
}
]
which produced an output
{
"fname" : "akhil",
"lname" : "kumar"
}
The input and expected output of what I need to achieve is given below :
{
"customer": {
"fname": "akhil",
"lname": "kumar",
.
.
.
},
"mappingvalues": {
"fname": "firstname",
"lname": "lastname",
.
.
.
}
}
##OUTPUT
{
"customer": {
"firstname": "akhil",
"lastname": "kumar",
.
.
.
}
}
*Is there any way to achieve the same in nifi with or without using jolt transform? Is it possible to do the same with groovy script? *
Please help me on the same.
the code in groovy with recursive mapping:
import groovy.json.JsonSlurper
def ff = session.get()
if(!ff)return
def json = ff.read().withReader("UTF-8"){r-> new JsonSlurper().parse(r) }
def mappings = json.remove('mappingvalues')
def mapper(o, mappings){
if(o instanceof Map){
//json object. let's iterate it and do mapping
o = o.collectEntries{k,v-> [ (mappings[k] ?: k), mapper(v,mappings) ] }
}else if(o instanceof List){
//map elements in array
o = o.collect{v-> mapper(v,mappings) }
}
return o
}
json = mapper(json,mappings)
ff.write("UTF-8"){w-> new JsonBuilder(json).writeTo(w) }
REL_SUCCESS << ff

JSON to dataset in Spark

I am facing an issue for which I am seeking your help. I have a task to convert a JSON file to dataSet so that it can be loaded into HIVE.
Code 1
SparkSession spark1 = SparkSession
.builder()
.appName("File_Validation")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset<Row> df = spark1.read().json("input/sample.json");
df.show();
Above code is throwing me a NullPointerException.
I tried another way
Code 2
JavaRDD<String> jsonFile = context.textFile("input/sample.json");
Dataset<Row> df2 = spark1.read().json(jsonFile);
df2.show();
created an RDD and passed it to the spark1 (sparkSession)
this code 2 is making the json to a different format with header as
+--------------------+
| _corrupt_record|
+--------------------+
with schema as - |-- _corrupt_record: string (nullable = true)
Please help in fixing it.
Sample JSON
{
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}
Your JSON should be in one line - one json in one line per one object.
In example:
{ "property1: 1 }
{ "property1: 2 }
It will be read as Dataset with 2 objects inside and one column
From documentation:
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. As a consequence, a regular multi-line JSON file will
most often fail.
Of course read data with SparkSession, as it will inference schema
You cant read formated JSON in spark, Your JSON should be in a single line like this :
{"user": "gT35Hhhre9m","dates": ["2016-01-29", "2016-01-28"],"status": "OK","reason": "some reason","content": [{"foo": 123,"bar": "val1"}, {"foo": 456,"bar": "val2"}, {"foo": 789,"bar": "val3"}, {"foo": 124,"bar": "val4"}, {"foo": 126,"bar": "val5"}]}
Or it could be multi-lined JSON like this :
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

Creating RDD from sequence of GenericRecord in spark will change field values in generic record

When I create RDD from GenericRecords (avro), immiediately collect it and print those records I am receiving wrong field values - modified in strange way:
all values of the field has value equal to the first field prior to schema i.e
def createGenericRecord(first: String, second: String) = {
val schemaString =
"""
|{
| "type": "record",
| "name": "test_schema",
| "fields":[
| { "name": "test_field1", "type": "string" },
| { "name": "test_field2", "type": ["null", "string"] }
|]
|}
""".stripMargin
val parser = new Schema.Parser()
parser.setValidate(true)
parser.setValidateDefaults(true)
val schema = parser.parse(schemaString);
val genericRecord = new Record(schema)
genericRecord.put("test_field1", first)
genericRecord.put("test_field2", second)
genericRecord
}
val record1 = createGenericRecord("test1","test2")
val record2 = createGenericRecord("test3","test4")
println(record1)//prints {"test_field1": "test1", "test_field2": "test2"}
println(record2)//prints {"test_field1": "test3", "test_field2": "test4"}
val t = sc.makeRDD(Seq(record1, record2))
val collected = t.collect()
println(collected(0))//prints {"test_field1": "test1", "test_field2": "test1"}
println(collected(1))//prints {"test_field1": "test3", "test_field2": "test3"}
I am using spark 1.2.0 with spark.serialiazier configured to org.apache.spark.serializer.KryoSerializer
The solution for this problem is to update arg.apache.avro % avro dependency to the version 1.7.7.

Resources