Spark Java API to create schema from config file - apache-spark

 
I am looking for an approach to read table schema from a config file, to avoid hardcoding it in Spark (Java). For example, to read two .csv files I create schemas as below:
#1
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("emp_dept",StringType, true),
DataTypes.createStructField("empid",IntegerType, true),
DataTypes.createStructField("empdesignation",StringType, true),
DataTypes.createStructField("emp_salary",IntegerType, true)
});
Dataset<Row> df1 = spark.read().format("csv")
.option("header", "true")
.schema(schema)
.csv(path);
#2
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("emp_details",StringType, true),
DataTypes.createStructField("empid",IntegerType, true),
DataTypes.createStructField("empfistname",StringType, true),
DataTypes.createStructField("emplastname,IntegerType, true)
});
Dataset<Row> df2 = spark.read().format("csv")
.option("header", "true")
.schema(schema)
.csv(path);
Instead of creating multiple schemas like this, I'd like to create it a from config file.

Perhaps you can use one of the StructType's static methods, StructType.fromDDL(String ddl) or DataType.fromJson(String json), depending on what you want your config file to look like. For example, simple DDL:
scala> import org.apache.spark.sql.types._
scala> val struct = StructType.fromDDL("id int, descr string")
struct: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true),StructField(descr,StringType,true))

Related

How to return GenericInternalRow from spark udf

I have a spark udf written in scala that takes couuple of columns and apply some logic and output InternalRow. There is spark schema of StructType also present.
But when I try to return the InternalRow from UDF there is exception
java.lang.UnsupportedOperationException: Schema for type
org.apache.spark.sql.catalyst.GenericInternalRow is not supported
val getData = (hash : String, type : String) => {
val schema = hash match {
case "people" =>
peopleSchema
case "empl" => emplSchema
}
getGenericInternalRow(schema)
}
val data = udf(getData)
Spark Version : 2.4.5

How to return json data in Dataset<Row> with encoder(structType) in Spark?

I tried to return required parameter in DataSet. Whenever am return the data to row am not able to encode the data with struct Type,If suppose am using Map/JSONObject it was throwing Map/jsonobject it not a valid External schema, Below code i tried? Any help will be appreciate thanks in advance
DataSet<Row>//
Row rowdat=RowFactory.create(jsondata)
Return rowdat.iterator();
//Dataset data will be **** [[{"employees:"accountant","firstname":"walter", "age":"54"}]]
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("employees", DataTypes.StringType, true),
DataTypes.createStructField("firstname", DataTypes.StringType, true),
DataTypes.createStructField("age", DataTypes.StringType, true)
});
ExpressionEncoder express=RowCoder.apply(schema)

Convert a spark dataframe Row to Avro and publish to kakfa

I have a spark data frame with the below schema and trying to stream this dataframe to Kafka using Avro
```root
|-- clientTag: struct (nullable = true)
| |-- key: string (nullable = true)
|-- contactPoint: struct (nullable = true)
| |-- email: string (nullable = true)
| |-- type: string (nullable = true)
|-- performCheck: string (nullable = true)```
Sample Record: {"performCheck" : "N", "clientTag" :{"key":"value"}, "contactPoint": {"email":"abc#gmail.com", "type":"EML"}}
Avro Schema:
{
"name":"Message",
"namespace":"kafka.sample.avro",
"type":"record",
"fields":[
{"type":"string", "name":"id"},
{"type":"string", "name":"email"}
{"type":"string", "name":"type"}
]
}
I have couple of questions.
What is the best way to convert a org.apache.spark.sql.Row to Avro Message because i want to extract email and type from the dataframe for each Row and use those values to construct an Avro message?
Eventually, all the Avro messages will be send to Kafka. So, if there is an error while producing, how can i collect all the Row's that failed to be produced to Kafka and return a dataframe?
Thanks for the help
You can try this.
Q#1: You can extract child elements of dataframe using dot notation as:
val dfJSON = spark.read.json("/json/path/sample_avro_data_as_json.json") //can read from schema registry
.withColumn("id", $"clientTag.key")
.withColumn("email", $"contactPoint.email")
.withColumn("type", $"contactPoint.type")
Then u can directly use these columns while assigning values to Avro record that you serialize & send to Kafka.
Q#2: You can keep track of success & failure something like this. This is not fully working code, but can give you an idea.
dfJSON.foreachPartition( currentPartition => {
var producer = new KafkaProducer[String, Array[Byte]](props)
var schema: Schema = ...//Get schema from schema registry or avsc file
val schemaRegProps = Map("schema.registry.url" -> schemaRegistryUrl)
val client = new CachedSchemaRegistryClient(schemaRegistryUrl, Int.MaxValue)
valueSerializer = new KafkaAvroSerializer(client)
valueSerializer.configure(schemaRegProps, false)
val failedRecDF = currentPartition.map(rec =>{
try {
var avroRecord: GenericRecord = new GenericData.Record(schema)
avroRecord.put("id", rec.getAs[String]("id"))
avroRecord.put("email", rec.getAs[String]("email"))
avroRecord.put("type", rec.getAs[String]("type"))
// Serialize record in Producer record & send to Kafka
producer.send(new ProducerRecord[String, Array[Byte]](kafkaTopic, rec.getAs[String]("id").toString(), valueSerializer.serialize(kafkaTopic, avroRecord).toArray))
(rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Success")
}catch{
case e: Exception => println("*** Exception *** ")
e.printStackTrace()
(rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Failed")
}
})//.toDF("id", "email", "type", "sent_status")
failedRecDF.foreach(println)
//You can retry or log them
})
Response would be:
(111,abc#gmail.com,EML,Success)
You can do whatever you want to do with it.

How do I groupby and aggregate column in Spark and create nested Json

I have data like this and I want to create following JSON Document.
How can I achieve it in Spark? What is the most efficient way to do it in Spark?
name|contact |type
jack|123-123-1234 |phone
jack|jack.reach#xyz.com |email
jack|123 main street |address
jack|34545544445 |mobile
{
"name" : "jack",
"contacts":[
{
"contact" : "123-123-1234",
"type" : "phone"
},
{
"contact" : "jack.reach#xyz.com",
"type" : "email"
},
{
"contact" : "123 main street",
"type" : "address"
},
{
"contact" : "34545544445",
"type" : "mobile"
}
]
}
This is just a sample use case I provided. I have large data set where
I have to collapse multi column rows into one row with some grouping
logic.
My Current approach is I write a UDAF that reads each row, stores in
buffer and merge it. So the code would be
val mergeUDAF = new ColumnUDAF
val tempTable = inputTable.withColumn("contacts",struct($"contact",$"type")
val outputTable = tempTable.groupby($"name").agg(mergeUDAF($"contacts").alias("contacts"))
I am trying to figure out what other approaches there can be. I am
trying to achieve this using Spark-SQL.
I think you should just create an RDD form your csv data, group by "name" than map to json string:
val data = sc.parallelize(Seq("jack|123-123-1234|phone", "jack|jack.reach#xyz.com |email", "david|123 main street|address", "david|34545544445|mobile")) // change to load your data as RDD
val result = data.map(_.split('|')).groupBy(a => a(0)).map(a => {
val contact = a._2.map(c => s"""{"contact": "${c(1)}", "type": "${c(2)}" }""" ).mkString(",")
s"""{"name": "${a._1}", "contacts":[ ${contact}] }"""
}).collect.mkString(",")
val json = s"""[ ${result} ] """
case class contact(contact:String,contactType:String)
case class Person(name:String,contact:Seq[contact])
object SparkTestGrouping {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("LocalTest").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val inputData=Seq("jack|123-123-1234|phone","jack|jack.reach#xyz.com|email","jack|123 main street|address","jack|34545544445|mobile")
val finalData = sc.parallelize(inputData)
val convertData = finalData.map(_.split('|'))
.map(line => (line(0),Seq(line(1) +"|" +line(2))))
.reduceByKey((x,y) => x ++: y)
val output = convertData.map(line => (line._1,line._2.map(_.split('|')).map(obj => contact(obj(0),obj(1)))))
val finalOutput = output.map(line => Person(line._1,line._2))
finalOutput.toDF().toJSON.foreach(println)
sc.stop()
}
}
You can crate tuples from the data with the key field and use
reducebyKey to group the data. In the above example, I created a tuple
(name,Seq("contact|contactType")) and used reducebykey to group the
data by name. After the data is grouped, you can use case class to
convert to DataFrame and DataSets if you need to do further join on it
or simply have to create json document.

Exporting nested fields with invalid characters from Spark 2 to Parquet [duplicate]

This question already has answers here:
Spark Dataframe validating column names for parquet writes
(7 answers)
Closed 4 years ago.
I am trying to use spark 2.0.2 to convert a JSON file into parquet.
The JSON file comes from an external source and therefor the schema can't be changed before it arrives.
The file contains a map of attributes. The attribute names arn't known before I receive the file.
The attribute names contain characters that can't be used in parquet.
{
"id" : 1,
"name" : "test",
"attributes" : {
"name=attribute" : 10,
"name=attribute with space" : 100,
"name=something else" : 10
}
}
Both the space and equals character can't be used in parquet, I get the following error:
org.apache.spark.sql.AnalysisException: Attribute name "name=attribute" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
As these are nested fields I can't rename them using an alias, is this true?
I have tried renaming the fields within the schema as suggested here: How to rename fields in an DataFrame corresponding to nested JSON. This works for some files, However, I now get the following stackoverflow:
java.lang.StackOverflowError
at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:65)
at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:258)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1563)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1576)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576)
at scala.collection.immutable.List.foreach(List.scala:381)
...
repeat
...
I want to do one of the following:
Strip invalid characters from the field names as I load the data into spark
Change the column names in the schema without causing stack overflows
Somehow change the schema to load the original data but use the following internally:
{
"id" : 1,
"name" : "test",
"attributes" : [
{"key":"name=attribute", "value" : 10},
{"key":"name=attribute with space", "value" : 100},
{"key":"name=something else", "value" : 10}
]
}
I solved the problem this way:
df.toDF(df
.schema
.fieldNames
.map(name => "[ ,;{}()\\n\\t=]+".r.replaceAllIn(name, "_")): _*)
where I replaced all incorrect symbols by "_".
The only solution I have found to work,so far, is to reload the data with a modified schema. The new schema will load the attributes into a map.
Dataset<Row> newData = sql.read().json(path);
StructType newSchema = (StructType) toMapType(newData.schema(), null, "attributes");
newData = sql.read().schema(newSchema).json(path);
private DataType toMapType(DataType dataType, String fullColName, String col) {
if (dataType instanceof StructType) {
StructType structType = (StructType) dataType;
List<StructField> renamed = Arrays.stream(structType.fields()).map(
f -> toMapType(f, fullColName == null ? f.name() : fullColName + "." + f.name(), col)).collect(Collectors.toList());
return new StructType(renamed.toArray(new StructField[renamed.size()]));
}
return dataType;
}
private StructField toMapType(StructField structField, String fullColName, String col) {
if (fullColName.equals(col)) {
return new StructField(col, new MapType(DataTypes.StringType, DataTypes.LongType, true), true, Metadata.empty());
} else if (col.startsWith(fullColName)) {
return new StructField(structField.name(), toMapType(structField.dataType(), fullColName, col), structField.nullable(), structField.metadata());
}
return structField;
}
I have the same problem with #:.
In our case, we solved flattering the DataFrame.
val ALIAS_RE: Regex = "[_.:#]+".r
val FIRST_AT_RE: Regex = "^_".r
def getFieldAlias(field_name: String): String = {
FIRST_AT_RE.replaceAllIn(ALIAS_RE.replaceAllIn(field_name, "_"), "")
}
def selectFields(df: DataFrame, fields: List[String]): DataFrame = {
var fields_to_select = List[Column]()
for (field <- fields) {
val alias = getFieldAlias(field)
fields_to_select +:= col(field).alias(alias)
}
df.select(fields_to_select: _*)
}
So the following json:
{
object: 'blabla',
schema: {
#type: 'blabla',
name#id: 'blabla'
}
}
That will be transformed [object, schema.#type, schema.name#id].
# and dots (in your case =) will create problems for SparkSQL.
So after our SelectFields you can end with
[object, schema_type, schema_name_id]. Flattered DataFrame.

Resources