I have a spark udf written in scala that takes couuple of columns and apply some logic and output InternalRow. There is spark schema of StructType also present.
But when I try to return the InternalRow from UDF there is exception
java.lang.UnsupportedOperationException: Schema for type
org.apache.spark.sql.catalyst.GenericInternalRow is not supported
val getData = (hash : String, type : String) => {
val schema = hash match {
case "people" =>
peopleSchema
case "empl" => emplSchema
}
getGenericInternalRow(schema)
}
val data = udf(getData)
Spark Version : 2.4.5
Related
This type of logic:
.mapPartitionsWithIndex((index, iter) => {
iter.map(x => (index, x))
})
How to do in just SQL %sql in Spark? I do not think this is possible.
I have a spark data frame with the below schema and trying to stream this dataframe to Kafka using Avro
```root
|-- clientTag: struct (nullable = true)
| |-- key: string (nullable = true)
|-- contactPoint: struct (nullable = true)
| |-- email: string (nullable = true)
| |-- type: string (nullable = true)
|-- performCheck: string (nullable = true)```
Sample Record: {"performCheck" : "N", "clientTag" :{"key":"value"}, "contactPoint": {"email":"abc#gmail.com", "type":"EML"}}
Avro Schema:
{
"name":"Message",
"namespace":"kafka.sample.avro",
"type":"record",
"fields":[
{"type":"string", "name":"id"},
{"type":"string", "name":"email"}
{"type":"string", "name":"type"}
]
}
I have couple of questions.
What is the best way to convert a org.apache.spark.sql.Row to Avro Message because i want to extract email and type from the dataframe for each Row and use those values to construct an Avro message?
Eventually, all the Avro messages will be send to Kafka. So, if there is an error while producing, how can i collect all the Row's that failed to be produced to Kafka and return a dataframe?
Thanks for the help
You can try this.
Q#1: You can extract child elements of dataframe using dot notation as:
val dfJSON = spark.read.json("/json/path/sample_avro_data_as_json.json") //can read from schema registry
.withColumn("id", $"clientTag.key")
.withColumn("email", $"contactPoint.email")
.withColumn("type", $"contactPoint.type")
Then u can directly use these columns while assigning values to Avro record that you serialize & send to Kafka.
Q#2: You can keep track of success & failure something like this. This is not fully working code, but can give you an idea.
dfJSON.foreachPartition( currentPartition => {
var producer = new KafkaProducer[String, Array[Byte]](props)
var schema: Schema = ...//Get schema from schema registry or avsc file
val schemaRegProps = Map("schema.registry.url" -> schemaRegistryUrl)
val client = new CachedSchemaRegistryClient(schemaRegistryUrl, Int.MaxValue)
valueSerializer = new KafkaAvroSerializer(client)
valueSerializer.configure(schemaRegProps, false)
val failedRecDF = currentPartition.map(rec =>{
try {
var avroRecord: GenericRecord = new GenericData.Record(schema)
avroRecord.put("id", rec.getAs[String]("id"))
avroRecord.put("email", rec.getAs[String]("email"))
avroRecord.put("type", rec.getAs[String]("type"))
// Serialize record in Producer record & send to Kafka
producer.send(new ProducerRecord[String, Array[Byte]](kafkaTopic, rec.getAs[String]("id").toString(), valueSerializer.serialize(kafkaTopic, avroRecord).toArray))
(rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Success")
}catch{
case e: Exception => println("*** Exception *** ")
e.printStackTrace()
(rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Failed")
}
})//.toDF("id", "email", "type", "sent_status")
failedRecDF.foreach(println)
//You can retry or log them
})
Response would be:
(111,abc#gmail.com,EML,Success)
You can do whatever you want to do with it.
I have data like this and I want to create following JSON Document.
How can I achieve it in Spark? What is the most efficient way to do it in Spark?
name|contact |type
jack|123-123-1234 |phone
jack|jack.reach#xyz.com |email
jack|123 main street |address
jack|34545544445 |mobile
{
"name" : "jack",
"contacts":[
{
"contact" : "123-123-1234",
"type" : "phone"
},
{
"contact" : "jack.reach#xyz.com",
"type" : "email"
},
{
"contact" : "123 main street",
"type" : "address"
},
{
"contact" : "34545544445",
"type" : "mobile"
}
]
}
This is just a sample use case I provided. I have large data set where
I have to collapse multi column rows into one row with some grouping
logic.
My Current approach is I write a UDAF that reads each row, stores in
buffer and merge it. So the code would be
val mergeUDAF = new ColumnUDAF
val tempTable = inputTable.withColumn("contacts",struct($"contact",$"type")
val outputTable = tempTable.groupby($"name").agg(mergeUDAF($"contacts").alias("contacts"))
I am trying to figure out what other approaches there can be. I am
trying to achieve this using Spark-SQL.
I think you should just create an RDD form your csv data, group by "name" than map to json string:
val data = sc.parallelize(Seq("jack|123-123-1234|phone", "jack|jack.reach#xyz.com |email", "david|123 main street|address", "david|34545544445|mobile")) // change to load your data as RDD
val result = data.map(_.split('|')).groupBy(a => a(0)).map(a => {
val contact = a._2.map(c => s"""{"contact": "${c(1)}", "type": "${c(2)}" }""" ).mkString(",")
s"""{"name": "${a._1}", "contacts":[ ${contact}] }"""
}).collect.mkString(",")
val json = s"""[ ${result} ] """
case class contact(contact:String,contactType:String)
case class Person(name:String,contact:Seq[contact])
object SparkTestGrouping {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("LocalTest").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val inputData=Seq("jack|123-123-1234|phone","jack|jack.reach#xyz.com|email","jack|123 main street|address","jack|34545544445|mobile")
val finalData = sc.parallelize(inputData)
val convertData = finalData.map(_.split('|'))
.map(line => (line(0),Seq(line(1) +"|" +line(2))))
.reduceByKey((x,y) => x ++: y)
val output = convertData.map(line => (line._1,line._2.map(_.split('|')).map(obj => contact(obj(0),obj(1)))))
val finalOutput = output.map(line => Person(line._1,line._2))
finalOutput.toDF().toJSON.foreach(println)
sc.stop()
}
}
You can crate tuples from the data with the key field and use
reducebyKey to group the data. In the above example, I created a tuple
(name,Seq("contact|contactType")) and used reducebykey to group the
data by name. After the data is grouped, you can use case class to
convert to DataFrame and DataSets if you need to do further join on it
or simply have to create json document.
This question already has answers here:
Spark Dataframe validating column names for parquet writes
(7 answers)
Closed 4 years ago.
I am trying to use spark 2.0.2 to convert a JSON file into parquet.
The JSON file comes from an external source and therefor the schema can't be changed before it arrives.
The file contains a map of attributes. The attribute names arn't known before I receive the file.
The attribute names contain characters that can't be used in parquet.
{
"id" : 1,
"name" : "test",
"attributes" : {
"name=attribute" : 10,
"name=attribute with space" : 100,
"name=something else" : 10
}
}
Both the space and equals character can't be used in parquet, I get the following error:
org.apache.spark.sql.AnalysisException: Attribute name "name=attribute" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
As these are nested fields I can't rename them using an alias, is this true?
I have tried renaming the fields within the schema as suggested here: How to rename fields in an DataFrame corresponding to nested JSON. This works for some files, However, I now get the following stackoverflow:
java.lang.StackOverflowError
at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:65)
at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:258)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1563)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1576)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576)
at scala.collection.immutable.List.foreach(List.scala:381)
...
repeat
...
I want to do one of the following:
Strip invalid characters from the field names as I load the data into spark
Change the column names in the schema without causing stack overflows
Somehow change the schema to load the original data but use the following internally:
{
"id" : 1,
"name" : "test",
"attributes" : [
{"key":"name=attribute", "value" : 10},
{"key":"name=attribute with space", "value" : 100},
{"key":"name=something else", "value" : 10}
]
}
I solved the problem this way:
df.toDF(df
.schema
.fieldNames
.map(name => "[ ,;{}()\\n\\t=]+".r.replaceAllIn(name, "_")): _*)
where I replaced all incorrect symbols by "_".
The only solution I have found to work,so far, is to reload the data with a modified schema. The new schema will load the attributes into a map.
Dataset<Row> newData = sql.read().json(path);
StructType newSchema = (StructType) toMapType(newData.schema(), null, "attributes");
newData = sql.read().schema(newSchema).json(path);
private DataType toMapType(DataType dataType, String fullColName, String col) {
if (dataType instanceof StructType) {
StructType structType = (StructType) dataType;
List<StructField> renamed = Arrays.stream(structType.fields()).map(
f -> toMapType(f, fullColName == null ? f.name() : fullColName + "." + f.name(), col)).collect(Collectors.toList());
return new StructType(renamed.toArray(new StructField[renamed.size()]));
}
return dataType;
}
private StructField toMapType(StructField structField, String fullColName, String col) {
if (fullColName.equals(col)) {
return new StructField(col, new MapType(DataTypes.StringType, DataTypes.LongType, true), true, Metadata.empty());
} else if (col.startsWith(fullColName)) {
return new StructField(structField.name(), toMapType(structField.dataType(), fullColName, col), structField.nullable(), structField.metadata());
}
return structField;
}
I have the same problem with #:.
In our case, we solved flattering the DataFrame.
val ALIAS_RE: Regex = "[_.:#]+".r
val FIRST_AT_RE: Regex = "^_".r
def getFieldAlias(field_name: String): String = {
FIRST_AT_RE.replaceAllIn(ALIAS_RE.replaceAllIn(field_name, "_"), "")
}
def selectFields(df: DataFrame, fields: List[String]): DataFrame = {
var fields_to_select = List[Column]()
for (field <- fields) {
val alias = getFieldAlias(field)
fields_to_select +:= col(field).alias(alias)
}
df.select(fields_to_select: _*)
}
So the following json:
{
object: 'blabla',
schema: {
#type: 'blabla',
name#id: 'blabla'
}
}
That will be transformed [object, schema.#type, schema.name#id].
# and dots (in your case =) will create problems for SparkSQL.
So after our SelectFields you can end with
[object, schema_type, schema_name_id]. Flattered DataFrame.
I have two RDDS :
rdd1 [String,String,String]: Name, Address, Zipcode
rdd2 [String,String,String]: Name, Address, Landmark
I am trying to join these 2 RDDs using the function : rdd1.join(rdd2)
But I am getting an error : error: value fullOuterJoin is not a member of org.apache.spark.rdd.RDD[String]
The join should join the RDD[String] and the output RDD should be something like :
rddOutput : Name,Address,Zipcode,Landmark
And I wanted to save these files as a JSON file in the end.
Can someone help me with the same ?
As said in the comments, you have to convert your RDDs to PairRDDs before joining, which means that each RDD must be of type RDD[(key, value)]. Only then you can perform the join by the key. In your case, the key is composed by (Name, Address), so you you would have to do something like:
// First, we create the first PairRDD, with (name, address) as key and zipcode as value:
val pairRDD1 = rdd1.map { case (name, address, zipcode) => ((name, address), zipcode) }
// Then, we create the second PairRDD, with (name, address) as key and landmark as value:
val pairRDD2 = rdd2.map { case (name, address, landmark) => ((name, address), landmark) }
// Now we can join them.
// The result will be an RDD of ((name, address), (zipcode, landmark)), so we can map to the desired format:
val joined = pairRDD1.fullOuterJoin(pairRDD2).map {
case ((name, address), (zipcode, landmark)) => (name, address, zipcode, landmark)
}
More info about PairRDD functions in the Spark's Scala API documentation