Expressing spark `StructType` in avro schema - apache-spark

How would you describe spark StructType data type in an avro schema? I am generating a parquet file, the format of which is described in an avro schema. This file is then loaded from S3 into spark. There is an array and map data types but these do not correspond to the StructType.

Using the package org.apache.spark.sql.avro (Spark 2.4) you can convert sparkSQL schemas to avro schemas and viceversa.
You cant try this way:
import org.apache.spark.sql.avro.SchemaConverters
val sqlType = SchemaConverters.toSqlType(avroSchema)
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
Here you can find more answers too: Code

Related

spark read schema from separate file

I have my data in HDFS and it's schema in MySQL. I'm able to fetch the schema to a DataFrame and it is as below :
col1,string
col2,date
col3,int
col4,string
How to read this schema and assign it to data while reading from HDFS?
I will be reading schema from MySql . It will be different for different datasets . I require a dynamic approach , where for any dataset I can fetch schema details from MySQL -> convert it into schema -> and then apply to dataset.
You can use the built-in pyspark function _parse_datatype_string:
from pyspark.sql.types import _parse_datatype_string
df = spark.createDataFrame([
["col1,string"],
["col3,int"],
["col3,int"]
], ["schema"])
str_schema = ",".join(map(lambda c: c["schema"].replace(",", ":") , df.collect()))
# col1:string,col3:int,col3:int
final_schema = _parse_datatype_string(str_schema)
# StructType(List(StructField(col1,StringType,true),StructField(col3,IntegerType,true),StructField(col3,IntegerType,true)))
_parse_datatype_string expects a DDL-formatted string i.e: col1:string, col2:int hence we need first to replace , with : then join all together seperated by comma. The function will return an instance of StructType which will be your final schema.

RDD String to Spark csv Reader

I want to read the RDD[String] using the spark CSV reader. The reason I am doing this is, I need to filter some records before using the CSV reader.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file")
I need to read the fileRDD using the spark CSV reader. I wish not to commit the file as it increases the IO of the HDFS. I have looked into the options we have in the spark CSV, but didn't found any.
spark.read.csv(file)
Sample Data
PHM|MERC|PHARMA|BLUEDRUG|50
CLM|BSH|CLAIM|VISIT|HSA|EMPLOYER|PAID|250
PHM|GSK|PHARMA|PARAC|70
CLM|UHC|CLAIM|VISIT|HSA|PERSONAL|PAID|72
As you can see all the records starts with PHM has different number of columns and clm has different number of columns. That is the reason i am filtering and then applying schema. PHM and CLM records has different schema.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file").filter(_.startWith("PHM"))
spark.read.option(schema,"phcschema").csv(fileRDD.toDS())
Since Spark 2.2, method ".csv" can read dataset of strings. Can be implemented in this way:
val rdd: RDD[String] = spark.sparkContext.textFile("csv.txt")
// ... do filtering
spark.read.csv(rdd.toDS())

Spark read avro

Trying to read an avro file.
val df = spark.read.avro(file)
Running into Avro schema cannot be converted to a Spark SQL StructType: [ "null", "string" ]
Tried to manually create a schema, but now running into the following:
val s = StructType(List(StructField("value", StringType, nullable = true)))
val df = spark.read
.option("inferSchema", "false")
.schema(s)
.avro(file)
com.databricks.spark.avro.SchemaConverters$IncompatibleSchemaException: Cannot convert Avro schema to catalyst type because schema at path is not compatible (avroType = StructType(StructField(value,StringType,true)), sqlType = STRING).
Source Avro schema: ["null","string"].
Target Catalyst type: StructType(StructField(value,StringType,true))
Trying to override the avro schema (without the null) also does not work:
val df = spark.read
.option("inferSchema", "false")
.option("avroSchema", """["string"]""")
.avro(file)
Avro schema cannot be converted to a Spark SQL StructType: [ "string" ]
Looks like spark-avro only creates a GenericDatumReader[GenericRecord] and I need a GenericDatumReader[Utf8] :(
Please make sure you are providing the correct AVSC with the data type.
["null", "String"] is placed to take care of null values in the Avro data.
You can create the schema of your Avro file by:-
val schema = new Schema.Parser().parse(new File("user.avsc")
Or if you have Java Schema file then you can get the schema by doing:-
val schema = Schema.getClassSchema
now once you have the schema it is very simple to build a data frame with it.
code snippet:-
val df =sparkSession.read.format("com.databricks.spark.avro")
.option("avroSchema", schema.toString)
.load("/home/garvit.vijay/000009_0.avro")
df.printSchema()
df.show()
Hope it works for you.

Converting CassandraRow obtained from joinWithCassandraTable to DataFrame

case class SourcePartition(id: String, host:String ,bucket: Int)
joinedRDDs =partitions.joinWithCassandraTable("db_name","table_name")
joinedRDDs.values.foreach(println)
I have to use joinWithCassandraTable , How do i covert the result CassandraRow in to a DataFrame? OR is there any equivalent of joinWithCassandraTable with DataFrame ?
I have to read a lot of partitions in one go, I'm aware of Datastax Cassandra connector Predicate push down, but it allows to pull only one Partition at a time ( It doesnt seems to allow IN operator , Only = seems to be supported)
val spark: SparkSession = SparkSession.builder().master("local[4]").appName("RDD2DF").getOrCreate()
val sc: SparkContext = spark.sparkContext
import spark.implicits._
val internalJoinRDD = spark.sparkContext.cassandraTable("test", "test_table_1").joinWithCassandraTable("test", "table_table_2")
internalJoin.toDebugString
internalJoinRDD.toDF()
Can you try the above code snippet.
If you have a schema for your data, you can use
def createDataFrame(internalJoinRDD: RDD[Row], schema: StructType): DataFrame

Avro format deserialization in Spark structured stream

I'm using Spark Structured Streaming as described on
this page.
I get correct message from Kafka topic but value is in Avro format. Is there some way to deserialize Avro records (something like KafkaAvroDeserializer approach)?
Spark >= 2.4
You can use from_avro function from spark-avro library.
import org.apache.spark.sql.avro._
val schema: String = ???
df.withColumn("value", from_avro($"value", schema))
Spark < 2.4
Define a function which takes Array[Byte] (serialized object):
import scala.reflect.runtime.universe.TypeTag
def decode[T : TypeTag](bytes: Array[Byte]): T = ???
which will deserialize Avro data and create object, that can be stored in a Dataset.
Create udf based on the function.
val decodeUdf = udf(decode _)
Call udf on value
val df = spark
.readStream
.format("kafka")
...
.load()
df.withColumn("value", decodeUdf($"value"))

Resources