I'm trying to use to_avro() function to create Avro records. However, I'm not able to encode multiple columns, as some columns are simply lost after encoding. A simple example to recreate the problem:
val schema = StructType(List(
StructField("entity_type", StringType),
StructField("entity", StringType)
))
val rdd = sc.parallelize(Seq(
Row("PERSON", "John Doe")
))
val df = sqlContext.createDataFrame(rdd, schema)
df
.withColumn("struct", struct(col("entity_type"), col("entity")))
.select("struct")
.collect()
.foreach(println)
// prints [[PERSON, John Doe]]
df
.withColumn("struct", struct(col("entity_type"), col("entity")))
.select(to_avro(col("struct")).as("value"))
.select(from_avro(col("value"), entitySchema).as("entity"))
.collect()
.foreach(println)
// prints [[, PERSON]]
My schema looks like this
{
"type" : "record",
"name" : "Entity",
"fields" : [ {
"name" : "entity_type",
"type" : "string"
},
{
"name" : "entity",
"type" : "string"
} ]
}
What's interesting, is if I change the column order in the struct, the result would be [, John Doe]
I'm using Spark 2.4.5. According to Spark documentation: "to_avro() can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka."
It's working after changing field types from "string" to ["string", "null"]. Not sure if this behavior is intended though.
Related
I have a Spark-Kakfa-Strucutre streaming pipeline. Listening to a topic, which may have json records of varying schema.
Now I want to resolve the schema based on the key(x_y), and then apply to value portion to parse the json record.
So here key's 'y' part tells about the schema type.
I tried to get the schema string from udf and then pass to from_json() function.
But it fails with exception
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of `schema`
Code used:
df.withColumn("data_type", element_at(split(col("key").cast("string"),"_"),1))
.withColumn("schema", schemaUdf($"data_type"))
.select(from_json(col("value").cast("string"), col("schema")).as("data"))
Schema demo:
{
"type" : "struct",
"fields" : [ {
"name" : "name",
"type" : {
"type" : "struct",
"fields" : [ {
"name" : "firstname",
"type" : "string",
"nullable" : true,
"metadata" : { }
}]
},
"nullable" : true,
"metadata" : { }
} ]
}
UDF used:
lazy val fetchSchema = (fileName : String) => {
DataType.fromJson(mapper.readTree(new File(fileName)).toString)
}
val schemaUdf = udf[DataType, String](fetchSchema)
Note: I am not using confluent feature.
I have updated avsc file to rename column like,
"fields" : [ {
"name" : "department_id",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "office_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "department_name" ],
"columnName" : "department_name"
}
However in may avro file columns are like department_id : 10, department_name : "maths"
Now when i query like below,
select office_name from t
it always returns null values. Will it not return value from department_name in avro. Is there a way to have multiple names for column in avsc
From cloudera community, "we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark."
Schema with aliases,
val schema = new Schema.Parser().parse(new File("../spark-2.4.3-bin-hadoop2.7/examples/src/main/resources/user.avsc"))
schema: org.apache.avro.Schema = {"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string","aliases":["customer_name"],"columnName":"customer_name"},{"name":"favorite_color","type":["string","null"],"aliases":["color"],"columnName":"color"}]}
Spark striping the aliases,
val usersDF = spark.read.format("avro").option("avroSchema",schema.toString).load("../spark-2.4.3-bin-hadoop2.7/examples/src/main/resources/users.avro")
usersDF: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string]
I guess you can go with spark builtin features to rename a column, but if you find any other workaround let me know as well.
My downstream source does not support a Map type and my source does and as such sends this. I need to convert this map into an array of struct (tuple).
Scala support Map.toArray which creates an array of tuple for you which seems like the function I need on the Map to transform:
{
"a" : {
"b": {
"key1" : "value1",
"key2" : "value2"
},
"b_" : {
"array": [
{
"key": "key1",
"value" : "value1"
},
{
"key": "key2",
"value" : "value2"
}
]
}
}
}
What is the most efficient way in Spark to do this assuming that also the field to change is a nested one. e.g
a is the root level dataframe column
a.b is the map at level 1 (comes from the source)
a.b_ is the array type of struct (this is what I want to generate in converting a.b to the array)
The answer so far goes some of the way I think, just can get the withColumn and UDF suggested to generate as below.
Thanks!
Just use an udf:
val toArray = udf((vs: Map[String, String]) => vs.toArray)
and adjust input type according to your needs.
I have data set with one of the field containing array as below:
{ "name" : "James", "subjects" : [ "english", "french", "botany" ] },
{ "name" : "neo", "subjects" : [ "english", "physics" ] },
{ "name" : "john", "subjects" : [ "spanish", "mathematics" ] }
Now i want to filter using Dataset.filter function by passing Column object. I tried isin function of Column and array_contains function of functions but did not work.
Is there a way to create Column object that will filter the dataset where an array field contains one of the values?
There are multiple ways to do this--once you've imported Encoders implicitly:
import sparkSession.implicits._
First, you can turn your DataFrame, which is a DataSet[Row], into a strongly typed DataSet[Student], which allows you to use familiar (at least if you know Scala) Scala idioms:
case class Student(name: String, subjects: Seq[String])
sparkSession.read.json("my.json")
.as[Student]
.filter(_.subjects.contains("english"))
You can also use a pure-Column based approach in your DataFrame with array_contains from the helpful Spark functions library:
sparkSession.read.json("my.json").filter(array_contains($"subjects", "english"))
Finally, although it may not be helpful to you here, keep in mind that you can also use explode from the same functions library to give each subject its own row in the column:
sparkSession.read.json("my.json")
.select($"name", explode($"subjects").as("subjects"))
.filter($"subjects" === "english")
Spark SQL's DataFrameReader supports so-called JSON Lines text format (aka newline-delimited JSON) where:
Each Line is a Valid JSON Value
You can use json operator to read the dataset.
// on command line
$ cat subjects.jsonl
{ "name" : "James", "subjects" : [ "english", "french", "botany" ] }
{ "name" : "neo", "subjects" : [ "english", "physics" ] }
{ "name" : "john", "subjects" : [ "spanish", "mathematics" ] }
// in spark-shell
scala> val subjects = spark.read.json("subjects.jsonl")
subjects: org.apache.spark.sql.DataFrame = [name: string, subjects: array<string>]
scala> subjects.show(truncate = false)
+-----+-------------------------+
|name |subjects |
+-----+-------------------------+
|James|[english, french, botany]|
|neo |[english, physics] |
|john |[spanish, mathematics] |
+-----+-------------------------+
scala> subjects.printSchema
root
|-- name: string (nullable = true)
|-- subjects: array (nullable = true)
| |-- element: string (containsNull = true)
With that, you should have a look at functions library when you can find Collection functions that deal with array-based inputs, e.g. array_contains or explode.
That's what you can find in the answer from #Vidya.
What is missing is my beloved Dataset.flatMap that, given the subjects Dataset, could be used as follows:
scala> subjects
.as[(String, Seq[String])] // convert to Dataset[(String, Seq[String])] for more type-safety
.flatMap { case (student, subjects) => subjects.map(s => (student, s)) } // typed expand
.filter(_._2.toLowerCase == "english") // filter out non-english subjects
.show
+-----+-------+
| _1| _2|
+-----+-------+
|James|english|
| neo|english|
+-----+-------+
That however doesn't look as good/nice as its for-comprehension version.
val subjectsDF = subjects.as[(String, Seq[String])]
val englishStudents = for {
(student, ss) <- subjectsDF // flatMap
subject <- ss // map
if subject.toLowerCase == "english"
} yield (student, subject)
scala> englishStudents.show
+-----+-------+
| _1| _2|
+-----+-------+
|James|english|
| neo|english|
+-----+-------+
Moreover, as of Spark 2.2 (soon to be released), you've got DataFrameReader.json operator that you can use to read a Dataset[String].
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.Dataset
val subjects: Dataset[String] = Seq(
"""{ "name" : "James", "subjects" : [ "english", "french", "botany" ] }""",
"""{ "name" : "neo", "subjects" : [ "english", "physics" ] }""",
"""{ "name" : "john", "subjects" : [ "spanish", "mathematics" ]}""").toDS
scala> spark.read.option("inferSchema", true).json(subjects).show(truncate = false)
+-----+-------------------------+
|name |subjects |
+-----+-------------------------+
|James|[english, french, botany]|
|neo |[english, physics] |
|john |[spanish, mathematics] |
+-----+-------------------------+
As per my understanding, you are trying to find the records within DataFrame based on the array column which contains a particular string. For example, in this case, you are trying to find the records which contain the particular subject say "english".
Let first create a sample DataFrame
import org.apache.spark.sql.functions._
val json_data = """[{ "name" : "James", "subjects" : [ "english", "french", "botany" ] },
{ "name" : "neo", "subjects" : [ "english", "physics" ] },
{ "name" : "john", "subjects" : [ "spanish", "mathematics" ] }]"""
val df = spark.read.json(Seq(json_data).toDS).toDF
Now let's try to find the records which contain the subject say "english". Here we can use the higher-order function "array_contains" which is available from spark 2.4.0.
df.filter(array_contains($"subjects", "english")).show(truncate=false)
// Output
+-----+-------------------------+------------+
|name |subjects |contains_eng|
+-----+-------------------------+------------+
|James|[english, french, botany]|true |
|neo |[english, physics] |true |
+-----+-------------------------+------------+
You can find more details about the functions here (scala and python).
I hope this helps.
I have some Avro classes that i generated, and am now trying to use them in Spark. So I imported my avro generated java class, “twitter_schema”, and refer to it when I deserialize. Seems to work but getting a Cast exception at the end.
My Schema:
$ more twitter.avsc
{ "type" : "record", "name" : "twitter_schema", "namespace" :
"com.miguno.avro", "fields" : [ {
"name" : "username",
"type" : "string",
"doc" : "Name of the user account on Twitter.com" }, {
"name" : "tweet",
"type" : "string",
"doc" : "The content of the user's Twitter message" }, {
"name" : "timestamp",
"type" : "long",
"doc" : "Unix epoch time in seconds" } ], "doc:" : "A basic schema for storing Twitter messages" }
My code:
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.avro.mapred.AvroKey
import org.apache.hadoop.io.NullWritable
import org.apache.avro.mapred.AvroInputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import com.miguno.avro.twitter_schema
val path = "/app/avro/data/twitter.avro"
val conf = new Configuration
var avroRDD = sc.newAPIHadoopFile(path,classOf[AvroKeyInputFormat[twitter_schema]],
classOf[AvroKey[ByteBuffer]], classOf[NullWritable], conf)
var avroRDD = sc.hadoopFile(path,classOf[AvroInputFormat[twitter_schema]],
classOf[AvroWrapper[twitter_schema]], classOf[NullWritable], 5)
avroRDD.map(l => {
//transformations here
new String(l._1.datum.username)
}
).first
And I get an error on the last line:
scala> avroRDD.map(l => {
| new String(l._1.datum.username)}).first
<console>:30: error: overloaded method constructor String with alternatives:
(x$1: StringBuilder)String <and>
(x$1: StringBuffer)String <and>
(x$1: Array[Byte])String <and>
(x$1: Array[Char])String <and>
(x$1: String)String
cannot be applied to (CharSequence)
new String(l._1.datum.username)}).first
What am I doing wrong – not understanding the error?
Is it the right way of deserializing? I read about Kryo but seems to add to the complexity, and read about the Spark SQL context accepting Avro in 1.2, but it sounds like a performance hog/workaround.. Best practices for this anyone?
thanks,
Matt
I think your problem is that avro has deserialized string into CharSequence but spark expected java String. Avro has 3 ways to deserialize string in java: into CharSequence, into String and into UTF8 (avro class for storing strings, kinda like Hadoop's Text).
You control that by adding "avro.java.string" property into your avro schema. Possible values are (case sensitive): "String", "CharSequence", "Utf8". There may be a way to control that dynamically through the input format as well but I don't know exactly.
Ok since CharSequence is the interface to String, i can keep my Avro schema the way it was, and just make my Avro string a String via toString(), i.e.:
scala> avroRDD.map(l => {
| new String(l._1.datum.get("username").toString())
| } ).first
res2: String = miguno