Spark read avro - apache-spark

Trying to read an avro file.
val df = spark.read.avro(file)
Running into Avro schema cannot be converted to a Spark SQL StructType: [ "null", "string" ]
Tried to manually create a schema, but now running into the following:
val s = StructType(List(StructField("value", StringType, nullable = true)))
val df = spark.read
.option("inferSchema", "false")
.schema(s)
.avro(file)
com.databricks.spark.avro.SchemaConverters$IncompatibleSchemaException: Cannot convert Avro schema to catalyst type because schema at path is not compatible (avroType = StructType(StructField(value,StringType,true)), sqlType = STRING).
Source Avro schema: ["null","string"].
Target Catalyst type: StructType(StructField(value,StringType,true))
Trying to override the avro schema (without the null) also does not work:
val df = spark.read
.option("inferSchema", "false")
.option("avroSchema", """["string"]""")
.avro(file)
Avro schema cannot be converted to a Spark SQL StructType: [ "string" ]
Looks like spark-avro only creates a GenericDatumReader[GenericRecord] and I need a GenericDatumReader[Utf8] :(

Please make sure you are providing the correct AVSC with the data type.
["null", "String"] is placed to take care of null values in the Avro data.
You can create the schema of your Avro file by:-
val schema = new Schema.Parser().parse(new File("user.avsc")
Or if you have Java Schema file then you can get the schema by doing:-
val schema = Schema.getClassSchema
now once you have the schema it is very simple to build a data frame with it.
code snippet:-
val df =sparkSession.read.format("com.databricks.spark.avro")
.option("avroSchema", schema.toString)
.load("/home/garvit.vijay/000009_0.avro")
df.printSchema()
df.show()
Hope it works for you.

Related

spark.read.excel - Not reading all Excel rows when using custom schema

I am trying to read a Spark DataFrame from an 'excel' file. I used the crealytics dependency.
Without any predefined schema, all rows are correctly read but as only string type columns.
To prevent that, I am using my own schema (where I mentioned certain columns to be Integer type), but in this case, most of the rows are dropped when the file is being read.
The library dependency used in build.sbt:
"com.crealytics" %% "spark-excel" % "0.11.1",
Scala version - 2.11.8
Spark version - 2.3.2
val inputDF = sparkSession.read.excel(useHeader = true).load(inputLocation(0))
The above reads all the data - around 25000 rows.
But,
val inputWithSchemaDF: DataFrame = sparkSession.read
.format("com.crealytics.spark.excel")
.option("useHeader" , "false")
.option("inferSchema", "false")
.option("addColorColumns", "true")
.option("treatEmptyValuesAsNulls" , "false")
.option("keepUndefinedRows", "true")
.option("maxRowsInMey", 2000)
.schema(templateSchema)
.load(inputLocation)
This gives me only 450 rows.
Is there a way to prevent that? Thanks in advance! (edited)
As of now, I haven't found a fix to this problem, but I tried solving it in a different way by manually type-casting. To make it a bit better in terms of number of lines of code, I took the help of a for loop. My solutions is as follows:
Step 1: Create my own schema of type 'StructType':
val requiredSchema = new StructType()
.add("ID", IntegerType, true)
.add("Vendor", StringType, true)
.add("Brand", StringType, true)
.add("Product Name", StringType, true)
.add("Net Quantity", StringType, true)
Step 2: Type casting the Dataframe AFTER it has been read (WITHOUT the custom schema) from the excel file (instead of using the schema while reading the data):
def convertInputToDesiredSchema(inputDF: DataFrame, requiredSchema: StructType)(implicit sparkSession: SparkSession) : DataFrame =
{
var schemaDf: DataFrame = inputDF
for(i <- inputDF.columns.indices)
{
if(inputDF.schema(i).dataType.typeName != requiredSchema(i).dataType.typeName)
{
schemaDf = schemaDf.withColumn(schemaDf.columns(i), col(schemaDf.columns(i)).cast(requiredSchema.apply(i).dataType))
}
}
schemaDf
}
This might not be an efficient solution, but is better than typing out too many lines of code to typecast multiple columns.
I am still searching for a solution to my original question.
This solution is just in case someone might want to try and are in immediate need of a quick fix.
Here's a workaround, using PySpark, using a schema that consists of "fieldname" and "dataType":
# 1st load the dataframe with StringType for all columns
from pyspark.sql.types import *
input_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", isHeaderOn) \
.option("treatEmptyValuesAsNulls", "true") \
.option("dataAddress", xlsxAddress1) \
.option("setErrorCellsToFallbackValues", "true") \
.option("ignoreLeadingWhiteSpace", "true") \
.option("ignoreTrailingWhiteSpace", "true") \
.load(inputfile)
# 2nd Modify the datatypes within the dataframe using a file containing column names and the expected data type.
dtypes = pd.read_csv("/dbfs/mnt/schema/{}".format(file_schema_location), header=None).to_records(index=False).tolist()
fields = [StructField(dtype[0], globals()[f'{dtype[1]}']()) for dtype in dtypes]
schema = StructType(fields)
for dt in dtypes:
colname =dt[0]
coltype = dt[1].replace("Type","")
input_df = input_df.withColumn(colname, col(colname).cast(coltype))

Expressing spark `StructType` in avro schema

How would you describe spark StructType data type in an avro schema? I am generating a parquet file, the format of which is described in an avro schema. This file is then loaded from S3 into spark. There is an array and map data types but these do not correspond to the StructType.
Using the package org.apache.spark.sql.avro (Spark 2.4) you can convert sparkSQL schemas to avro schemas and viceversa.
You cant try this way:
import org.apache.spark.sql.avro.SchemaConverters
val sqlType = SchemaConverters.toSqlType(avroSchema)
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
Here you can find more answers too: Code

Trim csv file before importing into a Spark Dataset

I've seen this post about how to specify an schema for creating a Dataset
Spark Scala: Cannot up cast from string to int as it may truncate
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
It works for me, but not when there are withispaces in the csv. Is it possible to trim the csv in this same spark.read sequence?

KafkaAvroDecoder object to DataFrame convertion

In spark streaming with kafka and schema registry after receiving dstream how can I convert the dstream batch to Dataframe in spark?
Dstream type after using KafkaAvroDecoder from confluent is Dstream(String,Object). When I use below code it changes schema data type something like Int to Long in avro columns.
val kafkaStream: DStream[(String, Object)] =
KafkaUtils.createDirectStream[String, Object, StringDecoder, KafkaAvroDecoder](
ssc, kafkaParams, Set(topic)
)
// Load JSON strings into DataFrame
kafkaStream.foreachRDD { rdd =>
// Get the singleton instance of SQLContext
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val topicValueStrings = rdd.map(_._2.toString)
val df = sqlContext.read.json(topicValueStrings)
code reference
Object.toSting and reading as json looses schema for int. Is there any other way instead of casting type in dataframe columns?

Create Spark Dataset from a CSV file

I would like to create a Spark Dataset from a simple CSV file. Here are the contents of the CSV file:
name,state,number_of_people,coolness_index
trenton,nj,"10","4.5"
bedford,ny,"20","3.3"
patterson,nj,"30","2.2"
camden,nj,"40","8.8"
Here is the code to make the Dataset:
var location = "s3a://path_to_csv"
case class City(name: String, state: String, number_of_people: Long)
val cities = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.csv(location)
.as[City]
Here is the error message: "Cannot up cast number_of_people from string to bigint as it may truncate"
Databricks talks about creating Datasets and this particular error message in this blog post.
Encoders eagerly check that your data matches the expected schema,
providing helpful error messages before you attempt to incorrectly
process TBs of data. For example, if we try to use a datatype that is
too small, such that conversion to an object would result in
truncation (i.e. numStudents is larger than a byte, which holds a
maximum value of 255) the Analyzer will emit an AnalysisException.
I am using the Long type, so I didn't expect to see this error message.
Use schema inference:
val cities = spark.read
.option("inferSchema", "true")
...
or provide schema:
val cities = spark.read
.schema(StructType(Array(StructField("name", StringType), ...)
or cast:
val cities = spark.read
.option("header", "true")
.csv(location)
.withColumn("number_of_people", col("number_of_people").cast(LongType))
.as[City]
with your case class as
case class City(name: String, state: String, number_of_people: Long),
you just need one line
private val cityEncoder = Seq(City("", "", 0)).toDS
then you code
val cities = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.csv(location)
.as[City]
will just work.
This is the official source [http://spark.apache.org/docs/latest/sql-programming-guide.html#overview][1]

Resources