Spark SQL : Handling schema evolution - apache-spark

I want to read 2 avro files of same data set but with schema evolution
first avro file schema : {String, String, Int}
second avro file schema evolution : {String, String, Long}
(Int field is undergone evolution to long)
I want to read these two avro file to store in dataframe using sparkSQL.
To read avro files I am using 'spark-avro' of databicks
https://github.com/databricks/spark-avro
How to do this efficiently.
Spark version : 2.0.1
Scala. 2.11.8
PS. Here in example I have mentioned only 2 files but in actual scenario file is generated daily so there are more than 1000 such file.
Thank you in advance:)

use an union like
{string,string, [int, long]}
is a valid solution for your? it should allow read both new and old files.

Related

Read avro file with bytes schema in spark

I am trying to read some avro files into a Spark dataframe and have the below sitution:
The avro file schema is defined as
Schema(
org.apache.avro.Schema
.create(org.apache.avro.Schema.Type.BYTES),
"ByteBlob", "1.0");
The file has a nested json structure stored as a simple bytes schema in the avro file.
I can't seem to find a way to read this into a dataframe in spark. Any pointers on how I can read files like these?
Output from avro-tools:
hadoop jar avro-tools/avro-tools-1.10.2.jar getmeta /projects/syslog_paranoids/encrypted/dhr/complete/visibility/zeeklog/202207251345/1.0/202207251351/stg-prd-dhrb-edg-003.data.ne1.yahoo.com_1658690707314_zeeklog_1.0_202207251349_202207251349_6c64f2210c568092c1892d60b19aef36.6.avro
avro.schema "bytes"
avro.codec deflate
The tojson function within avro-tools is able to read the file properly and return a json output contained in the file.

How to save files in same directory using saveAsNewAPIHadoopFile spark scala

I am using spark streaming and I want to save each batch of spark streaming on my local in Avro format. I have used saveAsNewAPIHadoopFile to save data in Avro format. This works well. But it overwrites the existing file. Next batch data will overwrite the old data. Is there any way to save Avro file in common directory? I tried by adding some properties of Hadoop job conf for adding a prefix in the file name. But not working any properties.
dstream.foreachRDD {
rdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration()
)
}
Try this -
You can make your process split into 2 steps :
Step-01 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-02 :- Move file from <temp-path> to <actual-target-path>
This will definitely solve your problem for now. I will share my thoughts if I get to fulfill this scenario in one step instead of two.
Hope this is helpful.

Saving empty DataFrame with known schema (Spark 2.2.1)

Is it possible to save an empty DataFrame with a known schema such that the schema is written to the file, even though it has 0 records?
def example(spark: SparkSession, path: String, schema: StructType) = {
val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet")
dataframeWriter.save(path)
spark.read.load(path) // ERROR!! No files to read, so schema unknown
}
This is the answer I received from Databricks Support:
This is actually a known issue in Spark. There is already fix done in
opensource JIRA -> https://issues.apache.org/jira/browse/SPARK-23271.
For more details on how this behavior will change from 2.4 please
check this doc change
https://github.com/apache/spark/pull/20525/files#diff-d8aa7a37d17a1227cba38c99f9f22511R1808
The behavior will be changed from Spark 2.4. Until then you need to go
with any one of the following ways
Save a dataframe with at-least one record to preserve its schema
Save schema in a JSON file and use later
I got a similar problem with Spark 2.1.0. I solved it using repartition before writing.
df.repartition(1).write.parquet("my/path")

Parse hive text format RDD[String] to DataFrame using schema of a existing table

I have and RDD[String], each String is a hive text format row data, and the hive table is in hive database so I can get the schema, is there way to let spark parse RDD[String] to a DataFrame with the schema so I don't need to it manually.
If in your RDD[String], each string represent a particular structure like (id,name,salary). You can create the case class in scala and convert your RDD[String] to RDD[case class] and then use toDF() function to convert RDD to DataFrame.
If your file is delimited then you can use csv package to create the DataFrame on delimited file if you are using spark 2.x or later. Or if you are using spark 1.6.x or earlier you can use external spark-csv package for the same.
Hope it helps.
Regards,
Neeraj

Read parquet files in Spark with pattern matching

I'm running Spark 1.3.0 and want to read a number of parquet files based on pattern matching. the parquet files are basically the underlying files of a Hive DB and I want to read some of the files (across different folders) only. the folder structure is
hdfs://myhost:8020/user/hive/warehouse/db/blogs/some/meta/files/
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160101/01/file1.parq
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160101/02/file2.parq
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160103/01/file3.parq
Something like
val v1 = sqlContext.parquetFile("hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd={[0-9]*}")
I want to ignore the meta files and load only the parquet files inside the date folders. Is this possible?
you can use wildcard in parquet like so (works on 1.5 didn't test on 1.3):
val v1 = sqlContext.parquetFile("hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd*")
another thing you can do in case that doesn't work is to create external table using hive with partition by yymmdd and read parquet from that table using:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SELECT FROM ...")
you can't use regular expression.
also I think you folder structure is problematic. it should be
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=150204/
or
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=150204/part=01
and not:
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=150204/1
because they way you use it I think you will have troubles using the folder names (yymmdd) as partition because the files are not directly under it

Resources