How to merge schema while loading avro in spark dataframe? - apache-spark

I am trying to read avro files using https://github.com/databricks/spark-avro and the avro schema evolved over time. I read like this with mergeSchema option set to true hoping that it would merge schema itself but it didn't work.
sqlContext.read.format("com.databricks.spark.avro").option("mergeSchema", "true").load('s3://xxxx/d=2015-10-27/h=*/')
What is the work around ?

Merging schema is not implemented for avro files in spark and there is no easy workaround. One solution would be to read your avro data file-by-file (or partition-by-partition) as separate data sets and then union those data sets. But that can be terribly slow.

Related

Transform CSV into Parquet using Apache Flume?

I have a question, is it possible to execute ETL for data using flume.
To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible ?
If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?
I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet. I feel nifi was the replacement for Apache Flume.
Flume partial answers:(Not Parquet)
If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)
You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)

Generate Spark schema code/persist and reuse schema

I am implementing some Spark Structured Streaming transformations from a Parquet data source. In order to read the data into a streaming DataFrame, one has to specify the schema (it cannot be automatically inferred). The schema is really complex and manually writing the schema code will be a very complex task.
Can you suggest a walkaround? Currently I am creating a batch DataFrame beforehand (using the same data source), Spark infers the schema and then I save the schema to a Scala object and use it as an input for the Structured Streaming reader.
I don't think it is a reliable or a well performing solution. Please suggest how to generate the schema code automatically or somehow persist the schema in a file and reuse it.
From the docs:
By default, Structured Streaming from file based sources requires you
to specify the schema, rather than rely on Spark to infer it
automatically. This restriction ensures a consistent schema will be
used for the streaming query, even in the case of failures. For ad-hoc
use cases, you can reenable schema inference by setting
spark.sql.streaming.schemaInference to true.
You could also open a shell, read one of the parquet files with automatic schema inference enabled, and save the schema to JSON for later reuse. You only have to do this one time, so it might be faster / more efficient than doing the similar-sounding workaround you're using now.

Does presto require a hive metastore to read parquet files from S3?

I am trying to generate parquet files in S3 file using spark with the goal that presto can be used later to query from parquet. Basically, there is how it looks like,
Kafka-->Spark-->Parquet<--Presto
I am able to generate parquet in S3 using Spark and its working fine. Now, I am looking at presto and what I think I found is that it needs hive meta store to query from parquet. I could not make presto read my parquet files even though parquet saves the schema. So, does it mean at the time of creating the parquet files, the spark job has to also store metadata in hive meta store?
If that is the case, can someone help me find an example of how it's done. To add to the problem, my data schema is changing, so to handle it, I am creating a programmatic schema in spark job and applying it while creating parquet files. And, if I am creating the schema in hive metastore, it needs to be done keeping this in consideration.
Or could you shed light on it if there is any better alternative way?
You keep the Parquet files on S3. Presto's S3 capability is a subcomponent of the Hive connector. As you said, you can let Spark define tables in Spark or you can use Presto for that, e.g.
create table hive.default.xxx (<columns>)
with (format = 'parquet', external_location = 's3://s3-bucket/path/to/table/dir');
(Depending on Hive metastore version and its configuration, you might need to use s3a instead of s3.)
Technically, it should be possible to create a connector that infers tables' schemata from Parquet headers, but I'm not aware of an existing one.

Convert Xml to Avro from Kafka to hdfs via spark streaming or flume

I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.
When the avro files hit hdfs, I want the ability to read them into hive tables later.
I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.
I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.
If you are on cloudera(since u have flume, may u have it), you can use morphline to work on conversion at record level. You can use batch/streaming. You can see here for more info.

Spark with Avro, Kryo and Parquet

I'm struggling to understand what exactly Avro, Kryo and Parquet do in the context of Spark. They all are related to serialization but I've seen them used together so they can't be doing the same thing.
Parquet describes its self as a columnar storage format and I kind of get that but when I'm saving a parquet file can Arvo or Kryo have anything to do with it? Or are they only relevant during the spark job, ie. for sending objects over the network during a shuffle or spilling to disk? How do Arvo and Kryo differ and what happens when you use them together?
Parquet works very well when you need to read only a few columns when querying your data. However if your schema has lots of columns (30+) and in your queries/jobs you need to read all of them then record based formats (like AVRO) will work better/faster.
Another limitation of Parquet is that it is essentially write-once format. So usually you need to collect data in some staging area and write it to a parquet file once a day (for example).
This is where you might want to use AVRO. E.g. you can collect AVRO-encoded records in a Kafka topic or local files and have a batch job that converts all of them to Parquet file at the end of the day. This is fairly easy to implement thanks to parquet-avro library that provides tools to convert between AVRO and Parquet formats automatically.
And of course you can use AVRO outside of Spark/BigData. It is fairly good serialization format similar to Google Protobuf or Apache Thrift.
This very good blog post explains the details for everything but Kryo.
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Kryo would be used for fast serialization not involving permanent storage, such as shuffle data and cached data, in memory or on disk as temp files.

Resources