How to write Avro Objects to Parquet with partitions in Java ? How to append data to the same parquet? - apache-spark

I am using Confluent's KafkaAvroDerserializer to deserialize Avro Objects sent over Kafka.
I want to write the recieved data to a Parquet file.
I want to be able to append data to the same parquet and to create a Parquet with Partitions.
I managed to create a Parquet with AvroParquetWriter - but I didn't find how to add partitions or append to the same file:
Before using Avro I used spark to write the Parquet - With spark writing a parquet with partitions and using append mode was trivial - should I try creating Rdds from my Avro objects and use spark to create the parquet ?

I want to write the Parquets to HDFS
Personally, I would not use Spark for this.
Rather I would use the HDFS Kafka Connector. Here is a config file that can get you started.
name=hdfs-sink
# List of topics to read
topics=test_hdfs
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
# increase to be the sum of the partitions for all connected topics
tasks.max=1
# the folder where core-site.xml and hdfs-site.xml exist
hadoop.conf.dir=/etc/hadoop
# the namenode url, defined as fs.defaultFS in the core-site.xml
hdfs.url=hdfs://hdfs-namenode.example.com:9000
# number of messages per file
flush.size=10
# The format to write the message values
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
# Setup Avro parser
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://schema-registry.example.com:8081
value.converter.schemas.enable=true
schema.compatibility=BACKWARD
If you want HDFS Partitions based on a field rather than the literal "Kafka Partition" number, then refer to the configuration docs on the FieldPartitioner. If you want automatic Hive integration, see the docs on that as well.
Let's say you did want to use Spark, though, you can try AbsaOSS/ABRiS to read in an Avro DataFrame, then you should be able to do something like df.write.format("parquet").path("/some/path") (not exact code, because I have not tried it)

Related

Transform CSV into Parquet using Apache Flume?

I have a question, is it possible to execute ETL for data using flume.
To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible ?
If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?
I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet. I feel nifi was the replacement for Apache Flume.
Flume partial answers:(Not Parquet)
If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)
You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)

How to append kafka consumer output to a file(parquet) in HDFS using Java/Scala?

This is a Kafka batch process. I want to read a local CSV file and write it into a Kafka topic.
Then consumer has to get data from the topic they subscribed.
Expected: I want the consumed data to be appended to a file in Parquet format in HDFS.
Please help me to achieve this in an efficient manner.
Kafka Producer input:
Kafka Consumer output:
I want the value to be appended to a file in HDFS.
Doing that from scratch would be quite complicated.
You can use the Kafka connect sink HDFS connector that handle out of the box parquet format output ( this would need a bit pre processing your records though, to put them in a correct format like json with schema etc..).
More info here :
https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Does presto require a hive metastore to read parquet files from S3?

I am trying to generate parquet files in S3 file using spark with the goal that presto can be used later to query from parquet. Basically, there is how it looks like,
Kafka-->Spark-->Parquet<--Presto
I am able to generate parquet in S3 using Spark and its working fine. Now, I am looking at presto and what I think I found is that it needs hive meta store to query from parquet. I could not make presto read my parquet files even though parquet saves the schema. So, does it mean at the time of creating the parquet files, the spark job has to also store metadata in hive meta store?
If that is the case, can someone help me find an example of how it's done. To add to the problem, my data schema is changing, so to handle it, I am creating a programmatic schema in spark job and applying it while creating parquet files. And, if I am creating the schema in hive metastore, it needs to be done keeping this in consideration.
Or could you shed light on it if there is any better alternative way?
You keep the Parquet files on S3. Presto's S3 capability is a subcomponent of the Hive connector. As you said, you can let Spark define tables in Spark or you can use Presto for that, e.g.
create table hive.default.xxx (<columns>)
with (format = 'parquet', external_location = 's3://s3-bucket/path/to/table/dir');
(Depending on Hive metastore version and its configuration, you might need to use s3a instead of s3.)
Technically, it should be possible to create a connector that infers tables' schemata from Parquet headers, but I'm not aware of an existing one.

Convert Xml to Avro from Kafka to hdfs via spark streaming or flume

I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.
When the avro files hit hdfs, I want the ability to read them into hive tables later.
I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.
I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.
If you are on cloudera(since u have flume, may u have it), you can use morphline to work on conversion at record level. You can use batch/streaming. You can see here for more info.

Resources