How to create a DStream from parquet files stored in HDFS? - apache-spark

I want to read a large number of records, stored in HDFS in parquet format, and convert them into a Spark Stream, so they can be processed in batches (written to another data-store).
Is there anyway to convert DataFrames to DStreams?

Related

PySpark is Writing Large Single Parquet Files instead of Partitioned Files

For most of my files, when I read in delimited files and write them out to snappy parquet, spark is executing as I expected and creating multiple partitioned snappy parquet files.
That said, I have some large .out files that are pipe-separated (25GB+), and when I read them in:
inputFile = spark.read.load(s3PathIn, format='csv', sep=fileSeparator, quote=fileQuote, escape=fileEscape, inferSchema='true', header='true', multiline='true')
Then output the results to S3:
inputFile.write.parquet(pathOut, mode="overwrite")
I am getting large single snappy parquet files (20GB+). Is there a reason for this? All my other spark pipelines generate nicely split files that make query in Athena more performant, but in these specific cases I am only getting single-large files. I am NOT executing any repartition or coallesce commands.
check how much partitions you have on inputFile dataframe. Seems like it has single partitioned.
Seems like you are just reading a CSV file and then writing it as parquet file. check the size of your CSV file, seems like it really large.
inputFile.rdd.getNumPartitions
if it's one. Try repartition dataframe.
inputFile.repartition(10) \\or
inputFile.repartition("col_name")

Parquet file size doubles after deduplication in Spark

We have a deduplication process that reads parquet files and drops duplicate records and writes back the distinct dataframe in Spark sql as parquet output files.
But the output file size doubles it's original size. We are writing the parquet with gzip compression that is also the original file compression codec.

How to write Avro Objects to Parquet with partitions in Java ? How to append data to the same parquet?

I am using Confluent's KafkaAvroDerserializer to deserialize Avro Objects sent over Kafka.
I want to write the recieved data to a Parquet file.
I want to be able to append data to the same parquet and to create a Parquet with Partitions.
I managed to create a Parquet with AvroParquetWriter - but I didn't find how to add partitions or append to the same file:
Before using Avro I used spark to write the Parquet - With spark writing a parquet with partitions and using append mode was trivial - should I try creating Rdds from my Avro objects and use spark to create the parquet ?
I want to write the Parquets to HDFS
Personally, I would not use Spark for this.
Rather I would use the HDFS Kafka Connector. Here is a config file that can get you started.
name=hdfs-sink
# List of topics to read
topics=test_hdfs
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
# increase to be the sum of the partitions for all connected topics
tasks.max=1
# the folder where core-site.xml and hdfs-site.xml exist
hadoop.conf.dir=/etc/hadoop
# the namenode url, defined as fs.defaultFS in the core-site.xml
hdfs.url=hdfs://hdfs-namenode.example.com:9000
# number of messages per file
flush.size=10
# The format to write the message values
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
# Setup Avro parser
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://schema-registry.example.com:8081
value.converter.schemas.enable=true
schema.compatibility=BACKWARD
If you want HDFS Partitions based on a field rather than the literal "Kafka Partition" number, then refer to the configuration docs on the FieldPartitioner. If you want automatic Hive integration, see the docs on that as well.
Let's say you did want to use Spark, though, you can try AbsaOSS/ABRiS to read in an Avro DataFrame, then you should be able to do something like df.write.format("parquet").path("/some/path") (not exact code, because I have not tried it)

Concatenate ORC partition files on disk?

I am using Spark 2.3 to convert some CSV data to ORC for use with Amazon Athena; it is working fine! Athena works best with files that are not too small so, after manipulating the data a bit, I am using Spark to coalesce the partitions into a single partition before writing to disk, like so:
df.coalesce(1).write.orc("out.orc", compression='zlib', mode='append')
This results in a single ORC file that is an optimal file size for use with Athena. However, the coalesce step takes a very long time. It adds about 33% to the total amount of time to convert the data!
This is obviously due to the fact that Spark cannot parallelize the coalesce step when saving to a single file. When I create the same number of partitions as there are CPUs available, the ORC write out to disk is much faster!
My question is, can I parallelize the ORC write to disk and then concatenate the files somehow? This would allow me to parallelize the write and merge the files without having to compress everything on a single CPU?

Merge multiple parquet files into single file on S3

I don't want to partition or repartition the spark dataframe and write multiple part files as it gives the best performance. Is there any way I can merge the files after it has been written to S3.
I have used parquet-tools and it does the merge to local files. I want to do this on S3.

Resources