How to save results of streaming query as PDF / XLSX (for report generation)? - apache-spark

Curious to know if we can generate PDF or XLSX files for report using spark streaming / spark structured streaming. As per the official document there is File Sink but is PDF and XLSX supported? if so can we make use of it for report generations?

if we can generate PDF or XLSX files for report using spark streaming / spark structured streaming
If you want to generate PDF/XLSX files in a distributed streaming manner, you could really use Spark Structured Streaming.
As per the official document there is File Sink but is PDF and XLSX supported?
No. There is no direct support for PDF/XLSX formats so you'd have to write a custom data source yourself (with a streaming sink).
if so can we make use of it for report generations?
I've never heard of such data source before, but it's certainly possible to write one yourself.
Think of Spark as a general-purpose computation platform and whatever can be modelled (designed) as a distributed computation should certainly be doable using Spark machinery.

Related

Spark Streaming with side inputs using Java

I am looking for some sample Java code for Spark streaming using some side inputs.
The Streaming input will be joined with some static input from local file / HDFS file.
Believe this is a standard use case; unfortunately could not find any guidance after extensive searches.
Code should provide the following:
How to create the session / context.
How to read the streaming data (preferably from Kafka).
How to read the static file (I need to use csv format).
How to join the Streaming data with the Static data (preferably using SQL).
I have tried searching for sample code and also checked the Apache Spark Programming guide.
However, could not find any sample code covering Streaming + static input join.

Transform CSV into Parquet using Apache Flume?

I have a question, is it possible to execute ETL for data using flume.
To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible ?
If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?
I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet. I feel nifi was the replacement for Apache Flume.
Flume partial answers:(Not Parquet)
If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)
You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)

Best file formats for S3 using Spark for ETL on EMR

We are planning to perform ETL processing using Spark with source data sitting on S3. The data volume for ETL processing is less than 100 million. What is the best format to store data in S3 in this scenario i.e. the best compression and file format (text, sequence, parquet etc.)
ORC or Parquet for queries, compressed with Snappy. Avro is another general purpose format, but way less efficient for SparkSQL queries as you have to scan a lot more data.
Important At the time of writing (June 2017), you cannot safely use S3 as a direct destination of spark RDD/dataframe queries (i.e. save()) calls. See Cloud Integration for an explanation. Write to HDFS then copy

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

Spark with Avro, Kryo and Parquet

I'm struggling to understand what exactly Avro, Kryo and Parquet do in the context of Spark. They all are related to serialization but I've seen them used together so they can't be doing the same thing.
Parquet describes its self as a columnar storage format and I kind of get that but when I'm saving a parquet file can Arvo or Kryo have anything to do with it? Or are they only relevant during the spark job, ie. for sending objects over the network during a shuffle or spilling to disk? How do Arvo and Kryo differ and what happens when you use them together?
Parquet works very well when you need to read only a few columns when querying your data. However if your schema has lots of columns (30+) and in your queries/jobs you need to read all of them then record based formats (like AVRO) will work better/faster.
Another limitation of Parquet is that it is essentially write-once format. So usually you need to collect data in some staging area and write it to a parquet file once a day (for example).
This is where you might want to use AVRO. E.g. you can collect AVRO-encoded records in a Kafka topic or local files and have a batch job that converts all of them to Parquet file at the end of the day. This is fairly easy to implement thanks to parquet-avro library that provides tools to convert between AVRO and Parquet formats automatically.
And of course you can use AVRO outside of Spark/BigData. It is fairly good serialization format similar to Google Protobuf or Apache Thrift.
This very good blog post explains the details for everything but Kryo.
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Kryo would be used for fast serialization not involving permanent storage, such as shuffle data and cached data, in memory or on disk as temp files.

Resources