Ingesting Data Sketches from Presto into Druid - presto

I am trying to ingest a dataset (S3 & parquet format) that has a HLL Data Sketch built from Presto engine.
When I ingest it into durid using HLLSketchMerge, it's failing as I have tried all 3 types "tgtHllType" (HLL_4, HLL_6, and HLL_8). I believe both engines are using the same methodology for data sketches. I can use "HLLSketchBuild" to build a HLL data Sketch, but I want to avoid that if possible as it's already pre-processed from a different engine (Presto). Any guidance will be greatly appreciated.
https://druid.apache.org/docs/latest/development/extensions-core/datasketches-hll.html
https://prestodb.io/docs/current/functions/hyperloglog.html

Related

How to save results of streaming query as PDF / XLSX (for report generation)?

Curious to know if we can generate PDF or XLSX files for report using spark streaming / spark structured streaming. As per the official document there is File Sink but is PDF and XLSX supported? if so can we make use of it for report generations?
if we can generate PDF or XLSX files for report using spark streaming / spark structured streaming
If you want to generate PDF/XLSX files in a distributed streaming manner, you could really use Spark Structured Streaming.
As per the official document there is File Sink but is PDF and XLSX supported?
No. There is no direct support for PDF/XLSX formats so you'd have to write a custom data source yourself (with a streaming sink).
if so can we make use of it for report generations?
I've never heard of such data source before, but it's certainly possible to write one yourself.
Think of Spark as a general-purpose computation platform and whatever can be modelled (designed) as a distributed computation should certainly be doable using Spark machinery.

Zeppelin with Spark Structured Streaming Example

I am trying to visualize spark structured streams in Zeppelin. I am able to achieve using memory sink(spark.apache). But it is not reliable solution for high data volumes. What will be the better solution?
Example implementation or demo would be helpful.
Thanks,
Rilwan
Thanks for asking the question!! Having 2+ years of experience for developing Spark Monitoring Tools, I think I will be able to resolve your doubt!!
There are two types of processing available when data is coming to spark as stream.
Discretized Stream or DStream: In this mode, spark provides you data
in RDD format and you have to write your own logic to handle the
RDD.
Pros:
1. If you want to do some processing before saving the streaming data, RDD is the best way to handle compared to DataFrame.
2. DStream provides you a nice Streaming UI where it graphically show how much data havebeen processed. Check this link - https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html#monitoring-applications
Cons:
1. Handling Raw RDD is not so convenient and easy.
Structured Stream: In this mode, spark provides you data in a
DataFrame format, you need to mention where to store/send the data.
Pros:
1. Spark Streaming comes with some predefined sources and sinks which are very common and 95% of real-life scenarios can be resolved by plugging in these. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Cons:
1. There is no Streaming UI available with Structured Streaming :( .Although you can get the metrices and create your own UI. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
You can also put store the metrices in some plaintext file, read the file in Zeppelin through spark.read.json, and plot your own graph.

Parquet with Athena VS Redshift

I hope someone out there can help me with this issue. I am currently working on a data pipeline project, my current dilemma is whether to use parquet with Athena or storing it to Redshift
2 Scenarios:
First,
EVENTS --> STORE IT IN S3 AS JSON.GZ --> USE SPARK(EMR) TO CONVERT TO PARQUET --> STORE PARQUET BACK INTO S3 --> ATHENA FOR QUERY --> VIZ
Second,
EVENTS --> STORE IT IN S3 --> USE SPARK(EMR) TO STORE DATA INTO REDSHIFT
Issues with this scenario:
Spark JDBC with Redshift is slow
Spark-Redshift repo by data bricks have a fail build and was updated 2 years ago
I am unable to find useful information on which method is better. Should I even use Redshift or is parquet good enough?
Also it would be great if someone could tell me if there are any other methods for connecting spark with Redshift because there's only 2 solution that I saw online - JDBC and Spark-Reshift(Databricks)
P.S. the pricing model is not a concern to me also I'm dealing with millions of events data.
Here are some ideas / recommendations
Don't use JDBC.
Spark-Redshift works fine but is a complex solution.
You don't have to use spark to convert to parquet, there is also the option of using hive.
see
https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html
Athena is great when used against parquet, so you don't need to use
Redshift at all
If you want to use Redshift, then use Redshift spectrum to set up a
view against your parquet tables, then if necessary a CTAS within
Redshift to bring the data in if you need to.
AWS Glue Crawler can be a great way to create the metadata needed to
map the parquet in to Athena and Redshift Spectrum.
My proposed architecture:
EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Athena
and/or
EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Redshift using Redshift Spectrum
You MAY NOT need to convert to parquet, if you use the right partitioning structure (s3 folders) and gzip the data then Athena/spectrum then performance can be good enough without the complexity of conversion to parquet. This is dependent on your use case (volumes of data and types of query that you need to run).
Which one to use depends on your data and access patterns. Athena directly uses S3 key structure to limit the amount of data to be scanned. Let's assume you have event type and time in events. The S3 keys could be e.g. yyyy/MM/dd/type/* or type/yyyy/MM/dd/*. The former key structure allows you to limit the amount of data to be scanned by date or date and type but not type alone. If you wanted to search only by type x but don't know the date, it would require a full bucket scan. The latter key schema would be the other way around. If you mostly need to access the data just one way (e.g. by time), Athena might be a good choice.
On the other hand, Redshift is a PostgreSQL based data warehouse which is much more complicated and flexible than Athena. The data partitioning plays a big role in terms of performance, but schema can be designed in many ways to suit your use-case. In my experience the best way to load data to Redshift is first to store it to S3 and then use COPY https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html . It is multiple magnitudes faster than JDBC which I found only good for testing with small amounts of data. This is also how Kinesis Firehose loads data into Redshift. If you don't want to implement S3 copying yourself, Firehose provides an alternative for that.
There are few details missing in the question. How would you manage incremental upsert in data pipeline.
If you have implemented Slowly Changing Dimension (SCD type 1 or 2) The same can't be managed using parquet files. But This can be easily manageable in Redshift.

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

library to process .rrd(round robin data) using spark

I have huge time series data which is in .rrd(round robin database) format stored in S3. I am planning to use apache spark for running analysis on this to get different performance matrix.
Currently I am downloading the .rrd file from s3 and processing it using rrd4j library. I am going to do processing for longer terms like year or more. it involves processing of hundreds of thousands of .rrd files. I want spark nodes to get the file directly from s3 and run the analysis.
how can I make spark to use the rrd4j to read the .rrd files? is there any library which helps me do that?
is there any support in spark for processing this kind of data?
The spark part is rather easy, use either wholeTextFiles or binaryFiles on sparkContext (see docs). According to the documentation, rrd4j usually wants a path to construct an rrd, but with the RrdByteArrayBackend, you could load the data in there - but that might be a problem, because most of the API is protected. You'll have to figure out a way to load an Array[Byte] into rrd4j.

Resources