Spark for realtime OLAP queries - apache-spark

I have IoT data streaming in via Kafka, and would like to use Spark SQL to analyze it. I was planning on persisting data to S3 using Sector, but there will be a delay of a few minutes while the batch of data gets collected before being written to S3.
How can I make Spark query both the streaming data, and the historical data on S3? Do I run two queries - one with Spark Streaming, and one Spark SQL, and try to combine the results?
Or do I need to use an OLTP database for this type of functionality? I wanted to independently scale compute and storage, which is why I went with Spark + S3.

Related

Process real time data using kafka

I have a requirement to implement the solution for below usecase.
Currently Applications are storing data into Postgres database but Postgres database is facing storage issue. So the plan is to move the data from postgres to Hadoop with near realtime data in hadoop. So we thought of below solution .
Write Kafka producer application to listen to postgres tables and capture changing data and write to Kafka Topic .
Write a Kafka sink application to read from kafka topic and write to hive tables(parquet -- external tables -- partitioned and non partitioned) . So for non partitioned tables if we want to apply updates/deletes then we need to touch the whole table in spark code right? which will lead to performance degrade for every record getting from kafka topic . We have already developed sqoop incremental job which runs for every 5 minutes to do the same. But client needs real time data in hadoop so kafka+spark processing came into discussion .
Could you provide pro's and con's for step2 comparing to sqoop incremental.
please share code snippets/links if any which helps my thought process.
Getting data into Kafka is easy - use Debezium.
For getting it out...
I wouldn't use Hive at all for this. Real time data (depending on on the volume of the data, obviously) results in tiny files in HDFS. Subsequently, Hive queries become slower and slower over time.
Hive is not a replacement for Postgres. In fact, the Hive metastore requires a relational database still, such as Postgres.
I also wouldn't use Spark. You have to write code when ingesting Kafka topics into queryable formats is already a solved problem with other tools.
Popular options include Apache Pinot, Druid, or Apache Iceberg storage with Presto (some of which may overlap with HDFS storage, but will be much, much faster than Hive to query). Only the third option requires writing Kafka consumer code; the other two have native Kafka ingestion.
And even still, if you're stuck with HDFS, Kafka Connect framework comes with Kafka. There's an HDFS Sink plugin, written by Confluent, which supports Hive integration.

Spark Streaming: in-memory aggregation - correct usage

I have a Spark 2.2 Structured streaming flow from an on-premise system into a containerized cloud spark cluster where kafka recieves the data, and SSS maintains a number of queries that flush to disk every ten seconds. A query console-sink is not accessible to external sessions outside the streaming context (hence the CSV flush); the monitoring dashboard runs spark sql from another context to get metrics.
Right now I am only aggregating the data that has come in since streaming was last started. Now I need to aggregate data since forever with the incoming streaming data to provide (near) realtime views. This will mean running a bunch of GROUP BY's on billions of records - maintaining several million aggregate rows in-memory.
My question is regarding how Spark streaming queries can scale like this: how efficient is memory usage (I'll probably use 32 worker contaiers) and is this the correct way to manage a (near-) realtime view of incoming data using kafka and SSS?

Kafka to Spark batch processing

I'm looking for an optimal data architecture.
I'm dealing with TS data that is flushed from a Redis database to OpenTSDB database each week.
OpenTSDB stores its data on HBase which is launched on a Hadoop cluster.
Then, the time series data available on OpenTSDB has to be batch processed (at 1-6 months interval).
Knowing that OpenTSDB data is stored in Binary large object format on HBase, I can't currently tackle HBase HTTP API.
Since Spark cannot directly access OpenTSDB API (While Kafka seems to be Okay with HTTP Api)... I'm facing architecture issues which can be expressed as follows, would it be more convenient to:
Use Apache Kafka to extract batch data (TByte) and use it as a pipeline to ingest and analyze data into Spark Dataframes ?
Flush redis data directly in HBase and hence use Spark directly from it ?
That said, I want to be sure than Spark can handle terabytes batch analytics and kafka can handle that amount before loading it as Spark RDD.
Any suggestion on help will be welcome. Thanks

running interactive sql queries over millions of parquet files

I have millions of streaming parquet files being written . I want to support running ad hoc interactive queries for debugging and analytics purpose ( added bonus if i can run streaming queries for some real time monitoring of key metrics as well).
What is a scalable solution for supporting this.
The two ways I have observed is running spark sql interactively over millions of parquet files (not too familiar with spark ecosystem but does this mean running a spark job for every sql user submits or do i need to run some streaming job and submit queries somehow) and second being using a presto sql engine on top of parquet (not exactly sure how presto ingests new incoming parquet files).
Any recommendations or pros and cons of either approach . Any better solutions considering i have > ~10Tb data produced every day .
Let me address your use cases :
Support running ad hoc interactive queries for debugging and analytics purpose
I would recommend building a presto cluster if you care about minimizing the latency of your queries and are willing to invest in many machines with a large amount of memory.
Reason: Presto would run fully in-memory without touching disk (in most cases)
A Spark Cluster can also do the job, however, it won't be as fast as Presto. The advantage of Spark over presto is its fault tolerance capabilities and its ability to fail over to disk in case of out of memory conditions which may be important for you given that you have too much data.
Run streaming queries for some real-time monitoring of key metrics as well
As long as you have basic queries, you can build dashboards on top of Presto which could run these queries every x minutes.
Having a considerable amount of processing may be a good reason to look at Spark streaming if real-time monitoring is important.
If it isn't then you could build an ETL (using Spark) for calculating your metrics, storing the data as a new hive table and then expose for querying via Presto/SparkSQL again.
How presto ingests new incoming parquet files?
I'm now aware of your architecture, but in any case, you need to provide Presto with a Hive connection (Hive Metastore to be precise).
Hive provides Presto with few schemas attached to the directories where you ingest your data. Presto dynamically sees the new data by default. Spark is not different by the way.
Presto has nothing to do with data ingestion. It only starts its job once the data is there.

Redshift with Spark Streaming

I have a Kafka - Spark Streaming application to ingest and process 60K events per min. I need a database to store my transformed dataframes to be accessed by visualization layer. Can Redshift be used for this with Spark Streaming or should Cassandra be used? I will be processing and storing the dataframes in every spark window of 30 seconds. Also I need to read from the datastore in every window. I guess Redhsift is primarily a data warehousing database not for OLTP sort of the processing.. any ideas?
You should check out SnappyData. SnappyData deeply integrates an in-memory database with Spark that allows hybrid OLTP/OLAP applications. You can write Spark Streaming applications on top of Snappy that can update/delete data from the database. Further, because it does not go over a connector, it performs better than the myriad datastores that have Spark connectors and even the native Spark cache. There may be other datastores that offer hybrid OLTP/OLAP applications on Spark in the aforementioned link.
Disclaimer: I am a SnappyData employee.

Resources