Spark Kafka Structured Streaming integration with Apache Ignite - apache-spark

Right now there is no way by which i can save spark DataFrames in Apche Ignite. It will get included in Apache Ignite 2.2 version as mentioned here https://issues.apache.org/jira/browse/IGNITE-3084. I am using Structured Streaming API of Apache Spark with Kafka for consuming data. I want to do some aggregations like average value for a particular column or min-max value on consumed data.
My question is whether i should use Spark SQL DataFrame API to do above mentioned aggregations or should i wait for Apache Ignite 2.2 version ? They have mentioned it in documentation that Ignite SQL is 100s faster than Spark SQL.

Actually, it's up to you. You could go ahead with Spark now, then wait for DataFrames support in Ignite is ready, compare these two approaches and choose which fits your needs better.

Related

Understanging kappa architecture with apache superset

There is a lot of information about kappa architecture in the internet and after going through some of the conceptual aspects I am trying to drill down to something more concrete. As I main source I used this website.
Let's imaging you want to implement a kappa architecture involving the following tech stack:
Apache Kafka
Apache Spark
Apache Superset
Now imagine the application you want to build do data-analytics against has a PostgreSQL database. Of course you can easily directly connect apache superset with the PostgresSQL database and create charts.
But now you want to see how you would do this with a kappa architecture and you add kafka and spark.
You can emit events to kafka and you can read such events in apache spark. Kafka will retain messages for topcis a certain period as pointed out in the answers to this quesition. When I read about connecting superset with spark in the docs it says hive should be used as a connector (also the project websites states the tool is unsupported, and if you look at this issue on pyhive then you find impyla could be an alternative). But apache hive is a completely different project for a storage system. So how would this connection work?
Assuming you have kafka nodes running (with zookeper obviously) and also have spark running and then you connect apache superset through this hive connector with spark.
How can you write queries against the data that is in kafka (which is in fact the live data)?
On spark side itself you can easily write a scala program that reads data from kafka and does something with it but how can you achieve this from apache superset?
Or is this not the intended way of connecting the things?
If I understood your question, you'd need to use Spark Structured Streaming to register a streaming SQL table into the Hive metastore, which could be queried from Superset from the Spark Thiftserver.
Hive itself doesn't store any of the data. Hive also has a built-in Kafka query handler, so Spark isn't completely necessary.
But, Hive/Spark isn't the only option. You could use Spark to write to HDFS/S3 and have Presto query that from Superset.
Or you can remove Spark and use Kafka Connect write to any other thing that a dashboarding tool (Tableau is another popular one) can support - JDBC database (i.e. Postgres), Mongo, Cassandra, etc. Then you'd just refresh the panels to run a new query.

Spark structured streaming from JDBC source

Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks
No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.
I am on a project now architecting this using CDC Shareplex from ORACLE and writing to KAFKA and then using Spark Structured Streaming with KAFKA integration and MERGE on delta format on HDFS.
Ie that is the way to do it if not using Debezium. You can use change logs for base tables or materialized views to feed CDC.
So direct JDBC is not possible.

Apache Spark Structured Streaming vs Apache Flink: what is the difference?

We have discussed the questions below:
What is the difference between Apache Spark and Apache Flink? [closed]
What does “streaming” mean in Apache Spark and Apache Flink?
What is the difference between mini-batch vs real time streaming in practice (not theory)?
But Spark Structured Streaming was added at Spark2.2, it brings a lot of changes for streaming, and it is outstanding.
Can we say Spark Strutured Streaming is a streaming processing, or still batch processing?
Now what is the big difference between Apache Flink and Apache Spark Structured Streaming?
Currently:
Spark Structured Streaming has still microbatches used in background. However, it supports event-time processing, quite low latency (but not as low as Flink), supports SQL and type-safe queries on the streams in one API; no distinction, every Dataset can be queried both with SQL or with typesafe operators. It has end-to-end exactly-one semantics (at least they says it ;) ). The throughput is better than in Flink (there were some benchmarks with different results, but look at Databricks post about the results).
In near future:
Spark Continous Processing Mode is in progress and it will give Spark ~1ms latency, comparable to those from Flink. However, as I said, it's still in progress. The API is ready for non-batch jobs, so it's easier to do than in previous Spark Streaming.
The main difference:
Spark relies on micro-batching now and Flink is has pre-scheduled operators. That means, Flink's latency is lower, but Spark Community works on Continous Processing Mode, which will work similar (as far as I understand) to receivers.

Redshift with Spark Streaming

I have a Kafka - Spark Streaming application to ingest and process 60K events per min. I need a database to store my transformed dataframes to be accessed by visualization layer. Can Redshift be used for this with Spark Streaming or should Cassandra be used? I will be processing and storing the dataframes in every spark window of 30 seconds. Also I need to read from the datastore in every window. I guess Redhsift is primarily a data warehousing database not for OLTP sort of the processing.. any ideas?
You should check out SnappyData. SnappyData deeply integrates an in-memory database with Spark that allows hybrid OLTP/OLAP applications. You can write Spark Streaming applications on top of Snappy that can update/delete data from the database. Further, because it does not go over a connector, it performs better than the myriad datastores that have Spark connectors and even the native Spark cache. There may be other datastores that offer hybrid OLTP/OLAP applications on Spark in the aforementioned link.
Disclaimer: I am a SnappyData employee.

Improve reading speed of Cassandra in Spark (Parallel reads implementation)

I am new to Spark and trying to combine Cassandra and Spark to do some analytical tasks.
From the Spark web UI I found that most of the time are consumed in the reading process.
When I dig into this particular task, I found that only single executor is working on it.
Is it possible to improve the performance of this task via some tricks like parallelization?
p.s. I am using the pyspark cassandra connector (https://github.com/TargetHolding/pyspark-cassandra).
UPDATE: I am using a 3-node Spark cluster running Spark 1.6 and a 3-node Cassandra cluster running Cassandra 2.2.4.
And I am selecting data in the form of
"select * from tbl where partitionKey IN [pk_1,pk_2,....,pk_N] where
clusteringKey > ck_1 and clusteringKey < ck_2"
UPDATE2: Ive read an article suggesting to replace the IN clause with parallel reads. (https://ahappyknockoutmouse.wordpress.com/2014/11/12/246/) How can this be achieved in spark?
Will able to answer to point, if you provide more details about cluster, spark and Cassandra versions and related stuff.Though I will try to answer it as per my understanding.
Make sure you are partitioning RDD parallelized-collections
If your spark job is running on only single executor, please verify spark submit command.you can get more details about spark submit commands here as per your cluster manager.
For speeding up Cassandra read operations, make use of proper indexing. I will recommend use of Solr, which will help you in fast data retrieval from Cassandra.

Resources