What is best approach to join data in spark streaming application? - cassandra

Question : Essentially it means , rather than running a join of C* table for each streaming records , is there anyway to run a join for each micro-batch ( micro-batching ) of records in spark streaming ?
We are almost finalized to use spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version.
But have one fundamental question regarding the efficiency in the below scenario.
For the streaming data records(i.e. streamingDataSet ) , I need to look up for existing records( i.e. cassandraDataset) from Cassandra(C*) table.
i.e.
Dataset<Row> streamingDataSet = //kafka read dataset
Dataset<Row> cassandraDataset= //loaded from C* table those records loaded earlier from above.
To look up data i need to join above datasets
i.e.
Dataset<Row> joinDataSet = cassandraDataset.join(cassandraDataset).where(//somelogic)
process further the joinDataSet to implement the business logic ...
In the above scenario, my understanding is ,for each record received
from kafka stream it would query the C* table i.e. data base call.
Does not it take huge time and network bandwidth if C* table consists
billions of records? What should be the approach/procedure to be
followed to improve look up C* table ?
What is the best solution in this scenario ? I CAN NOT load once from
C* table and look up as the data keep on adding to C* table ... i.e.
new look ups might need newly persisted data.
How to handle this kind of scenario? any advices plzz..

If you're using Apache Cassandra, then you have only one possibility for effective join with data in Cassandra - via RDD API's joinWithCassandraTable. The open source version of the Spark Cassandra Connector (SCC) supports only it, while in version for DSE, there is a code that allows to perform effective join against Cassandra also for Spark SQL - so-called DSE Direct Join. If you'll use join in Spark SQL against Cassandra table, Spark will need to read all data from Cassandra, and then perform join - that's very slow.
I don't have an example for OSS SCC for doing the join for Spark Structured Streaming, but I have some examples for "normal" join, like this:
CassandraJavaPairRDD<Tuple1<Integer>, Tuple2<Integer, String>> joinedRDD =
trdd.joinWithCassandraTable("test", "jtest",
someColumns("id", "v"), someColumns("id"),
mapRowToTuple(Integer.class, String.class), mapTupleToRow(Integer.class));

Related

Spark Connect Hive to HDFS vs Spark connect HDFS directly and Hive on the top of it?

Summary of the problem:
I have a perticular usecase to write >10gb data per day to HDFS via spark streaming. We are currently in the design phase. We want to write the data to HDFS (constraint) using spark streaming. The data is columnar.
We have 2 options(so far):
Naturally, I would like to use hive context to feed data to HDFS. The schema is defined and the data is feeded in batches or row wise.
There is another option. We can directly write data to HDFS thanks to spark streaming API. We are also considering this because we can query data from HDFS through hive then in this usecase. This will leave options open to use other technologies in future for the new usecases that may come.
What is best?
Spark Streaming -> Hive -> HDFS -> Consumed by Hive.
VS
Spark Streaming -> HDFS -> Consumed by Hive , or other technologies.
Thanks.
So far I have not found a discussion on the topic, my research may be short. If there is any article that you can suggest, I would be most happy to read it.
I have a particular use case to write >10gb data per day and data is columnar
that means you are storing day-wise data. if thats the case hive has partition column as date, so that you can query the data for each day easily. you can query the raw data from BI tools like looker or presto or any other BI tool. if you are querying from spark then you can use hive features/properties. Moreover if you store the data in columnar format in parquet impala can query the data using hive metastore.
If your data is columnar consider parquet or orc.
Regarding option2:
if you have hive an option NO need to feed data in to HDFS and create an external table from hive and access it.
Conclusion :
I feel both are same. but hive is preferred considering direct query on raw data using BI tools or spark. From HDFS also we can query data using spark. if its there in the formats like json or parquet or xml there wont be added advantage for option 2.
It depends on your final use cases. Please consider below two scenarios while taking decision:
If you have RT/NRT case and all your data is full refresh then I would suggest to go with second approach Spark Streaming -> HDFS -> Consumed by Hive. It will be faster than your first approach Spark Streaming -> Hive -> HDFS -> Consumed by Hive. Since there is one less layer in it.
If your data is incremental and also have multiple update, delete operations then It will be difficult to use HDFS or Hive over HDFS with spark. Since Spark does not allow to update or delete data from HDFS. In that case, both your approaches will be difficult to implement. Either you can go with Hive managed table and do update/delete using HQL (only supported in Hortonwork Hive version) or you can go with NOSQL database like HBase or Cassandra so that spark can do upsert & delete easily. From program perspective, it will be also easy in compare to both your approaches.
If you dump data in NoSQL then you can use hive over it for normal SQL or reporting purpose.
There are so many tools & approaches are available but go with that which fit in your all cases. :)

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Spark-Cassandra: how to efficiently restrict partitions

After days thinking about it I'm still stuck with this problem: I have one table where "timestamp" is the partition key. This table contains billions of rows.
I also have "timeseries" tables that contain timestamps related to specific measurement processes.
With Spark I want to analyze the content of the big table. Of course it is not efficient to do a full table scan, and with a rather fast lookup in the timeseries table I should be able to target only, say, 10k partitions.
What is the most efficient way to achieve this?
Is SparkSQL smart enough to optimize something like this
sqlContext.sql("""
SELECT timeseries.timestamp, bigtable.value1 FROM timeseries
JOIN bigtable ON bigtable.timestamp = timeseries.timestamp
WHERE timeseries.parameter = 'xyz'
""")
Ideally I would expect Cassandra to fetch the timestamps from the timeseries table and then use that to query only that subset of partitions from bigtable.
If you add an "Explain" call to your query you'll see what the Catalyst planner will do for your query but I know it will not do the optimizations you want.
Currently Catalyst has no support for pushing down joins to DataSources which means the structure of your query is most likely got to look like.
Read Data From Table timeseries with predicate parameter = 'xyz'
Read Data From Table bigtable
Join these two results
Filter on bigtable.timestamp == timeseries.timestamp
The Spark Cassandra Connector will be given the predicate from the timeseries table read and will be able to optimize it if is a clustering key or a partition key. See the Spark Cassandra Connector Docs. If it doesn't fit into one of those pushdown categories it will require a Full Table Scan followed by a filter in Spark.
Since the Read Data From Table bigtable has no restrictions on it, Spark will instruct the Connector to read the entire table (Full Table Scan).
I can only take a guess on the optimizations done by the driver, but I'd surely expect a query such as that to restrict the JOIN on the WHERE, which means that your simple query will be optimized.
What I will do as well is point you in the general direction of optimizing Spark SQL. Have a look at Catalyst for Spark SQL, which is a tool for greatly optimizing queries all the way down to the physical level.
Here is a breakdown of how it works:
Deep Dive into Spark SQL Catalyst Optimizer
And the link to the git-repo: Catalyst repo

How to paralellize RDD work when using cassandra spark connector for data aggration?

Here is the sample senario, we have real time data record in cassandra, and we want to aggregate the data in different time ranges. What I write code like below:
val timeRanges = getTimeRanges(report)
timeRanges.foreach { timeRange =>
val (timestampStart, timestampEnd) = timeRange
val query = _sc.get.cassandraTable(report.keyspace, utilities.Helper.makeStringValid(report.scope)).
where(s"TIMESTAMP > ?", timestampStart).
where(s"VALID_TIMESTAMP <= ?", timestampEnd)
......do the aggregation work....
what the issue of the code is that for every time range, the aggregation work is running not in parallized. My question is how can I parallized the aggregation work? Since RDD can't run in another RDD or Future? Is there any way to parallize the work, or we can't using spark connector here?
Use the joinWithCassandraTable function. This allows you to use the data from one RDD to access C* and pull records just like in your example.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java driver to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be preformed without doing a full table
scan. When preformed between two Cassandra Tables which share the
same partition key this will not require movement of data between
machines. In all cases this method will use the source RDD's
partitioning and placement for data locality.
Finally , we using union to join each RDD and makes them parallized.

Spark Cassandra Iterative Query

I am applying the following through the Spark Cassandra Connector:
val links = sc.textFile("linksIDs.txt")
links.map( link_id =>
{
val link_speed_records = sc.cassandraTable[Double]("freeway","records").select("speed").where("link_id=?",link_id)
average = link_speed_records.mean().toDouble
})
I would like to ask if there is way to apply the above sequence of queries more efficiently given that the only parameter I always change is the 'link_id'.
The 'link_id' value is the only Partition Key of my Cassandra 'records' table.
I am using Cassandra v.2.0.13, Spark v.1.2.1 and Spark-Cassandra Connector v.1.2.1
I was thinking if it is possible to open a Cassandra Session in order to apply those queries and still get the 'link_speed_records' as a SparkRDD.
Use the joinWithCassandra Method to use an RDD of keys to pull data out of a Cassandra Table. The method given in the question will be extremely expensive comparatively and also not function well as a parallelizable request.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12

Resources