Efficient Filtering on a huge data frame in Spark - apache-spark

I have a Cassandra table with 500 million rows. I would like to filter based on a field which is a partition key in Cassandra using spark.
Can you suggest the best possible/efficient approach to filter in Spark/Spark SQL based on the list keys which is also a pretty large.
Basically i need only those rows from the Cassandra table which are present in the list of keys.
We are using DSE and its features.
The approach i am using is taking lot of time roughly around an hour.

Have you checked repartitionByCassandraReplica and joinWithCassandraTable ?
https://github.com/datastax/spark-cassandra-connector/blob/75719dfe0e175b3e0bb1c06127ad4e6930c73ece/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java drive to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be performed without doing a full table
scan. When performed between two Cassandra Tables which share the same
partition key this will not require movement of data between machines.
In all cases this method will use the source RDD's partitioning and
placement for data locality.
The method repartitionByCassandraReplica can be used to relocate data
in an RDD to match the replication strategy of a given table and
keyspace. The method will look for partition key information in the
given RDD and then use those values to determine which nodes in the
Cluster would be responsible for that data.

Related

Spark 2.4.6 + JDBC Reader: When predicate pushdown set to false, is data read in parallel by spark from the engine?

I am trying to extract data from a big table in SAP HANA, which is around 1.5tb in size, and the best way is to run in parallel across nodes and threads. Spark JDBC is the perfect candidate for the task, but in order to actually extract in parallel it requires partition column, lower/upper bound and number of partitions option to be set. To make the operation of the extraction easier, I considered adding an added partition column which would be the row_number() function and use MIN(), MAX() as lower/upper bounds respectively. And then the operations team just would be required to provide the number of partitions to have.
The problem is that HANA runs out of memory and it is very likely that row_number() is too costly on the engine. I can only imagine that over 100 threads run the same query during every fetch to apply the where filters and retrieve the corresponding chunk.
So my question is, if I disable the predicate pushdown option, how does spark behave? is it only read by one executor and then the filters are applied on spark side? Or does it do some magic to split the fetching part from the DB?
What could you suggest for extracting such a big table using the available JDBC reader?
Thanks in advance.
Before executing your primary query from Spark, run pre-ingestion query to fetch the size of the Dataset being loaded, i.e. as you have mentioned Min(), Max() etc.
Expecting that the data is uniformly distributed between Min and Max keys, you can partition across executors in Spark by providing Min/Max/Number of Executors.
You don't need(want) to change your primary datasource by adding additional columns to support data ingestion in this case.

Performance consideration when reading from hive view Vs hive table via DataFrames

We have a view that unions multiple hive tables. If i use spark SQL in pyspark and read that view will there be any performance issue as against reading directly from the table.
In hive we had something called full table scan if we don't limit the where clause to an exact table partition. Is spark intelligent enough to directly read the table that has the data that we are looking for rather than searching through the entire view ?
Please advise.
You are talking about partition pruning.
Yes spark supports it spark automatically omits large data read when partition filters are specified.
Partition pruning is possible when data within a table is split across multiple logical partitions. Each partition corresponds to a particular value of a partition column and is stored as a subdirectory within the table root directory on HDFS. Where applicable, only the required partitions (subdirectories) of a table are queried, thereby avoiding unnecessary I/O
After partitioning the data, subsequent queries can omit large amounts of I/O when the partition column is referenced in predicates. For example, the following query automatically locates and loads the file under peoplePartitioned/age=20/and omits all others:
val peoplePartitioned = spark.read.format("orc").load("peoplePartitioned")
peoplePartitioned.createOrReplaceTempView("peoplePartitioned")
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20")
more detailed info is provided here
You can also see this in the logical plan if you run an explain(True) on your query:
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20").explain(True)
it will show which partitions are read by spark

Looking up about 40k records out 150 million records in Cassandra in every job run?

I am building a near real time/ microbatch data application with Cassandra as the lookup store. Each incremental run has ~40K records, while the Cassandra table has about 150 million records. In each run, I need to lookup the id field and get some attributes from Cassandra. These lookups can be random (not any time/ region/ country dependency), so there is no clear partitioning scheme.
How should I try to partition the Cassandra table to ensure decent/ good performance (for microbatches running every 15-30 mins)?
Apart from partitioning, any other tips?
joinWithCassandraTable and leftJoinWithCassandraTable functions were specifically designed for efficient data lookup in Cassandra from Spark jobs. It performs fetching of data by primary or partition key, and because it's executed by multiple executors in parallel, it could be fast (although ~40K could still take time, but it depends on size of your Cassandra and Spark clusters). See the SCC's documentation for detailed information how to use it - but remember, that these functions are available only in RDD API. The DataStax's version of connector has support for so-called "DirectJoin" - efficient joins with Cassandra in the DataFrame API.
Regarding partitioning - it depends on how do you perform lookup - you have 1 record in Cassandra matching one record in Spark? If yes, then just use this ID as primary key (it's equal to partition key in this case).

Cassandra Batch statement-Multiple tables

I want to use batch statement to delete a row from 3 tables in my database to ensure atomicity. The partition key is going to be the same in all the 3 tables. In all the examples that I read about batch statements, all the queries were for a single table? In my case, is it a good idea to use batch statements? Or, should I avoid it?
I'm using Cassandra-3.11.2 and I execute my queries using the C++ driver.
Yes, you can use batch to ensure atomicity. Single partition batches are faster (same table and same partition key) but only for a limited number of partitions (in your case three) it is okay. But don't use it for performance optimization (Ex: reduce of multiple requests). If you need atomicity you can use it.
You can check below links:
Cassandra batch query performance on tables having different partition keys
Cassandra batch query vs single insert performance
How single parition batch in cassandra function for multiple column update?
EDITED
In my case, the tables are different but the partition key is the same in all 3 tables. So is this a special case of single partition batch or is it something entirely different.
For different tables partitions are also different. So this is a multi partition batch. LOGGED batches are used to ensure atomicity for different partitions (different tables or different partition keys). UNLOGGED batches are used to ensure atomicity and isolation for single partition batch. If you use UNLOGGED batch for multi partition batch atomicity will not be ensured. Default is LOGGED batch. For single partition batch default is UNLOGGED. Cause single partition batch is considered as single row mutation. For single row update, there is no need of using LOGGED batch. To know about LOGGED or UNLOGGED batch, I have shared a link below.
Multi partition batches should only be used to achieve atomicity for a few writes on different tables. Apart from this they should be avoided because they’re too expensive.
Single partition batches can be used to achieve atomicity and isolation. They’re not much more expensive than normal writes.
But you can use multi partition LOGGED batch as partitions are limited.
A very useful Doc in Batch and all the details are provided. If you read this, all the confusions will be cleared.
Cassandra - to BATCH or not to BATCH
Partition Key tokens vs row partition
Table partitions and partition key tokens are different. Partition key is used to decide which node the data resides. For same row key partition tokens are same thus resides in the same node. For different partition key or same key different tables they are different row mutation. You cannot get data with one query for different partition keys or from different tables even if for the same key. Coordinator nodes have to treat it as different request or mutation and request the actual data from replicated nodes separately. It's the internal structure of how C* stores data.
Every table even has it's own directory structure making it clear that a partition from one table will never interact with the partition of another.
Does the same partition key in different cassandra tables add up to cell theoretical limit?
To know details how C* maps data check this link:
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
Yes, this is a good use-case for BATCH according to the Cassandra documentation.
See the "Note:" on https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useBatchGoodExample.html
If there are two different tables in the same keyspace and the two tables have the same partition key, this scenario is considered a single partition batch. There will be a single mutation for each table. This happens because the two tables could have different columns, even though the keyspace and partition are the same. Batches allow a caller to bundle multiple operations into a single batch request. All the operations are performed by the same coordinator. The best use of a batch request is for a single partition in multiple tables in the same keyspace. Also, batches provide a guarantee that mutations will be applied in a particular order.
Specifically, if they have the same partition key, this will be considered a single-partition batch. Hence: "The best use of a batch request is for a single partition in multiple tables in the same keyspace."

Spark SQL and Cassandra JOIN

My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter column which is a clustering key.
Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.
On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).
Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.
However using Spark SQL I can run such a query and perform the JOIN.
SELECT * from datasets JOIN data
WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'
Now the question is: is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
Edit: fix the answer with regard to join optimization
is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
No. In fact, since you provide the partition key for the datasets table, the Spark/Cassandra connector will perform predicate push down and execute the partition restriction directly in Cassandra with CQL. But there will be no predicate push down for the join operation itself unless you use the RDD API with joinWithCassandraTable()
See here for all possible predicate push down situations: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala

Resources