Delete rows from cassandra table using pyspark or cql query - apache-spark

I have a table with lots of columns, for ex. test_event and also I have another table test in the same keyspace that contains id's of rows I have to delete from test_event.
I tried deleteFromCassandra, but it doesn't works because spark cannot see SparkContext.
I found some solutions used DELETE FROM, but it was written in scala.
After about hundred attempts I finally get confused and asked for your help. Can somebody do it with me step by step?

Take a look on this code:
from pyspark.sql import SQLContext
def main_function():
sql = SQLContext(sc)
tests = sql.read.format("org.apache.spark.sql.cassandra").\
load(keyspace="your keyspace", table="test").where(...)
for test in tests:
delete_sql = "delete from test_event where id = " + test.select('id')
sql.execute(delete_sql)
Be aware of deleting one row at a time is not a best practice on spark but the above code is just an example to help you figure out your implementation.

Spark Cassandra Connector (SCC) itself provides only Dataframe API for Python. But there is a pyspark-cassandra package that provides RDD API on top of the SCC, so deletion could be performed as following.
Start pyspark shell with (I've tried with Spark 2.4.3):
bin/pyspark --conf spark.cassandra.connection.host=IPs\
--packages anguenot:pyspark-cassandra:2.4.0
and inside read data from one table, and do delete. You need to have source data to have the columns corresponding to the primary key. It could be full primary key, partial primary key, or only partition key - depending on it, Cassandra will use corresponding tombstone type (row/range/partition tombstone).
In my example, table has primary key consisting of one column - that's why I specified only one element in the array:
rdd = sc.cassandraTable("test", "m1")
rdd.deleteFromCassandra("test","m1", keyColumns = ["id"])

Related

What is the difference between saveAasTable and save in Spark

I am using Pyspark and want to insert-overwrite partitions into a existing hive table.
in this use case saveAsTable() is not suitable, it overwrites the whole existing table
insertInto() is behaving strangely: I have 3 partition levels, but it is inserting one
Snd what is the right way to use save()?
Can save() take options like database-name and table name to insert into, or only HDFS path?
example :
df\
.write\
.format('orc')\
.mode('overwrite)\
.option('database', db_name)\
.option('table', table_name)\
.save()

Cassandra full table scan using Spark Performance

I have a requirement of scanning a table which contains 100 million record in Production. The search will be made on the first clustering key. The requirement is to find the unique partition keys where first clustering key is matching a condition. The table looks like the following -
employeeid, companyname , lastdateloggedin, floorvisted, swipetimestamp
Partition Key - employeeid
Clustering Key - companyname , lastdateloggedin
I would like to get select distinct(employeeid),company, swipetimestamp where companyname = 'XYZ'. This is an SQL representation of what i would like to fetch from the table.
SparkConf conf = new SparkConf().set("spark.cassandra.connection.enabled", "true")
.set("spark.cassandra.auth.username", "XXXXXXXXXX")
.set("spark.cassandra.auth.password", "XXXXXXXXX")
.set("spark.cassandra.connection.host", "hostname")
.set("spark.cassandra.connection.port", "29042")
.set("spark.cassandra.connection.factory", ConnectionFactory.class)
.set("spark.cassandra.connection.cluster_name", "ZZZZ")
.set("spark.cassandra.connection.application_name", "ABC")
.set("spark.cassandra.connection.local_dc", "DC1")
.set("spark.cassandra.connection.cachedClusterFile", "/tmp/xyz/test.json")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.input.fetch.size_in_rows","10000") //
.set("spark.driver.allowMultipleContexts","true")
.set("spark.cassandra.connection.ssl.trustStore.path", "sampleabc-spark-util/src/main/resources/x.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "cassandrasam");
CassandraJavaRDD<CassandraRow> ctable = javaFunctions(jsc).cassandraTable("keyspacename", "employeedetails").
select("employeeid", "companyname","swipetimestamp").where("companyname= ?","XYZ");
List<CassandraRow> cassandraRows = ctable.distinct().collect();
This code run in non production with close 5 million data. Since it is production i would like to approach this query with caution. Questions -
What are the config that should be present in my SparkConf ?
Will the spark job ever bring down the db because of the large table ?
Running that job might starve threads to cassandra at that moment ?
I would recommend to use Dataframe API instead of the RDDs - theoretically, SCC may do more optimizations for that API. If you have condition on the first clustering column, then this condition should be pushed by SCC down to Cassandra and filtering will happen there. You can check that by using .expalin on the dataframe, and checking that you have rules marked with * in the PushedFilters part.
Regarding config - use default version of the spark.cassandra.input.fetch.size_in_rows - if you have too high value, then you can have a higher chance of getting timeouts. You can still bring down the nodes with default value, as SCC is reading with LOCAL_ONE, and that overload single nodes. Sometimes, reading with LOCAL_QUORUM is faster because it won't overload individual nodes too much, and won't restart the tasks that are reading data.
And I recommend to make sure that you're using latest Spark Cassandra Connector - 2.5.0 - it has a lot of new optimizations and new functionality...

Does spark saveAsTable really create a table?

This may be a dumb question since lack of some fundamental knowledge of spark, I try this:
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").enableHiveSupport().getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("foo");
This creates table under 'default' database in Hive, and of course, I can fetch data from the table anytime I want.
I update above code to get rid of "enableHiveSupport",
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("bar");
The code runs fine, without any error, but when I try "select * from bar", spark says,
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'bar' not found in database 'default';
So I have 2 questions here,
1) Is it possible to create a 'raw' spark table, not hive table? I know Hive mantains the metadata in database like mysql, does spark also have similar mechanism?
2) In the 2nd code snippet, what does spark actually create when calling saveAsTable?
Many thanks.
Check answers below:
If you want to create raw table only in spark createOrReplaceTempView could help you. For second part, check next answer.
By default, if you call saveAsTable on your dataframe, it will persistent tables into Hive metastore if you use enableHiveSupport. And if we don't enableHiveSupport, tables will be managed by Spark and data will be under spark-warehouse location. You will loose these tables after restart spark session.

How to use accumulators with spark 2.3.1 api

I am using spark-sql_2.11-2.3.1 version with Cassandra 3.x.
I need to provide a validation feature which has
column_family_name text,
oracle_count bigint,
cassandra_count bigint,
create_timestamp timestamp,
last_update_timestamp timestamp,
update_user text
For the same I need to count the successfully inserted record count i.e. cassandra_count to be populated , for that I want to make use of spark accumulator. But unfortunately I am not able to find required API samples with spark-sql_2.11-2.3.1 version.
Below is my saving to cassandra snippet
o_model_df.write.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> columnFamilyName, "keyspace" -> keyspace ))
.mode(SaveMode.Append)
.save()
Here how to implement accumulator increment for each row being successfully saved into Cassandra ...
Any help would be highly thankful.
Spark's accumulators are usually used in the transformations that you write, don't expect that spark cassandra connector will provide you something like.
But overall - if your job had finished without error, then it means that the data is written correctly into database.
If you want to check how many rows are really in the database, then you need to count data in the database - you can use cassandraCount method of the spark cassandra connector. The main reason for that - you may have in your DataFrame multiple rows that could be mapped into single Cassandra row (for example, if you incorrectly defined primary key, so multiple rows have it).

How to list partition-pruned inputs for Hive tables?

I am using Spark SQL to query data in Hive. The data is partitioned and Spark SQL correctly prunes the partitions when querying.
However, I need to list either the source tables along with partition filters or the specific input files (.inputFiles would be an obvious choice for this but it does not reflect pruning) for a given query in order to determine on which part of the data the computation will be taking place.
The closest I was able to get was by calling df.queryExecution.executedPlan.collectLeaves(). This contains the relevant plan nodes as HiveTableScanExec instances. However, this class is private[hive] for the org.apache.spark.sql.hive package. I think the relevant fields are relation and partitionPruningPred.
Is there any way to achieve this?
Update: I was able to get the relevant information thanks to Jacek's suggestion and by using getHiveQlPartitions on the returned relation and providing partitionPruningPred as the parameter:
scan.findHiveTables(execPlan).flatMap(e => e.relation.getHiveQlPartitions(e.partitionPruningPred))
This contained all the data I needed, including the paths to all input files, properly partition pruned.
Well, you're asking for low-level details of the query execution and things are bumpy down there. You've been warned :)
As you noted in your comment, all the execution information are in this private[hive] HiveTableScanExec.
One way to get some insight into HiveTableScanExec physical operator (that is a Hive table at execution time) is to create a sort of backdoor in org.apache.spark.sql.hive package that is not private[hive].
package org.apache.spark.sql.hive
import org.apache.spark.sql.hive.execution.HiveTableScanExec
object scan {
def findHiveTables(execPlan: org.apache.spark.sql.execution.SparkPlan) = execPlan.collect { case hiveTables: HiveTableScanExec => hiveTables }
}
Change the code to meet your needs.
With the scan.findHiveTables, I usually use :paste -raw while in spark-shell to sneak into such "uncharted areas".
You could then simply do the following:
scala> spark.version
res0: String = 2.4.0-SNAPSHOT
// Create a Hive table
import org.apache.spark.sql.types.StructType
spark.catalog.createTable(
tableName = "h1",
source = "hive", // <-- that makes for a Hive table
schema = new StructType().add($"id".long),
options = Map.empty[String, String])
// select * from h1
val q = spark.table("h1")
val execPlan = q.queryExecution.executedPlan
scala> println(execPlan.numberedTreeString)
00 HiveTableScan [id#22L], HiveTableRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#22L]
// Use the above code and :paste -raw in spark-shell
import org.apache.spark.sql.hive.scan
scala> scan.findHiveTables(execPlan).size
res11: Int = 1
relation field is the Hive table after it's been resolved using ResolveRelations and FindDataSourceTable logical rule that Spark analyzer uses to resolve data source and hive tables.
You can get pretty much all the information Spark uses from a Hive metastore using ExternalCatalog interface that is available as spark.sharedState.externalCatalog. That gives you pretty much all the metadata Spark uses to plan queries over Hive tables.

Resources