Spark SQL cassandra delete records - cassandra

Is there a way to delete some records based on a select query?
I have this query,
Select min(id) from ID having count(*)>1 which will show the duplicates. I need to get those ids and delete them. How can I do it in spark sql?

Spark SQL does not support DELETE.
If the number of ids to delete is small, you can do it using the Cassandra driver instead of through Spark:
import scala.collection.JavaConverters._
import scala.collection.JavaConversions._
import com.datastax.driver.core.{Cluster, Session, BatchStatement}
import com.datastax.driver.core.querybuilder.QueryBuilder
val cluster = Cluster.builder().addContactPoint(host_ip).build()
val session = cluster.connect(keyspace)
val idsToDelete = ... // perform your query and collect the ids
val queries = idsToDelete.map({ id => QueryBuilder.delete().from(keyspace, table).where(QueryBuilder.eq("id", id)) })
val batch = batchStatement().addAll(queries.asJava)
session.execute(batch)
cluster.close

Related

Spark sql querying a Hive table from workers

I am trying to querying a Hive table from a map operation in Spark, but when it run a query the execution getting frozen.
This is my test code
val sc = new SparkContext(conf)
val datasetPath = "npiCodesMin.csv"
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
val df = sparkSession.read.option("header", true).option("sep", ",").csv(datasetPath)
df.createOrReplaceTempView("npicodesTmp")
sparkSession.sql("DROP TABLE IF EXISTS npicodes");
sparkSession.sql("CREATE TABLE npicodes AS SELECT * FROM npicodesTmp");
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '1588667638'") //This works
println(res.first())
val NPIs = sc.parallelize(List("1679576722", "1588667638", "1306849450", "1932102084"))//Some existing NPIs
val rows = NPIs.mapPartitions{ partition =>
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
partition.map{code =>
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '"+code+"'")//The program stops here
res.first()
}
}
rows.collect().foreach(println)
It loads the data from a CSV, creates a new Hive table and fills it with the CSV data.
Then, if I query the table from the master it works perfectly, but if I try to do that in a map operation the execution getting frozen.
It do not generate any error, it continue running without do anything.
The Spark UI shows this situation
Actually, I am not sure if I can query a table in a distributed way, I cannot find it in the documentation.
Any suggestion?
Thanks.

HiveContext in Spark Version 2

I am working on a spark program that inserts dataframe into Hive Table as below.
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
val hiveCont = val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
case class partc(id:Int, name:String, salary:Int, dept:String, location:String)
val partRDD = partdata.map(p => partc(p(0).toInt, p(1), p(2).toInt, p(3), p(4)))
val partDF = partRDD.toDF()
partDF.registerTempTable("party")
hiveCont.sql("insert into parttab select id, name, salary, dept from party")
I know that Spark V2 has come out and we can use SparkSession object in it.
Can we use SparkSession object to directly insert the dataframe into Hive table or do we have to use the HiveContext in version 2 also ? Can anyone let me know what is the major difference in version with respect to HiveContext ?
You can use your SparkSession (normally called spark or ss) directly to fire a sql query (make sure hive-support is enabled when creating the spark-session):
spark.sql("insert into parttab select id, name, salary, dept from party")
But I would suggest this notation, you don't need to create a temp-table etc:
partDF
.select("id","name","salary","dept")
.write.mode("overwrite")
.insertInto("parttab")

How to avoid multiple jobs spawning scenio in spark sql

I have an application wherein I should query hive which would return me a set of records(hardly 50). For each record returned I have to fire a query on hive and get the relavant dataframe. This is how it would look like:
val employeeIds = hiveContext.sql("select id from employee")
val vertices = employeeIds.foreach(row => {
val employeeId = row.getInt(0)
val query = s""" select * from department where employeeId = $employeeId"""
//.... I would have to create a hive context here ....
})
But if I do so there will be new contexts spawning up from the executors, any pointers to eliminate this approach would be very helpful.
Note:
I have masked the information to post of stackoverflow. I have to fire a query based on the records of first query. I cannot join employee and department table.

Overwrite specific partitions in spark dataframe write method

I want to overwrite specific partitions instead of all in spark. I am trying the following command:
df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')
where df is dataframe having the incremental data to be overwritten.
hdfs-base-path contains the master data.
When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.
What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?
Finally! This is now a feature in Spark 2.3.0:
SPARK-20236
To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")
I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.
Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.
This is a common problem. The only solution with Spark up to 2.0 is to write directly into the partition directory, e.g.,
df.write.mode(SaveMode.Overwrite).save("/root/path/to/data/partition_col=value")
If you are using Spark prior to 2.0, you'll need to stop Spark from emitting metadata files (because they will break automatic partition discovery) using:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
If you are using Spark prior to 1.6.2, you will also need to delete the _SUCCESS file in /root/path/to/data/partition_col=value or its presence will break automatic partition discovery. (I strongly recommend using 1.6.2 or later.)
You can get a few more details about how to manage large partitioned tables from my Spark Summit talk on Bulletproof Jobs.
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")
This works for me on AWS Glue ETL jobs (Glue 1.0 - Spark 2.4 - Python 2)
Adding 'overwrite=True' parameter in the insertInto statement solves this:
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
df.write.mode("overwrite").insertInto("database_name.partioned_table", overwrite=True)
By default overwrite=False. Changing it to True allows us to overwrite specific partitions contained in df and in the partioned_table. This helps us avoid overwriting the entire contents of the partioned_table with df.
Using Spark 1.6...
The HiveContext can simplify this process greatly. The key is that you must create the table in Hive first using a CREATE EXTERNAL TABLE statement with partitioning defined. For example:
# Hive SQL
CREATE EXTERNAL TABLE test
(name STRING)
PARTITIONED BY
(age INT)
STORED AS PARQUET
LOCATION 'hdfs:///tmp/tables/test'
From here, let's say you have a Dataframe with new records in it for a specific partition (or multiple partitions). You can use a HiveContext SQL statement to perform an INSERT OVERWRITE using this Dataframe, which will overwrite the table for only the partitions contained in the Dataframe:
# PySpark
hiveContext = HiveContext(sc)
update_dataframe.registerTempTable('update_dataframe')
hiveContext.sql("""INSERT OVERWRITE TABLE test PARTITION (age)
SELECT name, age
FROM update_dataframe""")
Note: update_dataframe in this example has a schema that matches that of the target test table.
One easy mistake to make with this approach is to skip the CREATE EXTERNAL TABLE step in Hive and just make the table using the Dataframe API's write methods. For Parquet-based tables in particular, the table will not be defined appropriately to support Hive's INSERT OVERWRITE... PARTITION function.
Hope this helps.
Tested this on Spark 2.3.1 with Scala.
Most of the answers above are writing to a Hive table. However, I wanted to write directly to disk, which has an external hive table on top of this folder.
First the required configuration
val sparkSession: SparkSession = SparkSession
.builder
.enableHiveSupport()
.config("spark.sql.sources.partitionOverwriteMode", "dynamic") // Required for overwriting ONLY the required partitioned folders, and not the entire root folder
.appName("spark_write_to_dynamic_partition_folders")
Usage here:
DataFrame
.write
.format("<required file format>")
.partitionBy("<partitioned column name>")
.mode(SaveMode.Overwrite) // This is required.
.save(s"<path_to_root_folder>")
I tried below approach to overwrite particular partition in HIVE table.
### load Data and check records
raw_df = spark.table("test.original")
raw_df.count()
lets say this table is partitioned based on column : **c_birth_year** and we would like to update the partition for year less than 1925
### Check data in few partitions.
sample = raw_df.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag")
print "Number of records: ", sample.count()
sample.show()
### Back-up the partitions before deletion
raw_df.filter(col("c_birth_year") <= 1925).write.saveAsTable("test.original_bkp", mode = "overwrite")
### UDF : To delete particular partition.
def delete_part(table, part):
qry = "ALTER TABLE " + table + " DROP IF EXISTS PARTITION (c_birth_year = " + str(part) + ")"
spark.sql(qry)
### Delete partitions
part_df = raw_df.filter(col("c_birth_year") <= 1925).select("c_birth_year").distinct()
part_list = part_df.rdd.map(lambda x : x[0]).collect()
table = "test.original"
for p in part_list:
delete_part(table, p)
### Do the required Changes to the columns in partitions
df = spark.table("test.original_bkp")
newdf = df.withColumn("c_preferred_cust_flag", lit("Y"))
newdf.select("c_customer_sk", "c_preferred_cust_flag").show()
### Write the Partitions back to Original table
newdf.write.insertInto("test.original")
### Verify data in Original table
orginial.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag").show()
Hope it helps.
Regards,
Neeraj
As jatin Wrote you can delete paritions from hive and from path and then append data
Since I was wasting too much time with it I added the following example for other spark users.
I used Scala with spark 2.2.1
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}
case class DataExample(partition1: Int, partition2: String, someTest: String, id: Int)
object StackOverflowExample extends App {
//Prepare spark & Data
val sparkConf = new SparkConf()
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val tableName = "my_table"
val partitions1 = List(1, 2)
val partitions2 = List("e1", "e2")
val partitionColumns = List("partition1", "partition2")
val myTablePath = "/tmp/some_example"
val someText = List("text1", "text2")
val ids = (0 until 5).toList
val listData = partitions1.flatMap(p1 => {
partitions2.flatMap(p2 => {
someText.flatMap(
text => {
ids.map(
id => DataExample(p1, p2, text, id)
)
}
)
}
)
})
val asDataFrame = spark.createDataFrame(listData)
//Delete path function
def deletePath(path: String, recursive: Boolean): Unit = {
val p = new Path(path)
val fs = p.getFileSystem(new Configuration())
fs.delete(p, recursive)
}
def tableOverwrite(df: DataFrame, partitions: List[String], path: String): Unit = {
if (spark.catalog.tableExists(tableName)) {
//clean partitions
val asColumns = partitions.map(c => new Column(c))
val relevantPartitions = df.select(asColumns: _*).distinct().collect()
val partitionToRemove = relevantPartitions.map(row => {
val fields = row.schema.fields
s"ALTER TABLE ${tableName} DROP IF EXISTS PARTITION " +
s"${fields.map(field => s"${field.name}='${row.getAs(field.name)}'").mkString("(", ",", ")")} PURGE"
})
val cleanFolders = relevantPartitions.map(partition => {
val fields = partition.schema.fields
path + fields.map(f => s"${f.name}=${partition.getAs(f.name)}").mkString("/")
})
println(s"Going to clean ${partitionToRemove.size} partitions")
partitionToRemove.foreach(partition => spark.sqlContext.sql(partition))
cleanFolders.foreach(partition => deletePath(partition, true))
}
asDataFrame.write
.options(Map("path" -> myTablePath))
.mode(SaveMode.Append)
.partitionBy(partitionColumns: _*)
.saveAsTable(tableName)
}
//Now test
tableOverwrite(asDataFrame, partitionColumns, tableName)
spark.sqlContext.sql(s"select * from $tableName").show(1000)
tableOverwrite(asDataFrame, partitionColumns, tableName)
import spark.implicits._
val asLocalSet = spark.sqlContext.sql(s"select * from $tableName").as[DataExample].collect().toSet
if (asLocalSet == listData.toSet) {
println("Overwrite is working !!!")
}
}
If you use DataFrame, possibly you want to use Hive table over data.
In this case you need just call method
df.write.mode(SaveMode.Overwrite).partitionBy("partition_col").insertInto(table_name)
It'll overwrite partitions that DataFrame contains.
There's not necessity to specify format (orc), because Spark will use Hive table format.
It works fine in Spark version 1.6
Instead of writing to the target table directly, i would suggest you create a temporary table like the target table and insert your data there.
CREATE TABLE tmpTbl LIKE trgtTbl LOCATION '<tmpLocation';
Once the table is created, you would write your data to the tmpLocation
df.write.mode("overwrite").partitionBy("p_col").orc(tmpLocation)
Then you would recover the table partition paths by executing:
MSCK REPAIR TABLE tmpTbl;
Get the partition paths by querying the Hive metadata like:
SHOW PARTITONS tmpTbl;
Delete these partitions from the trgtTbl and move the directories from tmpTbl to trgtTbl
I would suggest you doing clean-up and then writing new partitions with Append mode:
import scala.sys.process._
def deletePath(path: String): Unit = {
s"hdfs dfs -rm -r -skipTrash $path".!
}
df.select(partitionColumn).distinct.collect().foreach(p => {
val partition = p.getAs[String](partitionColumn)
deletePath(s"$path/$partitionColumn=$partition")
})
df.write.partitionBy(partitionColumn).mode(SaveMode.Append).orc(path)
This will delete only new partitions. After writing data run this command if you need to update metastore:
sparkSession.sql(s"MSCK REPAIR TABLE $db.$table")
Note: deletePath assumes that hfds command is available on your system.
My solution implies overwriting each specific partition starting from a spark dataframe. It skips the dropping partition part. I'm using pyspark>=3 and I'm writing on AWS s3:
def write_df_on_s3(df, s3_path, field, mode):
# get the list of unique field values
list_partitions = [x.asDict()[field] for x in df.select(field).distinct().collect()]
df_repartitioned = df.repartition(1,field)
for p in list_partitions:
# create dataframes by partition and send it to s3
df_to_send = df_repartitioned.where("{}='{}'".format(field,p))
df_to_send.write.mode(mode).parquet(s3_path+"/"+field+"={}/".format(p))
The arguments of this simple function are the df, the s3_path, the partition field, and the mode (overwrite or append). The first part gets the unique field values: it means that if I'm partitioning the df by daily, I get a list of all the dailies in the df. Then I'm repartition the df. Finally, I'm selecting the repartitioned df by each daily and I'm writing it on its specific partition path.
You can change the repartition integer by your needs.
You could do something like this to make the job reentrant (idempotent):
(tried this on spark 2.2)
# drop the partition
drop_query = "ALTER TABLE table_name DROP IF EXISTS PARTITION (partition_col='{val}')".format(val=target_partition)
print drop_query
spark.sql(drop_query)
# delete directory
dbutils.fs.rm(<partition_directoy>,recurse=True)
# Load the partition
df.write\
.partitionBy("partition_col")\
.saveAsTable(table_name, format = "parquet", mode = "append", path = <path to parquet>)
For >= Spark 2.3.0 :
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.insertInto("partitioned_table", overwrite=True)

How to stop load the whole table in spark?

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24.
when I execute in Hive command:
hive>select * from table where year="2016" and month="06" and day="01";
I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24
But,when I read in spark
sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")
exceptition is throwable That I dont have the right to hdfs/.../2016/04/24
THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER?
HOW CAN I AVOID LOAD THE WHOLE TABLE?
You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.
To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).
val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)
Then you can create a JdbcRDD;
val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
sc,
() => DriverManager.getConnection(connUrl),
query,
lowerBound,
upperBound,
numOfPartitions,
(r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)
JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.
However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

Resources