How to avoid multiple jobs spawning scenio in spark sql - apache-spark

I have an application wherein I should query hive which would return me a set of records(hardly 50). For each record returned I have to fire a query on hive and get the relavant dataframe. This is how it would look like:
val employeeIds = hiveContext.sql("select id from employee")
val vertices = employeeIds.foreach(row => {
val employeeId = row.getInt(0)
val query = s""" select * from department where employeeId = $employeeId"""
//.... I would have to create a hive context here ....
})
But if I do so there will be new contexts spawning up from the executors, any pointers to eliminate this approach would be very helpful.
Note:
I have masked the information to post of stackoverflow. I have to fire a query based on the records of first query. I cannot join employee and department table.

Related

Spark sql querying a Hive table from workers

I am trying to querying a Hive table from a map operation in Spark, but when it run a query the execution getting frozen.
This is my test code
val sc = new SparkContext(conf)
val datasetPath = "npiCodesMin.csv"
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
val df = sparkSession.read.option("header", true).option("sep", ",").csv(datasetPath)
df.createOrReplaceTempView("npicodesTmp")
sparkSession.sql("DROP TABLE IF EXISTS npicodes");
sparkSession.sql("CREATE TABLE npicodes AS SELECT * FROM npicodesTmp");
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '1588667638'") //This works
println(res.first())
val NPIs = sc.parallelize(List("1679576722", "1588667638", "1306849450", "1932102084"))//Some existing NPIs
val rows = NPIs.mapPartitions{ partition =>
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
partition.map{code =>
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '"+code+"'")//The program stops here
res.first()
}
}
rows.collect().foreach(println)
It loads the data from a CSV, creates a new Hive table and fills it with the CSV data.
Then, if I query the table from the master it works perfectly, but if I try to do that in a map operation the execution getting frozen.
It do not generate any error, it continue running without do anything.
The Spark UI shows this situation
Actually, I am not sure if I can query a table in a distributed way, I cannot find it in the documentation.
Any suggestion?
Thanks.

Filter Partition Before Reading Hive table (Spark)

Currently I'm trying to filter a Hive table by the latest date_processed.
The table is partitioned by.
System
date_processed
Region
The only way I've managed to filter it, is by doing a join query:
query = "select * from contracts_table as a join (select (max(date_processed) as maximum from contract_table as b) on a.date_processed = b.maximum"
This way is really time consuming, as I have to do the same procedure for 25 tables.
Any one Knows a way to read directly the latest loaded partition of a table in Spark <1.6
This is the method I'm using to read.
public static DataFrame loadAndFilter (String query)
{
return SparkContextSingleton.getHiveContext().sql(+query);
}
Many thanks!
Dataframe with all table partitions can be received by:
val partitionsDF = hiveContext.sql("show partitions TABLE_NAME")
Values can be parsed, for get max value.

Cassandra Spark Connector

My cassandra CF has date and id as partition Key .
while querying I only know the date , so I loop over range of id's .
My question revolves around how the connector executes the following code.
SparkDriver code looks like -
SparkConf conf = new SparkConf().setAppName("DemoApp")
.conf.setMaster("local[*]")
.set("spark.cassandra.connection.host", "10.*.*.*")
.set("spark.cassandra.connection.port", "*");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkContextJavaFunctions javaFunctions = CassandraJavaUtil.javaFunctions(sc);
String date = "23012017";
for(String id : idlist) {
JavaRDD<CassandraRow> cassandraRowsRDD =
javaFunctions.cassandraTable("datakeyspace", "sample2")
.where("date = ?",date)
.where("id = ? ", id)
.select("data");
cassandraRowsRDDList.add(cassandraRowsRDD);
}
List<CassandraRow> collectAllRows = new ArrayList<CassandraRow>();
for(JavaRDD<CassandraRow> rdd : cassandraRowsRDDList){
//do transformations
collectAllRows.addAll(rdd.collect());
}
1) First of all I wanted to ask if I loop over the idlist ,say idlist has 1000 elements which might be increasing ever , will this be efficient ? how each select query be distributed in the cluster ?Especially how Cassandra DB connections will be maintained ?
2) In my driver program After looping over I am putting All the rows in List , and then apply transformations to each row and filter out the duplicates . Will this also be distributed by spark on the cluster or will this take place at driver's side .
Kindly help .!
There is a better way of doing this provided by spark cassandra connector.
you can create a rdd of (date,id) and then call joinWithCassandraTable function on columns date and id. Connector do it smartly all the data will be fetched by the workers only and that too without shuffle that is each worker will fetch the data only for the date and id it is having.

How to stop load the whole table in spark?

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24.
when I execute in Hive command:
hive>select * from table where year="2016" and month="06" and day="01";
I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24
But,when I read in spark
sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")
exceptition is throwable That I dont have the right to hdfs/.../2016/04/24
THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER?
HOW CAN I AVOID LOAD THE WHOLE TABLE?
You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.
To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).
val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)
Then you can create a JdbcRDD;
val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
sc,
() => DriverManager.getConnection(connUrl),
query,
lowerBound,
upperBound,
numOfPartitions,
(r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)
JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.
However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

Spark SQL cassandra delete records

Is there a way to delete some records based on a select query?
I have this query,
Select min(id) from ID having count(*)>1 which will show the duplicates. I need to get those ids and delete them. How can I do it in spark sql?
Spark SQL does not support DELETE.
If the number of ids to delete is small, you can do it using the Cassandra driver instead of through Spark:
import scala.collection.JavaConverters._
import scala.collection.JavaConversions._
import com.datastax.driver.core.{Cluster, Session, BatchStatement}
import com.datastax.driver.core.querybuilder.QueryBuilder
val cluster = Cluster.builder().addContactPoint(host_ip).build()
val session = cluster.connect(keyspace)
val idsToDelete = ... // perform your query and collect the ids
val queries = idsToDelete.map({ id => QueryBuilder.delete().from(keyspace, table).where(QueryBuilder.eq("id", id)) })
val batch = batchStatement().addAll(queries.asJava)
session.execute(batch)
cluster.close

Resources