In Pyspark HiveContext what is the equivalent of SQL OFFSET? - apache-spark

Or a more specific question would be how can I process large amounts of data that do not fit into memory at once? With OFFSET I was trying to do hiveContext.sql("select ... limit 10 offset 10") while incrementing offset to get all the data but offset doesn't seem to be valid within hiveContext. What is the alternative usually used to achieve this goal?
For some context the pyspark code starts with
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("select ... limit 10 offset 10").show()

You code will look like
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql(" with result as
( SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS RowNum FROM tablename )
select colunm1 ,column2,column3 from result where RowNum >= OFFSEtvalue and RowNum < (OFFSEtvalue +limtvalue ").show()
Note: Update below variables according your requirement tcolunm1 , tablename, OFFSEtvalue, limtvalue

Related

Hive hql to Spark sql conversion

I have a requirement to convert hql to spark sql .I am using below approach , with this I am not seeing much change in performance. If anybody has better suggestion please let me know.
hive-
create table temp1 as select * from Table1 T1 join (select id , min(activity_date) as dt from Table1 group by id) T2 on T1.id=T2.id and T1.activity_date=T2.dt ;
create table temp2 as select * from temp1 join diff_table
I have around 70 such internal hive temp tables and data in the source Table1 is around 1.8 billion with no partitioning and 200 hdfs files .
spark code - running with 20 executor, 5 executor-core,10G executor memory, yarn-client , driver 4G
import org.apache.spark.sql.{Row,SaveMode,SparkSession}
val spark=SparkSession.builder().appName("test").config("spark.sql.warehouse.dir","/usr/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicit._
import spark.sql
val id_df=sql("select id , min(activity_date) as dt from Table1 group by id")
val all_id_df=sql("select * from Table1")
id_df.createOrReplaceTempView("min_id_table")
all_id_df.createOrReplaceTempView("all_id_table")
val temp1_df=sql("select * from all_id_table T1 join min_id_table T2 on T1.id=T2.id and T1.activity_date=T2.dt")
temp1_df.createOrReplaceTempView("temp2")
sql("create or replace table temp as select * from temp2")

How to concurrently insert SparkSQL query output to HIVE while using it for another SparkSQL query

Is it possible to insert a SparkSQL dataframe o/p to Hive table and in parallel use same dataframe as subquery for another SaprkSQL action. Below pseudo-code should given an idea of what I am trying to achieve -
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
conf = SparkConf().setAppName("test_app")
sc = SparkContext(conf=conf)
hive_context = HiveContext(sc)
query1 = "select col1, col2, sum(col3) from input_table_1 group by col1, col2"
query2 = "select col1, sum(col1) from temp_table col1"
qry1_df = hive_context.sql(query1)
qry1_df.write.format("parquet").insertInto("output_table_1", overwrite=True)
qry1_df.registerTempTable("temp_table")
qry2_df = hive_context.sql(query2)
qry2_df.write.format("parquet").insertInto("output_table_2", overwrite=True)
I want execution of query2 to leverage qry1_df output without having to recalculate entire DAG (that's what happens with above code).
UPDATE :
Based on suggestion to use cache, below is modified code
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
conf = SparkConf().setAppName("test_app")
sc = SparkContext(conf=conf)
hive_context = HiveContext(sc)
query1 = "select col1, col2, sum(col3) from input_table_1 group by col1, col2"
query2 = "select col1, sum(col1) from temp_table col1"
hive_context.sql("CACHE TABLE temp_table as " + query1)
qry1_df = hive_context.sql("Select * from temp_table")
qry1_df.write.format("parquet").insertInto("output_table_1", overwrite=True)
qry2_df = hive_context.sql(query2)
qry2_df.write.format("parquet").insertInto("output_table_2", overwrite=True)
It works. Just one clarification - these 2 tasks, writing to Hive table "output_table_1" and execution of "query2", would happen asynchronously or sequentially?
Try .cacheTable() on the tempview
spark.cacheTable("my_table")

HiveContext in Spark Version 2

I am working on a spark program that inserts dataframe into Hive Table as below.
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
val hiveCont = val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
case class partc(id:Int, name:String, salary:Int, dept:String, location:String)
val partRDD = partdata.map(p => partc(p(0).toInt, p(1), p(2).toInt, p(3), p(4)))
val partDF = partRDD.toDF()
partDF.registerTempTable("party")
hiveCont.sql("insert into parttab select id, name, salary, dept from party")
I know that Spark V2 has come out and we can use SparkSession object in it.
Can we use SparkSession object to directly insert the dataframe into Hive table or do we have to use the HiveContext in version 2 also ? Can anyone let me know what is the major difference in version with respect to HiveContext ?
You can use your SparkSession (normally called spark or ss) directly to fire a sql query (make sure hive-support is enabled when creating the spark-session):
spark.sql("insert into parttab select id, name, salary, dept from party")
But I would suggest this notation, you don't need to create a temp-table etc:
partDF
.select("id","name","salary","dept")
.write.mode("overwrite")
.insertInto("parttab")

Pyspark query hive extremely slow even the final result is quite small

I am using spark 2.0.0 to query hive table:
my sql is:
select * from app.abtestmsg_v limit 10
Yes, I want to get the first 10 records from a view app.abtestmsg_v.
When I run this sql in spark-shell,it is very fast, USE about 2 seconds .
But then the problem comes when I try to implement this query by my python code.
I am using Spark 2.0.0 and write a very simple pyspark program, code is:
Below is my pyspark code:
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
hc.setConf("hive.security.authorization.enabled",false)
zj_sql = 'select * from app.abtestmsg_v limit 10'
zj_df = hc.sql(zj_sql)
zj_df.collect()
Below is my scala code:
val hive = new org.apache.spark.sql.hive.HiveContext(sc)
hive.setConf("hive.exec.orc.split.strategy", "ETL")
val df = hive.sql("select * from silver_ep.zj_v limit 10")
df.rdd.collect()
From the info log , I find:
although I use "limit 10" to tell spark that I just want the first 10 records , but spark still scan and read all files(in my case, the source data of this view contains 100 files and each file's size is about 1G) of the view , So , there are nearly 100 tasks , each task read a file , and all the task is executed serially. I use nearlly 15 minutes to finish these 100 tasks!!!!! but what I want is just to get the first 10 records.
So , I don't know what to do and what is wrong;
Anybode could give me some suggestions?

sparkSQL wrong result

My spark version is 1.5.0,and i use spark-SQL in spark-shell do some ETL,here is my code:
import com.databricks.spark.avro._
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
import java.security.MessageDigest
val dfGoods = sqlContext.read.avro("hdfs:///user/data/date=*")
dfGoods.registerTempTable("goodsinfo")
val dfGoodsLmt=sqlContext.sql("SELECT * FROM (SELECT goodsid, etype, goodsattribute, row_number() over (partition by goodsid order by runid DESC) rank_num FROM goodsinfo) tmp WHERE rank_num =1")
i use dfGoodsLmt.count() to see the row number,first time the result is always wrong,but after that, i rerun the code dfGoodsLmt.count() ,the result is right.I have tried many times,I don't know why.
here is a demo code and the result should be 1000,but i need to try more than once to get the right answer.
case class data(id:Int,name:Int)
val tmp=(1 to 1000) zip (1 to 1000)
tmp.map(x=>data(x._1,x._2)).toDF.registerTempTable("test_table")
sqlContext.sql("select * from (select *,row_number() over(partition by id order by id DESC)rank from test_table)tmp where rank=1").count

Resources