Hive hql to Spark sql conversion - apache-spark

I have a requirement to convert hql to spark sql .I am using below approach , with this I am not seeing much change in performance. If anybody has better suggestion please let me know.
hive-
create table temp1 as select * from Table1 T1 join (select id , min(activity_date) as dt from Table1 group by id) T2 on T1.id=T2.id and T1.activity_date=T2.dt ;
create table temp2 as select * from temp1 join diff_table
I have around 70 such internal hive temp tables and data in the source Table1 is around 1.8 billion with no partitioning and 200 hdfs files .
spark code - running with 20 executor, 5 executor-core,10G executor memory, yarn-client , driver 4G
import org.apache.spark.sql.{Row,SaveMode,SparkSession}
val spark=SparkSession.builder().appName("test").config("spark.sql.warehouse.dir","/usr/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicit._
import spark.sql
val id_df=sql("select id , min(activity_date) as dt from Table1 group by id")
val all_id_df=sql("select * from Table1")
id_df.createOrReplaceTempView("min_id_table")
all_id_df.createOrReplaceTempView("all_id_table")
val temp1_df=sql("select * from all_id_table T1 join min_id_table T2 on T1.id=T2.id and T1.activity_date=T2.dt")
temp1_df.createOrReplaceTempView("temp2")
sql("create or replace table temp as select * from temp2")

Related

Spark zeppelin: how to obtain %sql result in %pyspark interpreter?

I know I can use
%pyspark
df = sqlContext.sql('select * from train_table')
And I can use df.registerTempTable('xxx') to make df accessable in %sql .
But sometimes I would like to use %sql to draw plot. Calculation may be expansive :
%sql
select C.name, count(C.name) from orderitems as A
left join clientpagemodules as C on C.code = A.from_module
left join orders as B on A.ref_id = B.id
left join products as P on P.id = A.product_id
where B.time_create > (unix_timestamp(NOW()) - 3600*24*30) *1000 group by C.name
If I decide to write some code to clean the result , I have to move above sql into df = sqlContext.sql(sql) , calculate again .
I wonder is there any way to access %sql result in %pyspark ?
I'm not aware of a way to do it after you have executed your sql statement, but you can access the temporary table created in %sql from %pyspark when you register it as temporary view initially:
%sql
--initial step
CREATE OR REPLACE TEMPORARY VIEW temp_bla AS select * from YOURSTATEMENT
%sql
--your work as usual
Select * from temp_bla
%pyspark
--and continuing in pyspark
spark.sql('select * from temp_bla').show()
This is how you get the SQL-table in another paragraph as a pandas Dataframe:
%sql(saveAs=choose_name)
SELECT * FROM your_table
%pyspark
dataframe = z.getAsDataFrame('choose_name')
As written in the Zeppelin %python docs

In Pyspark HiveContext what is the equivalent of SQL OFFSET?

Or a more specific question would be how can I process large amounts of data that do not fit into memory at once? With OFFSET I was trying to do hiveContext.sql("select ... limit 10 offset 10") while incrementing offset to get all the data but offset doesn't seem to be valid within hiveContext. What is the alternative usually used to achieve this goal?
For some context the pyspark code starts with
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("select ... limit 10 offset 10").show()
You code will look like
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql(" with result as
( SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS RowNum FROM tablename )
select colunm1 ,column2,column3 from result where RowNum >= OFFSEtvalue and RowNum < (OFFSEtvalue +limtvalue ").show()
Note: Update below variables according your requirement tcolunm1 , tablename, OFFSEtvalue, limtvalue

Data frame re partition is not happening as expected

I am running below code in spark to create table temp1 with number of parttion 200 . But while i am checking the actual number of partition by creating an rdd out of temp1 table its coming to be more than 200.
How is this possible , am i missing any thing .It would be really helpful if any one can tell me ,if i am missing any thing !! Thanks
val TransDataFrame = hiveContext.sql(
s""" SELECT *
FROM uacc.TRANS
WHERE PROD_SURRO_ID != 0
AND MONTH_ID >= 201401
AND MONTH_ID <= 201403
AND CRE_DT <= '2016-11-13'
""").repartition(200,$"NDC").registerTempTable("temp")
hiveContext.sql(
s"""
CREATE TABLE uacc.temp1
AS SELECT * FROM temp
""")
val df = hiveContext.sql("SELECT * FROM uacc.temp1")
df.rdd.getNumPartitions
1224
As you create the table uacc.temp1 you actually write your dataframe to hdfs, now as you load that table again, the number of partitions is controlled by the number of hdfs files (more specific: file splits), see How does partitioning work for data from files on HDFS?

How to stop load the whole table in spark?

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24.
when I execute in Hive command:
hive>select * from table where year="2016" and month="06" and day="01";
I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24
But,when I read in spark
sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")
exceptition is throwable That I dont have the right to hdfs/.../2016/04/24
THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER?
HOW CAN I AVOID LOAD THE WHOLE TABLE?
You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.
To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).
val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)
Then you can create a JdbcRDD;
val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
sc,
() => DriverManager.getConnection(connUrl),
query,
lowerBound,
upperBound,
numOfPartitions,
(r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)
JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.
However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

While joining two dataframe in spark, getting empty result

I am trying to join two dataframes in spark from database Cassandra.
val table1=cc.sql("select * from test123").as("table1")
val table2=cc.sql("select * from test1234").as("table2")
table1.join(table2, table1("table1.id") === table2("table2.id1"), "inner")
.select("table1.name", "table2.name1")
The result I am getting is empty.
You can try pure sql way, if you are un-sure of the syntax of join here.
table1.registerTempTable("tbl1")
table2.registerTempTable("tbl2")
val table3 = sqlContext.sql("Select tbl1.name, tbl2.name FROM tbl1 INNER JOIN tbl2 on tbl1.id=tbl2.id")
Also, you should see, if table1 and table2, really have same id's to do join on, in first place.
Update :-
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Ideally, yes, csc should also work.
You should refer to http://spark.apache.org/docs/latest/sql-programming-guide.html
First union both data frame and after that register as temp table

Resources