sparkSQL wrong result - apache-spark

My spark version is 1.5.0,and i use spark-SQL in spark-shell do some ETL,here is my code:
import com.databricks.spark.avro._
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
import java.security.MessageDigest
val dfGoods = sqlContext.read.avro("hdfs:///user/data/date=*")
dfGoods.registerTempTable("goodsinfo")
val dfGoodsLmt=sqlContext.sql("SELECT * FROM (SELECT goodsid, etype, goodsattribute, row_number() over (partition by goodsid order by runid DESC) rank_num FROM goodsinfo) tmp WHERE rank_num =1")
i use dfGoodsLmt.count() to see the row number,first time the result is always wrong,but after that, i rerun the code dfGoodsLmt.count() ,the result is right.I have tried many times,I don't know why.
here is a demo code and the result should be 1000,but i need to try more than once to get the right answer.
case class data(id:Int,name:Int)
val tmp=(1 to 1000) zip (1 to 1000)
tmp.map(x=>data(x._1,x._2)).toDF.registerTempTable("test_table")
sqlContext.sql("select * from (select *,row_number() over(partition by id order by id DESC)rank from test_table)tmp where rank=1").count

Related

How do I print out a spark.sql object?

I have a spark.sql object that includes a couple of variables.
import com.github.nscala_time.time.Imports.LocalDate
val first_date = new LocalDate(2020, 4, 1)
val second_date = new LocalDate(2020, 4, 7)
val mydf = spark.sql(s"""
select *
from tempView
where timestamp between '{0}' and '{1}'
""".format(start_date.toString, end_date.toString))
I want to print out mydf because I ran mydf.count and got 0 as the outcome.
I ran mydf and got back mydf: org.apache.spark.sql.DataFrame = [column: type]
I also tried println(mydf) and it didn't return the query.
There is this related question, but it does not have the answer.
How can I print out the query?
Easiest way would be store your query into a variable then print out the variable to get the query.
Use variable in spark.sql
Example:
In Spark-scala:
val start_date="2020-01-01"
val end_date="2020-02-02"
val query=s"""select * from tempView where timestamp between'${start_date}' and '${end_date}'"""
print (query)
//select * from tempView where timestamp between'2020-01-01' and '2020-02-02'
spark.sql(query)
In Pyspark:
start_date="2020-01-01"
end_date="2020-02-02"
query="""select * from tempView where timestamp between'{0}' and '{1}'""".format(start_date,end_date)
print(query)
#select * from tempView where timestamp between'2020-01-01' and '2020-02-02'
#use same query in spark.sql
spark.sql(query)
Here it is in PySpark.
start_date="2020-01-01"
end_date="2020-02-02"
q="select * from tempView where timestamp between'{0}' and '{1}'".format(start_date,end_date)
print(q)
Here is the onlnie running version: https://repl.it/repls/FeistyVigorousSpyware

Databricks: Inserting an argument into the LIKE-clause of a query in a %sql cell

In a Databricks-notebook, I have a widget that allows to set a value for the argument kw. I need to use that value in a query as part of a LIKE-clause. The snippet below runs, but doesn't return anything (even when it should).
%sql
SELECT *
FROM table
WHERE keyword LIKE '%getArgument("kw")%'
I don't know what the 'kw' represents but I think it should be:
sqlContext.sql("SELECT * FROM SomeTable WHERE SomeField LIKE CONCAT('%', kw, '%')")
Use the appropriate libraries:
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // Make sure you use HiveContext
import sqlContext.implicits._
sqlContext.sql("SELECT * FROM SomeTable WHERE SomeField LIKE CONCAT('%', kw, '%')")
This works:
%sql
SELECT *
FROM table
WHERE keyword LIKE '%$kw%'

How to concurrently insert SparkSQL query output to HIVE while using it for another SparkSQL query

Is it possible to insert a SparkSQL dataframe o/p to Hive table and in parallel use same dataframe as subquery for another SaprkSQL action. Below pseudo-code should given an idea of what I am trying to achieve -
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
conf = SparkConf().setAppName("test_app")
sc = SparkContext(conf=conf)
hive_context = HiveContext(sc)
query1 = "select col1, col2, sum(col3) from input_table_1 group by col1, col2"
query2 = "select col1, sum(col1) from temp_table col1"
qry1_df = hive_context.sql(query1)
qry1_df.write.format("parquet").insertInto("output_table_1", overwrite=True)
qry1_df.registerTempTable("temp_table")
qry2_df = hive_context.sql(query2)
qry2_df.write.format("parquet").insertInto("output_table_2", overwrite=True)
I want execution of query2 to leverage qry1_df output without having to recalculate entire DAG (that's what happens with above code).
UPDATE :
Based on suggestion to use cache, below is modified code
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
conf = SparkConf().setAppName("test_app")
sc = SparkContext(conf=conf)
hive_context = HiveContext(sc)
query1 = "select col1, col2, sum(col3) from input_table_1 group by col1, col2"
query2 = "select col1, sum(col1) from temp_table col1"
hive_context.sql("CACHE TABLE temp_table as " + query1)
qry1_df = hive_context.sql("Select * from temp_table")
qry1_df.write.format("parquet").insertInto("output_table_1", overwrite=True)
qry2_df = hive_context.sql(query2)
qry2_df.write.format("parquet").insertInto("output_table_2", overwrite=True)
It works. Just one clarification - these 2 tasks, writing to Hive table "output_table_1" and execution of "query2", would happen asynchronously or sequentially?
Try .cacheTable() on the tempview
spark.cacheTable("my_table")

HiveContext in Spark Version 2

I am working on a spark program that inserts dataframe into Hive Table as below.
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
val hiveCont = val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
case class partc(id:Int, name:String, salary:Int, dept:String, location:String)
val partRDD = partdata.map(p => partc(p(0).toInt, p(1), p(2).toInt, p(3), p(4)))
val partDF = partRDD.toDF()
partDF.registerTempTable("party")
hiveCont.sql("insert into parttab select id, name, salary, dept from party")
I know that Spark V2 has come out and we can use SparkSession object in it.
Can we use SparkSession object to directly insert the dataframe into Hive table or do we have to use the HiveContext in version 2 also ? Can anyone let me know what is the major difference in version with respect to HiveContext ?
You can use your SparkSession (normally called spark or ss) directly to fire a sql query (make sure hive-support is enabled when creating the spark-session):
spark.sql("insert into parttab select id, name, salary, dept from party")
But I would suggest this notation, you don't need to create a temp-table etc:
partDF
.select("id","name","salary","dept")
.write.mode("overwrite")
.insertInto("parttab")

In Pyspark HiveContext what is the equivalent of SQL OFFSET?

Or a more specific question would be how can I process large amounts of data that do not fit into memory at once? With OFFSET I was trying to do hiveContext.sql("select ... limit 10 offset 10") while incrementing offset to get all the data but offset doesn't seem to be valid within hiveContext. What is the alternative usually used to achieve this goal?
For some context the pyspark code starts with
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("select ... limit 10 offset 10").show()
You code will look like
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql(" with result as
( SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS RowNum FROM tablename )
select colunm1 ,column2,column3 from result where RowNum >= OFFSEtvalue and RowNum < (OFFSEtvalue +limtvalue ").show()
Note: Update below variables according your requirement tcolunm1 , tablename, OFFSEtvalue, limtvalue

Resources