Databricks: Inserting an argument into the LIKE-clause of a query in a %sql cell - databricks

In a Databricks-notebook, I have a widget that allows to set a value for the argument kw. I need to use that value in a query as part of a LIKE-clause. The snippet below runs, but doesn't return anything (even when it should).
%sql
SELECT *
FROM table
WHERE keyword LIKE '%getArgument("kw")%'

I don't know what the 'kw' represents but I think it should be:
sqlContext.sql("SELECT * FROM SomeTable WHERE SomeField LIKE CONCAT('%', kw, '%')")
Use the appropriate libraries:
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // Make sure you use HiveContext
import sqlContext.implicits._
sqlContext.sql("SELECT * FROM SomeTable WHERE SomeField LIKE CONCAT('%', kw, '%')")

This works:
%sql
SELECT *
FROM table
WHERE keyword LIKE '%$kw%'

Related

Spark zeppelin: how to obtain %sql result in %pyspark interpreter?

I know I can use
%pyspark
df = sqlContext.sql('select * from train_table')
And I can use df.registerTempTable('xxx') to make df accessable in %sql .
But sometimes I would like to use %sql to draw plot. Calculation may be expansive :
%sql
select C.name, count(C.name) from orderitems as A
left join clientpagemodules as C on C.code = A.from_module
left join orders as B on A.ref_id = B.id
left join products as P on P.id = A.product_id
where B.time_create > (unix_timestamp(NOW()) - 3600*24*30) *1000 group by C.name
If I decide to write some code to clean the result , I have to move above sql into df = sqlContext.sql(sql) , calculate again .
I wonder is there any way to access %sql result in %pyspark ?
I'm not aware of a way to do it after you have executed your sql statement, but you can access the temporary table created in %sql from %pyspark when you register it as temporary view initially:
%sql
--initial step
CREATE OR REPLACE TEMPORARY VIEW temp_bla AS select * from YOURSTATEMENT
%sql
--your work as usual
Select * from temp_bla
%pyspark
--and continuing in pyspark
spark.sql('select * from temp_bla').show()
This is how you get the SQL-table in another paragraph as a pandas Dataframe:
%sql(saveAs=choose_name)
SELECT * FROM your_table
%pyspark
dataframe = z.getAsDataFrame('choose_name')
As written in the Zeppelin %python docs

How to concurrently insert SparkSQL query output to HIVE while using it for another SparkSQL query

Is it possible to insert a SparkSQL dataframe o/p to Hive table and in parallel use same dataframe as subquery for another SaprkSQL action. Below pseudo-code should given an idea of what I am trying to achieve -
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
conf = SparkConf().setAppName("test_app")
sc = SparkContext(conf=conf)
hive_context = HiveContext(sc)
query1 = "select col1, col2, sum(col3) from input_table_1 group by col1, col2"
query2 = "select col1, sum(col1) from temp_table col1"
qry1_df = hive_context.sql(query1)
qry1_df.write.format("parquet").insertInto("output_table_1", overwrite=True)
qry1_df.registerTempTable("temp_table")
qry2_df = hive_context.sql(query2)
qry2_df.write.format("parquet").insertInto("output_table_2", overwrite=True)
I want execution of query2 to leverage qry1_df output without having to recalculate entire DAG (that's what happens with above code).
UPDATE :
Based on suggestion to use cache, below is modified code
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
conf = SparkConf().setAppName("test_app")
sc = SparkContext(conf=conf)
hive_context = HiveContext(sc)
query1 = "select col1, col2, sum(col3) from input_table_1 group by col1, col2"
query2 = "select col1, sum(col1) from temp_table col1"
hive_context.sql("CACHE TABLE temp_table as " + query1)
qry1_df = hive_context.sql("Select * from temp_table")
qry1_df.write.format("parquet").insertInto("output_table_1", overwrite=True)
qry2_df = hive_context.sql(query2)
qry2_df.write.format("parquet").insertInto("output_table_2", overwrite=True)
It works. Just one clarification - these 2 tasks, writing to Hive table "output_table_1" and execution of "query2", would happen asynchronously or sequentially?
Try .cacheTable() on the tempview
spark.cacheTable("my_table")

In Pyspark HiveContext what is the equivalent of SQL OFFSET?

Or a more specific question would be how can I process large amounts of data that do not fit into memory at once? With OFFSET I was trying to do hiveContext.sql("select ... limit 10 offset 10") while incrementing offset to get all the data but offset doesn't seem to be valid within hiveContext. What is the alternative usually used to achieve this goal?
For some context the pyspark code starts with
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("select ... limit 10 offset 10").show()
You code will look like
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql(" with result as
( SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS RowNum FROM tablename )
select colunm1 ,column2,column3 from result where RowNum >= OFFSEtvalue and RowNum < (OFFSEtvalue +limtvalue ").show()
Note: Update below variables according your requirement tcolunm1 , tablename, OFFSEtvalue, limtvalue

While joining two dataframe in spark, getting empty result

I am trying to join two dataframes in spark from database Cassandra.
val table1=cc.sql("select * from test123").as("table1")
val table2=cc.sql("select * from test1234").as("table2")
table1.join(table2, table1("table1.id") === table2("table2.id1"), "inner")
.select("table1.name", "table2.name1")
The result I am getting is empty.
You can try pure sql way, if you are un-sure of the syntax of join here.
table1.registerTempTable("tbl1")
table2.registerTempTable("tbl2")
val table3 = sqlContext.sql("Select tbl1.name, tbl2.name FROM tbl1 INNER JOIN tbl2 on tbl1.id=tbl2.id")
Also, you should see, if table1 and table2, really have same id's to do join on, in first place.
Update :-
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Ideally, yes, csc should also work.
You should refer to http://spark.apache.org/docs/latest/sql-programming-guide.html
First union both data frame and after that register as temp table

sparkSQL wrong result

My spark version is 1.5.0,and i use spark-SQL in spark-shell do some ETL,here is my code:
import com.databricks.spark.avro._
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
import java.security.MessageDigest
val dfGoods = sqlContext.read.avro("hdfs:///user/data/date=*")
dfGoods.registerTempTable("goodsinfo")
val dfGoodsLmt=sqlContext.sql("SELECT * FROM (SELECT goodsid, etype, goodsattribute, row_number() over (partition by goodsid order by runid DESC) rank_num FROM goodsinfo) tmp WHERE rank_num =1")
i use dfGoodsLmt.count() to see the row number,first time the result is always wrong,but after that, i rerun the code dfGoodsLmt.count() ,the result is right.I have tried many times,I don't know why.
here is a demo code and the result should be 1000,but i need to try more than once to get the right answer.
case class data(id:Int,name:Int)
val tmp=(1 to 1000) zip (1 to 1000)
tmp.map(x=>data(x._1,x._2)).toDF.registerTempTable("test_table")
sqlContext.sql("select * from (select *,row_number() over(partition by id order by id DESC)rank from test_table)tmp where rank=1").count

Resources