Spark zeppelin: how to obtain %sql result in %pyspark interpreter? - apache-spark

I know I can use
%pyspark
df = sqlContext.sql('select * from train_table')
And I can use df.registerTempTable('xxx') to make df accessable in %sql .
But sometimes I would like to use %sql to draw plot. Calculation may be expansive :
%sql
select C.name, count(C.name) from orderitems as A
left join clientpagemodules as C on C.code = A.from_module
left join orders as B on A.ref_id = B.id
left join products as P on P.id = A.product_id
where B.time_create > (unix_timestamp(NOW()) - 3600*24*30) *1000 group by C.name
If I decide to write some code to clean the result , I have to move above sql into df = sqlContext.sql(sql) , calculate again .
I wonder is there any way to access %sql result in %pyspark ?

I'm not aware of a way to do it after you have executed your sql statement, but you can access the temporary table created in %sql from %pyspark when you register it as temporary view initially:
%sql
--initial step
CREATE OR REPLACE TEMPORARY VIEW temp_bla AS select * from YOURSTATEMENT
%sql
--your work as usual
Select * from temp_bla
%pyspark
--and continuing in pyspark
spark.sql('select * from temp_bla').show()

This is how you get the SQL-table in another paragraph as a pandas Dataframe:
%sql(saveAs=choose_name)
SELECT * FROM your_table
%pyspark
dataframe = z.getAsDataFrame('choose_name')
As written in the Zeppelin %python docs

Related

Is spark.read or spark.sql lazy transformations?

In Spark if the source data has changed in between two action calls why I still get previous o/p not the most recent ones. Through DAG all operations will get executed including read operation once action is called. Isn't it?
e.g.
df = spark.sql("select * from dummy.table1")
#Reading from spark table which has two records into dataframe.
df.count()
#Gives count as 2 records
Now, a record inserted into table and action is called withou re-running command1 .
df.count()
#Still gives count as 2 records.
I was expecting Spark will execute read operation again and fetch total 3 records into dataframe.
Where my understanding is wrong ?
To contrast your assertion, this below does give a difference - using Databricks Notebook (cells). The insert operation is not known that you indicate.
But the following using parquet or csv based Spark - thus not Hive table, does force a difference in results as the files making up the table change. For a DAG re-compute, the same set of files are used afaik, though.
//1st time in a cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//1st time in another cell
val df2 = spark.sql("select * from tab2")
df2.count()
//4 is returned
//2nd time in a different cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//2nd time in another cell
df2.count()
//8 is returned
Refutes your assertion. Also tried with .enableHiveSupport(), no difference.
Even if creating a Hive table directly in Databricks:
spark.sql("CREATE TABLE tab5 (id INT, name STRING, age INT) STORED AS ORC;")
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
...
df.count()
...
spark.sql(""" INSERT INTO tab5 VALUES (2, 'Amy SmithS', 77) """)
df.count()
...
Still get updated counts.
However, the for a Hive created ORC Serde table, the following "hive" approach or using an insert via spark.sql:
val dfX = Seq((88,"John", 888)).toDF("id" ,"name", "age")
dfX.write.format("hive").mode("append").saveAsTable("tab5")
or
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
will sometimes show and sometimes not show an updated count when just the 2nd df.count() is issued. This is due to Hive / Spark lack of synchronization that may depend on some internal flagging of changes. In any event not consistent. Double-checked.
This is most related to inmutability as I see it. DataFrames are inmutables, hence changes in the original table are not reflected on them.
Once a dataframe is evaluated, it will be never calculated again. So once the dataframe named df is evaluated, it is the picture of table1 at the time of evaluation, it doesn't matter if table1 changes, df won't. So the second df.count does not trigger evaluation it just return the previous result, which is 2
If you want the desired results you have to load again the DF in a different variable:
val df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
val df2 = spark.sql("select * from dummy.table1")
df2.count() //Will trigger evaluation and return 3
Or using var instead of val (which is bad)
var df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 3
This said: yes, spark read and spark sql are lazy, those are not called until an action is found, but once that happens, evaluation won't be trigger ever again in that dataframe

Spark to SparkSQL equivqlent syntax

I have this two line in spark, I want to get the equivalent in SparkSQL (im working with python env)
df = spark_df.filter(col["col_name".lower()].rlike("[0-9]{9}$")).count()
spark_df = spark_df.withColumn(columnname, F.to_date(F.col((columnname, ),"yyyyMMdd"))
For spark sql first convert dataframe to temp view then run sql.
Example:
spark_df.createOrReplaceTempView("tmp")
df=spark.sql("""select count(*) from tmp where lower(col_name) rlike("[0-9]{9}$") """).collect()[0][0]
spark_df = spark.sql("""select *, to_date(columnname,"yyyyMMdd") columnname from tmp """)

Hive hql to Spark sql conversion

I have a requirement to convert hql to spark sql .I am using below approach , with this I am not seeing much change in performance. If anybody has better suggestion please let me know.
hive-
create table temp1 as select * from Table1 T1 join (select id , min(activity_date) as dt from Table1 group by id) T2 on T1.id=T2.id and T1.activity_date=T2.dt ;
create table temp2 as select * from temp1 join diff_table
I have around 70 such internal hive temp tables and data in the source Table1 is around 1.8 billion with no partitioning and 200 hdfs files .
spark code - running with 20 executor, 5 executor-core,10G executor memory, yarn-client , driver 4G
import org.apache.spark.sql.{Row,SaveMode,SparkSession}
val spark=SparkSession.builder().appName("test").config("spark.sql.warehouse.dir","/usr/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicit._
import spark.sql
val id_df=sql("select id , min(activity_date) as dt from Table1 group by id")
val all_id_df=sql("select * from Table1")
id_df.createOrReplaceTempView("min_id_table")
all_id_df.createOrReplaceTempView("all_id_table")
val temp1_df=sql("select * from all_id_table T1 join min_id_table T2 on T1.id=T2.id and T1.activity_date=T2.dt")
temp1_df.createOrReplaceTempView("temp2")
sql("create or replace table temp as select * from temp2")

Databricks: Inserting an argument into the LIKE-clause of a query in a %sql cell

In a Databricks-notebook, I have a widget that allows to set a value for the argument kw. I need to use that value in a query as part of a LIKE-clause. The snippet below runs, but doesn't return anything (even when it should).
%sql
SELECT *
FROM table
WHERE keyword LIKE '%getArgument("kw")%'
I don't know what the 'kw' represents but I think it should be:
sqlContext.sql("SELECT * FROM SomeTable WHERE SomeField LIKE CONCAT('%', kw, '%')")
Use the appropriate libraries:
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // Make sure you use HiveContext
import sqlContext.implicits._
sqlContext.sql("SELECT * FROM SomeTable WHERE SomeField LIKE CONCAT('%', kw, '%')")
This works:
%sql
SELECT *
FROM table
WHERE keyword LIKE '%$kw%'

While joining two dataframe in spark, getting empty result

I am trying to join two dataframes in spark from database Cassandra.
val table1=cc.sql("select * from test123").as("table1")
val table2=cc.sql("select * from test1234").as("table2")
table1.join(table2, table1("table1.id") === table2("table2.id1"), "inner")
.select("table1.name", "table2.name1")
The result I am getting is empty.
You can try pure sql way, if you are un-sure of the syntax of join here.
table1.registerTempTable("tbl1")
table2.registerTempTable("tbl2")
val table3 = sqlContext.sql("Select tbl1.name, tbl2.name FROM tbl1 INNER JOIN tbl2 on tbl1.id=tbl2.id")
Also, you should see, if table1 and table2, really have same id's to do join on, in first place.
Update :-
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Ideally, yes, csc should also work.
You should refer to http://spark.apache.org/docs/latest/sql-programming-guide.html
First union both data frame and after that register as temp table

Resources