While joining two dataframe in spark, getting empty result - apache-spark

I am trying to join two dataframes in spark from database Cassandra.
val table1=cc.sql("select * from test123").as("table1")
val table2=cc.sql("select * from test1234").as("table2")
table1.join(table2, table1("table1.id") === table2("table2.id1"), "inner")
.select("table1.name", "table2.name1")
The result I am getting is empty.

You can try pure sql way, if you are un-sure of the syntax of join here.
table1.registerTempTable("tbl1")
table2.registerTempTable("tbl2")
val table3 = sqlContext.sql("Select tbl1.name, tbl2.name FROM tbl1 INNER JOIN tbl2 on tbl1.id=tbl2.id")
Also, you should see, if table1 and table2, really have same id's to do join on, in first place.
Update :-
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Ideally, yes, csc should also work.
You should refer to http://spark.apache.org/docs/latest/sql-programming-guide.html

First union both data frame and after that register as temp table

Related

Is spark.read or spark.sql lazy transformations?

In Spark if the source data has changed in between two action calls why I still get previous o/p not the most recent ones. Through DAG all operations will get executed including read operation once action is called. Isn't it?
e.g.
df = spark.sql("select * from dummy.table1")
#Reading from spark table which has two records into dataframe.
df.count()
#Gives count as 2 records
Now, a record inserted into table and action is called withou re-running command1 .
df.count()
#Still gives count as 2 records.
I was expecting Spark will execute read operation again and fetch total 3 records into dataframe.
Where my understanding is wrong ?
To contrast your assertion, this below does give a difference - using Databricks Notebook (cells). The insert operation is not known that you indicate.
But the following using parquet or csv based Spark - thus not Hive table, does force a difference in results as the files making up the table change. For a DAG re-compute, the same set of files are used afaik, though.
//1st time in a cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//1st time in another cell
val df2 = spark.sql("select * from tab2")
df2.count()
//4 is returned
//2nd time in a different cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//2nd time in another cell
df2.count()
//8 is returned
Refutes your assertion. Also tried with .enableHiveSupport(), no difference.
Even if creating a Hive table directly in Databricks:
spark.sql("CREATE TABLE tab5 (id INT, name STRING, age INT) STORED AS ORC;")
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
...
df.count()
...
spark.sql(""" INSERT INTO tab5 VALUES (2, 'Amy SmithS', 77) """)
df.count()
...
Still get updated counts.
However, the for a Hive created ORC Serde table, the following "hive" approach or using an insert via spark.sql:
val dfX = Seq((88,"John", 888)).toDF("id" ,"name", "age")
dfX.write.format("hive").mode("append").saveAsTable("tab5")
or
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
will sometimes show and sometimes not show an updated count when just the 2nd df.count() is issued. This is due to Hive / Spark lack of synchronization that may depend on some internal flagging of changes. In any event not consistent. Double-checked.
This is most related to inmutability as I see it. DataFrames are inmutables, hence changes in the original table are not reflected on them.
Once a dataframe is evaluated, it will be never calculated again. So once the dataframe named df is evaluated, it is the picture of table1 at the time of evaluation, it doesn't matter if table1 changes, df won't. So the second df.count does not trigger evaluation it just return the previous result, which is 2
If you want the desired results you have to load again the DF in a different variable:
val df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
val df2 = spark.sql("select * from dummy.table1")
df2.count() //Will trigger evaluation and return 3
Or using var instead of val (which is bad)
var df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 3
This said: yes, spark read and spark sql are lazy, those are not called until an action is found, but once that happens, evaluation won't be trigger ever again in that dataframe

PySpark Partition Dataframe created from HiveContext

I'm fetching data from HiveContext and creating DataFrame. To achieve performance benefits I want to partition DF before applying join operation. How to parition data on 'ID' column and then apply Join on 'ID'
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
hiveCtx = HiveContext(spark)
df1 = hiveCtx.sql("select id,name,address from db.table1")
df2 = hiveCtx.sql("select id,name,marks from db.table2")
Need to perform following operarions on data
Dataframe partitionBy 'ID'
Join by 'ID'
You can use repartition.
Refer spark docs: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=repartition#pyspark.sql.DataFrame.repartition
Based on your data size, choose the no.of.partition.
df1= df1.repartition(7, "id")

Spark zeppelin: how to obtain %sql result in %pyspark interpreter?

I know I can use
%pyspark
df = sqlContext.sql('select * from train_table')
And I can use df.registerTempTable('xxx') to make df accessable in %sql .
But sometimes I would like to use %sql to draw plot. Calculation may be expansive :
%sql
select C.name, count(C.name) from orderitems as A
left join clientpagemodules as C on C.code = A.from_module
left join orders as B on A.ref_id = B.id
left join products as P on P.id = A.product_id
where B.time_create > (unix_timestamp(NOW()) - 3600*24*30) *1000 group by C.name
If I decide to write some code to clean the result , I have to move above sql into df = sqlContext.sql(sql) , calculate again .
I wonder is there any way to access %sql result in %pyspark ?
I'm not aware of a way to do it after you have executed your sql statement, but you can access the temporary table created in %sql from %pyspark when you register it as temporary view initially:
%sql
--initial step
CREATE OR REPLACE TEMPORARY VIEW temp_bla AS select * from YOURSTATEMENT
%sql
--your work as usual
Select * from temp_bla
%pyspark
--and continuing in pyspark
spark.sql('select * from temp_bla').show()
This is how you get the SQL-table in another paragraph as a pandas Dataframe:
%sql(saveAs=choose_name)
SELECT * FROM your_table
%pyspark
dataframe = z.getAsDataFrame('choose_name')
As written in the Zeppelin %python docs

Does spark will reuse rdd in DAG for single action?

Does spark will reuse rdd in DAG for single action?
Case1
val df1 = spark.sql("select id, value from table")
val df2 = spark.sql("select id, value from table")
df1.join(df2, "id").show()
Case 2
val df1 = spark.sql("select id, value from table")
val df2 = df1.filter($"value" > 0)
df1.join(df2, "id").show()
Questions
In case1, does the query select id, value from table will be executed only once?
In case2, does the query will be executed only once?
If not, how can i optimize the code to make the query execute only once, because the query may be very slow.

INSERT IF NOT EXISTS ELSE UPDATE in Spark SQL

Is there any provision of doing "INSERT IF NOT EXISTS ELSE UPDATE" in Spark SQL.
I have Spark SQL table "ABC" that has some records.
And then i have another batch of records that i want to Insert/update in this table based on whether they exist in this table or not.
is there a SQL command that i can use in SQL query to make this happen?
In regular Spark this could be achieved with a join followed by a map like this:
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List(("id1", "orginal"), ("id2", "original"))).toDF("df1_id", "df1_status")
val df2 = spark.sparkContext.parallelize(List(("id1", "new"), ("id3","new"))).toDF("df2_id", "df2_status")
val df3 = df1
.join(df2, 'df1_id === 'df2_id, "outer")
.map(row => {
if (row.isNullAt(2))
(row.getString(0), row.getString(1))
else
(row.getString(2), row.getString(3))
})
This yields:
scala> df3.show
+---+--------+
| _1| _2|
+---+--------+
|id3| new|
|id1| new|
|id2|original|
+---+--------+
You could also use select with udfs instead of map, but in this particular case with null-values, I personally prefer the map variant.
you can use spark sql like this :
select * from (select c.*, row_number() over (partition by tac order by tag desc) as
TAG_NUM from (
select
a.tac
,a.name
,0 as tag
from tableA a
union all
select
b.tac
,b.name
,1 as tag
from tableB b) c ) d where TAG_NUM=1
tac is column you want to insert/update by.
I know it's a bit late to share my code, but to add or update my database, i did a fuction that looks like this :
import pandas as pd
#Returns a spark dataframe with added and updated datas
#key parameter is the primary key of the dataframes
#The two parameters dfToUpdate and dfToAddAndUpdate are spark dataframes
def AddOrUpdateDf(dfToUpdate,dfToAddAndUpdate,key):
#Cast the spark dataframe dfToUpdate to pandas dataframe
dfToUpdatePandas = dfToUpdate.toPandas()
#Cast the spark dataframe dfToAddAndUpdate to pandas dataframe
dfToAddAndUpdatePandas = dfToAddAndUpdate.toPandas()
#Update the table records with the latest records, and adding new records if there are new records.
AddOrUpdatePandasDf = pd.concat([dfToUpdatePandas,dfToAddAndUpdatePandas]).drop_duplicates([key], keep = 'last').sort_values(key)
#Cast back to get a spark dataframe
AddOrUpdateDf = spark.createDataFrame(AddOrUpdatePandasDf)
return AddOrUpdateDf
As you can see, we need to cast the spark dataframes to pandas dataframe to be able to do the pd.concat and especially the drop_duplicates with the "keep = 'last'", then we cast back to spark dataframe and return it.
I don't think this is the best way to handle the AddOrUpdate, but at least, it works.

Resources