Joining 3 SchemaRDDs - apache-spark

I'm doing a three-way join with three SchemaRDDs (each on the order of a million records, stored in Parquet files on HDFS).
The schema is as follows:
table1 has four fields: id, group_id, t2_id and a date
table2 has three fields: id, group_id and t3_id
table3 has three fields: id, group_id and date
I'm trying to figure out the relationships between table1 and table3 within a group.
The SQL Query I'd use would be:
SELECT group_id, t1.id, t3.id
FROM table1, table2, table3
WHERE t1.group_id = t2.group_id and t1.t2_id = t2.id and
and t2.group_id = t3.group_id and t2.t3_id = t3.id and
t3.date < t1.date
However, I'm trying to do it in Spark:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
import org.apache.spark.sql.catalyst.plans.{Inner, JoinType}
val tab1 = sqlContext.parquetFile("warehouse/tab1_pq")
val tab2 = sqlContext.parquetFile("warehouse/tab2_pq")
val tab3 = sqlContext.parquetFile("warehouse/tab3_pq")
val relationship = tab1.as('t1).
join(tab2.as('t2), Inner, Some(("t2.group_id".attr === "t1.group_id".attr) && ("t2.id".attr === "t1.t2_id".attr))).
join(tab3.as('t3), Inner, Some(("t3.group_id".attr === "t2.group_id".attr) && ("t3.id".attr === "t2.t3_id".attr))).
where("t3.date".attr <= "t1.date".attr).
select("t1.group_id".attr, "t1.id".attr, "t3.id".attr)
So this seems to work -- however it runs significantly slower than impala on the same (3 unit, EMR) cluster. Is this the right way to do it? Is there a way to make this more performant?
Thanks for any help

Related

Join apache spark dataframes properly with scala avoiding null values

Hellow everyone!
I have two DataFrames in apache spark (2.3) and I want to join them properly. I will explain below what I mean with 'properly'. First of all the two dataframes holds the following information:
nodeDf: ( id, year, title, authors, journal, abstract )
edgeDf: ( srcId, dstId, label )
The label could be 0 or 1 in case node1 is connected with node2 or not.
I want to combine this two dataframes to get one dataframe withe the following information:
JoinedDF: ( id_from, year_from, title_from, journal_from, abstract_from, id_to, year_to, title_to, journal_to, abstract_to, time_dist )
time_dist = abs(year_from - year_to)
When I said 'properly' I meant that the query must be as fast as it could be and I don't want to contain null rows or cels ( value on a row ).
I have tried the following but I took me 500 -540 sec to execute the query and the final dataframe contains null values. I don't even know if the dataframes ware joined correctly.
I want to mention that the node file from which I create the nodeDF has 27770 rows and the edge file (edgeDf) has 615512 rows.
Code:
val spark = SparkSession.builder().master("local[*]").appName("Logistic Regression").getOrCreate()
val sc = spark.sparkContext
val data = sc.textFile("resources/data/training_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1), fields(2).toInt)
})
val data2 = sc.textFile("resources/data/test_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1))
})
import spark.implicits._
val trainingDF = data.toDF("srcId","dstId", "label")
val testDF = data2.toDF("srcId","dstId")
val infoRDD = spark.read.option("header","false").option("inferSchema","true").format("csv").load("resources/data/node_information.csv")
val infoDF = infoRDD.toDF("srcId","year","title","authors","jurnal","abstract")
println("Showing linksDF sample...")
trainingDF.show(5)
println("Rows of linksDF: ",trainingDF.count())
println("Showing infoDF sample...")
infoDF.show(2)
println("Rows of infoDF: ",infoDF.count())
println("Joining linksDF and infoDF...")
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" === $"b.srcId")
println(joinedDF.count())
joinedDF = joinedDF.select($"a.srcId",$"a.dstId",$"a.label",$"b.year",$"b.title",$"b.authors",$"b.jurnal",$"b.abstract")
joinedDF.show(5)
val graphX = new GraphX()
val pageRankDf =graphX.computePageRank(spark,"resources/data/training_set.txt",0.0001)
println("Joining joinedDF and pageRankDf...")
joinedDF = joinedDF.as("a").join(pageRankDf.as("b"),$"a.srcId" === $"b.nodeId")
var dfWithRanks = joinedDF.select("srcId","dstId","label","year","title","authors","jurnal","abstract","rank").withColumnRenamed("rank","pgRank")
dfWithRanks.show(5)
println("Renameming joinedDF...")
dfWithRanks = dfWithRanks
.withColumnRenamed("srcId","id_from")
.withColumnRenamed("dstId","id_to")
.withColumnRenamed("year","year_from")
.withColumnRenamed("title","title_from")
.withColumnRenamed("authors","authors_from")
.withColumnRenamed("jurnal","jurnal_from")
.withColumnRenamed("abstract","abstract_from")
var infoDfRenamed = dfWithRanks
.withColumnRenamed("id_from","id_from")
.withColumnRenamed("id_to","id_to")
.withColumnRenamed("year_from","year_to")
.withColumnRenamed("title_from","title_to")
.withColumnRenamed("authors_from","authors_to")
.withColumnRenamed("jurnal_from","jurnal_to")
.withColumnRenamed("abstract_from","abstract_to").select("id_to","year_to","title_to","authors_to","jurnal_to","jurnal_to")
var finalDF = dfWithRanks.as("a").join(infoDF.as("b"),$"a.id_to" === $"b.srcId")
finalDF = finalDF
.withColumnRenamed("year","year_to")
.withColumnRenamed("title","title_to")
.withColumnRenamed("authors","authors_to")
.withColumnRenamed("jurnal","jurnal_to")
.withColumnRenamed("abstract","abstract_to")
println("Dropping unused columns from joinedDF...")
finalDF = finalDF.drop("srcId")
finalDF.show(5)
Here are my results!
Avoid all calculations and code related to pgRank! Is there any proper way to do this join works?
You can filter your data first and then join, in that case you will avoid nulls
df.filter($"ColumnName".isNotNull)
use <=> operator in your joining column condition
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" <=> $"b.srcId")
There is a function in spark 2.1 or greater is eqNullSafe
var joinedDF = trainingDF.join(infoDF,trainingDF("srcId").eqNullSafe(infoDF("srcId")))

Does spark will reuse rdd in DAG for single action?

Does spark will reuse rdd in DAG for single action?
Case1
val df1 = spark.sql("select id, value from table")
val df2 = spark.sql("select id, value from table")
df1.join(df2, "id").show()
Case 2
val df1 = spark.sql("select id, value from table")
val df2 = df1.filter($"value" > 0)
df1.join(df2, "id").show()
Questions
In case1, does the query select id, value from table will be executed only once?
In case2, does the query will be executed only once?
If not, how can i optimize the code to make the query execute only once, because the query may be very slow.

Confusion in Spark's partitioning strategy on dataFrames

In the following I am getting the same number of partitions (200) in all the four print statements. The initial dataframe (df1) is partitioned on 4 columns (account_id, schema_name, table_name, column_name). But the subsequent dataframes are partitioned only on 3 fields (account_id, schema_name, table_name). Can someone please explain to me, if Spark is able to retain the partitioning strategy from step1-step4 and doesn't need to shuffle the data anymore after step1.
val query1: String = "SELECT account_id, schema_name, table_name,
column_name, COLLECT_SET(u.query_id) AS query_id_set FROM usage_tab
GROUP BY account_id, schema_name, table_name, column_name"
val df1 = session.sql(query1)
println("1 " + df.rdd.getNumPartitions)
df1.createOrReplaceTempView("wtftempusage")
val query2 = "SELECT DISTINCT account_id, schema_name, table_name
FROM wtftempusage"
val df2 = session.sql(query2)
println("2 " + df2.rdd.getNumPartitions)
//MyFuncIterator retains all columns for df2 and adds an additional column
val extendedDF = df2.mapPartitions(MyFuncIterator)
println("3 " + extendedDF.rdd.getNumPartitions)
val joinedDF = df1.join(extendedDF, Seq("account_id", "schema_name", "table_name"))
println("4 " + joinedDF.rdd.getNumPartitions)
Thanks,
Devj
The default number of shuffle partitions in DF API is 200.
You can Set the default shuffle.partitons to a lesser number. Say like:
sqlContext.setConf ("spark.sql.shuffle.partitions","5")

How to avoid multiple jobs spawning scenio in spark sql

I have an application wherein I should query hive which would return me a set of records(hardly 50). For each record returned I have to fire a query on hive and get the relavant dataframe. This is how it would look like:
val employeeIds = hiveContext.sql("select id from employee")
val vertices = employeeIds.foreach(row => {
val employeeId = row.getInt(0)
val query = s""" select * from department where employeeId = $employeeId"""
//.... I would have to create a hive context here ....
})
But if I do so there will be new contexts spawning up from the executors, any pointers to eliminate this approach would be very helpful.
Note:
I have masked the information to post of stackoverflow. I have to fire a query based on the records of first query. I cannot join employee and department table.

While joining two dataframe in spark, getting empty result

I am trying to join two dataframes in spark from database Cassandra.
val table1=cc.sql("select * from test123").as("table1")
val table2=cc.sql("select * from test1234").as("table2")
table1.join(table2, table1("table1.id") === table2("table2.id1"), "inner")
.select("table1.name", "table2.name1")
The result I am getting is empty.
You can try pure sql way, if you are un-sure of the syntax of join here.
table1.registerTempTable("tbl1")
table2.registerTempTable("tbl2")
val table3 = sqlContext.sql("Select tbl1.name, tbl2.name FROM tbl1 INNER JOIN tbl2 on tbl1.id=tbl2.id")
Also, you should see, if table1 and table2, really have same id's to do join on, in first place.
Update :-
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Ideally, yes, csc should also work.
You should refer to http://spark.apache.org/docs/latest/sql-programming-guide.html
First union both data frame and after that register as temp table

Resources