Confusion in Spark's partitioning strategy on dataFrames - apache-spark

In the following I am getting the same number of partitions (200) in all the four print statements. The initial dataframe (df1) is partitioned on 4 columns (account_id, schema_name, table_name, column_name). But the subsequent dataframes are partitioned only on 3 fields (account_id, schema_name, table_name). Can someone please explain to me, if Spark is able to retain the partitioning strategy from step1-step4 and doesn't need to shuffle the data anymore after step1.
val query1: String = "SELECT account_id, schema_name, table_name,
column_name, COLLECT_SET(u.query_id) AS query_id_set FROM usage_tab
GROUP BY account_id, schema_name, table_name, column_name"
val df1 = session.sql(query1)
println("1 " + df.rdd.getNumPartitions)
df1.createOrReplaceTempView("wtftempusage")
val query2 = "SELECT DISTINCT account_id, schema_name, table_name
FROM wtftempusage"
val df2 = session.sql(query2)
println("2 " + df2.rdd.getNumPartitions)
//MyFuncIterator retains all columns for df2 and adds an additional column
val extendedDF = df2.mapPartitions(MyFuncIterator)
println("3 " + extendedDF.rdd.getNumPartitions)
val joinedDF = df1.join(extendedDF, Seq("account_id", "schema_name", "table_name"))
println("4 " + joinedDF.rdd.getNumPartitions)
Thanks,
Devj

The default number of shuffle partitions in DF API is 200.
You can Set the default shuffle.partitons to a lesser number. Say like:
sqlContext.setConf ("spark.sql.shuffle.partitions","5")

Related

how to check number of partitions count in amazon keyspaces?

how to check number of partitions count in amazon keyspaces ?
is there a way to create dashboard to check no. of partitions creation
whether no. of partition key is equal to no. of partitions ?
You will want to use AWS Glue and the Spark Cassandra connector. You can use the following to grab the distinct keys of any combination of columns. The script below reads a list of comma separated column names to use with distinct count. You will want to make sure you first enable the MurMur3 partitioner
val tableName = args("TABLE_NAME")
val keyspaceName = args("KEYSPACE_NAME")
val tableDf = sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> tableName, "keyspace" -> keyspaceName, "pushdown" -> "false"))
.load()
val distinctKeys = args("DISTINCT_KEYS").filterNot(_.isWhitespace).split(",")
logger.info("distinctKeys: " + distinctKeys.mkString(", "))
val results = tableDf.select(distinctKeys.head, distinctKeys.tail:_*).distinct().count()
logger.info("Total number of distinct keys: " + results)
The full example can be found here.

Join apache spark dataframes properly with scala avoiding null values

Hellow everyone!
I have two DataFrames in apache spark (2.3) and I want to join them properly. I will explain below what I mean with 'properly'. First of all the two dataframes holds the following information:
nodeDf: ( id, year, title, authors, journal, abstract )
edgeDf: ( srcId, dstId, label )
The label could be 0 or 1 in case node1 is connected with node2 or not.
I want to combine this two dataframes to get one dataframe withe the following information:
JoinedDF: ( id_from, year_from, title_from, journal_from, abstract_from, id_to, year_to, title_to, journal_to, abstract_to, time_dist )
time_dist = abs(year_from - year_to)
When I said 'properly' I meant that the query must be as fast as it could be and I don't want to contain null rows or cels ( value on a row ).
I have tried the following but I took me 500 -540 sec to execute the query and the final dataframe contains null values. I don't even know if the dataframes ware joined correctly.
I want to mention that the node file from which I create the nodeDF has 27770 rows and the edge file (edgeDf) has 615512 rows.
Code:
val spark = SparkSession.builder().master("local[*]").appName("Logistic Regression").getOrCreate()
val sc = spark.sparkContext
val data = sc.textFile("resources/data/training_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1), fields(2).toInt)
})
val data2 = sc.textFile("resources/data/test_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1))
})
import spark.implicits._
val trainingDF = data.toDF("srcId","dstId", "label")
val testDF = data2.toDF("srcId","dstId")
val infoRDD = spark.read.option("header","false").option("inferSchema","true").format("csv").load("resources/data/node_information.csv")
val infoDF = infoRDD.toDF("srcId","year","title","authors","jurnal","abstract")
println("Showing linksDF sample...")
trainingDF.show(5)
println("Rows of linksDF: ",trainingDF.count())
println("Showing infoDF sample...")
infoDF.show(2)
println("Rows of infoDF: ",infoDF.count())
println("Joining linksDF and infoDF...")
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" === $"b.srcId")
println(joinedDF.count())
joinedDF = joinedDF.select($"a.srcId",$"a.dstId",$"a.label",$"b.year",$"b.title",$"b.authors",$"b.jurnal",$"b.abstract")
joinedDF.show(5)
val graphX = new GraphX()
val pageRankDf =graphX.computePageRank(spark,"resources/data/training_set.txt",0.0001)
println("Joining joinedDF and pageRankDf...")
joinedDF = joinedDF.as("a").join(pageRankDf.as("b"),$"a.srcId" === $"b.nodeId")
var dfWithRanks = joinedDF.select("srcId","dstId","label","year","title","authors","jurnal","abstract","rank").withColumnRenamed("rank","pgRank")
dfWithRanks.show(5)
println("Renameming joinedDF...")
dfWithRanks = dfWithRanks
.withColumnRenamed("srcId","id_from")
.withColumnRenamed("dstId","id_to")
.withColumnRenamed("year","year_from")
.withColumnRenamed("title","title_from")
.withColumnRenamed("authors","authors_from")
.withColumnRenamed("jurnal","jurnal_from")
.withColumnRenamed("abstract","abstract_from")
var infoDfRenamed = dfWithRanks
.withColumnRenamed("id_from","id_from")
.withColumnRenamed("id_to","id_to")
.withColumnRenamed("year_from","year_to")
.withColumnRenamed("title_from","title_to")
.withColumnRenamed("authors_from","authors_to")
.withColumnRenamed("jurnal_from","jurnal_to")
.withColumnRenamed("abstract_from","abstract_to").select("id_to","year_to","title_to","authors_to","jurnal_to","jurnal_to")
var finalDF = dfWithRanks.as("a").join(infoDF.as("b"),$"a.id_to" === $"b.srcId")
finalDF = finalDF
.withColumnRenamed("year","year_to")
.withColumnRenamed("title","title_to")
.withColumnRenamed("authors","authors_to")
.withColumnRenamed("jurnal","jurnal_to")
.withColumnRenamed("abstract","abstract_to")
println("Dropping unused columns from joinedDF...")
finalDF = finalDF.drop("srcId")
finalDF.show(5)
Here are my results!
Avoid all calculations and code related to pgRank! Is there any proper way to do this join works?
You can filter your data first and then join, in that case you will avoid nulls
df.filter($"ColumnName".isNotNull)
use <=> operator in your joining column condition
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" <=> $"b.srcId")
There is a function in spark 2.1 or greater is eqNullSafe
var joinedDF = trainingDF.join(infoDF,trainingDF("srcId").eqNullSafe(infoDF("srcId")))

Does spark will reuse rdd in DAG for single action?

Does spark will reuse rdd in DAG for single action?
Case1
val df1 = spark.sql("select id, value from table")
val df2 = spark.sql("select id, value from table")
df1.join(df2, "id").show()
Case 2
val df1 = spark.sql("select id, value from table")
val df2 = df1.filter($"value" > 0)
df1.join(df2, "id").show()
Questions
In case1, does the query select id, value from table will be executed only once?
In case2, does the query will be executed only once?
If not, how can i optimize the code to make the query execute only once, because the query may be very slow.

How to distribute dataset evenly to avoid a skewed join (and long-running tasks)?

I am writing application using Spark dataset API on databricks notebook.
I have 2 tables. One is 1.5billion rows and second 2.5 million. Both tables contain telecommunication data and join is done using country code and first 5 digits of a number. Output has 55 billion rows. Problem is I have skewed data(long running tasks). No matter how i repartition dataset I get long running tasks because of uneven distribution of hashed keys.
I tried using broadcast joins, tried persisting big table partitions in memory etc.....
What are my options here?
spark will repartition the data based on the join key, so repartitioning before the join won't change the skew (only add an unnecessary shuffle)
if you know the key that is causing the skew (usually it will be some thing like null or 0 or ""), split your data into to 2 parts - 1 dataset with the skew key, and another with the rest
and do the join on the sub datasets, and union the results
for example:
val df1 = ...
val df2 = ...
val skewKey = null
val df1Skew = df1.where($"key" === skewKey)
val df2Skew = df2.where($"key" === skewKey)
val df1NonSkew = df1.where($"key" =!= skewKey)
val df2NonSkew = df2.where($"key" =!= skewKey)
val dfSkew = df1Skew.join(df2Skew) //this is a cross join
val dfNonSkew = df1NonSkew.join(df2NonSkew, "key")
val res = dfSkew.union(dfNonSkew)

Joining 3 SchemaRDDs

I'm doing a three-way join with three SchemaRDDs (each on the order of a million records, stored in Parquet files on HDFS).
The schema is as follows:
table1 has four fields: id, group_id, t2_id and a date
table2 has three fields: id, group_id and t3_id
table3 has three fields: id, group_id and date
I'm trying to figure out the relationships between table1 and table3 within a group.
The SQL Query I'd use would be:
SELECT group_id, t1.id, t3.id
FROM table1, table2, table3
WHERE t1.group_id = t2.group_id and t1.t2_id = t2.id and
and t2.group_id = t3.group_id and t2.t3_id = t3.id and
t3.date < t1.date
However, I'm trying to do it in Spark:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
import org.apache.spark.sql.catalyst.plans.{Inner, JoinType}
val tab1 = sqlContext.parquetFile("warehouse/tab1_pq")
val tab2 = sqlContext.parquetFile("warehouse/tab2_pq")
val tab3 = sqlContext.parquetFile("warehouse/tab3_pq")
val relationship = tab1.as('t1).
join(tab2.as('t2), Inner, Some(("t2.group_id".attr === "t1.group_id".attr) && ("t2.id".attr === "t1.t2_id".attr))).
join(tab3.as('t3), Inner, Some(("t3.group_id".attr === "t2.group_id".attr) && ("t3.id".attr === "t2.t3_id".attr))).
where("t3.date".attr <= "t1.date".attr).
select("t1.group_id".attr, "t1.id".attr, "t3.id".attr)
So this seems to work -- however it runs significantly slower than impala on the same (3 unit, EMR) cluster. Is this the right way to do it? Is there a way to make this more performant?
Thanks for any help

Resources