Understanding repartition in Pyspark - apache-spark

My question is a bit conceptual and related to how things work under the hood. I have written below code related to repartiton.
df_cube = spark.createDataFrame([ ("Sachin" , "M"), ("Dipti", "F") , ("Roshani", "F"), ("Tushar", "M"), ("Satyendra", "M") ], ["Name" , "Gender"] )
data = df_cube.union(df_cube).repartition("Gender")
data.show()
The above code gives me below output.
+---------+------+
| Name|Gender|
+---------+------+
| Dipti| F|
| Sachin| M|
| Roshani| F|
| Tushar| M|
|Satyendra| M|
| Dipti| F|
| Sachin| M|
| Roshani| F|
| Tushar| M|
|Satyendra| M|
+---------+------+
After that I repartition by Name and Gender both and I get below output.
df_name = data.repartition(7, "Name", "Gender")
df_name.show()
+---------+------+
| Name|Gender|
+---------+------+
| Tushar| M|
| Tushar| M|
|Satyendra| M|
|Satyendra| M|
| Sachin| M|
| Dipti| F|
| Sachin| M|
| Dipti| F|
| Roshani| F|
| Roshani| F|
+---------+------+
My main question is that how can I figure out the ordering of the rows when I call out repartition on one column and two columns as depicted above . How does Spark rearrange the rows . On my local machine it shows two partitions by default , is there a way to view what rows goes into which partition and after repartitioning , how to view which partition have which rows. Please help me answer these both queries , if possible, in a verbose manner

how can I figure out the ordering of the rows when I call out repartition
No you can't and in fact it should be irrelevant. Ordering is for when you query the data back ie. select from where order by, you can event make it a "view" for users. In short, ordering cannot be guaranteed because Spark distribute and parallelize data.
is there a way to view what rows goes into which partition and after repartitioning
Yes, you can experiment with the following snippet which will show what is inside each partition.
for i, part in enumerate(df.rdd.glom().collect()):
print({i: part})

Related

How to update dataframe column value while joinining with other dataframe in pyspark?

I have 3 Dataframe df1(EMPLOYEE_INFO),df2(DEPARTMENT_INFO),df3(COMPANY_INFO) and i want to update a column which is in df1 by joining all the three dataframes . The name of column is FLAG_DEPARTMENT which is in df1. I need to set the FLAG_DEPARTMENT='POLITICS' . In sql query will look like this.
UPDATE [COMPANY_INFO] INNER JOIN ([DEPARTMENT_INFO]
INNER JOIN [EMPLOYEE_INFO] ON [DEPARTMENT_INFO].DEPT_ID = [EMPLOYEE_INFO].DEPT_ID)
ON [COMPANY_INFO].[COMPANY_DEPT_ID] = [DEPARTMENT_INFO].[DEP_COMPANYID]
SET EMPLOYEE_INFO.FLAG_DEPARTMENT = "POLITICS";
If the values in columns of these three tables matches i need to set my FLAG_DEPARTMENT='POLITICS' in my employee_Info Table
How can i achieve this same thing in pyspark. I have just started learning pyspark don't have that much depth knowledge?
You can use a chain of joins with a select on top of it.
Suppose that you have the following pyspark DataFrames:
employee_df
+---------+-------+
| Name|dept_id|
+---------+-------+
| John| dept_a|
| Liù| dept_b|
| Luke| dept_a|
| Michail| dept_a|
| Noe| dept_e|
|Shinchaku| dept_c|
| Vlad| dept_e|
+---------+-------+
department_df
+-------+----------+------------+
|dept_id|company_id| description|
+-------+----------+------------+
| dept_a| company1|Department A|
| dept_b| company2|Department B|
| dept_c| company5|Department C|
| dept_d| company3|Department D|
+-------+----------+------------+
company_df
+----------+-----------+
|company_id|description|
+----------+-----------+
| company1| Company 1|
| company2| Company 2|
| company3| Company 3|
| company4| Company 4|
+----------+-----------+
Then you can run the following code to add the flag_department column to your employee_df:
from pyspark.sql import functions as F
employee_df = (
employee_df.alias('a')
.join(
department_df.alias('b'),
on='dept_id',
how='left',
)
.join(
company_df.alias('c'),
on=F.col('b.company_id') == F.col('c.company_id'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in employee_df.columns],
F.when(
F.col('b.dept_id').isNotNull() & F.col('c.company_id').isNotNull(),
F.lit('POLITICS')
).alias('flag_department')
)
)
The new employee_df will be:
+---------+-------+---------------+
| Name|dept_id|flag_department|
+---------+-------+---------------+
| John| dept_a| POLITICS|
| Liù| dept_b| POLITICS|
| Luke| dept_a| POLITICS|
| Michail| dept_a| POLITICS|
| Noe| dept_e| null|
|Shinchaku| dept_c| null|
| Vlad| dept_e| null|
+---------+-------+---------------+

union() method giving weird dataset values

I want to pile up the rows of many datasets into one. The datasets are stored in array and i use union() to add the rows of the previous dataset, given the array of datasets i described :
ArrayList<Dataset<Row>> dropped_added_cols_dataset_list = ...
i do it like so :
AD = dropped_added_cols_dataset_list.get(0);
for (int i = 1; i < dropped_added_cols_dataset_list.size(); i++) {
AD = AD.union(dropped_added_cols_dataset_list.get(i));
}
But it seems as the final AD is showing weird values not related to those in each sub-dataset from the array described previously.
Anyone has an idea why ?
Let's say we have the following n (here n == 3) number of Datasets as input stored in your dropped_added_cols_dataset_list list.
df1:
+---+----+----+
| _1| _2| _3|
+---+----+----+
| 0|cat1|30.9|
| 2|cat2|22.1|
| 0|cat3|19.6|
| 1|cat4| 1.3|
+---+----+----+
df2:
+---+----+----+
| _1| _2| _3|
+---+----+----+
| 1|cat5|28.5|
| 2|cat6|26.8|
| 1|cat7|12.6|
| 1|cat8| 5.3|
+---+----+----+
df3:
+---+-----+----+
| _1| _2| _3|
+---+-----+----+
| 1| cat9|39.6|
| 2|cat10|29.7|
| 0|cat11|27.9|
| 1|cat12| 9.8|
+---+-----+----+
We can apply the unions for all of them in the list without any user-specific loop, by using reduce (Java docs for Dataset's specified reduce here, working for the ArrayList through the InterruptibleIterator class that inherits reduce from the scala.collection.TraversableOnce interface as seen here) to merge the Datasets together in couples at a time.
// Java (`dropped_added_cols_dataset_list` is of type ArrayList)
list.reduce((ReduceFunction<Dataset>) (v1, v2) -> v1.union(v2));
// Scala (`dropped_added_cols_dataset_list` is of type List)
dropped_added_cols_dataset_list.reduce(_ union _)
The resulting Dataset will look like this:
+---+-----+----+
| _1| _2| _3|
+---+-----+----+
| 0| cat1|30.9|
| 2| cat2|22.1|
| 0| cat3|19.6|
| 1| cat4| 1.3|
| 1| cat5|28.5|
| 2| cat6|26.8|
| 1| cat7|12.6|
| 1| cat8| 5.3|
| 1| cat9|39.6|
| 2|cat10|29.7|
| 0|cat11|27.9|
| 1|cat12| 9.8|
+---+-----+----+
Of course, for the unions to be successful, you need to be extra careful that all of the Datasets stored in your list have the exact same column schema.
It also needs to be said that this solution's performance is heavily dependent on the number of Datasets being merged, so a relatively large number of unions can result to non-linear time complexity.

Is there a feature in spark that skips a left join in case when the field that gets added by left join is not required

Suppose we have a table A and we are doing a left join with a large Table B (to fetch field colB)
Then the output is again left joined with a large table C (to fetch field colC) and finally we left join this with a table D (to fetch field colD)
So above 3 left-joins help to create a final dataset that is shared by multiple consumers.
As a consumer of this code , i do a select colA's , colD from the final dataset (I don't need colB and colC )
Is there a feature which will skip 2 joins with B & C (since colB and colC are not required downstream in my case)
FYI :
I don't want to change the implementation(i.e. 3 joins) since this
method is used by mutiple teams.
I dont want to create my own implementation (avoid code duplication ,
and to stay up to date with the logic that is used across the teams )
PS for clarity:
B,C,D are huge dim tables
A is a fact table (relatively smaller than B,C,D)
I do not think that this is possible without changing the original code. The reason is that even if the final result does not contain columns from tables B and C, the result might still depend on which tables were part of the join chain.
An example: lets assume we have this data and we want to join the four tables with the id column.
Table A Table B Table C Table D
+---+----+ +---+----+ +---+----+ +---+----+
| id|colA| | id|colB| | id|colC| | id|colD|
+---+----+ +---+----+ +---+----+ +---+----+
| 1| A1| | 1| B1| | 1| C1| | 1| D1|
| 2| A2| | 2| B2| | 2| C2| | 2| D2|
+---+----+ +---+----+ | 2| C2b| +---+----+
+---+----+
The important point to note is that the table C contains a duplicate value in the join column.
If the four tables are joined with a left join and the columns A and D are selected, the result would be
+---+----+----+----+----+ +---+----+----+
| id|colA|colB|colC|colD| | id|colA|colD|
+---+----+----+----+----+ +---+----+----+
| 1| A1| B1| C1| D1| ==> | 1| A1| D1|
| 2| A2| B2| C2b| D2| | 2| A2| D2|
| 2| A2| B2| C2| D2| | 2| A2| D2|
+---+----+----+----+----+ +---+----+----+
On the other hand, if only the tables A and D are joined directly without tables B and C, the result would be
+---+----+----+
| id|colA|colD|
+---+----+----+
| 1| A1| D1|
| 2| A2| D2|
+---+----+----+
So even if the final result contains no columns from tables B and C, the result is different if you join A->D or A->B->C->D. So the Spark code cannot skip the joins of the tables B and C.
The good news: if you go the way A->B->C->D and exclude the columns from tables B and C, Spark will only process the join column(s) of tables B and C and skips (for example during a shuffle) all other columns. So at least the amount of data that is processed will be lower when not selecting columns from tables B and C.

how to count distinct values for data with too many columns in pyspark

I had problem when processing data with a large number of columns in spark.
I am currently using countDistinct function as follows:
from pyspark.sql import functions as F
distinct_cnts = df.agg(*(F.countDistinct(col).alias(col) for col in df.columns)).toPandas().T[0].to_dict()
for data with 3300 columns and only 50 rows(sampled data around 1mb)
I am using spark cluster environment(1gb for driver and executor).
When I tried to run the above function, I ran into memory problem and stackoverflow error.
java.lang.StackOverflowError
I don't really understand how the data around 1mb cause memory issue. Could anyone explain about this?
When I tried to allocate more memory to the spark driver and executors(3gb for each and setting dynamicAllocation), the above function works but another computation jobs for every column cause the same issue again.
for example function as follows:
df.select(*(F.sum(F.col(c).isNull().cast('int')).alias(c) for c in f.columns)).toPandas().T[0].to_dict()
Is there any other way to solve this problem besides the spark configuration? (better way to write code)
Here are two approaches you can try/challenge, but written in Scala. Note, I didn't test it yet on wide dataframes, but feel free to share some anonymized data if possible :
Lets say we have a wide DataFrame (here with only 6 columns) :
val wideDF = Seq(
("1","a","b","c","d","e","f"),
("2","a","b","c","d2","e","f"),
("3","a","b","c2","d3","e","f"),
("4","a2","b2","c3","d4","e2","f2")
).toDF("key","col1","col2","col3","col4","col5","col6")
+---+----+----+----+----+----+----+
|key|col1|col2|col3|col4|col5|col6|
+---+----+----+----+----+----+----+
| 1| a| b| c| d| e| f|
| 2| a| b| c| d2| e| f|
| 3| a| b| c2| d3| e| f|
| 4| a2| b2| c3| d4| e2| f2|
+---+----+----+----+----+----+----+
Approach 1 :
Here, I take all the DF columns (3300) and create 100 buckets, containing 33 columns each. My buckets can be parallelized for better scalability :
val allColumns = wideDF.columns.grouped(100).toList
val reducedDF = allColumns.par.map{colBucket:Array[String] =>
//We will process aggregates on narrowed dataframes
wideDF.select(colBucket.map(c => countDistinct($"$c").as(s"distinct_$c")):_*)
}.reduce(_ join _)
reducedDF.show(false)
+------------+------------+------------+------------+------------+------------+
|distinctcol1|distinctcol2|distinctcol3|distinctcol4|distinctcol5|distinctcol6|
+------------+------------+------------+------------+------------+------------+
|2 |2 |3 |4 |2 |2 |
+------------+------------+------------+------------+------------+------------+
Approach 2 :
Create (or use Panda's) Transpose method, then aggregate on column name (could also be done with RDD API)
def transposeDF(dataframe: DataFrame, transposeBy: Seq[String]): DataFrame = {
val (cols, types) = dataframe.dtypes.filter{ case (c, _) => !transposeBy.contains(c)}.unzip
require(types.distinct.size == 1)
val kvs = explode(array(
cols.map(c => struct(lit(c).alias("column_name"), col(c).alias("column_value"))): _*
))
val byExprs = transposeBy.map(col)
dataframe
.select(byExprs :+ kvs.alias("_kvs"): _*)
.select(byExprs ++ Seq($"_kvs.column_name", $"_kvs.column_value"): _*)
}
// We get 1 record per key and column name
transposeDF(wideDF, Seq("key")).groupBy("column_name").agg(countDistinct($"column_value").as("distinct_count")).show(false)
+-----------+--------------+
|column_name|distinct_count|
+-----------+--------------+
|col3 |3 |
|col4 |4 |
|col1 |2 |
|col6 |2 |
|col5 |2 |
|col2 |2 |
+-----------+--------------+

Spark SQL , doesn´t respect the Dataframe format

I m analyzing Twitter Files with the scope to take the trending topic, in json format with Spark SQL
After to take all the text form a Tweet and split the words, my dataFrame look like this
+--------------------+--------------------+
| line| words|
+--------------------+--------------------+
|[RT, #ONLYRPE:, #...| RT|
|[RT, #ONLYRPE:, #...| #ONLYRPE:|
|[RT, #ONLYRPE:, #...| #tlrp|
|[RT, #ONLYRPE:, #...| followan?|
I just need the column words, I coconvert my table to a temView.
df.createOrReplaceTempView("Twitter_test_2")
With the help of spark sql should be very easy to take the trending topic, I just need a query in sql using in the where condition operator "Like". words like "#%"
spark.sql("select words,
count(words) as count
from words_Twitter
where words like '#%'
group by words
order by count desc limit 10").show(20,False)
but I m getting some strange results that I can't find an explanation for them.
+---------------------+---+
|words |cnt|
+---------------------+---+
|#izmirescort |211|
|#PRODUCE101 |101|
|#VeranoMTV2017 |91 |
|#سلمان_يدق_خشم_العايل|89 |
|#ALDUBHomeAgain |67 |
|#BTS |32 |
|#سود_الله_وجهك_ياتميم|32 |
|#NowPlaying |32 |
for some reason the #89 and the #32 the twoo thar have arab characteres are no where they should been. The text had been exchanged with the counter.
others times I am confrontig tha kind of format.
spark.sql("select words, lang,count(words) count from Twitter_test_2 group by words,lang order by count desc limit 10 ").show()
After that Query to my dataframe, it look like so strange
+--------------------+----+-----+
| words|lang|count|
+--------------------+----+-----+
| #VeranoMTV2017| pl| 6|
| #umRei| pt| 2|
| #Virgem| pt| 2|
| #rt
2| pl| 2|
| #rt
gazowaną| pl| 1|
| #Ziobro| pl| 1|
| #SomosPorto| pt| 1|
+--------------------+----+-----+
Why is happening that, and how can avoid it ?

Resources