I am doing a cube operation with count aggregation on a spark dataframe that has close to 1 million rows. I am using 4 columns for doing this cube operation. I notice that the dataframe returned after cube operation has duplicate rows. Specially for null combinations.
There are no nulls in my input DF since I have replaced all nulls with separate default values for each column before doing the cube operation.Also I am filtering out the rows of the cube output where by all 3 grouping columns are null, because that represent the total count and I am already aware of that.
An example could be :
val dimensions = List("A","B","C","D")
val cube_df = input_df.cube(dimensions.head, dimensions.tail: _*)
.count()
.filter(!(col("A").isNull && col("B").isNull && col("C").isNull && col("D").isNull))
now if a do a show on the cube like this :
cube_df
.filter(col("A").isNull && col("B").isNull && col("C").isNull && col("D") === "xyz")
.show(false)
+----+----+----+---------------+-----------+
|A |B |C |D |Count |
+----+----+----+---------------+-----------+
|null|null|null|xyz |10221 |
|null|null|null|xyz |232638 |
+----+----+----+---------------+-----------+
I see two rows in the output and obviously only 1 of these rows represent the correct count as per the input_df (the second row in my case).
I am also aware that cube basically does a group by of all combination 1 by 1 keeping non participating columns in any combination as null and keeps performing a union_all operation for each combination group by. But still this seems a little strange to me.
So why is this happening ? and If I cannot avoid this duplicate combination output,then how I Identify which of the returned combination represent correct output ?
Related
assuming i have some pyspark df, f.e:
Key | Value
0 | "a"
2 | "c"
0 | "b"
1 | "z"
I want to perform map-reduce-like shuffle method -
i.e. I want to group rows on partitions by keys.
I believe df.rdd.groupByKey() does that, but it changes df structure
it returns list of tuples with list as value (grouped key).
How can I perform "pure" shuffle function - Move my objects to specific partition, but do not change anything in df appearance / structure?
So the output would be the same but partitioning would be different. For example - we start with 2 paritions:
(0,"a")
(1,"c")
(1,"d")
and
(1,"d")
(0:"e")
(1,"w")
as a result we get two partitions:
(0,"a")
(0:"e")
and
(1,"d")
(1,"c")
(1,"d")
(1,"w")
Sorry if the title does not explain clearly, I could not think of a better way to phrase this. So I have a data frame that is organized as follows:
ID Depart Arrive Time
****************************
A 1 2 1pm
A 2 3 2pm
A 4 1 5pm
So what I'm to do is find all the Times where one rows Depart does not match the next row's Arrive.
For example, the second column here has a 3 as it's Arrive but the third column has 4 as its depart (as opposed to 3).
What I'm hoping to get would be a new data frame with all these conditions. In the case of this data frame it would look like this:
ID From To Time
********************************
A 3 4 [2pm,5pm]
I'm struggling to figure out how to do this with spark as opposed to converting the data frame into a different data structure and iterating over it. Apologies if I missed anything, I'm new to spark.
You can check the value of Depart in the next row using lead and compare the the value of Arrive in the current row. If they are different, collect all the necessary information into a struct, and expand it later. Note that this solution only works for your time format (ha: hour followed by am/pm).
from pyspark.sql import functions as F, Window
w = Window.partitionBy('ID').orderBy('Time2')
df2 = df.withColumn(
'Time2',
F.to_timestamp('Time', 'ha')
).withColumn(
'unmatched_arrive',
F.when(
F.lead('Depart').over(w) != F.col('Arrive'),
F.struct(
F.col('Arrive').alias('From'),
F.lead('Depart').over(w).alias('To'),
F.array('Time', F.lead('Time').over(w)).alias('Time')
)
)
).dropna(subset=['unmatched_arrive']).dropDuplicates().select('ID', 'unmatched_arrive.*')
df2.show()
+---+----+---+----------+
|ID |From|To |Time |
+---+----+---+----------+
|A |3 |4 |[2pm, 5pm]|
+---+----+---+----------+
I have a very big pyspark dataframe. The dataframe contains two important columns: A key and tokens related to that key. So each row has a key and a list of tokens:
load_df.show(5)
+--------------------+-----------+
| token | key |
+--------------------+-----------+
|[-LU4KeI8o, FrWx6...| h9-1256 |
|[] | h1-2112 |
|[HDOksdh_vv, aIHD...| e3-0139 |
|[-LU4KeI8o, FrWx6...| S3-4156 |
+--------------------+-----------+
Now, I want to count the number of times each token appeared relative to different keys. But the problem is whatever I do turn to be very slow.
I want to know what is the best way to this?
I have tried to explode the token column and then count.
Something like this:
explode_df = load_df.withColumn('token', F.explode('token'))
load_freq = explode_df.groupby('token')\
.count()\
.sort('count', ascending=False)
or this:
explode_df = load_df.withColumn('token', F.explode('token'))
load_freq = explode_df.groupby('token')\
.agg(F.collect_set('key'), F.count(F.col('key')).alias('count'))\
.sort('count', ascending=True)
The dataframe has more than 250 million rows and this method is very slow. I wonder if there is a better way to reach the same result faster and more efficient.
Can anyone explain to me why these two conditions produce different outputs (even different count() )?
FIRST:
(df
.where(cond1)
.where((cond2) | (cond3))
.groupBy('id')
.agg(F.avg(F.column('col1')).alias('name'),
F.avg(F.column('col2')).alias('name'))
).count()
SECOND:
(df
.groupBy('id')
.agg(F.avg(F.when(((cond2) | (cond3))) & (cond1),
F.column('col1'))).alias('name'),
F.avg(F.when(((cond2) | (cond3)) & (cond1),
F.column('col2'))).alias('name'))
).count()
I just figured it out. when() returns None when it finds no match, but None is still a return, which means the aggregation takes into account all the values. When compared to a simple df grouped by the same column and just aggregated with no conditions, the result is the same.
On the other hand, where() filters the DataFrame, so the aggregation is only applied to the filtered version of the DataFrame, hence lower number of results
Without knowing what the conditions are, my understanding is that they are different processes: in the first case you first filter the rows you need to process, group by id and get the averages of the filtered data, that results to lets say x rows. In the second case, you group by id first, no rows filtering, and you tell spark to add a column named 'name' that holds the conditional average to the grouped df. You don't conditionally filter the rows, so you now have x+something more rows (depending on your conditions)
(df
.where(cond1) # remove rows by applying cond1
.where((cond2) | (cond3)) # remove rows by applying cond2, 3
.groupBy('id') # group *remaining* rows by id
.agg(F.avg(F.column('col1')).alias('name'), # then get the average
F.avg(F.column('col2')).alias('name'))
).count()
But:
(df
.groupBy('id') # group initial data by id
.agg(F.avg(F.when(((cond2) | (cond3))) & (cond1), # add a column to the grouped data that computes average conditionally
F.column('col1'))).alias('name'),
F.avg(F.when(((cond2) | (cond3)) & (cond1),
F.column('col2'))).alias('name'))
).count()
# the agg does not change the number of the rows.
Hope this helps (I think you've already figured it out though :) ). Good luck!
I currently started using pyspark. I have a two columns dataframe with one column containing some nulls, e.g.
df1
A B
1a3b 7
0d4s 12
6w2r null
6w2r null
1p4e null
and another dataframe has the correct mapping, i.e.
df2
A B
1a3b 7
0d4s 12
6w2r 0
1p4e 3
so I want to fill out the nulls in df1 using df2 s.t. the result is:
A B
1a3b 7
0d4s 12
6w2r 0
6w2r 0
1p4e 3
in pandas, I would first create a lookup dictionary from df2 then use apply on the df1 to populate the nulls. But I'm not really sure what functions to use in pyspark, most of replacing nulls I saw is based on simple conditions, for example, filling all the nulls to be a single constant value for certain column.
What I have tried is:
from pyspark.sql.functions import when, col
df1.withColumn('B', when(df.B.isNull(), df2.where(df2.B== df1.B).select('A')))
although I was getting AttributeError: 'DataFrame' object has no attribute '_get_object_id'. The logic is to first filter out the nulls then replace it with the column B's value from df2, but I think df.B.isNull() evaluates the whole column instead of single value, which is probably not the right way to do it, any suggestions?
left join on common column A and selecting appropriate columns should get you your desired output
df1.join(df2, df1.A == df2.A, 'left').select(df1.A, df2.B).show(truncate=False)
which should give you
+----+---+
|A |B |
+----+---+
|6w2r|0 |
|6w2r|0 |
|1a3b|7 |
|1p4e|3 |
|0d4s|12 |
+----+---+