I have a use-case where I need to deduplicate a dataframe using a column (it's a GUID column). But instead of dumping the duplicates, I need to store them in a separate location. So for e.g., if we have the following data, with schema (name, GUID):
(a, 1), (b, 2), (a, 2), (a, 3), (c, 1), (c, 4). I want to split the dataset such that I have:
(a, 1), (b, 2), (a, 3), (c, 4) in 1 part and (a, 2), (c, 1) in second part. If I use dropDuplicates(col("GUID")), the second part gets lost. What would be an efficient way to do this?
You can assign a row number, and split the dataframe into two parts based on whether the row number is equal to 1.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(Window.partitionBy('GUID').orderBy(F.monotonically_increasing_id()))
)
df2.show()
+----+----+---+
|name|GUID| rn|
+----+----+---+
| a| 1| 1|
| c| 1| 2|
| a| 3| 1|
| b| 2| 1|
| a| 2| 2|
| c| 4| 1|
+----+----+---+
df2_part1 = df2.filter('rn = 1').drop('rn')
df2_part2 = df2.filter('rn != 1').drop('rn')
df2_part1.show()
+----+----+
|name|GUID|
+----+----+
| a| 1|
| a| 3|
| b| 2|
| c| 4|
+----+----+
df2_part2.show()
+----+----+
|name|GUID|
+----+----+
| c| 1|
| a| 2|
+----+----+
Related
I thought rangeBetween(start, end) looks into values of the range(cur_value - start, cur_value + end). https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html
But, I saw an example where they used descending orderBy() on timestamp, and then used (unboundedPreceeding, 0) with rangeBetween. Which led me to explore the following example:
dd = spark.createDataFrame(
[(1, "a"), (3, "a"), (3, "a"), (1, "b"), (2, "b"), (3, "b")],
['id', 'category']
)
dd.show()
# output
+---+--------+
| id|category|
+---+--------+
| 1| a|
| 3| a|
| 3| a|
| 1| b|
| 2| b|
| 3| b|
+---+--------+
It seems to include preceding row whose value is higher by 1.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-1, Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 3|
| 3| a| 6|
| 3| a| 6|
| 1| a| 1|
+---+--------+---+
And with start set to -2, it includes value greater by 2 but in preceding rows.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-2,Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 6|
| 3| a| 6|
| 3| a| 6|
| 1| a| 7|
+---+--------+---+
So, what is the exact behavior of rangeBetween with desc orderBy?
It's not well documented but when using range (or value-based) frames the ascending and descending order affects the determination of the values that are included in the frame.
Let's take the example you provided:
RANGE BETWEEN 1 PRECEDING AND CURRENT ROW
Depending on the order by direction, 1 PRECEDING means:
current_row_value - 1 if ASC
current_row_value + 1 if DESC
Consider the row with value 1 in partition b.
With the descending order, the frame includes :
(current_value and all preceding values where x = current_value + 1) = (1, 2)
With the ascending order, the frame includes:
(current_value and all preceding values where x = current_value - 1) = (1)
PS: using rangeBetween(-1, Window.currentRow) with desc ordering is just equivalent to rangeBetween(Window.currentRow, 1) with asc ordering.
I have list like below:
rrr=[[(1,(3,1)),(2, (3,2)),(3, (3, 2)),(1,(4,1)),(2, (4,2))]]
df_input = []
and next I defined header like below:
df_header=['sid', 'tid', 'srank']
Using for loop appending the data into the empty list:
for i in rrr:
for j in i:
df_input.append((j[0], j[1][0], j[1][1]))
df_input
Output : [(1, 3, 1), (2, 3, 2), (3, 3, 2)]
Create Data Frame like below:
df = spark.createDataFrame(df_input, df_header)
df.show()
+---+---+------+
| sid|tid|srank|
+---+---+------+
| 1| 3| 1|
| 2| 3| 2|
| 3| 3| 2|
+---+---+------+
Now my question is how to Create Data Frame without using any external for loop(like above). Input list contains more then 1 Lack records.
When you realize that your initial list is a nested one. i.e. an actual list as a unique element of an outer one, then you'll see that the solution comes easily by taking only its first (and only) element into consideration:
spark.version
# u'2.1.1'
from pyspark.sql import Row
# your exact data:
rrr=[[(1,(3,1)),(2, (3,2)),(3, (3, 2)),(1,(4,1)),(2, (4,2))]]
df_header=['sid', 'tid', 'srank']
df = sc.parallelize(rrr[0]).map(lambda x: Row(x[0], x[1][0],x[1][1])).toDF(schema=df_header)
df.show()
# +---+---+-----+
# |sid|tid|srank|
# +---+---+-----+
# | 1| 3| 1|
# | 2| 3| 2|
# | 3| 3| 2|
# | 1| 4| 1|
# | 2| 4| 2|
# +---+---+-----+
Solution one: to introduce toDF() transformation (but with input modified)
from pyspark.sql import Row
ar=[[1,(3,1)],[2, (3,2)],[3, (3,2)]]
sc.parallelize(ar).map(lambda x: Row(sid=x[0], tid=x[1][0],srank=x[1][1])).toDF().show()
+---+-----+---+
|sid|srank|tid|
+---+-----+---+
| 1| 1| 3|
| 2| 2| 3|
| 3| 2| 3|
+---+-----+---+
Solution 2: with the requested input matrix use list comprehension, numpy flatten and reshape
import numpy as np
x=[[(1,(3,1)),(2, (3,2)),(3, (3, 2))]]
ar=[[(j[0],j[1][0],j[1][1]) for j in i] for i in x]
flat=np.array(ar).flatten()
flat=flat.reshape(len(flat)/3, 3)
sc.parallelize(flat).map(lambda x: Row(sid=int(x[0]),tid=int(x[1]),srank=int(x[2]))).toDF().show()
+---+-----+---+
|sid|srank|tid|
+---+-----+---+
| 1| 1| 3|
| 2| 2| 3|
| 3| 2| 3|
+---+-----+---+
#works also with N,M matrix
number_columns=3
x=[[(1,(3,1)),(2, (3,2)),(3, (3, 2))],[(5,(6,7)),(8, (9,10)),(11, (12, 13))]]
ar=[[(j[0],j[1][0],j[1][1]) for j in i] for i in x]
flat=np.array(ar).flatten()
flat=flat.reshape(int(len(flat)/number_columns), number_columns)
sc.parallelize(flat).map(lambda x: Row(sid=int(x[0]),tid=int(x[1]),srank=int(x[2]))).toDF().show()
+---+-----+---+
|sid|srank|tid|
+---+-----+---+
| 1| 1| 3|
| 2| 2| 3|
| 3| 2| 3|
| 5| 7| 6|
| 8| 10| 9|
| 11| 13| 12|
+---+-----+---+
I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!
With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])
I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.
I used first and last functions to get first and last values of one column. But, I found the both of functions don't work as what I supposed. I referred to the answer #zero323, but I am still confusing with the both. the code like:
df = spark.sparkContext.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)
]).toDF(["k", "v"])
w = Window().partitionBy("k").orderBy('k','v')
df.select(F.col("k"), F.last("v",True).over(w).alias('v')).show()
the result:
+---+----+
| k| v|
+---+----+
| b| 1|
| b| 3|
| a|null|
| a| -1|
| a| 1|
+---+----+
I supposed it should be like:
+---+----+
| k| v|
+---+----+
| b| 3|
| b| 3|
| a| 1|
| a| 1|
| a| 1|
+---+----+
because, I showed df by operation of orderBy on 'k' and 'v':
df.orderBy('k','v').show()
+---+----+
| k| v|
+---+----+
| a|null|
| a| -1|
| a| 1|
| b| 1|
| b| 3|
+---+----+
additionally, I figured out the other solution to test this kind of problems, my code like:
df.orderBy('k','v').groupBy('k').agg(F.first('v')).show()
I found that it was possible that its results are different after running above it every time . Was someone met the same experience like me? I hope to use the both of functions in my project, but I found those solutions are inconclusive.
Try inverting the sort order using .desc() and then first() will give the desired output.
w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select(F.col("k"), F.first("v",True).over(w2).alias('v')).show()
F.first("v",True).over(w2).alias('v').show()
Outputs:
+---+---+
| k| v|
+---+---+
| b| 3|
| b| 3|
| a| 1|
| a| 1|
| a| 1|
+---+---+
You should also be careful about partitionBy vs. orderBy. Since you are partitioning by 'k', all of the values of k in any given window are the same. Sorting by 'k' does nothing.
The last function is not really the opposite of first, in terms of which item from the window it returns. It returns the last non-null, value it has seen, as it progresses through the ordered rows.
To compare their effects, here is a dataframe with both function/ordering combinations. Notice how in column 'last_w2', the null value has been replaced by -1.
df = spark.sparkContext.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)]).toDF(["k", "v"])
#create two windows for comparison.
w = Window().partitionBy("k").orderBy('v')
w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select('k','v',
F.first("v",True).over(w).alias('first_w1'),
F.last("v",True).over(w).alias('last_w1'),
F.first("v",True).over(w2).alias('first_w2'),
F.last("v",True).over(w2).alias('last_w2')
).show()
Output:
+---+----+--------+-------+--------+-------+
| k| v|first_w1|last_w1|first_w2|last_w2|
+---+----+--------+-------+--------+-------+
| b| 1| 1| 1| 3| 1|
| b| 3| 1| 3| 3| 3|
| a|null| null| null| 1| -1|
| a| -1| -1| -1| 1| -1|
| a| 1| -1| 1| 1| 1|
+---+----+--------+-------+--------+-------+
Have a look at Question 47130030.
The issue is not with the last() function but with the frame, which includes only rows up to the current one.
Using
w = Window().partitionBy("k").orderBy('k','v').rowsBetween(W.unboundedPreceding,W.unboundedFollowing)
will yield correct results for first() and last().
Given a DataFrame
+---+---+----+
| id| v|date|
+---+---+----+
| 1| a| 1|
| 2| a| 2|
| 3| b| 3|
| 4| b| 4|
+---+---+----+
And we want to add a column with the mean value of date by v
+---+---+----+---------+
| v| id|date|avg(date)|
+---+---+----+---------+
| a| 1| 1| 1.5|
| a| 2| 2| 1.5|
| b| 3| 3| 3.5|
| b| 4| 4| 3.5|
+---+---+----+---------+
Is there a better way (e.g in term of performance) ?
val df = sc.parallelize(List((1,"a",1), (2, "a", 2), (3, "b", 3), (4, "b", 4))).toDF("id", "v", "date")
val aggregated = df.groupBy("v").agg(avg("date"))
df.join(aggregated, usingColumn = "v")
More precisely, I think this join will trigger a shuffle.
[update] add some precisions because I don't think it's a duplicate. The join has a key in this case.
I may different options to avoid it :
automatic. Spark has an automaticBroadcastJoin but it requires that Hive metadata had been computed. Right ?
by using a known partitioner ? If yes, how to do that with DataFrame.
by forcing a broadcast (leftDF.join(broadcast(rightDF), usingColumn = "v") ?