How to create column from an expression - apache-spark

The doc says:
# 2. Create from an expression
df.colName + 1
1 / df.colName
Can anyone explain the meaning and usage of the code?

It means the arithmetic operation with the old Column creates a new Column object:
df = spark.createDataFrame([[1], [2]], ['a'])
df.show()
+---+
| a|
+---+
| 1|
| 2|
+---+
df.a
# Column<b'a'>
df.a + 1
# Column<b'(a + 1)'>
1 / df.a
# Column<b'(1 / a)'>
df.a, df.a + 1 and 1 / df.a are all Column objects, what you want to ask is probably how to attach the column to the data frame, for which, you can use select:
df.select('a', (df.a + 1).alias('b')).show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
+---+---+
Or withColumn:
df.withColumn('b', df.a + 1).show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
+---+---+

Related

Renaming the duplicate column name or performing select operation on it in PySpark

Jupyter Notebook Screenshot
Code:
pdf=[(1,'a',4,'a',4.1,'d'),(2,'b',3,'b',3.2,'c'),(3,'c',2,'c',2.3,'b'),(1,'d',1,'d',1.4,'a')]
df15 = spark.createDataFrame(pdf, ('x','y','z','a','b','a') )
df15.show(2)
try: df15.select(df15.a).show(2)
except: print("failed")
df15.columns
try: df15.select(df15.columns[3]).show(2)
except: print("failed")
df15.withColumnRenamed('a', 'b_id').show(2)
df15.drop('a').show(2)
Output:
+---+---+---+---+---+---+
| x| y| z| a| b| a|
+---+---+---+---+---+---+
| 1| a| 4| a|4.1| d|
| 2| b| 3| b|3.2| c|
+---+---+---+---+---+---+
only showing top 2 rows
failed
failed
+---+---+---+----+---+----+
| x| y| z|b_id| b|b_id|
+---+---+---+----+---+----+
| 1| a| 4| a|4.1| d|
| 2| b| 3| b|3.2| c|
+---+---+---+----+---+----+
only showing top 2 rows
+---+---+---+---+
| x| y| z| b|
+---+---+---+---+
| 1| a| 4|4.1|
| 2| b| 3|3.2|
+---+---+---+---+
only showing top 2 rows
How to rename a duplicate column or perform select operations on it?
select operation doesn't work on duplicate col names
rename and drop operation applies changes to both duplicate col names
you could define a list of new column names and rename all columns for the dataframe at once, then drop whatever column you want to drop
new_cols = ['x','y','z','b_id','b','b_id_to_drop']
df = df.toDF(*new_cols)
df = df.drop('b_id_to_drop')

How spark RangeBetween works with Descending Order?

I thought rangeBetween(start, end) looks into values of the range(cur_value - start, cur_value + end). https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html
But, I saw an example where they used descending orderBy() on timestamp, and then used (unboundedPreceeding, 0) with rangeBetween. Which led me to explore the following example:
dd = spark.createDataFrame(
[(1, "a"), (3, "a"), (3, "a"), (1, "b"), (2, "b"), (3, "b")],
['id', 'category']
)
dd.show()
# output
+---+--------+
| id|category|
+---+--------+
| 1| a|
| 3| a|
| 3| a|
| 1| b|
| 2| b|
| 3| b|
+---+--------+
It seems to include preceding row whose value is higher by 1.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-1, Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 3|
| 3| a| 6|
| 3| a| 6|
| 1| a| 1|
+---+--------+---+
And with start set to -2, it includes value greater by 2 but in preceding rows.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-2,Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 6|
| 3| a| 6|
| 3| a| 6|
| 1| a| 7|
+---+--------+---+
So, what is the exact behavior of rangeBetween with desc orderBy?
It's not well documented but when using range (or value-based) frames the ascending and descending order affects the determination of the values that are included in the frame.
Let's take the example you provided:
RANGE BETWEEN 1 PRECEDING AND CURRENT ROW
Depending on the order by direction, 1 PRECEDING means:
current_row_value - 1 if ASC
current_row_value + 1 if DESC
Consider the row with value 1 in partition b.
With the descending order, the frame includes :
(current_value and all preceding values where x = current_value + 1) = (1, 2)
With the ascending order, the frame includes:
(current_value and all preceding values where x = current_value - 1) = (1)
PS: using rangeBetween(-1, Window.currentRow) with desc ordering is just equivalent to rangeBetween(Window.currentRow, 1) with asc ordering.

How to compare two dataframes and add new flag column in pyspark?

I have created two data frames by executing below command.
test1 = sc.parallelize([
("a",1,1),
("b",2,2),
("d",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test1.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 2|
| d| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
test2=sc.parallelize([
("a",1,1),
("b",2,3),
("f",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test2.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 3|
| f| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
using test1 and test2 data-frames I need to produce new dataframe which should contain result like below .
+---+--------+----------+------------+------------+
|SID|SSection|test1SRank|test2SRank | flag |
+---+--------+----------+------------+------------+
| a| 1| 1 | 1 | same_rank |
| b| 2| 2 | 3 |rank_changed|
| d| 4| 2 | 0 |No_rank |
| e| 4| 1 | 1 |same_rank |
| c| 3| 4 | 4 |same_rank |
| f| 4| 0 | 2 |new_rank |
+---+--------+----------+------------+------------+
above result I want to produce by comparison between test1 and test2 data frames using combination of columns SID and SSection and comparison between ranks.
for example :
1) SID (a) and SSection (1): in test1rank is1 and test2rank is 1 so my flag value should be same_rank.
2) SID (b) and SSection (2): in test1rank is 2 and test2rank is 3 here rank was changed so my flag value should be rank_changed.
3) SID (d) and SSection (4): in test1rank is 2 and in test2rank he lost his rank, so my flag value should be No_rank
4) SID (f) and SSection (4): in test1rank is he was not performed well so he don't have any rank and in test2rank he performed well his rank is 2, so my flag value should be New_rank
This should give you what you want:
from pyspark.sql import functions as f
test3=test1.withColumnRenamed('SRank','test1SRank')\
.join(test2.drop('SSection')\
.withColumnRenamed('SRank','test2SRank'), on='SID', how='outer')\
.fillna(0)
test3=test3.withColumn('flag', f.expr("case when test1SRank=0 and test2SRank>0 then 'new_rank'\
when test1SRank>0 and test2SRank=0 then 'No_rank'\
when test1SRank=test2SRank then 'same_rank'\
else 'rank_changed' end"))
test3.orderBy('SID').show()
Explanation: Outer join the data frame so you have test1 and test2 scores for all SIDs. Then fill nulls with 0 and perform the flag logic with a sql case when statement.

spark SQL to perform simple arithmetic with constant

I'm trying do arithmetic operation with two operands: constant literal and Column. Is there an approach other than withColumn?
let df be a dataframe:
+---+
| i|
+---+
| 1|
| 2|
| 3|
+---+
then you can use select to add the results:
import org.apache spark.sql.functions.lit
df
.select($"i",($"i" + lit(1)).as("j"))
.show
+---+---+
| i| j|
+---+---+
| 1| 2|
| 2| 3|
| 3| 4|
+---+---+

How to use first and last function in pyspark?

I used first and last functions to get first and last values of one column. But, I found the both of functions don't work as what I supposed. I referred to the answer #zero323, but I am still confusing with the both. the code like:
df = spark.sparkContext.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)
]).toDF(["k", "v"])
w = Window().partitionBy("k").orderBy('k','v')
df.select(F.col("k"), F.last("v",True).over(w).alias('v')).show()
the result:
+---+----+
| k| v|
+---+----+
| b| 1|
| b| 3|
| a|null|
| a| -1|
| a| 1|
+---+----+
I supposed it should be like:
+---+----+
| k| v|
+---+----+
| b| 3|
| b| 3|
| a| 1|
| a| 1|
| a| 1|
+---+----+
because, I showed df by operation of orderBy on 'k' and 'v':
df.orderBy('k','v').show()
+---+----+
| k| v|
+---+----+
| a|null|
| a| -1|
| a| 1|
| b| 1|
| b| 3|
+---+----+
additionally, I figured out the other solution to test this kind of problems, my code like:
df.orderBy('k','v').groupBy('k').agg(F.first('v')).show()
I found that it was possible that its results are different after running above it every time . Was someone met the same experience like me? I hope to use the both of functions in my project, but I found those solutions are inconclusive.
Try inverting the sort order using .desc() and then first() will give the desired output.
w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select(F.col("k"), F.first("v",True).over(w2).alias('v')).show()
F.first("v",True).over(w2).alias('v').show()
Outputs:
+---+---+
| k| v|
+---+---+
| b| 3|
| b| 3|
| a| 1|
| a| 1|
| a| 1|
+---+---+
You should also be careful about partitionBy vs. orderBy. Since you are partitioning by 'k', all of the values of k in any given window are the same. Sorting by 'k' does nothing.
The last function is not really the opposite of first, in terms of which item from the window it returns. It returns the last non-null, value it has seen, as it progresses through the ordered rows.
To compare their effects, here is a dataframe with both function/ordering combinations. Notice how in column 'last_w2', the null value has been replaced by -1.
df = spark.sparkContext.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)]).toDF(["k", "v"])
#create two windows for comparison.
w = Window().partitionBy("k").orderBy('v')
w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select('k','v',
F.first("v",True).over(w).alias('first_w1'),
F.last("v",True).over(w).alias('last_w1'),
F.first("v",True).over(w2).alias('first_w2'),
F.last("v",True).over(w2).alias('last_w2')
).show()
Output:
+---+----+--------+-------+--------+-------+
| k| v|first_w1|last_w1|first_w2|last_w2|
+---+----+--------+-------+--------+-------+
| b| 1| 1| 1| 3| 1|
| b| 3| 1| 3| 3| 3|
| a|null| null| null| 1| -1|
| a| -1| -1| -1| 1| -1|
| a| 1| -1| 1| 1| 1|
+---+----+--------+-------+--------+-------+
Have a look at Question 47130030.
The issue is not with the last() function but with the frame, which includes only rows up to the current one.
Using
w = Window().partitionBy("k").orderBy('k','v').rowsBetween(W.unboundedPreceding,W.unboundedFollowing)
will yield correct results for first() and last().

Resources