Replacing all column values using Window operation? - apache-spark

Hi Data frame created like below.
df = sc.parallelize([
(1, 3),
(2, 3),
(3, 2),
(4,2),
(1, 3)
]).toDF(["id",'t'])
it shows like below.
+---+---+
| id| t|
+---+---+
| 1| 3|
| 2| 3|
| 3| 2|
| 4| 2|
| 1| 3|
+---+---+
my main aim is ,I want to replace repeated value in every column with how many times repeated.
so i have tried flowing code it is not working as expected.
from pyspark.sql.functions import col
column_list = ["id",'t']
w = Window.partitionBy(column_list)
dfmax=df.select(*((count(col(c)).over(w)).alias(c) for c in df.columns))
dfmax.show()
+---+---+
| id| t|
+---+---+
| 2| 2|
| 2| 2|
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
my expected output will be
+---+---+
| id| t|
+---+---+
| 2| 3|
| 1| 3|
| 1| 1|
| 1| 1|
| 2| 3|
+---+---+

If I understand you correctly, what you're looking for is simply:
df.select(*[count(c).over(Window.partitionBy(c)).alias(c) for c in df.columns]).show()
#+---+---+
#| id| t|
#+---+---+
#| 2| 3|
#| 2| 3|
#| 1| 2|
#| 1| 3|
#| 1| 2|
#+---+---+
The difference between this and what you posted is that we only partition by one column at a time.
Remember that DataFrames are unordered. If you wanted to maintain your row order, you could add an ordering column using pyspark.sql.functions.monotonically_increasing_id():
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("order", monotonically_increasing_id())\
.select(*[count(c).over(Window.partitionBy(c)).alias(c) for c in df.columns])\
.sort("order")\
.drop("order")\
.show()
#+---+---+
#| id| t|
#+---+---+
#| 2| 3|
#| 1| 3|
#| 1| 2|
#| 1| 2|
#| 2| 3|
#+---+---+

Related

Check if a column is consecutive with groupby in pyspark

I have a pyspark dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5]})
I would like to create a new binary column is_consecutive that indicates if the values in the value column are consecutive by group.
The output should look like this:
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5],
'is_consecutive': [1,1,1,1,1,0,0,0]})
How could I do that in pyspark?
You can use lag to compare values with the previous row and check if they are consecutive, then use min to determine whether all rows are consecutive in a given group.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'consecutive',
F.coalesce(
F.col('value') - F.lag('value').over(Window.partitionBy('group').orderBy('value')) == 1,
F.lit(True)
).cast('int')
).withColumn(
'all_consecutive',
F.min('consecutive').over(Window.partitionBy('group'))
)
df2.show()
+-----+-----+-----------+---------------+
|group|value|consecutive|all_consecutive|
+-----+-----+-----------+---------------+
| c| 2| 1| 0|
| c| 4| 0| 0|
| c| 5| 1| 0|
| b| 4| 1| 1|
| b| 5| 1| 1|
| a| 1| 1| 1|
| a| 2| 1| 1|
| a| 3| 1| 1|
+-----+-----+-----------+---------------+
You can use lead and subtract the same with the existing value then find max of the window, once done , put a condition saying return 0 is max is >1 else return 1
w = Window.partitionBy("group").orderBy(F.monotonically_increasing_id())
(foo.withColumn("Diff",F.lead("value").over(w)-F.col("value"))
.withColumn("is_consecutive",F.when(F.max("Diff").over(w)>1,0).otherwise(1))
.drop("Diff")).show()
+-----+-----+--------------+
|group|value|is_consecutive|
+-----+-----+--------------+
| a| 1| 1|
| a| 2| 1|
| a| 3| 1|
| b| 4| 1|
| b| 5| 1|
| c| 2| 0|
| c| 4| 0|
| c| 5| 0|
+-----+-----+--------------+

Pyspark adding a column of repeating values from a list

I have a pyspark dataframe and want to add a column that adds values from a list in a repeating fashion. If this were just python, I would probably use itertools' cycle function. I don't know how to do this in pyspark.
names = ['Julia', 'Tim', 'Zoe']
My dataframe looks like this:
+-----+------+
| id_A| idx_B|
+-----+------+
| a| 0|
| b| 0|
| b| 2|
| b| 2|
| b| 2|
| b| 2|
+-----+------+
I want it to look like this:
+-----+------+--------+
| id_A| idx_B| names |
+-----+------+--------+
| a| 0| Julia|
| b| 0| Tim|
| b| 2| Zoe|
| b| 2| Julia|
| b| 2| Tim|
| b| 2| Zoe|
+-----+------+--------+
Here's one way.
1 - add a unique incremental id for your dataframe:
df = spark.createDataFrame(
df.rdd.zipWithIndex().map(lambda x: Row(*x[0], x[1]))
).toDF("id_A", "idx_B", "id")
df.show()
#+----+-----+---+
#|id_A|idx_B| id|
#+----+-----+---+
#| a| 0| 0|
#| b| 0| 1|
#| b| 2| 2|
#| b| 2| 3|
#| b| 2| 4|
#| b| 2| 5|
#+----+-----+---+
2 - create dataframe from the list of names:
names_df = spark.createDataFrame([(idx, name) for idx, name in enumerate(names)], ["name_id", "names"])
3 - join using modulo 3 (length of names list) in condition:
from pyspark.sql import functions as F
result = df.join(
names_df,
F.col("id") % 3 == F.col("name_id")
).orderBy("id").drop("id", "name_id")
result.show()
#+----+-----+-----+
#|id_A|idx_B|names|
#+----+-----+-----+
#| a| 0|Julia|
#| b| 0| Tim|
#| b| 2| Zoe|
#| b| 2|Julia|
#| b| 2| Tim|
#| b| 2| Zoe|
#+----+-----+-----+

Drop function doesn't work properly after joining same columns of Dataframe

I am facing this same issue while joining two Data frame A, B.
For ex:
c = df_a.join(df_b, [df_a.col1 == df_b.col1], how="left").drop(df_b.col1)
And when I try to drop the duplicate column like as above this query doesn't drop the col1 of df_b. Instead when I try to drop col1 of df_a, then it able to drop the col1 of df_a.
Could anyone please say about this.
Note: I tried the same in my project which has more than 200 columns and shows the same problem. Sometimes this drop function works properly if we have few columns but not if we have more columns.
Drop function not working after left outer join in pyspark
function to drop duplicates column after merge.
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
There are some similar issues I faced recently. Let me show them below with your case.
I am creating two dataframes with the same data
scala> val df_a = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_a: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> val df_b = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_b: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
Joining them
scala> val df = df_a.join(df_b, df_a("b") === df_b("a"), "leftouter")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 4 more fields]
scala> df.show
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
Let's drop a column that is not present in the above dataframe
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
Ideally we will expect spark to throw an error, but it executes successfully.
Now, if you drop a column from the above dataframe
scala> df.drop("a").show
+---+---+---+---+
| b| c| b| c|
+---+---+---+---+
| 2| as| 3| ds|
| 3| ds| 1| ht|
| 3| ds| 4| ew|
| 4| ew| 1| re|
| 1| re| 2| as|
| 1| ht| 2| as|
+---+---+---+---+
It drops all the columns with provided column name in the input dataframe.
If you want to drop specific columns, it should be done as below:
scala> df.drop(df_a("a")).show()
+---+---+---+---+---+
| b| c| a| b| c|
+---+---+---+---+---+
| 2| as| 2| 3| ds|
| 3| ds| 3| 1| ht|
| 3| ds| 3| 4| ew|
| 4| ew| 4| 1| re|
| 1| re| 1| 2| as|
| 1| ht| 1| 2| as|
+---+---+---+---+---+
I don't think spark accepts the input as give by you(see below):
scala> df.drop(df_a.a).show()
<console>:30: error: value a is not a member of org.apache.spark.sql.DataFrame
df.drop(df_a.a).show()
^
scala> df.drop(df_a."a").show()
<console>:1: error: identifier expected but string literal found.
df.drop(df_a."a").show()
^
If you provide the input to drop, as below, it executes but will have no impact
scala> df.drop("df_a.a").show
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
The reason being, spark interprets "df_a.a" as a nested column. As that column is not present ideally it should have thrown error, but as explained above, it just executes.
Hope this helps..!!!

Joining two data frames and result data frames contain non duplicate items in PySpark?

I have created two data frames by executing below command. I want to
join the two data frames and result data frames contain non duplicate items in PySpark.
df1 = sc.parallelize([
("a",1,1),
("b",2,2),
("d",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
df1.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 2|
| d| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
df2 is
df2=sc.parallelize([
("a",2,1),
("b",2,3),
("f",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 2| 1|
| b| 2| 3|
| f| 4| 2|
| e| 4| 1|
| c| 3| 4|ggVG
+---+--------+-----+
I want to join above two tables like below.
+---+--------+----------+----------+
|SID|SSection|test1SRank|test2SRank|
+---+--------+----------+----------+
| f| 4| 0| 2|
| e| 4| 1| 1|
| d| 4| 2| 0|
| c| 3| 4| 4|
| b| 2| 2| 3|
| a| 1| 1| 0|
| a| 2| 0| 1|
+---+--------+----------+----------+
Doesn't look like something that can be achieved with a single join. Here's a solution involving multiple joins:
from pyspark.sql.functions import col
d1 = df1.unionAll(df2).select("SID" , "SSection" ).distinct()
t1 = d1.join(df1 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test1Srank"))
t2 = d1.join(df2 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test2Srank"))
t1.join(t2, ["SID", "SSection"]).na.fill(0).show()
+---+--------+----------+----------+
|SID|SSection|test1Srank|test2Srank|
+---+--------+----------+----------+
| b| 2| 2| 3|
| c| 3| 4| 4|
| d| 4| 2| 0|
| e| 4| 1| 1|
| f| 4| 0| 2|
| a| 1| 1| 0|
| a| 2| 0| 1|
+---+--------+----------+----------+
You can simply rename the SRank column names and use outer join and use na.fill function
df1.withColumnRenamed("SRank", "test1SRank").join(df2.withColumnRenamed("SRank", "test2SRank"), ["SID", "SSection"], "outer").na.fill(0)

Reducing a dataframe to the most frequent combinations of two columns

I have a json file which I import using the following code:
spark = SparkSession.builder.master("local").appName('GPS').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("SensorData.json")
The result is a dataframe similar to this:
+---+---+
| A| B|
+---+---+
| 1| 3|
| 2| 1|
| 2| 3|
| 1| 2|
| 3| 1|
| 1| 2|
| 2| 1|
| 1| 3|
| 1| 2|
+---+---+
My task is using PySpark to reduce the data to only the most frequent combinations of two columns (A and B)
So the wanted output is this
+---+---+-----+
| A| B|count|
+---+---+-----+
| 1| 2| 3|
| 2| 1| 2|
+---+---+-----+
You can do that with a combination of groupBy and limit:
spark = SparkSession.builder.master("local").appName('GPS').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("SensorData.json")
df.groupBy("A","B")
.count()
.sort("count",ascending = False)
.limit(2)
.show()
+---+---+-----+
| A| B|count|
+---+---+-----+
| 1| 2| 3|
| 2| 1| 2|
+---+---+-----+

Resources