How to merge duplicate columns in pyspark?

How to merge duplicate columns in pyspark? - apache-spark

I have a pyspark dataframe in which some of the columns have same name. I want to merge all the columns having same name in one column.
For example, Input dataframe:
How can I do this in pyspark? Any help would be highly appreciated.

Check below scala code. It might help you.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.annotation.tailrec
import scala.util.Try
implicit class DFHelpers(df: DataFrame) {
def mergeColumns() = {
val dupColumns = df.columns
val newColumns = dupColumns.zipWithIndex.map(c => s"${c._1}${c._2}")
val columns = newColumns
.map(c => (c(0),c))
.groupBy(_._1)
.map(c => (c._1,c._2.map(_._2)))
.map(c => s"""coalesce(${c._2.mkString(",")}) as ${c._1}""")
.toSeq
df.toDF(newColumns:_*).selectExpr(columns:_*)
}
}
// Exiting paste mode, now interpreting.
scala> df.show(false)
+----+----+----+----+----+----+
|a |b |a |c |a |b |
+----+----+----+----+----+----+
|4 |null|null|8 |null|21 |
|null|8 |7 |6 |null|null|
|96 |null|null|null|null|78 |
+----+----+----+----+----+----+
scala> df.printSchema
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- a: string (nullable = true)
|-- c: string (nullable = true)
|-- a: string (nullable = true)
|-- b: string (nullable = true)
scala> df.mergeColumns.show(false)
+---+---+----+
|b |a |c |
+---+---+----+
|21 |4 |8 |
|8 |7 |6 |
|78 |96 |null|
+---+---+----+

Edited to answer OP request to coalesce from list,
Here's a reproducible example
import pyspark.sql.functions as F
df = spark.createDataFrame([
("z","a", None, None),
("b",None,"c", None),
("c","b", None, None),
("d",None, None, "z"),
], ["a","c", "c","c"])
df.show()
#fix duplicated column names
old_col=df.schema.names
running_list=[]
new_col=[]
i=0
for column in old_col:
if(column in running_list):
new_col.append(column+"_"+str(i))
i=i+1
else:
new_col.append(column)
running_list.append(column)
print(new_col)
df1 = df.toDF(*new_col)
#coalesce columns to get one column from a list
a=['c','c_0','c_1']
to_drop=['c_0','c_1']
b=[]
[b.append(df1[col]) for col in a]
#coalesce columns to get one column
df_merged=df1.withColumn('c',F.coalesce(*b)).drop(*to_drop)
df_merged.show()
Output:
+---+----+----+----+
| a| c| c| c|
+---+----+----+----+
| z| a|null|null|
| b|null| c|null|
| c| b|null|null|
| d|null|null| z|
+---+----+----+----+
['a', 'c', 'c_0', 'c_1']
+---+---+
| a| c|
+---+---+
| z| a|
| b| c|
| c| b|
| d| z|
+---+---+

Related

SparkSQL Windows: Creating Frame Based On Array Column

I am looking to use SparkSQL's window function, but with a custom condition on the frame specification.
The dataframe being operated on is as follows:
+--------------------+--------------------+--------------------+-----+
| userid| elementid| prerequisites|score|
+--------------------+--------------------+--------------------+-----+
|a |1 |[] | 1 |
|a |2 |[] | 1 |
|a |3 |[] | 1 |
|b |1 |[] | 1 |
|a |4 |[1, 2] | 1 |
+--------------------+--------------------+--------------------+-----+
Every element in the prerequisites column is a value in another row's elementid column.
I would like to create a window where I partition by userid, and then grab all preceding rows where elementid is contained in the present row's prerequisites column.
Once I attain this window, I want to perform a sum on the score column.
Desired output for the above example:
+--------------------+--------------------+--------------------+-----+
| userid| elementid| prerequisites|sum |
+--------------------+--------------------+--------------------+-----+
|a |1 |[] | 0 |
|a |2 |[] | 0 |
|a |3 |[] | 0 |
|b |1 |[] | 0 |
|a |4 |[1, 2] | 2 |
+--------------------+--------------------+--------------------+-----+
Notice how because user a is the only user with the prerequisites of its element preceding it, its the only one with > 0 sum.
The closest question I saw was this question, which utilises collect_list.
However, that doesn't construct a window so much as collect a potential list of IDs. Anyone have any ideas on how to construct the aforementioned window?

scala> import org.apache.spark.sql.expressions.{Window,UserDefinedFunction}
scala> df.show()
+------+---------+-------------+-----+
|userid|elementid|prerequisites|score|
+------+---------+-------------+-----+
| a| 1| []| 1|
| a| 2| []| 1|
| a| 3| []| 1|
| b| 1| []| 1|
| a| 4| [1, 2]| 1|
+------+---------+-------------+-----+
scala> df.printSchema
root
|-- userid: string (nullable = true)
|-- elementid: string (nullable = true)
|-- prerequisites: array (nullable = true)
| |-- element: string (containsNull = true)
|-- score: string (nullable = true)
scala> val W = Window.partitionBy("userid")
scala> val df1 = df.withColumn("elementidList", collect_set(col("elementid")).over(W))
.withColumn("elementidScoreMap", map_from_arrays(col("elementidList"), collect_list(col("score").cast("long")).over(W)))
.withColumn("common", array_intersect(col("prerequisites"), col("elementidList")))
.drop("elementidList", "score")
scala> def getSumUDF:UserDefinedFunction = udf((Score:Map[String,Long], Id:String) => {
| var out:Long = 0
| Id.split(",").foreach{ x => out = Score(x.toString) + out}
| out})
scala> df1.withColumn("sum", when(size(col("common")) =!= 0 ,getSumUDF(col("elementidScoreMap"), concat_ws(",",col("prerequisites")))).otherwise(lit(0)))
.drop("elementidScoreMap", "common")
.show()
+------+---------+-------------+---+
|userid|elementid|prerequisites|sum|
+------+---------+-------------+---+
| b| 1| []| 0|
| a| 1| []| 0|
| a| 2| []| 0|
| a| 3| []| 0|
| a| 4| [1, 2]| 2|
+------+---------+-------------+---+

Adding values of two column which datatypes are in string format in pyspark

The log files is in json format,i extracted the data to dataframe of pyspark
There are two column whose values are in int but datatype of column is string.
cola|colb
45|10
10|20
Expected Output
newcol
55
30
but I am getting output like
4510
1020
Code i have used like
df = .select (F.concat("cola","colb") as newcol).show()
kindly help me how can i get correct output.

>>> from pyspark.sql.functions import col
>>> df.show()
+----+----+
|cola|colb|
+----+----+
| 45| 10|
| 10| 20|
+----+----+
>>> df.printSchema()
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
>>> df.withColumn("newcol", col("cola") + col("colb")).show()
+----+----+------+
|cola|colb|newcol|
+----+----+------+
| 45| 10| 55.0|
| 10| 20| 30.0|
+----+----+------+

Explode Maptype column in pyspark

I have a dataframe like this
data = [(("ID1", {'A': 1, 'B': 2}))]
df = spark.createDataFrame(data, ["ID", "Coll"])
df.show()
+---+----------------+
| ID| Coll|
+---+----------------+
|ID1|[A -> 1, B -> 2]|
+---+----------------+
df.printSchema()
root
|-- ID: string (nullable = true)
|-- Coll: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)
I want to explode the 'Coll' column such that
+---+-----------+
| ID| Key| Value|
+---+-----------+
|ID1| A| 1|
|ID1| B| 2|
+---+-----------+
I am trying to do this in pyspark
I am successful if I use only one column, however I want the ID column as well
df.select(explode("Coll").alias("x", "y")).show()
+---+---+
| x| y|
+---+---+
| A| 1|
| B| 2|
+---+---+

Simply add the ID column to the select and it should work:
df.select("id", explode("Coll").alias("x", "y"))

How to filter dataframe using two dates?

I have a scenario where dataframe has data_date as below
root
|-- data_date: timestamp (nullable = true)
+-------------------+
| data_date|
+-------------------+
|2009-10-19 00:00:00|
|2004-02-24 00:00:00|
+-------------------+
I Need to filter the data between two dates i.e. data_date between '01-Jan-2017' and '31-dec-2017'
I tried many ways like
df.where(col("data_date") >= "2017-01-01" )
df.filter(col("data_date").gt("2017-01-01"))
df.filter(col("data_date").gt(lit("2017-01-01"))).filter(col("data_date").lt("2017-12-31")
but nothing worked.
I am getting below error:
java.lang.AssertionError: assertion failed: unsafe symbol Unstable (child of <none>) in runtime reflection universe
at scala.reflect.internal.Symbols$Symbol.<init>(Symbols.scala:205)
at scala.reflect.internal.Symbols$TypeSymbol.<init>(Symbols.scala:3030)
at scala.reflect.internal.Symbols$ClassSymbol.<init>(Symbols.scala:3222)
at scala.reflect.internal.Symbols$StubClassSymbol.<init>(Symbols.scala:3522)
at scala.reflect.internal.Symbols$class.newStubSymbol(Symbols.scala:191)
at scala.reflect.internal.SymbolTable.newStubSymbol(SymbolTable.scala:16)\
How can I solve it?

You need to cast the literal value as "date" datatype. BTW.. the input is not between the condition that you are specifying. Check this out:
scala> val df = Seq(("2009-10-19 00:00:00"),("2004-02-24 00:00:00")).toDF("data_date").select('data_date.cast("timestamp"))
df: org.apache.spark.sql.DataFrame = [data_date: timestamp]
scala> df.printSchema
root
|-- data_date: timestamp (nullable = true)
scala> df.withColumn("greater",'data_date.gt(lit("2017-01-01").cast("date"))).withColumn("lesser",'data_date.lt(lit("2017-12-31").cast("date"))).show
+-------------------+-------+------+
| data_date|greater|lesser|
+-------------------+-------+------+
|2009-10-19 00:00:00| false| true|
|2004-02-24 00:00:00| false| true|
+-------------------+-------+------+
scala>
If I change the input as below, the filter works.
val df = Seq(("2017-10-19 00:00:00"),("2017-02-24 00:00:00")).toDF("data_date").select('data_date.cast("timestamp"))
val df2= df.withColumn("greater",'data_date.gt(lit("2017-01-01").cast("date"))).withColumn("lesser",'data_date.lt(lit("2017-12-31").cast("date")))
df2.filter("greater and lesser ").show(false)
+-------------------+-------+------+
|data_date |greater|lesser|
+-------------------+-------+------+
|2017-10-19 00:00:00|true |true |
|2017-02-24 00:00:00|true |true |
+-------------------+-------+------+

spark drop multiple duplicated columns after join

I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema
root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)
now I want to drop the last two columns
|-- test: string (nullable = true)
|-- value: string (nullable = true)
I tried with df..dropDuplicates() but it dropping all
how to drop the duplicated columns which comes in the last ?

You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) { (acc: DataFrame, colName: String) => acc.drop(col(colName)) }
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) { (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")}
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>

df.dropDuplicates() works only for rows.
You can df1.drop(df2.column("value"))
You can specify columns you want to select, for example, with df.select(Seq of columns)

Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.

I wasn't completely satisfied with the answers in this. For the most part, especially #stack0114106 's answers, they hint at the right way and the complexity of doing it in a clean way. But they seem to be incomplete answers. To me a clean automated way of doing this is to use the df.columns functionality to get the columns as list of strings and then use sets to find the common columns to drop or find the unique columns to keep depending on your use case. However, if you use the select you will have to alias the dataframes so it knows which of the non-unique columns to keep. Anyways, using pseudocode because I can't be bothered to write the scala code proper.
common_cols = df_b.columns.toSet().intersection(df_a.columns.toSet())
df_a.join(df_b.drop(*common_cols))
The select version of this looks similar but you have to add in the aliasing.
unique_b_cols = df_b.columns.toSet().difference(df_a.columns.toSet()).toList
a_cols_aliased = df_a.columns.foreach(cols => "a." + cols)
keep_columns = a_cols_aliased.toList + unique_b_cols.toList
df_a.alias("a")
.join(df_b.alias("b"))
.select(*keep_columns)
I prefer the drop way, but having written a bunch of spark code. A select statement can often lead to cleaner code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to merge duplicate columns in pyspark? - apache-spark

I have a pyspark dataframe in which some of the columns have same name. I want to merge all the columns having same name in one column. For example, Input dataframe: How can I do this in pyspark? Any help would be highly appreciated.

Related

SparkSQL Windows: Creating Frame Based On Array Column

Adding values of two column which datatypes are in string format in pyspark

Explode Maptype column in pyspark

How to filter dataframe using two dates?

spark drop multiple duplicated columns after join

Categories

Resources