how to split one column and keep other columns in pyspark dataframe?

how to split one column and keep other columns in pyspark dataframe? - apache-spark

I have data like this:
>>> data = sc.parallelize([[1,5,10,0,[1,2,3,4,5,6]],[0,10,20,1,[2,3,4,5,6,7]],[1,15,25,0,[3,4,5,6,7,8]],[0,30,40,1,[4,5,6,7,8,9]]]).toDF(('a','b','c',"d","e"))
>>> data.show()
+---+---+---+---+------------------+
| a| b| c| d| e|
+---+---+---+---+------------------+
| 1| 5| 10| 0|[1, 2, 3, 4, 5, 6]|
| 0| 10| 20| 1|[2, 3, 4, 5, 6, 7]|
| 1| 15| 25| 0|[3, 4, 5, 6, 7, 8]|
| 0| 30| 40| 1|[4, 5, 6, 7, 8, 9]|
+---+---+---+---+------------------+
# colums should be kept in result
keep_cols = ["a","b"]
# column 'e' should be split into split_e_cols
split_e_cols = ["one","two","three","four","five","six"]
# I hope the result dataframe has keep_cols + split_res_cols
I want to split column e into multiple columns and keep columns a and b at the same time.
I have tried:
data.select(*(col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(len(split_e_cols)))))
and
data.select("e").rdd.flatMap(lambda x:x).toDF(split_e_cols)
neither can keep columns a and b.
Could anyone help me? Thanks.

Try this:
select_cols = [col(c) for c in keep_cols] + [col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]
data.select(*select_cols).show()
#+---+---+---+---+-----+----+----+---+
#| a| b|one|two|three|four|five|six|
#+---+---+---+---+-----+----+----+---+
#| 1| 5| 1| 2| 3| 4| 5| 6|
#| 0| 10| 2| 3| 4| 5| 6| 7|
#| 1| 15| 3| 4| 5| 6| 7| 8|
#| 0| 30| 4| 5| 6| 7| 8| 9|
#+---+---+---+---+-----+----+----+---+
Or using for loop and withColumn:
data = data.select(keep_cols + ["e"])
for i in range(len(split_e_cols)):
data = data.withColumn(split_e_cols[i], col("e").getItem(i))
data.drop("e").show()

You can concatenate the lists using +:
from pyspark.sql.functions import col
data.select(
keep_cols +
[col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]
).show()
+---+---+---+---+-----+----+----+---+
| a| b|one|two|three|four|five|six|
+---+---+---+---+-----+----+----+---+
| 1| 5| 1| 2| 3| 4| 5| 6|
| 0| 10| 2| 3| 4| 5| 6| 7|
| 1| 15| 3| 4| 5| 6| 7| 8|
| 0| 30| 4| 5| 6| 7| 8| 9|
+---+---+---+---+-----+----+----+---+
A more pythonic way is to use enumerate instead of range(len()):
from pyspark.sql.functions import col
data.select(
keep_cols +
[col("e").getItem(i).alias(c) for (i, c) in enumerate(split_e_cols)]
).show()
+---+---+---+---+-----+----+----+---+
| a| b|one|two|three|four|five|six|
+---+---+---+---+-----+----+----+---+
| 1| 5| 1| 2| 3| 4| 5| 6|
| 0| 10| 2| 3| 4| 5| 6| 7|
| 1| 15| 3| 4| 5| 6| 7| 8|
| 0| 30| 4| 5| 6| 7| 8| 9|
+---+---+---+---+-----+----+----+---+

Related

window function on a subset of data

I have a table like the below. I want to calculate an average of median but only for Q=2 and Q=3. I don't want to include other Qs but still preserve the data.
df = spark.createDataFrame([('2018-03-31',6,1),('2018-03-31',27,2),('2018-03-31',3,3),('2018-03-31',44,4),('2018-06-30',6,1),('2018-06-30',4,3),('2018-06-30',32,2),('2018-06-30',112,4),('2018-09-30',2,1),('2018-09-30',23,4),('2018-09-30',37,3),('2018-09-30',3,2)],['date','median','Q'])
+----------+--------+---+
| date| median | Q |
+----------+--------+---+
|2018-03-31| 6| 1|
|2018-03-31| 27| 2|
|2018-03-31| 3| 3|
|2018-03-31| 44| 4|
|2018-06-30| 6| 1|
|2018-06-30| 4| 3|
|2018-06-30| 32| 2|
|2018-06-30| 112| 4|
|2018-09-30| 2| 1|
|2018-09-30| 23| 4|
|2018-09-30| 37| 3|
|2018-09-30| 3| 2|
+----------+--------+---+
Expected output:
+----------+--------+---+------------+
| date| median | Q |result |
+----------+--------+---+------------+
|2018-03-31| 6| 1| null|
|2018-03-31| 27| 2| 15|
|2018-03-31| 3| 3| 15|
|2018-03-31| 44| 4| null|
|2018-06-30| 6| 1| null|
|2018-06-30| 4| 3| 18|
|2018-06-30| 32| 2| 18|
|2018-06-30| 112| 4| null|
|2018-09-30| 2| 1| null|
|2018-09-30| 23| 4| null|
|2018-09-30| 37| 3| 20|
|2018-09-30| 3| 2| 20|
+----------+--------+---+------------+
OR
+----------+--------+---+------------+
| date| median | Q |result |
+----------+--------+---+------------+
|2018-03-31| 6| 1| 15|
|2018-03-31| 27| 2| 15|
|2018-03-31| 3| 3| 15|
|2018-03-31| 44| 4| 15|
|2018-06-30| 6| 1| 18|
|2018-06-30| 4| 3| 18|
|2018-06-30| 32| 2| 18|
|2018-06-30| 112| 4| 18|
|2018-09-30| 2| 1| 20|
|2018-09-30| 23| 4| 20|
|2018-09-30| 37| 3| 20|
|2018-09-30| 3| 2| 20|
+----------+--------+---+------------+
I tried the following code but when I include the where statement it drops Q=1 and Q=4.
window = (
Window
.partitionBy("date")
.orderBy("date")
)
df_avg = (
df
.where(
(F.col("Q") == 2) |
(F.col("Q") == 3)
)
.withColumn("result", F.avg("median").over(window))
)

For both of your expected output, you can use conditional aggregation, use avg with when (otherwise).
If you want the 1st expected output.
window = (
Window
.partitionBy("date", F.col("Q").isin([2, 3]))
)
df_avg = (
df.withColumn("result", F.when(F.col("Q").isin([2, 3]), F.avg("median").over(window)))
)
For the 2nd expected output.
window = (
Window
.partitionBy("date")
)
df_avg = (
df.withColumn("result", F.avg(F.when(F.col("Q").isin([2, 3]), F.col("median"))).over(window))
)

Alternatively, since you are really aggregating a (small?) subset, replace window with auto-join:
>>> df_avg = df.where(col("Q").isin([2,3])).groupBy("date","Q").agg(avg("median").alias("result"))
>>> df_result = df.join(df_avg,["date","Q"],"left")
Might turn out to be faster than using window.

regarding the usage of F.count(F.col("some column").isNotNull()) in window function

I am trying to test the usage of F.count(F.col().isNotNull()) in window function. Please see the following code script
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w)).show()
The result is shown as follows. In the first two rows, my understanding is that F.count(F.col("xyz") should count the non-zero items from xyz = -infinity to xyz = null, how does the following isNotNull() process this. Why it gets 2 for the first two rows in xyz1 column.

If you count the Booleans, since they are either True or False, you will count all the rows in the specified window, regardless of whether xyz is null or not.
What you could do is to sum the isNotNull Boolean rather than counting them.
df.withColumn("xyz1",F.sum(F.col("xyz").isNotNull().cast('int')).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
Another way is to do a conditional count using when:
df.withColumn("xyz1",F.count(F.when(F.col("xyz").isNotNull(), 1)).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+

create unique id for combination of a pair of values from two columns in a spark dataframe

I have a spark dataframe of six columns say (col1, col2,...col6). I want to create a unique id for each combination of values from "col1" and "col2" and add it to the dataframe. Can someone help me with some pyspark code on how to do it?

You can achieve it using monotonically_increasing_id(pyspark >1.6) or monotonicallyIncreasingId(pyspark <1.6)
>>> from pyspark.sql.functions import monotonically_increasing_id
>>> rdd=sc.parallelize([[12,23,3,4,5,6],[12,23,56,67,89,20],[12,23,0,0,0,0],[12,2,12,12,12,23],[1,2,3,4,56,7],[1,2,3,4,56,7]])
>>> df = rdd.toDF(['col_1','col_2','col_3','col_4','col_5','col_6'])
>>> df.show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
| 12| 23| 3| 4| 5| 6|
| 12| 23| 56| 67| 89| 20|
| 12| 23| 0| 0| 0| 0|
| 12| 2| 12| 12| 12| 23|
| 1| 2| 3| 4| 56| 7|
| 1| 2| 3| 4| 56| 7|
+-----+-----+-----+-----+-----+-----+
>>> df_1=df.groupBy(df.col_1,df.col_2).count().withColumn("id", monotonically_increasing_id()).select(['col_1','col_2','id'])
>>> df_1.show()
+-----+-----+-------------+
|col_1|col_2| id|
+-----+-----+-------------+
| 12| 23| 34359738368|
| 1| 2|1434519076864|
| 12| 2|1554778161152|
+-----+-----+-------------+
>>> df.join(df_1,(df.col_1==df_1.col_1) & (df.col_2==df_1.col_2)).drop(df_1.col_1).drop(df_1.col_2).show()
+-----+-----+-----+-----+-----+-----+-------------+
|col_3|col_4|col_5|col_6|col_1|col_2| id|
+-----+-----+-----+-----+-----+-----+-------------+
| 3| 4| 5| 6| 12| 23| 34359738368|
| 56| 67| 89| 20| 12| 23| 34359738368|
| 0| 0| 0| 0| 12| 23| 34359738368|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 12| 12| 12| 23| 12| 2|1554778161152|
+-----+-----+-----+-----+-----+-----+-------------+

If you really need to generate the unique ID from col1 and col2 you can also create a hash value leveraging the sha2 function of Spark.
First let's generate some dummy data with:
from random import randint
max_range = 10
df1 = spark.createDataFrame(
[(x, x * randint(1, max_range), x * 10 * randint(1, max_range)) for x in range(1, max_range)],
['C1', 'C2', 'C3'])
>>> df1.show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 1| 1| 60|
| 2| 14|180|
| 3| 21|270|
| 4| 16|360|
| 5| 35|250|
| 6| 30|480|
| 7| 28|210|
| 8| 80|320|
| 9| 45|360|
+---+---+---+
Then create a new uid column from columns C2 and C3 with the next code:
from pyspark.sql.functions import col, sha2, concat
df1.withColumn("uid", sha2(concat(col("C2"), col("C3")), 256)).show(10, False)
And the output:
+---+---+---+--------------------+
| C1| C2| C3| uid|
+---+---+---+--------------------+
| 1| 1| 60|a512db2741cd20693...|
| 2| 14|180|2f6543dc6c0e06e4a...|
| 3| 21|270|bd3c65ddde4c6f733...|
| 4| 16|360|c7a1e8c59fc9dcc21...|
| 5| 35|250|cba1aeb7a72d9ae27...|
| 6| 30|480|ad7352ff8927cf790...|
| 7| 28|210|ea7bc25aa7cd3503f...|
| 8| 80|320|02e1d953517339552...|
| 9| 45|360|b485cf8f710a65755...|
+---+---+---+--------------------+

Joining two data frames and result data frames contain non duplicate items in PySpark?

I have created two data frames by executing below command. I want to
join the two data frames and result data frames contain non duplicate items in PySpark.
df1 = sc.parallelize([
("a",1,1),
("b",2,2),
("d",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
df1.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 2|
| d| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
df2 is
df2=sc.parallelize([
("a",2,1),
("b",2,3),
("f",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 2| 1|
| b| 2| 3|
| f| 4| 2|
| e| 4| 1|
| c| 3| 4|ggVG
+---+--------+-----+
I want to join above two tables like below.
+---+--------+----------+----------+
|SID|SSection|test1SRank|test2SRank|
+---+--------+----------+----------+
| f| 4| 0| 2|
| e| 4| 1| 1|
| d| 4| 2| 0|
| c| 3| 4| 4|
| b| 2| 2| 3|
| a| 1| 1| 0|
| a| 2| 0| 1|
+---+--------+----------+----------+

Doesn't look like something that can be achieved with a single join. Here's a solution involving multiple joins:
from pyspark.sql.functions import col
d1 = df1.unionAll(df2).select("SID" , "SSection" ).distinct()
t1 = d1.join(df1 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test1Srank"))
t2 = d1.join(df2 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test2Srank"))
t1.join(t2, ["SID", "SSection"]).na.fill(0).show()
+---+--------+----------+----------+
|SID|SSection|test1Srank|test2Srank|
+---+--------+----------+----------+
| b| 2| 2| 3|
| c| 3| 4| 4|
| d| 4| 2| 0|
| e| 4| 1| 1|
| f| 4| 0| 2|
| a| 1| 1| 0|
| a| 2| 0| 1|
+---+--------+----------+----------+

You can simply rename the SRank column names and use outer join and use na.fill function
df1.withColumnRenamed("SRank", "test1SRank").join(df2.withColumnRenamed("SRank", "test2SRank"), ["SID", "SSection"], "outer").na.fill(0)

How to use variable arguments _* in udf with Scala/Spark?

I have a dataframe where the number of column is variable. Every column type is Int and I want to get sum of all column. thought of using ：_* ,this is my code:
val arr = Array(1,4,3,2,5,7,3,5,4,18)
val input=new ArrayBuffer[(Int,Int)]()
for(i<-0 until 10){
input.append((i,arr(i%10)))
}
var df=sc.parallelize(input,3).toDF("value1","value2")
val cols=new ArrayBuffer[Column]()
val colNames=df.columns
for(name<-colNames){
cols.append(col(name))
}
val func = udf((s: Int*) => s.sum)
df.withColumn("sum",func(cols:_*)).show()
But I get a error:
Error:(101, 27) ')' expected but identifier found.
val func = udf((s: Int*) => s.sum)
Error:(101, 27) ')' expected but identifier found.
val func = udf((s: Int*) => s.sum)
how to use :_* in udf?
my except result is:
+------+------+---+
|value1|value2|sum|
+------+------+---+
| 0| 1| 1|
| 1| 4| 5|
| 2| 3| 5|
| 3| 2| 5|
| 4| 5| 9|
| 5| 7| 12|
| 6| 3| 9|
| 7| 5| 12|
| 8| 4| 12|
| 9| 18| 27|
+------+------+---+

This may what you expect
val func = udf((s: Seq[Int]) => s.sum)
df.withColumn("sum", func(array(cols: _*))).show()
where array is org.apache.spark.sql.functions.array which
Creates a new array column. The input columns must all have the same data type.

Spark UDF does not supports variable length arguments,
Here is a solution for your problem.
import spark.implicits._
val input = Array(1,4,3,2,5,7,3,5,4,18).zipWithIndex
var df=spark.sparkContext.parallelize(input,3).toDF("value2","value1")
df.withColumn("total", df.columns.map(col(_)).reduce(_ + _))
Output:
+------+------+-----+
|value2|value1|total|
+------+------+-----+
| 1| 0| 1|
| 4| 1| 5|
| 3| 2| 5|
| 2| 3| 5|
| 5| 4| 9|
| 7| 5| 12|
| 3| 6| 9|
| 5| 7| 12|
| 4| 8| 12|
| 18| 9| 27|
+------+------+-----+
Hope this helps

you can try VectorAssembler
import org.apache.spark.ml.feature.VectorAssembler
import breeze.linalg.DenseVector
val assembler = new VectorAssembler().
setInputCols(Array("your column name")).
setOutputCol("allNum")
val assembledDF = assembler.transform(df)
assembledDF.show
+------+------+----------+
|value1|value2| allNum|
+------+------+----------+
| 0| 1| [0.0,1.0]|
| 1| 4| [1.0,4.0]|
| 2| 3| [2.0,3.0]|
| 3| 2| [3.0,2.0]|
| 4| 5| [4.0,5.0]|
| 5| 7| [5.0,7.0]|
| 6| 3| [6.0,3.0]|
| 7| 5| [7.0,5.0]|
| 8| 4| [8.0,4.0]|
| 9| 18|[9.0,18.0]|
+------+------+----------+
def yourSumUDF = udf((allNum:Vector) => new DenseVector(allNum.toArray).sum)
assembledDF.withColumn("sum", yourSumUDF($"allNum")).show
+------+------+----------+----+
|value1|value2| allNum| sum|
+------+------+----------+----+
| 0| 1| [0.0,1.0]| 1.0|
| 1| 4| [1.0,4.0]| 5.0|
| 2| 3| [2.0,3.0]| 5.0|
| 3| 2| [3.0,2.0]| 5.0|
| 4| 5| [4.0,5.0]| 9.0|
| 5| 7| [5.0,7.0]|12.0|
| 6| 3| [6.0,3.0]| 9.0|
| 7| 5| [7.0,5.0]|12.0|
| 8| 4| [8.0,4.0]|12.0|
| 9| 18|[9.0,18.0]|27.0|
+------+------+----------+----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to split one column and keep other columns in pyspark dataframe? - apache-spark

Related

window function on a subset of data

regarding the usage of F.count(F.col("some column").isNotNull()) in window function

create unique id for combination of a pair of values from two columns in a spark dataframe

Joining two data frames and result data frames contain non duplicate items in PySpark?

How to use variable arguments _* in udf with Scala/Spark?

Categories

Resources