I have a dataframe where the number of column is variable. Every column type is Int and I want to get sum of all column. thought of using :_* ,this is my code:
val arr = Array(1,4,3,2,5,7,3,5,4,18)
val input=new ArrayBuffer[(Int,Int)]()
for(i<-0 until 10){
input.append((i,arr(i%10)))
}
var df=sc.parallelize(input,3).toDF("value1","value2")
val cols=new ArrayBuffer[Column]()
val colNames=df.columns
for(name<-colNames){
cols.append(col(name))
}
val func = udf((s: Int*) => s.sum)
df.withColumn("sum",func(cols:_*)).show()
But I get a error:
Error:(101, 27) ')' expected but identifier found.
val func = udf((s: Int*) => s.sum)
Error:(101, 27) ')' expected but identifier found.
val func = udf((s: Int*) => s.sum)
how to use :_* in udf?
my except result is:
+------+------+---+
|value1|value2|sum|
+------+------+---+
| 0| 1| 1|
| 1| 4| 5|
| 2| 3| 5|
| 3| 2| 5|
| 4| 5| 9|
| 5| 7| 12|
| 6| 3| 9|
| 7| 5| 12|
| 8| 4| 12|
| 9| 18| 27|
+------+------+---+
This may what you expect
val func = udf((s: Seq[Int]) => s.sum)
df.withColumn("sum", func(array(cols: _*))).show()
where array is org.apache.spark.sql.functions.array which
Creates a new array column. The input columns must all have the same data type.
Spark UDF does not supports variable length arguments,
Here is a solution for your problem.
import spark.implicits._
val input = Array(1,4,3,2,5,7,3,5,4,18).zipWithIndex
var df=spark.sparkContext.parallelize(input,3).toDF("value2","value1")
df.withColumn("total", df.columns.map(col(_)).reduce(_ + _))
Output:
+------+------+-----+
|value2|value1|total|
+------+------+-----+
| 1| 0| 1|
| 4| 1| 5|
| 3| 2| 5|
| 2| 3| 5|
| 5| 4| 9|
| 7| 5| 12|
| 3| 6| 9|
| 5| 7| 12|
| 4| 8| 12|
| 18| 9| 27|
+------+------+-----+
Hope this helps
you can try VectorAssembler
import org.apache.spark.ml.feature.VectorAssembler
import breeze.linalg.DenseVector
val assembler = new VectorAssembler().
setInputCols(Array("your column name")).
setOutputCol("allNum")
val assembledDF = assembler.transform(df)
assembledDF.show
+------+------+----------+
|value1|value2| allNum|
+------+------+----------+
| 0| 1| [0.0,1.0]|
| 1| 4| [1.0,4.0]|
| 2| 3| [2.0,3.0]|
| 3| 2| [3.0,2.0]|
| 4| 5| [4.0,5.0]|
| 5| 7| [5.0,7.0]|
| 6| 3| [6.0,3.0]|
| 7| 5| [7.0,5.0]|
| 8| 4| [8.0,4.0]|
| 9| 18|[9.0,18.0]|
+------+------+----------+
def yourSumUDF = udf((allNum:Vector) => new DenseVector(allNum.toArray).sum)
assembledDF.withColumn("sum", yourSumUDF($"allNum")).show
+------+------+----------+----+
|value1|value2| allNum| sum|
+------+------+----------+----+
| 0| 1| [0.0,1.0]| 1.0|
| 1| 4| [1.0,4.0]| 5.0|
| 2| 3| [2.0,3.0]| 5.0|
| 3| 2| [3.0,2.0]| 5.0|
| 4| 5| [4.0,5.0]| 9.0|
| 5| 7| [5.0,7.0]|12.0|
| 6| 3| [6.0,3.0]| 9.0|
| 7| 5| [7.0,5.0]|12.0|
| 8| 4| [8.0,4.0]|12.0|
| 9| 18|[9.0,18.0]|27.0|
+------+------+----------+----+
Related
I am trying to test the usage of F.count(F.col().isNotNull()) in window function. Please see the following code script
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w)).show()
The result is shown as follows. In the first two rows, my understanding is that F.count(F.col("xyz") should count the non-zero items from xyz = -infinity to xyz = null, how does the following isNotNull() process this. Why it gets 2 for the first two rows in xyz1 column.
If you count the Booleans, since they are either True or False, you will count all the rows in the specified window, regardless of whether xyz is null or not.
What you could do is to sum the isNotNull Boolean rather than counting them.
df.withColumn("xyz1",F.sum(F.col("xyz").isNotNull().cast('int')).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
Another way is to do a conditional count using when:
df.withColumn("xyz1",F.count(F.when(F.col("xyz").isNotNull(), 1)).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
I have data like this:
>>> data = sc.parallelize([[1,5,10,0,[1,2,3,4,5,6]],[0,10,20,1,[2,3,4,5,6,7]],[1,15,25,0,[3,4,5,6,7,8]],[0,30,40,1,[4,5,6,7,8,9]]]).toDF(('a','b','c',"d","e"))
>>> data.show()
+---+---+---+---+------------------+
| a| b| c| d| e|
+---+---+---+---+------------------+
| 1| 5| 10| 0|[1, 2, 3, 4, 5, 6]|
| 0| 10| 20| 1|[2, 3, 4, 5, 6, 7]|
| 1| 15| 25| 0|[3, 4, 5, 6, 7, 8]|
| 0| 30| 40| 1|[4, 5, 6, 7, 8, 9]|
+---+---+---+---+------------------+
# colums should be kept in result
keep_cols = ["a","b"]
# column 'e' should be split into split_e_cols
split_e_cols = ["one","two","three","four","five","six"]
# I hope the result dataframe has keep_cols + split_res_cols
I want to split column e into multiple columns and keep columns a and b at the same time.
I have tried:
data.select(*(col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(len(split_e_cols)))))
and
data.select("e").rdd.flatMap(lambda x:x).toDF(split_e_cols)
neither can keep columns a and b.
Could anyone help me? Thanks.
Try this:
select_cols = [col(c) for c in keep_cols] + [col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]
data.select(*select_cols).show()
#+---+---+---+---+-----+----+----+---+
#| a| b|one|two|three|four|five|six|
#+---+---+---+---+-----+----+----+---+
#| 1| 5| 1| 2| 3| 4| 5| 6|
#| 0| 10| 2| 3| 4| 5| 6| 7|
#| 1| 15| 3| 4| 5| 6| 7| 8|
#| 0| 30| 4| 5| 6| 7| 8| 9|
#+---+---+---+---+-----+----+----+---+
Or using for loop and withColumn:
data = data.select(keep_cols + ["e"])
for i in range(len(split_e_cols)):
data = data.withColumn(split_e_cols[i], col("e").getItem(i))
data.drop("e").show()
You can concatenate the lists using +:
from pyspark.sql.functions import col
data.select(
keep_cols +
[col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]
).show()
+---+---+---+---+-----+----+----+---+
| a| b|one|two|three|four|five|six|
+---+---+---+---+-----+----+----+---+
| 1| 5| 1| 2| 3| 4| 5| 6|
| 0| 10| 2| 3| 4| 5| 6| 7|
| 1| 15| 3| 4| 5| 6| 7| 8|
| 0| 30| 4| 5| 6| 7| 8| 9|
+---+---+---+---+-----+----+----+---+
A more pythonic way is to use enumerate instead of range(len()):
from pyspark.sql.functions import col
data.select(
keep_cols +
[col("e").getItem(i).alias(c) for (i, c) in enumerate(split_e_cols)]
).show()
+---+---+---+---+-----+----+----+---+
| a| b|one|two|three|four|five|six|
+---+---+---+---+-----+----+----+---+
| 1| 5| 1| 2| 3| 4| 5| 6|
| 0| 10| 2| 3| 4| 5| 6| 7|
| 1| 15| 3| 4| 5| 6| 7| 8|
| 0| 30| 4| 5| 6| 7| 8| 9|
+---+---+---+---+-----+----+----+---+
I have a spark dataframe of six columns say (col1, col2,...col6). I want to create a unique id for each combination of values from "col1" and "col2" and add it to the dataframe. Can someone help me with some pyspark code on how to do it?
You can achieve it using monotonically_increasing_id(pyspark >1.6) or monotonicallyIncreasingId(pyspark <1.6)
>>> from pyspark.sql.functions import monotonically_increasing_id
>>> rdd=sc.parallelize([[12,23,3,4,5,6],[12,23,56,67,89,20],[12,23,0,0,0,0],[12,2,12,12,12,23],[1,2,3,4,56,7],[1,2,3,4,56,7]])
>>> df = rdd.toDF(['col_1','col_2','col_3','col_4','col_5','col_6'])
>>> df.show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
| 12| 23| 3| 4| 5| 6|
| 12| 23| 56| 67| 89| 20|
| 12| 23| 0| 0| 0| 0|
| 12| 2| 12| 12| 12| 23|
| 1| 2| 3| 4| 56| 7|
| 1| 2| 3| 4| 56| 7|
+-----+-----+-----+-----+-----+-----+
>>> df_1=df.groupBy(df.col_1,df.col_2).count().withColumn("id", monotonically_increasing_id()).select(['col_1','col_2','id'])
>>> df_1.show()
+-----+-----+-------------+
|col_1|col_2| id|
+-----+-----+-------------+
| 12| 23| 34359738368|
| 1| 2|1434519076864|
| 12| 2|1554778161152|
+-----+-----+-------------+
>>> df.join(df_1,(df.col_1==df_1.col_1) & (df.col_2==df_1.col_2)).drop(df_1.col_1).drop(df_1.col_2).show()
+-----+-----+-----+-----+-----+-----+-------------+
|col_3|col_4|col_5|col_6|col_1|col_2| id|
+-----+-----+-----+-----+-----+-----+-------------+
| 3| 4| 5| 6| 12| 23| 34359738368|
| 56| 67| 89| 20| 12| 23| 34359738368|
| 0| 0| 0| 0| 12| 23| 34359738368|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 12| 12| 12| 23| 12| 2|1554778161152|
+-----+-----+-----+-----+-----+-----+-------------+
If you really need to generate the unique ID from col1 and col2 you can also create a hash value leveraging the sha2 function of Spark.
First let's generate some dummy data with:
from random import randint
max_range = 10
df1 = spark.createDataFrame(
[(x, x * randint(1, max_range), x * 10 * randint(1, max_range)) for x in range(1, max_range)],
['C1', 'C2', 'C3'])
>>> df1.show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 1| 1| 60|
| 2| 14|180|
| 3| 21|270|
| 4| 16|360|
| 5| 35|250|
| 6| 30|480|
| 7| 28|210|
| 8| 80|320|
| 9| 45|360|
+---+---+---+
Then create a new uid column from columns C2 and C3 with the next code:
from pyspark.sql.functions import col, sha2, concat
df1.withColumn("uid", sha2(concat(col("C2"), col("C3")), 256)).show(10, False)
And the output:
+---+---+---+--------------------+
| C1| C2| C3| uid|
+---+---+---+--------------------+
| 1| 1| 60|a512db2741cd20693...|
| 2| 14|180|2f6543dc6c0e06e4a...|
| 3| 21|270|bd3c65ddde4c6f733...|
| 4| 16|360|c7a1e8c59fc9dcc21...|
| 5| 35|250|cba1aeb7a72d9ae27...|
| 6| 30|480|ad7352ff8927cf790...|
| 7| 28|210|ea7bc25aa7cd3503f...|
| 8| 80|320|02e1d953517339552...|
| 9| 45|360|b485cf8f710a65755...|
+---+---+---+--------------------+
I want to find the IDs of groups (or blocks) of trues in a Spark DataFrame. That is, I want to go from this:
>>> df.show()
+---------+-----+
|timestamp| bool|
+---------+-----+
| 1|false|
| 2| true|
| 3| true|
| 4|false|
| 5| true|
| 6| true|
| 7| true|
| 8| true|
| 9|false|
| 10|false|
| 11|false|
| 12|false|
| 13|false|
| 14| true|
| 15| true|
| 16| true|
+---------+-----+
to this:
>>> df.show()
+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
| 1|false| 0|
| 2| true| 1|
| 3| true| 1|
| 4|false| 0|
| 5| true| 2|
| 6| true| 2|
| 7| true| 2|
| 8| true| 2|
| 9|false| 0|
| 10|false| 0|
| 11|false| 0|
| 12|false| 0|
| 13|false| 0|
| 14| true| 3|
| 15| true| 3|
| 16| true| 3|
+---------+-----+-----+
(the zeros are optional, could be Null or -1 or whatever is easier to implement)
I have a solution in scala, should be easy to adapt it to pyspark. Consider the following dataframe df:
+---------+-----+
|timestamp| bool|
+---------+-----+
| 1|false|
| 2| true|
| 3| true|
| 4|false|
| 5| true|
| 6| true|
| 7| true|
| 8| true|
| 9|false|
| 10|false|
| 11|false|
| 12|false|
| 13|false|
| 14| true|
| 15| true|
| 16| true|
+---------+-----+
then you could do:
df
.withColumn("prev_bool",lag($"bool",1).over(Window.orderBy($"timestamp")))
.withColumn("block",sum(when(!$"prev_bool" and $"bool",1).otherwise(0)).over(Window.orderBy($"timestamp")))
.drop($"prev_bool")
.withColumn("block",when($"bool",$"block").otherwise(0))
.show()
+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
| 1|false| 0|
| 2| true| 1|
| 3| true| 1|
| 4|false| 0|
| 5| true| 2|
| 6| true| 2|
| 7| true| 2|
| 8| true| 2|
| 9|false| 0|
| 10|false| 0|
| 11|false| 0|
| 12|false| 0|
| 13|false| 0|
| 14| true| 3|
| 15| true| 3|
| 16| true| 3|
+---------+-----+-----+
I have the following dataframe showing the revenue of purchases.
+-------+--------+-------+
|user_id|visit_id|revenue|
+-------+--------+-------+
| 1| 1| 0|
| 1| 2| 0|
| 1| 3| 0|
| 1| 4| 100|
| 1| 5| 0|
| 1| 6| 0|
| 1| 7| 200|
| 1| 8| 0|
| 1| 9| 10|
+-------+--------+-------+
Ultimately I want the new column purch_revenue to show the revenue generated by the purchase in every row.
As a workaround, I have also tried to introduce a purchase identifier purch_id which is incremented each time a purchase was made. So this is listed just as a reference.
+-------+--------+-------+-------------+--------+
|user_id|visit_id|revenue|purch_revenue|purch_id|
+-------+--------+-------+-------------+--------+
| 1| 1| 0| 100| 1|
| 1| 2| 0| 100| 1|
| 1| 3| 0| 100| 1|
| 1| 4| 100| 100| 1|
| 1| 5| 0| 100| 2|
| 1| 6| 0| 100| 2|
| 1| 7| 200| 100| 2|
| 1| 8| 0| 100| 3|
| 1| 9| 10| 100| 3|
+-------+--------+-------+-------------+--------+
I've tried to use the lag/lead function like this:
user_timeline = Window.partitionBy("user_id").orderBy("visit_id")
find_rev = fn.when(fn.col("revenue") > 0,fn.col("revenue"))\
.otherwise(fn.lead(fn.col("revenue"), 1).over(user_timeline))
df.withColumn("purch_revenue", find_rev)
This duplicates the revenue column if revenue > 0 and also pulls it up by one row. Clearly, I can chain this for a finite N, but that's not a solution.
Is there a way to apply this recursively until revenue > 0?
Alternatively, is there a way to increment a value based on a condition? I've tried to figure out a way to do that but struggled to find one.
Window functions don't support recursion but it is not required here. This type of sesionization can be easily handled with cumulative sum:
from pyspark.sql.functions import col, sum, when, lag
from pyspark.sql.window import Window
w = Window.partitionBy("user_id").orderBy("visit_id")
purch_id = sum(lag(when(
col("revenue") > 0, 1).otherwise(0),
1, 0
).over(w)).over(w) + 1
df.withColumn("purch_id", purch_id).show()
+-------+--------+-------+--------+
|user_id|visit_id|revenue|purch_id|
+-------+--------+-------+--------+
| 1| 1| 0| 1|
| 1| 2| 0| 1|
| 1| 3| 0| 1|
| 1| 4| 100| 1|
| 1| 5| 0| 2|
| 1| 6| 0| 2|
| 1| 7| 200| 2|
| 1| 8| 0| 3|
| 1| 9| 10| 3|
+-------+--------+-------+--------+