Append a value after every element in PySpark list Dataframe

Append a value after every element in PySpark list Dataframe - python-3.x

I am having a dataframe like this
Data ID
[1,2,3,4] 22
I want to create a new column and each and every entry in the new column will be value from Data field appended with ID by ~ symbol, like below
Data ID New_Column
[1,2,3,4] 22 [1|22~2|22~3|22~4|22]
Note : In Data field the array size is not fixed one. It may not have entry or N number of entry will be there.
Can anyone please help me to solve!

package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DF extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val df = Seq(
(22, Seq(1,2,3,4)),
(23, Seq(1,2,3,4,5,6,7,8)),
(24, Seq())
).toDF("ID", "Data")
val arrUDF = udf((id: Long, array: Seq[Long]) => {
val r = array.size match {
case 0 => ""
case _ => array.map(x => s"$x|$id").mkString("~")
}
s"[$r]"
})
val resDF = df.withColumn("New_Column", lit(arrUDF('ID, 'Data)))
resDF.show(false)
//+---+------------------------+-----------------------------------------+
//|ID |Data |New_Column |
//+---+------------------------+-----------------------------------------+
//|22 |[1, 2, 3, 4] |[1|22~2|22~3|22~4|22] |
//|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
//|24 |[] |[] |
//+---+------------------------+-----------------------------------------+
}

Spark 2.4+
Pyspark equivalent for the same goes like
df = spark.createDataFrame([(22, [1,2,3,4]),(23, [1,2,3,4,5,6,7,8]),(24, [])],['Id','Data'])
df.show()
+---+--------------------+
| Id| Data|
+---+--------------------+
| 22| [1, 2, 3, 4]|
| 23|[1, 2, 3, 4, 5, 6...|
| 24| []|
+---+--------------------+
df.withColumn('ff', f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))'''))).show(20,False)
+---+------------------------+---------------------------------------+
|Id |Data |ff |
+---+------------------------+---------------------------------------+
|22 |[1, 2, 3, 4] |1|22~2|22~3|22~4|22 |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23|
|24 |[] | |
+---+------------------------+---------------------------------------+
If you want final output as array
df.withColumn('ff',f.array(f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))''')))).show(20,False)
+---+------------------------+-----------------------------------------+
|Id |Data |ff |
+---+------------------------+-----------------------------------------+
|22 |[1, 2, 3, 4] |[1|22~2|22~3|22~4|22] |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
|24 |[] |[] |
+---+------------------------+-----------------------------------------+
Hope this helps

A udf can help:
def func(array, suffix):
return '~'.join([str(x) + '|' + str(suffix) for x in array])
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
my_udf = F.udf(func, StringType())
df.withColumn("New_Column", my_udf("Data", "ID")).show()
prints
+------------+---+-------------------+
| Data| ID| New_Column |
+------------+---+-------------------+
|[1, 2, 3, 4]| 22|22~1|22~2|22~3|22~4|
+------------+---+-------------------+

Related

Choose from multinomial distribution

I have a series of values and a probability I want each those values sampled. Is there a PySpark method to sample from that distribution for each row? I know how to hard-code with a random number generator, but I want this method to be flexible for any number of assignment values and probabilities:
assignment_values = ["foo", "buzz", "boo"]
value_probabilities = [0.3, 0.3, 0.4]
Hard-coded method with random number generator:
from pyspark.sql import Row
data = [
{"person": 1, "company": "5g"},
{"person": 2, "company": "9s"},
{"person": 3, "company": "1m"},
{"person": 4, "company": "3l"},
{"person": 5, "company": "2k"},
{"person": 6, "company": "7c"},
{"person": 7, "company": "3m"},
{"person": 8, "company": "2p"},
{"person": 9, "company": "4s"},
{"person": 10, "company": "8y"},
]
df = spark.createDataFrame(Row(**x) for x in data)
(
df
.withColumn("rand", F.rand())
.withColumn(
"assignment",
F.when(F.col("rand") < F.lit(0.3), "foo")
.when(F.col("rand") < F.lit(0.6), "buzz")
.otherwise("boo")
)
.show()
)
+-------+------+-------------------+----------+
|company|person| rand|assignment|
+-------+------+-------------------+----------+
| 5g| 1| 0.8020603266148111| boo|
| 9s| 2| 0.1297179045352752| foo|
| 1m| 3|0.05170251723736685| foo|
| 3l| 4|0.07978240998283603| foo|
| 2k| 5| 0.5931269297050258| buzz|
| 7c| 6|0.44673560271164037| buzz|
| 3m| 7| 0.1398027427612647| foo|
| 2p| 8| 0.8281404801171598| boo|
| 4s| 9|0.15568513681001817| foo|
| 8y| 10| 0.6173220502731542| boo|
+-------+------+-------------------+----------+

I think randomSplit may serve you. It randomly makes several dataframes out of your one nd puts them all into a list.
df.randomSplit([0.3, 0.3, 0.4])
You can also provide seed to it.
You can join the dfs back together using reduce
from pyspark.sql import functions as F
from functools import reduce
df = spark.createDataFrame(
[(1, "5g"),
(2, "9s"),
(3, "1m"),
(4, "3l"),
(5, "2k"),
(6, "7c"),
(7, "3m"),
(8, "2p"),
(9, "4s"),
(10, "8y")],
['person', 'company'])
assignment_values = ["foo", "buzz", "boo"]
value_probabilities = [0.3, 0.3, 0.4]
dfs = df.randomSplit(value_probabilities, 5)
dfs = [df.withColumn('assignment', F.lit(assignment_values[i])) for i, df in enumerate(dfs)]
df = reduce(lambda a, b: a.union(b), dfs)
df.show()
# +------+-------+----------+
# |person|company|assignment|
# +------+-------+----------+
# | 1| 5g| foo|
# | 2| 9s| foo|
# | 6| 7c| foo|
# | 4| 3l| buzz|
# | 5| 2k| buzz|
# | 8| 2p| buzz|
# | 3| 1m| boo|
# | 7| 3m| boo|
# | 9| 4s| boo|
# | 10| 8y| boo|
# +------+-------+----------+

Alternative of groupby in Pyspark to improve performance of Pyspark code

My Pyspark data frame looks like this. I have to remove group by function from pyspark code to increase the performance of the code. I have to perform operations on 100k data.
[Initial Data]
To create Dataframe
df = spark.createDataFrame([
(0, ['-9.53', '-9.35', '0.18']),
(1, ['-7.77', '-7.61', '0.16']),
(2, ['-5.80', '-5.71', '0.10']),
(0, ['1', '2', '3']),
(1, ['4', '5', '6']),
(2, ['8', '98', '32'])
], ["id", "Array"])
And the expected output is produced using this code.
import pyspark.sql.functions as f
df.groupBy('id').agg(f.collect_list(f.col("Array")).alias('Array')).\
select("id",f.flatten("Array")).show()
I have to achieve the output in this format. The above code is giving me this output. I have to achieve the same by removing the groupby function.
+---+-------------------------------+
|id |flatten(Array) |
+---+-------------------------------+
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
+---+-------------------------------+

If you don't want to do group by, you can use window functions:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df2 = df.select(
"id",
f.flatten(f.collect_list(f.col("Array")).over(Window.partitionBy("id"))).alias("Array")
).distinct()
df2.show(truncate=False)
+---+-------------------------------+
|id |Array |
+---+-------------------------------+
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
+---+-------------------------------+
You can also try
df.select(
'id',
f.explode('Array').alias('Array')
).groupBy('id').agg(f.collect_list('Array').alias('Array'))
Although I'm not sure if it'll be faster.

Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?

Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?
Here is the code to create my dataframe:
from pyspark import SparkContext, SparkConf, SQLContext
from datetime import datetime
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data1 = [
('George', datetime(2010, 3, 24, 3, 19, 58), 13),
('George', datetime(2020, 9, 24, 3, 19, 6), 8),
('George', datetime(2009, 12, 12, 17, 21, 30), 5),
('Micheal', datetime(2010, 11, 22, 13, 29, 40), 12),
('Maggie', datetime(2010, 2, 8, 3, 31, 23), 8),
('Ravi', datetime(2009, 1, 1, 4, 19, 47), 2),
('Xien', datetime(2010, 3, 2, 4, 33, 51), 3),
]
df1 = sqlContext.createDataFrame(data1, ['name', 'trial_start_time', 'purchase_time'])
df1.show(truncate=False)
Here is the dataframe:
+-------+-------------------+-------------+
|name |trial_start_time |purchase_time|
+-------+-------------------+-------------+
|George |2010-03-24 07:19:58|13 |
|George |2020-09-24 07:19:06|8 |
|George |2009-12-12 22:21:30|5 |
|Micheal|2010-11-22 18:29:40|12 |
|Maggie |2010-02-08 08:31:23|8 |
|Ravi |2009-01-01 09:19:47|2 |
|Xien |2010-03-02 09:33:51|3 |
+-------+-------------------+-------------+
Here is a working example to replace one string:
from pyspark.sql.functions import regexp_replace, regexp_extract, col
df1.withColumn("name", regexp_replace('name', "Ravi", "Ravi_renamed")).show()
Here is the output:
+------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+------------+-------------------+-------------+
| George|2010-03-24 07:19:58| 13|
| George|2020-09-24 07:19:06| 8|
| George|2009-12-12 22:21:30| 5|
| Micheal|2010-11-22 18:29:40| 12|
| Maggie|2010-02-08 08:31:23| 8|
|Ravi_renamed|2009-01-01 09:19:47| 2|
| Xien|2010-03-02 09:33:51| 3|
+------------+-------------------+-------------+
In pandas I could replace multiple strings in one line of code with a lambda expression:
df1[name].apply(lambda x: x.replace('George','George_renamed1').replace('Ravi', 'Ravi_renamed2')
I am not sure if this can be done in pyspark with regexp_replace. Perhaps another alternative? When I read about using lambda expressions in pyspark it seems I have to create udf functions (which seem to get a little long). But I am curious if I can simply run some type of regex expression on multiple strings like above in one line of code.

This is what you're looking for:
Using when() (most readable)
df1.withColumn('name',
when(col('name') == 'George', 'George_renamed1')
.when(col('name') == 'Ravi', 'Ravi_renamed2')
.otherwise(col('name'))
)
With mapping expr (less explicit but handy if there's many values to replace)
df1 = df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name], name)"))
or if you already have a list to use i.e.
name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']
# str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))
the above but only using pyspark imported functions
mapping_expr = create_map([lit(x) for x in name_changes])
df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))
Result
df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58| 13|
|George_renamed1|2020-09-24 03:19:06| 8|
|George_renamed1|2009-12-12 17:21:30| 5|
| Micheal|2010-11-22 13:29:40| 12|
| Maggie|2010-02-08 03:31:23| 8|
| Ravi_renamed2|2009-01-01 04:19:47| 2|
| Xien|2010-03-02 04:33:51| 3|
+---------------+-------------------+-------------+

Pyspark UDF to return result similar to groupby().sum() between two columns

I have the following sample dataframe
fruit_list = ['apple', 'apple', 'orange', 'apple']
qty_list = [16, 2, 3, 1]
spark_df = spark.createDataFrame([(101, 'Mark', fruit_list, qty_list)], ['ID', 'name', 'fruit', 'qty'])
and I would like to create another column which contains a result similar to what I would achieve with a pandas groupby('fruit').sum()
qty
fruits
apple 19
orange 3
The above result could be stored in the new column in any form (either a string, dictionary, list of tuples...).
I've tried an approach similar to the following one which does not work
sum_cols = udf(lambda x: pd.DataFrame({'fruits': x[0], 'qty': x[1]}).groupby('fruits').sum())
spark_df.withColumn('Result', sum_cols(F.struct('fruit', 'qty'))).show()
One example of result dataframe could be
+---+----+--------------------+-------------+-------------------------+
| ID|name| fruit| qty| Result|
+---+----+--------------------+-------------+-------------------------+
|101|Mark|[apple, apple, or...|[16, 2, 3, 1]|[(apple,19), (orange,3)] |
+---+----+--------------------+-------------+-------------------------+
Do you have any suggestion on how I could achieve that?
Thanks
Edit: running on Spark 2.4.3

As #pault mentioned, as of Spark 2.4+, you can use Spark SQL built-in function to handle your task, here is one way with array_distinct + transform + aggregate:
from pyspark.sql.functions import expr
# set up data
spark_df = spark.createDataFrame([
(101, 'Mark', ['apple', 'apple', 'orange', 'apple'], [16, 2, 3, 1])
, (102, 'Twin', ['apple', 'banana', 'avocado', 'banana', 'avocado'], [5, 2, 11, 3, 1])
, (103, 'Smith', ['avocado'], [10])
], ['ID', 'name', 'fruit', 'qty']
)
>>> spark_df.show(5,0)
+---+-----+-----------------------------------------+----------------+
|ID |name |fruit |qty |
+---+-----+-----------------------------------------+----------------+
|101|Mark |[apple, apple, orange, apple] |[16, 2, 3, 1] |
|102|Twin |[apple, banana, avocado, banana, avocado]|[5, 2, 11, 3, 1]|
|103|Smith|[avocado] |[10] |
+---+-----+-----------------------------------------+----------------+
>>> spark_df.printSchema()
root
|-- ID: long (nullable = true)
|-- name: string (nullable = true)
|-- fruit: array (nullable = true)
| |-- element: string (containsNull = true)
|-- qty: array (nullable = true)
| |-- element: long (containsNull = true)
Set up the SQL statement:
stmt = '''
transform(array_distinct(fruit), x -> (x, aggregate(
transform(sequence(0,size(fruit)-1), i -> IF(fruit[i] = x, qty[i], 0))
, 0
, (y,z) -> int(y + z)
))) AS sum_fruit
'''
>>> spark_df.withColumn('sum_fruit', expr(stmt)).show(10,0)
+---+-----+-----------------------------------------+----------------+----------------------------------------+
|ID |name |fruit |qty |sum_fruit |
+---+-----+-----------------------------------------+----------------+----------------------------------------+
|101|Mark |[apple, apple, orange, apple] |[16, 2, 3, 1] |[[apple, 19], [orange, 3]] |
|102|Twin |[apple, banana, avocado, banana, avocado]|[5, 2, 11, 3, 1]|[[apple, 5], [banana, 5], [avocado, 12]]|
|103|Smith|[avocado] |[10] |[[avocado, 10]] |
+---+-----+-----------------------------------------+----------------+----------------------------------------+
Explanation:
Use array_distinct(fruit) to find all distinct entries in the array fruit
transform this new array (with element x) from x to (x, aggregate(..x..))
the above function aggregate(..x..) takes the simple form of summing up all elements in array_T
aggregate(array_T, 0, (y,z) -> y + z)
where the array_T is from the following transformation:
transform(sequence(0,size(fruit)-1), i -> IF(fruit[i] = x, qty[i], 0))
which iterate through the array fruit, if the value of fruit[i] = x , then return the corresponding qty[i], otherwise return 0. for example for ID=101, when x = 'orange', it returns an array [0, 0, 3, 0]

There may be a fancy way to do this using only the API functions on Spark 2.4+, perhaps with some combination of arrays_zip and aggregate, but I can't think of any that don't involve an explode step followed by a groupBy. With that in mind, using a udf may actually be better for you in this case.
I think creating a pandas DataFrame just for the purpose of calling .groupby().sum() is overkill. Furthermore, even if you did do it that way, you'd need to convert the final output to a different data structure because a udf can't return a pandas DataFrame.
Here's one way with a udf using collections.defaultdict:
from collections import defaultdict
from pyspark.sql.functions import udf
def sum_cols_func(frt, qty):
d = defaultdict(int)
for x, y in zip(frt, map(int, qty)):
d[x] += y
return d.items()
sum_cols = udf(
lambda x: sum_cols_func(*x),
ArrayType(
StructType([StructField("fruit", StringType()), StructField("qty", IntegerType())])
)
)
Then call this by passing in the fruit and qty columns:
from pyspark.sql.functions import array, col
spark_df.withColumn(
"Result",
sum_cols(array([col("fruit"), col("qty")]))
).show(truncate=False)
#+---+----+-----------------------------+-------------+--------------------------+
#|ID |name|fruit |qty |Result |
#+---+----+-----------------------------+-------------+--------------------------+
#|101|Mark|[apple, apple, orange, apple]|[16, 2, 3, 1]|[[orange, 3], [apple, 19]]|
#+---+----+-----------------------------+-------------+--------------------------+

If you have spark < 2.4, use the follwoing to explode (otherwise check this answer):
df_split = (spark_df.rdd.flatMap(lambda row: [(row.ID, row.name, f, q) for f, q in zip(row.fruit, row.qty)]).toDF(["ID", "name", "fruit", "qty"]))
df_split.show()
Output:
+---+----+------+---+
| ID|name| fruit|qty|
+---+----+------+---+
|101|Mark| apple| 16|
|101|Mark| apple| 2|
|101|Mark|orange| 3|
|101|Mark| apple| 1|
+---+----+------+---+
Then prepare the result you want. First find the aggregated dataframe:
df_aggregated = df_split.groupby('ID', 'fruit').agg(F.sum('qty').alias('qty'))
df_aggregated.show()
Output:
+---+------+---+
| ID| fruit|qty|
+---+------+---+
|101|orange| 3|
|101| apple| 19|
+---+------+---+
And finally change it to the desired format:
df_aggregated.groupby('ID').agg(F.collect_list(F.struct(F.col('fruit'), F.col('qty'))).alias('Result')).show()
Output:
+---+--------------------------+
|ID |Result |
+---+--------------------------+
|101|[[orange, 3], [apple, 19]]|
+---+--------------------------+

How to do Rdd and broadcasted Rdd multiplication in pyspark?

I have two data frames like Below:
data frame1:(df1)
+---+----------+
|id |features |
+---+----------+
|8 |[5, 4, 5] |
|9 |[4, 5, 2] |
+---+----------+
data frame2:(df2)
+---+----------+
|id |features |
+---+----------+
|1 |[1, 2, 3] |
|2 |[4, 5, 6] |
+---+----------+
after that i have converted into Df to Rdd
rdd1=df1.rdd
if I am doing rdd1.collect() result is like below
[Row(id=8, f=[5, 4, 5]), Row(id=9, f=[4, 5, 2])]
rdd2=df2.rdd
broadcastedrddif = sc.broadcast(rdd2.collectAsMap())
now if I am doing broadcastedrddif.value
{1: [1, 2, 3], 2: [4, 5, 6]}
now i want to do sum of multiplication of rdd1 and broadcastedrddif i.e it should return output like below.
((8,[(1,(5*1+2*4+5*3)),(2,(5*4+4*5+5*6))]),(9,[(1,(4*1+5*2+2*3)),(2,(4*4+5*5+2*6)]) ))
so my final output should be
((8,[(1,28),(2,70)]),(9,[(1,20),(2,53)]))
where (1, 28) is a tuple not a float.
Please help me on this.

I did not understand why you used sc.broadcast() but i used it anyway...
Very useful in this case mapValues on the last RDD and I used a list comprehension to execute the operations using the dictionary.
x1=sc.parallelize([[8,5,4,5], [9,4,5,2]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x1.collect()
x2=sc.parallelize([[1,1,2,3], [2,4,5,6]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x2.collect()
#I took immediately an RDD because is more simply to test
broadcastedrddif = sc.broadcast(x2.collectAsMap())
d2=broadcastedrddif.value
def sum_prod(x,y):
c=0
for i in range(0,len(x)):
c+=x[i]*y[i]
return c
x1.mapValues(lambda x: [(i, sum_prod(list(x),list(d2[i]))) for i in [k for k in d2.keys()]]).collect()
Out[19]: [(8, [(1, 28), (2, 70)]), (9, [(1, 20), (2, 53)])]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Append a value after every element in PySpark list Dataframe - python-3.x

Related

Choose from multinomial distribution

Alternative of groupby in Pyspark to improve performance of Pyspark code

Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?

Pyspark UDF to return result similar to groupby().sum() between two columns

How to do Rdd and broadcasted Rdd multiplication in pyspark?

Categories

Resources