PySpark - Dynamically update a col depending on the value of a cell - apache-spark

This is a modified version of the problem posted here https://stackoverflow.com/questions/35898687/adding-column-to-pyspark-dataframe-depending-on-whether-column-value-is-in-anoth
I am using Pyspark (spark 1.6)
I have the following data:
myDict
{'1': 'c1', '3': 'c3', '2': 'c2', '5': 'c5', '4': 'c4', '6': 'c6'}
I have the foll df:
+----+----+---------+---+---+---+---+---+---+
|user|item|fav_items| c1| c2| c3| c4| c5| c6|
+----+----+---------+---+---+---+---+---+---+
| u1| 1| 1,1,3| 0| 0| 0| 0| 0| 0|
| u1| 4|4,4,4,5,6| 0| 0| 0| 0| 0| 0|
+----+----+---------+---+---+---+---+---+---+
The output should be:
+----+----+---------+---+---+---+---+---+---+
|user|item|fav_items| c1| c2| c3| c4| c5| c6|
+----+----+---------+---+---+---+---+---+---+
| u1| 1| 1,1,3| 2| 0| 1| 0| 0| 0|
| u1| 4|4,4,4,5,6| 0| 0| 0| 3| 1| 1|
+----+----+---------+---+---+---+---+---+---+
Depending on the cound and values in fav_items, lookup myDict to get the column mapping and update the column. For instance, for the first row we have 1 occuring twice, and 1 maps to 'c1' from myDict and therefore value for c1 for row 1 should be 2.
I got the following method working where we can iterate over columns but this approach is inefficient since the number of columns exceeds 2k+.
for key in myDict.keys():
contains_event = udf(lambda x: x.count(key), IntegerType())
df = df.withColumn(myDict[key], contains_event('fav_items'))
I'm looking for a more efficient method for this problem.
Thanks in advance.

Just tried in my way, hope it helps.
>>> from pyspark.sql.types import *
>>> from pyspark.sql imoport functions as F
>>> from collections import Counter
>>> d = {'1': 'c1', '3': 'c3', '2': 'c2', '5': 'c5', '4': 'c4', '6': 'c6'}
>>> df = spark.createDataFrame([('u1',1,'1,1,3',0,0,0,0,0,0),('u1',4,'4,4,4,5,6',0,0,0,0,0,0),('u1',1,'3,6,2',0,0,0,0,0,0)],['user','item','fav_items','c1','c2','c3','c4','c5','c6'])
>>> df.show()
+----+----+---------+---+---+---+---+---+---+
|user|item|fav_items| c1| c2| c3| c4| c5| c6|
+----+----+---------+---+---+---+---+---+---+
| u1| 1| 1,1,3| 0| 0| 0| 0| 0| 0|
| u1| 4|4,4,4,5,6| 0| 0| 0| 0| 0| 0|
| u1| 1| 3,6,2| 0| 0| 0| 0| 0| 0|
+----+----+---------+---+---+---+---+---+---+
>>> udf1 = F.udf(lambda c: Counter(c).most_common(),ArrayType(ArrayType(StringType())))
>>> df1 = df.select('user','item','fav_items',udf1(F.split(df.fav_items,',')).alias('item_counter'))
>>> df1.show(3,False)
+----+----+---------+------------------------------------------------------------+
|user|item|fav_items|item_counter |
+----+----+---------+------------------------------------------------------------+
|u1 |1 |1,1,3 |[WrappedArray(1, 2), WrappedArray(3, 1)] |
|u1 |4 |4,4,4,5,6|[WrappedArray(4, 3), WrappedArray(5, 1), WrappedArray(6, 1)]|
|u1 |1 |3,6,2 |[WrappedArray(3, 1), WrappedArray(6, 1), WrappedArray(2, 1)]|
+----+----+---------+------------------------------------------------------------+
>>> df2 = df2.select('user','item','fav_items','val',df2.val[0].alias('val1'),df2.val[1].alias('val2'))
>>> df2.show()
+----+----+---------+------+----+----+
|user|item|fav_items| val|val1|val2|
+----+----+---------+------+----+----+
| u1| 1| 1,1,3|[1, 2]| 1| 2|
| u1| 1| 1,1,3|[3, 1]| 3| 1|
| u1| 4|4,4,4,5,6|[4, 3]| 4| 3|
| u1| 4|4,4,4,5,6|[5, 1]| 5| 1|
| u1| 4|4,4,4,5,6|[6, 1]| 6| 1|
| u1| 1| 3,6,2|[3, 1]| 3| 1|
| u1| 1| 3,6,2|[6, 1]| 6| 1|
| u1| 1| 3,6,2|[2, 1]| 2| 1|
+----+----+---------+------+----+----+
>>> udf2 = F.udf(lambda x : d[x],StringType())
>>> df2 = df2.withColumn('d_col',udf2(df2.val1))
>>> df2.show()
+----+----+---------+------+----+----+-----+
|user|item|fav_items| val|val1|val2|d_col|
+----+----+---------+------+----+----+-----+
| u1| 1| 1,1,3|[1, 2]| 1| 2| c1|
| u1| 1| 1,1,3|[3, 1]| 3| 1| c3|
| u1| 4|4,4,4,5,6|[4, 3]| 4| 3| c4|
| u1| 4|4,4,4,5,6|[5, 1]| 5| 1| c5|
| u1| 4|4,4,4,5,6|[6, 1]| 6| 1| c6|
| u1| 1| 3,6,2|[3, 1]| 3| 1| c3|
| u1| 1| 3,6,2|[6, 1]| 6| 1| c6|
| u1| 1| 3,6,2|[2, 1]| 2| 1| c2|
+----+----+---------+------+----+----+-----+
>>> pvtdf = df2.groupby(['user','item','fav_items']).pivot('d_col').agg(F.first('val2')).na.fill({'c1':0,'c2':0,'c3':0,'c4':0,'c5':0,'c6':0})
>>> pvtdf.show()
+----+----+---------+---+---+---+---+---+---+
|user|item|fav_items| c1| c2| c3| c4| c5| c6|
+----+----+---------+---+---+---+---+---+---+
| u1| 1| 1,1,3| 2| 0| 1| 0| 0| 0|
| u1| 1| 3,6,2| 0| 1| 1| 0| 0| 1|
| u1| 4|4,4,4,5,6| 0| 0| 0| 3| 1| 1|
+----+----+---------+---+---+---+---+---+---+

Related

How can I give index to users' events and preserve order with PySpark?

I have the following dataframe:
+------------+------------------+--------------------+
|id. |install_time_first| timestamp|
+------------+------------------+--------------------+
| 2| 2022-02-02|2022-02-01 10:03:...|
| 3| 2022-02-01|2022-02-01 10:00:...|
| 2| 2022-02-02| null|
| 3| 2022-02-01|2022-02-03 11:35:...|
| 1| 2022-02-01| null|
| 2| 2022-02-02|2022-02-02 10:05:...|
| 3| 2022-02-01|2022-02-01 10:05:...|
| 4| 2022-02-02| null|
| 1| 2022-02-01|2022-02-01 10:05:...|
| 2| 2022-02-02|2022-02-02 10:05:...|
| 4| 2022-02-02|2022-02-03 11:35:...|
| 1| 2022-02-01| null|
| 1| 2022-02-01|2022-02-01 10:03:...|
| 1| 2022-02-01|2022-02-01 10:05:...|
| 4| 2022-02-02|2022-02-03 11:35:...|
| 2| 2022-02-02|2022-02-02 11:00:...|
| 4| 2022-02-02|2022-02-03 11:35:...|
| 3| 2022-02-01|2022-02-04 11:35:...|
| 1| 2022-02-01|2022-02-01 10:00:...|
+------------+------------------+--------------------+
And I want to sort the dataframe by install_time_first and add an index to each user (all his events) and preserve the order. For example:
+------------+------------------+--------------------+-----+
|id. |install_time_first| timestamp|index|
+------------+------------------+--------------------+-----+
| 1| 2022-02-01| null| 1|
| 1| 2022-02-01| null| 1|
| 1| 2022-02-01|2022-02-01 10:00:...| 1|
| 1| 2022-02-01|2022-02-01 10:03:...| 1|
| 1| 2022-02-01|2022-02-01 10:05:...| 1|
| 1| 2022-02-01|2022-02-01 10:05:...| 1|
| 3| 2022-02-01|2022-02-01 10:00:...| 2|
| 3| 2022-02-01|2022-02-01 10:05:...| 2|
| 3| 2022-02-01|2022-02-03 11:35:...| 2|
| 3| 2022-02-01|2022-02-04 11:35:...| 2|
| 2| 2022-02-02| null| 3|
| 2| 2022-02-02|2022-02-01 10:03:...| 3|
| 2| 2022-02-02|2022-02-02 10:05:...| 3|
| 2| 2022-02-02|2022-02-02 10:05:...| 3|
| 2| 2022-02-02|2022-02-02 11:00:...| 3|
| 4| 2022-02-02| null| 4|
| 4| 2022-02-02|2022-02-03 11:35:...| 4|
| 4| 2022-02-02|2022-02-03 11:35:...| 4|
| 4| 2022-02-02|2022-02-03 11:35:...| 4|
+------------+------------------+--------------------+-----+
How can I do that? I couldn't do it and keep it sorted
Key observation here is that the "index" column has same "id" column values but ordered by "install_time_first", one way to look at this is to partition/orderBy on (install_time_first, id) and assign a unqiue index for each couple, I did 2 solution the first using joins and the second using windows with some tricks, I would prefer the first solution because the second one can be performance heavy:
PS: tou can delete this line of code in both solutions ".orderBy("install_time_first", "id")", I added it just to make sure the output is sorted so it can be readable:
Prepare data:
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame([
(2, "2022-02-02", "2022-02-01 10:03"),
(3, "2022-02-01", "2022-02-01 10:00"),
(2, "2022-02-02", None),
(3, "2022-02-01", "2022-02-03 11:35"),
(1, "2022-02-01", None),
(2, "2022-02-02", "2022-02-02 10:05"),
(3, "2022-02-01", "2022-02-01 10:05"),
(4, "2022-02-02", None),
(1, "2022-02-01", "2022-02-01 10:05"),
(2, "2022-02-02", "2022-02-02 10:05"),
(4, "2022-02-02", "2022-02-03 11:35"),
(1, "2022-02-01", None),
(1, "2022-02-01", "2022-02-01 10:03"),
(1, "2022-02-01", "2022-02-01 10:05"),
(4, "2022-02-02", "2022-02-03 11:35"),
(2, "2022-02-02", "2022-02-02 11:00"),
(4, "2022-02-02", "2022-02-03 11:35"),
(3, "2022-02-01", "2022-02-04 11:35"),
(1, "2022-02-01", "2022-02-01 10:00"),
], ("id", "install_time_first", "timestamp"))
Solution 1:
df_with_index = df.select("id", "install_time_first").distinct().orderBy("install_time_first", "id")\
.withColumn("index",monotonically_increasing_id() + 1)\
.withColumnRenamed("id", "id2").withColumnRenamed("install_time_first", "install_time_first2")
df.join(df_with_index, (df.id == df_with_index.id2) & (df.install_time_first == df_with_index.install_time_first2),
"left").orderBy("install_time_first", "id").drop("id2", "install_time_first2").show()
Solution 2:
w = Window.partitionBy(col("id")).orderBy(col("install_time_first"))
w2 = Window.orderBy(col("install_time_first"))
df = df.withColumn("prev_id", lag("id", 1, None).over(w))
df.withColumn("index", when(df.prev_id.isNull() | (df.prev_id != df.id), 1).otherwise(0))\
.withColumn("index", sum("index").over(w2.rowsBetween(Window.unboundedPreceding, Window.currentRow)))\
.orderBy("install_time_first", "id").drop("prev_id").show()
Both gives same results:
+---+------------------+----------------+-----+
| id|install_time_first| timestamp|index|
+---+------------------+----------------+-----+
| 1| 2022-02-01|2022-02-01 10:05| 1|
| 1| 2022-02-01|2022-02-01 10:00| 1|
| 1| 2022-02-01| null| 1|
| 1| 2022-02-01| null| 1|
| 1| 2022-02-01|2022-02-01 10:03| 1|
| 1| 2022-02-01|2022-02-01 10:05| 1|
| 3| 2022-02-01|2022-02-03 11:35| 2|
| 3| 2022-02-01|2022-02-01 10:00| 2|
| 3| 2022-02-01|2022-02-04 11:35| 2|
| 3| 2022-02-01|2022-02-01 10:05| 2|
| 2| 2022-02-02| null| 3|
| 2| 2022-02-02|2022-02-02 10:05| 3|
| 2| 2022-02-02|2022-02-02 10:05| 3|
| 2| 2022-02-02|2022-02-02 11:00| 3|
| 2| 2022-02-02|2022-02-01 10:03| 3|
| 4| 2022-02-02| null| 4|
| 4| 2022-02-02|2022-02-03 11:35| 4|
| 4| 2022-02-02|2022-02-03 11:35| 4|
| 4| 2022-02-02|2022-02-03 11:35| 4|
+---+------------------+----------------+-----+

RowNumber with Reset

I am trying to achieve the expected output shown here:
+---+-----+--------+--------+--------+----+
| ID|State| Time|Expected|lagState|rank|
+---+-----+--------+--------+--------+----+
| 1| P|20220722| 1| null| 1|
| 1| P|20220723| 2| P| 2|
| 1| P|20220724| 3| P| 3|
| 1| P|20220725| 4| P| 4|
| 1| D|20220726| 1| P| 1|
| 1| O|20220727| 1| D| 1|
| 1| D|20220728| 1| O| 1|
| 1| P|20220729| 2| D| 1|
| 1| P|20220730| 3| P| 9|
| 1| P|20220731| 4| P| 10|
+---+-----+--------+--------+--------+----+
# create df
df = spark.createDataFrame(sc.parallelize([
[1, 'P', 20220722, 1],
[1, 'P', 20220723, 2],
[1, 'P', 20220724, 3],
[1, 'P', 20220725, 4],
[1, 'D', 20220726, 1],
[1, 'O', 20220727, 1],
[1, 'D', 20220728, 1],
[1, 'P', 20220729, 2],
[1, 'P', 20220730, 3],
[1, 'P', 20220731, 4],
]),
['ID', 'State', 'Time', 'Expected'])
# lag
df = df.withColumn('lagState', F.lag('State').over(w.partitionBy('id').orderBy('time')))
# rn
df = df.withColumn('rank', F.when( F.col('State') == F.col('lagState'), F.rank().over(w.partitionBy('id').orderBy('time', 'state'))).otherwise(1))
# view
df.show()
The general problem is that the tail of the DF is not resetting to the expected value as hoped.
data_sdf. \
withColumn('st_notsame',
func.coalesce(func.col('state') != func.lag('state').over(wd.partitionBy('id').orderBy('time')),
func.lit(False)).cast('int')
). \
withColumn('rank_temp',
func.sum('st_notsame').over(wd.partitionBy('id').orderBy('time').rowsBetween(-sys.maxsize, 0))
). \
withColumn('rank',
func.row_number().over(wd.partitionBy('id', 'rank_temp').orderBy('time'))
). \
show()
# +---+-----+--------+--------+----------+---------+----+
# | id|state| time|expected|st_notsame|rank_temp|rank|
# +---+-----+--------+--------+----------+---------+----+
# | 1| P|20220722| 1| 0| 0| 1|
# | 1| P|20220723| 2| 0| 0| 2|
# | 1| P|20220724| 3| 0| 0| 3|
# | 1| P|20220725| 4| 0| 0| 4|
# | 1| D|20220726| 1| 1| 1| 1|
# | 1| O|20220727| 1| 1| 2| 1|
# | 1| D|20220728| 1| 1| 3| 1|
# | 1| P|20220729| 2| 1| 4| 1|
# | 1| P|20220730| 3| 0| 4| 2|
# | 1| P|20220731| 4| 0| 4| 3|
# +---+-----+--------+--------+----------+---------+----+
your expected field looks a little incorrect. I believe the rank against "20220729" should be 1.
you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable you to do a running sum
use the sum window with infinite lookback for each id to get a temp rank
use the temp rank as a partition column to be used for row_number()

Pyspark groupby for all column with unpivot

I have 101 columns from a pipe delimited and looking to get counts for all columns with all untransposing the data.
Sample data:
+----------------+------------+------------+------------+------------+------------+------------+
|rm_ky|flag_010961|flag_011622|flag_009670|flag_009708|flag_009890|flag_009893|
+----------------+------------+------------+------------+------------+------------+------------+
| 193012020044| 0| 0| 0| 0| 0| 0|
| 115012030044| 0| 0| 1| 1| 1| 1|
| 140012220044| 0| 0| 0| 0| 0| 0|
| 189012240044| 0| 0| 0| 0| 0| 0|
| 151012350044| 0| 0| 0| 0| 0| 0|
+----------------+------------+------------+------------+------------+------------+------------+
I have tried each column based out like
df.groupBy("flag_011622").count().show()
+------------+--------+
|flag_011622| count|
+------------+--------+
| 1| 192289|
| 0|69861980|
+------------+--------+
Instead I'm looking something like
I'm looking something like: Any suggestions to handle instead of loop in each time
+----------------+------------+------------+
|rm_ky|flag_010961|flag_name|counts|
+----------------+------------+------------+--------
| flag_011622| 1| 192289|
| flag_011622| 0| 69861980|
| flag_009670| 1| 120011800|
| flag_009670| 0| 240507|
| flag_009708| 1| 119049838|
| flag_009708| 0| 1202469|
+----------------+------------+------------+--------
You could use stack function that returns a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe.
Using your sample as df:
df = df.select(
"rm_ky",
F.expr(
"""stack(5,
'flag_010961', flag_010961,
'flag_009670', flag_009670,
'flag_009708', flag_009708,
'flag_009890', flag_009890,
'flag_009893', flag_009893
) AS (flag_name, value)"""
),
)
gives:
+------------+-----------+-----+
|rm_ky |flag_name |value|
+------------+-----------+-----+
|193012020044|flag_010961|0 |
|193012020044|flag_009670|0 |
|193012020044|flag_009708|0 |
|193012020044|flag_009890|0 |
|193012020044|flag_009893|0 |
|115012030044|flag_010961|0 |
|115012030044|flag_009670|0 |
|115012030044|flag_009708|1 |
|115012030044|flag_009890|1 |
|115012030044|flag_009893|1 |
|140012220044|flag_010961|0 |
|140012220044|flag_009670|0 |
|140012220044|flag_009708|0 |
|140012220044|flag_009890|0 |
|140012220044|flag_009893|0 |
|189012240044|flag_010961|0 |
|189012240044|flag_009670|0 |
|189012240044|flag_009708|0 |
|189012240044|flag_009890|0 |
|189012240044|flag_009893|0 |
|151012350044|flag_010961|0 |
|151012350044|flag_009670|0 |
|151012350044|flag_009708|0 |
|151012350044|flag_009890|0 |
|151012350044|flag_009893|0 |
+------------+-----------+-----+
Which you can then group and order:
df = (
df.groupBy("flag_name", "value")
.agg(F.count("*").alias("counts"))
.orderBy("flag_name", "value")
)
to get:
+-----------+-----+------+
|flag_name |value|counts|
+-----------+-----+------+
|flag_009670|0 |5 |
|flag_009708|0 |4 |
|flag_009708|1 |1 |
|flag_009890|0 |4 |
|flag_009890|1 |1 |
|flag_009893|0 |4 |
|flag_009893|1 |1 |
|flag_010961|0 |5 |
+-----------+-----+------+
Exemple:
data = [ ("193012020044",0, 0, 0, 0, 0, 1)
,("115012030044",0, 0, 1, 1, 1, 1)
,("140012220044",0, 0, 0, 0, 0, 0)
,("189012240044",0, 1, 0, 0, 0, 0)
,("151012350044",0, 0, 0, 1, 1, 0)]
columns= ["rm_ky","flag_010961","flag_011622","flag_009670","flag_009708","flag_009890","flag_009893"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
| rm_ky|flag_010961|flag_011622|flag_009670|flag_009708|flag_009890|flag_009893|
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
|193012020044| 0| 0| 0| 0| 0| 1|
|115012030044| 0| 0| 1| 1| 1| 1|
|140012220044| 0| 0| 0| 0| 0| 0|
|189012240044| 0| 1| 0| 0| 0| 0|
|151012350044| 0| 0| 0| 1| 1| 0|
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
Creating an expression to unpivot:
x = ""
cnt = 0
for col in df.columns:
if col != 'rm_ky':
cnt += 1
x += "'"+str(col)+"', " + str(col) + ", "
x = x[:-2]
xpr = """stack({}, {}) as (Type,Value)""".format(cnt,x)
print(xpr)
>> stack(6, 'flag_010961', flag_010961, 'flag_011622', flag_011622, 'flag_009670', flag_009670, 'flag_009708', flag_009708, 'flag_009890', flag_009890, 'flag_009893', flag_009893) as (Type,Value)
Then, using expr and pivot:
from pyspark.sql import functions as F
df\
.drop('rm_ky')\
.select(F.lit('dummy'),F.expr(xpr))\
.drop('dummy')\
.groupBy('Type')\
.pivot('Value')\
.agg(*[F.count(x).alias(x) for x in df_output.columns if x not in {"Type"}])\
.fillna(0)\
.show()
+-----------+---+---+
| Type| 0| 1|
+-----------+---+---+
|flag_009890| 3| 2|
|flag_009893| 3| 2|
|flag_011622| 4| 1|
|flag_010961| 5| 0|
|flag_009708| 3| 2|
|flag_009670| 4| 1|
+-----------+---+---+
i think this is what you are looking for
>>> df2.show()
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
| rm_ky|flag_010961|flag_011622|flag_009670|flag_009708|flag_009890|flag_009893|
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
|193012020044| 0| 0| 0| 0| 0| 0|
|115012030044| 0| 0| 1| 1| 1| 1|
|140012220044| 0| 0| 0| 0| 0| 0|
|189012240044| 0| 0| 0| 0| 0| 0|
|151012350044| 0| 0| 0| 0| 0| 0|
+------------+-----------+-----------+-----------+-----------+-----------+-----------+
>>> unpivotExpr = "stack(6, 'flag_010961',flag_010961,'flag_011622',flag_011622,'flag_009670',flag_009670, 'flag_009708',flag_009708, 'flag_009890',flag_009890, 'flag_009893',flag_009893) as (flag,flag_val)"
>>> unPivotDF = df2.select("rm_ky", expr(unpivotExpr))
>>> unPivotDF.show()
+------------+-----------+--------+
| rm_ky| flag|flag_val|
+------------+-----------+--------+
|193012020044|flag_010961| 0|
|193012020044|flag_011622| 0|
|193012020044|flag_009670| 0|
|193012020044|flag_009708| 0|
|193012020044|flag_009890| 0|
|193012020044|flag_009893| 0|
|115012030044|flag_010961| 0|
|115012030044|flag_011622| 0|
|115012030044|flag_009670| 1|
|115012030044|flag_009708| 1|
|115012030044|flag_009890| 1|
|115012030044|flag_009893| 1|
|140012220044|flag_010961| 0|
|140012220044|flag_011622| 0|
|140012220044|flag_009670| 0|
|140012220044|flag_009708| 0|
|140012220044|flag_009890| 0|
|140012220044|flag_009893| 0|
|189012240044|flag_010961| 0|
|189012240044|flag_011622| 0|
+------------+-----------+--------+
only showing top 20 rows
>>> unPivotDF.groupBy("flag","flag_val").count().show()
+-----------+--------+-----+
| flag|flag_val|count|
+-----------+--------+-----+
|flag_009670| 0| 4|
|flag_009708| 0| 4|
|flag_009893| 0| 4|
|flag_009890| 0| 4|
|flag_009670| 1| 1|
|flag_009893| 1| 1|
|flag_011622| 0| 5|
|flag_010961| 0| 5|
|flag_009890| 1| 1|
|flag_009708| 1| 1|
+-----------+--------+-----+
>>> unPivotDF.groupBy("rm_ky","flag","flag_val").count().show()
+------------+-----------+--------+-----+
| rm_ky| flag|flag_val|count|
+------------+-----------+--------+-----+
|151012350044|flag_009708| 0| 1|
|115012030044|flag_010961| 0| 1|
|140012220044|flag_009670| 0| 1|
|189012240044|flag_010961| 0| 1|
|151012350044|flag_009670| 0| 1|
|115012030044|flag_009890| 1| 1|
|151012350044|flag_009890| 0| 1|
|189012240044|flag_009890| 0| 1|
|193012020044|flag_011622| 0| 1|
|193012020044|flag_009670| 0| 1|
|115012030044|flag_009670| 1| 1|
|140012220044|flag_011622| 0| 1|
|151012350044|flag_009893| 0| 1|
|140012220044|flag_009893| 0| 1|
|189012240044|flag_011622| 0| 1|
|189012240044|flag_009893| 0| 1|
|115012030044|flag_009893| 1| 1|
|140012220044|flag_009708| 0| 1|
|189012240044|flag_009708| 0| 1|
|193012020044|flag_010961| 0| 1|
+------------+-----------+--------+-----+

Get all possible combinations recursively in an RDD in pyspark

I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow, it will run in a cluster of big data(cloudera), so i think that i have to put the function into pyspark, any tip how improve it please
import pandas as pd import itertools as itts
number_list = [10953, 10423, 10053]
def reducer(nums): def ranges(n): print(n) return range(n, -1, -1)
num_list = list(map(ranges, nums)) return list(itts.product(*num_list))
data=pd.DataFrame(reducer(number_list)) print(data)
You can use crossJoin with DataFrame:
Here we have a simple example trying to compute the cross-product of three arrays,
i.e. [1,0], [2,1,0], [3,2,1,0]. Their cross-product should have 2*3*4 = 24 elements.
The code below shows how to achieve this.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df1 = spark.createDataFrame([(1,),(0,)], ['v1'])
df2 = spark.createDataFrame([(2,), (1,),(0,)], ['v2'])
df3 = spark.createDataFrame([(3,), (2,),(1,),(0,)], ['v3'])
df1.show()
df2.show()
df3.show()
+---+
| v1|
+---+
| 1|
| 0|
+---+
+---+
| v2|
+---+
| 2|
| 1|
| 0|
+---+
+---+
| v3|
+---+
| 3|
| 2|
| 1|
| 0|
+---+
df = df1.crossJoin(df2).crossJoin(df3)
print('----------- Total rows: ', df.count())
df.show(30)
----------- Total rows: 24
+---+---+---+
| v1| v2| v3|
+---+---+---+
| 1| 2| 3|
| 1| 2| 2|
| 1| 2| 1|
| 1| 2| 0|
| 1| 1| 3|
| 1| 1| 2|
| 1| 1| 1|
| 1| 1| 0|
| 1| 0| 3|
| 1| 0| 2|
| 1| 0| 1|
| 1| 0| 0|
| 0| 2| 3|
| 0| 2| 2|
| 0| 2| 1|
| 0| 2| 0|
| 0| 1| 3|
| 0| 1| 2|
| 0| 1| 1|
| 0| 1| 0|
| 0| 0| 3|
| 0| 0| 2|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Your computation is pretty big:
(10953+1)*(10423+1)*(10053+1)=1148010922784, about 1 trillion rows. I would suggest increase the numbers slowly, spark is not as fast as you think when it involves table joins.
Also, try use broadcast on all your initial DataFrames, i.e. df1, df2, df3. See if it helps.

Pyspark apply function to column value if condition is met [duplicate]

This question already has answers here:
Spark Equivalent of IF Then ELSE
(4 answers)
Closed 3 years ago.
Given a pyspark dataframe, for example:
ls = [
['1', 2],
['2', 7],
['1', 3],
['2',-6],
['1', 3],
['1', 5],
['1', 4],
['2', 7]
]
df = spark.createDataFrame(pd.DataFrame(ls, columns=['col1', 'col2']))
df.show()
+----+-----+
|col1| col2|
+----+-----+
| 1| 2|
| 2| 7|
| 1| 3|
| 2| -6|
| 1| 3|
| 1| 5|
| 1| 4|
| 2| 7|
+----+-----+
How can I apply a function to col2 values where col1 == '1' and store result in a new column?
For example the function is:
f = x**2
Result should look like this:
+----+-----+-----+
|col1| col2| y|
+----+-----+-----+
| 1| 2| 4|
| 2| 7| null|
| 1| 3| 9|
| 2| -6| null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7| null|
+----+-----+-----+
I tried defining a separate function, and use df.withColumn(y).when(condition,function) but it wouldn't work.
So what is a way to do this?
I hope this helps:
def myFun(x):
return (x**2).cast(IntegerType())
df2 = df.withColumn("y", when(df.col1 == 1, myFun(df.col2)).otherwise(None))
df2.show()
+----+----+----+
|col1|col2| y|
+----+----+----+
| 1| 2| 4|
| 2| 7|null|
| 1| 3| 9|
| 2| -6|null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7|null|
+----+----+----+

Resources