pyspark convert rows to columns - apache-spark

I have a dataframe where I need to convert rows of the same group to columns. basically pivot these. below is my df.
+------------+-------+-----+-------+
|Customer |ID |unit |order |
+------------+-------+-----+-------+
|John |123 |00015|1 |
|John |123 |00016|2 |
|John |345 |00205|3 |
|John |345 |00206|4 |
|John |789 |00283|5 |
|John |789 |00284|6 |
+------------+-------+-----+-------+
I need the resultant data for the above as..
+--------+-------+--------+----------+--------+--------+-----------+--------+-------+----------+
|state | ID_1 | unit_1 |seq_num_1 | ID_2 | unit_2 | seq_num_2 | ID_3 |unit_3 |seq_num_3 |
+--------+-------+--------+----------+--------+--------+-----------+--------+-------+----------+
|John | 123 | 00015 | 1 | 345 | 00205 | 3 | 789 |00283 | 5 |
|John | 123 | 00016 | 2 | 345 | 00206 | 4 | 789 |00284 | 6 |
+--------+-------+--------+----------+--------+--------+-----------+--------+-------+----------+
I tried to groupBy and pivot() function, but its throwing error says large pivot values found. Is there any way to get the result without using the pivot() function..any help is greatly appreciated.
thanks.

This looks like a typical case of using dense_rank() aggregate function to create a generic sequence (dr in the below code) of distinct IDs under each group of Customer, then do pivoting on this sequence. we can do the similar to order column using row_number() so that it can be used in groupby:
from pyspark.sql import Window, functions as F
# below I added an extra row for a reference when the number of rows vary for different IDs
df = spark.createDataFrame([
('John', '123', '00015', '1'), ('John', '123', '00016', '2'), ('John', '345', '00205', '3'),
('John', '345', '00206', '4'), ('John', '789', '00283', '5'), ('John', '789', '00284', '6'),
('John', '789', '00285', '7')
], ['Customer', 'ID', 'unit', 'order'])
Add two Window Specs: w1 to get dense_rank() of IDs over Customer and w2 to get row_number() of order under the same Customer and ID.
w1 = Window.partitionBy('Customer').orderBy('ID')
w2 = Window.partitionBy('Customer','ID').orderBy('order')
Add two new columns based on the above two WinSpecs: dr(dense_rank) and sid(row_number)
df1 = df.select(
"*",
F.dense_rank().over(w1).alias('dr'),
F.row_number().over(w2).alias('sid')
)
+--------+---+-----+-----+---+---+
|Customer| ID| unit|order| dr|sid|
+--------+---+-----+-----+---+---+
| John|123|00015| 1| 1| 1|
| John|123|00016| 2| 1| 2|
| John|345|00205| 3| 2| 1|
| John|345|00206| 4| 2| 2|
| John|789|00283| 5| 3| 1|
| John|789|00284| 6| 3| 2|
| John|789|00285| 7| 3| 3|
+--------+---+-----+-----+---+---+
Find the max(dr), so that we can pre-define the list to pivot on which is range(1,N+1) (this will improve the efficiency of pivot method).
N = df1.agg(F.max('dr')).first()[0]
Groupby Customer, sid and pivot with dr and then do the aggregate:
df_new = df1.groupby('Customer','sid') \
.pivot('dr', range(1,N+1)) \
.agg(
F.first('ID').alias('ID'),
F.first('unit').alias('unit'),
F.first('order').alias('order')
)
df_new.show()
+--------+---+----+------+-------+----+------+-------+----+------+-------+
|Customer|sid|1_ID|1_unit|1_order|2_ID|2_unit|2_order|3_ID|3_unit|3_order|
+--------+---+----+------+-------+----+------+-------+----+------+-------+
| John| 1| 123| 00015| 1| 345| 00205| 3| 789| 00283| 5|
| John| 2| 123| 00016| 2| 345| 00206| 4| 789| 00284| 6|
| John| 3|null| null| null|null| null| null| 789| 00285| 7|
+--------+---+----+------+-------+----+------+-------+----+------+-------+
Rename the column names if needed:
import re
df_new.toDF(*['_'.join(reversed(re.split('_',c,1))) for c in df_new.columns]).show()
+--------+---+----+------+-------+----+------+-------+----+------+-------+
|Customer|sid|ID_1|unit_1|order_1|ID_2|unit_2|order_2|ID_3|unit_3|order_3|
+--------+---+----+------+-------+----+------+-------+----+------+-------+
| John| 1| 123| 00015| 1| 345| 00205| 3| 789| 00283| 5|
| John| 2| 123| 00016| 2| 345| 00206| 4| 789| 00284| 6|
| John| 3|null| null| null|null| null| null| 789| 00285| 7|
+--------+---+----+------+-------+----+------+-------+----+------+-------+

below is my solution.. doing the rank and then flattening the results.
df = spark.createDataFrame([
('John', '123', '00015', '1'), ('John', '123', '00016', '2'), ('John', '345', '00205', '3'),
('John', '345', '00206', '4'), ('John', '789', '00283', '5'), ('John', '789', '00284', '6'),
('John', '789', '00285', '7')
], ['Customer', 'ID', 'unit', 'order'])
rankedDF = df.withColumn("rank", row_number().over(Window.partitionBy("customer").orderBy("order")))
w1 = Window.partitionBy("customer").orderBy("order")
groupedDF = rankedDF.select("customer", "rank", collect_list("ID").over(w1).alias("ID"), collect_list("unit").over(w1).alias("unit"), collect_list("order").over(w1).alias("seq_num")).groupBy("customer", "rank").agg(max("ID").alias("ID"), max("unit").alias("unit"), max("seq_num").alias("seq_num") )
groupedColumns = [col("customer")]
pivotColumns = map(lambda i:map(lambda a:col(a)[i-1].alias(a + "_" + `i`), ["ID", "unit", "seq_num"]), [1,2,3])
flattenedCols = [item for sublist in pivotColumns for item in sublist]
finalDf=groupedDF.select(groupedColumns + flattenedCols)

There may be multiple ways to do this but a pandas udf can be one such way. Here is a toy example based on your data:
df = pd.DataFrame({'Customer': ['John']*6,
'ID': [123]*2 + [345]*2 + [789]*2,
'unit': ['00015', '00016', '00205', '00206', '00283', '00284'],
'order': range(1, 7)})
sdf = spark.createDataFrame(df)
# Spark 2.4 syntax. Spark 3.0 is less verbose
return_types = 'state string, ID_1 int, unit_1 string, seq_num_1 int, ID_2int, unit_2 string, seq_num_2 int, ID_3 int, unit_3 string, seq_num_3 int'
#pandas_udf(returnType=return_types, functionType=PandasUDFType.GROUPED_MAP)
def convert_to_wide(pdf):
groups = pdf.groupby('ID')
out = pd.concat([group.set_index('Customer') for _, group in groups], axis=1).reset_index()
out.columns = ['state', 'ID_1', 'unit_1', 'seq_num_1', 'ID_2', 'unit_2', 'seq_num_2', 'ID_3', 'unit_3', 'seq_num_3']
return out
sdf.groupby('Customer').apply(convert_to_wide).show()
+-----+----+------+---------+----+------+---------+----+------+---------+
|state|ID_1|unit_1|seq_num_1|ID_2|unit_2|seq_num_2|ID_3|unit_3|seq_num_3|
+-----+----+------+---------+----+------+---------+----+------+---------+
| John| 123| 00015| 1| 345| 00205| 3| 789| 00283| 5|
| John| 123| 00016| 2| 345| 00206| 4| 789| 00284| 6|
+-----+----+------+---------+----+------+---------+----+------+---------+

Related

Calculate duration within groups in PySpark

I want to calculate the duration within groups of the same date_id, subs_no, year, month, and day. If it's the first entry, it should just display "first".
Here's my dataset:
+--------+---------------+--------+----+-----+---+
| date_id| ts| subs_no|year|month|day|
+--------+---------------+--------+----+-----+---+
|20200801|14:27:18.000000|10007239|2022| 6| 1|
|20200801|14:29:44.000000|10054647|2022| 6| 1|
|20200801|08:24:21.000000|10057750|2022| 6| 1|
|20200801|13:49:27.000000|10019958|2022| 6| 1|
|20200801|20:07:32.000000|10019958|2022| 6| 1|
+--------+---------------+--------+----+-----+---+
NB: column "ts" is of string type.
Here's my expected output:
+--------+---------------+--------+----+-----+---+---------+
| date_id| ts| subs_no|year|month|day| duration|
+--------+---------------+--------+----+-----+---+---------+
|20200801|14:27:18.000000|10007239|2022| 6| 1| first |
|20200801|14:29:44.000000|10054647|2022| 6| 1| first |
|20200801|08:24:21.000000|10057750|2022| 6| 1| first |
|20200801|13:49:27.000000|10019958|2022| 6| 1| first |
|20200801|20:07:32.000000|10019958|2022| 6| 1| 6:18:05 |
+--------+---------------+--------+----+-----+---+---------+
You could try joining some of the columns into one which will represent a real timestamp. Then, do calculations using min as a window function. Finally, replace duration of "00:00:00" to "first".
Input:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('20200801', '14:27:18.000000', '10007239', 2022, 6, 1),
('20200801', '14:29:44.000000', '10054647', 2022, 6, 1),
('20200801', '08:24:21.000000', '10057750', 2022, 6, 1),
('20200801', '13:49:27.000000', '10019958', 2022, 6, 1),
('20200801', '20:07:32.000000', '10019958', 2022, 6, 1)],
['date_id', 'ts', 'subs_no', 'year', 'month', 'day'])
Script:
ts = F.to_timestamp(F.format_string('%d-%d-%d %s','year', 'month', 'day', 'ts'))
w = W.partitionBy('date_id', 'subs_no', 'year', 'month', 'day').orderBy(ts)
df = df.withColumn(
'duration',
F.regexp_extract(ts - F.min(ts).over(w), r'\d\d:\d\d:\d\d', 0)
)
df = df.replace('00:00:00', 'first', 'duration')
df.show()
# +--------+---------------+--------+----+-----+---+--------+
# |date_id |ts |subs_no |year|month|day|duration|
# +--------+---------------+--------+----+-----+---+--------+
# |20200801|14:27:18.000000|10007239|2022|6 |1 |first |
# |20200801|13:49:27.000000|10019958|2022|6 |1 |first |
# |20200801|20:07:32.000000|10019958|2022|6 |1 |06:18:05|
# |20200801|14:29:44.000000|10054647|2022|6 |1 |first |
# |20200801|08:24:21.000000|10057750|2022|6 |1 |first |
# +--------+---------------+--------+----+-----+---+--------+
Use window function.Code and logic below
w=Window.partitionBy('date_id', 'subs_no', 'year', 'month').orderBy('date_id', 'subs_no', 'year', 'month')
new =(df.withColumn('ty', to_timestamp('ts'))#Coerce to timestamp.
.withColumn('duration',when(first('ty').over(w)==col('ty'),'first').otherwise(regexp_extract(col('ty')-first('ty').over(w),'\d{2}:\d{2}:\d{2}',0)))#use window functions to align consecutive tos of ts.Where ts does not change, delineate as first else compute distance nad extract time lapsed
.drop('ty')
.orderBy('date_id', 'ts','subs_no', 'year', 'month'))
new.show()
+--------+---------------+--------+----+-----+---+--------+
| date_id| ts| subs_no|year|month|day|duration|
+--------+---------------+--------+----+-----+---+--------+
|20200801|08:24:21.000000|10057750|2022| 6| 1| first|
|20200801|13:49:27.000000|10019958|2022| 6| 1| first|
|20200801|14:27:18.000000|10007239|2022| 6| 1| first|
|20200801|14:29:44.000000|10054647|2022| 6| 1| first|
|20200801|20:07:32.000000|10019958|2022| 6| 1|06:18:05|
+--------+---------------+--------+----+-----+---+--------+

Databricks spark dataframe create dataframe by each column

in pandas , I can do something like this .
data = {"col1" : [np.random.randint(10) for x in range(1,10)],
"col2" : [np.random.randint(100) for x in range(1,10)]}
mypd = pd.DataFrame(data)
mypd
and give the two columns
are there any similar way to create a spark dataframe in pyspark ?
The answer shared by Steven is brilliant
Additionally if you are comfortable with Pandas
You can directly supply your pandas dataframe to the function createDataFrame
Spark >= 2.x
data = {
"col1": [np.random.randint(10) for x in range(1, 10)],
"col2": [np.random.randint(100) for x in range(1, 10)],
}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+----+----+
|col1|col2|
+----+----+
| 6| 4|
| 1| 39|
| 7| 4|
| 7| 95|
| 6| 3|
| 7| 28|
| 2| 26|
| 0| 4|
| 4| 32|
+----+----+

Crossjoin between two dataframes that is dependent on a common column

A crossJoin can be done as follows:
df1 = pd.DataFrame({'subgroup':['A','B','C','D']})
df2 = pd.DataFrame({'dates':pd.date_range(date_today, date_today + timedelta(3), freq='D')})
sdf1 = spark.createDataFrame(df1)
sdf2 = spark.createDataFrame(df2)
sdf1.crossJoin(sdf2).toPandas()
In this example there are two dataframes each containing 4 rows, in the end, I get 16 rows.
However, for my problem, I would like to do a cross join per user, and the user is another column in the two dataframes, e.g.:
df1 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'subgroup':['A','B','C','D','A','B','D','E']})
df2 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'dates':np.hstack([np.array(pd.date_range(date_today, date_today + timedelta(3), freq='D')),np.array(pd.date_range(date_today+timedelta(1), date_today + timedelta(4), freq='D'))])})
The result of applying the per-user crossJoin should be a dataframe with 32 rows. Is this possible in pyspark and how can this be done?
A cross join is a join that generates a multiplication of lines because the joining key does not identify rows uniquely (in our case the joining key is trivial or there is no joining key at all)
Let's start with sample data frames:
import pyspark.sql.functions as psf
import pyspark.sql.types as pst
df1 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value1']]))
df2 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value2']]))
+----+------+
|user|value1|
+----+------+
| 0| 76|
| 1| 59|
| 0| 14|
| 1| 71|
| 0| 66|
| 1| 61|
| 0| 2|
| 1| 22|
| 0| 16|
| 1| 83|
+----+------+
+----+------+
|user|value2|
+----+------+
| 0| 65|
| 1| 81|
| 0| 60|
| 1| 69|
| 0| 21|
| 1| 61|
| 0| 98|
| 1| 76|
| 0| 40|
| 1| 21|
+----+------+
Let's try joining the data frames on a constant column to see the equivalence between a cross join and regular join on a constant (trivial) column:
df = df1.withColumn('key', psf.lit(1)) \
.join(df2.withColumn('key', psf.lit(1)), on=['key'])
We get an error from spark > 2, because it realises we're trying to do a cross join (cartesian product)
Py4JJavaError: An error occurred while calling o1865.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [user#1538, value1#1539], false
and
LogicalRDD [user#1542, value2#1543], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
If your joining key (user here) is not a column that uniquely identifies rows, you'll get a multiplication of lines as well but within each user group:
df = df1.join(df2, on='user')
print("Number of rows : \tdf1: {} \tdf2: {} \tdf: {}".format(df1.count(), df2.count(), df.count()))
Number of rows : df1: 10 df2: 10 df: 50
+----+------+------+
|user|value1|value2|
+----+------+------+
| 1| 59| 81|
| 1| 59| 69|
| 1| 59| 61|
| 1| 59| 76|
| 1| 59| 21|
| 1| 71| 81|
| 1| 71| 69|
| 1| 71| 61|
| 1| 71| 76|
| 1| 71| 21|
| 1| 61| 81|
| 1| 61| 69|
| 1| 61| 61|
| 1| 61| 76|
| 1| 61| 21|
| 1| 22| 81|
| 1| 22| 69|
| 1| 22| 61|
| 1| 22| 76|
| 1| 22| 21|
+----+------+------+
5 * 5 rows for user 0 + 5 * 5 rows for user 1, hence 50
Note: Using a self join followed by a filter usually means you should be using window functions instead.

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!
With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])
I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

How to join two data frames in Apache Spark and merge keys into one column?

I have two following Spark data frames:
sale_df:
|user_id|total_sale|
+-------+----------+
| a| 1100|
| b| 2100|
| c| 3300|
| d| 4400
and target_df:
user_id|personalized_target|
+-------+-------------------+
| b| 1000|
| c| 2000|
| d| 3000|
| e| 4000|
+-------+-------------------+
How can I join them in a way that output is:
user_id total_sale personalized_target
a 1100 NA
b 2100 1000
c 3300 2000
d 4400 4000
e NA 4000
I have tried all most all the join types but it seems that single join can not make the desired output.
Any PySpark or SQL and HiveContext can help.
You can use the equi-join synthax in Scala
val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")
You should check if it works in python:
output = sales_df.join(target_df,['user_id'],"outer")
You need to perform an outer equi-join :
data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])
sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# | e| null| 4000|
# | d| 4400| 3000|
# | c| 3300| 2000|
# | b| 2100| 1000|
# | a| 1100| null|
# +-------+----------+-------------------+

Resources