Pivot fixed amount of rows to fixed amount of columns in PySpark - apache-spark

Imagine I have the following table:
unique_id
column_A
column_B
123
12345
ABCDEFG
123
23456
BCDEFGH
123
34567
CDEFGHI
234
12345
ABCDEFG
The amount of rows per unique ID is maximum 3.
The result I want to achieve is the following
unique_id
column_A_1
column_A_2
column_A_3
column_B_1
column_B_2
column_B_3
123
12345
23456
34567
ABCDEFG
BCDEFGH
CDEFGHI
234
12345
ABCDEFG

you can assign a row_number to each record and pivot that.
Here's an example of retaining 2 values per id using your input dataframe.
pivoted_sdf = data_sdf. \
withColumn('rn',
func.row_number().over(wd.partitionBy('unique_id').orderBy(func.rand()))
). \
filter(func.col('rn') <= 2). \
groupBy('unique_id'). \
pivot('rn', values=['1', '2']). \
agg(func.first('col_a').alias('col_a'),
func.first('col_b').alias('col_b')
)
# +---------+-------+-------+-------+-------+
# |unique_id|1_col_a|1_col_b|2_col_a|2_col_b|
# +---------+-------+-------+-------+-------+
# | 234| 12345|ABCDEFG| null| null|
# | 123| 34567|CDEFGHI| 23456|BCDEFGH|
# +---------+-------+-------+-------+-------+
Notice the column names - spark added the row number as a prefix to the aggregation alias. You can rename the columns to make that as a suffix.
def renameTheCol(column):
col_split = column.split('_')
col_split_rearr = col_split[1:] + [col_split[0]]
new_column = '_'.join(col_split_rearr)
return new_column
pivoted_sdf. \
select('unique_id',
*[func.col(k).alias(renameTheCol(k)) for k in pivoted_sdf.columns if k != 'unique_id']
). \
show()
# +---------+-------+-------+-------+-------+
# |unique_id|col_a_1|col_b_1|col_a_2|col_b_2|
# +---------+-------+-------+-------+-------+
# | 234| 12345|ABCDEFG| null| null|
# | 123| 23456|BCDEFGH| 34567|CDEFGHI|
# +---------+-------+-------+-------+-------+

Related

How to get the mean for multiple groups at once in PySpark

Let's say I have the following dataframe:
age_group occupation sex country debt
31_40 attorney M USA 10000
41_50 doctor F Mexico 2000
21_30 dentist F Canada 7000
51_60 engineer M Hungary 9000
61_70 driver M Egypt 23000
Considering it contains millions of rows, how can I get the debt mean for each specific group in only one dataframe, so it would return something like that:
group value debt_mean
country egypt 12500
country usa 25000
age_group 21_30 5000
sex f 15000
sex m 15000
ocuppation driver 5200
Use list comprehension to create of list of means for each of the columns. Then use reduce to union the df in the list of means.
Long hand;
from functools import reduce
meanDF_list= [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]
reduce(lambda a, b: b.unionByName(a),meanDF_list).show()
Chained
reduce(lambda a, b: b.unionByName(a), [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]).show()
+--------+-------+
| Group| mean|
+--------+-------+
| USA|10000.0|
| Mexico| 2000.0|
| Canada| 7000.0|
| Hungary| 9000.0|
| Egypt|23000.0|
| M|14000.0|
| F| 4500.0|
|attorney|10000.0|
| doctor| 2000.0|
| dentist| 7000.0|
|engineer| 9000.0|
| driver|23000.0|
| 31_40|10000.0|
| 41_50| 2000.0|
| 21_30| 7000.0|
| 51_60| 9000.0|
| 61_70|23000.0|
+--------+-------+
This is what can be done in Spark:
Setup:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('31_40', 'attorney', 'M', 'USA', 10000),
('41_50', 'doctor', 'F', 'Mexico', 2000),
('21_30', 'dentist', 'F', 'Canada', 7000),
('51_60', 'engineer', 'M', 'Hungary', 9000),
('61_70', 'driver', 'M', 'Egypt', 23000)],
['age_group', 'occupation', 'sex', 'country', 'debt']
)
Script:
grp_cols = ['age_group', 'occupation', 'sex', 'country']
df = (df
.cube(grp_cols)
.agg(F.avg('debt').alias('debt_mean'))
.filter(F.aggregate(F.array(*grp_cols), F.lit(0), lambda acc, x: acc + x.isNotNull().cast('int')) == 1)
.withColumn('map', F.from_json(F.to_json(F.struct(*grp_cols)), 'map<string,string>'))
.select(
F.map_keys('map')[0].alias('group'),
F.map_values('map')[0].alias('value'),
'debt_mean'
)
)
Result:
df.show()
# +----------+--------+---------+
# | group| value|debt_mean|
# +----------+--------+---------+
# | sex| F| 4500.0|
# |occupation|attorney| 10000.0|
# | age_group| 41_50| 2000.0|
# | sex| M| 14000.0|
# |occupation| doctor| 2000.0|
# | country| Mexico| 2000.0|
# | age_group| 31_40| 10000.0|
# | country| USA| 10000.0|
# |occupation| dentist| 7000.0|
# | country| Canada| 7000.0|
# | age_group| 61_70| 23000.0|
# | age_group| 51_60| 9000.0|
# | country| Egypt| 23000.0|
# | country| Hungary| 9000.0|
# |occupation| driver| 23000.0|
# |occupation|engineer| 9000.0|
# | age_group| 21_30| 7000.0|
# +----------+--------+---------+
This does not match with your example, so I hope the example was just for structure, not for data.
A way to improve this would be going into SQL and using grouping sets. Then it becomes a bit dirty, so I'll leave it to you if you really need more improvement.

Removing unwanted characters in python pandas

I have a pandas dataframe column like below :
| ColumnA |
+-------------+
| ABCD(!) |
| <DEFG>(23) |
| (MNPQ. ) |
| 32.JHGF |
| "QWERT" |
Aim is to remove the special characters and produce the output as below :
| ColumnA |
+------------+
| ABCD |
| DEFG |
| MNPQ |
| JHGF |
| QWERT |
Tried using the replace method like below, but without success :
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z\d\_]+", "", regex=True)
print(df)
So, how can I replace the special characters using replace method in pandas?
Your solution is also for get numbers \d and _, so it remove only:
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "")
print (df)
ColumnA
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT
regrex should be r'[^a-zA-Z]+', it means keep only the characters that are from A to Z, a-z
import pandas as pd
# | ColumnA |
# +-------------+
# | ABCD(!) |
# | <DEFG>(23) |
# | (MNPQ. ) |
# | 32.JHGF |
# | "QWERT" |
# create a dataframe from a list
df = pd.DataFrame(['ABCD(!)', 'DEFG(23)', '(MNPQ. )', '32.JHGF', 'QWERT'], columns=['ColumnA'])
# | ColumnA |
# +------------+
# | ABCD |
# | DEFG |
# | MNPQ |
# | JHGF |
# | QWERT |
# keep only the characters that are from A to Z, a-z
df['ColumnB'] =df['ColumnA'].str.replace(r'[^a-zA-Z]+', '')
print(df['ColumnB'])
Result:
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT
Your suggested code works fine on my installation with only extra digits so that you need to update your regex statement: r"[^a-zA-Z]+" If this doesn't work, then maybe try to update your pandas;
import pandas as pd
d = {'Column A': [' ABCD(!)', '<DEFG>(23)', '(MNPQ. )', ' 32.JHGF', '"QWERT"']}
df = pd.DataFrame(d)
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "", regex=True)
print(df)
Output
Column A
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT

PySpark get related records from its array object values

I have a spark dataframe that has an ID column and along with other columns, it has an array column that contains the IDs of its related records, as its value.
example dataframe will be of
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456]
345 | alen | [789]
456 | sam | [789,999]
789 | marc | [111]
555 | dan | [333]
From the above, I need to append all the related child Id's to the array column of the parent ID. The resultant DF should be like
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456,789,999,111]
345 | alen | [789,111]
456 | sam | [789,999,111]
789 | marc | [111]
555 | dan | [333]
need help on how to do it. thanks
One way to handle this task is to do self leftjoin, update the RELATED_IDLIST, do this several iterations until some conditions satisfy (this works only when the max-depth of the whole hierarchy is small). For Spark 2.3, we can convert the ArrayType column into a comma-delimitered StringType column, use SQL builtin function find_in_set and a new column PROCESSED_IDLIST to set up the join-conditions, see below for the main logic:
Functions:
from pyspark.sql import functions as F
import pandas as pd
# define a function which takes a dataframe as input, does a self left-join and then return another
# dataframe with exactly the same schema as the input dataframe. do the same repeatly until some conditions satisfy
def recursive_join(d, max_iter=10):
# function to find direct child-IDs and merge into RELATED_IDLIST
def find_child_idlist(_df):
return _df.alias('d1').join(
_df.alias('d2'),
F.expr("find_in_set(d2.ID,d1.RELATED_IDLIST)>0 AND find_in_set(d2.ID,d1.PROCESSED_IDLIST)<1"),
"left"
).groupby("d1.ID", "d1.NAME").agg(
F.expr("""
/* combine d1.RELATED_IDLIST with all matched entries from collect_set(d2.RELATED_IDLIST)
* and remove trailing comma from when all d2.RELATED_IDLIST are NULL */
trim(TRAILING ',' FROM
concat_ws(",", first(d1.RELATED_IDLIST), concat_ws(",", collect_list(d2.RELATED_IDLIST)))
) as RELATED_IDLIST"""),
F.expr("first(d1.RELATED_IDLIST) as PROCESSED_IDLIST")
)
# below the main code logic
d = find_child_idlist(d).persist()
if (d.filter("RELATED_IDLIST!=PROCESSED_IDLIST").count() > 0) & (max_iter > 1):
d = recursive_join(d, max_iter-1)
return d
# define pandas_udf to remove duplicate from an ArrayType column
get_uniq = F.pandas_udf(lambda s: pd.Series([ list(set(x)) for x in s ]), "array<int>")
Where:
in the function find_child_idlist(), the left-join must satisfy the following two conditions:
d2.ID is in d1.RELATED_IDLIST: find_in_set(d2.ID,d1.RELATED_IDLIST)>0
d2.ID not in d1.PROCESSED_IDLIST: find_in_set(d2.ID,d1.PROCESSED_IDLIST)<1
quit the recursive_join when no row satisfying RELATED_IDLIST!=PROCESSED_IDLIST or max_iter > 1
Processing:
set up dataframe:
df = spark.createDataFrame([
(123, "mike", [345,456]), (345, "alen", [789]), (456, "sam", [789,999]),
(789, "marc", [111]), (555, "dan", [333])
],["ID", "NAME", "RELATED_IDLIST"])
add a new column PROCESSED_IDLIST to save RELATED_IDLIST in the previous join, and do recursive_join()
df1 = df.withColumn('RELATED_IDLIST', F.concat_ws(',','RELATED_IDLIST')) \
.withColumn('PROCESSED_IDLIST', F.col('ID'))
df_new = recursive_join(df1, 5)
df_new.show(10,0)
+---+----+-----------------------+-----------------------+
|ID |NAME|RELATED_IDLIST |PROCESSED_IDLIST |
+---+----+-----------------------+-----------------------+
|555|dan |333 |333 |
|789|marc|111 |111 |
|345|alen|789,111 |789,111 |
|123|mike|345,456,789,789,999,111|345,456,789,789,999,111|
|456|sam |789,999,111 |789,999,111 |
+---+----+-----------------------+-----------------------+
split RELATED_IDLIST into array of integers and then use pandas_udf function to drop duplicate array elements:
df_new.withColumn("RELATED_IDLIST", get_uniq(F.split('RELATED_IDLIST', ',').cast('array<int>'))).show(10,0)
+---+----+-------------------------+-----------------------+
|ID |NAME|RELATED_IDLIST |PROCESSED_IDLIST |
+---+----+-------------------------+-----------------------+
|555|dan |[333] |333 |
|789|marc|[111] |111 |
|345|alen|[789, 111] |789,111 |
|123|mike|[999, 456, 111, 789, 345]|345,456,789,789,999,111|
|456|sam |[111, 789, 999] |789,999,111 |
+---+----+-------------------------+-----------------------+

filter dataframe by multiple columns after exploding

My df contains product names and corresponding information. Relevant here is the name and country sold to:
+--------------------+-------------------------+
| Product_name|collect_set(Countries_en)|
+--------------------+-------------------------+
| null| [Belgium,United K...|
| #5 pecan/almond| [Belgium]|
| #8 mango/strawberry| [Belgium]|
|& Sully A Mild Th...| [Belgium,France]|
|"70CL Liqueu...| [Belgium,France]|
|"Gingembre&q...| [Belgium]|
|"Les Schtrou...| [Belgium,France]|
|"Sho-key&quo...| [Belgium]|
|"mini Chupa ...| [Belgium,France]|
| 'S Lands beste| [Belgium]|
|'T vlierbos confi...| [Belgium]|
|(H)eat me - Spagh...| [Belgium]|
| -cheese flips| [Belgium]|
| .soupe cerfeuil| [Belgium]|
|1 1/2 Minutes Bas...| [Belgium,Luxembourg]|
| 1/2 Reblochon AOP| [Belgium]|
| 1/2 nous de jambon| [Belgium]|
|1/2 tarte cerise ...| [Belgium]|
|10 Original Knack...| [Belgium,France,S...|
| 10 pains au lait| [Belgium,France]|
+--------------------+-------------------------+
sample input data:
[Row(code=2038002038.0, Product_name='Formula 2 men multi vitaminic', Countries_en='France,Ireland,Italy,Mexico,United States,Argentina-espanol,Armenia-pyсский,Aruba-espanol,Asia-pacific,Australia-english,Austria-deutsch,Azerbaijan-русский,Belarus-pyсский,Belgium-francais,Belgium-nederlands,Bolivia-espanol,Bosnia-i-hercegovina-bosnian,Botswana-english,Brazil-portugues,Bulgaria-български,Cambodia-english,Cambodia-ភាសាខ្មែរ,Canada-english,Canada-francais,Chile-espanol,China-中文,Colombia-espanol,Costa-rica-espanol,Croatia-hrvatski,Cyprus-ελληνικά,Czech-republic-čeština,Denmark-dansk,Ecuador-espanol,El-salvador-espanol,Estonia-eesti,Europe,Finland-suomi,France-francais,Georgia-ქართული,Germany-deutsch,Ghana-english,Greece-ελληνικά,Guatemala-espanol,Honduras-espanol,Hong-kong-粵語,Hungary-magyar,Iceland-islenska,India-english,Indonesia-bahasa-indonesia,Ireland-english,Israel-עברית,Italy-italiano,Jamaica-english,Japan-日本語,Kazakhstan-pyсский,Korea-한국어,Kyrgyzstan-русский,Latvia-latviešu,Lebanon-english,Lesotho-english,Lithuania-lietuvių,Macau-中文,Malaysia-bahasa-melayu,Malaysia-english,Malaysia-中文,Mexico-espanol,Middle-east-africa,Moldova-roman,Mongolia-монгол-хэл,Namibia-english,Netherlands-nederlands,New-zealand-english,Nicaragua-espanol,North-macedonia-македонски-јазик,Norway-norsk,Panama-espanol,Paraguay-espanol,Peru-espanol,Philippines-english,Poland-polski,Portugal-portugues,Puerto-rico-espanol,Republica-dominicana-espanol,Romania-romană,Russia-русский,Serbia-srpski,Singapore-english,Slovak-republic-slovenčina,Slovenia-slovene,South-africa-english,Spain-espanol,Swaziland-english,Sweden-svenska,Switzerland-deutsch,Switzerland-francais,Taiwan-中文,Thailand-ไทย,Trinidad-tobago-english,Turkey-turkce,Ukraine-yкраї́нська,United-kingdom-english,United-states-english,United-states-espanol,Uruguay-espanol,Venezuela-espanol,Vietnam-tiếng-việt,Zambia-english', Traces_en=None, Additives_tags=None, Main_category_en='Vitamins', Image_url='https://static.openfoodfacts.org/images/products/203/800/203/8/front_en.12.400.jpg', Quantity='60 compresse', Packaging_tags='barattolo,tablet', )]
Since I want to explore to which countries the products are sold to besides Belgium i split the country column to show every country individually using the code below
#create df with grouped products
countriesDF = productsDF\
.select("Product_name", "Countries_en")\
.groupBy("Product_name")\
.agg(F.collect_set("Countries_en").cast("string").alias("Countries"))\
.orderBy("Product_name")
#split df to show countries the product is sold to in a seperate column
countriesDF = countriesDF\
.where(col("Countries")!="null")\
.select("Product_name",\
F.split("Countries", ",").alias("Countries"),
F.posexplode(F.split("Countries", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Product_name",
F.concat(F.lit("Countries"),F.col("pos").cast("string")).alias("name"),
F.expr("Countries[pos]").alias("val")
)\
.groupBy("Product_name").pivot("name").agg(F.first("val"))\
.show()
However, this table now has over 400 columns for countries alone which is not presentable. So my question is:
am I doing the splitting / exploding correctly?
can I split the df so I get the countries as column names (e.g. 'France' instead of 'countries1' etc.) counting the number of times the product is sold in this country?
Some sample data :
val sampledf = Seq(("p1","BELGIUM,GERMANY"),("p1","BELGIUM,ITALY"),("p1","GERMANY"),("p2","BELGIUM")).toDF("Product_name","Countries_en")
Transform to required df :
df = sampledf
.withColumn("country_list",split(col("Countries_en"),","))
.select(col("Product_name"), explode(col("country_list")).as("country"))
+------------+-------+
|Product_name|country|
+------------+-------+
| p1|BELGIUM|
| p1|GERMANY|
| p1|BELGIUM|
| p1| ITALY|
| p1|GERMANY|
| p2|BELGIUM|
+------------+-------+
If you need only counts per country :
countDF = df.groupBy("Product_name","country").count()
countDF.show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|BELGIUM| 2|
| p1|GERMANY| 1|
| p2|BELGIUM| 1|
+------------+-------+-----+
Except Belgium :
countDF.filter(col("country") =!="BELGIUM").show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|GERMANY| 1|
+------------+-------+-----+
And if you really want countries as Columns :
countDF.groupBy("Product_name").pivot("country").agg(first("count"))
+------------+-------+-------+
|Product_name|BELGIUM|GERMANY|
+------------+-------+-------+
| p2| 1| null|
| p1| 2| 1|
+------------+-------+-------+
And you can .drop("BELGIUM") to achieve it.
Final code used:
#create df where countries are split off
df = productsDF\
.withColumn("country_list",split(col("Countries_en"),","))\
.select(col("Product_name"), explode(col("country_list")).alias("Country"))\
#create count and filter out Country Belgium, Product Name can be changed as needed
countDF = df.groupBy("Product","Country").count()\
.filter(col("Country") !="Belgium")\
.filter(col('Product') == 'Café').show()

Add columns based on a changing number of rows below

I'm trying to solve a machine learning problem for an university project. As a input I got a excel table.
It is needed to access information below specific rows (condition: df[c1] !=0) and create new columns with it. But the number of rows after the specific row is not fixed.
There are various pandas functions I tried to get running (e.g.: While-Loops combined with iloc, iterrows.) But nothing seemed to work. Now I wonder if I need to create a function where I create a new df for every group below each top element. I asume there must be a better option. I use Python 3.6 and Pandas 0.25.0.
I try to get the following result.
Input:
| name | c1 | c2 |
|------|-------|--------------|
| ab | 1 | info |
| tz | 0 | more info |
| ka | 0 | more info |
| cd | 2 | info |
| zz | 0 | more info |
The output should look like this:
Output:
| name | c1 | c2 | tz3 | ka4 | zz5 |
|------|-------|--------------|-----------|-----------|------------|
| ab | 1 | info | more info | more info | |
| tz | 0 | more info | | | |
| ka | 0 | more info | | | |
| cd | 2 | info | | | more info |
| zz | 0 | more info | | | |
You can do this as follows:
# make sure c1 is of type int (if it isn't already)
# if it is string, just change the comparison further below
df['c1']= df['c1'].astype('int32')
# create two temporary aux columns in the original dataframe
# the first contains 1 for each row where c1 is nonzero
df['nonzero']= (df['c1'] != 0).astype('int')
# the second contains a "group index" to give
# all rows that belong together the same number
df['group']= df['nonzero'].cumsum()
# create a working copy from the original dataframe
df2= df[['c1', 'c2', 'group']].copy()
# add another column which contains the name of the
# column under which the text should appear
df2['col']= df['name'].where(df['nonzero']==0, 'c2')
# add a dummy column with all ones
# (needed to merge the original dataframe
# with the "transposed" dataframe later)
df2['nonzero']= 1
# now the main part
# use the prepared copy and index it on
# group, nonzero(1) and col
df3= df2[['group', 'nonzero', 'col', 'c2']].set_index(['group', 'nonzero', 'col'])
# unstack it, meaning col is "split off" to create a new column
# level (like pivoting), the rest remains in the index
df3= df3.unstack()
# now df3 has a multilevel column index
# to get rid of it and have regular column names
# just rename the columns and remove c2 which
# we get from the original dataframe
df3_names= ['{1}'.format(*tup) for tup in df3.columns]
df3.columns= df3_names
df3.drop(['c2'], axis='columns', inplace=True)
# df3 now contains the "transposed" infos in column c1
# which should appear in the row for which 'nonzero' contains 1
# to get this, use merge
result= df.merge(df3, left_on=['group', 'nonzero'], right_index=True, how='left')
# if you don't like the NaN values (for the rows with nonzero=0), use fillna
result.fillna('', inplace=True)
# remove the aux columns and the merged c2_1 column
# for c2_1 we can use the original c2 column from df
result.drop(['group', 'nonzero'], axis='columns', inplace=True)
# therefore we rename it to get the same naming schema
result.rename({'c2': 'c2_1'}, axis='columns', inplace=True)
The result looks like this:
Out[191]:
name c1 c2 ka tz zz
0 ab 1 info even more info more info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info more info
4 zz 0 more info
For this input data:
Out[166]:
name c1 c2
0 ab 1 info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info
4 zz 0 more info
# created by the following code:
import io
raw=""" name c1 c2
0 ab 1 info
1 tz 0 more_info
2 ka 0 even_more_info
3 cd 2 info
4 zz 0 more_info"""
df= pd.read_csv(io.StringIO(raw), sep='\s+', index_col=0)
df['c2']=df['c2'].str.replace('_', ' ')

Resources