Randomly Split DataFrame by Unique Values in One Column - apache-spark

I have a pyspark DataFrame like the following:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val12 | val22 | 1 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
Each row has a groupId and multiple rows can have the same groupId.
I want to randomly split this data into two datasets. But all the data having a particular groupId must be in one of the splits.
This means that if d1.groupId = d2.groupId, then d1 and d2 are in the same split.
For example:
# Split 1:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
+--------+--------+-----------+
# Split 2:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val12 | val22 | 1 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
What is the good way to do it on PySpark? Can I use the randomSplit method somehow?

You can use randomSplit to split just the distinct groupIds, and then use the results to split the source DataFrame using join.
For example:
split1, split2 = df.select("groupId").distinct().randomSplit(weights=[0.5, 0.5], seed=0)
split1.show()
#+-------+
#|groupId|
#+-------+
#| 1|
#+-------+
split2.show()
#+-------+
#|groupId|
#+-------+
#| 0|
#| 2|
#+-------+
Now join these back to the original DataFrame:
df1 = df.join(split1, on="groupId", how="inner")
df2 = df.join(split2, on="groupId", how="inner")
df1.show()
3+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 1|val12|val22|
#| 1|val15|val25|
#| 1|val16|val26|
#+-------+-----+-----+
df2.show()
#+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 0|val11|val21|
#| 0|val14|val24|
#| 2|val13|val23|
#+-------+-----+-----+

Related

How to map one column to multiple binary columns in Spark?

This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:
+-----------------+
| C1 | C2 |
+--------|--------+
| A | xxx |
| B | yyy |
| A | yyy |
| B | www |
| B | xxx |
| A | zzz |
| A | xxx |
| A | yyy |
+-----------------+
to
+--------------------------------------------+
| C1 | www | xxx | yyy | zzz |
+--------|--------|--------|--------|--------|
| A | 0 | 1 | 1 | 1 |
| B | 1 | 1 | 1 | 0 |
+--------------------------------------------+
How does one attain this in PySpark? Presence is 1 and absence is 0.
Yes, you will need pivot. But for aggregation, in your case it's best just to use F.first(F.lit(1)) and when you get nulls, just replace them with 0 using df.fillna(0).
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('A', 'xxx'),
('B', 'yyy'),
('A', 'yyy'),
('B', 'www'),
('B', 'xxx'),
('A', 'zzz'),
('A', 'xxx'),
('A', 'yyy')],
['C1', 'C2'])
df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)
df.show()
# +---+---+---+---+---+
# | C1|www|xxx|yyy|zzz|
# +---+---+---+---+---+
# | B| 1| 1| 1| 0|
# | A| 0| 1| 1| 1|
# +---+---+---+---+---+

Dynamically add padding Zeros

mock_data = [('TYCO', ' 1303','13'),('EMC', ' 120989 ','123'), ('VOLVO ', '102329 ','1234'),('BMW', '1301571345 ',' '),('FORD', '004','21212')]
df = spark.createDataFrame(mock_data, ['col1', 'col2','col3'])
+-------+------------+-----+
| col1 | col2| col3|
+-------+------------+-----+
| TYCO| 1303| 13|
| EMC| 120989 | 123|
|VOLVO | 102329 | 1234|
| BMW|1301571345 | |
| FORD| 004|21212|
+-------+------------+-----+
trim the col2 and based on the length(10-col2 length) need to dynamically add padding zeroes in col3. concatenate col2 and col3.
df2 = df.withColumn('length_col2', 10-length(trim(df.col2)))
+-------+------------+-----+-----------+
| col1| col2| col3|length_col2|
+-------+------------+-----+-----------+
| TYCO| 1303| 13| 6|
| EMC| 120989 | 123| 4|
|VOLVO | 102329 | 1234| 4|
| BMW|1301571345 | | 0|
| FORD| 004|21212| 7|
+-------+------------+-----+-----------+
expected output
+-------+----------+-----+-------------
| col1| col2 | col3|output
+-------+----------+-----+-------------
| TYCO| 1303 | 13|1303000013
| EMC| 120989 | 123|1209890123
|VOLVO | 102329 | 1234|1023291234
| BMW| 1301571345 | |1301571345
| FORD| 004 |21212|0040021212
+-------+----------+-----+-------------
What You are looking for is rpad Function in pyspark.sql.functions as listed here => https://spark.apache.org/docs/2.3.0/api/sql/index.html
See The Solution Below :
%pyspark
mock_data = [('TYCO', ' 1303','13'),('EMC', ' 120989 ','123'), ('VOLVO ', '102329 ','1234'),('BMW', '1301571345 ',' '),('FORD', '004','21212')]
df = spark.createDataFrame(mock_data, ['col1', 'col2','col3'])
df.createOrReplaceTempView("input_df")
spark.sql("SELECT *, concat(rpad(trim(col2),10,'0') , col3) as OUTPUT from input_df").show(20,False)
and Result
+-------+------------+-----+---------------+
|col1 |col2 |col3 |OUTPUT |
+-------+------------+-----+---------------+
|TYCO | 1303 |13 |130300000013 |
|EMC | 120989 |123 |1209890000123 |
|VOLVO |102329 |1234 |10232900001234 |
|BMW |1301571345 | |1301571345 |
|FORD |004 |21212|004000000021212|
+-------+------------+-----+---------------+

Sum of only positive data of the column when aggregation in spark dataset

I have a sample dataset as below
+------+------+--------+
| col1 | col2 | values |
+------+------+--------+
| abc | def | -4 |
+------+------+--------+
| abc | def | 4 |
+------+------+--------+
| abc | efg | 8 |
+------+------+--------+
When aggregation of the above dataset I need the sum of values (only positive) when grouping by col1, col2
as below
+------+------+--------+
| col1 | col2 | values |
+------+------+--------+
| abc | def | 4 |
+------+------+--------+
| abc | efg | 8 |
+------+------+--------+
Now I get it as
+------+------+--------+
| col1 | col2 | values |
+------+------+--------+
| abc | def | 0 |
+------+------+--------+
| abc | efg | 8 |
+------+------+--------+
I would not filter before aggregation, because you may need the other rows for a different aggregation. Just formulate your requirement directly in the agg clause as :
df
.groupBy($"col1",$"col2")
.agg(
sum(when($"values">0,$"values")).as("values") // sum of positive values
)
You can replace negative values with zero, then do the aggregation. The following is a demo with pyspark. It should be similar with other languages:
import pyspark.sql.functions as f
df.groupBy('col1', 'col2').agg(
f.sum(
f.when(f.col('values') < 0, 0).otherwise(f.col('values'))
).alias('values')
).show()
+----+----+------+
|col1|col2|values|
+----+----+------+
| abc| def| 4|
| abc| efg| 8|
+----+----+------+
You can filter the Dataset before the aggreagation
df
.filter(col("values") > 0)
.groupBy("col1", "col2")
.agg(sum("values").as("values"))
.show()
+----+----+------+
|col1|col2|values|
+----+----+------+
| abc| def| 4|
| abc| efg| 8|
+----+----+------+
val sparkSession = SparkSession.builder.config("spark.master","local").getOrCreate()
import sparkSession.implicits._
val someDF = Seq(("abc","def",-4),("abc","def",4),("abc","efg",8)).toDF("col1","col2","values")
someDF.filter("values>0").groupBy("col1","col2").sum("values").show()
+----+----+-----------+
|col1|col2|sum(values)|
+----+----+-----------+
| abc| def| 4|
| abc| efg| 8|
+----+----+-----------+

Expand last value of string column to groupby Pandas Dataframe

I have the following Pandas dataframe:
+--------+----+
|id |name|
+--------+----+
| 1| |
| 1| |
| 1| |
| 1|Carl|
| 2| |
| 2| |
| 2|John|
+--------+----+
What I want to achieve is to expand the last value of each group to the rest of the group:
+--------+----+
|id |name|
+--------+----+
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 2|John|
| 2|John|
| 2|John|
+--------+----+
It looks pretty easy but I am struggling to achieve it because of the columns' type.
What I've tried so far is:
df['name'] = df.groupby('id')['name'].transform('last')
This works for int or float columns, but not for string columns.
I am getting the following error:
No numeric types to aggregate
Thanks in advance.
Edit
bfill() is not valid because I can have the following:
+--------+----+
|id |name|
+--------+----+
| 1| |
| 1| |
| 1| |
| 1|Carl|
| 2| |
| 2| |
| 2| |
| 3| |
| 3| |
| 3|John|
+--------+----+
In this case, I want id = 2 to remain as NaN, and it would end up as John, which is incorrect. The desired output would be:
+--------+----+
|id |name|
+--------+----+
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 2| |
| 2| |
| 2| |
| 3|John|
| 3|John|
| 3|John|
+--------+----+
If the empty values are NaN, could you try fillna
df['name'] = df['name'].bfill()
If not, replace empty strings by NaN.
Try this.
import pandas as pd
import numpy as np
dff = pd.DataFrame({"id":[1,1,1,1,2,2,2,3,3,3],
"name":["","","","car1","","","","","","john"]})
dff = dff.replace(r'', np.NaN)
def c(x):
if sum(pd.isnull(x)) != np.size(x):
l = [v for v in x if type(v) == str]
return [l[0]]*np.size(x)
else:
return [""]*np.size(x)
df=dff.groupby('id')["name"].apply(lambda x:c(list(x)))
df = df.to_frame().reset_index()
df = df.set_index('id').name.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'name'})
output
id name
0 1 car1
1 1 car1
2 1 car1
3 1 car1
0 2
1 2
2 2
0 3 john
1 3 john
2 3 john

Combine dataframes columns consisting of multiple values - Spark

I have two Spark dataframes that share the same ID column:
df1:
+------+---------+---------+
|ID | Name1 | Name2 |
+------+---------+---------+
| 1 | A | B |
| 2 | C | D |
| 3 | E | F |
+------+---------+---------+
df2:
+------+-------+
|ID | key |
+------+-------+
| 1 | w |
| 1 | x |
| 2 | y |
| 3 | z |
+------+-------+
Now, I want to create a new column in df1 that contains all key values denoted in df2. So, I aim for the result:
+------+---------+---------+---------+
|ID | Name1 | Name2 | keys |
+------+---------+---------+---------+
| 1 | A | B | w,x |
| 2 | C | D | y |
| 3 | E | F | z |
+------+---------+---------+---------+
Ultimately, I want to find a solution for an arbitrary amount of keys.
My attempt in PySpark:
def get_keys(id):
x = df2.where(df2.ID == id).select('key')
return x
df_keys = df1.withColumn("keys", get_keys(col('ID')))
In the above code, x is a dataframe. Since the second argument of the .withColumn function needs to be an Column type variable, I am not sure how to mutate x correctly.
You are looking for collect_list function.
from pyspark.sql.functions import collect_list
df3 = df1.join(df2, df1.ID == df2.ID).drop(df2.ID)
df3.groupBy('ID','Name1','Name2').agg(collect_list('key').alias('keys')).show()
#+---+-----+-----+------+
#| ID|Name1|Name2| keys|
#+---+-----+-----+------+
#| 1| A| B|[w, x]|
#| 3| C| F| [z]|
#| 2| B| D| [y]|
#+---+-----+-----+------+
If you want only unique keys you can use collect_set

Resources