Randomly Split DataFrame by Unique Values in One Column

Randomly Split DataFrame by Unique Values in One Column - apache-spark

I have a pyspark DataFrame like the following:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val12 | val22 | 1 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
Each row has a groupId and multiple rows can have the same groupId.
I want to randomly split this data into two datasets. But all the data having a particular groupId must be in one of the splits.
This means that if d1.groupId = d2.groupId, then d1 and d2 are in the same split.
For example:
# Split 1:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
+--------+--------+-----------+
# Split 2:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val12 | val22 | 1 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
What is the good way to do it on PySpark? Can I use the randomSplit method somehow?

You can use randomSplit to split just the distinct groupIds, and then use the results to split the source DataFrame using join.
For example:
split1, split2 = df.select("groupId").distinct().randomSplit(weights=[0.5, 0.5], seed=0)
split1.show()
#+-------+
#|groupId|
#+-------+
#| 1|
#+-------+
split2.show()
#+-------+
#|groupId|
#+-------+
#| 0|
#| 2|
#+-------+
Now join these back to the original DataFrame:
df1 = df.join(split1, on="groupId", how="inner")
df2 = df.join(split2, on="groupId", how="inner")
df1.show()
3+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 1|val12|val22|
#| 1|val15|val25|
#| 1|val16|val26|
#+-------+-----+-----+
df2.show()
#+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 0|val11|val21|
#| 0|val14|val24|
#| 2|val13|val23|
#+-------+-----+-----+

Related

How to map one column to multiple binary columns in Spark?

This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:
+-----------------+
| C1 | C2 |
+--------|--------+
| A | xxx |
| B | yyy |
| A | yyy |
| B | www |
| B | xxx |
| A | zzz |
| A | xxx |
| A | yyy |
+-----------------+
to
+--------------------------------------------+
| C1 | www | xxx | yyy | zzz |
+--------|--------|--------|--------|--------|
| A | 0 | 1 | 1 | 1 |
| B | 1 | 1 | 1 | 0 |
+--------------------------------------------+
How does one attain this in PySpark? Presence is 1 and absence is 0.

Yes, you will need pivot. But for aggregation, in your case it's best just to use F.first(F.lit(1)) and when you get nulls, just replace them with 0 using df.fillna(0).
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('A', 'xxx'),
('B', 'yyy'),
('A', 'yyy'),
('B', 'www'),
('B', 'xxx'),
('A', 'zzz'),
('A', 'xxx'),
('A', 'yyy')],
['C1', 'C2'])
df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)
df.show()
# +---+---+---+---+---+
# | C1|www|xxx|yyy|zzz|
# +---+---+---+---+---+
# | B| 1| 1| 1| 0|
# | A| 0| 1| 1| 1|
# +---+---+---+---+---+

Dynamically add padding Zeros

mock_data = [('TYCO', ' 1303','13'),('EMC', ' 120989 ','123'), ('VOLVO ', '102329 ','1234'),('BMW', '1301571345 ',' '),('FORD', '004','21212')]
df = spark.createDataFrame(mock_data, ['col1', 'col2','col3'])
+-------+------------+-----+
| col1 | col2| col3|
+-------+------------+-----+
| TYCO| 1303| 13|
| EMC| 120989 | 123|
|VOLVO | 102329 | 1234|
| BMW|1301571345 | |
| FORD| 004|21212|
+-------+------------+-----+
trim the col2 and based on the length(10-col2 length) need to dynamically add padding zeroes in col3. concatenate col2 and col3.
df2 = df.withColumn('length_col2', 10-length(trim(df.col2)))
+-------+------------+-----+-----------+
| col1| col2| col3|length_col2|
+-------+------------+-----+-----------+
| TYCO| 1303| 13| 6|
| EMC| 120989 | 123| 4|
|VOLVO | 102329 | 1234| 4|
| BMW|1301571345 | | 0|
| FORD| 004|21212| 7|
+-------+------------+-----+-----------+
expected output
+-------+----------+-----+-------------
| col1| col2 | col3|output
+-------+----------+-----+-------------
| TYCO| 1303 | 13|1303000013
| EMC| 120989 | 123|1209890123
|VOLVO | 102329 | 1234|1023291234
| BMW| 1301571345 | |1301571345
| FORD| 004 |21212|0040021212
+-------+----------+-----+-------------

What You are looking for is rpad Function in pyspark.sql.functions as listed here => https://spark.apache.org/docs/2.3.0/api/sql/index.html
See The Solution Below :
%pyspark
mock_data = [('TYCO', ' 1303','13'),('EMC', ' 120989 ','123'), ('VOLVO ', '102329 ','1234'),('BMW', '1301571345 ',' '),('FORD', '004','21212')]
df = spark.createDataFrame(mock_data, ['col1', 'col2','col3'])
df.createOrReplaceTempView("input_df")
spark.sql("SELECT *, concat(rpad(trim(col2),10,'0') , col3) as OUTPUT from input_df").show(20,False)
and Result
+-------+------------+-----+---------------+
|col1 |col2 |col3 |OUTPUT |
+-------+------------+-----+---------------+
|TYCO | 1303 |13 |130300000013 |
|EMC | 120989 |123 |1209890000123 |
|VOLVO |102329 |1234 |10232900001234 |
|BMW |1301571345 | |1301571345 |
|FORD |004 |21212|004000000021212|
+-------+------------+-----+---------------+

Sum of only positive data of the column when aggregation in spark dataset

I have a sample dataset as below
+------+------+--------+
| col1 | col2 | values |
+------+------+--------+
| abc | def | -4 |
+------+------+--------+
| abc | def | 4 |
+------+------+--------+
| abc | efg | 8 |
+------+------+--------+
When aggregation of the above dataset I need the sum of values (only positive) when grouping by col1, col2
as below
+------+------+--------+
| col1 | col2 | values |
+------+------+--------+
| abc | def | 4 |
+------+------+--------+
| abc | efg | 8 |
+------+------+--------+
Now I get it as
+------+------+--------+
| col1 | col2 | values |
+------+------+--------+
| abc | def | 0 |
+------+------+--------+
| abc | efg | 8 |
+------+------+--------+

I would not filter before aggregation, because you may need the other rows for a different aggregation. Just formulate your requirement directly in the agg clause as :
df
.groupBy($"col1",$"col2")
.agg(
sum(when($"values">0,$"values")).as("values") // sum of positive values
)

You can replace negative values with zero, then do the aggregation. The following is a demo with pyspark. It should be similar with other languages:
import pyspark.sql.functions as f
df.groupBy('col1', 'col2').agg(
f.sum(
f.when(f.col('values') < 0, 0).otherwise(f.col('values'))
).alias('values')
).show()
+----+----+------+
|col1|col2|values|
+----+----+------+
| abc| def| 4|
| abc| efg| 8|
+----+----+------+

You can filter the Dataset before the aggreagation
df
.filter(col("values") > 0)
.groupBy("col1", "col2")
.agg(sum("values").as("values"))
.show()
+----+----+------+
|col1|col2|values|
+----+----+------+
| abc| def| 4|
| abc| efg| 8|
+----+----+------+

val sparkSession = SparkSession.builder.config("spark.master","local").getOrCreate()
import sparkSession.implicits._
val someDF = Seq(("abc","def",-4),("abc","def",4),("abc","efg",8)).toDF("col1","col2","values")
someDF.filter("values>0").groupBy("col1","col2").sum("values").show()
+----+----+-----------+
|col1|col2|sum(values)|
+----+----+-----------+
| abc| def| 4|
| abc| efg| 8|
+----+----+-----------+

Expand last value of string column to groupby Pandas Dataframe

I have the following Pandas dataframe:
+--------+----+
|id |name|
+--------+----+
| 1| |
| 1| |
| 1| |
| 1|Carl|
| 2| |
| 2| |
| 2|John|
+--------+----+
What I want to achieve is to expand the last value of each group to the rest of the group:
+--------+----+
|id |name|
+--------+----+
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 2|John|
| 2|John|
| 2|John|
+--------+----+
It looks pretty easy but I am struggling to achieve it because of the columns' type.
What I've tried so far is:
df['name'] = df.groupby('id')['name'].transform('last')
This works for int or float columns, but not for string columns.
I am getting the following error:
No numeric types to aggregate
Thanks in advance.
Edit
bfill() is not valid because I can have the following:
+--------+----+
|id |name|
+--------+----+
| 1| |
| 1| |
| 1| |
| 1|Carl|
| 2| |
| 2| |
| 2| |
| 3| |
| 3| |
| 3|John|
+--------+----+
In this case, I want id = 2 to remain as NaN, and it would end up as John, which is incorrect. The desired output would be:
+--------+----+
|id |name|
+--------+----+
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 2| |
| 2| |
| 2| |
| 3|John|
| 3|John|
| 3|John|
+--------+----+

If the empty values are NaN, could you try fillna
df['name'] = df['name'].bfill()
If not, replace empty strings by NaN.

Try this.
import pandas as pd
import numpy as np
dff = pd.DataFrame({"id":[1,1,1,1,2,2,2,3,3,3],
"name":["","","","car1","","","","","","john"]})
dff = dff.replace(r'', np.NaN)
def c(x):
if sum(pd.isnull(x)) != np.size(x):
l = [v for v in x if type(v) == str]
return [l[0]]*np.size(x)
else:
return [""]*np.size(x)
df=dff.groupby('id')["name"].apply(lambda x:c(list(x)))
df = df.to_frame().reset_index()
df = df.set_index('id').name.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'name'})
output
id name
0 1 car1
1 1 car1
2 1 car1
3 1 car1
0 2
1 2
2 2
0 3 john
1 3 john
2 3 john

Combine dataframes columns consisting of multiple values - Spark

I have two Spark dataframes that share the same ID column:
df1:
+------+---------+---------+
|ID | Name1 | Name2 |
+------+---------+---------+
| 1 | A | B |
| 2 | C | D |
| 3 | E | F |
+------+---------+---------+
df2:
+------+-------+
|ID | key |
+------+-------+
| 1 | w |
| 1 | x |
| 2 | y |
| 3 | z |
+------+-------+
Now, I want to create a new column in df1 that contains all key values denoted in df2. So, I aim for the result:
+------+---------+---------+---------+
|ID | Name1 | Name2 | keys |
+------+---------+---------+---------+
| 1 | A | B | w,x |
| 2 | C | D | y |
| 3 | E | F | z |
+------+---------+---------+---------+
Ultimately, I want to find a solution for an arbitrary amount of keys.
My attempt in PySpark:
def get_keys(id):
x = df2.where(df2.ID == id).select('key')
return x
df_keys = df1.withColumn("keys", get_keys(col('ID')))
In the above code, x is a dataframe. Since the second argument of the .withColumn function needs to be an Column type variable, I am not sure how to mutate x correctly.

You are looking for collect_list function.
from pyspark.sql.functions import collect_list
df3 = df1.join(df2, df1.ID == df2.ID).drop(df2.ID)
df3.groupBy('ID','Name1','Name2').agg(collect_list('key').alias('keys')).show()
#+---+-----+-----+------+
#| ID|Name1|Name2| keys|
#+---+-----+-----+------+
#| 1| A| B|[w, x]|
#| 3| C| F| [z]|
#| 2| B| D| [y]|
#+---+-----+-----+------+
If you want only unique keys you can use collect_set

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Randomly Split DataFrame by Unique Values in One Column - apache-spark

Related

How to map one column to multiple binary columns in Spark?

Dynamically add padding Zeros

Sum of only positive data of the column when aggregation in spark dataset

Expand last value of string column to groupby Pandas Dataframe

Combine dataframes columns consisting of multiple values - Spark

Categories

Resources