Splitting rows of a dataset depending on a column values - apache-spark

I am using Spark 3.1.1 along with JAVA 8, i am trying to split a dataset<Row> according to values of one of its numerical columns (greater or lesser than a threshold), the split is possible only if some string column values of the rows are identical : i am trying something like this :
Iterator<Row> iter2 = partition.toLocalIterator();
while (iter2.hasNext()) {
Row item = iter2.next();
//getColVal is a function that gets the value given a column
String numValue = getColVal(item, dim);
if (Integer.parseInt(numValue) < threshold)
pl.add(item);
else
pr.add(item);
But how to check, beforehand splitting, if some other column values (string) of the concerned rows are identical in order to perform the split ?
PS : i tried to groupBy the columns before splitting like so :
Dataset<Row> newDataset=oldDataset.groupBy("col1","col4").agg(col("col1"));
but it's not working
Thank you for the help
EDIT :
A sample dataset which i want to split is :
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B
If the threshold is 30 then the two first and last rows will form two datasets because the first and fourth columns of these are identical; otherwise the split is not possible.
EDIT : the resulting outpout would be
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B

I mainly use pyspark but you could adapt to your environment
## could add some conditional logic or just always output 2 data frames where
## one would be empty
print("pdf - two dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
## filter
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# +----+----+-----+
print("pdf - one dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[11,29,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 11| A|
# | abc| 7| 29| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg( F.sum('col2').alias('sumC2') )
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# +----+----+-----+
Filtering by a dynamic mean
print("pdf - filter by mean")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
w = Window.partitionBy("col1").orderBy("col2")
## add another column, the mean of col2 partitioned by col1
sdf = sdf.withColumn('mean_c2', F.mean('col2').over(w))
## filter by the dynamic mean
pr = sdf.filter('col2 > mean_c2')
pr.show()
# +----+----+----+----+-------+
# |col1|col2|col3|col4|mean_c2|
# +----+----+----+----+-------+
# | cde| 4| 20| B| 3.5|
# | abc| 9| 40| A| 8.0|
# +----+----+----+----+-------+

Related

Trying to regex match null in Spark dataframe

I have issue with null on Spark DF which I want to overcome. Let's say I have this spark DF:
spark_df.show()
# Output
# +----+----+
# |keys|vals|
# +----+----+
# | k1| 0|
# | k2| 1|
# | k3|null|
# +----+----+
And I have these functions:
def add_col_by_vals():
df_with_col = spark_df.withColumn('target_col', get_column())
return df_with_col
def get_column():
return ~f.lower(f.col("vals")).rlike("0|null|None")
The expected result is:
df_after_add_col = add_col_by_vals()
df_after_add_col.show()
# Output
# +----+----+----------+
# |keys|vals|target_col|
# +----+----+----------+
# | k1| 0| false|
# | k2| 1| true|
# | k3|null| false|
# +----+----+----------+
The actual result is:
df_after_add_col = add_col_by_vals()
df_after_add_col.show()
# Output
# +----+----+----------+
# |keys|vals|target_col|
# +----+----+----------+
# | k1| 0| false|
# | k2| 1| true|
# | k3|null| null|
# +----+----+----------+
I understand there is a problem with null. I don't want to change the DF at all, the only place I can change something in the code is get_column function.
How can I overcome this issue?
Regex works with strings (null is not a string). If you provide null, you will get null. You will have to use another function to deal with nulls. You could use coalesce. It will return False if the result of regex is null.
from pyspark.sql import functions as F
spark_df = spark.createDataFrame([("k1", 0), ("k2", 1), ("k3", None)], ["keys", "vals"])
def add_col_by_vals():
df_with_col = spark_df.withColumn('target_col', get_column())
return df_with_col
def get_column():
return F.coalesce(~F.lower("vals").rlike("0"), F.lit(False))
add_col_by_vals().show()
# +----+----+----------+
# |keys|vals|target_col|
# +----+----+----------+
# | k1| 0| false|
# | k2| 1| true|
# | k3|null| false|
# +----+----+----------+

Sample random n rows from each group

I would like to sample at most n rows from each group in the data, where the grouping is defined by a single column. There are many answers for selecting the top n rows, but I dont't need order and am not sure whether ordering would not introduce unnecessary shuffling.
I have looked at
sampleBy(), but I don't need a fraction but a maximal absolute amount of rows.
Windows, but they always seem to imply ordering the values
groupBy, but was not able to construct something of the available aggregate functions.
Code example:
data = [('A',1), ('B',1), ('C',2)]
columns = ["field_1","field_2"]
df = spark.createDataFrame(data=data, schema = columns)
Where I would be looking for a pandas-like
df.groupby('field_2').head(1)
I would also be happy with a suitable SQL expression.
Otherwise if there is no better performance than using
Window.partitionBy(df['field_2']).orderBy('field_1')...
then I'd also be happy to know that.
Thanks!
The below would work if a sort isn't required, and it uses RDD transformations.
For a dataframe like the following
sdf.show()
# +-----------+-------+--------+----+
# |bvdidnumber|dt_year|dt_rfrnc|goal|
# +-----------+-------+--------+----+
# | 1| 2020| 202006| 0|
# | 1| 2020| 202012| 1|
# | 1| 2020| 202012| 0|
# | 1| 2021| 202103| 0|
# | 1| 2021| 202106| 0|
# | 1| 2021| 202112| 1|
# | 2| 2020| 202006| 0|
# | 2| 2020| 202012| 0|
# | 2| 2020| 202012| 1|
# | 2| 2021| 202103| 0|
# | 2| 2021| 202106| 0|
# | 2| 2021| 202112| 1|
# +-----------+-------+--------+----+
I created a function that can be shipped to all executors, and then used with flatMapValues() in RDD transformation.
# best to ship this function to all executors for optimum performance
def get_n_from_group(group, num_recs):
"""
get `N` number of sample records
"""
res = []
i = 0
for rec in group:
res.append(rec)
i = i + 1
if i == num_recs:
break
return res
rdd = sdf.rdd. \
groupBy(lambda x: x.bvdidnumber). \
flatMapValues(lambda k: get_n_from_group(k, 2)) # 2 records only
top_n_sdf = spark.createDataFrame(rdd.values(), schema=sdf.schema)
top_n_sdf.show()
# +-----------+-------+--------+----+
# |bvdidnumber|dt_year|dt_rfrnc|goal|
# +-----------+-------+--------+----+
# | 1| 2020| 202006| 0|
# | 1| 2020| 202012| 1|
# | 2| 2020| 202006| 0|
# | 2| 2020| 202012| 0|
# +-----------+-------+--------+----+

Pyspark Dataframe - Map Strings to Numerics

I'm looking for a way to convert a given column of data, in this case strings, and convert them into a numeric representation. For example, I have a dataframe of strings with values:
+------------+
| level |
+------------+
| Medium|
| Medium|
| Medium|
| High|
| Medium|
| Medium|
| Low|
| Low|
| High|
| Low|
| Low|
And I want to create a new column where these values get converted to:
"High"= 1, "Medium" = 2, "Low" = 3
+------------+
| level_num|
+------------+
| 2|
| 2|
| 2|
| 1|
| 2|
| 2|
| 3|
| 3|
| 1|
| 3|
| 3|
I've tried defining a function and doing a foreach over the dataframe like so:
def f(x):
if(x == 'Medium'):
return 2
elif(x == "Low"):
return 3
else:
return 1
a = df.select("level").rdd.foreach(f)
But this returns a "None" type. Thoughts? Thanks for the help as always!
You can certainly do this along the lines you have been trying - you'll need a map operation instead of foreach.
spark.version
# u'2.2.0'
from pyspark.sql import Row
# toy data:
df = spark.createDataFrame([Row("Medium"),
Row("High"),
Row("High"),
Row("Low")
],
["level"])
df.show()
# +------+
# | level|
# +------+
# |Medium|
# | High|
# | High|
# | Low|
# +------+
Using your f(x) with these toy data, we get:
df.select("level").rdd.map(lambda x: f(x[0])).collect()
# [2, 1, 1, 3]
And one more map will give you a dataframe:
df.select("level").rdd.map(lambda x: f(x[0])).map(lambda x: Row(x)).toDF(["level_num"]).show()
# +---------+
# |level_num|
# +---------+
# | 2|
# | 1|
# | 1|
# | 3|
# +---------+
But it would be preferable to do it without invoking a temporary intermediate RDD, using the dataframe function when instead of your f(x):
from pyspark.sql.functions import col, when
df.withColumn("level_num", when(col("level")=='Medium', 2).when(col("level")=='Low', 3).otherwise(1)).show()
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# | High| 1|
# | High| 1|
# | Low| 3|
# +------+---------+
An alternative would be to use a Python dictionary to represent the map for Spark >= 2.4.
Then use array and map_from_arrays Spark functions to implement a key-based search mechanism for filling in the level_num field:
from pyspark.sql.functions import lit, map_from_arrays, array
_dict = {"High":1, "Medium":2, "Low":3}
df = spark.createDataFrame([
["Medium"], ["Medium"], ["Medium"], ["High"], ["Medium"], ["Medium"], ["Low"], ["Low"], ["High"]
], ["level"])
keys = array(list(map(lit, _dict.keys()))) # or alternatively [lit(k) for k in _dict.keys()]
values = array(list(map(lit, _dict.values())))
_map = map_from_arrays(keys, values)
df.withColumn("level_num", _map.getItem(col("level"))) # or element_at(_map, col("level"))
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# |Medium| 2|
# |Medium| 2|
# | High| 1|
# |Medium| 2|
# |Medium| 2|
# | Low| 3|
# | Low| 3|
# | High| 1|
# +------+---------+

Saving iteratively to a new DataFrame in Pyspark

I'm performing computations based on 3 different PySpark DataFrames.
This script works in the sense that it performs the computation as it should, however, I struggle with working properly with the results of said computation.
import sys
import numpy as np
from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext("local")
sqlContext = SQLContext(sc)
# Dummy Data
df = sqlContext.createDataFrame([[0,1,0,0,0],[1,1,0,0,1],[0,0,1,0,1],[1,0,1,1,0],[1,1,0,0,0]], ['p1', 'p2', 'p3', 'p4', 'p5'])
df.show()
+---+---+---+---+---+
| p1| p2| p3| p4| p5|
+---+---+---+---+---+
| 0| 1| 0| 0| 0|
| 1| 1| 0| 0| 1|
| 0| 0| 1| 0| 1|
| 1| 0| 1| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
# Values
values = sqlContext.createDataFrame([(0,1,'p1'),(None,1,'p2'),(0,0,'p3'),(None,0, 'p4'),(1,None,'p5')], ('f1', 'f2','index'))
values.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 0| 1| p1|
|null| 1| p2|
| 0| 0| p3|
|null| 0| p4|
| 1|null| p5|
+----+----+-----+
# Weights
weights = sqlContext.createDataFrame([(4,3,'p1'),(None,1,'p2'),(2,2,'p3'),(None, 3, 'p4'),(3,None,'p5')], ('f1', 'f2','index'))
weights.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 4| 3| p1|
|null| 1| p2|
| 2| 2| p3|
|null| 3| p4|
| 3|null| p5|
+----+----+-----+
# Function: it sums the vector W for the values of Row equal to the value of V and then divide by the length of V.
# If there a no similarities between Row and V outputs 0
def W_sum(row,v,w):
if len(w[row==v])>0:
return float(np.sum(w[row==v])/len(w))
else:
return 0.0
For each of the columns and for each row in Data, the above function is applied.
# We iterate over the columns of Values (except the last one called index)
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we select only the useful columns
df_select= df.select(defined_col)
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
df_select.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in df_select.columns))))
This gives :
+---+---+---+---+---+---+
| p1| p2| p3| p4| p5| f1|
+---+---+---+---+---+---+
| 0| 1| 0| 0| 0|2.0|
| 1| 1| 0| 0| 1|1.0|
| 0| 0| 1| 0| 1|2.0|
| 1| 0| 1| 1| 0|0.0|
| 1| 1| 0| 0| 0|0.0|
+---+---+---+---+---+---+
It added the column to the sliced DataFrame as I asked it to. The problem is that I would rather collect the data into a new one that I could access at the end to consult the results.
It it possible to grow (somewhat efficiently) a DataFrame in PySpark as I would with pandas?
Edit to make my goal clearer:
Ideally I would get a DataFrame with the just the computed columns, like this:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
There are some issues with your question...
First, your for loop will produce an error, since df_select in the last line is nowhere defined; there is also no assignment at the end (what does it produce?).
Assuming that df_select is actually your subsubsample dataframe, defined some lines before, and that your last line is something like
new_df = subsubsample.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in subsubsample.columns))))
then your problem starts getting more clear. Since
values.columns[:-1]
# ['f1', 'f2']
the result of the whole loop would be just
+---+---+---+---+---+
| p1| p2| p3| p4| f2|
+---+---+---+---+---+
| 0| 1| 0| 0|1.0|
| 1| 1| 0| 0|2.0|
| 0| 0| 1| 0|0.0|
| 1| 0| 1| 1|0.0|
| 1| 1| 0| 0|2.0|
+---+---+---+---+---+
i.e. with only the column f2 included (natural, since the results with f1 are simply overwritten).
Now, as I said, assuming that the situation is like this, and that your problem is actually how to have both columns f1 & f2 together rather in different dataframes, you can just forget subsubsample and append columns to your initial df, possibly dropping afterwards the unwanted ones:
init_cols = df.columns
init_cols
# ['p1', 'p2', 'p3', 'p4', 'p5']
new_df = df
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
new_df = new_df.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in defined_col)))) # change here
# drop initial columns:
for i in init_cols:
new_df = new_df.drop(i)
The resulting new_df will be:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
UPDATE (after comment): To force the division in your W_sum function to be a float, use:
from __future__ import division
new_df now will be:
+---------+----+
| f1| f2|
+---------+----+
| 2.0| 1.5|
|1.6666666|2.25|
|2.3333333|0.75|
| 0.0|0.75|
|0.6666667|2.25|
+---------+----+
with f2 exactly as it should be according to your comment.

Spark: Replace missing values with values from another column

Suppose you have a Spark dataframe containing some null values, and you would like to replace the values of one column with the values from another if present. In Python/Pandas you can use the fillna() function to do this quite nicely:
df = spark.createDataFrame([('a', 'b', 'c'),(None,'e', 'f'),(None,None,'i')], ['c1','c2','c3'])
DF = df.toPandas()
DF['c1'].fillna(DF['c2']).fillna(DF['c3'])
How can this be done using Pyspark?
You need to use the coalesce function :
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDF.show()
# +----+----+
# | a| b|
# +----+----+
# |null|null|
# | 1|null|
# |null| 2|
# +----+----+
cDf.select(coalesce(cDf["a"], cDf["b"])).show()
# +--------------+
# |coalesce(a, b)|
# +--------------+
# | null|
# | 1|
# | 2|
# +--------------+
cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
# +----+----+----------------+
# | a| b|coalesce(a, 0.0)|
# +----+----+----------------+
# |null|null| 0.0|
# | 1|null| 1.0|
# |null| 2| 0.0|
# +----+----+----------------+
You can also apply coalesce on multiple columns :
cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()
# ...
This example is taken from the pyspark.sql API documentation.

Resources