Suppose you have a Spark dataframe containing some null values, and you would like to replace the values of one column with the values from another if present. In Python/Pandas you can use the fillna() function to do this quite nicely:
df = spark.createDataFrame([('a', 'b', 'c'),(None,'e', 'f'),(None,None,'i')], ['c1','c2','c3'])
DF = df.toPandas()
DF['c1'].fillna(DF['c2']).fillna(DF['c3'])
How can this be done using Pyspark?
You need to use the coalesce function :
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDF.show()
# +----+----+
# | a| b|
# +----+----+
# |null|null|
# | 1|null|
# |null| 2|
# +----+----+
cDf.select(coalesce(cDf["a"], cDf["b"])).show()
# +--------------+
# |coalesce(a, b)|
# +--------------+
# | null|
# | 1|
# | 2|
# +--------------+
cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
# +----+----+----------------+
# | a| b|coalesce(a, 0.0)|
# +----+----+----------------+
# |null|null| 0.0|
# | 1|null| 1.0|
# |null| 2| 0.0|
# +----+----+----------------+
You can also apply coalesce on multiple columns :
cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()
# ...
This example is taken from the pyspark.sql API documentation.
Related
I have to select a column out of two columns which has more data or values in it using PySpark and keep it in my DataFrame.
For example, we have two columns A and B:
In example, the column B has more values so I will keep it in my DF for transformations. Similarly, I would take A, if A had more values. I think we can do it using if else conditions, but I'm not able to get the correct logic.
You could first aggregate the columns (count the values in each). This way you will get just 1 row which you could extract as dictionary using .head().asDict(). Then use Python's max(your_dict, key=your_dict.get) to get the dict's key having the max value (i.e. the name of the column which has maximum number of values). Then just select this column.
Example input:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 7), (2, 4), (3, 7), (None, 8), (None, 4)], ['A', 'B'])
df.show()
# +----+---+
# | A| B|
# +----+---+
# | 1| 7|
# | 2| 4|
# | 3| 7|
# |null| 8|
# |null| 4|
# +----+---+
Scalable script using built-in max:
val_cnt = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head().asDict()
df = df.select(max(val_cnt, key=val_cnt.get))
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Script for just 2 columns (A and B):
head = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head()
df = df.select('B' if head.B > head.A else 'A')
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Scalable script adjustable to more columns, without built-in max:
val_cnt = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head().asDict()
key, val = '', -1
for k, v in val_cnt.items():
if v > val:
key, val = k, v
df = df.select(key)
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Create a data frame with the data
df = spark.createDataFrame(data=[(1,7),(2,4),(3,7),(4,8),(5,0),(6,0),(None,3),(None,5),(None,8),(None,4)],schema = ['A','B'])
Define a condition to check for that
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
condition = fx.when((fx.col('A').isNotNull() & (fx.col('A')>fx.col('B'))),fx.col('A')).otherwise(fx.col('B'))
df_1 = df.withColumn('max_value_among_A_and_B',condition)
Print the dataframe
df_1.show()
Please check the below screenshot for details
or
If you want to pick up the whole column just based on the count. you can try this:
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df = spark.createDataFrame(data=[(1,7),(2,4),(3,7),(4,8),(5,0),(6,0),(None,3),(None,5),(None,8),(None,4)],schema = ['A','B'])
if df.select('A').count() > df.select('B').count():
pickcolumn = 'A'
else:
pickcolumn = 'B'
df_1 = df.withColumn('NewColumnm',col(pickcolumn)).drop('A','B')
df_1.show()
I want to replace negative values in spark dataframe with previous positive values. I am using Spark with Java. In Python pandas we are having ffill() api which will help here to solve this issue but in Java it is getting difficult to resolve. I tried using lead/lag function but till where I can check negative values that I am not sure hence this solution will not work.
You can use a Window Function. Take this df as an exemple:
df = spark.createDataFrame(
[
('2018-03-01','6'),
('2018-03-02','1'),
('2018-03-03','-2'),
('2018-03-04','7'),
('2018-03-05','-3'),
],
["date", "value"]
)\
.withColumn('date', F.col('date').cast('date'))\
.withColumn('value', F.col('value').cast('integer'))
df.show()
# +----------+-----+
# | date|value|
# +----------+-----+
# |2018-03-01| 6|
# |2018-03-02| 1|
# |2018-03-03| -2|
# |2018-03-04| 7|
# |2018-03-05| -3|
# +----------+-----+
Then, you can use create a column with a when and use a Window Function:
from pyspark.sql import Window
window = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df\
.withColumn('if<0_than_null', F.when(F.col('value')<0, F.lit(None)).otherwise(F.col('value')))\
.withColumn('desired_output', F.last('if<0_than_null', ignorenulls=True).over(window))\
.show()
# +----------+-----+--------------+--------------+
# | date|value|if<0_than_null|desired_output|
# +----------+-----+--------------+--------------+
# |2018-03-01| 6| 6| 6|
# |2018-03-02| 1| 1| 1|
# |2018-03-03| -2| null| 1|
# |2018-03-04| 7| 7| 7|
# |2018-03-05| -3| null| 7|
# +----------+-----+--------------+--------------+
I have issue with null on Spark DF which I want to overcome. Let's say I have this spark DF:
spark_df.show()
# Output
# +----+----+
# |keys|vals|
# +----+----+
# | k1| 0|
# | k2| 1|
# | k3|null|
# +----+----+
And I have these functions:
def add_col_by_vals():
df_with_col = spark_df.withColumn('target_col', get_column())
return df_with_col
def get_column():
return ~f.lower(f.col("vals")).rlike("0|null|None")
The expected result is:
df_after_add_col = add_col_by_vals()
df_after_add_col.show()
# Output
# +----+----+----------+
# |keys|vals|target_col|
# +----+----+----------+
# | k1| 0| false|
# | k2| 1| true|
# | k3|null| false|
# +----+----+----------+
The actual result is:
df_after_add_col = add_col_by_vals()
df_after_add_col.show()
# Output
# +----+----+----------+
# |keys|vals|target_col|
# +----+----+----------+
# | k1| 0| false|
# | k2| 1| true|
# | k3|null| null|
# +----+----+----------+
I understand there is a problem with null. I don't want to change the DF at all, the only place I can change something in the code is get_column function.
How can I overcome this issue?
Regex works with strings (null is not a string). If you provide null, you will get null. You will have to use another function to deal with nulls. You could use coalesce. It will return False if the result of regex is null.
from pyspark.sql import functions as F
spark_df = spark.createDataFrame([("k1", 0), ("k2", 1), ("k3", None)], ["keys", "vals"])
def add_col_by_vals():
df_with_col = spark_df.withColumn('target_col', get_column())
return df_with_col
def get_column():
return F.coalesce(~F.lower("vals").rlike("0"), F.lit(False))
add_col_by_vals().show()
# +----+----+----------+
# |keys|vals|target_col|
# +----+----+----------+
# | k1| 0| false|
# | k2| 1| true|
# | k3|null| false|
# +----+----+----------+
I am using Spark 3.1.1 along with JAVA 8, i am trying to split a dataset<Row> according to values of one of its numerical columns (greater or lesser than a threshold), the split is possible only if some string column values of the rows are identical : i am trying something like this :
Iterator<Row> iter2 = partition.toLocalIterator();
while (iter2.hasNext()) {
Row item = iter2.next();
//getColVal is a function that gets the value given a column
String numValue = getColVal(item, dim);
if (Integer.parseInt(numValue) < threshold)
pl.add(item);
else
pr.add(item);
But how to check, beforehand splitting, if some other column values (string) of the concerned rows are identical in order to perform the split ?
PS : i tried to groupBy the columns before splitting like so :
Dataset<Row> newDataset=oldDataset.groupBy("col1","col4").agg(col("col1"));
but it's not working
Thank you for the help
EDIT :
A sample dataset which i want to split is :
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B
If the threshold is 30 then the two first and last rows will form two datasets because the first and fourth columns of these are identical; otherwise the split is not possible.
EDIT : the resulting outpout would be
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B
I mainly use pyspark but you could adapt to your environment
## could add some conditional logic or just always output 2 data frames where
## one would be empty
print("pdf - two dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
## filter
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# +----+----+-----+
print("pdf - one dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[11,29,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 11| A|
# | abc| 7| 29| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg( F.sum('col2').alias('sumC2') )
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# +----+----+-----+
Filtering by a dynamic mean
print("pdf - filter by mean")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
w = Window.partitionBy("col1").orderBy("col2")
## add another column, the mean of col2 partitioned by col1
sdf = sdf.withColumn('mean_c2', F.mean('col2').over(w))
## filter by the dynamic mean
pr = sdf.filter('col2 > mean_c2')
pr.show()
# +----+----+----+----+-------+
# |col1|col2|col3|col4|mean_c2|
# +----+----+----+----+-------+
# | cde| 4| 20| B| 3.5|
# | abc| 9| 40| A| 8.0|
# +----+----+----+----+-------+
I'm looking for a way to convert a given column of data, in this case strings, and convert them into a numeric representation. For example, I have a dataframe of strings with values:
+------------+
| level |
+------------+
| Medium|
| Medium|
| Medium|
| High|
| Medium|
| Medium|
| Low|
| Low|
| High|
| Low|
| Low|
And I want to create a new column where these values get converted to:
"High"= 1, "Medium" = 2, "Low" = 3
+------------+
| level_num|
+------------+
| 2|
| 2|
| 2|
| 1|
| 2|
| 2|
| 3|
| 3|
| 1|
| 3|
| 3|
I've tried defining a function and doing a foreach over the dataframe like so:
def f(x):
if(x == 'Medium'):
return 2
elif(x == "Low"):
return 3
else:
return 1
a = df.select("level").rdd.foreach(f)
But this returns a "None" type. Thoughts? Thanks for the help as always!
You can certainly do this along the lines you have been trying - you'll need a map operation instead of foreach.
spark.version
# u'2.2.0'
from pyspark.sql import Row
# toy data:
df = spark.createDataFrame([Row("Medium"),
Row("High"),
Row("High"),
Row("Low")
],
["level"])
df.show()
# +------+
# | level|
# +------+
# |Medium|
# | High|
# | High|
# | Low|
# +------+
Using your f(x) with these toy data, we get:
df.select("level").rdd.map(lambda x: f(x[0])).collect()
# [2, 1, 1, 3]
And one more map will give you a dataframe:
df.select("level").rdd.map(lambda x: f(x[0])).map(lambda x: Row(x)).toDF(["level_num"]).show()
# +---------+
# |level_num|
# +---------+
# | 2|
# | 1|
# | 1|
# | 3|
# +---------+
But it would be preferable to do it without invoking a temporary intermediate RDD, using the dataframe function when instead of your f(x):
from pyspark.sql.functions import col, when
df.withColumn("level_num", when(col("level")=='Medium', 2).when(col("level")=='Low', 3).otherwise(1)).show()
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# | High| 1|
# | High| 1|
# | Low| 3|
# +------+---------+
An alternative would be to use a Python dictionary to represent the map for Spark >= 2.4.
Then use array and map_from_arrays Spark functions to implement a key-based search mechanism for filling in the level_num field:
from pyspark.sql.functions import lit, map_from_arrays, array
_dict = {"High":1, "Medium":2, "Low":3}
df = spark.createDataFrame([
["Medium"], ["Medium"], ["Medium"], ["High"], ["Medium"], ["Medium"], ["Low"], ["Low"], ["High"]
], ["level"])
keys = array(list(map(lit, _dict.keys()))) # or alternatively [lit(k) for k in _dict.keys()]
values = array(list(map(lit, _dict.values())))
_map = map_from_arrays(keys, values)
df.withColumn("level_num", _map.getItem(col("level"))) # or element_at(_map, col("level"))
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# |Medium| 2|
# |Medium| 2|
# | High| 1|
# |Medium| 2|
# |Medium| 2|
# | Low| 3|
# | Low| 3|
# | High| 1|
# +------+---------+