I have issue with null on Spark DF which I want to overcome. Let's say I have this spark DF:
spark_df.show()
# Output
# +----+----+
# |keys|vals|
# +----+----+
# | k1| 0|
# | k2| 1|
# | k3|null|
# +----+----+
And I have these functions:
def add_col_by_vals():
df_with_col = spark_df.withColumn('target_col', get_column())
return df_with_col
def get_column():
return ~f.lower(f.col("vals")).rlike("0|null|None")
The expected result is:
df_after_add_col = add_col_by_vals()
df_after_add_col.show()
# Output
# +----+----+----------+
# |keys|vals|target_col|
# +----+----+----------+
# | k1| 0| false|
# | k2| 1| true|
# | k3|null| false|
# +----+----+----------+
The actual result is:
df_after_add_col = add_col_by_vals()
df_after_add_col.show()
# Output
# +----+----+----------+
# |keys|vals|target_col|
# +----+----+----------+
# | k1| 0| false|
# | k2| 1| true|
# | k3|null| null|
# +----+----+----------+
I understand there is a problem with null. I don't want to change the DF at all, the only place I can change something in the code is get_column function.
How can I overcome this issue?
Regex works with strings (null is not a string). If you provide null, you will get null. You will have to use another function to deal with nulls. You could use coalesce. It will return False if the result of regex is null.
from pyspark.sql import functions as F
spark_df = spark.createDataFrame([("k1", 0), ("k2", 1), ("k3", None)], ["keys", "vals"])
def add_col_by_vals():
df_with_col = spark_df.withColumn('target_col', get_column())
return df_with_col
def get_column():
return F.coalesce(~F.lower("vals").rlike("0"), F.lit(False))
add_col_by_vals().show()
# +----+----+----------+
# |keys|vals|target_col|
# +----+----+----------+
# | k1| 0| false|
# | k2| 1| true|
# | k3|null| false|
# +----+----+----------+
Related
I would like to sample at most n rows from each group in the data, where the grouping is defined by a single column. There are many answers for selecting the top n rows, but I dont't need order and am not sure whether ordering would not introduce unnecessary shuffling.
I have looked at
sampleBy(), but I don't need a fraction but a maximal absolute amount of rows.
Windows, but they always seem to imply ordering the values
groupBy, but was not able to construct something of the available aggregate functions.
Code example:
data = [('A',1), ('B',1), ('C',2)]
columns = ["field_1","field_2"]
df = spark.createDataFrame(data=data, schema = columns)
Where I would be looking for a pandas-like
df.groupby('field_2').head(1)
I would also be happy with a suitable SQL expression.
Otherwise if there is no better performance than using
Window.partitionBy(df['field_2']).orderBy('field_1')...
then I'd also be happy to know that.
Thanks!
The below would work if a sort isn't required, and it uses RDD transformations.
For a dataframe like the following
sdf.show()
# +-----------+-------+--------+----+
# |bvdidnumber|dt_year|dt_rfrnc|goal|
# +-----------+-------+--------+----+
# | 1| 2020| 202006| 0|
# | 1| 2020| 202012| 1|
# | 1| 2020| 202012| 0|
# | 1| 2021| 202103| 0|
# | 1| 2021| 202106| 0|
# | 1| 2021| 202112| 1|
# | 2| 2020| 202006| 0|
# | 2| 2020| 202012| 0|
# | 2| 2020| 202012| 1|
# | 2| 2021| 202103| 0|
# | 2| 2021| 202106| 0|
# | 2| 2021| 202112| 1|
# +-----------+-------+--------+----+
I created a function that can be shipped to all executors, and then used with flatMapValues() in RDD transformation.
# best to ship this function to all executors for optimum performance
def get_n_from_group(group, num_recs):
"""
get `N` number of sample records
"""
res = []
i = 0
for rec in group:
res.append(rec)
i = i + 1
if i == num_recs:
break
return res
rdd = sdf.rdd. \
groupBy(lambda x: x.bvdidnumber). \
flatMapValues(lambda k: get_n_from_group(k, 2)) # 2 records only
top_n_sdf = spark.createDataFrame(rdd.values(), schema=sdf.schema)
top_n_sdf.show()
# +-----------+-------+--------+----+
# |bvdidnumber|dt_year|dt_rfrnc|goal|
# +-----------+-------+--------+----+
# | 1| 2020| 202006| 0|
# | 1| 2020| 202012| 1|
# | 2| 2020| 202006| 0|
# | 2| 2020| 202012| 0|
# +-----------+-------+--------+----+
I am using Spark 3.1.1 along with JAVA 8, i am trying to split a dataset<Row> according to values of one of its numerical columns (greater or lesser than a threshold), the split is possible only if some string column values of the rows are identical : i am trying something like this :
Iterator<Row> iter2 = partition.toLocalIterator();
while (iter2.hasNext()) {
Row item = iter2.next();
//getColVal is a function that gets the value given a column
String numValue = getColVal(item, dim);
if (Integer.parseInt(numValue) < threshold)
pl.add(item);
else
pr.add(item);
But how to check, beforehand splitting, if some other column values (string) of the concerned rows are identical in order to perform the split ?
PS : i tried to groupBy the columns before splitting like so :
Dataset<Row> newDataset=oldDataset.groupBy("col1","col4").agg(col("col1"));
but it's not working
Thank you for the help
EDIT :
A sample dataset which i want to split is :
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B
If the threshold is 30 then the two first and last rows will form two datasets because the first and fourth columns of these are identical; otherwise the split is not possible.
EDIT : the resulting outpout would be
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B
I mainly use pyspark but you could adapt to your environment
## could add some conditional logic or just always output 2 data frames where
## one would be empty
print("pdf - two dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
## filter
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# +----+----+-----+
print("pdf - one dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[11,29,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 11| A|
# | abc| 7| 29| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg( F.sum('col2').alias('sumC2') )
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# +----+----+-----+
Filtering by a dynamic mean
print("pdf - filter by mean")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
w = Window.partitionBy("col1").orderBy("col2")
## add another column, the mean of col2 partitioned by col1
sdf = sdf.withColumn('mean_c2', F.mean('col2').over(w))
## filter by the dynamic mean
pr = sdf.filter('col2 > mean_c2')
pr.show()
# +----+----+----+----+-------+
# |col1|col2|col3|col4|mean_c2|
# +----+----+----+----+-------+
# | cde| 4| 20| B| 3.5|
# | abc| 9| 40| A| 8.0|
# +----+----+----+----+-------+
I have an id column for each person (data with the same id belongs to one person). I want these:
Now the id column is not based on numbering, it's 10 digit. How can I reset id with integers, e.g. 1, 2, 3, 4?
For example:
id col1
12a4 summer
12a4 goest
3b yes
3b No
3b why
4t Hi
Output:
id col1
1 summer
1 goest
2 yes
2 No
2 why
3 Hi
How I can get the data corresponding to id=2?
In the above example:
id col1
2 yes
2 No
2 why
from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F
spark = SparkSession.builder.getOrCreate()
data = [
('12a4', 'summer'),
('12a4', 'goest'),
('3b', 'yes'),
('3b', 'No'),
('3b', 'why'),
('4t', 'Hi')
]
df1 = spark.createDataFrame(data, ['id', 'col1'])
df1.show()
# +----+------+
# | id| col1|
# +----+------+
# |12a4|summer|
# |12a4| goest|
# | 3b| yes|
# | 3b| No|
# | 3b| why|
# | 4t| Hi|
# +----+------+
df = df1.select('id').distinct()
df = df.withColumn('new_id', F.row_number().over(Window.orderBy('id')))
df.show()
# +----+------+
# | id|new_id|
# +----+------+
# |12a4| 1|
# | 3b| 2|
# | 4t| 3|
# +----+------+
df = df.join(df1, 'id', 'full')
df.show()
# +----+------+------+
# | id|new_id| col1|
# +----+------+------+
# |12a4| 1|summer|
# |12a4| 1| goest|
# | 4t| 3| Hi|
# | 3b| 2| yes|
# | 3b| 2| No|
# | 3b| 2| why|
# +----+------+------+
df = df.drop('id').withColumnRenamed('new_id', 'id')
df.show()
# +---+------+
# | id| col1|
# +---+------+
# | 1|summer|
# | 1| goest|
# | 3| Hi|
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+------+
df = df.filter(F.col('id') == 2)
df.show()
# +---+----+
# | id|col1|
# +---+----+
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+----+
I'm looking for a way to convert a given column of data, in this case strings, and convert them into a numeric representation. For example, I have a dataframe of strings with values:
+------------+
| level |
+------------+
| Medium|
| Medium|
| Medium|
| High|
| Medium|
| Medium|
| Low|
| Low|
| High|
| Low|
| Low|
And I want to create a new column where these values get converted to:
"High"= 1, "Medium" = 2, "Low" = 3
+------------+
| level_num|
+------------+
| 2|
| 2|
| 2|
| 1|
| 2|
| 2|
| 3|
| 3|
| 1|
| 3|
| 3|
I've tried defining a function and doing a foreach over the dataframe like so:
def f(x):
if(x == 'Medium'):
return 2
elif(x == "Low"):
return 3
else:
return 1
a = df.select("level").rdd.foreach(f)
But this returns a "None" type. Thoughts? Thanks for the help as always!
You can certainly do this along the lines you have been trying - you'll need a map operation instead of foreach.
spark.version
# u'2.2.0'
from pyspark.sql import Row
# toy data:
df = spark.createDataFrame([Row("Medium"),
Row("High"),
Row("High"),
Row("Low")
],
["level"])
df.show()
# +------+
# | level|
# +------+
# |Medium|
# | High|
# | High|
# | Low|
# +------+
Using your f(x) with these toy data, we get:
df.select("level").rdd.map(lambda x: f(x[0])).collect()
# [2, 1, 1, 3]
And one more map will give you a dataframe:
df.select("level").rdd.map(lambda x: f(x[0])).map(lambda x: Row(x)).toDF(["level_num"]).show()
# +---------+
# |level_num|
# +---------+
# | 2|
# | 1|
# | 1|
# | 3|
# +---------+
But it would be preferable to do it without invoking a temporary intermediate RDD, using the dataframe function when instead of your f(x):
from pyspark.sql.functions import col, when
df.withColumn("level_num", when(col("level")=='Medium', 2).when(col("level")=='Low', 3).otherwise(1)).show()
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# | High| 1|
# | High| 1|
# | Low| 3|
# +------+---------+
An alternative would be to use a Python dictionary to represent the map for Spark >= 2.4.
Then use array and map_from_arrays Spark functions to implement a key-based search mechanism for filling in the level_num field:
from pyspark.sql.functions import lit, map_from_arrays, array
_dict = {"High":1, "Medium":2, "Low":3}
df = spark.createDataFrame([
["Medium"], ["Medium"], ["Medium"], ["High"], ["Medium"], ["Medium"], ["Low"], ["Low"], ["High"]
], ["level"])
keys = array(list(map(lit, _dict.keys()))) # or alternatively [lit(k) for k in _dict.keys()]
values = array(list(map(lit, _dict.values())))
_map = map_from_arrays(keys, values)
df.withColumn("level_num", _map.getItem(col("level"))) # or element_at(_map, col("level"))
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# |Medium| 2|
# |Medium| 2|
# | High| 1|
# |Medium| 2|
# |Medium| 2|
# | Low| 3|
# | Low| 3|
# | High| 1|
# +------+---------+
Suppose you have a Spark dataframe containing some null values, and you would like to replace the values of one column with the values from another if present. In Python/Pandas you can use the fillna() function to do this quite nicely:
df = spark.createDataFrame([('a', 'b', 'c'),(None,'e', 'f'),(None,None,'i')], ['c1','c2','c3'])
DF = df.toPandas()
DF['c1'].fillna(DF['c2']).fillna(DF['c3'])
How can this be done using Pyspark?
You need to use the coalesce function :
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDF.show()
# +----+----+
# | a| b|
# +----+----+
# |null|null|
# | 1|null|
# |null| 2|
# +----+----+
cDf.select(coalesce(cDf["a"], cDf["b"])).show()
# +--------------+
# |coalesce(a, b)|
# +--------------+
# | null|
# | 1|
# | 2|
# +--------------+
cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
# +----+----+----------------+
# | a| b|coalesce(a, 0.0)|
# +----+----+----------------+
# |null|null| 0.0|
# | 1|null| 1.0|
# |null| 2| 0.0|
# +----+----+----------------+
You can also apply coalesce on multiple columns :
cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()
# ...
This example is taken from the pyspark.sql API documentation.