Python built-in function conflicts with spark [duplicate] - apache-spark

Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid column to 2 decimal places, and rename the column as bid afterwards - I'm importing pyspark.sql.functions AS func for reference, and using the round function contained within it:
output = output.select(col("ad").alias("ad_id"),
col("part").alias("part_id"),
func.round(col("new_bid"), 2).alias("bid"))
the new_bid column here is of type float - the resulting dataframe does not have the newly named bid column rounded to 2 decimal places as I am trying to do, rather it is still 8 or 9 decimal places out.
I've tried various things but can't seem to get the resulting dataframe to have the rounded value - any pointers would be greatly appreciated! Thanks!

Here are a couple of ways to do it with some toy data:
spark.version
# u'2.2.0'
import pyspark.sql.functions as func
df = spark.createDataFrame(
[(0.0, 0.2, 3.45631),
(0.4, 1.4, 2.82945),
(0.5, 1.9, 7.76261),
(0.6, 0.9, 2.76790),
(1.2, 1.0, 9.87984)],
["col1", "col2", "col3"])
df.show()
# +----+----+-------+
# |col1|col2| col3|
# +----+----+-------+
# | 0.0| 0.2|3.45631|
# | 0.4| 1.4|2.82945|
# | 0.5| 1.9|7.76261|
# | 0.6| 0.9| 2.7679|
# | 1.2| 1.0|9.87984|
# +----+----+-------+
# round 'col3' in a new column:
df2 = df.withColumn("col4", func.round(df["col3"], 2)).withColumnRenamed("col4","new_col3")
df2.show()
# +----+----+-------+--------+
# |col1|col2| col3|new_col3|
# +----+----+-------+--------+
# | 0.0| 0.2|3.45631| 3.46|
# | 0.4| 1.4|2.82945| 2.83|
# | 0.5| 1.9|7.76261| 7.76|
# | 0.6| 0.9| 2.7679| 2.77|
# | 1.2| 1.0|9.87984| 9.88|
# +----+----+-------+--------+
# round & replace existing 'col3':
df3 = df.withColumn("col3", func.round(df["col3"], 2))
df3.show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | 0.0| 0.2|3.46|
# | 0.4| 1.4|2.83|
# | 0.5| 1.9|7.76|
# | 0.6| 0.9|2.77|
# | 1.2| 1.0|9.88|
# +----+----+----+
It's a personal taste, but I am not a great fan of either col or alias - I prefer withColumn and withColumnRenamed instead. Nevertheless, if you would like to stick with select and col, here is how you should adapt your own code snippet:
from pyspark.sql.functions import col
df4 = df.select(col("col1").alias("new_col1"),
col("col2").alias("new_col2"),
func.round(df["col3"],2).alias("new_col3"))
df4.show()
# +--------+--------+--------+
# |new_col1|new_col2|new_col3|
# +--------+--------+--------+
# | 0.0| 0.2| 3.46|
# | 0.4| 1.4| 2.83|
# | 0.5| 1.9| 7.76|
# | 0.6| 0.9| 2.77|
# | 1.2| 1.0| 9.88|
# +--------+--------+--------+

Related

How to replace negative values with previous positive values in Spark?

I want to replace negative values in spark dataframe with previous positive values. I am using Spark with Java. In Python pandas we are having ffill() api which will help here to solve this issue but in Java it is getting difficult to resolve. I tried using lead/lag function but till where I can check negative values that I am not sure hence this solution will not work.
You can use a Window Function. Take this df as an exemple:
df = spark.createDataFrame(
[
('2018-03-01','6'),
('2018-03-02','1'),
('2018-03-03','-2'),
('2018-03-04','7'),
('2018-03-05','-3'),
],
["date", "value"]
)\
.withColumn('date', F.col('date').cast('date'))\
.withColumn('value', F.col('value').cast('integer'))
df.show()
# +----------+-----+
# | date|value|
# +----------+-----+
# |2018-03-01| 6|
# |2018-03-02| 1|
# |2018-03-03| -2|
# |2018-03-04| 7|
# |2018-03-05| -3|
# +----------+-----+
Then, you can use create a column with a when and use a Window Function:
from pyspark.sql import Window
window = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df\
.withColumn('if<0_than_null', F.when(F.col('value')<0, F.lit(None)).otherwise(F.col('value')))\
.withColumn('desired_output', F.last('if<0_than_null', ignorenulls=True).over(window))\
.show()
# +----------+-----+--------------+--------------+
# | date|value|if<0_than_null|desired_output|
# +----------+-----+--------------+--------------+
# |2018-03-01| 6| 6| 6|
# |2018-03-02| 1| 1| 1|
# |2018-03-03| -2| null| 1|
# |2018-03-04| 7| 7| 7|
# |2018-03-05| -3| null| 7|
# +----------+-----+--------------+--------------+

How to reset index and find specific id?

I have an id column for each person (data with the same id belongs to one person). I want these:
Now the id column is not based on numbering, it's 10 digit. How can I reset id with integers, e.g. 1, 2, 3, 4?
For example:
id col1
12a4 summer
12a4 goest
3b yes
3b No
3b why
4t Hi
Output:
id col1
1 summer
1 goest
2 yes
2 No
2 why
3 Hi
How I can get the data corresponding to id=2?
In the above example:
id col1
2 yes
2 No
2 why
from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F
spark = SparkSession.builder.getOrCreate()
data = [
('12a4', 'summer'),
('12a4', 'goest'),
('3b', 'yes'),
('3b', 'No'),
('3b', 'why'),
('4t', 'Hi')
]
df1 = spark.createDataFrame(data, ['id', 'col1'])
df1.show()
# +----+------+
# | id| col1|
# +----+------+
# |12a4|summer|
# |12a4| goest|
# | 3b| yes|
# | 3b| No|
# | 3b| why|
# | 4t| Hi|
# +----+------+
df = df1.select('id').distinct()
df = df.withColumn('new_id', F.row_number().over(Window.orderBy('id')))
df.show()
# +----+------+
# | id|new_id|
# +----+------+
# |12a4| 1|
# | 3b| 2|
# | 4t| 3|
# +----+------+
df = df.join(df1, 'id', 'full')
df.show()
# +----+------+------+
# | id|new_id| col1|
# +----+------+------+
# |12a4| 1|summer|
# |12a4| 1| goest|
# | 4t| 3| Hi|
# | 3b| 2| yes|
# | 3b| 2| No|
# | 3b| 2| why|
# +----+------+------+
df = df.drop('id').withColumnRenamed('new_id', 'id')
df.show()
# +---+------+
# | id| col1|
# +---+------+
# | 1|summer|
# | 1| goest|
# | 3| Hi|
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+------+
df = df.filter(F.col('id') == 2)
df.show()
# +---+----+
# | id|col1|
# +---+----+
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+----+

Pyspark Dataframe - Map Strings to Numerics

I'm looking for a way to convert a given column of data, in this case strings, and convert them into a numeric representation. For example, I have a dataframe of strings with values:
+------------+
| level |
+------------+
| Medium|
| Medium|
| Medium|
| High|
| Medium|
| Medium|
| Low|
| Low|
| High|
| Low|
| Low|
And I want to create a new column where these values get converted to:
"High"= 1, "Medium" = 2, "Low" = 3
+------------+
| level_num|
+------------+
| 2|
| 2|
| 2|
| 1|
| 2|
| 2|
| 3|
| 3|
| 1|
| 3|
| 3|
I've tried defining a function and doing a foreach over the dataframe like so:
def f(x):
if(x == 'Medium'):
return 2
elif(x == "Low"):
return 3
else:
return 1
a = df.select("level").rdd.foreach(f)
But this returns a "None" type. Thoughts? Thanks for the help as always!
You can certainly do this along the lines you have been trying - you'll need a map operation instead of foreach.
spark.version
# u'2.2.0'
from pyspark.sql import Row
# toy data:
df = spark.createDataFrame([Row("Medium"),
Row("High"),
Row("High"),
Row("Low")
],
["level"])
df.show()
# +------+
# | level|
# +------+
# |Medium|
# | High|
# | High|
# | Low|
# +------+
Using your f(x) with these toy data, we get:
df.select("level").rdd.map(lambda x: f(x[0])).collect()
# [2, 1, 1, 3]
And one more map will give you a dataframe:
df.select("level").rdd.map(lambda x: f(x[0])).map(lambda x: Row(x)).toDF(["level_num"]).show()
# +---------+
# |level_num|
# +---------+
# | 2|
# | 1|
# | 1|
# | 3|
# +---------+
But it would be preferable to do it without invoking a temporary intermediate RDD, using the dataframe function when instead of your f(x):
from pyspark.sql.functions import col, when
df.withColumn("level_num", when(col("level")=='Medium', 2).when(col("level")=='Low', 3).otherwise(1)).show()
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# | High| 1|
# | High| 1|
# | Low| 3|
# +------+---------+
An alternative would be to use a Python dictionary to represent the map for Spark >= 2.4.
Then use array and map_from_arrays Spark functions to implement a key-based search mechanism for filling in the level_num field:
from pyspark.sql.functions import lit, map_from_arrays, array
_dict = {"High":1, "Medium":2, "Low":3}
df = spark.createDataFrame([
["Medium"], ["Medium"], ["Medium"], ["High"], ["Medium"], ["Medium"], ["Low"], ["Low"], ["High"]
], ["level"])
keys = array(list(map(lit, _dict.keys()))) # or alternatively [lit(k) for k in _dict.keys()]
values = array(list(map(lit, _dict.values())))
_map = map_from_arrays(keys, values)
df.withColumn("level_num", _map.getItem(col("level"))) # or element_at(_map, col("level"))
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# |Medium| 2|
# |Medium| 2|
# | High| 1|
# |Medium| 2|
# |Medium| 2|
# | Low| 3|
# | Low| 3|
# | High| 1|
# +------+---------+

Trouble With Pyspark Round Function

Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid column to 2 decimal places, and rename the column as bid afterwards - I'm importing pyspark.sql.functions AS func for reference, and using the round function contained within it:
output = output.select(col("ad").alias("ad_id"),
col("part").alias("part_id"),
func.round(col("new_bid"), 2).alias("bid"))
the new_bid column here is of type float - the resulting dataframe does not have the newly named bid column rounded to 2 decimal places as I am trying to do, rather it is still 8 or 9 decimal places out.
I've tried various things but can't seem to get the resulting dataframe to have the rounded value - any pointers would be greatly appreciated! Thanks!
Here are a couple of ways to do it with some toy data:
spark.version
# u'2.2.0'
import pyspark.sql.functions as func
df = spark.createDataFrame(
[(0.0, 0.2, 3.45631),
(0.4, 1.4, 2.82945),
(0.5, 1.9, 7.76261),
(0.6, 0.9, 2.76790),
(1.2, 1.0, 9.87984)],
["col1", "col2", "col3"])
df.show()
# +----+----+-------+
# |col1|col2| col3|
# +----+----+-------+
# | 0.0| 0.2|3.45631|
# | 0.4| 1.4|2.82945|
# | 0.5| 1.9|7.76261|
# | 0.6| 0.9| 2.7679|
# | 1.2| 1.0|9.87984|
# +----+----+-------+
# round 'col3' in a new column:
df2 = df.withColumn("col4", func.round(df["col3"], 2)).withColumnRenamed("col4","new_col3")
df2.show()
# +----+----+-------+--------+
# |col1|col2| col3|new_col3|
# +----+----+-------+--------+
# | 0.0| 0.2|3.45631| 3.46|
# | 0.4| 1.4|2.82945| 2.83|
# | 0.5| 1.9|7.76261| 7.76|
# | 0.6| 0.9| 2.7679| 2.77|
# | 1.2| 1.0|9.87984| 9.88|
# +----+----+-------+--------+
# round & replace existing 'col3':
df3 = df.withColumn("col3", func.round(df["col3"], 2))
df3.show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | 0.0| 0.2|3.46|
# | 0.4| 1.4|2.83|
# | 0.5| 1.9|7.76|
# | 0.6| 0.9|2.77|
# | 1.2| 1.0|9.88|
# +----+----+----+
It's a personal taste, but I am not a great fan of either col or alias - I prefer withColumn and withColumnRenamed instead. Nevertheless, if you would like to stick with select and col, here is how you should adapt your own code snippet:
from pyspark.sql.functions import col
df4 = df.select(col("col1").alias("new_col1"),
col("col2").alias("new_col2"),
func.round(df["col3"],2).alias("new_col3"))
df4.show()
# +--------+--------+--------+
# |new_col1|new_col2|new_col3|
# +--------+--------+--------+
# | 0.0| 0.2| 3.46|
# | 0.4| 1.4| 2.83|
# | 0.5| 1.9| 7.76|
# | 0.6| 0.9| 2.77|
# | 1.2| 1.0| 9.88|
# +--------+--------+--------+

Spark: Replace missing values with values from another column

Suppose you have a Spark dataframe containing some null values, and you would like to replace the values of one column with the values from another if present. In Python/Pandas you can use the fillna() function to do this quite nicely:
df = spark.createDataFrame([('a', 'b', 'c'),(None,'e', 'f'),(None,None,'i')], ['c1','c2','c3'])
DF = df.toPandas()
DF['c1'].fillna(DF['c2']).fillna(DF['c3'])
How can this be done using Pyspark?
You need to use the coalesce function :
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDF.show()
# +----+----+
# | a| b|
# +----+----+
# |null|null|
# | 1|null|
# |null| 2|
# +----+----+
cDf.select(coalesce(cDf["a"], cDf["b"])).show()
# +--------------+
# |coalesce(a, b)|
# +--------------+
# | null|
# | 1|
# | 2|
# +--------------+
cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
# +----+----+----------------+
# | a| b|coalesce(a, 0.0)|
# +----+----+----------------+
# |null|null| 0.0|
# | 1|null| 1.0|
# |null| 2| 0.0|
# +----+----+----------------+
You can also apply coalesce on multiple columns :
cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()
# ...
This example is taken from the pyspark.sql API documentation.

Resources