Pyspark Dataframe - Map Strings to Numerics - apache-spark

I'm looking for a way to convert a given column of data, in this case strings, and convert them into a numeric representation. For example, I have a dataframe of strings with values:
+------------+
| level |
+------------+
| Medium|
| Medium|
| Medium|
| High|
| Medium|
| Medium|
| Low|
| Low|
| High|
| Low|
| Low|
And I want to create a new column where these values get converted to:
"High"= 1, "Medium" = 2, "Low" = 3
+------------+
| level_num|
+------------+
| 2|
| 2|
| 2|
| 1|
| 2|
| 2|
| 3|
| 3|
| 1|
| 3|
| 3|
I've tried defining a function and doing a foreach over the dataframe like so:
def f(x):
if(x == 'Medium'):
return 2
elif(x == "Low"):
return 3
else:
return 1
a = df.select("level").rdd.foreach(f)
But this returns a "None" type. Thoughts? Thanks for the help as always!

You can certainly do this along the lines you have been trying - you'll need a map operation instead of foreach.
spark.version
# u'2.2.0'
from pyspark.sql import Row
# toy data:
df = spark.createDataFrame([Row("Medium"),
Row("High"),
Row("High"),
Row("Low")
],
["level"])
df.show()
# +------+
# | level|
# +------+
# |Medium|
# | High|
# | High|
# | Low|
# +------+
Using your f(x) with these toy data, we get:
df.select("level").rdd.map(lambda x: f(x[0])).collect()
# [2, 1, 1, 3]
And one more map will give you a dataframe:
df.select("level").rdd.map(lambda x: f(x[0])).map(lambda x: Row(x)).toDF(["level_num"]).show()
# +---------+
# |level_num|
# +---------+
# | 2|
# | 1|
# | 1|
# | 3|
# +---------+
But it would be preferable to do it without invoking a temporary intermediate RDD, using the dataframe function when instead of your f(x):
from pyspark.sql.functions import col, when
df.withColumn("level_num", when(col("level")=='Medium', 2).when(col("level")=='Low', 3).otherwise(1)).show()
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# | High| 1|
# | High| 1|
# | Low| 3|
# +------+---------+

An alternative would be to use a Python dictionary to represent the map for Spark >= 2.4.
Then use array and map_from_arrays Spark functions to implement a key-based search mechanism for filling in the level_num field:
from pyspark.sql.functions import lit, map_from_arrays, array
_dict = {"High":1, "Medium":2, "Low":3}
df = spark.createDataFrame([
["Medium"], ["Medium"], ["Medium"], ["High"], ["Medium"], ["Medium"], ["Low"], ["Low"], ["High"]
], ["level"])
keys = array(list(map(lit, _dict.keys()))) # or alternatively [lit(k) for k in _dict.keys()]
values = array(list(map(lit, _dict.values())))
_map = map_from_arrays(keys, values)
df.withColumn("level_num", _map.getItem(col("level"))) # or element_at(_map, col("level"))
# +------+---------+
# | level|level_num|
# +------+---------+
# |Medium| 2|
# |Medium| 2|
# |Medium| 2|
# | High| 1|
# |Medium| 2|
# |Medium| 2|
# | Low| 3|
# | Low| 3|
# | High| 1|
# +------+---------+

Related

How to interact with each element of an ArrayType column in pyspark?

If I have an ArrayType column in pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(((1,[]),(2,[1,2,3]),(3,[-2])),schema=StructType([StructField("a",IntegerType()),StructField("b",ArrayType(IntegerType()))]))
df.show()
output:
+---+---------+
| a| b|
+---+---------+
| 1| []|
| 2|[1, 2, 3]|
| 3| [-2]|
Now, I want to be able to interact with each element of column b, Like,
Divide each element by 5
output:
+---+---------------+
| a| b|
+---+---------------+
| 1| []|
| 2|[0.2, 0.4, 0.6]|
| 3| [-0.4]|
+---+---------------+
Add to each element etc.
How do I go about such transformations where some operator or function is applied to each element of the array type columns?
You are looking for the tranform function. Transform enables to apply computation on each element of an array.
from pyspark.sql import functions as F
# Spark < 3.1.0
df.withColumn("b", F.expr("transform(b, x -> x / 5)")).show()
"""
+---+---------------+
| a| b|
+---+---------------+
| 1| []|
| 2|[0.2, 0.4, 0.6]|
| 3| [-0.4]|
+---+---------------+
"""
# Spark >= 3.1.0
df.withColumn("b", F.transform("b", lambda x: x / 5)).show()
"""
+---+---------------+
| a| b|
+---+---------------+
| 1| []|
| 2|[0.2, 0.4, 0.6]|
| 3| [-0.4]|
+---+---------------+
"""

Splitting rows of a dataset depending on a column values

I am using Spark 3.1.1 along with JAVA 8, i am trying to split a dataset<Row> according to values of one of its numerical columns (greater or lesser than a threshold), the split is possible only if some string column values of the rows are identical : i am trying something like this :
Iterator<Row> iter2 = partition.toLocalIterator();
while (iter2.hasNext()) {
Row item = iter2.next();
//getColVal is a function that gets the value given a column
String numValue = getColVal(item, dim);
if (Integer.parseInt(numValue) < threshold)
pl.add(item);
else
pr.add(item);
But how to check, beforehand splitting, if some other column values (string) of the concerned rows are identical in order to perform the split ?
PS : i tried to groupBy the columns before splitting like so :
Dataset<Row> newDataset=oldDataset.groupBy("col1","col4").agg(col("col1"));
but it's not working
Thank you for the help
EDIT :
A sample dataset which i want to split is :
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B
If the threshold is 30 then the two first and last rows will form two datasets because the first and fourth columns of these are identical; otherwise the split is not possible.
EDIT : the resulting outpout would be
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B
I mainly use pyspark but you could adapt to your environment
## could add some conditional logic or just always output 2 data frames where
## one would be empty
print("pdf - two dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
## filter
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# +----+----+-----+
print("pdf - one dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[11,29,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 11| A|
# | abc| 7| 29| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg( F.sum('col2').alias('sumC2') )
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# +----+----+-----+
Filtering by a dynamic mean
print("pdf - filter by mean")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
w = Window.partitionBy("col1").orderBy("col2")
## add another column, the mean of col2 partitioned by col1
sdf = sdf.withColumn('mean_c2', F.mean('col2').over(w))
## filter by the dynamic mean
pr = sdf.filter('col2 > mean_c2')
pr.show()
# +----+----+----+----+-------+
# |col1|col2|col3|col4|mean_c2|
# +----+----+----+----+-------+
# | cde| 4| 20| B| 3.5|
# | abc| 9| 40| A| 8.0|
# +----+----+----+----+-------+

How to reset index and find specific id?

I have an id column for each person (data with the same id belongs to one person). I want these:
Now the id column is not based on numbering, it's 10 digit. How can I reset id with integers, e.g. 1, 2, 3, 4?
For example:
id col1
12a4 summer
12a4 goest
3b yes
3b No
3b why
4t Hi
Output:
id col1
1 summer
1 goest
2 yes
2 No
2 why
3 Hi
How I can get the data corresponding to id=2?
In the above example:
id col1
2 yes
2 No
2 why
from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F
spark = SparkSession.builder.getOrCreate()
data = [
('12a4', 'summer'),
('12a4', 'goest'),
('3b', 'yes'),
('3b', 'No'),
('3b', 'why'),
('4t', 'Hi')
]
df1 = spark.createDataFrame(data, ['id', 'col1'])
df1.show()
# +----+------+
# | id| col1|
# +----+------+
# |12a4|summer|
# |12a4| goest|
# | 3b| yes|
# | 3b| No|
# | 3b| why|
# | 4t| Hi|
# +----+------+
df = df1.select('id').distinct()
df = df.withColumn('new_id', F.row_number().over(Window.orderBy('id')))
df.show()
# +----+------+
# | id|new_id|
# +----+------+
# |12a4| 1|
# | 3b| 2|
# | 4t| 3|
# +----+------+
df = df.join(df1, 'id', 'full')
df.show()
# +----+------+------+
# | id|new_id| col1|
# +----+------+------+
# |12a4| 1|summer|
# |12a4| 1| goest|
# | 4t| 3| Hi|
# | 3b| 2| yes|
# | 3b| 2| No|
# | 3b| 2| why|
# +----+------+------+
df = df.drop('id').withColumnRenamed('new_id', 'id')
df.show()
# +---+------+
# | id| col1|
# +---+------+
# | 1|summer|
# | 1| goest|
# | 3| Hi|
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+------+
df = df.filter(F.col('id') == 2)
df.show()
# +---+----+
# | id|col1|
# +---+----+
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+----+

Spark: Replace missing values with values from another column

Suppose you have a Spark dataframe containing some null values, and you would like to replace the values of one column with the values from another if present. In Python/Pandas you can use the fillna() function to do this quite nicely:
df = spark.createDataFrame([('a', 'b', 'c'),(None,'e', 'f'),(None,None,'i')], ['c1','c2','c3'])
DF = df.toPandas()
DF['c1'].fillna(DF['c2']).fillna(DF['c3'])
How can this be done using Pyspark?
You need to use the coalesce function :
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDF.show()
# +----+----+
# | a| b|
# +----+----+
# |null|null|
# | 1|null|
# |null| 2|
# +----+----+
cDf.select(coalesce(cDf["a"], cDf["b"])).show()
# +--------------+
# |coalesce(a, b)|
# +--------------+
# | null|
# | 1|
# | 2|
# +--------------+
cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
# +----+----+----------------+
# | a| b|coalesce(a, 0.0)|
# +----+----+----------------+
# |null|null| 0.0|
# | 1|null| 1.0|
# |null| 2| 0.0|
# +----+----+----------------+
You can also apply coalesce on multiple columns :
cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()
# ...
This example is taken from the pyspark.sql API documentation.

How to join two data frames in Apache Spark and merge keys into one column?

I have two following Spark data frames:
sale_df:
|user_id|total_sale|
+-------+----------+
| a| 1100|
| b| 2100|
| c| 3300|
| d| 4400
and target_df:
user_id|personalized_target|
+-------+-------------------+
| b| 1000|
| c| 2000|
| d| 3000|
| e| 4000|
+-------+-------------------+
How can I join them in a way that output is:
user_id total_sale personalized_target
a 1100 NA
b 2100 1000
c 3300 2000
d 4400 4000
e NA 4000
I have tried all most all the join types but it seems that single join can not make the desired output.
Any PySpark or SQL and HiveContext can help.
You can use the equi-join synthax in Scala
val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")
You should check if it works in python:
output = sales_df.join(target_df,['user_id'],"outer")
You need to perform an outer equi-join :
data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])
sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# | e| null| 4000|
# | d| 4400| 3000|
# | c| 3300| 2000|
# | b| 2100| 1000|
# | a| 1100| null|
# +-------+----------+-------------------+

Resources