Split Spark dataframe string column into multiple columns - string

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
which takes something looking like:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
and converts it to this:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.
Ideally, I want these new columns to be named as well.

pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.groupBy("num").pivot("name").agg(f.first("val"))\
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+

Here's another approach, in case you want split a string with a delimiter.
import pyspark.sql.functions as f
df = spark.createDataFrame([("1:a:2001",),("2:b:2002",),("3:c:2003",)],["value"])
df.show()
+--------+
| value|
+--------+
|1:a:2001|
|2:b:2002|
|3:c:2003|
+--------+
df_split = df.select(f.split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"])
df_split.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
I don't think this transition back and forth to RDDs is going to slow you down...
Also don't worry about last schema specification: it's optional, you can avoid it generalizing the solution to data with unknown column size.

I understand your pain. Using split() can work, but can also lead to breaks.
Let's take your df and make a slight change to it:
df = spark.createDataFrame([('1:"a:3":2001',),('2:"b":2002',),('3:"c":2003',)],["value"])
df.show()
+------------+
| value|
+------------+
|1:"a:3":2001|
| 2:"b":2002|
| 3:"c":2003|
+------------+
If you try to apply split() to this as outlined above:
df_split = df.select(split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"]).show()
you will get
IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 3 values are provided.
So, is there a more elegant way of addressing this? I was so happy to have it pointed out to me. pyspark.sql.functions.from_csv() is your friend.
Taking my above example df:
from pyspark.sql.functions import from_csv
# Define a column schema to apply with from_csv()
col_schema = ["col1 INTEGER","col2 STRING","col3 INTEGER"]
schema_str = ",".join(col_schema)
# define the separator because it isn't a ','
options = {'sep': ":"}
# create a df from the value column using schema and options
df_csv = df.select(from_csv(df.value, schema_str, options).alias("value_parsed"))
df_csv.show()
+--------------+
| value_parsed|
+--------------+
|[1, a:3, 2001]|
| [2, b, 2002]|
| [3, c, 2003]|
+--------------+
Then we can easily flatten the df to put the values in columns:
df2 = df_csv.select("value_parsed.*").toDF("col1","col2","col3")
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a:3|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
No breaks. Data correctly parsed. Life is good. Have a beer.

Instead of Column.getItem(i) we can use Column[i].
Also, enumerate is useful in big dataframes.
from pyspark.sql import functions as F
Keep parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
or
new_cols = ['new_1', 'new_2']
df = df.select('*', *[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)])
Replace parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
df = df.drop('my_str_col')
or
new_cols = ['new_1', 'new_2']
df = df.select(
*[c for c in df.columns if c != 'my_str_col'],
*[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)]
)

Related

Managing multiple columns with duplicate names in pyspark dataframe using spark_sanitize_names

I have a dataframe with columns with duplicate names. The contents of these columns are different, but unfortunately the names are the same. I would like to change the names of the columns by adding say - a number series to the columns to make each column unique like this..
foo1 | foo2 | laa3 | boo4 ...
----------------------------------
| | |
Is there a way to do that? I found a tool for scala spark here, but none for pyspark.
https://rdrr.io/cran/sparklyr/src/R/utils.R#sym-spark_sanitize_names
We can use enumerate on df.columns then append index value to the column name.
finally create dataframe with new column names!
In Pyspark:
df.show()
#+---+---+---+---+
#| i| j| k| l|
#+---+---+---+---+
#| a| 1| v| p|
#+---+---+---+---+
new_cols=[elm + str(index+1) for index,elm in enumerate(df.columns)]
#['i1', 'j2', 'k3', 'l4']
#creating new dataframe with new column names
df1=df.toDF(*new_cols)
df1.show()
#+---+---+---+---+
#| i1| j2| k3| l4|
#+---+---+---+---+
#| a| 1| v| p|
#+---+---+---+---+
In Scala:
val new_cols=df.columns.zipWithIndex.collect{case(a,i) => a+(i+1)}
val df1=df.toDF(new_cols:_*)
df1.show()
//+---+---+---+---+
//| i1| j2| k3| l4|
//+---+---+---+---+
//| a| 1| v| p|
//+---+---+---+---+

Adding a group count column to a PySpark dataframe

I am coming from R and the tidyverse to PySpark due to its superior Spark handling, and I am struggling to map certain concepts from one context to the other.
In particular, suppose that I had a dataset like the following
x | y
--+--
a | 5
a | 8
a | 7
b | 1
and I wanted to add a column containing the number of rows for each x value, like so:
x | y | n
--+---+---
a | 5 | 3
a | 8 | 3
a | 7 | 3
b | 1 | 1
In dplyr, I would just say:
import(tidyverse)
df <- read_csv("...")
df %>%
group_by(x) %>%
mutate(n = n()) %>%
ungroup()
and that would be that. I can do something almost as simple in PySpark if I'm looking to summarize by number of rows:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...") \
.groupBy(col("x")) \
.count() \
.show()
And I thought I understood that withColumn was equivalent to dplyr's mutate. However, when I do the following, PySpark tells me that withColumn is not defined for groupBy data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...") \
.groupBy(col("x")) \
.withColumn("n", count("x")) \
.show()
In the short run, I can simply create a second dataframe containing the counts and join it to the original dataframe. However, it seems like this could become inefficient in the case of large tables. What is the canonical way to accomplish this?
When you do a groupBy(), you have to specify the aggregation before you can display the results. For example:
import pyspark.sql.functions as f
data = [
('a', 5),
('a', 8),
('a', 7),
('b', 1),
]
df = sqlCtx.createDataFrame(data, ["x", "y"])
df.groupBy('x').count().select('x', f.col('count').alias('n')).show()
#+---+---+
#| x| n|
#+---+---+
#| b| 1|
#| a| 3|
#+---+---+
Here I used alias() to rename the column. But this only returns one row per group. If you want all rows with the count appended, you can do this with a Window:
from pyspark.sql import Window
w = Window.partitionBy('x')
df.select('x', 'y', f.count('x').over(w).alias('n')).sort('x', 'y').show()
#+---+---+---+
#| x| y| n|
#+---+---+---+
#| a| 5| 3|
#| a| 7| 3|
#| a| 8| 3|
#| b| 1| 1|
#+---+---+---+
Or if you're more comfortable with SQL, you can register the dataframe as a temporary table and take advantage of pyspark-sql to do the same thing:
df.registerTempTable('table')
sqlCtx.sql(
'SELECT x, y, COUNT(x) OVER (PARTITION BY x) AS n FROM table ORDER BY x, y'
).show()
#+---+---+---+
#| x| y| n|
#+---+---+---+
#| a| 5| 3|
#| a| 7| 3|
#| a| 8| 3|
#| b| 1| 1|
#+---+---+---+
as #pault appendix
import pyspark.sql.functions as F
...
(df
.groupBy(F.col('x'))
.agg(F.count('x').alias('n'))
.show())
#+---+---+
#| x| n|
#+---+---+
#| b| 1|
#| a| 3|
#+---+---+
enjoy
I found we can get even more close to the tidyverse example:
from pyspark.sql import Window
w = Window.partitionBy('x')
df.withColumn('n', f.count('x').over(w)).sort('x', 'y').show()

Apply a transformation to multiple columns pyspark dataframe

Suppose I have the following spark-dataframe:
+-----+-------+
| word| label|
+-----+-------+
| red| color|
| red| color|
| blue| color|
| blue|feeling|
|happy|feeling|
+-----+-------+
Which can be created using the following code:
sample_df = spark.createDataFrame([
('red', 'color'),
('red', 'color'),
('blue', 'color'),
('blue', 'feeling'),
('happy', 'feeling')
],
('word', 'label')
)
I can perform a groupBy() to get the counts of each word-label pair:
sample_df = sample_df.groupBy('word', 'label').count()
#+-----+-------+-----+
#| word| label|count|
#+-----+-------+-----+
#| blue| color| 1|
#| blue|feeling| 1|
#| red| color| 2|
#|happy|feeling| 1|
#+-----+-------+-----+
And then pivot() and sum() to get the label counts as columns:
import pyspark.sql.functions as f
sample_df = sample_df.groupBy('word').pivot('label').agg(f.sum('count')).na.fill(0)
#+-----+-----+-------+
#| word|color|feeling|
#+-----+-----+-------+
#| red| 2| 0|
#|happy| 0| 1|
#| blue| 1| 1|
#+-----+-----+-------+
What is the best way to transform this dataframe such that each row is divided by the total for that row?
# Desired output
+-----+-----+-------+
| word|color|feeling|
+-----+-----+-------+
| red| 1.0| 0.0|
|happy| 0.0| 1.0|
| blue| 0.5| 0.5|
+-----+-----+-------+
One way to achieve this result is to use __builtin__.sum (NOT pyspark.sql.functions.sum) to get the row-wise sum and then call withColumn() for each label:
labels = ['color', 'feeling']
sample_df.withColumn('total', sum([f.col(x) for x in labels]))\
.withColumn('color', f.col('color')/f.col('total'))\
.withColumn('feeling', f.col('feeling')/f.col('total'))\
.select('word', 'color', 'feeling')\
.show()
But there has to be a better way than enumerating each of the possible columns.
More generally, my question is:
How can I apply an arbitrary transformation, that is a function of the current row, to multiple columns simultaneously?
Found an answer on this Medium post.
First make a column for the total (as above), then use the * operator to unpack a list comprehension over the labels in select():
labels = ['color', 'feeling']
sample_df = sample_df.withColumn('total', sum([f.col(x) for x in labels]))
sample_df.select(
'word', *[(f.col(col_name)/f.col('total')).alias(col_name) for col_name in labels]
).show()
The approach shown on the linked post shows how to generalize this for arbitrary transformations.

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!
With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])
I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

How to aggregate on one column and take maximum of others in pyspark?

I have columns X (string), Y (string), and Z (float).
And I want to
aggregate on X
take the maximum of column Z
report ALL the values for columns X, Y, and Z
If there are multiple values for column Y that correspond to the maximum for column Z, then take the maximum of those values in column Y.
For example, my table is like: table1:
col X col Y col Z
A 1 5
A 2 10
A 3 10
B 5 15
resulting in:
A 3 10
B 5 15
If I were using SQL, I would do it like this:
select X, Y, Z
from table1
join (select max(Z) as max_Z from table1 group by X) table2
on table1.Z = table2.max_Z
However how do I do this when 1) column Z is a float? and 2) I'm using pyspark sql?
The two following solutions are in Scala, but honestly could not resist posting them to promote my beloved window aggregate functions. Sorry.
The only question is which structured query is more performant/effective?
Window Aggregate Function: rank
val df = Seq(
("A",1,5),
("A",2,10),
("A",3,10),
("B",5,15)
).toDF("x", "y", "z")
scala> df.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 1| 5|
| A| 2| 10|
| A| 3| 10|
| B| 5| 15|
+---+---+---+
// describe window specification
import org.apache.spark.sql.expressions.Window
val byX = Window.partitionBy("x").orderBy($"z".desc).orderBy($"y".desc)
// use rank to calculate the best X
scala> df.withColumn("rank", rank over byX)
.select("x", "y", "z")
.where($"rank" === 1) // <-- take the first row
.orderBy("x")
.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Window Aggregate Function: first and dropDuplicates
I've always been thinking about the alternatives to rank function and first usually sprung to mind.
// use first and dropDuplicates
scala> df.
withColumn("y", first("y") over byX).
withColumn("z", first("z") over byX).
dropDuplicates.
orderBy("x").
show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
You can consider using Window function. My approach here is to create Window function that partition dataframe by X first. Then, order columns Y and Z by its value.
We can simply select rank == 1 for row that we're interested.
Or we can use first and drop_duplicates to achieve the same task.
PS. Thanks Jacek Laskowski for the comments and Scala solution that leads to this solution.
Create toy example dataset
from pyspark.sql.window import Window
import pyspark.sql.functions as func
data=[('A',1,5),
('A',2,10),
('A',3,10),
('B',5,15)]
df = spark.createDataFrame(data,schema=['X','Y','Z'])
Window Aggregate Function: rank
Apply windows function with rank function
w = Window.partitionBy(df['X']).orderBy([func.col('Y').desc(), func.col('Z').desc()])
df_max = df.select('X', 'Y', 'Z', func.rank().over(w).alias("rank"))
df_final = df_max.where(func.col('rank') == 1).select('X', 'Y', 'Z').orderBy('X')
df_final.show()
Output
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Window Aggregate Function: first and drop_duplicates
This task can also be achieved by using first and drop_duplicates as follows
df_final = df.select('X', func.first('Y').over(w).alias('Y'), func.first('Z').over(w).alias('Z'))\
.drop_duplicates()\
.orderBy('X')
df_final.show()
Output
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Lets create a dataframe from your sample data as -
data=[('A',1,5),
('A',2,10),
('A',3,10),
('B',5,15)]
df = spark.createDataFrame(data,schema=['X','Y','Z'])
df.show()
output:
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 1| 5|
| A| 2| 10|
| A| 3| 10|
| B| 5| 15|
+---+---+---+
:
# create a intermediate dataframe that find max of Z
df1 = df.groupby('X').max('Z').toDF('X2','max_Z')
:
# create 2nd intermidiate dataframe that finds max of Y where Z = max of Z
df2 = df.join(df1,df.X==df1.X2)\
.where(col('Z')==col('max_Z'))\
.groupBy('X')\
.max('Y').toDF('X','max_Y')
:
# join above two to form final result
result = df1.join(df2,df1.X2==df2.X)\
.select('X','max_Y','max_Z')\
.orderBy('X')
result.show()
:
+---+-----+-----+
| X|max_Y|max_Z|
+---+-----+-----+
| A| 3| 10|
| B| 5| 15|
+---+-----+-----+

Resources