Concat multiple columns of a dataframe using pyspark - apache-spark

Suppose I have a list of columns, for example:
col_list = ['col1','col2']
df = spark.read.json(path_to_file)
print(df.columns)
# ['col1','col2','col3']
I need to create a new column by concatenating col1 and col2. I don't want to hard code the column names while concatenating but need to pick it from the list.
How can I do this?

You can use pyspark.sql.functions.concat() to concatenate as many columns as you specify in your list. Keep on passing them as arguments.
from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A1| 11| A3| A4|
| B1| 22| B3| B4|
| C1| 33| C3| C4|
+----+----+----+----+
In the concat() function, you pass all the columns you need to concatenate - like concat('col1','col2'). If you have a list, you can un-list it using *. So (*['col1','col2']) returns ('col1','col2')
col_list = ['col1','col2']
df = df.withColumn('concatenated_cols',concat(*col_list))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
| A1| 11| A3| A4| A111|
| B1| 22| B3| B4| B122|
| C1| 33| C3| C4| C133|
+----+----+----+----+-----------------+

Related

Managing multiple columns with duplicate names in pyspark dataframe using spark_sanitize_names

I have a dataframe with columns with duplicate names. The contents of these columns are different, but unfortunately the names are the same. I would like to change the names of the columns by adding say - a number series to the columns to make each column unique like this..
foo1 | foo2 | laa3 | boo4 ...
----------------------------------
| | |
Is there a way to do that? I found a tool for scala spark here, but none for pyspark.
https://rdrr.io/cran/sparklyr/src/R/utils.R#sym-spark_sanitize_names
We can use enumerate on df.columns then append index value to the column name.
finally create dataframe with new column names!
In Pyspark:
df.show()
#+---+---+---+---+
#| i| j| k| l|
#+---+---+---+---+
#| a| 1| v| p|
#+---+---+---+---+
new_cols=[elm + str(index+1) for index,elm in enumerate(df.columns)]
#['i1', 'j2', 'k3', 'l4']
#creating new dataframe with new column names
df1=df.toDF(*new_cols)
df1.show()
#+---+---+---+---+
#| i1| j2| k3| l4|
#+---+---+---+---+
#| a| 1| v| p|
#+---+---+---+---+
In Scala:
val new_cols=df.columns.zipWithIndex.collect{case(a,i) => a+(i+1)}
val df1=df.toDF(new_cols:_*)
df1.show()
//+---+---+---+---+
//| i1| j2| k3| l4|
//+---+---+---+---+
//| a| 1| v| p|
//+---+---+---+---+

Case wise using mapping from columns to fill value in another column in a pyspark dataframe

I have a data frame with multiple columns:
+-----------+-----------+-----------+
| col1| col2| col3|
+-----------+-----------+-----------+
| s1| c1| p3|
| s2| c1| p3|
| s1| c3| p3|
| s3| c4| p4|
| s4| c5| p4|
| s2| c6| p4|
+-----------+-----------+-----------+
Now what I want to achieve is that I want to create a new column from mapping of multiple columns by using let's say a dict (since number of unique values are large, individual or case statements would be tedious).
The idea is to first map the values of col1, then if there are remaining null values in the new column, to map them from col2, then again if more null values, to map them from col3, and finally the remaining null values to be replaced by a str literal.:
col1_map = {'s1' : 'apple', 's3' : 'orange'}
col2_map = {'c1' : 'potato', 'c6' : 'tomato'}
col3_map = {'p3' : 'ball', 'p4' : 'bat'}
The final output would look like this:
+-----------+-----------+-----------+-----------+
| col1| col2| col3| col4|
+-----------+-----------+-----------+-----------+
| s1| c1| p3| apple|
| s2| c1| p3| potato|
| s1| c3| p3| apple|
| s3| c4| p4| orange|
| s4| c5| p4| bat|
| s2| c6| p4| tomato|
+-----------+-----------+-----------+-----------+
My approach so far is to create a new column. And then to
from itertools import chain
from pyspark.sql.functions import create_map, lit
mapping_expr = create_map([lit(x) for x in chain(*col1_dict.items())])
df = df.withColumn('col4', mapping_expr[df['col4']])
This will get the values in col4 from the mapping of col1. however My issue is that if I repeat this for col2, and there's already a mapped value from col1 in col4, the new mapping will replace that. I do not want that.
Does anyone have any suggestion to maintain this order of addition of values in the new column?
You did almost right, just that you need to use mapping_expr in sucession.
from pyspark.sql.functions import col, create_map, lit, when
from itertools import chain
values = [('s1','c1','p3'),('s2','c1','p3'),('s1','c3','p3'),('s3','c4','p4'),('s4','c5','p4'),('s2','c6','p4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3'])
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| s1| c1| p3|
| s2| c1| p3|
| s1| c3| p3|
| s3| c4| p4|
| s4| c5| p4|
| s2| c6| p4|
+----+----+----+
Dictionary, as provided by you and creating it's mapping
col1_map = {'s1' : 'apple', 's3' : 'orange'}
col2_map = {'c1' : 'potato', 'c6' : 'tomato'}
col3_map = {'p3' : 'ball', 'p4' : 'bat'}
#Applying the mapping of dictionary.
mapping_expr1 = create_map([lit(x) for x in chain(*col1_map.items())])
mapping_expr2 = create_map([lit(x) for x in chain(*col2_map.items())])
mapping_expr3 = create_map([lit(x) for x in chain(*col3_map.items())])
Finally applying create_map() in succession. All I am doing in addition, is checking if after operating on col1/col2 we still have null, which can be checked using isNull() function.
df=df.withColumn('col4', mapping_expr1.getItem(col('col1')))
df=df.withColumn('col4', when(col('col4').isNull(),mapping_expr2.getItem(col('col2'))).otherwise(col('col4')))
df=df.withColumn('col4', when(col('col4').isNull(),mapping_expr3.getItem(col('col3'))).otherwise(col('col4')))
df.show()
+----+----+----+------+
|col1|col2|col3| col4|
+----+----+----+------+
| s1| c1| p3| apple|
| s2| c1| p3|potato|
| s1| c3| p3| apple|
| s3| c4| p4|orange|
| s4| c5| p4| bat|
| s2| c6| p4|tomato|
+----+----+----+------+

How to add column with alternate values in PySpark dataframe?

I have the following sample dataframe
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
and I want to explode the values in each row and associate alternating 1-0 values in the generated rows. This way I can identify the start/end entries in each row.
I am able to achieve the desired result this way
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = (df.withColumn('start_end', fn.array('start', 'end'))
.withColumn('date', fn.explode('start_end'))
.withColumn('row_num', fn.row_number().over(w)))
df = (df.withColumn('is_start', fn.when(fn.col('row_num')%2 == 0, 0).otherwise(1))
.select('date', 'is_start'))
which gives
| date | is_start |
|--------|----------|
| start | 1 |
| end | 0 |
| start1 | 1 |
| end1 | 0 |
but it seems overly complicated for such a simple task.
Is there any better/cleaner way without using UDFs?
You can use pyspark.sql.functions.posexplode along with pyspark.sql.functions.array.
First create an array out of your start and end columns, then explode this with the position:
from pyspark.sql.functions import array, posexplode
df.select(posexplode(array("end", "start")).alias("is_start", "date")).show()
#+--------+------+
#|is_start| date|
#+--------+------+
#| 0| end|
#| 1| start|
#| 0| end1|
#| 1|start1|
#+--------+------+
You can try union:
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df = df.withColumn('startv', F.lit(1))
df = df.withColumn('endv', F.lit(0))
df = df.select(['start', 'startv']).union(df.select(['end', 'endv']))
df.show()
+------+------+
| start|startv|
+------+------+
| start| 1|
|start1| 1|
| end| 0|
| end1| 0|
+------+------+
You can rename the columns and re-order the rows starting here.
I had similar situation in my use case. In my situation i had Huge dataset(~50GB) and doing any self join/heavy transformation was resulting in more memory and unstable execution .
I went one more level down of dataset and used flatmap of rdd. This will use map side transformation and it will be cost effective in terms of shuffle, cpu and memory.
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df.show()
+------+----+
| start| end|
+------+----+
| start| end|
|start1|end1|
+------+----+
final_df = df.rdd.flatMap(lambda row: [(row.start, 1), (row.end, 0)]).toDF(['date', 'is_start'])
final_df.show()
+------+--------+
| date|is_start|
+------+--------+
| start| 1|
| end| 0|
|start1| 1|
| end1| 0|
+------+--------+

How to do a conditional aggregation after a groupby in pyspark dataframe?

I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.
You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c

Apply a transformation to multiple columns pyspark dataframe

Suppose I have the following spark-dataframe:
+-----+-------+
| word| label|
+-----+-------+
| red| color|
| red| color|
| blue| color|
| blue|feeling|
|happy|feeling|
+-----+-------+
Which can be created using the following code:
sample_df = spark.createDataFrame([
('red', 'color'),
('red', 'color'),
('blue', 'color'),
('blue', 'feeling'),
('happy', 'feeling')
],
('word', 'label')
)
I can perform a groupBy() to get the counts of each word-label pair:
sample_df = sample_df.groupBy('word', 'label').count()
#+-----+-------+-----+
#| word| label|count|
#+-----+-------+-----+
#| blue| color| 1|
#| blue|feeling| 1|
#| red| color| 2|
#|happy|feeling| 1|
#+-----+-------+-----+
And then pivot() and sum() to get the label counts as columns:
import pyspark.sql.functions as f
sample_df = sample_df.groupBy('word').pivot('label').agg(f.sum('count')).na.fill(0)
#+-----+-----+-------+
#| word|color|feeling|
#+-----+-----+-------+
#| red| 2| 0|
#|happy| 0| 1|
#| blue| 1| 1|
#+-----+-----+-------+
What is the best way to transform this dataframe such that each row is divided by the total for that row?
# Desired output
+-----+-----+-------+
| word|color|feeling|
+-----+-----+-------+
| red| 1.0| 0.0|
|happy| 0.0| 1.0|
| blue| 0.5| 0.5|
+-----+-----+-------+
One way to achieve this result is to use __builtin__.sum (NOT pyspark.sql.functions.sum) to get the row-wise sum and then call withColumn() for each label:
labels = ['color', 'feeling']
sample_df.withColumn('total', sum([f.col(x) for x in labels]))\
.withColumn('color', f.col('color')/f.col('total'))\
.withColumn('feeling', f.col('feeling')/f.col('total'))\
.select('word', 'color', 'feeling')\
.show()
But there has to be a better way than enumerating each of the possible columns.
More generally, my question is:
How can I apply an arbitrary transformation, that is a function of the current row, to multiple columns simultaneously?
Found an answer on this Medium post.
First make a column for the total (as above), then use the * operator to unpack a list comprehension over the labels in select():
labels = ['color', 'feeling']
sample_df = sample_df.withColumn('total', sum([f.col(x) for x in labels]))
sample_df.select(
'word', *[(f.col(col_name)/f.col('total')).alias(col_name) for col_name in labels]
).show()
The approach shown on the linked post shows how to generalize this for arbitrary transformations.

Resources