Choose the column having more data

Choose the column having more data - apache-spark

I have to select a column out of two columns which has more data or values in it using PySpark and keep it in my DataFrame.
For example, we have two columns A and B:
In example, the column B has more values so I will keep it in my DF for transformations. Similarly, I would take A, if A had more values. I think we can do it using if else conditions, but I'm not able to get the correct logic.

You could first aggregate the columns (count the values in each). This way you will get just 1 row which you could extract as dictionary using .head().asDict(). Then use Python's max(your_dict, key=your_dict.get) to get the dict's key having the max value (i.e. the name of the column which has maximum number of values). Then just select this column.
Example input:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 7), (2, 4), (3, 7), (None, 8), (None, 4)], ['A', 'B'])
df.show()
# +----+---+
# | A| B|
# +----+---+
# | 1| 7|
# | 2| 4|
# | 3| 7|
# |null| 8|
# |null| 4|
# +----+---+
Scalable script using built-in max:
val_cnt = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head().asDict()
df = df.select(max(val_cnt, key=val_cnt.get))
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Script for just 2 columns (A and B):
head = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head()
df = df.select('B' if head.B > head.A else 'A')
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Scalable script adjustable to more columns, without built-in max:
val_cnt = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head().asDict()
key, val = '', -1
for k, v in val_cnt.items():
if v > val:
key, val = k, v
df = df.select(key)
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+

Create a data frame with the data
df = spark.createDataFrame(data=[(1,7),(2,4),(3,7),(4,8),(5,0),(6,0),(None,3),(None,5),(None,8),(None,4)],schema = ['A','B'])
Define a condition to check for that
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
condition = fx.when((fx.col('A').isNotNull() & (fx.col('A')>fx.col('B'))),fx.col('A')).otherwise(fx.col('B'))
df_1 = df.withColumn('max_value_among_A_and_B',condition)
Print the dataframe
df_1.show()
Please check the below screenshot for details
or
If you want to pick up the whole column just based on the count. you can try this:
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df = spark.createDataFrame(data=[(1,7),(2,4),(3,7),(4,8),(5,0),(6,0),(None,3),(None,5),(None,8),(None,4)],schema = ['A','B'])
if df.select('A').count() > df.select('B').count():
pickcolumn = 'A'
else:
pickcolumn = 'B'
df_1 = df.withColumn('NewColumnm',col(pickcolumn)).drop('A','B')
df_1.show()

Related

Merge two columns in a single DataFrame and count the occurrences using PySpark

I've two columns in my DataFrame name1 and name2.
I want to join them and count the occurrence (without Null values!).
df = spark.createDataFrame([
["Luc Krier","Jeanny Thorn"],
["Jeanny Thorn","Ben Weller"],
[ "Teddy E Beecher","Luc Krier"],
["Philippe Schauss","Jeanny Thorn"],
["Meindert I Tholen","Liam Muller"],
["Meindert I Tholen",""]
]).toDF("name1", "name2")
Desired result:
+------------------------------+
|name |Occurrence |
+------------------------------+
|Luc Krier |2 |
|Jeanny Thorn |3 |
|Teddy E Beecher |1 |
|Philippe Schauss |1 |
|Meindert I Tholen |2 |
|Liam Muller |1 |
|Ben Weller |1 |
+------------------------------+
How can I achieve this?

You can use explode with array fuction to merge the columns into one then simply group by and count, like this :
from pyspark.sql.functions import col, array, explode, count
df.select(explode(array("name1", "name2")).alias("name")) \
.filter("nullif(name, '') is not null") \
.groupBy("name") \
.agg(count("*").alias("Occurrence")) \
.show()
#+-----------------+----------+
#| name|Occurrence|
#+-----------------+----------+
#|Meindert I Tholen| 2|
#| Jeanny Thorn| 3|
#| Luc Krier| 2|
#| Teddy E Beecher| 1|
#|Philippe Schauss| 1|
#| Ben Weller| 1|
#| Liam Muller| 1|
#+-----------------+----------+
Another way is to select each column, union then group by and count:
df.select(col("name1").alias("name")).union(df.select(col("name2").alias("name"))) \
.filter("nullif(name, '') is not null")\
.groupBy("name") \
.agg(count("name").alias("Occurrence")) \
.show()

Many fancy answers out there, but the easiest solution should be to do a union and then aggregate the count:
df2 = (df.select('name1')
.union(df.select('name2'))
.filter("name1 != ''")
.groupBy('name1')
.count()
.toDF('name', 'Occurrence')
)
df2.show()
+-----------------+----------+
| name|Occurrence|
+-----------------+----------+
|Meindert I Tholen| 2|
| Jeanny Thorn| 3|
| Luc Krier| 2|
| Teddy E Beecher| 1|
|Philippe Schauss| 1|
| Ben Weller| 1|
| Liam Muller| 1|
+-----------------+----------+

There are better ways to do it. One naive way of doing it is as follows
from collections import Counter
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("OccurenceCount").getOrCreate()
df = spark.createDataFrame([
["Luc Krier","Jeanny Thorn"],
["Jeanny Thorn","Ben Weller"],
[ "Teddy E Beecher","Luc Krier"],
["Philippe Schauss","Jeanny Thorn"],
["Meindert I Tholen","Liam Muller"],
["Meindert I Tholen",""]
]).toDF("name1", "name2")
counter_dict = dict(Counter(df.select("name1", "name2").rdd.flatMap(lambda x: x).collect()))
counter_list = list(map(list, counter_dict.items()))
frequency_df = spark.createDataFrame(counter_list, ["name", "Occurrence"])
frequency_df.show()
Output:
+-----------------+----------+
| name|Occurrence|
+-----------------+----------+
| | 1|
| Liam Muller| 1|
| Teddy E Beecher| 1|
| Ben Weller| 1|
| Jeanny Thorn| 3|
| Luc Krier| 2|
|Philippe Schauss| 1|
|Meindert I Tholen| 2|
+-----------------+----------+

Does this work?
# Groupby & count both dataframes individually to reduce size.
df_name1 = (df.groupby(['name1']).count()
.withColumnRenamed('name1', 'name')
.withColumnRenamed('count', 'count1'))
df_name2 = (df.groupby(['name2']).count()
.withColumnRenamed('name2', 'name')
.withColumnRenamed('count', 'count2'))
# Join the two dataframes containing frequency counts
# Any null value in the 'count' column can be correctly interpreted as zero.
df_count = (df_name1.join(df_name2, on=['name'], how='outer')
.fillna(0, subset=['count1', 'count2']))
# Sum the two counts and drop the useless columns
df_count = (df_count.withColumn('count', df_count['count1'] + df_count['count2'])
.drop('count1').drop('count2').dropna(subset=['name']))
# (Optional) While any rows with a null name have been removed, rows with an
# empty string ("") for a name are still there. We can drop the empty name
# rows like this.
df_count = df_count[df_count['name'] != '']
df_count.show()
# +-----------------+-----+
# | name|count|
# +-----------------+-----+
# |Meindert I Tholen| 2|
# | Jeanny Thorn| 3|
# | Luc Krier| 2|
# | Teddy E Beecher| 1|
# |Philippe Schauss| 1|
# | Ben Weller| 1|
# | Liam Muller| 1|
# +-----------------+-----+

You can get the required output as follows in scala :
import org.apache.spark.sql.functions._
val df = Seq(
("Luc Krier","Jeanny Thorn"),
("Jeanny Thorn","Ben Weller"),
( "Teddy E Beecher","Luc Krier"),
("Philippe Schauss","Jeanny Thorn"),
("Meindert I Tholen","Liam Muller"),
("Meindert I Tholen","")
).toDF("name1", "name2")
val df1 = df.filter($"name1".isNotNull).filter($"name1" !==
"").groupBy("name1").agg(count("name1").as("count1"))
val df2 = df.filter($"name2".isNotNull).filter($"name2" !==
"").groupBy("name2").agg(count("name2").as("count2"))
val newdf = df1.join(df2, $"name1" === $"name2","outer").withColumn("count1",
when($"count1".isNull,0).otherwise($"count1")).withColumn("count2",
when($"count2".isNull,0).otherwise($"count2")).withColumn("Count",$"count1" +
$"count2")
val finalDF =newdf.withColumn("name",when($"name1".isNull,$"name2")
.when($"name2".isNull,$"name1").otherwise($"name1")).select("name","Count")
display(finalDF)
You can see the final output as image below :

How do I use multiple conditions with pyspark.sql.funtions.when() from a dict?

I want to generate a when clause based on values in a dict. Its very similar to what's being done How do I use multiple conditions with pyspark.sql.funtions.when()?
Only I want to pass a dict of cols and values
Let's say I have a dict:
{
'employed': 'Y',
'athlete': 'N'
}
I want to use that dict to generate the equivalent of:
df.withColumn("call_person",when((col("employed") == "Y") & (col("athlete") == "N"), "Y")
So the end result is:
+---+-----------+--------+-------+
| id|call_person|employed|athlete|
+---+-----------+--------+-------+
| 1| Y | Y | N |
| 2| N | Y | Y |
| 3| N | N | N |
+---+-----------+--------+-------+
Note part of the reason I want to do it programmatically is I have different length dicts (number of conditions)

Use reduce() function:
from functools import reduce
from pyspark.sql.functions import when, col
# dictionary
d = {
'employed': 'Y',
'athlete': 'N'
}
# set up the conditions, multiple conditions merged with `&`
cond = reduce(lambda x,y: x&y, [ col(c) == v for c,v in d.items() if c in df.columns ])
# set up the new column
df.withColumn("call_person", when(cond, "Y").otherwise("N")).show()
+---+--------+-------+-----------+
| id|employed|athlete|call_person|
+---+--------+-------+-----------+
| 1| Y| N| Y|
| 2| Y| Y| N|
| 3| N| N| N|
+---+--------+-------+-----------+

you can access dictionary items directly also:
dict ={
'code': 'b',
'amt': '4'
}
list = [(1, 'code'),(1,'amt')]
df=spark.createDataFrame(list, ['id', 'dict_key'])
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
user_func = udf (lambda x: dict.get(x), StringType())
newdf = df.withColumn('new_column',user_func(df.dict_key))
>>> newdf.show();
+---+--------+----------+
| id|dict_key|new_column|
+---+--------+----------+
| 1| code| b|
| 1| amt| 4|
+---+--------+----------+
or broadcasting a dictionary
broadcast_dict = sc.broadcast(dict)
def my_func(key):
return broadcast_dict.value.get(key)
new_my_func = udf(my_func, StringType())
newdf = df.withColumn('new_column',new_my_func(df.dict_key))
>>> newdf.show();
+---+--------+----------+
| id|dict_key|new_column|
+---+--------+----------+
| 1| code| b|
| 1| amt| 4|
+---+--------+----------+

Dot Products of Rows of a Dataframe with a Fixed Vector in Spark

I have a dataframe (df1) with m rows and n columns in Spark. I have another dataframe (df2) with 1 row and n columns. How can I efficiently compute the dot product of each row of df1 with the single row of df2?

We can use VectorAssembler to do dot product.
Create sample DataFrames:
from pyspark.ml.linalg import Vectors, DenseVector
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
v = [('a', 1,2,3),
('b', 4,5,6),
('c', 9,8,7)]
df1 = spark.createDataFrame(v, ['id', 'v1', 'v2', 'v3'])
df2 = spark.createDataFrame([('d',3,2,1)], ['id', 'v1', 'v2', 'v3'])
df1.show()
df2.show()
They look like this:
+---+---+---+---+
| id| v1| v2| v3|
+---+---+---+---+
| a| 1| 2| 3|
| b| 4| 5| 6|
| c| 9| 8| 7|
+---+---+---+---+
+---+---+---+---+
| id| v1| v2| v3|
+---+---+---+---+
| d| 3| 2| 1|
+---+---+---+---+
Use VectorAssembler to convert the columns to Vector
vecAssembler = VectorAssembler(inputCols=["v1", "v2", "v3"], outputCol="values")
dfv1 = vecAssembler.transform(df1)
dfv2 = vecAssembler.transform(df2)
dfv1.show()
dfv2.show()
Now they look like this:
+---+---+---+---+-------------+
| id| v1| v2| v3| values|
+---+---+---+---+-------------+
| a| 1| 2| 3|[1.0,2.0,3.0]|
| b| 4| 5| 6|[4.0,5.0,6.0]|
| c| 9| 8| 7|[9.0,8.0,7.0]|
+---+---+---+---+-------------+
+---+---+---+---+-------------+
| id| v1| v2| v3| values|
+---+---+---+---+-------------+
| d| 3| 2| 1|[3.0,2.0,1.0]|
+---+---+---+---+-------------+
Define a udf to do the dot product
# Get the fixed vector from DataFrame dfv2
vm = Vectors.dense(dfv2.take(1)[0]['values'])
dot_prod_udf = F.udf(lambda v: float(v.dot(vm)), FloatType())
dfv1 = dfv1.withColumn('dot_prod', dot_prod_udf('values'))
dfv1.show()
The final result is:
+---+---+---+---+-------------+--------+
| id| v1| v2| v3| values|dot_prod|
+---+---+---+---+-------------+--------+
| a| 1| 2| 3|[1.0,2.0,3.0]| 10.0|
| b| 4| 5| 6|[4.0,5.0,6.0]| 28.0|
| c| 9| 8| 7|[9.0,8.0,7.0]| 50.0|
+---+---+---+---+-------------+--------+

You can simply form a matrix with the first data frame and another matrix with the second data frame and multiply them. Here is a code snippet to use (here I'm using block matrix since I assume your data frame can not be stored in your local machine)
v = [('a', [1,2,3]),
('b', [4,5,6]),
('c', [9,8,7])]
df1 = spark.createDataFrame(v, ['id', 'vec'])
df2 = spark.createDataFrame([('d',[3,2,1])], ['id', 'vec'])
m1 = matdf1.toBlockMatrix(100,100)
m2 = matdf2.toBlockMatrix(100,100)
m1.multiply(m2.transpose())

Since Python udfs are not very performant, its probably better to implement it in Scala. But if you want a pure Python implementation, you can try the following hack.
A dot product is equivalent to a linear prediction. So you can leverage LinearRegressionModel.transform if you create one with the desired coefficients. You can do that by "estimating" a regression with the features as an identity matrix and the labels as your desired coefficients. See the following implementation:
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.regression import LinearRegression
from pyspark.sql import Row, DataFrame
class DotProduct:
_regressors_col = 'regressors'
_dot_product_col = 'dot_product'
_coef_col = 'coef'
_index_col = 'index'
def __init__(
self,
coefficients: np.ndarray
):
self._coefficients = coefficients
def transform(self, dataframe: DataFrame) -> DataFrame:
coef_df = self._convert_coefficients_to_dataframe()
w_identity_df = self._add_identity_matrix_as_regressors(coef_df)
linear_reg = LinearRegression(
featuresCol=self._regressors_col,
labelCol=self._coef_col,
predictionCol=self._dot_product_col,
fitIntercept=False,
standardization=False
)
model = linear_reg.fit(w_identity_df)
return model.transform(dataset=dataframe)
def _convert_coefficients_to_dataframe(self) -> DataFrame:
rows = [
Row(**{self._index_col: index, self._coef_col: float(coef)})
for index, coef
in enumerate(self._coefficients)
]
return spark_session.createDataFrame(rows)
def _add_identity_matrix_as_regressors(
self,
coefficients_df: DataFrame
) -> DataFrame:
encoder = OneHotEncoder(
inputCols=[self._index_col],
outputCols=[self._regressors_col],
dropLast=False
)
model = encoder.fit(coefficients_df)
return model.transform(coefficients_df)

Explode 2 columns (2 lists) in the same time in pyspark [duplicate]

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as is.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9], d='foo')])
# +---+---------+---------+---+
# | a| b| c| d|
# +---+---------+---------+---+
# | 1|[1, 2, 3]|[7, 8, 9]|foo|
# +---+---------+---------+---+
What I want:
+---+---+----+------+
| a| b| c | d |
+---+---+----+------+
| 1| 1| 7 | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+----+------+
If I only had one list column, this would be easy by just doing an explode:
df_exploded = df.withColumn('b', explode('b'))
# >>> df_exploded.show()
# +---+---+---------+---+
# | a| b| c| d|
# +---+---+---------+---+
# | 1| 1|[7, 8, 9]|foo|
# | 1| 2|[7, 8, 9]|foo|
# | 1| 3|[7, 8, 9]|foo|
# +---+---+---------+---+
However, if I try to also explode the c column, I end up with a dataframe with a length the square of what I want:
df_exploded_again = df_exploded.withColumn('c', explode('c'))
# >>> df_exploded_again.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | 1| 1| 7|foo|
# | 1| 1| 8|foo|
# | 1| 1| 9|foo|
# | 1| 2| 7|foo|
# | 1| 2| 8|foo|
# | 1| 2| 9|foo|
# | 1| 3| 7|foo|
# | 1| 3| 8|foo|
# | 1| 3| 9|foo|
# +---+---+---+---+
What I want is - for each column, take the nth element of the array in that column and add that to a new row. I've tried mapping an explode accross all columns in the dataframe, but that doesn't seem to work either:
df_split = df.rdd.map(lambda col: df.withColumn(col, explode(col))).toDF()

Spark >= 2.4
You can replace zip_ udf with arrays_zip function
from pyspark.sql.functions import arrays_zip, col, explode
(df
.withColumn("tmp", arrays_zip("b", "c"))
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.b"), col("tmp.c"), "d"))
Spark < 2.4
With DataFrames and UDF:
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType
from pyspark.sql.functions import col, udf, explode
zip_ = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
# Adjust types to reflect data types
StructField("first", IntegerType()),
StructField("second", IntegerType())
]))
)
(df
.withColumn("tmp", zip_("b", "c"))
# UDF output cannot be directly passed to explode
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.first").alias("b"), col("tmp.second").alias("c"), "d"))
With RDDs:
(df
.rdd
.flatMap(lambda row: [(row.a, b, c, row.d) for b, c in zip(row.b, row.c)])
.toDF(["a", "b", "c", "d"]))
Both solutions are inefficient due to Python communication overhead. If data size is fixed you can do something like this:
from functools import reduce
from pyspark.sql import DataFrame
# Length of array
n = 3
# For legacy Python you'll need a separate function
# in place of method accessor
reduce(
DataFrame.unionAll,
(df.select("a", col("b").getItem(i), col("c").getItem(i), "d")
for i in range(n))
).toDF("a", "b", "c", "d")
or even:
from pyspark.sql.functions import array, struct
# SQL level zip of arrays of known size
# followed by explode
tmp = explode(array(*[
struct(col("b").getItem(i).alias("b"), col("c").getItem(i).alias("c"))
for i in range(n)
]))
(df
.withColumn("tmp", tmp)
.select("a", col("tmp").getItem("b"), col("tmp").getItem("c"), "d"))
This should be significantly faster compared to UDF or RDD. Generalized to support an arbitrary number of columns:
# This uses keyword only arguments
# If you use legacy Python you'll have to change signature
# Body of the function can stay the same
def zip_and_explode(*colnames, n):
return explode(array(*[
struct(*[col(c).getItem(i).alias(c) for c in colnames])
for i in range(n)
]))
df.withColumn("tmp", zip_and_explode("b", "c", n=3))

You'd need to use flatMap, not map as you want to make multiple output rows out of each input row.
from pyspark.sql import Row
def dualExplode(r):
rowDict = r.asDict()
bList = rowDict.pop('b')
cList = rowDict.pop('c')
for b,c in zip(bList, cList):
newDict = dict(rowDict)
newDict['b'] = b
newDict['c'] = c
yield Row(**newDict)
df_split = sqlContext.createDataFrame(df.rdd.flatMap(dualExplode))

One liner (for Spark>=2.4.0):
df.withColumn("bc", arrays_zip("b","c"))
.select("a", explode("bc").alias("tbc"))
.select("a", col"tbc.b", "tbc.c").show()
Import required:
from pyspark.sql.functions import arrays_zip
Steps -
Create a column bc which is an array_zip of columns b and c
Explode bc to get a struct tbc
Select the required columns a, b and c (all exploded as required).
Output:
> df.withColumn("bc", arrays_zip("b","c")).select("a", explode("bc").alias("tbc")).select("a", "tbc.b", col("tbc.c")).show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 7|
| 1| 2| 8|
| 1| 3| 9|
+---+---+---+

Convert column of lists into one column of values in Pyspark [duplicate]

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as is.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9], d='foo')])
# +---+---------+---------+---+
# | a| b| c| d|
# +---+---------+---------+---+
# | 1|[1, 2, 3]|[7, 8, 9]|foo|
# +---+---------+---------+---+
What I want:
+---+---+----+------+
| a| b| c | d |
+---+---+----+------+
| 1| 1| 7 | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+----+------+
If I only had one list column, this would be easy by just doing an explode:
df_exploded = df.withColumn('b', explode('b'))
# >>> df_exploded.show()
# +---+---+---------+---+
# | a| b| c| d|
# +---+---+---------+---+
# | 1| 1|[7, 8, 9]|foo|
# | 1| 2|[7, 8, 9]|foo|
# | 1| 3|[7, 8, 9]|foo|
# +---+---+---------+---+
However, if I try to also explode the c column, I end up with a dataframe with a length the square of what I want:
df_exploded_again = df_exploded.withColumn('c', explode('c'))
# >>> df_exploded_again.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | 1| 1| 7|foo|
# | 1| 1| 8|foo|
# | 1| 1| 9|foo|
# | 1| 2| 7|foo|
# | 1| 2| 8|foo|
# | 1| 2| 9|foo|
# | 1| 3| 7|foo|
# | 1| 3| 8|foo|
# | 1| 3| 9|foo|
# +---+---+---+---+
What I want is - for each column, take the nth element of the array in that column and add that to a new row. I've tried mapping an explode accross all columns in the dataframe, but that doesn't seem to work either:
df_split = df.rdd.map(lambda col: df.withColumn(col, explode(col))).toDF()

Spark >= 2.4
You can replace zip_ udf with arrays_zip function
from pyspark.sql.functions import arrays_zip, col, explode
(df
.withColumn("tmp", arrays_zip("b", "c"))
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.b"), col("tmp.c"), "d"))
Spark < 2.4
With DataFrames and UDF:
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType
from pyspark.sql.functions import col, udf, explode
zip_ = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
# Adjust types to reflect data types
StructField("first", IntegerType()),
StructField("second", IntegerType())
]))
)
(df
.withColumn("tmp", zip_("b", "c"))
# UDF output cannot be directly passed to explode
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.first").alias("b"), col("tmp.second").alias("c"), "d"))
With RDDs:
(df
.rdd
.flatMap(lambda row: [(row.a, b, c, row.d) for b, c in zip(row.b, row.c)])
.toDF(["a", "b", "c", "d"]))
Both solutions are inefficient due to Python communication overhead. If data size is fixed you can do something like this:
from functools import reduce
from pyspark.sql import DataFrame
# Length of array
n = 3
# For legacy Python you'll need a separate function
# in place of method accessor
reduce(
DataFrame.unionAll,
(df.select("a", col("b").getItem(i), col("c").getItem(i), "d")
for i in range(n))
).toDF("a", "b", "c", "d")
or even:
from pyspark.sql.functions import array, struct
# SQL level zip of arrays of known size
# followed by explode
tmp = explode(array(*[
struct(col("b").getItem(i).alias("b"), col("c").getItem(i).alias("c"))
for i in range(n)
]))
(df
.withColumn("tmp", tmp)
.select("a", col("tmp").getItem("b"), col("tmp").getItem("c"), "d"))
This should be significantly faster compared to UDF or RDD. Generalized to support an arbitrary number of columns:
# This uses keyword only arguments
# If you use legacy Python you'll have to change signature
# Body of the function can stay the same
def zip_and_explode(*colnames, n):
return explode(array(*[
struct(*[col(c).getItem(i).alias(c) for c in colnames])
for i in range(n)
]))
df.withColumn("tmp", zip_and_explode("b", "c", n=3))

You'd need to use flatMap, not map as you want to make multiple output rows out of each input row.
from pyspark.sql import Row
def dualExplode(r):
rowDict = r.asDict()
bList = rowDict.pop('b')
cList = rowDict.pop('c')
for b,c in zip(bList, cList):
newDict = dict(rowDict)
newDict['b'] = b
newDict['c'] = c
yield Row(**newDict)
df_split = sqlContext.createDataFrame(df.rdd.flatMap(dualExplode))

One liner (for Spark>=2.4.0):
df.withColumn("bc", arrays_zip("b","c"))
.select("a", explode("bc").alias("tbc"))
.select("a", col"tbc.b", "tbc.c").show()
Import required:
from pyspark.sql.functions import arrays_zip
Steps -
Create a column bc which is an array_zip of columns b and c
Explode bc to get a struct tbc
Select the required columns a, b and c (all exploded as required).
Output:
> df.withColumn("bc", arrays_zip("b","c")).select("a", explode("bc").alias("tbc")).select("a", "tbc.b", col("tbc.c")).show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 7|
| 1| 2| 8|
| 1| 3| 9|
+---+---+---+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Choose the column having more data - apache-spark

Related

Merge two columns in a single DataFrame and count the occurrences using PySpark

How do I use multiple conditions with pyspark.sql.funtions.when() from a dict?

Dot Products of Rows of a Dataframe with a Fixed Vector in Spark

Explode 2 columns (2 lists) in the same time in pyspark [duplicate]

Convert column of lists into one column of values in Pyspark [duplicate]

Categories

Resources