PySpark merge 2 column values by index into new list [duplicate] - apache-spark

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:
df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'
I wanted to join these two columns in a third column like below for each row of my dataframe.
df['column_3']: [abc_1.0, def_2.0, ghi_3.0]
I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?
df['column_3'] = df['column_2']
for index, row in df.iterrows():
while index < 3:
if isinstance(row['column_1'], str):
row['column_1'] = list(row['column_1'].split(','))
row['column_2'] = list(row['column_2'].split(','))
row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]
I have converted the two columns to arrays in PySpark by using the below code
from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split
crash.withColumn("column_1",
split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)
Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.

A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:
pyspark.sql.functions.arrays_zip(*cols)
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
So if you already have two arrays:
from pyspark.sql.functions import split
df = (spark
.createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
.toDF("column_1", "column_2")
.withColumn("column_1", split("column_1", "\s*,\s*"))
.withColumn("column_2", split("column_2", "\s*,\s*")))
You can just apply it on the result
from pyspark.sql.functions import arrays_zip
df_zipped = df.withColumn(
"zipped", arrays_zip("column_1", "column_2")
)
df_zipped.select("zipped").show(truncate=False)
+------------------------------------+
|zipped |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+
Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):
df_zipped_concat = df_zipped.withColumn(
"zipped_concat",
expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
)
df_zipped_concat.select("zipped_concat").show(truncate=False)
+---------------------------+
|zipped_concat |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+
Note:
Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.

For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:
df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))"))
The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).

You can also UDF to zip the split array columns,
df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2'])
+-----------+-----------+
|col1 |col2 |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is
from pyspark.sql import functions as F
from pyspark.sql.types import *
def concat_udf(*args):
return ['_'.join(x) for x in zip(*args)]
udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1 |col2 |col3 |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+

For Spark 3.1+, they now provide pyspark.sql.functions.zip_with() with Python lambda function, therefore it can be done like this:
import pyspark.sql.functions as F
df = df.withColumn("column_3", F.zip_with("column_1", "column_2", lambda x,y: F.concat_ws("_", x, y)))

Related

Separate string by white space in pyspark

I have column with search queries that are represented by strings. I want to separate every string to different work.
Let say I have this data frame:
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.read.option("header", "true") \
.option("delimiter", "\t") \
.option("inferSchema", "true") \
.csv("/content/drive/MyDrive/my_data.txt")
data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))
from pyspark.sql.functions import array_distinct
from pyspark.sql.functions import udf
data = data.withColumn("New_Data", array_distinct("Query"))
Z = data.drop(data.Query)
+------+------------------------+
|AnonID| New_Data |
+------+------------------------+
| 142|[Big House, Green frog] |
+------+------------------------+
And I want output like that:
+------+--------------------------+
|AnonID| New_Data |
+------+--------------------------+
| 142|[Big, House, Green, frog] |
+------+--------------------------+
I have tried to search In older posts but I was able to find only something that separates each word to different column and it's not what I want.
To separate the elements in an array and split each string into separate words, you can use the explode and split functions in Spark. The exploded elements can then be combined back into an array using the array function.
from pyspark.sql.functions import explode, split, array
data = data.withColumn("Words", explode(split(data["New_Data"], " ")))
data = data.groupBy("AnonID").agg(array(data["Words"]).alias("New_Data"))
You can do the collect_list first and then use the transform function to split the array elements and then flatten the elements and then finally apply array_distinct. Please check out the code and output below.
df=spark.createDataFrame([[142,"Big House"],[142,"Big Green Frog"]],["AnonID","Query"])
import pyspark.sql.functions as F
data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))
data.withColumn("Query",F.array_distinct(flatten(transform(data["Query"], lambda x: split(x, " "))))).show(2,False)
+------+-------------------------+
|AnonID|Query |
+------+-------------------------+
|142 |[Big, House, Green, Frog]|
+------+-------------------------+

Create new dataframe in pyspark with column names and its associated values in other column using spark/pyspark

i have a dataset like below
and i would like to create a dataframe using above dataset like below
First you need stack your dataframe, group by var_name and apply collect_list
import pyspark.sql.functions as f
expr_columns = ', '.join(map(lambda col: '"{col}", {col}'.format(col=col), df.columns))
expr = "stack(2, {columns}) as (var_name, values)".format(columns=expr_columns)
df_stack = df.selectExpr(expr)
df_final = df_stack.groupBy("var_name").agg(f.collect_list(f.col("values")))

Pyspark - Index from monotonically_increasing_id changes after list aggregation

I'm creating an index using the monotonically_increasing_id() function in Pyspark 3.1.1.
I'm aware of the specific characteristics of that function, but they don't explain my issue.
After creating the index I do a simple aggregation applying the collect_list() function on the created index.
If I compare the results the index changes in certain cases, that is specifically on the upper end of the long-range when the input data is not too small.
Full example code:
import random
import string
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '8')\
.getOrCreate()
# Create random input data of around length 100000:
input_data = []
ii = 0
while ii <= 100000:
L = random.randint(1, 3)
B = ''.join(random.choices(string.ascii_uppercase, k=5))
for i in range(L):
C = random.randint(1,100)
input_data.append((B,))
ii += 1
# Create Spark DataFrame:
input_rdd = sc.parallelize(tuple(input_data))
schema = StructType([StructField("B", StringType())])
dg = spark.createDataFrame(input_rdd, schema=schema)
# Create id and aggregate:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id())
dg2 = dg.groupBy("B").agg(f.collect_list("ID0"))
Output:
dg.sort('B', ascending=False).show(10, truncate=False)
dg2.sort('B', ascending=False).show(5, truncate=False)
This of course creates different data with every run, but if the length is large enough (problem appears slightly at 10000, but not at 1000), it should appear everytime. Here's an example result:
+-----+-----------+
|B |ID0 |
+-----+-----------+
|ZZZVB|60129554616|
|ZZZVB|60129554617|
|ZZZVB|60129554615|
|ZZZUH|60129554614|
|ZZZRW|60129554612|
|ZZZRW|60129554613|
|ZZZNH|60129554611|
|ZZZNH|60129554609|
|ZZZNH|60129554610|
|ZZZJH|60129554606|
+-----+-----------+
only showing top 10 rows
+-----+---------------------------------------+
|B |collect_list(ID0) |
+-----+---------------------------------------+
|ZZZVB|[60129554742, 60129554743, 60129554744]|
|ZZZUH|[60129554741] |
|ZZZRW|[60129554739, 60129554740] |
|ZZZNH|[60129554736, 60129554737, 60129554738]|
|ZZZJH|[60129554733, 60129554734, 60129554735]|
+-----+---------------------------------------+
only showing top 5 rows
The entry ZZZVB has the three IDs 60129554615, 60129554616, and 60129554617 before aggregation, but after aggregation the numbers have changed to 60129554742, 60129554743, 60129554744.
Why? I can't imagine this is supposed to happen. Isn't the result of monotonically_increasing_id() a simple long that keeps its value after having been created?
EDIT: As expected a workaround is to coalesce(1) the DataFrame before creating the id.
dg and df2 are two different dataframes, each with their own DAG. These DAGs are executed independently from each other when an action on one of the dataframes is called. So each time show() is called, the DAG of the respective dataframe is evaluated and during that evaluation, f.monotonically_increasing_id() is called.
To prevent f.monotonically_increasing_id() being called twice, you could add a cache after the withColumn transformation:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id()).cache()
With the cache, the result of the first evaluation of f.monotonically_increasing_id() is cached and reused when evaluating the second dataframe.

pyspark getting the field names of a a struct datatype inside a udf

I am trying to pass multiple columns to a udf as a StructType (using pyspark.sql.functions.struct()).
Inside this udf I want to get the fields of the struct column that I passed as a list, so that I can iterate over the passed columns for every row.
Basically I am looking for a pyspark version of the scala code provided in this answer - Spark - pass full row to a udf and then get column name inside udf
You can use the same method as on the post you linked, i.e. by using a pyspark.sql.Row. But instead of .schema.fieldNames, you can use .asDict() to convert the Row into a dictionary.
For example, here is a way to iterate over the column names and values simultaneously:
from pyspark.sql.functions import col, struct, udf
df = spark.createDataFrame([(1, 2, 3)], ["a", "b", "c"])
f = udf(lambda row: "; ".join(["=".join(map(str, [k,v])) for k, v in row.asDict().items()]))
df.select(f(struct(*df.columns)).alias("myUdfOutput")).show()
#+-------------+
#| myUdfOutput|
#+-------------+
#|a=1; c=3; b=2|
#+-------------+
An alternative would be to build a MapType() of column name to value, and pass this to your udf.
from itertools import chain
from pyspark.sql.functions import create_map, lit
f2 = udf(lambda row: "; ".join(["=".join(map(str, [k,v])) for k, v in row.items()]))
df.select(
f2(
create_map(
*chain.from_iterable([(lit(c), col(c)) for c in df.columns])
)
).alias("myNewUdfOutput")
).show()
#+--------------+
#|myNewUdfOutput|
#+--------------+
#| a=1; c=3; b=2|
#+--------------+
This second method is arguably unnecessarily complicated, so the first option is the recommended approach.

How to reverse and combine string columns in a spark dataframe?

I am using pyspark version 2.4 and I am trying to write a udf which should take the values of column id1 and column id2 together, and returns the reverse string of it.
For example, my data looks like:
+---+---+
|id1|id2|
+---+---+
| a|one|
| b|two|
+---+---+
the corresponding code is:
df = spark.createDataFrame([['a', 'one'], ['b', 'two']], ['id1', 'id2'])
The returned value should be
+---+---+----+
|id1|id2| val|
+---+---+----+
| a|one|enoa|
| b|two|owtb|
+---+---+----+
My code is:
#udf(string)
def reverse_value(value):
return value[::-1]
df.withColumn('val', reverse_value(lit('id1' + 'id2')))
My errors are:
TypeError: Invalid argument, not a string or column: <function
reverse_value at 0x0000010E6D860B70> of type <class 'function'>. For
column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
Should be:
from pyspark.sql.functions import col, concat
df.withColumn('val', reverse_value(concat(col('id1'), col('id2'))))
Explanation:
lit is a literal while you want to refer to individual columns (col).
Columns have to be concatenated using concat function (Concatenate columns in Apache Spark DataFrame)
Additionally it is not clear if argument of udf is correct. It should be either:
from pyspark.sql.functions import udf
#udf
def reverse_value(value):
...
or
#udf("string")
def reverse_value(value):
...
or
from pyspark.sql.types import StringType
#udf(StringType())
def reverse_value(value):
...
Additionally the stacktrace suggests that you have some other problems in your code, not reproducible with the snippet you've shared - the reverse_value seems to return function.
The answer by #user11669673 explains what's wrong with your code and how to fix the udf. However, you don't need a udf for this.
You will achieve much better performance by using pyspark.sql.functions.reverse:
from pyspark.sql.functions import col, concat, reverse
df.withColumn("val", concat(reverse(col("id2")), col("id1"))).show()
#+---+---+----+
#|id1|id2| val|
#+---+---+----+
#| a|one|enoa|
#| b|two|owtb|
#+---+---+----+

Resources