How to reverse and combine string columns in a spark dataframe? - apache-spark

I am using pyspark version 2.4 and I am trying to write a udf which should take the values of column id1 and column id2 together, and returns the reverse string of it.
For example, my data looks like:
+---+---+
|id1|id2|
+---+---+
| a|one|
| b|two|
+---+---+
the corresponding code is:
df = spark.createDataFrame([['a', 'one'], ['b', 'two']], ['id1', 'id2'])
The returned value should be
+---+---+----+
|id1|id2| val|
+---+---+----+
| a|one|enoa|
| b|two|owtb|
+---+---+----+
My code is:
#udf(string)
def reverse_value(value):
return value[::-1]
df.withColumn('val', reverse_value(lit('id1' + 'id2')))
My errors are:
TypeError: Invalid argument, not a string or column: <function
reverse_value at 0x0000010E6D860B70> of type <class 'function'>. For
column literals, use 'lit', 'array', 'struct' or 'create_map'
function.

Should be:
from pyspark.sql.functions import col, concat
df.withColumn('val', reverse_value(concat(col('id1'), col('id2'))))
Explanation:
lit is a literal while you want to refer to individual columns (col).
Columns have to be concatenated using concat function (Concatenate columns in Apache Spark DataFrame)
Additionally it is not clear if argument of udf is correct. It should be either:
from pyspark.sql.functions import udf
#udf
def reverse_value(value):
...
or
#udf("string")
def reverse_value(value):
...
or
from pyspark.sql.types import StringType
#udf(StringType())
def reverse_value(value):
...
Additionally the stacktrace suggests that you have some other problems in your code, not reproducible with the snippet you've shared - the reverse_value seems to return function.

The answer by #user11669673 explains what's wrong with your code and how to fix the udf. However, you don't need a udf for this.
You will achieve much better performance by using pyspark.sql.functions.reverse:
from pyspark.sql.functions import col, concat, reverse
df.withColumn("val", concat(reverse(col("id2")), col("id1"))).show()
#+---+---+----+
#|id1|id2| val|
#+---+---+----+
#| a|one|enoa|
#| b|two|owtb|
#+---+---+----+

Related

PySpark merge 2 column values by index into new list [duplicate]

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:
df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'
I wanted to join these two columns in a third column like below for each row of my dataframe.
df['column_3']: [abc_1.0, def_2.0, ghi_3.0]
I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?
df['column_3'] = df['column_2']
for index, row in df.iterrows():
while index < 3:
if isinstance(row['column_1'], str):
row['column_1'] = list(row['column_1'].split(','))
row['column_2'] = list(row['column_2'].split(','))
row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]
I have converted the two columns to arrays in PySpark by using the below code
from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split
crash.withColumn("column_1",
split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)
Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.
A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:
pyspark.sql.functions.arrays_zip(*cols)
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
So if you already have two arrays:
from pyspark.sql.functions import split
df = (spark
.createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
.toDF("column_1", "column_2")
.withColumn("column_1", split("column_1", "\s*,\s*"))
.withColumn("column_2", split("column_2", "\s*,\s*")))
You can just apply it on the result
from pyspark.sql.functions import arrays_zip
df_zipped = df.withColumn(
"zipped", arrays_zip("column_1", "column_2")
)
df_zipped.select("zipped").show(truncate=False)
+------------------------------------+
|zipped |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+
Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):
df_zipped_concat = df_zipped.withColumn(
"zipped_concat",
expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
)
df_zipped_concat.select("zipped_concat").show(truncate=False)
+---------------------------+
|zipped_concat |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+
Note:
Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.
For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:
df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))"))
The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).
You can also UDF to zip the split array columns,
df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2'])
+-----------+-----------+
|col1 |col2 |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is
from pyspark.sql import functions as F
from pyspark.sql.types import *
def concat_udf(*args):
return ['_'.join(x) for x in zip(*args)]
udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1 |col2 |col3 |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+
For Spark 3.1+, they now provide pyspark.sql.functions.zip_with() with Python lambda function, therefore it can be done like this:
import pyspark.sql.functions as F
df = df.withColumn("column_3", F.zip_with("column_1", "column_2", lambda x,y: F.concat_ws("_", x, y)))

Spark SQL not recognizing \d+

I am trying to use the regex_extract function to get the last three digits in a string ABCDF1_123 with:
regexp_extrach('ABCDF1_123', 'ABCDF1_(\d+)', 1)
and it does not capture the group. If I change the function call to:
regexp_extrach('ABCDF1_123', 'ABCDF1_([0-9]+)', 1)
it works. Can anyone give me some insight in to why? I am also grabbing the data from a Postgres database using a JDBC connection.
I ran the regexp_extract and both of them are giving the same output as shown below
from pyspark.sql import Row
from pyspark.sql.functions import lit, when, col, regexp_extract
l = [('ABCDF1_123')]
rdd = sc.parallelize(l)
sample = rdd.map(lambda x: Row(name=x))
sample_df = sqlContext.createDataFrame(sample)
not_working = r'ABCDF1_(\d+)'
working = r'ABCDF1_([0-9]+)'
sample_df.select(regexp_extract('name',not_working,1).alias('not_working'),
regexp_extract('name',working,1).alias('working')).show(10)
+-----------+-------+
|not_working|working|
+-----------+-------+
| 123| 123|
+-----------+-------+
Is this what you are looking for?

pyspark getting the field names of a a struct datatype inside a udf

I am trying to pass multiple columns to a udf as a StructType (using pyspark.sql.functions.struct()).
Inside this udf I want to get the fields of the struct column that I passed as a list, so that I can iterate over the passed columns for every row.
Basically I am looking for a pyspark version of the scala code provided in this answer - Spark - pass full row to a udf and then get column name inside udf
You can use the same method as on the post you linked, i.e. by using a pyspark.sql.Row. But instead of .schema.fieldNames, you can use .asDict() to convert the Row into a dictionary.
For example, here is a way to iterate over the column names and values simultaneously:
from pyspark.sql.functions import col, struct, udf
df = spark.createDataFrame([(1, 2, 3)], ["a", "b", "c"])
f = udf(lambda row: "; ".join(["=".join(map(str, [k,v])) for k, v in row.asDict().items()]))
df.select(f(struct(*df.columns)).alias("myUdfOutput")).show()
#+-------------+
#| myUdfOutput|
#+-------------+
#|a=1; c=3; b=2|
#+-------------+
An alternative would be to build a MapType() of column name to value, and pass this to your udf.
from itertools import chain
from pyspark.sql.functions import create_map, lit
f2 = udf(lambda row: "; ".join(["=".join(map(str, [k,v])) for k, v in row.items()]))
df.select(
f2(
create_map(
*chain.from_iterable([(lit(c), col(c)) for c in df.columns])
)
).alias("myNewUdfOutput")
).show()
#+--------------+
#|myNewUdfOutput|
#+--------------+
#| a=1; c=3; b=2|
#+--------------+
This second method is arguably unnecessarily complicated, so the first option is the recommended approach.

Change the timestamp to UTC format in Pyspark

I have an input dataframe(ip_df), data in this dataframe looks like as below:
id timestamp_value
1 2017-08-01T14:30:00+05:30
2 2017-08-01T14:30:00+06:30
3 2017-08-01T14:30:00+07:30
I need to create a new dataframe(op_df), wherein i need to convert timestamp value to UTC format. So final output dataframe will look like as below:
id timestamp_value
1 2017-08-01T09:00:00+00:00
2 2017-08-01T08:00:00+00:00
3 2017-08-01T07:00:00+00:00
I want to achieve it using PySpark. Can someone please help me with it? Any help will be appericiated.
If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.
However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.
Given the following representation of data
+---+-------------------------+
|id |timestamp_value |
+---+-------------------------+
|1 |2017-08-01T14:30:00+05:30|
|2 |2017-08-01T14:30:00+06:30|
|3 |2017-08-01T14:30:00+07:30|
+---+-------------------------+
as given by:
l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])
where timestamp_value is a String, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):
from pyspark.sql.functions import to_timestamp, date_format
spark.conf.set('spark.sql.session.timeZone', 'UTC')
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssZ"
).alias('timestamp_value'))
which yields:
+------------------------+
|timestamp_value |
+------------------------+
|2017-08-01T09:00:00+0000|
|2017-08-01T08:00:00+0000|
|2017-08-01T07:00:00+0000|
+------------------------+
or, slightly differently:
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias('timestamp_value'))
which yields:
+--------------------+
|timestamp_value |
+--------------------+
|2017-08-01T09:00:00Z|
|2017-08-01T08:00:00Z|
|2017-08-01T07:00:00Z|
+--------------------+
You can use parser and tz in dateutil library.
I assume you have Strings and you want a String Column :
from dateutil import parser, tz
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, udf
# Create UTC timezone
utc_zone = tz.gettz('UTC')
# Create UDF function that apply on the column
# It takes the String, parse it to a timestamp, convert to UTC, then convert to String again
func = udf(lambda x: parser.parse(x).astimezone(utc_zone).isoformat(), StringType())
# Create new column in your dataset
df = df.withColumn("new_timestamp",func(col("timestamp_value")))
It gives this result :
<pre>
+---+-------------------------+-------------------------+
|id |timestamp_value |new_timestamp |
+---+-------------------------+-------------------------+
|1 |2017-08-01T14:30:00+05:30|2017-08-01T09:00:00+00:00|
|2 |2017-08-01T14:30:00+06:30|2017-08-01T08:00:00+00:00|
|3 |2017-08-01T14:30:00+07:30|2017-08-01T07:00:00+00:00|
+---+-------------------------+-------------------------+
</pre>
Finally you can drop and rename :
df = df.drop("timestamp_value").withColumnRenamed("new_timestamp","timestamp_value")

How do I replace a string value with a NULL in PySpark?

I want to do something like this:
df.replace('empty-value', None, 'NAME')
Basically, I want to replace some value with NULL. but it does not accept None in this function. How can I do this?
You can combine when clause with NULL literal and types casting as follows:
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## | x| y|
## +---+----+
## | 1| foo|
## | 2|null|
## +---+----+
It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.
This will replace empty-value with None in your name column:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()
Output:
[Row(key=1, name=None), Row(key=2, name=u'something else')]
By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.
You could also simply use a dict for the first argument of replace. I tried it and this seems to accept None as an argument.
df = df.replace({'empty-value':None}, subset=['NAME'])
Note that your 'empty-value' needs to be hashable.
The best alternative is the use of a when combined with a NULL. Example:
from pyspark.sql.functions import when, lit, col
df= df.withColumn('foo', when(col('foo') != 'empty-value',col('foo)))
If you want to replace several values to null you can either use | inside the when condition or the powerfull create_map function.
Important to note is that the worst way to solve it with the use of a UDF. This is so because udfs provide great versatility to your code but come with a huge penalty on performance.

Resources