Convert null values to empty array in Spark DataFrame - apache-spark

I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.
I thought I could do it like so:
val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )
However, this results in the following exception:
java.lang.RuntimeException: Unsupported literal type class [I [I#5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)
Apparently array types are not supported by the when function. Is there some other easy way to convert the null values?
In case it is relevant, here is the schema for this column:
|-- myCol: array (nullable = true)
| |-- element: integer (containsNull = false)

You can use an UDF:
import org.apache.spark.sql.functions.udf
val array_ = udf(() => Array.empty[Int])
combined with WHEN or COALESCE:
df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show
In the recent versions you can use array function:
import org.apache.spark.sql.functions.{array, lit}
df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show
Please note that it will work only if conversion from string to the desired type is allowed.
The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType
def empty_array(t):
return udf(lambda: [], ArrayType(t()))()
coalesce(myCol, empty_array(IntegerType()))
and in the recent versions just use array:
from pyspark.sql.functions import array
coalesce(myCol, array().cast("array<integer>"))

With a slight modification to zero323's approach, I was able to do this without using a udf in Spark 2.3.1.
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
+---+---------+
| id| numbers|
+---+---------+
| a|[1, 2, 3]|
| b| null|
| c|[7, 8, 9]|
+---+---------+
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
+---+---------+
| id| numbers|
+---+---------+
| a|[1, 2, 3]|
| b| []|
| c|[7, 8, 9]|
+---+---------+

An UDF-free alternative to use when the data type you want your array elements in can not be cast from StringType is the following:
import pyspark.sql.types as T
import pyspark.sql.functions as F
df.withColumn(
"myCol",
F.coalesce(
F.col("myCol"),
F.from_json(F.lit("[]"), T.ArrayType(T.IntegerType()))
)
)
You can replace IntegerType() with whichever data type, also complex ones.

Related

Split column with JSON string to columns each containing one key-value pair from the string

I have a data frame that looks like this (one column named "value" with a JSON string in it). I send it to an Event Hub using Kafka API and then I want to read that data from the Event Hub and apply some transformations to it. The data is in received in binary format, as described in the Kafka documentation.
Here are a few columns in CSV format:
value
"{""id"":""e52f247c-f46c-4021-bc62-e28e56db1ad8"",""latitude"":""34.5016064725731"",""longitude"":""123.43996453687777""}"
"{""id"":""32782100-9b59-49c7-9d56-bb4dfc368a86"",""latitude"":""49.938541626415144"",""longitude"":""111.88360885971986""}"
"{""id"":""a72a600f-2b99-4c41-a388-9a24c00545c0"",""latitude"":""4.988768300413497"",""longitude"":""-141.92727675177588""}"
"{""id"":""5a5f056a-cdfd-4957-8e84-4d5271253509"",""latitude"":""41.802942545247134"",""longitude"":""90.45164573613573""}"
"{""id"":""d00d0926-46eb-45dd-9e35-ab765804340d"",""latitude"":""70.60161063520081"",""longitude"":""20.566520665122482""}"
"{""id"":""dda14397-6922-4bb6-9be3-a1546f08169d"",""latitude"":""68.400462882435"",""longitude"":""135.7167027587489""}"
"{""id"":""c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea"",""latitude"":""26.04757722355835"",""longitude"":""175.20227554031783""}"
"{""id"":""97f8f1cf-3aa0-49bb-b3d5-05b736e0c883"",""latitude"":""35.52624182094499"",""longitude"":""-164.18066699972852""}"
"{""id"":""6bed49bc-ee93-4ed9-893f-4f51c7b7f703"",""latitude"":""-24.319581484353847"",""longitude"":""85.27338980948076""}"
What I want to do is to apply a transformation and create a data frame with 3 columns one with id, one with latitude and one with longitude.
This is what I tried but the result is not what I expected:
from pyspark.sql.types import StructType
from pyspark.sql.functions import from_json
from pyspark.sql import functions as F
# df is the data frame received from Kafka
location_schema = StructType().add("id", "string").add("latitude", "float").add("longitude", "float")
string_df = df.selectExpr("CAST(value AS STRING)").withColumn("value", from_json(F.col("value"), location_schema))
string_df.printSchema()
string_df.show()
And this is the result:
It created a "value" column with a structure as a value. Any idea what to do to obtain 3 different columns, as I described?
Your df:
df = spark.createDataFrame(
[
(1, '{"id":"e52f247c-f46c-4021-bc62-e28e56db1ad8","latitude":"34.5016064725731","longitude":"123.43996453687777"}'),
(2, '{"id":"32782100-9b59-49c7-9d56-bb4dfc368a86","latitude":"49.938541626415144","longitude":"111.88360885971986"}'),
(3, '{"id":"a72a600f-2b99-4c41-a388-9a24c00545c0","latitude":"4.988768300413497","longitude":"-141.92727675177588"}'),
(4, '{"id":"5a5f056a-cdfd-4957-8e84-4d5271253509","latitude":"41.802942545247134","longitude":"90.45164573613573"}'),
(5, '{"id":"d00d0926-46eb-45dd-9e35-ab765804340d","latitude":"70.60161063520081","longitude":"20.566520665122482"}'),
(6, '{"id":"dda14397-6922-4bb6-9be3-a1546f08169d","latitude":"68.400462882435","longitude":"135.7167027587489"}'),
(7, '{"id":"c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea","latitude":"26.04757722355835","longitude":"175.20227554031783"}'),
(8, '{"id":"97f8f1cf-3aa0-49bb-b3d5-05b736e0c883","latitude":"35.52624182094499","longitude":"-164.18066699972852"}'),
(9, '{"id":"6bed49bc-ee93-4ed9-893f-4f51c7b7f703","latitude":"-24.319581484353847","longitude":"85.27338980948076"}')
],
['id', 'value']
).drop('id')
+--------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------+
|{"id":"e52f247c-f46c-4021-bc62-e28e56db1ad8","latitude":"34.5016064725731","longitude":"123.43996453687777"} |
|{"id":"32782100-9b59-49c7-9d56-bb4dfc368a86","latitude":"49.938541626415144","longitude":"111.88360885971986"}|
|{"id":"a72a600f-2b99-4c41-a388-9a24c00545c0","latitude":"4.988768300413497","longitude":"-141.92727675177588"}|
|{"id":"5a5f056a-cdfd-4957-8e84-4d5271253509","latitude":"41.802942545247134","longitude":"90.45164573613573"} |
|{"id":"d00d0926-46eb-45dd-9e35-ab765804340d","latitude":"70.60161063520081","longitude":"20.566520665122482"} |
|{"id":"dda14397-6922-4bb6-9be3-a1546f08169d","latitude":"68.400462882435","longitude":"135.7167027587489"} |
|{"id":"c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea","latitude":"26.04757722355835","longitude":"175.20227554031783"} |
|{"id":"97f8f1cf-3aa0-49bb-b3d5-05b736e0c883","latitude":"35.52624182094499","longitude":"-164.18066699972852"}|
|{"id":"6bed49bc-ee93-4ed9-893f-4f51c7b7f703","latitude":"-24.319581484353847","longitude":"85.27338980948076"}|
+--------------------------------------------------------------------------------------------------------------+
Then:
from pyspark.sql import functions as F
from pyspark.sql.types import *
json_schema = StructType([
StructField("id", StringType(), True),
StructField("latitude", FloatType(), True),
StructField("longitude", FloatType(), True)
])
df\
.withColumn('json', F.from_json(F.col('value'), json_schema))\
.select(F.col('json').getItem('id').alias('id'),
F.col('json').getItem('latitude').alias('latitude'),
F.col('json').getItem('longitude').alias('longitude')
)\
.show(truncate=False)
+------------------------------------+-------------------+-------------------+
|id |latitude |longitude |
+------------------------------------+-------------------+-------------------+
|e52f247c-f46c-4021-bc62-e28e56db1ad8|34.5016064725731 |123.43996453687777 |
|32782100-9b59-49c7-9d56-bb4dfc368a86|49.938541626415144 |111.88360885971986 |
|a72a600f-2b99-4c41-a388-9a24c00545c0|4.988768300413497 |-141.92727675177588|
|5a5f056a-cdfd-4957-8e84-4d5271253509|41.802942545247134 |90.45164573613573 |
|d00d0926-46eb-45dd-9e35-ab765804340d|70.60161063520081 |20.566520665122482 |
|dda14397-6922-4bb6-9be3-a1546f08169d|68.400462882435 |135.7167027587489 |
|c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea|26.04757722355835 |175.20227554031783 |
|97f8f1cf-3aa0-49bb-b3d5-05b736e0c883|35.52624182094499 |-164.18066699972852|
|6bed49bc-ee93-4ed9-893f-4f51c7b7f703|-24.319581484353847|85.27338980948076 |
+------------------------------------+-------------------+-------------------+
If pattern remains unchanged then you can use regexp_replace()
>>> df = spark.read.option("header",False).option("inferSchema",True).csv("/dir1/dir2/Sample2.csv")
>>> df.show(truncate=False)
+-------------------------------------------------+------------------------------------+---------------------------------------+
|_c0 |_c1 |_c2 |
+-------------------------------------------------+------------------------------------+---------------------------------------+
|"{""id"":""e52f247c-f46c-4021-bc62-e28e56db1ad8""|""latitude"":""34.5016064725731"" |""longitude"":""123.43996453687777""}" |
|"{""id"":""32782100-9b59-49c7-9d56-bb4dfc368a86""|""latitude"":""49.938541626415144"" |""longitude"":""111.88360885971986""}" |
|"{""id"":""a72a600f-2b99-4c41-a388-9a24c00545c0""|""latitude"":""4.988768300413497"" |""longitude"":""-141.92727675177588""}"|
|"{""id"":""5a5f056a-cdfd-4957-8e84-4d5271253509""|""latitude"":""41.802942545247134"" |""longitude"":""90.45164573613573""}" |
|"{""id"":""d00d0926-46eb-45dd-9e35-ab765804340d""|""latitude"":""70.60161063520081"" |""longitude"":""20.566520665122482""}" |
|"{""id"":""dda14397-6922-4bb6-9be3-a1546f08169d""|""latitude"":""68.400462882435"" |""longitude"":""135.7167027587489""}" |
|"{""id"":""c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea""|""latitude"":""26.04757722355835"" |""longitude"":""175.20227554031783""}" |
|"{""id"":""97f8f1cf-3aa0-49bb-b3d5-05b736e0c883""|""latitude"":""35.52624182094499"" |""longitude"":""-164.18066699972852""}"|
|"{""id"":""6bed49bc-ee93-4ed9-893f-4f51c7b7f703""|""latitude"":""-24.319581484353847""|""longitude"":""85.27338980948076""}" |
+-------------------------------------------------+------------------------------------+---------------------------------------+
>>> df.withColumn("id",regexp_replace('_c0','\"\{\"\"id\"\":\"\"','')).withColumn("id",regexp_replace('id','\"\"','')).withColumn("latitude",regexp_replace('_c1','\"\"latitude\"\":\"\"','')).withColumn("latitude",regexp_replace('latitude','\"\"','')).withColumn("longitude",regexp_replace('_c2','\"\"longitude\"\":\"\"','')).withColumn("longitude",regexp_replace('longitude','\"\"\}\"','')).drop("_c0").drop("_c1").drop("_c2").show()
+--------------------+-------------------+-------------------+
| id| latitude| longitude|
+--------------------+-------------------+-------------------+
|e52f247c-f46c-402...| 34.5016064725731| 123.43996453687777|
|32782100-9b59-49c...| 49.938541626415144| 111.88360885971986|
|a72a600f-2b99-4c4...| 4.988768300413497|-141.92727675177588|
|5a5f056a-cdfd-495...| 41.802942545247134| 90.45164573613573|
|d00d0926-46eb-45d...| 70.60161063520081| 20.566520665122482|
|dda14397-6922-4bb...| 68.400462882435| 135.7167027587489|
|c7f13b8a-3468-4bc...| 26.04757722355835| 175.20227554031783|
|97f8f1cf-3aa0-49b...| 35.52624182094499|-164.18066699972852|
|6bed49bc-ee93-4ed...|-24.319581484353847| 85.27338980948076|
+--------------------+-------------------+-------------------+
You can use json_tuple to extract values from JSON string.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('{"id":"e52f247c-f46c-4021-bc62-e28e56db1ad8","latitude":"34.5016064725731","longitude":"123.43996453687777"}',)],
['value'])
Script:
cols = ['id', 'latitude', 'longitude']
df = df.select(F.json_tuple('value', *cols)).toDF(*cols)
df.show(truncate=0)
# +------------------------------------+----------------+------------------+
# |id |latitude |longitude |
# +------------------------------------+----------------+------------------+
# |e52f247c-f46c-4021-bc62-e28e56db1ad8|34.5016064725731|123.43996453687777|
# +------------------------------------+----------------+------------------+
If needed, cast to double:
.withColumn('latitude', F.col('latitude').cast('double'))
.withColumn('longitude', F.col('longitude').cast('double'))
It's easy to extract JSON string as columns using inline and from_json
df = spark.createDataFrame(
[('{"id":"e52f247c-f46c-4021-bc62-e28e56db1ad8","latitude":"34.5016064725731","longitude":"123.43996453687777"}',)],
['value'])
df = df.selectExpr(
"inline(array(from_json(value, 'struct<id:string, latitude:string, longitude:string>')))"
)
df.show(truncate=0)
# +------------------------------------+----------------+------------------+
# |id |latitude |longitude |
# +------------------------------------+----------------+------------------+
# |e52f247c-f46c-4021-bc62-e28e56db1ad8|34.5016064725731|123.43996453687777|
# +------------------------------------+----------------+------------------+
I used the sample data provided, created a dataframe called df and proceeded to use the same method as you.
The following is the image of the rows present inside df dataframe.
The fields are not displayed as required because of the their datatype. The values for latitude and longitude are present as string types in the dataframe df. But while creating the schema location_schema you have specified their type as float. Instead, try changing their type to string and later convert them to double type. The code looks as shown below:
location_schema = StructType().add("id", "string").add("latitude", "string").add("longitude", "string")
string_df = df.selectExpr('CAST(value AS STRING)').withColumn("value", from_json(F.col("value"), location_schema))
string_df.printSchema()
string_df.show(truncate=False)
Now using DataFrame.withColumn(), Column.withField() and cast() convert the string type fields latitude and longitude to Double Type.
string_df = string_df.withColumn("value", col("value").withField("latitude", col("value.latitude").cast(DoubleType())))\
.withColumn("value", col("value").withField("longitude", col("value.longitude").cast(DoubleType())))
string_df.printSchema()
string_df.show(truncate=False)
So, you can get the desired output as shown below.
Update:
To get separate columns you can simply use json_tuple() method. Refer to this official spark documentation:
pyspark.sql.functions.json_tuple — PySpark 3.3.0 documentation (apache.org)

large number of big decimal type is null when query it

I have a simple spark code as follows that I want query large number of big decimal type
test("SparkTest 0458") {
val spark = SparkSession.builder().master("local").appName("SparkTest0456").getOrCreate()
import spark.implicits._
val data =
(
new java.math.BigDecimal("819021675302547012738064321"),
new java.math.BigDecimal("819021675302547012738064321"),
new java.math.BigDecimal("819021675302547012738064321")
)
val df = spark.createDataset(Seq(data)).toDF("a", "b", "c")
df.show(truncate = false)
}
But it shows 3 nulls
+----+----+----+
|a |b |c |
+----+----+----+
|null|null|null|
+----+----+----+
I would ask what's wrong here, thanks
The source of the problem is the schema inference mechanism for decimal types. Since neither scale nor precision is part of the type signature, Spark assumes that input is decimal(38, 18):
df.printSchema
root
|-- a: decimal(38,18) (nullable = true)
|-- b: decimal(38,18) (nullable = true)
|-- c: decimal(38,18) (nullable = true)
This means that you can store at most 20 digits before decimal point, and the numbers you use, have 26 digits.
As far as I know there is no workaround that works directly with reflection, but it is possible to convert data to Row objects and provide schema explicitly. For example with intermediate RDD
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.math.BigDecimal
val schema = StructType(
Seq("a", "b", "c") map (c => StructField(c, DecimalType(38, 0)))
)
spark.createDataFrame(
sc.parallelize(Seq(data)) map(t => Row(t.productIterator.toSeq: _*)),
schema
)
or Kryo-serialized dataset
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
spark.createDataset(Seq(data))(
Encoders.kryo: Encoder[(BigDecimal, BigDecimal, BigDecimal)]
).map(t => Row(t.productIterator.toSeq: _*))(RowEncoder(schema))

How to use transform higher-order function?

It's about transform higher-order function (https://issues.apache.org/jira/browse/SPARK-23908).
Is there any way to use it as a standard function (in package org.apache.spark.sql.functions._)?
I have an array of strings and I want to apply URI normalization to each of them. For now I did it with an UDF. I just hopped that with spark 2.4.0 I would be able to skip the UDF.
As I see it should be used in selectExpr like df.selectExpr("transform(i, x -> x + 1)") but is it only meant to be used with selectExpr?
Using it this way is there anyway to provide a custom function for the transformation? Is there any way to achieve it or should I resort to using good old UDFs?
Is there anyway to use it as a standard function located in package org.apache.spark.sql.functions._ ?
For now it is intended only for usage with SQL expressions, although if you want to return a Column your use expr:
org.apache.spark.sql.functions._
expr("transform(i, x -> x + 1)"): Column
Using it this way is there anyway to provide a custom function for the transformation?
It is possible to use Scala UDF*:
spark.udf.register("f", (x: Int) => x + 1)
Seq((1, Seq(1, 2, 3))).toDF("id", "xs")
.withColumn("xsinc", expr("transform(xs, x -> f(x))"))
.show
+---+---------+---------+
| id| xs| xsinc|
+---+---------+---------+
| 1|[1, 2, 3]|[2, 3, 4]|
+---+---------+---------+
although it doesn't seem to provide any real benefits over UDF taking a Seq.
* A partial support for Python UDFs seem to be in place (udfs are recognized, types are correctly derived, and calls are dispatched) as well, but as of 2.4.0 the serialization mechanism seems to be broken (all records are passed to UDF as None):
from typing import Optional
from pyspark.sql.functions import expr
sc.version
'2.4.0'
def f(x: Optional[int]) -> Optional[int]:
return x + 1 if x is not None else None
spark.udf.register('f', f, "integer")
df = (spark
.createDataFrame([(1, [1, 2, 3])], ("id", "xs"))
.withColumn("xsinc", expr("transform(xs, x -> f(x))")))
df.printSchema()
root
|-- id: long (nullable = true)
|-- xs: array (nullable = true)
| |-- element: long (containsNull = true)
|-- xsinc: array (nullable = true)
| |-- element: integer (containsNull = true)
df.show()
+---+---------+-----+
| id| xs|xsinc|
+---+---------+-----+
| 1|[1, 2, 3]| [,,]|
+---+---------+-----+
Of course there is no real potential for performance boost here - it dispatches to BasePythonRunner so overhead should be the same as of plain udf.
Related JIRA ticket SPARK-27052 - Using PySpark udf in transform yields NULL values
It seems that in python, as reported in the answer above, there is a problem with the expr function using a udf, but it can be done as follows:
from typing import Optional
from pyspark.sql.functions import expr, transform
def f(x: Optional[int]) -> Optional[int]:
return x + 1 if x is not None else None
spark.udf.register('f', f, "integer")
df = (spark
.createDataFrame(
[(1, [1, 2, 3])], ("id", "xs"))
.withColumn("xsinc", eval("transform(col('xs'), lambda x: f(x))")))
df.show()
+---+---------+---------+
| id| xs| xsinc|
+---+---------+---------+
| 1|[1, 2, 3]|[2, 3, 4]|
+---+---------+---------+

Set schema in pyspark dataframe read.csv with null elements

I have a data set (example) that when imported with
df = spark.read.csv(filename, header=True, inferSchema=True)
df.show()
will assign the column with 'NA' as a stringType(), where I would like it to be IntegerType() (or ByteType()).
I then tried to set
schema = StructType([
StructField("col_01", IntegerType()),
StructField("col_02", DateType()),
StructField("col_03", IntegerType())
])
df = spark.read.csv(filename, header=True, schema=schema)
df.show()
The output shows the entire row with 'col_03' = null to be null.
However col_01 and col_02 return appropriate data if they are called with
df.select(['col_01','col_02']).show()
I can find a way around this by post casting the data type of col_3
df = spark.read.csv(filename, header=True, inferSchema=True)
df = df.withColumn('col_3',df['col_3'].cast(IntegerType()))
df.show()
, but I think it is not ideal and would be much better if I can assign the data type for each column directly with setting schema.
Would anyone be able to guide me what I do incorrectly? Or casting the data types after importing is the only solution? Any comment regarding performance of the two approaches (if we can make assigning schema to work) is also welcome.
Thank you,
You can set a new null value in spark's csv loader using nullValue:
for a csv file looking like this:
col_01,col_02,col_03
111,2007-11-18,3
112,2002-12-03,4
113,2007-02-14,5
114,2003-04-16,NA
115,2011-08-24,2
116,2003-05-03,3
117,2001-06-11,4
118,2004-05-06,NA
119,2012-03-25,5
120,2006-10-13,4
and forcing schema:
from pyspark.sql.types import StructType, IntegerType, DateType
schema = StructType([
StructField("col_01", IntegerType()),
StructField("col_02", DateType()),
StructField("col_03", IntegerType())
])
You'll get:
df = spark.read.csv(filename, header=True, nullValue='NA', schema=schema)
df.show()
df.printSchema()
+------+----------+------+
|col_01| col_02|col_03|
+------+----------+------+
| 111|2007-11-18| 3|
| 112|2002-12-03| 4|
| 113|2007-02-14| 5|
| 114|2003-04-16| null|
| 115|2011-08-24| 2|
| 116|2003-05-03| 3|
| 117|2001-06-11| 4|
| 118|2004-05-06| null|
| 119|2012-03-25| 5|
| 120|2006-10-13| 4|
+------+----------+------+
root
|-- col_01: integer (nullable = true)
|-- col_02: date (nullable = true)
|-- col_03: integer (nullable = true)
Try this once - (But this will read every column as string type. You can type caste as per your requirement)
import csv
from pyspark.sql.types import IntegerType
data = []
with open('filename', 'r' ) as doc:
reader = csv.DictReader(doc)
for line in reader:
data.append(line)
df = sc.parallelize(data).toDF()
df = df.withColumn("col_03", df["col_03"].cast(IntegerType()))

How to change case of whole pyspark dataframe to lower or upper

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help
#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])
Assuming df is your dataframe, this should do the work:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(col, F.lower(F.col(col)))
Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))
val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)
Now types are correct also when you have non-string fields, i.e. numeric fields).
If you know that each column is of String type, use one of the other answers - they are correct in that cases :)
Python code in PySpark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields)
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)
You can generate an expression using list comprehension:
from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]
And then just call it over your existing dataframe
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+

Resources