Set schema in pyspark dataframe read.csv with null elements - python-3.x

I have a data set (example) that when imported with
df = spark.read.csv(filename, header=True, inferSchema=True)
df.show()
will assign the column with 'NA' as a stringType(), where I would like it to be IntegerType() (or ByteType()).
I then tried to set
schema = StructType([
StructField("col_01", IntegerType()),
StructField("col_02", DateType()),
StructField("col_03", IntegerType())
])
df = spark.read.csv(filename, header=True, schema=schema)
df.show()
The output shows the entire row with 'col_03' = null to be null.
However col_01 and col_02 return appropriate data if they are called with
df.select(['col_01','col_02']).show()
I can find a way around this by post casting the data type of col_3
df = spark.read.csv(filename, header=True, inferSchema=True)
df = df.withColumn('col_3',df['col_3'].cast(IntegerType()))
df.show()
, but I think it is not ideal and would be much better if I can assign the data type for each column directly with setting schema.
Would anyone be able to guide me what I do incorrectly? Or casting the data types after importing is the only solution? Any comment regarding performance of the two approaches (if we can make assigning schema to work) is also welcome.
Thank you,

You can set a new null value in spark's csv loader using nullValue:
for a csv file looking like this:
col_01,col_02,col_03
111,2007-11-18,3
112,2002-12-03,4
113,2007-02-14,5
114,2003-04-16,NA
115,2011-08-24,2
116,2003-05-03,3
117,2001-06-11,4
118,2004-05-06,NA
119,2012-03-25,5
120,2006-10-13,4
and forcing schema:
from pyspark.sql.types import StructType, IntegerType, DateType
schema = StructType([
StructField("col_01", IntegerType()),
StructField("col_02", DateType()),
StructField("col_03", IntegerType())
])
You'll get:
df = spark.read.csv(filename, header=True, nullValue='NA', schema=schema)
df.show()
df.printSchema()
+------+----------+------+
|col_01| col_02|col_03|
+------+----------+------+
| 111|2007-11-18| 3|
| 112|2002-12-03| 4|
| 113|2007-02-14| 5|
| 114|2003-04-16| null|
| 115|2011-08-24| 2|
| 116|2003-05-03| 3|
| 117|2001-06-11| 4|
| 118|2004-05-06| null|
| 119|2012-03-25| 5|
| 120|2006-10-13| 4|
+------+----------+------+
root
|-- col_01: integer (nullable = true)
|-- col_02: date (nullable = true)
|-- col_03: integer (nullable = true)

Try this once - (But this will read every column as string type. You can type caste as per your requirement)
import csv
from pyspark.sql.types import IntegerType
data = []
with open('filename', 'r' ) as doc:
reader = csv.DictReader(doc)
for line in reader:
data.append(line)
df = sc.parallelize(data).toDF()
df = df.withColumn("col_03", df["col_03"].cast(IntegerType()))

Related

spark schema difference in partitions

I have to read data from a path which is partitioned by region.
US region has columns a,b,c,d,e
EUR region has only a,b,c,d
When I read data from the path and doing a printSchema, I am seeing only a,b,c,d 'e' is missing.
Is there any way to handle this situation? Like column e automatically gets populated with null for EUR data...?
You can use the mergeSchema option that should do exactly what you are looking for as long as columns with the same name have the same type.
Example:
spark.read.option("mergeSchema", "true").format("parquet").load(...)
Once you read the data from the path, you can check if data frame contains column 'e'. If it does not, then you could add this with default value which is None is this case.
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder \
.appName('example') \
.getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)
if 'e' not in df.columns:
df = df.withColumn('e',lit(None))
You can collect all the possible columns from both dataset then fill None if that column is not available in each dataset
df_ab = (spark
.sparkContext
.parallelize([
('a1', 'b1'),
('a2', 'b2'),
])
.toDF(['a', 'b'])
)
df_ab.show()
# +---+---+
# | a| b|
# +---+---+
# | a1| b1|
# | a2| b2|
# +---+---+
df_abcd = (spark
.sparkContext
.parallelize([
('a3', 'b3', 'c3', 'd3'),
('a4', 'b4', 'c4', 'd4'),
])
.toDF(['a', 'b', 'c', 'd'])
)
df_abcd.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | a3| b3| c3| d3|
# | a4| b4| c4| d4|
# +---+---+---+---+
unique_columns = list(set(df_ab.columns + df_abcd.columns))
# ['d', 'b', 'a', 'c']
for col in unique_columns:
if col not in df_ab.columns:
df_ab = df_ab.withColumn(col, F.lit(None))
if col not in df_abcd.columns:
df_abcd = df_abcd.withColumn(col, F.lit(None))
df_ab.printSchema()
# root
# |-- a: string (nullable = true)
# |-- b: string (nullable = true)
# |-- d: null (nullable = true)
# |-- c: null (nullable = true)
df_ab.show()
# +---+---+----+----+
# | a| b| d| c|
# +---+---+----+----+
# | a1| b1|null|null|
# | a2| b2|null|null|
# +---+---+----+----+
df_abcd.printSchema()
# root
# |-- a: string (nullable = true)
# |-- b: string (nullable = true)
# |-- c: string (nullable = true)
# |-- d: string (nullable = true)
df_abcd.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | a3| b3| c3| d3|
# | a4| b4| c4| d4|
# +---+---+---+---+
I used pyspark and SQLContext. Hope this implementation will help you to get an idea. Spark provides an environment to use SQL and it is very convenient to use SPARK SQL for these type of things.
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions
from pyspark.sql import SQLContext
import sys
import os
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
class getData(object):
"""docstring for getData"""
def __init__(self):
def get_data(self, n):
spark = SparkSession.builder.appName('YourProjectName').getOrCreate()
data2 = [("region 1","region 2","region 3","region 4"),
("region 5","region 6","region 7","region 8")
]
schema = StructType([ \
StructField("a",StringType(),True), \
StructField("b",StringType(),True), \
StructField("c",StringType(),True), \
StructField("d", StringType(), True) \
])
data3 = [("EU region 1","EU region 2","EU region 3"),
("EU region 5","EU region 6","EU region 7")
]
schema3 = StructType([ \
StructField("a",StringType(),True), \
StructField("b",StringType(),True), \
StructField("c",StringType(),True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.createOrReplaceTempView("USRegion")
sqlDF = self.sparkSession1.sql("SELECT * FROM USRegion")
sqlDF.show(n=600)
df1 = spark.createDataFrame(data=data3,schema=schema3)
df1.createOrReplaceTempView("EURegion")
sqlDF1 = self.sparkSession1.sql("SELECT * FROM EURegion")
sqlDF1.show(n=600)
sql_union_df = self.sparkSession1.sql("SELECT a, b, c, d FROM USRegion uNION ALL SELECT a,b, c, '' as d FROM EURegion ")
sql_union_df.show(n=600)
#call the class
conn = getData()
#call the method implemented inside the class
print(conn.get_data(10))

pyspark read csv with user specified schema - returned all StringType

New to pyspark. I am trying to read the csv file from datalake blob using pyspark with user-specified schema structure type. Below is the code I tried.
from pyspark.sql.types import *
customschema = StructType([
StructField("A", StringType(), True)
,StructField("B", DoubleType(), True)
,StructField("C", TimestampType(), True)
])
df_1 = spark.read.format("csv").options(header="true", schema=customschema, multiline="true", enforceSchema='true').load(destinationPath)
df_1.show()
Out:
+---------+------+--------------------+
| A| B| C|
+---------+------+--------------------+
|322849691|9547.0|2020-09-24 07:30:...|
|322847371| 492.0|2020-09-23 13:15:...|
|322329853|6661.0|2020-09-07 09:45:...|
|322283810| 500.0|2020-09-04 13:12:...|
|322319107| 251.0|2020-09-02 13:51:...|
|322319096| 254.0|2020-09-02 13:51:...|
+---------+------+--------------------+
But I got the field type as String instead. I am not quite sure what I have done wrong.
df_1.printSchema()
Out:
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
When you use DataFrameReader load method you should pass the schema using schema and not in the options :
df_1 = spark.read.format("csv") \
.options(header="true", multiline="true")\
.schema(customschema).load(destinationPath)
That's not the same as the API method spark.read.csv which accepts schema as an argument :
df_1 = spark.read.csv(destinationPath, schema=customschema, header=True)
It works with the following syntax
customschema=StructType([
StructField("A",StringType(), True),
StructField("B",DoubleType(), True),
StructField("C",TimestampType(), True)
])
df = spark.read.csv("test.csv", header=True, sep=";", schema=customschema)
df.show()
df.printSchema()
or you can also use
df = spark.read.load("test.csv",format="csv", sep=";", schema=customschema, header="true")
It is interesting that the read().option().load() syntax does not work for me either. I am not sure if it works at all. At least according to the documentation .options() is only used for write(), so to export a dataframe.
Another option would be to cast the datatypes afterwards
import pyspark.sql.functions as f
df=(df
.withColumn("B",f.col("B").cast("string"))
.withColumn("B",f.col("B").cast("double"))
.withColumn("C",f.col("C").cast("timestamp"))
)

How do you remove an ambiguous column in pyspark?

There are many questions similar to this that are asking a different question with regard to avoid duplicate columns in a join; that is not what I am asking here.
Given that I already have a DataFrame with ambiguous columns, how do I remove a specific column?
For example, given:
df = spark.createDataFrame(
spark.sparkContext.parallelize([
[1, 0.0, "ext-0.0"],
[1, 1.0, "ext-1.0"],
[2, 1.0, "ext-2.0"],
[3, 2.0, "ext-3.0"],
[4, 3.0, "ext-4.0"],
]),
StructType([
StructField("id", IntegerType(), True),
StructField("shared", DoubleType(), True),
StructField("shared", StringType(), True),
])
)
I wish to retain only the numeric columns.
However, attempting to do something like df.select("id", "shared").show() results in:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'shared' is ambiguous, could be: shared, shared.;"
Many related solution to this problem are simply 'avoid ever getting into this situation', eg. by using ['joinkey'] instead of a.joinkey = b.joinkey on the join. I reiterate that this is not the situation here; this relates to a dataframe that has already been converted into this form.
The metadata from the DF disambiguates these columns:
$ df.dtypes
[('id', 'int'), ('shared', 'double'), ('shared', 'string')]
$ df.schema
StructType(List(StructField(id,IntegerType,true),StructField(shared,DoubleType,true),StructField(shared,StringType,true)))
So the data is retained internally... I just can't see how to use it.
How do I pick one column over the other?
I expected to be able to use, eg. col('shared#11') or similar... but there is nothing like that I can see?
Is this simply not possible in spark?
To answer this question, I would ask, please post either a) a working code snippet that solves the problem above, or b) link to something official from the spark developers that this simply isn't supported?
the easiest solution to this problem is to rename using df.toDF(...<new-col-names>...), but if you don't wanted to change the column name then group the duplicated columns by their type as struct<type1, type2> as below-
Please note that below solution is written in scala, but logically similar code can be implemented in python. Also this solution will work for all duplicate columns in the dataframe-
1. Load the test data
val df = Seq((1, 2.0, "shared")).toDF("id", "shared", "shared")
df.show(false)
df.printSchema()
/**
* +---+------+------+
* |id |shared|shared|
* +---+------+------+
* |1 |2.0 |shared|
* +---+------+------+
*
* root
* |-- id: integer (nullable = false)
* |-- shared: double (nullable = false)
* |-- shared: string (nullable = true)
*/
2. get all the duplicated column names
// 1. get all the duplicated column names
val findDupCols = (cols: Array[String]) => cols.map((_ , 1)).groupBy(_._1).filter(_._2.length > 1).keys.toSeq
val dupCols = findDupCols(df.columns)
println(dupCols.mkString(", "))
// shared
3. rename duplicate cols like shared => shared:string, shared:int, without touching the other column names
val renamedDF = df
// 2 rename duplicate cols like shared => shared:string, shared:int
.toDF(df.schema
.map{case StructField(name, dt, _, _) =>
if(dupCols.contains(name)) s"$name:${dt.simpleString}" else name}: _*)
3. create struct of all cols
// 3. create struct of all cols
val structCols = df.schema.map(f => f.name -> f ).groupBy(_._1)
.map{case(name, seq) =>
if (seq.length > 1)
struct(
seq.map { case (_, StructField(fName, dt, _, _)) =>
expr(s"`$fName:${dt.simpleString}` as ${dt.simpleString}")
}: _*
).as(name)
else col(name)
}.toSeq
val structDF = renamedDF.select(structCols: _*)
structDF.show(false)
structDF.printSchema()
/**
* +-------------+---+
* |shared |id |
* +-------------+---+
* |[2.0, shared]|1 |
* +-------------+---+
*
* root
* |-- shared: struct (nullable = false)
* | |-- double: double (nullable = false)
* | |-- string: string (nullable = true)
* |-- id: integer (nullable = false)
*/
4. get column by their type using <column_name>.<datatype>
// Use the dataframe without losing any columns
structDF.selectExpr("id", "shared.double as shared").show(false)
/**
* +---+------+
* |id |shared|
* +---+------+
* |1 |2.0 |
* +---+------+
*/
Hope this is useful to someone!
It seems this is possible by replacing the schema using .rdd.toDf() on the dataframe.
However, I'll still accept any answer that is less convoluted and annoying than the one below:
import random
import string
from pyspark.sql.types import DoubleType, LongType
def makeId():
return ''.join(random.choice(string.ascii_lowercase) for _ in range(6))
def makeUnique(column):
return "%s---%s" % (column.name, makeId())
def makeNormal(column):
return column.name.split("---")[0]
unique_schema = list(map(makeUnique, df.schema))
df_unique = df.rdd.toDF(schema=unique_schema)
df_unique.show()
numeric_cols = filter(lambda c: c.dataType.__class__ in [LongType, DoubleType], df_unique.schema)
numeric_col_names = list(map(lambda c: c.name, numeric_cols))
df_filtered = df_unique.select(*numeric_col_names)
df_filtered.show()
normal_schema = list(map(makeNormal, df_filtered.schema))
df_fixed = df_filtered.rdd.toDF(schema=normal_schema)
df_fixed.show()
Gives:
+-----------+---------------+---------------+
|id---chjruu|shared---aqboua|shared---ehjxor|
+-----------+---------------+---------------+
| 1| 0.0| ext-0.0|
| 1| 1.0| ext-1.0|
| 2| 1.0| ext-2.0|
| 3| 2.0| ext-3.0|
| 4| 3.0| ext-4.0|
+-----------+---------------+---------------+
+-----------+---------------+
|id---chjruu|shared---aqboua|
+-----------+---------------+
| 1| 0.0|
| 1| 1.0|
| 2| 1.0|
| 3| 2.0|
| 4| 3.0|
+-----------+---------------+
+---+------+
| id|shared|
+---+------+
| 1| 0.0|
| 1| 1.0|
| 2| 1.0|
| 3| 2.0|
| 4| 3.0|
+---+------+
Workaround: Simply rename the columns (in order) and then do whatever you wanted to do!
renamed_df = df.toDF("id", "shared_double", "shared_string")

How do I add headers to a PySpark DataFrame?

I have created a PySpark RDD (converted from XML to CSV) that does not have headers. I need to convert it to a DataFrame with headers to perform some SparkSQL queries on it. I cannot seem to find a simple way to add headers. Most examples start with a dataset that already has headers.
df = spark.read.csv('some.csv', header=True, schema=schema)
However, I need to append headers.
headers = ['a', 'b', 'c', 'd']
This seems to be a trivial problem, I am not sure why I cannot find a working solution. Thank you.
Like this ... you need to specify schema and .option("header", "false") if your csv does not contain a header row
spark.version
'2.3.2'
! cat sample.csv
1, 2.0,"hello"
3, 4.0, "there"
5, 6.0, "how are you?"
PATH = "sample.csv"
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = StructType([\
StructField("col1", IntegerType(), True),\
StructField("col2", FloatType(), True),\
StructField("col3", StringType(), True)])
csvFile = spark.read.format("csv")\
.option("header", "false")\
.schema(schema)\
.load(PATH)
csvFile.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| 2.0| hello|
| 3| 4.0| "there"|
| 5| 6.0| "how are you?"|
+----+----+---------------+
# if you have rdd and want to convert straight to df
rdd = sc.textFile(PATH)
# just showing rows
for i in rdd.collect(): print(i)
1, 2.0,"hello"
3, 4.0, "there"
5, 6.0, "how are you?"
# use Row to construct a schema from rdd
from pyspark.sql import Row
csvDF = rdd\
.map(lambda x: Row(col1 = int(x.split(",")[0]),\
col2 = float(x.split(",")[1]),\
col3 = str(x.split(",")[2]))).toDF()
csvDF.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| 2.0| "hello"|
| 3| 4.0| "there"|
| 5| 6.0| "how are you?"|
+----+----+---------------+
csvDF.printSchema()
root
|-- col1: long (nullable = true)
|-- col2: double (nullable = true)
|-- col3: string (nullable = true)
rdd.toDF(schema=['a', 'b', 'c', 'd']

Convert null values to empty array in Spark DataFrame

I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.
I thought I could do it like so:
val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )
However, this results in the following exception:
java.lang.RuntimeException: Unsupported literal type class [I [I#5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)
Apparently array types are not supported by the when function. Is there some other easy way to convert the null values?
In case it is relevant, here is the schema for this column:
|-- myCol: array (nullable = true)
| |-- element: integer (containsNull = false)
You can use an UDF:
import org.apache.spark.sql.functions.udf
val array_ = udf(() => Array.empty[Int])
combined with WHEN or COALESCE:
df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show
In the recent versions you can use array function:
import org.apache.spark.sql.functions.{array, lit}
df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show
Please note that it will work only if conversion from string to the desired type is allowed.
The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType
def empty_array(t):
return udf(lambda: [], ArrayType(t()))()
coalesce(myCol, empty_array(IntegerType()))
and in the recent versions just use array:
from pyspark.sql.functions import array
coalesce(myCol, array().cast("array<integer>"))
With a slight modification to zero323's approach, I was able to do this without using a udf in Spark 2.3.1.
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
+---+---------+
| id| numbers|
+---+---------+
| a|[1, 2, 3]|
| b| null|
| c|[7, 8, 9]|
+---+---------+
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
+---+---------+
| id| numbers|
+---+---------+
| a|[1, 2, 3]|
| b| []|
| c|[7, 8, 9]|
+---+---------+
An UDF-free alternative to use when the data type you want your array elements in can not be cast from StringType is the following:
import pyspark.sql.types as T
import pyspark.sql.functions as F
df.withColumn(
"myCol",
F.coalesce(
F.col("myCol"),
F.from_json(F.lit("[]"), T.ArrayType(T.IntegerType()))
)
)
You can replace IntegerType() with whichever data type, also complex ones.

Resources