How to handle null values when writing to parquet from Spark

How to handle null values when writing to parquet from Spark - apache-spark

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:
https://issues.apache.org/jira/browse/SPARK-10943
So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.
The problem is that null alone carries no type information at all
scala> spark.sql("SELECT null as comments").printSchema
root
|-- comments: null (nullable = true)
As per comment by Michael Armbrust all you have to do is cast:
scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)
and the result can be safely written to Parquet.

I wrote a PySpark solution for this (df is a dataframe with columns of NullType):
# get dataframe schema
my_schema = list(df.schema)
null_cols = []
# iterate over schema list to filter for NullType columns
for st in my_schema:
if str(st.dataType) == 'NullType':
null_cols.append(st)
# cast null type columns to string (or whatever you'd like)
for ncol in null_cols:
mycolname = str(ncol.name)
df = df \
.withColumn(mycolname, df[mycolname].cast('string'))

Related

Spark - How to fetch value from a dataframe column having type binary?

We have a scenario, where we are using our custom serializer to serialize POJOs'
In our use-case, we have formed dataframes having one of the column holding these POJOs.
df.printSchema is as below:
root
|-- mykey: string (nullable = true)
|-- myPojo: binary (nullable = true)
When we are trying to fetch values from this dataframe .i.e. running following code:
df.foreach((row) => {
val value1 = row.getAs[String]("mykey") --> this works fine
val value2 = row.getAs[MyPojo]("myPojo") --> getting exception here
})
we are getting java.lang.ClassCastException:, when fetching value2.
Any pointers how to resolve this issue?
Thanks
Anuj

You cannot read an object this way from a Row. You can only read the data out as Array[Byte] and then manually deserialize it.
In theory, Spark's Dataset[_] should allow for this by building a custom Encoder but the Spark implementation is bound to a single subtype right now: ExpressionEncoder.

In java Spark version 3.1.2:
byte[] data = item.getAs("column_name");
Just works.

Can IF statement work correctly to build spark dataframe?

I have following code which uses an IF statement to build dataframe conditionally.
Does this work as I expect?
df = sqlContext.read.option("badRecordsPath", badRecordsPath).json([data_path_1, s3_prefix + "batch_01/2/2019-04-28/15723921/15723921_15.json"])
if "scrape_date" not in df.columns:
df = df.withColumn("scrape_date", lit(None).cast(StringType()))

Is this what you are trying to do?
val result = <SOME Dataframe I previously created>
scala> result.printSchema
root
|-- VAR1: string (nullable = true)
|-- VAR2: double (nullable = true)
|-- VAR3: string (nullable = true)
|-- VAR4: string (nullable = true)
scala> result.columns.contains("VAR3")
res13: Boolean = true
scala> result.columns.contains("VAR9")
res14: Boolean = false
So the "result" dataframe has columns "VAR1", "VAR2" and so on.
The next line shows that it contains "VAR3" (result of expression is "true". But it does not contains a column called "VAR9" (result of the expression is "false").
The above is scala, but you should be able to do the same in Python (sorry I did not notice you were asking about python when I replied).
In terms of execution, the if statement will execute locally on the driver node. As a rule of thumb, if something returns an RDD, DataFrame or DataSet, it will be executed in parallel on the executor(s). Since DataFrame.columns returns an Array, any processing of the list of columns will be done in the driver node (because an Array is not an RDD, DataFrame nor DataSet).
Also note that RDD, DataFrame and DataSet will be executed "lazy-lly". That is, Spark will "accumulate" the operations that generate these objects. It will only execute them when you do something that doesn't generate an RDD, DataFrame or DataSet. For example when you do a show or a count or a collect. Part of the reason for doing this is so Spark can optimise the execution of the process. Another is so it only does what is actually needed to generate the answer.

Avoid losing data type for the partitioned data when writing from Spark

I have a dataframe like below.
itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0
I would like to save this dataframe as partitioned parquet file:
df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)
For this dataframe, when I read the data back, it will have String the data type for itemCategory.
However at times, I have dataframe from other tenants as below.
itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0
In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory.
Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?

If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.
In spark 2.0 or greater, you can set like this:
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
In 1.6, like this:
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
The downside is you have to do this each time you read the data, but at least it works.

As you partition by the itemCategory column, this data will be stored in the file structure and not in the actual csv files. Spark infer the datatype depending on the values, if all values are integers then the column type will be int.
One simple solution would be to cast the column to StringType after reading the data:
import spark.implicits._
df.withColumn("itemCategory", $"itemCategory".cast(StringType))
Another option would be to duplicate the column itself. Then one of the columns will be used for the partitioning and, hence, be saved in the file structure. However, the other duplicated column would be saved normally in the parquet file. To make a duplicate simply use:
df.withColumn("itemCategoryCopy", $"itemCategory")

Read it with a schema:
import spark.implicits._
val path = "/tmp/test/input"
val source = Seq(("Name1", "0"), ("Name2", "1"), ("Name3", "0")).toDF("itemName", "itemCategory")
source.write.partitionBy("itemCategory").parquet(path)
spark.read.schema(source.schema).parquet(path).printSchema()
// will print
// root
// |-- itemName: string (nullable = true)
// |-- itemCategory: string (nullable = true)
See https://www.zepl.com/viewer/notebooks/bm90ZTovL2R2aXJ0ekBnbWFpbC5jb20vMzEzZGE2ZmZjZjY0NGRiZjk2MzdlZDE4NjEzOWJlZWYvbm90ZS5qc29u

In Spark 2.0, jdbc dataframes schema is automatically applied as nullable = false

For jdbc dataframes if I specify a custom query like
(select * from table1 where col4 > 10.0) AS table1
then schema for all columns turns out to be nullable = false
col1: string (nullable = false)
col2: string (nullable = false)
col3: string (nullable = false)
col4: float (nullable = false)
This causes null pointer exception when I use custom queries and the resultset contains any null value. I also tried to transform schema programatically but it still fails because of spark lineage, as original dataframe has the restricted schema irrespective of what schema transformed dataframe has.
I found a workaround for this. If I just provide the table name and then provide the select and where clause
sqlContext.read.jdbc(url, table1, dconnectionProperties).
select("col1", "col2", "col3", "col4").
where(s"col4 < 10.0")
the schema is correctly(or I how I want) inferred as
col1: string (nullable = true)
col2: string (nullable = true)
col3: string (nullable = true)
col4: float (nullable = true)
But I wanted to use the custom queries as my queries has some join and aggregations which I want to be pushed down to database to execute.
This started showing up after we moved to spark 2.0.x prior to that this was working fine

This issue is related to Teradata JDBC driver. This problem is discussed at https://community.teradata.com/t5/Connectivity/Teradata-JDBC-Driver-returns-the-wrong-schema-column-nullability/m-p/76667/highlight/true#M3798.
The root cause is discussed on the first page. And the solution is on the third page.
People from Teradata said they fixed the issue in the 16.10.* driver with a MAYBENULL parameter, but I am still seeing an undeterministic behaviour.
Here is a similar discussion
https://issues.apache.org/jira/browse/SPARK-17195

To close the loop, this issue is fixed in driver version 16.04.00.0. Two new parameters need to be added in the connection string COLUMN_NAME=ON,MAYBENULL=ON

Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

I have two dataframes df1 and df2. Both of them have the following schema:
|-- ts: long (nullable = true)
|-- id: integer (nullable = true)
|-- managers: array (nullable = true)
| |-- element: string (containsNull = true)
|-- projects: array (nullable = true)
| |-- element: string (containsNull = true)
df1 is created from an avro file while df2 from an equivalent parquet file. However, If I execute, df1.unionAll(df2).show(), I get the following error:
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)

I ran into the same situation and it turns out to be not only the fields need to be the same but also you need to maintain the exact same ordering of the fields in both dataframe in order to make it work.

This is old and there are already some answers lying around but I just faced this problem while trying to make a union of two dataframes like in...
//Join 2 dataframes
val df = left.unionAll(right)
As others have mentioned, order matters. So just select right columns in the same order than left dataframe columns
//Join 2 dataframes, but take columns in the same order
val df = left.unionAll(right.select(left.columns.map(col):_*))

I found the following PR on github
https://github.com/apache/spark/pull/11333.
That relates to UDF (user defined function) columns, which were not correctly handled during the union, and thus would cause the union to fail. The PR fixes it, but it hasn't made it to spark 1.6.2, I haven't checked on spark 2.x yet.
If you're stuck on 1.6.x there's a stupid work around, map the DataFrame to an RDD and back to a DataFrame
// for a DF with 2 columns (Long, Array[Long])
val simple = dfWithUDFColumn
.map{ r => (r.getLong(0), r.getAs[Array[Long]](1))} // DF --> RDD[(Long, Array[Long])]
.toDF("id", "tags") // RDD --> back to DF but now without UDF column
// dfOrigin has the same structure but no UDF columns
val joined = dfOrigin.unionAll(simple).dropDuplicates(Seq("id")).cache()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to handle null values when writing to parquet from Spark - apache-spark

Related

Spark - How to fetch value from a dataframe column having type binary?

Can IF statement work correctly to build spark dataframe?

Avoid losing data type for the partitioned data when writing from Spark

In Spark 2.0, jdbc dataframes schema is automatically applied as nullable = false

Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

Categories

Resources