How do I apply schema with nullable = false to json reading - apache-spark

I'm trying to write some test cases using json files for dataframes (whereas production would be parquet). I'm using spark-testing-base framework and I'm running into a snag when asserting data frames equal each other due to schema mismatches where the json schema always has nullable = true.
I'd like to be able to apply a schema with nullable = false to the json read.
I've written a small test case:
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
import org.scalatest.FunSuite
class TestJSON extends FunSuite with DataFrameSuiteBase {
val expectedSchema = StructType(
List(StructField("a", IntegerType, nullable = false),
StructField("b", IntegerType, nullable = true))
)
test("testJSON") {
val readJson =
spark.read.schema(expectedSchema).json("src/test/resources/test.json")
assert(readJson.schema == expectedSchema)
}
}
And have a small test.json file of:
{"a": 1, "b": 2}
{"a": 1}
This returns an assertion failure of
StructType(StructField(a,IntegerType,true),
StructField(b,IntegerType,true)) did not equal
StructType(StructField(a,IntegerType,false),
StructField(b,IntegerType,true)) ScalaTestFailureLocation:
TestJSON$$anonfun$1 at (TestJSON.scala:15) Expected
:StructType(StructField(a,IntegerType,false),
StructField(b,IntegerType,true)) Actual
:StructType(StructField(a,IntegerType,true),
StructField(b,IntegerType,true))
Am I applying the schema the correct way?
I'm using spark 2.2, scala 2.11.8

There is a workaround, where rather than reading the json directly from the file, read it using RDD then it applies the schema. Below is code:
val expectedSchema = StructType(
List(StructField("a", IntegerType, nullable = false),
StructField("b", IntegerType, nullable = true))
)
test("testJSON") {
val jsonRdd =spark.sparkContext.textFile("src/test/resources/test.json")
//val readJson =sparksession.read.schema(expectedSchema).json("src/test/resources/test.json")
val readJson = spark.read.schema(expectedSchema).json(jsonRdd)
readJson.printSchema()
assert(readJson.schema == expectedSchema)
}
The test case passes and the print schema result is :
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = true)
There is JIRA https://issues.apache.org/jira/browse/SPARK-10848 with apache Spark for this issue, which they say is not a problem and said that:
This should be resolved in the latest file format refactoring in Spark 2.0. Please reopen it if you still hit the problem. Thanks!
If you are getting the error you can open the JIRA again.
I tested in spark 2.1.0, and still see the same issue

The workAround aboves ensures there is a correct schema, but null values are set to default ones. In my case when an Int does not exist in the json String it is set to 0.

Related

Spark DataFrame - Applying Schema shows nullable always true

I am trying to apply a schema for one of my json file. And does not matter if I provide nullable for this particular filed true or false in schema, when I apply that schema on my file, the field comes as nullable only.
Schema
val schema = StructType(
List(
StructField("SMS", StringType, false)
)
)
output
schema: org.apache.spark.sql.types.StructType = StructType(StructField(SMS,StringType,false))
applying Schema on file
val SMSDF = spark.read.schema(schema).json("/mnt/aaa/log*")
SMSDF.printSchema()
output
root
|-- SMS: string (nullable = true)
I am using Spark 2.4.3, Scala 2.11

Why can't Spark properly load columns from HDFS? [duplicate]

This question already has answers here:
What is going wrong with `unionAll` of Spark `DataFrame`?
(5 answers)
Closed 4 years ago.
Below I provide my schema and the code that I use to read from partitions in hdfs.
An example of a partition could be this path: /home/maria_dev/data/key=key/date=19 jan (and of course inside this folder there's a csv file that contains cnt)
So, the data I have is partitioned by key and date columns.
When I read it like below the columns are not properly read, so cnt gets read into date and vice versa.
How can I resolve this?
private val tweetSchema = new StructType(Array(
StructField("date", StringType, nullable = true),
StructField("key", StringType, nullable = true),
StructField("cnt", IntegerType, nullable = true)
))
// basePath example: /home/maria_dev/data
// path example: /home/maria_dev/data/key=key/data=19 jan
private def loadDF(basePath: String, path: String, format: String): DataFrame = {
val df = spark.read
.schema(tweetSchema)
.format(format)
.option("basePath", basePath)
.load(path)
df
}
I tried changing their order in the schema from (date, key, cnt) to (cnt, key, date) but it does not help.
My problem is that when I call union, it appends 2 dataframes:
df1: {(key: 1, date: 2)}
df2: {(date: 3, key: 4)}
into the final dataframe like this: {(key: 1, date: 2), (date: 3, key: 4)}. As you can see, the columns are messed up.
The schema should be in the following order:
Columns present in the data files as such - in case of CSV in the natural order from left to right.
Columns used with partitioning in the same order as defined by the directory structure.
So in your case the correct order will be:
new StructType(Array(
StructField("cnt", IntegerType, nullable = true),
StructField("key", StringType, nullable = true),
StructField("date", StringType, nullable = true)
))
It turns out that everything was read properly.
So, now, instead of doing df1.union(df2), I do df1.select("key", "date").union(df2.select("key", "date")) and it works.

How to add a schema to a Dataset in Spark?

I am trying to load a file into spark.
If I load a normal textFile into Spark like below:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
The outcome is:
partFile: org.apache.spark.sql.Dataset[String] = [value: string]
I can see a dataset in the output. But if I load a Json file:
val pfile = spark.read.json("hdfs://quickstart:8020/user/cloudera/pjson")
The outcome is a dataframe with a readymade schema:
pfile: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, age: bigint ... 1 more field]
The Json/parquet/orc files have schema. So I can understand that this is a feature from Spark version:2x, which made things easier as we directly get a DataFrame in this case and for a normal textFile you get a dataset where there is no schema which makes sense.
What I'd like to know is how can I add a schema to a dataset that is a resultant of loading a textFile into spark. For an RDD, there is case class/StructType option to add the schema and convert it to a DataFrame.
Could anyone let me know how can I do it ?
When you use textFile, each line of the file will be a string row in your Dataset. To convert to DataFrame with a schema, you can use toDF:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
import sqlContext.implicits._
val df = partFile.toDF("string_column")
In this case, the DataFrame will have a schema of a single column of type StringType.
If your file contains a more complex schema, you can either use the csv reader (if the file is in a structured csv format):
val partFile = spark.read.option("header", "true").option("delimiter", ";").csv("hdfs://quickstart:8020/user/cloudera/partfile")
Or you can process your Dataset using map, then using toDF to convert to DataFrame. For example, suppose you want one column to be the first character of the line (as an Int) and the other column to be the fourth character (also as an Int):
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[(Int, Int)] = partFile.map {
line: String => (line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF("value0", "value3")
Also, you can define a case class, which will represent the final schema for your DataFrame:
case class MyRow(value0: Int, value3: Int)
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[MyRow] = partFile.map {
line: String => MyRow(line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF
In both cases above, calling df.printSchema would show:
root
|-- value0: integer (nullable = true)
|-- value3: integer (nullable = true)

Is there a bug about StructField in SPARK Structured Streaming

When I try this :
cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.config(conf=cfg).getOrCreate()
lines = spark.readStream.load(format='socket', host='localhost', port=9999,
schema=StructType(StructField('value', StringType, True)))
words = lines.groupBy('value').count()
query = words.writeStream.format('console').outputMode("complete").start()
query.awaitTermination()
Then I get some error :
AssertionError: dataType should be DataType
And I search the source code in ./pyspark/sql/types.py at line 403:
assert isinstance(dataType, DataType), "dataType should be DataType"
But StringType based on AtomicType not DataType
class StringType(AtomicType):
"""String data type.
"""
__metaclass__ = DataTypeSingleton
So is there a mistake?
In Python DataTypes are not used as singletons. When creating StructField you have have to use an instance. Also StructType requires a sequence of StructField:
StructType([StructField('value', StringType(), True)])
Nevertheless this is completely pointless here. Schema of TextSocketSource is fixed and cannot be modified with schema argument.

Why specifying schema to be DateType / TimestampType will make querying extremely slow?

I'm using spark-csv 1.1.0 and Spark 1.5. I make the schema as follows:
private def makeSchema(tableColumns: List[SparkSQLFieldConfig]): StructType = {
new StructType(
tableColumns.map(p => p.ColumnDataType match {
case FieldDataType.Integer => StructField(p.ColumnName, IntegerType, nullable = true)
case FieldDataType.Decimal => StructField(p.ColumnName, FloatType, nullable = true)
case FieldDataType.String => StructField(p.ColumnName, StringType, nullable = true)
case FieldDataType.DateTime => StructField(p.ColumnName, TimestampType, nullable = true)
case FieldDataType.Date => StructField(p.ColumnName, DateType, nullable = true)
case FieldDataType.Boolean => StructField(p.ColumnName, BooleanType, nullable = false)
case _ => StructField(p.ColumnName, StringType, nullable = true)
}).toArray
)
}
But when there are DateType columns, my query with Dataframes will be very slow. (The queries are just simple groupby(), sum() and so on)
With the same dataset, after I commented the two lines to map Date to DateType and DateTime to TimestampType(that is, to map them to StringType), the queries become much faster.
What is the possible reason for this? Thank you very much!
We have found a possible answer for this problem.
When simply specifying a column to be DateType or TimestampType, spark-csv will try to parse the dates with all its internal formats for each line of the row, which makes the parsing progress much slower.
From its official documentation, it seems that we can specify in the option the format for the dates. I suppose it can make the parsing progress much faster.

Resources