Spark DataFrame - Applying Schema shows nullable always true - apache-spark

I am trying to apply a schema for one of my json file. And does not matter if I provide nullable for this particular filed true or false in schema, when I apply that schema on my file, the field comes as nullable only.
Schema
val schema = StructType(
List(
StructField("SMS", StringType, false)
)
)
output
schema: org.apache.spark.sql.types.StructType = StructType(StructField(SMS,StringType,false))
applying Schema on file
val SMSDF = spark.read.schema(schema).json("/mnt/aaa/log*")
SMSDF.printSchema()
output
root
|-- SMS: string (nullable = true)
I am using Spark 2.4.3, Scala 2.11

Related

error: overloaded method value createDataFrame

I tried to create Apache Spark dataframe
val valuesCol = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
valuesCol: Seq[(String, String)] = List((Male,2019-09-06), (Female,2019-09-06), (Male,2019-09-07))
Schema
val someSchema = List(StructField("sex", StringType, true),StructField("date", DateType, true))
someSchema: List[org.apache.spark.sql.types.StructField] = List(StructField(sex,StringType,true), StructField(date,DateType,true))
It does not work
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
I got error
<console>:30: error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[(String, String)], org.apache.spark.sql.types.StructType)
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
Should I change date formatting in valuesCol? What actually causes this error?
With import spark.implicits._ you could convert Seq into Dataframe in place
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF() // <--- Here
Explicitly setting column names:
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
For the desired schema, you could either cast column or use a different type
//Cast
Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
.select($"sex", $"date".cast(DateType))
.printSchema()
//Types
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
Seq(
("Male", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Female", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Male", new java.sql.Date(format.parse("2019-09-07").getTime)))
.toDF("sex", "date")
.printSchema()
//Output
root
|-- sex: string (nullable = true)
|-- date: date (nullable = true)
Regarding your question, your rdd type is known, Spark will create schema accordingly to it.
val rdd: RDD[(String, String)] = spark.sparkContext.parallelize(valuesCol)
spark.createDataFrame(rdd)
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
You can specify your valuesCol as Seq of Row instead of Seq of Tuple :
val valuesCol = Seq(
Row("Male", "2019-09-06"),
Row ("Female", "2019-09-06"),
Row("Male", "2019-09-07"))

Printschema() in Apache Spark [duplicate]

This question already has answers here:
Datasets in Apache Spark
(2 answers)
Closed 4 years ago.
Dataset<Tweet> ds = sc.read().json("/path").as(Encoders.bean(Tweet.class));
Tweet class :-
long id
string user;
string text;
ds.printSchema();
Output:-
root
|-- id: string (nullable = true)
|-- text: string (nullable = true)
|-- user: string (nullable = true)
json file has all arguments of string type
My question is am taking input and encoding it as Tweet.class .The datatype specified for id in the schema is Long but when schema is printed it is cast to String.
Does it give printscheme a/c to how it reads the file or according to encoding we do (here Tweet.class)?
i don't know the exact reason why your code is not working, but if you want to change the filed type you can write your customSchema.
val schema = StructType(List
(
StructField("id", LongType, nullable = true),
StructField("text", StringType, nullable = true),
StructField("user", StringType, nullable = true)
)))
you can apply schema to your dataframe as follows:
Dataset<Tweet> ds = sc.read().schema(schema).json("/path")
ds.printSchema()

How do I apply schema with nullable = false to json reading

I'm trying to write some test cases using json files for dataframes (whereas production would be parquet). I'm using spark-testing-base framework and I'm running into a snag when asserting data frames equal each other due to schema mismatches where the json schema always has nullable = true.
I'd like to be able to apply a schema with nullable = false to the json read.
I've written a small test case:
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
import org.scalatest.FunSuite
class TestJSON extends FunSuite with DataFrameSuiteBase {
val expectedSchema = StructType(
List(StructField("a", IntegerType, nullable = false),
StructField("b", IntegerType, nullable = true))
)
test("testJSON") {
val readJson =
spark.read.schema(expectedSchema).json("src/test/resources/test.json")
assert(readJson.schema == expectedSchema)
}
}
And have a small test.json file of:
{"a": 1, "b": 2}
{"a": 1}
This returns an assertion failure of
StructType(StructField(a,IntegerType,true),
StructField(b,IntegerType,true)) did not equal
StructType(StructField(a,IntegerType,false),
StructField(b,IntegerType,true)) ScalaTestFailureLocation:
TestJSON$$anonfun$1 at (TestJSON.scala:15) Expected
:StructType(StructField(a,IntegerType,false),
StructField(b,IntegerType,true)) Actual
:StructType(StructField(a,IntegerType,true),
StructField(b,IntegerType,true))
Am I applying the schema the correct way?
I'm using spark 2.2, scala 2.11.8
There is a workaround, where rather than reading the json directly from the file, read it using RDD then it applies the schema. Below is code:
val expectedSchema = StructType(
List(StructField("a", IntegerType, nullable = false),
StructField("b", IntegerType, nullable = true))
)
test("testJSON") {
val jsonRdd =spark.sparkContext.textFile("src/test/resources/test.json")
//val readJson =sparksession.read.schema(expectedSchema).json("src/test/resources/test.json")
val readJson = spark.read.schema(expectedSchema).json(jsonRdd)
readJson.printSchema()
assert(readJson.schema == expectedSchema)
}
The test case passes and the print schema result is :
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = true)
There is JIRA https://issues.apache.org/jira/browse/SPARK-10848 with apache Spark for this issue, which they say is not a problem and said that:
This should be resolved in the latest file format refactoring in Spark 2.0. Please reopen it if you still hit the problem. Thanks!
If you are getting the error you can open the JIRA again.
I tested in spark 2.1.0, and still see the same issue
The workAround aboves ensures there is a correct schema, but null values are set to default ones. In my case when an Int does not exist in the json String it is set to 0.

How to covert dataframe datatypes to String?

I have a hive Table having Date and Timestamp datatypes. I am creating DataFrame using below java code:
SparkConf conf = new SparkConf(true).setMaster("yarn-cluster").setAppName("SAMPLE_APP");
SparkContext sc = new SparkContext(conf);
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.table("testdb.tbl1");
Dataframe schema:
df.printSchema
root
|-- c_date: date (nullable = true)
|-- c_timestamp: timestamp (nullable = true)
I want to covert these columns to String. How can I achieve this?
I need this because of issue : Spark csv data validation failed for date and timestamp data types of Hive
You can do the following:
df.withColumn("c_date", df.col("c_date").cast(StringType))
In scala, we generally cast datatypes like this:
df.select($"date".cast(StringType).as("new_date"))

On Spark DataFrame save to JSON and load back, schema column sequence changes

I am using spark DataFrames and trying to do de-duplication across to DataFrames of same schema.
schema for before saving DataFrame to JSON is like:
root
|-- startTime: long (nullable = false)
|-- name: string (nullable = true)
Schema of DataFrame after loading from JSON file is like:
root
|-- name: string (nullable = true)
|-- startTime: long (nullable = false)
I save to JSON as:
newDF.write.json(filePath)
and read back as:
existingDF = sqlContext.read.json(filePath)
After doing unionAll
existingDF.unionAll(newDF).distinct()
or except
newDF.except(existingDF)
The de-duplication fails because of schema change.
Can I avoid this schema conversion?
Is there a way to conserve (or enforce) schema sequence while saving to and loading back from JSON file?
Implemented a workaround to convert the schema back to what I need:
val newSchema = StructType(jsonDF.schema.map {
case StructField(name, dataType, nullable, metadata) if name.equals("startTime") => StructField(name, LongType, nullable = false, metadata)
case y: StructField => y
})
existingDF = sqlContext.createDataFrame(jsonDF.rdd, newSchema).select("startTime", "name")

Resources