Printschema() in Apache Spark [duplicate] - apache-spark

This question already has answers here:
Datasets in Apache Spark
(2 answers)
Closed 4 years ago.
Dataset<Tweet> ds = sc.read().json("/path").as(Encoders.bean(Tweet.class));
Tweet class :-
long id
string user;
string text;
ds.printSchema();
Output:-
root
|-- id: string (nullable = true)
|-- text: string (nullable = true)
|-- user: string (nullable = true)
json file has all arguments of string type
My question is am taking input and encoding it as Tweet.class .The datatype specified for id in the schema is Long but when schema is printed it is cast to String.
Does it give printscheme a/c to how it reads the file or according to encoding we do (here Tweet.class)?

i don't know the exact reason why your code is not working, but if you want to change the filed type you can write your customSchema.
val schema = StructType(List
(
StructField("id", LongType, nullable = true),
StructField("text", StringType, nullable = true),
StructField("user", StringType, nullable = true)
)))
you can apply schema to your dataframe as follows:
Dataset<Tweet> ds = sc.read().schema(schema).json("/path")
ds.printSchema()

Related

How to get StructType object out of a StructType in spark java?

I'm working on this spark java application and I wanted to access structtype object in a structtype object. For instance-
When we take schema of spark dataframe it looks something like this-
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- language: string (nullable = true)
|-- fee: integer (nullable = true)
I wanted to grab the name as a structtype so that I can analyze it further. It would make a sort of chain. But the problem is that at the root level or any level, we can only extract structfield out of structtype and not other structtype.
StructType st = df.schema(); --> we get root level structtype
st.fields(); --> give us array of structfields but if I take name as a structfield i will lose all the fields inside it as 'name' is a StructType and I want to have it as it is.
StructType name = out of st --> this is what I want to achieve.
You can use the parameters and methods mentioned in the official documentation:
schema = StructType([StructField('name', StructType([StructField('firstname', StringType()), StructField('middlename', StringType()), StructField('lastname', StringType())])), StructField('language', StringType()), StructField('fee', IntegerType())])
for f in schema.fields:
if (f.name == "name"):
print(f.dataType)
for f2 in f.dataType.fields:
print(f2.name)
[Out]:
StructType([StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True)])
firstname
middlename
lastname

DateType column read as StringType from CSV file even when appropriate schema provided

I am trying to read a CSV file using PySpark containing a DateType field in the format "dd/MM/yyyy". I have specified the field as DateType() in schema definition and also provided the option "dateFormat" in DataFrame CSV reader. However, the output dataframe after read is having the field as StringType() instead of DateType().
Sample input data:
"school_id","gender","class","doj"
"1","M","9","01/01/2020"
"1","M","10","01/03/2018"
"1","F","10","01/04/2018"
"2","M","9","01/01/2019"
"2","F","10","01/01/2018"
My code:
from pyspark.sql.types import StructField, StructType, StringType, DateType
school_students_schema = StructType([StructField("school_id", StringType(),True) ,\
StructField("gender", StringType(),True) ,\
StructField("class", StringType(),True) ,\
StructField("doj", DateType(),True)
])
school_students_df = spark.read.format("csv") \
.option("header", True) \
.option("schema", school_students_schema) \
.option("dateFormat", "dd/MM/yyyy") \
.load("/user/test/school_students.csv")
school_students_df.printSchema()
Actual output after running the above (column doj parsed as string instead of the specified DateType and dateFormat without any exception).
root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: string (nullable = true)
Expected output:
root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: date (nullable = true)
Runtime environment
Databricks Community Edition
7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)
Requesting your help to understand:
Why is the column being parsed as StringType even though DateType is mentioned in schema?
What needs to be done in the code so that the column doj is parsed as DateType()?
You should use
.schema(school_students_schema)
instead of
.option("schema", school_students_schema)
(There is no "schema" in the available option list.)
Need
.option("dateFormat", "some format")
or the appropriate default format. Becomes stringtype if not correct format.
Only 1 date format possible this way btw. Otherwise in line manipulation.

How can I copy the nullability state of a source Spark Dataframe schema and force it onto a target Spark Dataframe?

I'm using Databricks. Let's say I have two Spark Dataframes (I'm using PySpark):
df_source
df_target
If df_source has the following schema:
root
|-- name: string (nullable = false)
|-- id: long (nullable = false)
|-- age: long (nullable = true)
And df_target has the following schema:
root
|-- name: string (nullable = true)
|-- id: long (nullable = false)
|-- age: long (nullable = false)
How do I efficiently create another Dataframe, df_final where the (nullable = true/false) property from the df_source can be forced onto df_target?
I have tried doing the following:
df_final = spark.createDataFrame(df_target.rdd, schema = df_source.schema)
By this method, I'm able to achieve the desired result but it seems to be taking a long amount of time for the dataset size that I have. For smaller datasets, it works fine. Using the collect() function instead of an rdd conversion is obviously way worse for larger datasets.
I would like to point out that the only thing I want to do here is copy the nullability part from the source schema and change it accordingly in target, for the final dataframe.
Is there a way to do some sort of nullability casting, which works similar to .withColumn() performance wise, without RDD conversion, without explicit column name specification in the code? The column ordering is already aligned between source and target.
Additional context: The reason I need to do this is because I need to write (append) df_final to a Google BigQuery table using the Spark BQ connector. So, even if my Spark Dataframe doesn't have any null values in a column but the nullability property is set to true, the BigQuery table will reject the write operation since that column in the BigQuery table may have the nullable property set to false, and the schema mismatches.
Since you know for a fact that age can't be null, you can coalesce age and a constant literal to create a non-nullable field. For fields where the nullable field has to be convered from false to true, when expression can be used.
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql import functions as F
df_source_schema = StructType([
StructField("name", StringType(), False),
StructField("id", LongType(), False),
StructField("age", LongType(), True),
])
df_target_schema = StructType([
StructField("name", StringType(), True),
StructField("id", LongType(), False),
StructField("age", LongType(), False),
])
df_source = spark.createDataFrame([("a", 1, 18, ), ], df_source_schema)
df_source.printSchema()
"""
root
|-- name: string (nullable = false)
|-- id: long (nullable = false)
|-- age: long (nullable = true)
"""
df_target = spark.createDataFrame([("a", 1, 18), ], df_target_schema)
df_target.printSchema()
"""
root
|-- name: string (nullable = true)
|-- id: long (nullable = false)
|-- age: long (nullable = false)
"""
# Construct selection expression based on the logic described above
target_field_nullable_map = {field.name: field.nullable for field in df_target.schema}
selection_expr = []
for src_field in df_source.schema:
field_name = src_field.name
field_type = src_field.dataType
if target_field_nullable_map[field_name] != src_field.nullable:
if src_field.nullable:
selection_expr.append(F.when(F.col(field_name).isNotNull(), F.col(field_name)).otherwise(F.lit(None)).alias(field_name))
else:
selection_expr.append(F.coalesce(F.col(field_name), F.lit("-1").cast(field_type)).alias(field_name))
else:
selection_expr.append(F.col(field_name))
df_final = df_target.select(*selection_expr)
df_final.printSchema()
"""
root
|-- name: string (nullable = false)
|-- id: long (nullable = false)
|-- age: long (nullable = true)
"""
Why does this work?
For a coalesce expression to be null all it's child expression have to be null as seen from here.
Since lit is a non-null expression when value != null coalesce results in a non-nullable column.
A when expression is nullable is any branch is nullable of if the else expression is nullable as noted here.

Why do several datasets have an Array of Structs in Apache Spark

I see that several datasets have an array of Structs inside of an element instead of an Array of String or Integer.
|-- name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- value: string (nullable = true)
I was wondering why because ultimately what I want is to be able to represent an Array of Strings then why have a struct in between.
You can hold array of Strings using ArrayType and StructField. You don't need to use StructType inside StructField. In the example, column2 can hold array of String. Please see schema for "column2". Nevertheless the schema for the whole row will be a StructType.
StructType(
Array(
StructField("column1", LongType, nullable = true),
StructField("column2", ArrayType(StringType, true), nullable = true)
)
)
You need a StructType to hold a complex type which consists of many data types. It is like holding a table within a column. Please see schema for "column2".
StructType(
Array(
StructField("column1", LongType, nullable = true),
StructField("column2", ArrayType(StructType(Array(
StructField("column3", StringType, nullable = true),
StructField("column4", StringType, nullable = true))),
true)
)
)

On Spark DataFrame save to JSON and load back, schema column sequence changes

I am using spark DataFrames and trying to do de-duplication across to DataFrames of same schema.
schema for before saving DataFrame to JSON is like:
root
|-- startTime: long (nullable = false)
|-- name: string (nullable = true)
Schema of DataFrame after loading from JSON file is like:
root
|-- name: string (nullable = true)
|-- startTime: long (nullable = false)
I save to JSON as:
newDF.write.json(filePath)
and read back as:
existingDF = sqlContext.read.json(filePath)
After doing unionAll
existingDF.unionAll(newDF).distinct()
or except
newDF.except(existingDF)
The de-duplication fails because of schema change.
Can I avoid this schema conversion?
Is there a way to conserve (or enforce) schema sequence while saving to and loading back from JSON file?
Implemented a workaround to convert the schema back to what I need:
val newSchema = StructType(jsonDF.schema.map {
case StructField(name, dataType, nullable, metadata) if name.equals("startTime") => StructField(name, LongType, nullable = false, metadata)
case y: StructField => y
})
existingDF = sqlContext.createDataFrame(jsonDF.rdd, newSchema).select("startTime", "name")

Resources