Why do several datasets have an Array of Structs in Apache Spark - apache-spark

I see that several datasets have an array of Structs inside of an element instead of an Array of String or Integer.
|-- name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- value: string (nullable = true)
I was wondering why because ultimately what I want is to be able to represent an Array of Strings then why have a struct in between.

You can hold array of Strings using ArrayType and StructField. You don't need to use StructType inside StructField. In the example, column2 can hold array of String. Please see schema for "column2". Nevertheless the schema for the whole row will be a StructType.
StructType(
Array(
StructField("column1", LongType, nullable = true),
StructField("column2", ArrayType(StringType, true), nullable = true)
)
)
You need a StructType to hold a complex type which consists of many data types. It is like holding a table within a column. Please see schema for "column2".
StructType(
Array(
StructField("column1", LongType, nullable = true),
StructField("column2", ArrayType(StructType(Array(
StructField("column3", StringType, nullable = true),
StructField("column4", StringType, nullable = true))),
true)
)
)

Related

Convert a column of JSON list to a DataFrame

I'm loading a file from a jdbc, which has a JSON object formated as:
[
{
"numero": 1,
"resposta": "A",
"peso": 2
},
{
"numero": 2,
"resposta": "A",
"peso": 1
},
...
]
Its datatype is set as json (postgresql), but when loading in into spark, it gets loaded with newline and tab characters:
I tried using the following schema, which results in null (I imagine thats because I have to iterate through the list, but I'm not sure how to do that):
schema = StructType(
[
StructField("peso", IntegerType(), False),
StructField("numero", IntegerType(), False),
StructField("resposta", StringType(), False)
]
)
questoes.withColumn("questoes", from_json("questoes", schema)).show(truncate=200)
Output:
Desired DataFrame:
numero
resposta
peso
1
A
2
2
A
1
...
Code used to read from the DB:
spark = SparkSession.builder.config(
'spark.driver.extraClassPath', 'C:/Users/vitor/AppData/Roaming/DBeaverData/drivers/maven/maven-central/org.postgresql/postgresql-42.2.25.jar').getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
url = 'jdbc:postgresql://localhost:5432/informacoes_concursos'
properties = {'user': 'postgres', 'password': '123'}
gabaritos = spark.read.jdbc(url, table="gabaritos", properties=properties)
concursos = spark.read.jdbc(url, table="concursos", properties=properties)
Edit: I fixed the newline and tab characters by changing the dtype from json to jsonb.
So there are two issues with your code:
Your json is not a struct with 3 fields, it is a collection of structs with 3 fields. Therefore you need to change the schema and use an ArrayType.
Inside your database, the json data seems to be stored with tabs \t and newlines \n (it is formatted). Spark's from_json function does not seem to be able to parse that. So we need to clean it.
# same as before, but wrapped within an array
schema = ArrayType(
StructType([
StructField("peso", IntegerType(), False),
StructField("numero", IntegerType(), False),
StructField("resposta", StringType(), False)
])
)
result = questoes\
.withColumn("questoes", f.regexp_replace("questoes", "\\s", ""))\
.withColumn("data", f.from_json("questoes", schema))
result.show(truncate=False)
which yields:
+---+---------------------------------------------------------------------------+----------------------+
|id |questoes |data |
+---+---------------------------------------------------------------------------+----------------------+
|1 |[{"numero":1,"resposta":"A","peso":2},{"numero":2,"resposta":"A","peso":1}]|[{2, 1, A}, {1, 2, A}]|
+---+---------------------------------------------------------------------------+----------------------+
and the schema:
result.printSchema()
root
|-- id: long (nullable = true)
|-- questoes: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- peso: integer (nullable = true)
| | |-- numero: integer (nullable = true)
| | |-- resposta: string (nullable = true)
You may drop the questoes column, I just kept it to display the cleansed json.

How to get StructType object out of a StructType in spark java?

I'm working on this spark java application and I wanted to access structtype object in a structtype object. For instance-
When we take schema of spark dataframe it looks something like this-
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- language: string (nullable = true)
|-- fee: integer (nullable = true)
I wanted to grab the name as a structtype so that I can analyze it further. It would make a sort of chain. But the problem is that at the root level or any level, we can only extract structfield out of structtype and not other structtype.
StructType st = df.schema(); --> we get root level structtype
st.fields(); --> give us array of structfields but if I take name as a structfield i will lose all the fields inside it as 'name' is a StructType and I want to have it as it is.
StructType name = out of st --> this is what I want to achieve.
You can use the parameters and methods mentioned in the official documentation:
schema = StructType([StructField('name', StructType([StructField('firstname', StringType()), StructField('middlename', StringType()), StructField('lastname', StringType())])), StructField('language', StringType()), StructField('fee', IntegerType())])
for f in schema.fields:
if (f.name == "name"):
print(f.dataType)
for f2 in f.dataType.fields:
print(f2.name)
[Out]:
StructType([StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True)])
firstname
middlename
lastname

How can I copy the nullability state of a source Spark Dataframe schema and force it onto a target Spark Dataframe?

I'm using Databricks. Let's say I have two Spark Dataframes (I'm using PySpark):
df_source
df_target
If df_source has the following schema:
root
|-- name: string (nullable = false)
|-- id: long (nullable = false)
|-- age: long (nullable = true)
And df_target has the following schema:
root
|-- name: string (nullable = true)
|-- id: long (nullable = false)
|-- age: long (nullable = false)
How do I efficiently create another Dataframe, df_final where the (nullable = true/false) property from the df_source can be forced onto df_target?
I have tried doing the following:
df_final = spark.createDataFrame(df_target.rdd, schema = df_source.schema)
By this method, I'm able to achieve the desired result but it seems to be taking a long amount of time for the dataset size that I have. For smaller datasets, it works fine. Using the collect() function instead of an rdd conversion is obviously way worse for larger datasets.
I would like to point out that the only thing I want to do here is copy the nullability part from the source schema and change it accordingly in target, for the final dataframe.
Is there a way to do some sort of nullability casting, which works similar to .withColumn() performance wise, without RDD conversion, without explicit column name specification in the code? The column ordering is already aligned between source and target.
Additional context: The reason I need to do this is because I need to write (append) df_final to a Google BigQuery table using the Spark BQ connector. So, even if my Spark Dataframe doesn't have any null values in a column but the nullability property is set to true, the BigQuery table will reject the write operation since that column in the BigQuery table may have the nullable property set to false, and the schema mismatches.
Since you know for a fact that age can't be null, you can coalesce age and a constant literal to create a non-nullable field. For fields where the nullable field has to be convered from false to true, when expression can be used.
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql import functions as F
df_source_schema = StructType([
StructField("name", StringType(), False),
StructField("id", LongType(), False),
StructField("age", LongType(), True),
])
df_target_schema = StructType([
StructField("name", StringType(), True),
StructField("id", LongType(), False),
StructField("age", LongType(), False),
])
df_source = spark.createDataFrame([("a", 1, 18, ), ], df_source_schema)
df_source.printSchema()
"""
root
|-- name: string (nullable = false)
|-- id: long (nullable = false)
|-- age: long (nullable = true)
"""
df_target = spark.createDataFrame([("a", 1, 18), ], df_target_schema)
df_target.printSchema()
"""
root
|-- name: string (nullable = true)
|-- id: long (nullable = false)
|-- age: long (nullable = false)
"""
# Construct selection expression based on the logic described above
target_field_nullable_map = {field.name: field.nullable for field in df_target.schema}
selection_expr = []
for src_field in df_source.schema:
field_name = src_field.name
field_type = src_field.dataType
if target_field_nullable_map[field_name] != src_field.nullable:
if src_field.nullable:
selection_expr.append(F.when(F.col(field_name).isNotNull(), F.col(field_name)).otherwise(F.lit(None)).alias(field_name))
else:
selection_expr.append(F.coalesce(F.col(field_name), F.lit("-1").cast(field_type)).alias(field_name))
else:
selection_expr.append(F.col(field_name))
df_final = df_target.select(*selection_expr)
df_final.printSchema()
"""
root
|-- name: string (nullable = false)
|-- id: long (nullable = false)
|-- age: long (nullable = true)
"""
Why does this work?
For a coalesce expression to be null all it's child expression have to be null as seen from here.
Since lit is a non-null expression when value != null coalesce results in a non-nullable column.
A when expression is nullable is any branch is nullable of if the else expression is nullable as noted here.

pyspark: Converting string to struct

I have data as follows -
{
"Id": "01d3050e",
"Properties": "{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",
"LastUpdated": 1581530000000,
"LastUpdatedBy": "System"
}
Using aws glue, I want to relationalize the "Properties" column but since the datatype is string it can't be done. Converting it to struct, might do it based on reading this blog -
https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/
>>> df.show
<bound method DataFrame.show of DataFrame[Id: string, LastUpdated: bigint, LastUpdatedBy: string, Properties: string]>
>>> df.show()
+--------+-------------+-------------+--------------------+
| Id| LastUpdated|LastUpdatedBy| Properties|
+--------+-------------+-------------+--------------------+
|01d3050e|1581530000000| System|{"choices":null,"...|
+--------+-------------+-------------+--------------------+
How can I un-nested the "properties" column to break it into "choices", "object", "database" and "timestamp" columns, using relationalize transformer or any UDF in pyspark.
Use from_json since the column Properties is a JSON string.
If the schema is the same for all you records you can convert to a struct type by defining the schema like this:
schema = StructType([StructField("choices", StringType(), True),
StructField("object", StringType(), True),
StructField("database", StringType(), True),
StructField("timestamp", StringType(), True)],
)
df.withColumn("Properties", from_json(col("Properties"), schema)).show(truncate=False)
#+--------+-------------+-------------+---------------------------+
#|Id |LastUpdated |LastUpdatedBy|Properties |
#+--------+-------------+-------------+---------------------------+
#|01d3050e|1581530000000|System |[, demo, pg, 1581534117303]|
#+--------+-------------+-------------+---------------------------+
However, if the schema can change from one row to another I'd suggest you to convert it to a Map type instead:
df.withColumn("Properties", from_json(col("Properties"), MapType(StringType(), StringType()))).show(truncate=False)
#+--------+-------------+-------------+------------------------------------------------------------------------+
#|Id |LastUpdated |LastUpdatedBy|Properties |
#+--------+-------------+-------------+------------------------------------------------------------------------+
#|01d3050e|1581530000000|System |[choices ->, object -> demo, database -> pg, timestamp -> 1581534117303]|
#+--------+-------------+-------------+------------------------------------------------------------------------+
You can then access elements of the map using element_at (Spark 2.4+)
Creating your dataframe:
from pyspark.sql import functions as F
list=[["01d3050e","{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",1581530000000,"System"]]
df=spark.createDataFrame(list, ['Id','Properties','LastUpdated','LastUpdatedBy'])
df.show(truncate=False)
+--------+----------------------------------------------------------------------------+-------------+-------------+
|Id |Properties |LastUpdated |LastUpdatedBy|
+--------+----------------------------------------------------------------------------+-------------+-------------+
|01d3050e|{"choices":null,"object":"demo","database":"pg","timestamp":"1581534117303"}|1581530000000|System |
+--------+----------------------------------------------------------------------------+-------------+-------------+
Use inbuilt regex, split, and element_at:
No need to use UDF, inbuilt functions are adequate and very much optimized for big data tasks.
df.withColumn("Properties", F.split(F.regexp_replace(F.regexp_replace((F.regexp_replace("Properties",'\{|}',"")),'\:',','),'\"|"',"").cast("string"),','))\
.withColumn("choices", F.element_at("Properties",2))\
.withColumn("object", F.element_at("Properties",4))\
.withColumn("database",F.element_at("Properties",6))\
.withColumn("timestamp",F.element_at("Properties",8).cast('long')).drop("Properties").show()
+--------+-------------+-------------+-------+------+--------+-------------+
| Id| LastUpdated|LastUpdatedBy|choices|object|database| timestamp|
+--------+-------------+-------------+-------+------+--------+-------------+
|01d3050e|1581530000000| System| null| demo| pg|1581534117303|
+--------+-------------+-------------+-------+------+--------+-------------+
root
|-- Id: string (nullable = true)
|-- LastUpdated: long (nullable = true)
|-- LastUpdatedBy: string (nullable = true)
|-- choices: string (nullable = true)
|-- object: string (nullable = true)
|-- database: string (nullable = true)
|-- timestamp: long (nullable = true)
Since I was using AWS Glue service, I ended up using the "Unbox" class to Unboxe the string field in dynamicFrame. Worked well for my use-case.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html
unbox = Unbox.apply(frame = dynamic_dframe, path = "Properties", format="json")

Printschema() in Apache Spark [duplicate]

This question already has answers here:
Datasets in Apache Spark
(2 answers)
Closed 4 years ago.
Dataset<Tweet> ds = sc.read().json("/path").as(Encoders.bean(Tweet.class));
Tweet class :-
long id
string user;
string text;
ds.printSchema();
Output:-
root
|-- id: string (nullable = true)
|-- text: string (nullable = true)
|-- user: string (nullable = true)
json file has all arguments of string type
My question is am taking input and encoding it as Tweet.class .The datatype specified for id in the schema is Long but when schema is printed it is cast to String.
Does it give printscheme a/c to how it reads the file or according to encoding we do (here Tweet.class)?
i don't know the exact reason why your code is not working, but if you want to change the filed type you can write your customSchema.
val schema = StructType(List
(
StructField("id", LongType, nullable = true),
StructField("text", StringType, nullable = true),
StructField("user", StringType, nullable = true)
)))
you can apply schema to your dataframe as follows:
Dataset<Tweet> ds = sc.read().schema(schema).json("/path")
ds.printSchema()

Resources