Create spark dataframe schema from json schema representation - apache-spark

Is there a way to serialize a dataframe schema to json and deserialize it later on?
The use case is simple:
I have a json configuration file which contains the schema for dataframes I need to read.
I want to be able to create the default configuration from an existing schema (in a dataframe) and I want to be able to generate the relevant schema to be used later on by reading it from the json string.

There are two steps for this: Creating the json from an existing dataframe and creating the schema from the previously saved json string.
Creating the string from an existing dataframe
val schema = df.schema
val jsonString = schema.json
create a schema from json
import org.apache.spark.sql.types.{DataType, StructType}
val newSchema = DataType.fromJson(jsonString).asInstanceOf[StructType]

I am posting a pyspark version to a question answered by Assaf:
from pyspark.sql.types import StructType
# Save schema from the original DataFrame into json:
schema_json = df.schema.json()
# Restore schema from json:
import json
new_schema = StructType.fromJson(json.loads(schema_json))

Adding to the answers above, I already had a custom PySpark Schema defined as follows:
custom_schema = StructType(
[
StructField("ID", StringType(), True),
StructField("Name", StringType(), True),
]
)
I converted it into JSON and saved as a file as follows:
with open("custom_schema.json", "w") as f:
json.dump(custom_schema.jsonValue(), f)
Now, you have a json file with schema defined which you can read as follows
with open("custom_schema.json") as f:
new_schema = StructType.fromJson(json.load(f))
print(new_schema)
Inspired from: stefanthoss

Related

Passing Schema Manually to a Spark dataframe

Question: Is there a way to just pass the Column_names to a spark df and expect the spark to infer the schema types ?
My Scenario: I'm trying to fire a spark job using Kubernetes that basically reads CSV files from AWS S3 and creates a spark df using spark.read.csv().
If there is no header for the CSV file, I need to pass the schema manually to the spark data frame, which I can achieve by the following approach.
schema = StructType([
StructField('column_name', StringType(), True),
StructField('column_name1', StringType(), True)
])
df = spark.read.csv( csv_file, header = False, schema = schema )
That's all fine.
But
Problem: I'm passing the required parameters such as S3_access_key, secret_key, column_names ... etc as environment variables to the executor pods. Refer the below snippet.
ArgoDriverV2.ArgoDriver.create_spark_job(
's3-connector',
'WriteS3',
namespace='default',
executors=2,
args={
"USER":self.user.id,
"COLUMN_SCHEMA": json.dumps(column_names),
"S3_FILE_KEYS":json.dumps(s3_file_keys),
"S3_ACCESS_KEY": params['access_key'],
"S3_SECRET_KEY": params['secret_key'],
"N_EXECUTORS":2,
})
Using the column_names, I can generate the schema in the spark job and pass it to a data frame. But I find this approach a bit complicated.
Is there a way to just pass the Column_names to a spark df and expect the spark to infer the schema types ?
You could read the csv using inferSchema=true and then simply rename the columns like this:
# let's say that we have a list of desired column names
cols = ['a', 'b', 'c']
df = spark.read.option("inferSchema", True).csv("test")
df = df.select([df[x].alias(y) for x,y in zip(df.columns, cols)])

spark read schema from separate file

I have my data in HDFS and it's schema in MySQL. I'm able to fetch the schema to a DataFrame and it is as below :
col1,string
col2,date
col3,int
col4,string
How to read this schema and assign it to data while reading from HDFS?
I will be reading schema from MySql . It will be different for different datasets . I require a dynamic approach , where for any dataset I can fetch schema details from MySQL -> convert it into schema -> and then apply to dataset.
You can use the built-in pyspark function _parse_datatype_string:
from pyspark.sql.types import _parse_datatype_string
df = spark.createDataFrame([
["col1,string"],
["col3,int"],
["col3,int"]
], ["schema"])
str_schema = ",".join(map(lambda c: c["schema"].replace(",", ":") , df.collect()))
# col1:string,col3:int,col3:int
final_schema = _parse_datatype_string(str_schema)
# StructType(List(StructField(col1,StringType,true),StructField(col3,IntegerType,true),StructField(col3,IntegerType,true)))
_parse_datatype_string expects a DDL-formatted string i.e: col1:string, col2:int hence we need first to replace , with : then join all together seperated by comma. The function will return an instance of StructType which will be your final schema.

Specify pyspark dataframe schema with string longer than 256

I'm reading a source that got descriptions longer then 256 chars. I want to write them to Redshift.
According to: https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns it is only possible in Scala.
According to this: https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691
it should be a workaround to specify the schema when creating the dataframe. I'm not able to get it to work.
How can I specify the schema with varchar(max)?
df = ...from source
schema = StructType([
StructField('field1', StringType(), True),
StructField('description', StringType(), True)
])
df = sqlContext.createDataFrame(df.rdd, schema)
Redshift maxlength annotations are passed in format
{"maxlength":2048}
so this is the structure you should pass to StructField constructor:
from pyspark.sql.types import StructField, StringType
StructField("description", StringType(), metadata={"maxlength":2048})
or alias method:
from pyspark.sql.functions import col
col("description").alias("description", metadata={"maxlength":2048})
If you use PySpark 2.2 or earlier please check How to change column metadata in pyspark? for workaround.

Is there a bug about StructField in SPARK Structured Streaming

When I try this :
cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.config(conf=cfg).getOrCreate()
lines = spark.readStream.load(format='socket', host='localhost', port=9999,
schema=StructType(StructField('value', StringType, True)))
words = lines.groupBy('value').count()
query = words.writeStream.format('console').outputMode("complete").start()
query.awaitTermination()
Then I get some error :
AssertionError: dataType should be DataType
And I search the source code in ./pyspark/sql/types.py at line 403:
assert isinstance(dataType, DataType), "dataType should be DataType"
But StringType based on AtomicType not DataType
class StringType(AtomicType):
"""String data type.
"""
__metaclass__ = DataTypeSingleton
So is there a mistake?
In Python DataTypes are not used as singletons. When creating StructField you have have to use an instance. Also StructType requires a sequence of StructField:
StructType([StructField('value', StringType(), True)])
Nevertheless this is completely pointless here. Schema of TextSocketSource is fixed and cannot be modified with schema argument.

Syntax while setting schema for Pyspark.sql using StructType

I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this:
spark= SparkSession.builder.getOrCreate()
from pyspark.sql.types import StringType, IntegerType,
StructType, StructField
rdd = sc.textFile('./some csv_to_play_around.csv'
schema = StructType([StructField('Name', StringType(), True),
StructField('DateTime', TimestampType(), True)
StructField('Age', IntegerType(), True)])
# create dataframe
df3 = sqlContext.createDataFrame(rdd, schema)
My question is, what does the True stand for in the schema list above? I can't seem to find it in the documentation. Thanks in advance
It means if the column allows null values, true for nullable, and false for not nullable
StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.
Refer to Spark SQL and DataFrame Guide for more informations.
You can also use a datatype string:
schema = 'Name STRING, DateTime TIMESTAMP, Age INTEGER'
There's not much documentation on datatype strings, but they mention them in the docs. They're much more compact and readable than StructTypes

Resources