Infer a dataframe schema in spark from json file - apache-spark

I am looking to infer a data schema from a .json doc then load data from a database that has a type="Customer"
The current code I am using at the moment does work but the issue i face is that not all documents within the couchbase database contain the exact same schema. As the schema scans from a selection of documents the inferred schema will not contain all the data fields that I require. When I then load the dataframe I get a wrapped array error.
Python Code
%pyspark
df = sqlContext.read.format("com.couchbase.spark.sql.DefaultSource").option("schemaFilter", "type=\"Customer\"").load()
if you have had this issue previously all suggestions welcome on how to solve.
Many thanks!!

Related

Pyspark Readstream is not enforcing schema as expected

We are running a readstream, trigger once setup to load in parquet files from an immutable log.
spark.readStream.schema(SCHEMA).option('enforceSchema', "true").parquet(source_location)
To test the schema enforcement, we have a parquet file where one column has a different label from the column defined in the schema. In the schema we expect: row_id in the test file we provide row_id_x.
I would expect a pyspark error because the schema is different from the file being loaded. However, the code runs without any error/message. Is this expected behavior or is something else happening?

Azure Data Factory - Google BigQuery Copy Data activity not returning nested column names

I have a copy activity in Azure Data Factory with a Google BigQuery source.
I need to import the whole table (which contains nested fields - Records in BigQuery).
Nested fields get imported as follows (a string containing only data values):
"{\"v\":{\"f\":[{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":\"0\"}]}}"
Expected output would be something like:
{"nestedColName" : [{"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": null}, ...] }
I think this is a connector issue from Data Factory's side but am not sure how to proceed.
Have considered using Databricks to import data from GBQ directly and then saving the DataFrame to sink.
Have also considered querying for a subset of columns and using UNNEST where required but would rather not do this as Parquet handles both Array and Map types.
Anyone encountered this before / what did you do?
Solution used:
Databricks (Spark) connector for Google BigQuery:
https://docs.databricks.com/data/data-sources/google/bigquery.html
This preserves schemas and nested field names.
Preferring the simpler setup of ADF BigQuery connector to Databricks's BigQuery support, I opted for a solution where I extract the data in JSON and 'massage' it into Parquet using Databricks:
Use a Copy activity to get data from BigQuery with all the data packed into a single JSON string field. Output format can be Parquet or JSON (I'm using Parquet). Use a BigQuery query like this:
select TO_JSON_STRING(t) as value from `<your BigQuery table>` as t
NOTE: The name of the field must be value. The df.write.text() text file writer writes the contents of value column into each row of the text file, which is a JSON string in this case.
Run a Databrick notebook activity with code like this:
# Read data and write it out as text file to get the JSON. (Compression is optional).
dfInput=spark.read.parquet(inputpath)
dfInput.write.mode("overwrite").option("compression","gzip").text(tmppath)
# Read back as JSON to extract the correct schema.
dfTemp=spark.read.json(tmppath)
dfTemp.write.mode("overwrite").parquet(outputpath)
Use the output as is, or use a Copy activity to copy it to where you like.

How to print the schema of a dataframe with Python objects rather than Java objects?

For example:
df = spark.read.json("path")
print(df.schema)
prints:
StructType(List(StructField(timestamp,StringType,true)))
rather than:
StructType([StructField("timestamp",StringType(),True)])
This is an issue for me if i want to come up with a schema by initially inferring the schema from a file in order to then print the schema and hardcode it in my code.
Is there a way to print the schema of a dataframe and have it in python syntax so that i can set a hardcoded schema to a variable in my code and use it?
ideally (schema = df.schema) works in case of common file formats like csv etc, but for a file like json it's good to provide schema manually to avoid any error

Is dataframe.colums a Spark action?

If not, there is no action method in the following code, but "./demo.json" is read once.
val x = spark.read.json("./demo.json")
println(x.columns)
dataframe.columns is not an action per se, but it needs to get the schema of your dataframe. Depending on the file format, this needs a file-scan (json, csv). With other file formats like parquet, the columns can be extracted from the meta-data, so no actual file scan is needed
spark.read.json is an action that reads all your data to infer the schema (unless you specify it manually). Hence x.columns will not trigger any action.
According to the latest documentation (click on json):
This function goes through the input once to determine the input
schema. If you know the schema in advance, use the version that
specifies the schema to avoid the extra scan.

Spark: reading files with PERMISSIVE and provided schema - issues with corrupted records column

I am reading spark CSV. I am providing a schema for the file that I read and I read it permissive mode. I would like to keep all records in columnNameOfCorruptRecord (in my case corrupted_records).
I went trough hell to set this up and still get warnings that I cannot suppress i there something I miss.
So first in order to have the corrupted_records column I needed to add it to the schema as StringType. This is documented so okay.
But whenever I read a file a get a Warning that the schema doesn't match because the amount of columns is different. It's just a warning, but it's filling my logs.
Also when there is a field that is not nullable and there is a corrupted record, the corrupted record goes to the corrupt_records column and all it's fields are set to null thus I get an exception because I have non nullable field. The only to solve this is to set that the columns are not nullable to nullable. Which is quite strange.
Am I missing something?
Recap:
Is there a way to ignore the warning when I've added
corrupted_records column in the schema
Is there a way to use
PERMISSIVE mode and corrupted_records column with schema that has
non nullable fields.
Thanks!
The following documentation might help. It would be great if you atleast provide the code you've written.
https://docs.databricks.com/spark/latest/data-sources/read-csv.html
a demo of read json code snippet
df= self.spark.read.option("mode", "PERMISSIVE").option("primitivesAsString", True).json(self.src_path)
To answer your point 2, you should delve better point 1.
Point 1: you should do an analysis of your file and map your schema with all the fields in your file. After having imported your csv file into a DataFrame, I would select your fields of interest, and continue what you were doing.
Point 2: you will solve your problem defining your schema as follows (I would use scala):
import pyspark.sql.types as types
yourSchema = (types.StructType()
.add('field0', types.IntegerType(), True)
# all your .add(fieldsName, fieldsType, True which let your field be nullable)
.add('corrupted_records', types.StringType(), True) #your corrupted date will be here
)
With this having been defined, you can import your csv file into a DataFrame as follows:
df = ( spark.read.format("csv")
.schema(yourSchema)
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "corrupted_records")
load(your_csv_files)
)
There are also other ways to do the same operation, and different modalities to handle bad bad; have a look at this insightful article: https://python.plainenglish.io/how-to-handle-bad-data-in-spark-sql-5e0276d37ca1

Resources