Pyspark Readstream is not enforcing schema as expected - apache-spark

We are running a readstream, trigger once setup to load in parquet files from an immutable log.
spark.readStream.schema(SCHEMA).option('enforceSchema', "true").parquet(source_location)
To test the schema enforcement, we have a parquet file where one column has a different label from the column defined in the schema. In the schema we expect: row_id in the test file we provide row_id_x.
I would expect a pyspark error because the schema is different from the file being loaded. However, the code runs without any error/message. Is this expected behavior or is something else happening?

Related

How to Ignore Empty Parquet Files in Databricks When Creating Dataframe

I am having issues loading multiple files into a dataframe in Databricks. When I load a parquet file in an individual folder, it is fine, but the following error returns when I try to load multiple files in the dataframe:
DF = spark.read.parquet('S3 path/')
"org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually."
Per other StackOverflow answers, I added spark.sql.files.ignoreCorruptFiles true to the cluster configuration but it didn't seem to resolve the issue. Any other ideas?

Is dataframe.colums a Spark action?

If not, there is no action method in the following code, but "./demo.json" is read once.
val x = spark.read.json("./demo.json")
println(x.columns)
dataframe.columns is not an action per se, but it needs to get the schema of your dataframe. Depending on the file format, this needs a file-scan (json, csv). With other file formats like parquet, the columns can be extracted from the meta-data, so no actual file scan is needed
spark.read.json is an action that reads all your data to infer the schema (unless you specify it manually). Hence x.columns will not trigger any action.
According to the latest documentation (click on json):
This function goes through the input once to determine the input
schema. If you know the schema in advance, use the version that
specifies the schema to avoid the extra scan.

What is the proper way to validate datatype of csv data in spark?

We have a JSON file as input to the spark program(which describe schema definition and constraints which we want to check on each column) and I want to perform some data quality checks such as (Not NULL, UNIQUE) and datatype validations as well(Wants to check whether csv file contains the data according to json schema or not?).
JSON File:
{
"id":"1",
"name":"employee",
"source":"local",
"file_type":"text",
"sub_file_type":"csv",
"delimeter":",",
"path":"/user/all/dqdata/data/emp.txt",
"columns":[
{"column_name":"empid","datatype":"integer","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"empname","datatype":"string","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"salary","datatype":"double","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"doj","datatype":"date","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"location","string":"number","constraints":["not null","unique"],"values_permitted":["1","2"]}
]
}
Sample CSV input :
empId,empname,salar,dob,location
1,a,10000,11-03-2019,pune
2,b,10020,14-03-2019,pune
3,a,10010,15-03-2019,pune
a,1,10010,15-03-2019,pune
Keep in mind that,
1) intentionally I have put the invalid data for empId and name field(check last record).
2) The number of column in the json file is not fixed?
Question:
How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?
We have tried below things:
1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID".
Above scenario works only when we call some RDD action on data frame which I felt a weird way to validate schema.
Please guide me, How we can achieve it in spark?
I don't think there is a free lunch here you have to write this process yourself but the process you can do is...
Read the csv file as a Dataset of Strings so that every row is good
Parse the Dataset using the map function to check for Null or datatype problems per column
Add an extra two columns, a boolean called like validRow and a String called like message or description
With the parser mentioned in '2.', do some sort of try/catch or a Try/Success/Failure for each value in each column and catch the exception and set the validRow and the description column accordingly
Do a filter and write one DataFrame/DataSet that is successful (validRow flag is set to True) to a success place, and write the error DataFrame/DataSet to an error place

Dataframe not able to write on S3

I am creating a dataframe from existing hive table.Table is partitioned on date and site column.Now, when i am trying to overwrite the data in this same table after some computation with previous day data.It is successfully getting loaded.
But when i am trying to write final dataframe at S3 bucket. I am getting error saying file not found.Now the file it is mentioning is previous day file which is now overwritten.
If i write dataframe first and then overwrite table then its running fine.
For writing at S3 location , what it has to do with table partition file?
Below is the error and code.
java.io.FileNotFoundException: No such file or directory: s3://bucket_1/DM/web_fact_tbl/local_dt=2018-05-10/site_name=ABC/part-00000-882a6e29-eb6a-477c-8b88-6fe853956674.c000
fact_tbl = spark.table('db.web_fact_tbl')
fact_lkp = fact_tbl.filter(fact_tbl['local_dt']=='2018-05-10')
fact_join = fact_lkp.alias('a').join(fact_tbl.alias('b'),(col('a.id') == col('b.id')),"inner").select('a.*')
fact_final = fact_join.union(fact_tbl)
fact_final.coalesce(2).createOrReplaceTempView('cwf')
spark.sql('INSERT OVERWRITE TABLE dm.web_fact_tbl PARTITION (local_dt, site_name) \
SELECT * FROM cwf')
fact_final.write.csv('s3://bucket_1/yahoo')
Before last line fact_final is just a "lazy" dataframe object that contains definitions only. It does not contain any data. But it has pointer to exact data files, where data is stored actually.
When you try to perform actual operations (does not matter it's writing to S3, or executing query like fact_final.count()) you'll get the error as above. It looks like partition local_dt=2018-05-10 does not exists anymore (files/folder that sits behind it does not exists).
You can try to re-initialize dataframe once again, before final write (it's another lazy operation - all work is done in your case while you writing it on S3).

Infer a dataframe schema in spark from json file

I am looking to infer a data schema from a .json doc then load data from a database that has a type="Customer"
The current code I am using at the moment does work but the issue i face is that not all documents within the couchbase database contain the exact same schema. As the schema scans from a selection of documents the inferred schema will not contain all the data fields that I require. When I then load the dataframe I get a wrapped array error.
Python Code
%pyspark
df = sqlContext.read.format("com.couchbase.spark.sql.DefaultSource").option("schemaFilter", "type=\"Customer\"").load()
if you have had this issue previously all suggestions welcome on how to solve.
Many thanks!!

Resources