AzureML TabularDatasetFactory.from_parquet_files() error handling column types - azure-machine-learning-service

I'm reading in a folder of parquet files using azureml's TabularDatasetFactory method:
dataset = TabularDatasetFactory.from_parquet_files(path=[(datastore_instance, "path/to/files/*.parquet")])
but am running into the issue that one of the columns is typed 'List' in the parquet files, and it seems TabularDatasetFactory.from_parquet_files() can't handle that typing?
ExecutionError:
Error Code: ScriptExecution.StreamAccess.Validation
Validation Error Code: NotSupported
Validation Target: ParquetType
Failed Step: xxxxxx
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by ValidationException.
No conversion exists for column: '[REDACTED COLUMN NAME]', from Parquet SchemaType: 'List' to DataPrep ValueKind
So I'm wondering if there's a way to tell TabularDatasetFactory.from_parquet_files() specifically which columns to pull in, or a way to tell it to fall back on any unsupported column types to just use object/string. Or maybe there's a work around by first reading in the files as a FileDataset, then selecting which columns in the files to use?
I do see the set_column_types parameter, but I don't know the columns until I read it into a dataset since I'm using datasets to explore what data is available in the folder paths in the first place

Related

Reading Parquet file with Pyspark returns java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths

I am trying to load parquet files in the following directories:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
This is what I wrote in Pyspark
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.parquet(s3_bucket_location_of_data)
but I received the following error:
Py4JJavaError: An error occurred while calling o109.parquet.
: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
After reading other StackOverflow posts like this, I tried the following:
base_path="s3://dir1/" # I have tried to set this to "s3://dir1/model=m1/version=newest/versionnumber=3/scores/" as well, but it didn't work
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.option("basePath", base_path).parquet(s3_bucket_location_of_data)
but that returned a similar error message as above. I am new to Spark/Pyspark and I don't know what I could possibly be doing wrong here. Thank you in advance for your answers!
You don't need to specify the detailed path. Just load the files from the base_path.
df = spark.read.parquet("s3://dir1")
df.filter("model = 'm1' and version = 'newest' and versionnumber = 3")
The directory structure is already partitioned by 3 columns, model, version and versionnumber. So read the base and filter the partition, then you could read all the parquet files under the partition path.

Spark : java.lang.ClassCastException: org.apache.Hadoop.io.Text cannot be cast to org.apache.orc.storage.serde2.io.DateWritable

Received this error (java.lang.ClassCastException: org.apache.Hadoop.io.Text cannot be cast to org.apache.orc.storage.serde2.io.DateWritable) while executing a pyspark py file which is reading the data from orc files in a partitioned folder.
This input folder has data, which should be read, transformed and needs to be written to a folder which has existing external table built on top(MSCK repair will be run post writing data to this target folder)
Code sample(process)
Step 1:
Df = spark.read.orc(“input_path”)
Step 2:
—> apply transformations(No cast function used)
Step 3:
Transformed_Df.write\
.partitionBy(“columns”)\
.mode(“overwrite”)\
.orc(“output_path”)
When I checked the logs, I see this error occurs multiple times right after reading partitions. I believe this is happening before applying transformations and writing the data to target.
Attached a picture of the log, please check.
enter image description here

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?
Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again
This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

What is the proper way to validate datatype of csv data in spark?

We have a JSON file as input to the spark program(which describe schema definition and constraints which we want to check on each column) and I want to perform some data quality checks such as (Not NULL, UNIQUE) and datatype validations as well(Wants to check whether csv file contains the data according to json schema or not?).
JSON File:
{
"id":"1",
"name":"employee",
"source":"local",
"file_type":"text",
"sub_file_type":"csv",
"delimeter":",",
"path":"/user/all/dqdata/data/emp.txt",
"columns":[
{"column_name":"empid","datatype":"integer","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"empname","datatype":"string","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"salary","datatype":"double","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"doj","datatype":"date","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"location","string":"number","constraints":["not null","unique"],"values_permitted":["1","2"]}
]
}
Sample CSV input :
empId,empname,salar,dob,location
1,a,10000,11-03-2019,pune
2,b,10020,14-03-2019,pune
3,a,10010,15-03-2019,pune
a,1,10010,15-03-2019,pune
Keep in mind that,
1) intentionally I have put the invalid data for empId and name field(check last record).
2) The number of column in the json file is not fixed?
Question:
How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?
We have tried below things:
1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID".
Above scenario works only when we call some RDD action on data frame which I felt a weird way to validate schema.
Please guide me, How we can achieve it in spark?
I don't think there is a free lunch here you have to write this process yourself but the process you can do is...
Read the csv file as a Dataset of Strings so that every row is good
Parse the Dataset using the map function to check for Null or datatype problems per column
Add an extra two columns, a boolean called like validRow and a String called like message or description
With the parser mentioned in '2.', do some sort of try/catch or a Try/Success/Failure for each value in each column and catch the exception and set the validRow and the description column accordingly
Do a filter and write one DataFrame/DataSet that is successful (validRow flag is set to True) to a success place, and write the error DataFrame/DataSet to an error place

Error when reading in .adf raster file

When reading in a raster dataset I get the below error. Previously I have been able to read this same raster dataset successfully in R in this way, maintaining access to the attribute table and correct field names. I've tried updating the files with backups to eliminate the issue of corrupt files and I still get the below error. Besides corrupt files, what may be causing this error?
dat2 <- raster("data/data_LEMMA/lemma_clip/w001001.adf")
Error : GDAL Error 3: Failed reading table field info for table
lemma_clip.VAT File may be corrupt?
Warning message: In .rasterFromGDAL(x, band = band, objecttype, ...) :
Could not read RAT or Category names
This appears to be an ESRI GRID that works fine with Arc, but not with GDAL (when reading the Raster Attribute Table (RAT; or VAT in ESRI speak)). It would be useful if you could make it available for others to look at and look for a solution.
A work-around is to not read the RAT; perhaps that is acceptable in this case.
dat2 <- raster("data/data_LEMMA/lemma_clip/w001001.adf", RAT=FALSE)

Resources