Spark is inconsistent with unusually encoded CSV file - apache-spark

Context:
As part of data pipeline, I am working on some flat CSV files
Those files have unusual encoding and escaping rules
My intention is too preprocess those and convert to parquets for subsequent pipeline steps
MCVE:
spark = SparkSession.builder.appName("...").getOrCreate()
min_schema = StructType(
[
StructField("dummy_col", StringType(), True),
StructField("record_id", IntegerType(), nullable=False),
StructField("dummy_after", StringType(), nullable=False),
]
)
df = (
spark.read.option("mode", "FAILFAST")
.option("quote", '"')
.option("escape", '"')
.option("inferSchema", "false")
.option("multiline", "true")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.schema(min_schema)
.csv(f'min_repro.csv', header=True)
)
dummy_col,record_id,dummy_after
"",1,", Unusual value with comma included"
B,2,"Unusual value with escaped quote and comma ""like, this"
CSV parses fine:
df.collect()
[Row(dummy_col=None, record_id=1, dummy_after=', Unusual value with comma included'),
Row(dummy_col='B', record_id=2, dummy_after='Unusual value with escaped quote and comma "like, this')]
Yet trivial Spark code on same DF fails with obscure error:
if df.count() != df.select('record_id').distinct().count():
pass
Py4JJavaError: An error occurred while calling o357.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 13, localhost, executor driver): org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST.
...
Caused by: java.lang.NumberFormatException: For input string: "Unusual value with comma included""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
I don't understand how .collect() on same DF can provide correct rows, yet any queries on same DF are failing.
Upstream bug was created: https://issues.apache.org/jira/browse/SPARK-39842

Correct way of ignoring , within Data is
enclose data within Double quotes.
Use option "escapeQuotes", "true"
df = ( spark.read.option("mode", "FAILFAST") .option("escapeQuotes", "true") .option("inferSchema", "false") .option("multiline", "true") .option("ignoreLeadingWhiteSpace", "true") .option("ignoreTrailingWhiteSpace", "true") .schema(min_schema) .csv(f'C:/Users/pc/Desktop/sample2.csv', header=True) )
------------------------------------------------------------------------
>>> df.select('dummy_after').show(truncate=False)
+-----------------------------------+
|dummy_after |
+-----------------------------------+
|, Unusual value with comma included|
+-----------------------------------+
>>> if df.count() != df.select('record_id').distinct().count():
... pass

Related

How to override default timestamp format while reading csv in pyspark?

Suppose I have the following data in a CSV format,
ID|TIMESTAMP_COL
1|03-02-2003 08:37:55.671 PM
2|2003-02-03 08:37:55.671 AM
and my code for reading the above CSV is,
from pyspark.sql.types import *
sch = StructType([StructField("ID",StringType(),False),StructField("TIMESTAMP_COL",StringType(),True)])
df = spark.read \
.format("csv") \
.option("encoding", "utf-8") \
.option("mode", "PERMISSIVE") \
.option("header", "true") \
.option("dateFormat", "dd-MM-yyyy") \
.option("timestampFormat", "dd-MM-yyyy HH:mm:ss.SSS a") \
.option("delimiter", "|") \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.schema(sch) \
.load("data.csv")
So, according to the given timestamp format, I should get the record with id '2' rejected as it has a different format but it gets parsed but the value is different.
The output I am getting is,
df.show(truncate=False)
+-------------+-----------------------+-------------------+
| ID| TIMESTAMP_COL| _corrupt_record|
+-------------+-----------------------+-------------------+
| 1|2003-02-03 08:37:55.671| null|
| 2|0008-07-26 08:37:55.671| null|
+-------------+-----------------------+-------------------+
Why is this happening?
Not sure if it helps but here is what i found:
In your schema second field is declared as StringType, shouldnt it be TimestampType()?
I was able to reproduce your results with spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") i did also tests with other possible options for this parameter
object LegacyBehaviorPolicy extends Enumeration {
val EXCEPTION, LEGACY, CORRECTED = Value
}
and here is doc for this parameter:
.doc("When LEGACY, java.text.SimpleDateFormat is used for formatting and parsing " +
"dates/timestamps in a locale-sensitive manner, which is the approach before Spark 3.0. " +
"When set to CORRECTED, classes from java.time.* packages are used for the same purpose. " +
"The default value is EXCEPTION, RuntimeException is thrown when we will get different " +
"results.")
So with LEGACY i am getting same results as you
With EXCEPTION Spark is throwing exception
org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
With CORRECTED Spark is returning nulls for both records
It is however parsing correctly record with id 1 when i change pattern to hh instead of HH
so with something like this:
from pyspark.sql.types import *
spark.conf.set("spark.sql.legacy.timeParserPolicy","CORRECTED")
sch = StructType([StructField("ID",StringType(),False),StructField("TIMESTAMP_COL",TimestampType(),True), StructField("_corrupt_record", StringType(),True)])
df = spark.read \
.format("csv") \
.option("encoding", "utf-8") \
.option("mode", "PERMISSIVE") \
.option("header", "true") \
.option("dateFormat", "dd-MM-yyyy") \
.option("timestampFormat", "dd-MM-yyyy hh:mm:ss.SSS a") \
.option("delimiter", "|") \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.schema(sch) \
.load("dbfs:/FileStore/tables/stack.csv") \
df.show(truncate = False)
I am able to get this on output:
+---+-----------------------+----------------------------+
|ID |TIMESTAMP_COL |_corrupt_record |
+---+-----------------------+----------------------------+
|1 |2003-02-03 20:37:55.671|null |
|2 |null |2|2003-02-03 08:37:55.671 AM|
+---+-----------------------+----------------------------+
I am getting null here because thats how Spark parser is working, when pattern is incorrect its assigning null and your value is not going to be moved to corrupted_records i think so if you want to remove not matching timestamps you may filter nulls
Edit: As mentioned in comment i was missing this column in schema, its added now and you can get corrupted_value if you need it

spark.read.excel - Not reading all Excel rows when using custom schema

I am trying to read a Spark DataFrame from an 'excel' file. I used the crealytics dependency.
Without any predefined schema, all rows are correctly read but as only string type columns.
To prevent that, I am using my own schema (where I mentioned certain columns to be Integer type), but in this case, most of the rows are dropped when the file is being read.
The library dependency used in build.sbt:
"com.crealytics" %% "spark-excel" % "0.11.1",
Scala version - 2.11.8
Spark version - 2.3.2
val inputDF = sparkSession.read.excel(useHeader = true).load(inputLocation(0))
The above reads all the data - around 25000 rows.
But,
val inputWithSchemaDF: DataFrame = sparkSession.read
.format("com.crealytics.spark.excel")
.option("useHeader" , "false")
.option("inferSchema", "false")
.option("addColorColumns", "true")
.option("treatEmptyValuesAsNulls" , "false")
.option("keepUndefinedRows", "true")
.option("maxRowsInMey", 2000)
.schema(templateSchema)
.load(inputLocation)
This gives me only 450 rows.
Is there a way to prevent that? Thanks in advance! (edited)
As of now, I haven't found a fix to this problem, but I tried solving it in a different way by manually type-casting. To make it a bit better in terms of number of lines of code, I took the help of a for loop. My solutions is as follows:
Step 1: Create my own schema of type 'StructType':
val requiredSchema = new StructType()
.add("ID", IntegerType, true)
.add("Vendor", StringType, true)
.add("Brand", StringType, true)
.add("Product Name", StringType, true)
.add("Net Quantity", StringType, true)
Step 2: Type casting the Dataframe AFTER it has been read (WITHOUT the custom schema) from the excel file (instead of using the schema while reading the data):
def convertInputToDesiredSchema(inputDF: DataFrame, requiredSchema: StructType)(implicit sparkSession: SparkSession) : DataFrame =
{
var schemaDf: DataFrame = inputDF
for(i <- inputDF.columns.indices)
{
if(inputDF.schema(i).dataType.typeName != requiredSchema(i).dataType.typeName)
{
schemaDf = schemaDf.withColumn(schemaDf.columns(i), col(schemaDf.columns(i)).cast(requiredSchema.apply(i).dataType))
}
}
schemaDf
}
This might not be an efficient solution, but is better than typing out too many lines of code to typecast multiple columns.
I am still searching for a solution to my original question.
This solution is just in case someone might want to try and are in immediate need of a quick fix.
Here's a workaround, using PySpark, using a schema that consists of "fieldname" and "dataType":
# 1st load the dataframe with StringType for all columns
from pyspark.sql.types import *
input_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", isHeaderOn) \
.option("treatEmptyValuesAsNulls", "true") \
.option("dataAddress", xlsxAddress1) \
.option("setErrorCellsToFallbackValues", "true") \
.option("ignoreLeadingWhiteSpace", "true") \
.option("ignoreTrailingWhiteSpace", "true") \
.load(inputfile)
# 2nd Modify the datatypes within the dataframe using a file containing column names and the expected data type.
dtypes = pd.read_csv("/dbfs/mnt/schema/{}".format(file_schema_location), header=None).to_records(index=False).tolist()
fields = [StructField(dtype[0], globals()[f'{dtype[1]}']()) for dtype in dtypes]
schema = StructType(fields)
for dt in dtypes:
colname =dt[0]
coltype = dt[1].replace("Type","")
input_df = input_df.withColumn(colname, col(colname).cast(coltype))

ParseException Error using Regex in pyspark 2.4

I am trying to get only those rows where colADD contain non alphanumeric character.
Code :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
data = spark.read.csv("Customers");
data.registerTempTable("data");
spark.sql("SELECT colADD from data WHERE colADD REGEXP '^[A-Za-z0-9]+$'; ");
Error:
pyspark.sql.utils.ParseException: u"\nextraneous input ';'
expecting <EOF>(line 1, pos 56)\n\n== SQL ==\nSELECT CNME from data WHERE CNME REGEXP '^[A-Za-z0-9]+$';
Please help, am i missing somethhing.
spark used this
spark.sql("SELECT col2 from test WHERE col2 REGEXP '^[A-Za-z0-9]*\\-' ").show
Have note used pyspark - but how about just removing the ; - seems not to be needed.

Unable to infer schema for CSV in pyspark

I'm using databricks and trying to read in a csv file like this:
df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path_to_my_file)
)
and I'm getting the error:
AnalysisException: 'Unable to infer schema for CSV. It must be specified manually.;'
I've checked that my file is not empty, and I've also tried to specify schema myself like this:
schema = "datetime timestamp, id STRING, zone_id STRING, name INT, time INT, a INT"
df = (spark.read
.option("header", "true")
.schema(schema)
.csv(path_to_my_file)
)
But when try to see it using display(df), it just gives me this below, I'm totally lost and don't know what to do.
df.show() and df.printSchema() gives the following:
It looks like that data are not being read into the dataframe.
error snapshot:
Note, this is an incomplete answer as there isn't enough information about what your file looks like to understand why the inferSchema did not work. I've placed this response as an answer as it is too long as a comment.
Saying this, for programmatically specifying a schema, you would need to specify the schema using StructType().
Using your example of
datetime timestamp, id STRING, zone_id STRING, name INT, time INT, mod_a INT"
it would look something like this:
# Import data types
from pyspark.sql.types import *
schema = StructType(
[StructField('datetime', TimestampType(), True),
StructField('id', StringType(), True),
StructField('zone_id', StringType(), True),
StructField('name', IntegerType(), True),
StructField('time', IntegerType(), True),
StructField('mod_a', IntegerType(), True)
]
)
Note, how the df.printSchema() had specified that all of the columns were datatype string.
I discovered that the problem was caused by the filename.
Perhaps databrick is unable to read filename schemas that begin with '_'. (underscore).
I had the same problem, and when I uploaded the file without the first letter (ie, underscore), I was able to process it.

Spark's .count() function is different to the contents of the dataframe when filtering on corrupt record field

I have a Spark job, written in Python, which is getting odd behaviour when checking for errors in its data. A simplified version is below:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, StructType, StructField, DoubleType
from pyspark.sql.functions import col, lit
spark = SparkSession.builder.master("local[3]").appName("pyspark-unittest").getOrCreate()
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
SCHEMA = StructType([
StructField("headerDouble", DoubleType(), False),
StructField("ErrorField", StringType(), False)
])
dataframe = (
spark.read
.option("header", "true")
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "ErrorField")
.schema(SCHEMA).csv("./x.csv")
)
total_row_count = dataframe.count()
print("total_row_count = " + str(total_row_count))
errors = dataframe.filter(col("ErrorField").isNotNull())
errors.show()
error_count = errors.count()
print("errors count = " + str(error_count))
The csv it is reading is simply:
headerDouble
wrong
The relevant output of this is
total_row_count = 1
+------------+----------+
|headerDouble|ErrorField|
+------------+----------+
| null| wrong|
+------------+----------+
errors count = 0
Now how does this possibly happen? If the dataframe has a record, how is being counted as 0? Is this a bug in the Spark infrastructure or am I missing something?
EDIT: Looks like this might be a known bug on Spark 2.2 which has been fixed in Spark 2.3 - https://issues.apache.org/jira/browse/SPARK-21610
Thanks #user6910411 - does seem to be a bug. I've raised an issue in the Spark project's bug tracker.
I'm speculating that Spark is getting confused due to the presence of the ErrorField in the schema which is also being specified as the error column and being used to filter the dataframe.
Meanwhile I think I've found a workaround to count the dataframe rows at a reasonable speed:
def count_df_with_spark_bug_workaround(df):
return sum(1 for _ in df.toLocalIterator())
Not quite sure why this gives the right answer when .count() doesn't work.
Jira ticket I raised:
https://issues.apache.org/jira/browse/SPARK-24147
This turned out to be a duplicate of:
https://issues.apache.org/jira/browse/SPARK-21610

Resources