Specifying Maximum Column Width When Loading Flat File - apache-spark

I am loading a file that has many columns that exceed 1000+ characters (in some cases 4000-8000 characters) and am receiving this error when I query the resulting dataframe from loading it from the file:
FileReadException: Error while reading file dbfs:/fin/fm/spynotesandcommentsfile.txt
Caused by: TextParsingException: java.lang.ArrayIndexOutOfBoundsException - null
Caused by: ArrayIndexOutOfBoundsException:
In the reader, I can specify options (.option("option",true)) and have been looking for an option to allow for the maximum width possible for all columns since the resulting dataframe cannot be queried. The actual reader needs to allow for maximum width here for every column loaded into the dataframe from the file, which is why many of the solutions about many columns (not the issue) or growing a dataframe (not the issue) don't solve this issue.
val spyfile = spark.read.format("csv")
.option("delimiter", ",")
.option("maximumColumnWidthAllowed", true) ///if this existed or a similar option
I was able to confirm that one field on one of the rows in a file has a character length of 12,043. It still fails for the same reason if I specify .option("maxCharsPerColumn", -1 or .option("maxCharsPerColumn", "-1")

Related

Spark CSV read option for number format

I'm loading a CSV file with numbers:
spark.read.format("csv")
.schema(StructType(Seq(StructField("result", IntegerType, true))))
.option("mode", "FAILFAST")
.option("delimiter", "|")
.option("encoding", "utf8")
.load(file)
Caused by: FileReadException: Error while reading file blah.csv.
Caused by: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
Caused by: BadRecordException: java.lang.NumberFormatException: For input string: "65,9"
Caused by: NumberFormatException: For input string: "65,9"
Oops... we use comma as decimal point. I see data source options like dateFormat and timestampFormat, but not anything about number format (decimal point and/or grouping).
Can I somehow specify force Spark to handle commas? Or is the only way loading it as string and parse manually?
You should read data in String format then remove comma and convert it to float.
Spark provides various options while reading but doesn't allow to customize any options. Like in your case you will need to do it manually only.
Also I think any leaner solution(if any) will also follow same steps in backend.

CSV upload to PySpark data frame counting linebreaks as new rows

I am trying to load a CSV into a Spark data frame using common instructions, however, the CSV is incorrectly loading. Below is the header and a problematic record. This is the Vim view of the file, showing ^M carriage return.
symbol,tweet_id,text,created_at,retweets,likes,geo,place,coordinates,location^M
AIG,1423790670557351945, "Next Monday, US stock market may edge up. The market is more likely to be mixed.
$MNST $AIG $WFC $GS $MET $BAC $JPM▒<80>▒",2021-08-06 23:38:43,1,0,,,,"Toronto, Ontario, Canada"^M
Here is the command I'm using to load the CSV into Spark data frame:
df = spark.read.load("symbols_tweets.csv",
format="csv", sep=",", header="true")
The issue is that spark.read.load ends up thinking $MNST is a new row since it appears on a new line. Is there any way I can have Spark pay attention to the Unix carriage return ^M instead so that it would load the rows as intended? As a workaround, I tried converting the CSV to Pandas data frame and then to Spark data frame, however that is resulting in more complex datatype issues - I would rather solve in a more direct manner.

Spark: reading files with PERMISSIVE and provided schema - issues with corrupted records column

I am reading spark CSV. I am providing a schema for the file that I read and I read it permissive mode. I would like to keep all records in columnNameOfCorruptRecord (in my case corrupted_records).
I went trough hell to set this up and still get warnings that I cannot suppress i there something I miss.
So first in order to have the corrupted_records column I needed to add it to the schema as StringType. This is documented so okay.
But whenever I read a file a get a Warning that the schema doesn't match because the amount of columns is different. It's just a warning, but it's filling my logs.
Also when there is a field that is not nullable and there is a corrupted record, the corrupted record goes to the corrupt_records column and all it's fields are set to null thus I get an exception because I have non nullable field. The only to solve this is to set that the columns are not nullable to nullable. Which is quite strange.
Am I missing something?
Recap:
Is there a way to ignore the warning when I've added
corrupted_records column in the schema
Is there a way to use
PERMISSIVE mode and corrupted_records column with schema that has
non nullable fields.
Thanks!
The following documentation might help. It would be great if you atleast provide the code you've written.
https://docs.databricks.com/spark/latest/data-sources/read-csv.html
a demo of read json code snippet
df= self.spark.read.option("mode", "PERMISSIVE").option("primitivesAsString", True).json(self.src_path)
To answer your point 2, you should delve better point 1.
Point 1: you should do an analysis of your file and map your schema with all the fields in your file. After having imported your csv file into a DataFrame, I would select your fields of interest, and continue what you were doing.
Point 2: you will solve your problem defining your schema as follows (I would use scala):
import pyspark.sql.types as types
yourSchema = (types.StructType()
.add('field0', types.IntegerType(), True)
# all your .add(fieldsName, fieldsType, True which let your field be nullable)
.add('corrupted_records', types.StringType(), True) #your corrupted date will be here
)
With this having been defined, you can import your csv file into a DataFrame as follows:
df = ( spark.read.format("csv")
.schema(yourSchema)
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "corrupted_records")
load(your_csv_files)
)
There are also other ways to do the same operation, and different modalities to handle bad bad; have a look at this insightful article: https://python.plainenglish.io/how-to-handle-bad-data-in-spark-sql-5e0276d37ca1

how to drop malformed records while loading xls file to spark

While loading csv file, there is an option to drop Malformed records. Can we do the same for XLS file load?
I have tried loading an XLS file (almost 1T size) and it shows this error:
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#339370e
java.lang.IllegalArgumentException: MALFORMED
at java.util.zip.ZipCoder.toString(ZipCoder.java:58)
at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:300)
Please, advise. Thank you very much.
I think this is what you are looking for (as done in Java for CSV):
spark.read().format("csv").option("sep", ",")
.option("header", true)
.option("mode", "DROPMALFORMED")
.schema(schema)
.load(filePath);
Here, we the mode will take care of malformed records and drop them when encountered.
Similarly, header is set to true which will not consider it as value and separator is defined as comma for CSV file format.
spark is the spark session to be created before the above snippet.

Behavior of spark.read.csv with inferschema=True in case of multiple file loading

I am facing a difficulty(rather a strange result) while trying to load multiple csv files simultaneously into Spark by
df = spark.read.csv('/dir/*.csv', header=True, inferSchema=True)
df.printschema()
# Sale_Value String(nullable=true) # it should be Double/float
# Name String # Fine for all string columns
So basically all my integer and double column are converting into String Type, and i am expecting it to be double type, as i am passing inferschema parameter as true.
I checked my data and confirmed that there is no null value or any string value is there.
Strange thing is->
i read each file into separate dfs, df1=spark.read.csv(file1, inferSchema=True),
df2=spark.read.csv(file2, inferSchema=True), then printing schema for each dataframes, All the schema are as expected(dpouble are coming double, string is string)
I started appending each separate dfs into single df like df = df1.union(df2), df= df.union(df2)..etc and checked printing df.printSchema(), then also all results are as expected...No issue in that.
So i am confused with the behavior with multiple file load in single load staement(*.csv).
IS there anything i am missing about inferSchema behavior. Please shed some light.
Edit
My data is doubleQuote enclosed(to avoid splitting of records if comma found in fields), ex: "name","sale_target","sale_V","gender"....
I have 3 files out of that 2 of the files are quote-enclosed and one file is without quote-enclosed(that is the reason, I've faced this issue, i dropped the file not having quotes in it, and guess what everything worked perfectly).
So is it mandatory that if i am using/reading/loading multiple csv files then either i have to use all-files-QuoteEnclosed or all-file-without-QuoteEnclosed??
File location https://github.com/satya-panda/king-s-things

Resources