Spark CSV read option for number format - apache-spark

I'm loading a CSV file with numbers:
spark.read.format("csv")
.schema(StructType(Seq(StructField("result", IntegerType, true))))
.option("mode", "FAILFAST")
.option("delimiter", "|")
.option("encoding", "utf8")
.load(file)
Caused by: FileReadException: Error while reading file blah.csv.
Caused by: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
Caused by: BadRecordException: java.lang.NumberFormatException: For input string: "65,9"
Caused by: NumberFormatException: For input string: "65,9"
Oops... we use comma as decimal point. I see data source options like dateFormat and timestampFormat, but not anything about number format (decimal point and/or grouping).
Can I somehow specify force Spark to handle commas? Or is the only way loading it as string and parse manually?

You should read data in String format then remove comma and convert it to float.
Spark provides various options while reading but doesn't allow to customize any options. Like in your case you will need to do it manually only.
Also I think any leaner solution(if any) will also follow same steps in backend.

Related

Specifying Maximum Column Width When Loading Flat File

I am loading a file that has many columns that exceed 1000+ characters (in some cases 4000-8000 characters) and am receiving this error when I query the resulting dataframe from loading it from the file:
FileReadException: Error while reading file dbfs:/fin/fm/spynotesandcommentsfile.txt
Caused by: TextParsingException: java.lang.ArrayIndexOutOfBoundsException - null
Caused by: ArrayIndexOutOfBoundsException:
In the reader, I can specify options (.option("option",true)) and have been looking for an option to allow for the maximum width possible for all columns since the resulting dataframe cannot be queried. The actual reader needs to allow for maximum width here for every column loaded into the dataframe from the file, which is why many of the solutions about many columns (not the issue) or growing a dataframe (not the issue) don't solve this issue.
val spyfile = spark.read.format("csv")
.option("delimiter", ",")
.option("maximumColumnWidthAllowed", true) ///if this existed or a similar option
I was able to confirm that one field on one of the rows in a file has a character length of 12,043. It still fails for the same reason if I specify .option("maxCharsPerColumn", -1 or .option("maxCharsPerColumn", "-1")

Special characters not encoded properly when creating Spark dataframe from parquet

My input parquet file has a column defined as optional binary title (UTF8);, which may include special characters such as the German umlat (i.e. Schrödinger).
When using Spark to load the contents of the parquet to a DataFrame, the contents of the row are loading the value Schrödinger as Schrödinger. I believe the best explanation of why this could be happening is answered here, though I was under the impression that Spark will read the parquet file as UTF-8 by default anyway.
I have attempted to force UTF-8 encoding by using the option argument as described here, but still no luck. Any suggestions?
Can you try with encoding CP1252. It worked for us for most of the special characters which are not supported in UTF8.

how to drop malformed records while loading xls file to spark

While loading csv file, there is an option to drop Malformed records. Can we do the same for XLS file load?
I have tried loading an XLS file (almost 1T size) and it shows this error:
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#339370e
java.lang.IllegalArgumentException: MALFORMED
at java.util.zip.ZipCoder.toString(ZipCoder.java:58)
at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:300)
Please, advise. Thank you very much.
I think this is what you are looking for (as done in Java for CSV):
spark.read().format("csv").option("sep", ",")
.option("header", true)
.option("mode", "DROPMALFORMED")
.schema(schema)
.load(filePath);
Here, we the mode will take care of malformed records and drop them when encountered.
Similarly, header is set to true which will not consider it as value and separator is defined as comma for CSV file format.
spark is the spark session to be created before the above snippet.

Unable to append "Quotes" in write for dataframe

I am trying to save a dataframe as .csv in spark. It is required to have all fields bounded by "Quotes". Currently, the file is not enclosed by "Quotes".
I am using Spark 2.1.0
Code :
DataOutputResult.write.format("com.databricks.spark.csv").
option("header", true).
option("inferSchema", false).
option("quoteMode", "ALL").
mode("overwrite").
save(Dataoutputfolder)
Output format(actual) :
Name, Id,Age,Gender
XXX,1,23,Male
Output format (Required) :
"Name", "Id" ," Age" ,"Gender"
"XXX","1","23","Male"
Options I tried so far :
QuoteMode, Quote in the options during it as file, But with no success.
("quote", "all"), replace quoteMode with quote
or play with concat or concat_wsdirectly on df columns and save without quote - mode
import org.apache.spark.sql.functions.{concat, lit}
val newDF = df.select(concat($"Name", lit("""), $"Age"))
or create own udf function to add desired behaviour, pls find more examples in Concatenate columns in apache spark dataframe
Unable to add as a comment to the above answer, so posting as an answer.
In Spark 2.3.1, use quoteAll
df1.write.format("csv")
.option("header", true)
.option("quoteAll","true")
.save(Dataoutputfolder)
Also, to add to the comment of #Karol Sudol (great answer btw), .option("quote","\u0000") will work only if one is using Pyspark with Python 3 which has default encoding as 'utf-8'. A few reported that the option did not work, because they must be using Pyspark with Python 2 whose default encoding is 'ascii'. Therefore the error "java.lang.RuntimeException: quote cannot be more than one character"

How to read sequence files exported from HBase

I used the following code to export an HBase table and save the output to HDFS:
hbase org.apache.hadoop.hbase.mapreduce.Export \
MyHbaseTable1 hdfs://nameservice1/user/ken/data/exportTable1
Output files are binary files. If I use pyspark to read the file folder:
test1 = sc.textFile('hdfs://nameservice1/user/ken/data/exportTable1')
test1.show(5)
It shows:
u'SEQ\x061org.apache.hadoop.hbase.io.ImmutableBytesWritable%org.apache.hadoop.hbase.client.Result\x00\x00\x00\x00\x00\x00\ufffd-\x10A\ufffd~lUE\u025bt\ufffd\ufffd\ufffd&\x00\x00\x04\ufffd\x00\x00\x00'
u'\x00\x00\x00\x067-2010\ufffd\t'
u'|'
u'\x067-2010\x12\x01r\x1a\x08clo-0101 \ufffd\ufffd\ufffd*(\x042\\6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0'
u'u'
I can tell that
'7-2010' in the 2nd line is the Rowkey,
'r' in the 4th line is the column family,
'clo-0101' in the 4th line is the column name,
'6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0' is the value.
I don't know where 3rd and 5th line came from. It seems like Hbase-export followed its own rule to generate the file, if I use my own way to decode it, data might got corrupted.
Question:
How can I convert this file back to a readable format? For example:
7-2010, r, clo-0101, 6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0
I have tried:
test1 = sc.sequenceFile('/user/youyang/data/hbaseSnapshot1/', keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)
test1.show(5)
and
test1 = sc.sequenceFile('hdfs://nameservice1/user/ken/data/exportTable1'
, keyClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
, valueClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable'
, keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter'
, valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringCon verter'
, minSplits=None
, batchSize=100)
No luck, the code did not work, ERROR:
Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
Any suggestions? Thank you!
I had this problem recently myself. I solved it by going away from sc.sequenceFile, and instead using sc.newAPIHadoopFile (or just hadoopFile if you're on the old API). The Spark SequenceFile-reader appears to only handle keys/values that are Writable types (it's stated in the docs).
If you use newAPIHadoopFile it uses the Hadoop deserialization logic, and you can specify which Serialization types you need in the config-dictionary you give it:
hadoop_conf = {"io.serializations": "org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.hbase.mapreduce.ResultSerialization"}
sc.newAPIHadoopFile(
<input_path>,
'org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat',
keyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable',
valueClass='org.apache.hadoop.hbase.client.Result',
keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter',
valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter',
conf=hadoop_conf)
Note that the value in hadoop_conf for "io.serializations" is a comma separated list which includes "org.apache.hadoop.hbase.mapreduce.ResultSerialization". That is the key configuration you need to be able to deserialize the Result. The WritableSerialization is also needed in order to be able to deserialize ImmutableBytesWritable.
You can also use sc.newAPIHadoopRDD, but then you also need to set a value for "mapreduce.input.fileinputformat.inputdir" in the config dictionary.

Resources