Spark read CSV - Not showing corroupt Records - apache-spark

Spark has a Permissive mode for reading CSV files which stores the corroupt records into a separate column named _corroupt_record.
permissive -
Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column
called _corrupt_record
However, when I am trying following example, I don't see any column named _corroupt_record. the reocords which doesn't match with schema appears to be null
data.csv
data
10.00
11.00
$12.00
$13
gaurang
code
import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), false)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(schema)
.load("../test.csv")
schema
scala> df.printSchema()
root
|-- value: decimal(25,10) (nullable = true)
scala> df.show()
+-------------+
| value|
+-------------+
|10.0000000000|
|11.0000000000|
| null|
| null|
| null|
+-------------+
If I change the mode to FAILFAST I am getting error when I try to see data.

Adding the _corroup_record as suggested by Andrew and Prateek resolved the issue.
import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), true),
new StructField("_corrupt_record", StringType, true)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(schema)
.load("../test.csv")
querying Data
scala> df.show()
+-------------+---------------+
| value|_corrupt_record|
+-------------+---------------+
|10.0000000000| null|
|11.0000000000| null|
| null| $12.00|
| null| $13|
| null| gaurang|
+-------------+---------------+

Related

Pyspark : Problem with FloatType while writing data to parquet file

I am having following schema,
root
|-- A: string (nullable = true)
|-- B: float (nullable = true)
And when I apply schema on data, dataframe values for float column is populating as wrong.
Original Data :-
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
Please help me to understand what exactly spark is doing here and generating below output.
After Applying Schema Dataframe:-
+---------+----------+
| A| B|
+---------+----------+
|floadVal1|0.40441337|
|floadVal2| 0.28563|
|floadVal3| 0.5912903|
|floadVal4|0.40441337|
|floadVal5| 15.376102|
|floadVal6| 15.261798|
|floadVal7| 19.887815|
|floadVal8| 0.0|
+---------+----------+
After writing to parquet :-
A B
0 floadVal1 0.404413
1 floadVal2 0.285630
2 floadVal3 0.591290
3 floadVal4 0.404413
4 floadVal5 15.376102
5 floadVal6 15.261798
6 floadVal7 19.887815
7 floadVal8 0.000000
And
AS per the spark doc 2.4.5
FloatType: Represents 4-byte single-precision floating point numbers.
Sample Code
spark = SparkSession.builder.master('local').config(
"spark.sql.parquet.writeLegacyFormat", 'true').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
schema = StructType([
StructField("A", StringType(), True),
StructField("B", FloatType(), True)])
df = spark.createDataFrame([
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
], schema)
df.printSchema()
df.show()
df.write.format("parquet").save('floatTestParFile')

how to override column names in csv file with custom schema?

I have a CSV file with headers and data like this:
Date,Transaction,Name,Memo,Amount
12/31/2018,DEBIT,Amazon stuff,24000978364666403396802,-62.48
I want to override the column names to be like this:
transaction,credit_debit,description,memo,amount
Here is how I manually specify the schema I want to use and then read the file:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("transaction_date", DataTypes.TimestampType, true),
DataTypes.createStructField("credit_debit", DataTypes.StringType, true),
DataTypes.createStructField("description", DataTypes.StringType, true),
DataTypes.createStructField("memo", DataTypes.StringType, true),
DataTypes.createStructField("amount", DataTypes.DoubleType, true)
});
String csvPath = "input/mytransactions.csv";
DataFrameReader dataFrameReader = spark.read();
Dataset<Row> dataFrame =
dataFrameReader
.format("org.apache.spark.csv")
.option("header","true")
.option("inferSchema", false)
.schema(schema)
.csv(csvPath);
dataFrame.show(20);
But when I do, the actual column values are null when I read the file.
+----------------+------------+-----------+----+------+
|transaction_date|credit_debit|description|memo|amount|
+----------------+------------+-----------+----+------+
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
Any idea what I'm doing incorrectly?
Problem is with Date Column and you are missing an option on csv called dateFormat.
Code below.
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("transaction_date", DataTypes.DateType, true),
DataTypes.createStructField("credit_debit", DataTypes.StringType, true),
DataTypes.createStructField("description", DataTypes.StringType, true),
DataTypes.createStructField("memo", DataTypes.StringType, true),
DataTypes.createStructField("amount", DataTypes.DoubleType, true)
});
Dataset<Row> dataFrame =
dataFrameReader
.format("org.apache.spark.csv")
.option("header","true")
.option("dateFormat", "MM/dd/YYYY")
.option("inferSchema", false)
.schema(schema)
.csv(csvPath);
I wanted to rename the columns. This does it:
Dataset<Row> dataFrame =
dataFrameReader
.format("org.apache.spark.csv")
.option("header","true")
.option("inferSchema", true)
.csv(csvPath);
// Rename Columns
dataFrame = dataFrame.toDF("transaction_date","debit_credit", "description", "memo", "amount");

Error when running a query involving ROUND function in spark sql

I am trying, in pyspark, to obtain a new column by rounding one column of a table to the precision specified, in each row, by another column of the same table, e.g., from the following table:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
I should be able to obtain the following result:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+
In particular, I have tried the following code:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import (
StructType, StructField, FloatType, LongType,
IntegerType
)
pdDF = pd.DataFrame(columns=["Data", "Rounding"], data=[[3.141592, 3],
[0.577215, 1]])
mySchema = StructType([ StructField("Data", FloatType(), True),
StructField("Rounding", IntegerType(), True)])
spark = (SparkSession.builder
.master("local")
.appName("column rounding")
.getOrCreate())
df = spark.createDataFrame(pdDF,schema=mySchema)
df.show()
df.createOrReplaceTempView("df_table")
df_rounded = spark.sql("SELECT Data, Rounding, ROUND(Data, Rounding) AS Rounded_Column FROM df_table")
df_rounded .show()
but I get the following error:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'round(df_table.`Data`, df_table.`Rounding`)' due to data type mismatch: Only foldable Expression is allowed for scale arguments; line 1 pos 23;\n'Project [Data#0, Rounding#1, round(Data#0, Rounding#1) AS Rounded_Column#12]\n+- SubqueryAlias df_table\n +- LogicalRDD [Data#0, Rounding#1], false\n"
Any help would be deeply appreciated :)
With spark sql , the catalyst throws out the following error in your run - Only foldable Expression is allowed for scale arguments
i.e #param scale new scale to be round to, this should be a constant int at runtime
ROUND only expect a Literal for the scale. you can try out writing custom code instead of spark-sql way.
EDIT:
With UDF,
val df = Seq(
(3.141592,3),
(0.577215,1)).toDF("Data","Rounding")
df.show()
df.createOrReplaceTempView("df_table")
import org.apache.spark.sql.functions._
def RoundUDF(customvalue:Double, customscale:Int):Double = BigDecimal(customvalue).setScale(customscale, BigDecimal.RoundingMode.HALF_UP).toDouble
spark.udf.register("RoundUDF", RoundUDF(_:Double,_:Int):Double)
val df_rounded = spark.sql("select Data, Rounding, RoundUDF(Data, Rounding) as Rounded_Column from df_table")
df_rounded.show()
Input:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
Output:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+

Round Spark DataFrame in-place

I read a .csv file to Spark DataFrame. For a DoubleType column is there a way to specify at the time of the file read that this column should be rounded to 2 decimal places? I'm also supplying a custom schema to the DataFrameReader API call. Here's my schema and API calls:
val customSchema = StructType(Array(StructField("id_1", IntegerType, true),
StructField("id_2", IntegerType, true),
StructField("id_3", DoubleType, true)))
#using Spark's CSV reader with custom schema
#spark == SparkSession()
val parsedSchema = spark.read.format("csv").schema(customSchema).option("header", "true").option("nullvalue", "?").load("C:\\Scala\\SparkAnalytics\\block_1.csv")
After the file read into DataFrame I can round the decimals like:
parsedSchema.withColumn("cmp_fname_c1", round($"cmp_fname_c1", 3))
But this creates a new DataFrame, so I'd also like to know if it can be done in-place instead of creating a new DataFrame.
Thanks
You can specify, say, DecimalType(10, 2) for the DoubleType column in your customSchema when loading your CSV file. Let's say you have a CSV file with the following content:
id_1,id_2,Id_3
1,10,5.555
2,20,6.0
3,30,7.444
Sample code below:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("id_1", IntegerType, true),
StructField("id_2", IntegerType, true),
StructField("id_3", DecimalType(10, 2), true)
))
spark.read.format("csv").schema(customSchema).
option("header", "true").option("nullvalue", "?").
load("/path/to/csvfile").
show
// +----+----+----+
// |id_1|id_2|id_3|
// +----+----+----+
// | 1| 10|5.56|
// | 2| 20|6.00|
// | 3| 30|7.44|
// +----+----+----+

Converting RDD into a dataframe int vs Double

Why is it possible to convert an rdd[int] into a dataframe using the implicit method
import sqlContext.implicits._
//Concatenate rows
val rdd1 = sc.parallelize(Array(4,5,6)).toDF()
rdd1.show()
rdd1: org.apache.spark.sql.DataFrame = [_1: int]
+---+
| _1|
+---+
| 4|
| 5|
| 6|
+---+
but rdd[Double] is throwing an error:
val rdd2 = sc.parallelize(Array(1.1,2.34,3.4)).toDF()
error: value toDF is not a member of org.apache.spark.rdd.RDD[Double]
Spark 2.x
In Spark 2.x toDF uses SparkSession.implicits and provides rddToDatasetHolder and localSeqToDatasetHolder for any type with Encoder so with
val spark: SparkSession = ???
import spark.implicits._
both:
Seq(1.1, 2.34, 3.4).toDF()
and
sc.parallelize(Seq(1.1, 2.34, 3.4)).toDF()
are valid.
Spark 1.x
It is not possible. Excluding Product types SQLContext provides implicit conversions only for RDD[Int] (intRddToDataFrameHolder), RDD[Long] (longRddToDataFrameHolder) and RDD[String] (stringRddToDataFrameHolder). You can always map to RDD[(Double,)] first:
sc.parallelize(Seq(1.1, 2.34, 3.4)).map(Tuple1(_)).toDF()

Resources