how to override column names in csv file with custom schema? - apache-spark

I have a CSV file with headers and data like this:
Date,Transaction,Name,Memo,Amount
12/31/2018,DEBIT,Amazon stuff,24000978364666403396802,-62.48
I want to override the column names to be like this:
transaction,credit_debit,description,memo,amount
Here is how I manually specify the schema I want to use and then read the file:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("transaction_date", DataTypes.TimestampType, true),
DataTypes.createStructField("credit_debit", DataTypes.StringType, true),
DataTypes.createStructField("description", DataTypes.StringType, true),
DataTypes.createStructField("memo", DataTypes.StringType, true),
DataTypes.createStructField("amount", DataTypes.DoubleType, true)
});
String csvPath = "input/mytransactions.csv";
DataFrameReader dataFrameReader = spark.read();
Dataset<Row> dataFrame =
dataFrameReader
.format("org.apache.spark.csv")
.option("header","true")
.option("inferSchema", false)
.schema(schema)
.csv(csvPath);
dataFrame.show(20);
But when I do, the actual column values are null when I read the file.
+----------------+------------+-----------+----+------+
|transaction_date|credit_debit|description|memo|amount|
+----------------+------------+-----------+----+------+
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
Any idea what I'm doing incorrectly?

Problem is with Date Column and you are missing an option on csv called dateFormat.
Code below.
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("transaction_date", DataTypes.DateType, true),
DataTypes.createStructField("credit_debit", DataTypes.StringType, true),
DataTypes.createStructField("description", DataTypes.StringType, true),
DataTypes.createStructField("memo", DataTypes.StringType, true),
DataTypes.createStructField("amount", DataTypes.DoubleType, true)
});
Dataset<Row> dataFrame =
dataFrameReader
.format("org.apache.spark.csv")
.option("header","true")
.option("dateFormat", "MM/dd/YYYY")
.option("inferSchema", false)
.schema(schema)
.csv(csvPath);

I wanted to rename the columns. This does it:
Dataset<Row> dataFrame =
dataFrameReader
.format("org.apache.spark.csv")
.option("header","true")
.option("inferSchema", true)
.csv(csvPath);
// Rename Columns
dataFrame = dataFrame.toDF("transaction_date","debit_credit", "description", "memo", "amount");

Related

Pyspark : Problem with FloatType while writing data to parquet file

I am having following schema,
root
|-- A: string (nullable = true)
|-- B: float (nullable = true)
And when I apply schema on data, dataframe values for float column is populating as wrong.
Original Data :-
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
Please help me to understand what exactly spark is doing here and generating below output.
After Applying Schema Dataframe:-
+---------+----------+
| A| B|
+---------+----------+
|floadVal1|0.40441337|
|floadVal2| 0.28563|
|floadVal3| 0.5912903|
|floadVal4|0.40441337|
|floadVal5| 15.376102|
|floadVal6| 15.261798|
|floadVal7| 19.887815|
|floadVal8| 0.0|
+---------+----------+
After writing to parquet :-
A B
0 floadVal1 0.404413
1 floadVal2 0.285630
2 floadVal3 0.591290
3 floadVal4 0.404413
4 floadVal5 15.376102
5 floadVal6 15.261798
6 floadVal7 19.887815
7 floadVal8 0.000000
And
AS per the spark doc 2.4.5
FloatType: Represents 4-byte single-precision floating point numbers.
Sample Code
spark = SparkSession.builder.master('local').config(
"spark.sql.parquet.writeLegacyFormat", 'true').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
schema = StructType([
StructField("A", StringType(), True),
StructField("B", FloatType(), True)])
df = spark.createDataFrame([
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
], schema)
df.printSchema()
df.show()
df.write.format("parquet").save('floatTestParFile')

Reading csv files in PySpark

I am trying to read csv file and convert into dataframe.
input.txt
4324,'Andy',43.5,20.3,53.21
2342,'Sam',22.1
3248,'Jane',11.05,12.87
6457,'Bob',32.1,75.23,71.6
Schema: Id, Name,Jan,Feb,March
As you see the csv file doesn't have "," if there are no trailing expenses.
Code:
from pyspark.sql.types import *
input1= sc.textFile('/FileStore/tables/input.txt').map(lambda x: x.split(","))
schema = StructType([StructField('Id',StringType(),True), StructField('Name',StringType(),True), StructField('Jan',StringType(),True), StructField('Feb',StringType(),True), StructField('Mar',StringType(),True)])
df3 = sqlContext.createDataFrame(input1, schema)
I get ValueError: Length of object (4) does not match with length of fields (5). How do I resolve this?
I would first import the file using pandas which should handle everything for you. From there you can then convert the pandas DataFrame to spark and do all your usual stuff. I copied your example txt file and quickly wrote up some code to confirm that it would all work:
import pandas as pd
# Reading in txt file as csv
df_pandas = pd.read_csv('<your location>/test.txt',
sep=",")
# Converting to spark dataframe and displaying
df_spark = spark.createDataFrame(df_pandas)
display(df_pandas)
Which produced the following output:
The faster method would be to import through spark:
# Importing csv file using pyspark
csv_import = sqlContext.read\
.format('csv')\
.options(sep = ',', header='true', inferSchema='true')\
.load('<your location>/test.txt')
display(csv_import)
Which gives the same output.
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
fields = [StructField('Id', StringType(), True), StructField('Name', StringType(), True),
StructField('Jan', StringType(), True), StructField('Feb', StringType(), True),
StructField('Mar', StringType(), True)]
schema = StructType(fields)
data = spark.read.format("csv").load("test2.txt")
df3 = spark.createDataFrame(data.rdd, schema)
df3.show()
Output:
+----+------+-----+-----+-----+
| Id| Name| Jan| Feb| Mar|
+----+------+-----+-----+-----+
|4324|'Andy'| 43.5| 20.3|53.21|
|2342| 'Sam'| 22.1| null| null|
|3248|'Jane'|11.05|12.87| null|
|6457| 'Bob'| 32.1|75.23| 71.6|
+----+------+-----+-----+-----+
Here are a couple options for you to consider. These use the wildcard character, so you can loop through all folders and sub-folders, look for files with names that match a specific pattern, and merge everything into a dingle dataframe.
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.head()
myDFCsv.count()
//////////////////////////////////////////
// If you also need to load the filename
import org.apache.spark.sql.functions.input_file_name
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
.withColumn("file_name",input_file_name())
myDFCsv.show(false)
myDFCsv.head()
myDFCsv.count()

Spark read CSV - Not showing corroupt Records

Spark has a Permissive mode for reading CSV files which stores the corroupt records into a separate column named _corroupt_record.
permissive -
Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column
called _corrupt_record
However, when I am trying following example, I don't see any column named _corroupt_record. the reocords which doesn't match with schema appears to be null
data.csv
data
10.00
11.00
$12.00
$13
gaurang
code
import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), false)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(schema)
.load("../test.csv")
schema
scala> df.printSchema()
root
|-- value: decimal(25,10) (nullable = true)
scala> df.show()
+-------------+
| value|
+-------------+
|10.0000000000|
|11.0000000000|
| null|
| null|
| null|
+-------------+
If I change the mode to FAILFAST I am getting error when I try to see data.
Adding the _corroup_record as suggested by Andrew and Prateek resolved the issue.
import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), true),
new StructField("_corrupt_record", StringType, true)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(schema)
.load("../test.csv")
querying Data
scala> df.show()
+-------------+---------------+
| value|_corrupt_record|
+-------------+---------------+
|10.0000000000| null|
|11.0000000000| null|
| null| $12.00|
| null| $13|
| null| gaurang|
+-------------+---------------+

Can't Tranform Kafka Json Data in Spark Structured Streaming

I am trying to get Kafka messages and processing it with Spark in standalone. Kafka stores data as json format. I can get Kafka messages but can not parse json data with defining schema.
When I run the bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my_kafka_topic --from-beginning command to see the kafka messages in kafka topic, it outputs as follows:
"{\"timestamp\":1553792312117,\"values\":[{\"id\":\"Simulation.Simulator.Temperature\",\"v\":21,\"q\":true,\"t\":1553792311686}]}"
"{\"timestamp\":1553792317117,\"values\":[{\"id\":\"Simulation.Simulator.Temperature\",\"v\":22,\"q\":true,\"t\":1553792316688}]}"
And I can get this data succesfully with this code block in Spark:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "my_kafka_topic") \
.load() \
.select(col("value").cast("string"))
The schema is like this:
df.printSchema()
root
|-- value: string (nullable = true)
And then writing this dataframe to console and it prints the kafka messages:
Batch: 9
-------------------------------------------
+--------------------+
| value|
+--------------------+
|"{\"timestamp\":1...|
+--------------------+
But I want to parse json data to define schema and the code block that I've tried to do it:
schema = StructType([ StructField("timestamp", LongType(), False), StructField("values", ArrayType( StructType([ StructField("id", StringType(), True), StructField("v", IntegerType(), False), StructField("q", BooleanType(), False), StructField("t", LongType(), False) ]), True ), True) ])
parsed = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "my_kafka_topic") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("opc"))
And the schema of parsed dataframe:
parsed.printSchema()
root
|-- opc: struct (nullable = true)
| |-- timestamp: string (nullable = true)
| |-- values: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- v: integer (nullable = true)
| | |-- q: boolean (nullable = true)
| | |-- t: string (nullable = true)
These code blocks run without error. But when I want to write parsed dataframe to the console:
query = parsed\
.writeStream\
.format("console")\
.start()
query.awaitTermination()
it is writing null like this in console:
+----+
| opc|
+----+
|null|
+----+
So, it seems there is problem with parsing json data but can't figure out it.
Can you tell me what is wrong?
It seems that the schema was not correct for your case please try to apply the next one:
schema = StructType([
StructField("timestamp", LongType(), False),
StructField("values", ArrayType(
StructType([StructField("id", StringType(), True),
StructField("v", IntegerType(), False),
StructField("q", BooleanType(), False),
StructField("t", LongType(), False)]), True), True)])
Also remember that the inferSchema option works pretty well so you could let Spark discover the schema and save it.
Another issue is that your json data has leading and trailing double quotes " also it contains \ those make an invalid JSON which was preventing Spark from parsing the message.
In order to remove the invalid characters your code should modified as next:
parsed = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "my_kafka_topic") \
.load() \
.withColumn("value", regexp_replace(col("value").cast("string"), "\\\\", "")) \
.withColumn("value", regexp_replace(col("value"), "^\"|\"$", "")) \
.select(from_json(col("value"), schema).alias("opc"))
Now your output should be:
+------------------------------------------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------------------------------------------+
|{"timestamp":1553588718638,"values":[{"id":"Simulation.Simulator.Temperature","v":26,"q":true,"t":1553588717036}]}|
+------------------------------------------------------------------------------------------------------------------+
Good luck!

Round Spark DataFrame in-place

I read a .csv file to Spark DataFrame. For a DoubleType column is there a way to specify at the time of the file read that this column should be rounded to 2 decimal places? I'm also supplying a custom schema to the DataFrameReader API call. Here's my schema and API calls:
val customSchema = StructType(Array(StructField("id_1", IntegerType, true),
StructField("id_2", IntegerType, true),
StructField("id_3", DoubleType, true)))
#using Spark's CSV reader with custom schema
#spark == SparkSession()
val parsedSchema = spark.read.format("csv").schema(customSchema).option("header", "true").option("nullvalue", "?").load("C:\\Scala\\SparkAnalytics\\block_1.csv")
After the file read into DataFrame I can round the decimals like:
parsedSchema.withColumn("cmp_fname_c1", round($"cmp_fname_c1", 3))
But this creates a new DataFrame, so I'd also like to know if it can be done in-place instead of creating a new DataFrame.
Thanks
You can specify, say, DecimalType(10, 2) for the DoubleType column in your customSchema when loading your CSV file. Let's say you have a CSV file with the following content:
id_1,id_2,Id_3
1,10,5.555
2,20,6.0
3,30,7.444
Sample code below:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("id_1", IntegerType, true),
StructField("id_2", IntegerType, true),
StructField("id_3", DecimalType(10, 2), true)
))
spark.read.format("csv").schema(customSchema).
option("header", "true").option("nullvalue", "?").
load("/path/to/csvfile").
show
// +----+----+----+
// |id_1|id_2|id_3|
// +----+----+----+
// | 1| 10|5.56|
// | 2| 20|6.00|
// | 3| 30|7.44|
// +----+----+----+

Resources