Spark csv to dataframe skip first row - apache-spark

I am loading csv to dataframe using -
sqlContext.read.format("com.databricks.spark.csv").option("header", "true").
option("delimiter", ",").load("file.csv")
but my input file contains date in the first row and header from second row.
example
20160612
id,name,age
1,abc,12
2,bcd,33
How can i skip this first row while converting csv to dataframe?

Here are several options that I can think of since the data bricks module doesn't seem to provide a skip line option:
Option one: Add a "#" character in front of the first line, and the line will be automatically considered as comment and ignored by the data.bricks csv module;
Option two: Create your customized schema and specify the mode option as DROPMALFORMED which will drop the first line since it contains less token than expected in the customSchema:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val customSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("mode", "DROPMALFORMED").
schema(customSchema).load("test.txt")
df.show
16/06/12 21:24:05 WARN CsvRelation$: Number format exception. Dropping
malformed line: id,name,age
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+
Note the warning message here which says dropped malformed line:
Option three: Write your own parser to drop the line that doesn't have length of three:
val file = sc.textFile("pathToYourCsvFile")
val df = file.map(line => line.split(",")).
filter(lines => lines.length == 3 && lines(0)!= "id").
map(row => (row(0), row(1), row(2))).
toDF("id", "name", "age")
df.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+

Related

Escape newline character in DataFrame

I have a Parquet table in Hive which I read via Spark and write to a delimited file. The code I use is this
var x = spark.table("myschema.my_table")
x.write.mode("overwrite").format("csv").save("/tmp/abc")
So far, so good. But the Hive table can contain data that has \n in it. Now when I write the data, that character breaks the line into a new one, creating an extra broken record. The character can be there in any column. How can I set it to replace it with a space while writing? I tried the following but it didn't work
x.write.mode("overwrite").format("csv").option("multiline", "true").save("/tmp/abc")
There are no such options provided by spark to replace \n with space while writing dataframe to csv. Check Aavailable options here.
You can use regex_replace to replace \n with space and then write the dataframe to CSV.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
val inDF = List(("a\nb", "c\nd", "d\ne", "ef")).toDF("col1", "col2", "col3", "col4")
inDF.show()
/*
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a
b| c
d| d
e| ef|
+----+----+----+----+ */
val outDF = inDF.columns.foldLeft(inDF)((df, c) => df.withColumn(c, regexp_replace(col(c), "\n", "")))
outDF.show()
/*
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| ab| cd| de| ef|
+----+----+----+----+ */
outDF.write.option("header", true).csv("outputPath")

How to access nested schema column?

I have a Kafka streaming source with JSONs, e.g. {"type":"abc","1":"23.2"}.
The query gives the following exception:
org.apache.spark.sql.catalyst.parser.ParseException: extraneous
input '.1' expecting {<EOF>, .......}
== SQL ==
person.1
What is the correct syntax to access "person.1"?
I have even changed DoubleType to StringType, but that didn't work either. Example works fine with just by keeping person.type and removing person.1 in selectExpr:
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)")
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.DoubleType)
val personNestedDf = personJsonDf
.select(from_json($"value", struct).as("person"))
val personFlattenedDf = personNestedDf
.selectExpr("person.type", "person.1")
val consoleOutput = personNestedDf.writeStream
.outputMode("update")
.format("console")
.start()
Interesting, since select($"person.1") should work fine (but you used selectExpr which could've confused Spark SQL).
StructField(1,DoubleType,true) won't work however since the type should actually be StringType.
Let's see...
$ cat input.json
{"type":"abc","1":"23.2"}
val input = spark.read.text("input.json")
scala> input.show(false)
+-------------------------+
|value |
+-------------------------+
|{"type":"abc","1":"23.2"}|
+-------------------------+
import org.apache.spark.sql.types._
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.StringType)
val q = input.select(from_json($"value", struct).as("person"))
scala> q.show
+-----------+
| person|
+-----------+
|[abc, 23.2]|
+-----------+
val q = input.select(from_json($"value", struct).as("person")).select($"person.1")
scala> q.show
+----+
| 1|
+----+
|23.2|
+----+
I have solved this problem by using person.*
+-----+--------+
|type | 1 |
+-----+--------+
|abc |23.2 |
+-----+--------+

using pyspark how to reject bad (malformed) records from csv file and save these rejected records in a new file

I am using pyspark to load the data from csv file into a dataframe and I was able to load the data while dropping the malformed records but how can I reject these bad (malformed) records from the csv file and save these rejected records in a new file?
Here is an idea, although I am not very happy about it. The CSV parser has different modes, as you know, to drop malformed data. However, if no mode is specified, it 'fills the blanks' with a default null value. You can use that to your advantage.
Using this data, and assuming that the column article_id is not nullable by design:
1,abcd,correct record1,description1 haha
Bad record,Bad record description
3,hijk,another correct record,description2
Not_An_Integer,article,no integer type,description
Here is the code:
#!/usr/bin/env python
# coding: utf-8
import pyspark
from pyspark.sql.types import *
from pyspark.sql import Row, functions as F
sc = pyspark.SparkContext.getOrCreate()
spark = pyspark.sql.SparkSession(sc)
# Load the data with your schema, drop the malformed information
schema = StructType([ StructField("article_id", IntegerType()),
StructField("title", StringType()),
StructField("short_desc", StringType()),
StructField("article_desc", StringType())])
valid_data = spark.read.format("csv").schema(schema).option("mode","DROPMALFORMED").load("./data.csv")
valid_data.show()
"""
+----------+-----+--------------------+-----------------+
|article_id|title| short_desc| article_desc|
+----------+-----+--------------------+-----------------+
| 1| abcd| correct record1|description1 haha|
| 3| hijk|another correct r...| description2|
+----------+-----+--------------------+-----------------+
"""
# Load the data and let spark infer everything
malformed_data = spark.read.format("csv").option("header", "false").load("./data.csv")
malformed_data.show()
"""
+--------------+--------------------+--------------------+-----------------+
| _c0| _c1| _c2| _c3|
+--------------+--------------------+--------------------+-----------------+
| 1| abcd| correct record1|description1 haha|
| Bad record|Bad record descri...| null| null|
| 3| hijk|another correct r...| description2|
|Not_An_Integer| article| no integer type| description|
+--------------+--------------------+--------------------+-----------------+
"""
# Join and keep all data from the 'malformed' DataFrame.
merged = valid_data.join(malformed_data, on=valid_data.article_id == malformed_data._c0, how="right")
# Filter those records for which a matching with the 'valid' data was not possible
malformed = merged.where(F.isnull(merged.article_id))
malformed.show()
"""
+----------+-----+----------+------------+--------------+--------------------+---------------+-----------+
|article_id|title|short_desc|article_desc| _c0| _c1| _c2| _c3|
+----------+-----+----------+------------+--------------+--------------------+---------------+-----------+
| null| null| null| null| Bad record|Bad record descri...| null| null|
| null| null| null| null|Not_An_Integer| article|no integer type|description|
+----------+-----+----------+------------+--------------+--------------------+---------------+-----------+
"""
I am not too fond of this, as it is very sensitive to how Spark parses the CSV and it might not work for all files, but you might find it useful.

Spark SQL error AnalysisException: cannot resolve column_name

This is a common error in Spark SQL, I tried all other answers but no differences!
I want to read the following small CSV file from HDFS (or even local filesystem).
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| id| first_name| last_name| ssn| test1| test2| test3| test4| final| grade|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| 4.0| Dandy| Jim| 087-75-4321| 47.0| 1.0| 23.0| 36.0| 45.0| C+|
|13.0| Elephant| Ima| 456-71-9012| 45.0| 1.0| 78.0| 88.0| 77.0| B-|
|14.0| Franklin| Benny| 234-56-2890| 50.0| 1.0| 90.0| 80.0| 90.0| B-|
|15.0| George| Boy| 345-67-3901| 40.0| 1.0| 11.0| -1.0| 4.0| B|
|16.0| Heffalump| Harvey| 632-79-9439| 30.0| 1.0| 20.0| 30.0| 40.0| C|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
Here is the code:
List<String> cols = new ArrayList<>();
Collections.addAll(cols, "id, first_name".replaceAll("\\s+", "").split(","));
Dataset<Row> temp = spark.read()
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.csv(path)
.selectExpr(JavaConverters.asScalaIteratorConverter(cols.iterator()).asScala().toSeq());
but it errors:
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) 'first_name missing from last_name#14, test1#16,id#12, test4#19, ssn#15, test3#18, grade#21, test2#17, final#20, first_name#13 in operator 'Project [id#12, 'first_name];;
'Project [id#12, 'first_name]
+- Relation[id#12, first_name#13, last_name#14, ssn#15, test1#16, test2#17, test3#18, test4#19, final#20, grade#21] csv
In some cases it works without error:
If I didn't select anything, it will successfully get all data.
If I just select column "id".
Even I tried using views and SQL method:
df.createOrReplaceTempView("csvFile");
spark.sql(SELECT id, first_name FROM csvFile).show()
but I got the same errro!
I did the same with the same data that I read from the database, and it was without any error.
I use Spark 2.2.1.
No need to convert String[] --> List<String> --> Seq<String>.
Simply pass array in selectExpr method, because selectExpr support varargs datatype.
Dataset<Row> temp = spark.read()
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.csv(path)
.selectExpr("id, first_name".replaceAll("\\s+", "").split(","));
It was because of the incorrect structure of the CSV file! I removed the white spaces from the CSV file and now it works!

What is the efficient way to create schema for a dataframe?

I am new to spark and I saw that there are two ways to create a data frame's schema.
I have an RDD: empRDD with data(split by ",")
+---+-------+------+-----+
| 1| Mark| 1000| HR|
| 2| Peter| 1200|SALES|
| 3| Henry| 1500| HR|
| 4| Adam| 2000| IT|
| 5| Steve| 2500| IT|
| 6| Brian| 2700| IT|
| 7|Michael| 3000| HR|
| 8| Steve| 10000|SALES|
| 9| Peter| 7000| HR|
| 10| Dan| 6000| BS|
+---+-------+------+-----+
val empFile = sc.textFile("emp")
val empData = empFile.map(e => e.split(","))
First way to create schema is using a case class:
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empData.map(e => employee(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = empRDD.toDF()
Second way is using StructType:
val empSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("salary", IntegerType, true),
StructField("dept", StringType, true)))
val empRDD = empdata.map(e => Row(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = sqlContext.createDataFrame(empRDD, empSchema)
Personally I prefer to code using StructType. But I don't know which way is recommended in the actual industry projects. Could anyone let me know the preferred way ?
You can use spark-csv library to read a csv files, This library have lots of options as per our requirement.
You can read a csv file as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("data.csv")
However you can also provide schema manually.
I think the best way is to read a csv with spark-csv as a dataset as
val cities = spark.read
.option("header", "true")
.csv(location)
.as[employee]
Read the advantage of dataset over rdd and dataframe here.
You can also generate the schema from case class if you have it already.
import org.apache.spark.sql.Encoders
val empSchema = Encoders.product[Employee].schema
Hope this helps
In the case when you are creating your RDD's from a CSV file(or any delimited file) you can infer schema automatically as #Shankar Koirala mentioned.
In case you are creating your RDD's from a different source then:
A. When you have less number of fields(less than 22) you can create it using case classes.
B. When you have more than 22 fields you need to create schema programmatically
Link to Spark Programming Guide
If your input file is delimited file, you can use databrick's spark-csv library.
Use this way:
// For spark < 2.0
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
For spark 2.0;
DataFrame df = sqlContext.read()
.format("csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
There are lots of customization possible using option in the command.
Such as:
.option("inferSchema", "true") to infer data types of each column automatically.
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") to define compression codec
.option("delimiter", ",") to specify delimiter as ','
Databrick's spark-csv library is ported in to spark 2.0.
Using this library will give you freedom from the difficulties parsing various use cases of delimited files.
Refer: https://github.com/databricks/spark-csv

Resources