What is the efficient way to create schema for a dataframe? - apache-spark

I am new to spark and I saw that there are two ways to create a data frame's schema.
I have an RDD: empRDD with data(split by ",")
+---+-------+------+-----+
| 1| Mark| 1000| HR|
| 2| Peter| 1200|SALES|
| 3| Henry| 1500| HR|
| 4| Adam| 2000| IT|
| 5| Steve| 2500| IT|
| 6| Brian| 2700| IT|
| 7|Michael| 3000| HR|
| 8| Steve| 10000|SALES|
| 9| Peter| 7000| HR|
| 10| Dan| 6000| BS|
+---+-------+------+-----+
val empFile = sc.textFile("emp")
val empData = empFile.map(e => e.split(","))
First way to create schema is using a case class:
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empData.map(e => employee(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = empRDD.toDF()
Second way is using StructType:
val empSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("salary", IntegerType, true),
StructField("dept", StringType, true)))
val empRDD = empdata.map(e => Row(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = sqlContext.createDataFrame(empRDD, empSchema)
Personally I prefer to code using StructType. But I don't know which way is recommended in the actual industry projects. Could anyone let me know the preferred way ?

You can use spark-csv library to read a csv files, This library have lots of options as per our requirement.
You can read a csv file as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("data.csv")
However you can also provide schema manually.
I think the best way is to read a csv with spark-csv as a dataset as
val cities = spark.read
.option("header", "true")
.csv(location)
.as[employee]
Read the advantage of dataset over rdd and dataframe here.
You can also generate the schema from case class if you have it already.
import org.apache.spark.sql.Encoders
val empSchema = Encoders.product[Employee].schema
Hope this helps

In the case when you are creating your RDD's from a CSV file(or any delimited file) you can infer schema automatically as #Shankar Koirala mentioned.
In case you are creating your RDD's from a different source then:
A. When you have less number of fields(less than 22) you can create it using case classes.
B. When you have more than 22 fields you need to create schema programmatically
Link to Spark Programming Guide

If your input file is delimited file, you can use databrick's spark-csv library.
Use this way:
// For spark < 2.0
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
For spark 2.0;
DataFrame df = sqlContext.read()
.format("csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
There are lots of customization possible using option in the command.
Such as:
.option("inferSchema", "true") to infer data types of each column automatically.
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") to define compression codec
.option("delimiter", ",") to specify delimiter as ','
Databrick's spark-csv library is ported in to spark 2.0.
Using this library will give you freedom from the difficulties parsing various use cases of delimited files.
Refer: https://github.com/databricks/spark-csv

Related

Deserializing Spark structured stream data from Kafka topic

I am working off Kafka 2.3.0 and Spark 2.3.4. I have already built a Kafka Connector which reads off a CSV file and posts a line from the CSV to the relevant Kafka topic. The line is like so:
"201310,XYZ001,Sup,XYZ,A,0,Presales,6,Callout,0,0,1,N,Prospect".
The CSV contains 1000s of such lines. The Connector is able to successfully post them on the topic and I am also able to get the message in Spark. I am not sure how can I deserialize that message to my schema? Note that the messages are headerless so the key part in the kafka message is null. The value part includes the complete CSV string as above. My code is below.
I looked at this - How to deserialize records from Kafka using Structured Streaming in Java? but was unable to port it to my csv case. In addition I've tried other spark sql mechanisms to try and retrieve the individual row from the 'value' column but to no avail. If I do manage to get a compiling version (e.g. a map over the indivValues Dataset or dsRawData) I get errors similar to: "org.apache.spark.sql.AnalysisException: cannot resolve 'IC' given input columns: [value];". If I understand correctly, it is because value is a comma separated string and spark isn't really going to magically map it for me without me doing 'something'.
//build the spark session
SparkSession sparkSession = SparkSession.builder()
.appName(seCfg.arg0AppName)
.config("spark.cassandra.connection.host",config.arg2CassandraIp)
.getOrCreate();
...
//my target schema is this:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("timeOfOrigin", DataTypes.TimestampType, true),
DataTypes.createStructField("cName", DataTypes.StringType, true),
DataTypes.createStructField("cRole", DataTypes.StringType, true),
DataTypes.createStructField("bName", DataTypes.StringType, true),
DataTypes.createStructField("stage", DataTypes.StringType, true),
DataTypes.createStructField("intId", DataTypes.IntegerType, true),
DataTypes.createStructField("intName", DataTypes.StringType, true),
DataTypes.createStructField("intCatId", DataTypes.IntegerType, true),
DataTypes.createStructField("catName", DataTypes.StringType, true),
DataTypes.createStructField("are_vval", DataTypes.IntegerType, true),
DataTypes.createStructField("isee_vval", DataTypes.IntegerType, true),
DataTypes.createStructField("opCode", DataTypes.IntegerType, true),
DataTypes.createStructField("opType", DataTypes.StringType, true),
DataTypes.createStructField("opName", DataTypes.StringType, true)
});
...
Dataset<Row> dsRawData = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", config.arg3Kafkabootstrapurl)
.option("subscribe", config.arg1TopicName)
.option("failOnDataLoss", "false")
.load();
//getting individual terms like '201310', 'XYZ001'.. from "values"
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING())
.flatMap((FlatMapFunction<String, String>) x -> Arrays.asList(x.split(",")).iterator(), Encoders.STRING());
//indivValues when printed to console looks like below which confirms that //I receive the data correctly and completely
/*
When printed on console, looks like this:
+--------------------+
| value|
+--------------------+
| 201310|
| XYZ001|
| Sup|
| XYZ|
| A|
| 0|
| Presales|
| 6|
| Callout|
| 0|
| 0|
| 1|
| N|
| Prospect|
+--------------------+
*/
StreamingQuery sq = indivValues.writeStream()
.outputMode("append")
.format("console")
.start();
//await termination
sq.awaitTermination();
I require the data to be typed as my custom schema shown above since I would be running mathematical calculations over it (for every new row combined with some older rows).
Is it better to synthesize headers in the Kafka Connector source task before pushing them onto the topic? Will having headers make this issue resolution simpler?
Thanks!
Given your existing code, the easiest way to parse your input from your dsRawData is to convert it to a Dataset<String> and then use the native csv reader api
//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
Dataset<Row> finalValues = sparkSession.read()
.schema(schema)
.option("delimiter",",")
.csv(indivValues);
With such a construct you can use exactly the same CSV parsing options that are available when directly reading a CSV file from Spark.
I have been able to resolve this now. Via use of spark sql. The code to the solution is below.
//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
//create new columns, parse out the orig message and fill column with the values
Dataset<Row> dataAsSchema2 = indivValues
.selectExpr("value",
"split(value,',')[0] as time",
"split(value,',')[1] as cname",
"split(value,',')[2] as crole",
"split(value,',')[3] as bname",
"split(value,',')[4] as stage",
"split(value,',')[5] as intid",
"split(value,',')[6] as intname",
"split(value,',')[7] as intcatid",
"split(value,',')[8] as catname",
"split(value,',')[9] as are_vval",
"split(value,',')[10] as isee_vval",
"split(value,',')[11] as opcode",
"split(value,',')[12] as optype",
"split(value,',')[13] as opname")
.drop("value");
//remove any whitespaces as they interfere with data type conversions
dataAsSchema2 = dataAsSchema2
.withColumn("intid", functions.regexp_replace(functions.col("int_id"),
" ", ""))
.withColumn("intcatid", functions.regexp_replace(functions.col("intcatid"),
" ", ""))
.withColumn("are_vval", functions.regexp_replace(functions.col("are_vval"),
" ", ""))
.withColumn("isee_vval", functions.regexp_replace(functions.col("isee_vval"),
" ", ""))
.withColumn("opcode", functions.regexp_replace(functions.col("opcode"),
" ", ""));
//change types to ready for calc
dataAsSchema2 = dataAsSchema2
.withColumn("intcatid",functions.col("intcatid").cast(DataTypes.IntegerType))
.withColumn("intid",functions.col("intid").cast(DataTypes.IntegerType))
.withColumn("are_vval",functions.col("are_vval").cast(DataTypes.IntegerType))
.withColumn("isee_vval",functions.col("isee_vval").cast(DataTypes.IntegerType))
.withColumn("opcode",functions.col("opcode").cast(DataTypes.IntegerType));
//build a POJO dataset
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
Dataset<Pojoclass2> pjClass = new Dataset<Pojoclass2>(sparkSession, dataAsSchema2.logicalPlan(), encoder);

How to access nested schema column?

I have a Kafka streaming source with JSONs, e.g. {"type":"abc","1":"23.2"}.
The query gives the following exception:
org.apache.spark.sql.catalyst.parser.ParseException: extraneous
input '.1' expecting {<EOF>, .......}
== SQL ==
person.1
What is the correct syntax to access "person.1"?
I have even changed DoubleType to StringType, but that didn't work either. Example works fine with just by keeping person.type and removing person.1 in selectExpr:
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)")
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.DoubleType)
val personNestedDf = personJsonDf
.select(from_json($"value", struct).as("person"))
val personFlattenedDf = personNestedDf
.selectExpr("person.type", "person.1")
val consoleOutput = personNestedDf.writeStream
.outputMode("update")
.format("console")
.start()
Interesting, since select($"person.1") should work fine (but you used selectExpr which could've confused Spark SQL).
StructField(1,DoubleType,true) won't work however since the type should actually be StringType.
Let's see...
$ cat input.json
{"type":"abc","1":"23.2"}
val input = spark.read.text("input.json")
scala> input.show(false)
+-------------------------+
|value |
+-------------------------+
|{"type":"abc","1":"23.2"}|
+-------------------------+
import org.apache.spark.sql.types._
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.StringType)
val q = input.select(from_json($"value", struct).as("person"))
scala> q.show
+-----------+
| person|
+-----------+
|[abc, 23.2]|
+-----------+
val q = input.select(from_json($"value", struct).as("person")).select($"person.1")
scala> q.show
+----+
| 1|
+----+
|23.2|
+----+
I have solved this problem by using person.*
+-----+--------+
|type | 1 |
+-----+--------+
|abc |23.2 |
+-----+--------+

Writing Spark SQL query on data without header or schema

I want to write a generic script that can run SQL queries on a file that doesn't have a header or pre-defined schema. For example, a file could look like:
Bob,32
Alice, 24
Jane,65
Doug,33
Peter,19
And the SQL query might be:
SELECT COUNT(DISTINCT ??)
FROM temp_table
WHERE ?? > 32
I am wondering what to put in the ??.
you can define 'custom schema' while reading like
val schema = StructType(
StructField("field1", StringType, true) ::
StructField("field2", IntegerType, true) :: Nil
)
val df = spark.read.format("csv")
.option("sep", ",")
.option("header", "false")
.schema(schema)
.load("examples/src/main/resources/people.csv")
also you can ignore the schema part that would end-up in default names (not-preferred)
val df = spark.read.format("csv")
.option("sep", ",")
.option("header", "false")
.load("examples/src/main/resources/people.csv")
+-----+-----+
| _c0| _c1|
+-----+-----+
| Bob| 32 |
| .. | ... |
+-----+-----+
with that you can fill the column names in your spark-sql.
It seems default schema has column names _c0, _c1 etc.
val df = spark.read.format("csv").load("test.txt")
scala> df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
In Spark 2.0,
df.createOrReplaceTempView("temp_table")
spark.sql("SELECT COUNT(DISTINCT _c1) FROM temp_table WHERE cast(_c1 as int) > 32")

Spark SQL error AnalysisException: cannot resolve column_name

This is a common error in Spark SQL, I tried all other answers but no differences!
I want to read the following small CSV file from HDFS (or even local filesystem).
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| id| first_name| last_name| ssn| test1| test2| test3| test4| final| grade|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| 4.0| Dandy| Jim| 087-75-4321| 47.0| 1.0| 23.0| 36.0| 45.0| C+|
|13.0| Elephant| Ima| 456-71-9012| 45.0| 1.0| 78.0| 88.0| 77.0| B-|
|14.0| Franklin| Benny| 234-56-2890| 50.0| 1.0| 90.0| 80.0| 90.0| B-|
|15.0| George| Boy| 345-67-3901| 40.0| 1.0| 11.0| -1.0| 4.0| B|
|16.0| Heffalump| Harvey| 632-79-9439| 30.0| 1.0| 20.0| 30.0| 40.0| C|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
Here is the code:
List<String> cols = new ArrayList<>();
Collections.addAll(cols, "id, first_name".replaceAll("\\s+", "").split(","));
Dataset<Row> temp = spark.read()
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.csv(path)
.selectExpr(JavaConverters.asScalaIteratorConverter(cols.iterator()).asScala().toSeq());
but it errors:
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) 'first_name missing from last_name#14, test1#16,id#12, test4#19, ssn#15, test3#18, grade#21, test2#17, final#20, first_name#13 in operator 'Project [id#12, 'first_name];;
'Project [id#12, 'first_name]
+- Relation[id#12, first_name#13, last_name#14, ssn#15, test1#16, test2#17, test3#18, test4#19, final#20, grade#21] csv
In some cases it works without error:
If I didn't select anything, it will successfully get all data.
If I just select column "id".
Even I tried using views and SQL method:
df.createOrReplaceTempView("csvFile");
spark.sql(SELECT id, first_name FROM csvFile).show()
but I got the same errro!
I did the same with the same data that I read from the database, and it was without any error.
I use Spark 2.2.1.
No need to convert String[] --> List<String> --> Seq<String>.
Simply pass array in selectExpr method, because selectExpr support varargs datatype.
Dataset<Row> temp = spark.read()
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.csv(path)
.selectExpr("id, first_name".replaceAll("\\s+", "").split(","));
It was because of the incorrect structure of the CSV file! I removed the white spaces from the CSV file and now it works!

Spark csv to dataframe skip first row

I am loading csv to dataframe using -
sqlContext.read.format("com.databricks.spark.csv").option("header", "true").
option("delimiter", ",").load("file.csv")
but my input file contains date in the first row and header from second row.
example
20160612
id,name,age
1,abc,12
2,bcd,33
How can i skip this first row while converting csv to dataframe?
Here are several options that I can think of since the data bricks module doesn't seem to provide a skip line option:
Option one: Add a "#" character in front of the first line, and the line will be automatically considered as comment and ignored by the data.bricks csv module;
Option two: Create your customized schema and specify the mode option as DROPMALFORMED which will drop the first line since it contains less token than expected in the customSchema:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val customSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("mode", "DROPMALFORMED").
schema(customSchema).load("test.txt")
df.show
16/06/12 21:24:05 WARN CsvRelation$: Number format exception. Dropping
malformed line: id,name,age
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+
Note the warning message here which says dropped malformed line:
Option three: Write your own parser to drop the line that doesn't have length of three:
val file = sc.textFile("pathToYourCsvFile")
val df = file.map(line => line.split(",")).
filter(lines => lines.length == 3 && lines(0)!= "id").
map(row => (row(0), row(1), row(2))).
toDF("id", "name", "age")
df.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+

Resources