Deserializing Spark structured stream data from Kafka topic - apache-spark

I am working off Kafka 2.3.0 and Spark 2.3.4. I have already built a Kafka Connector which reads off a CSV file and posts a line from the CSV to the relevant Kafka topic. The line is like so:
"201310,XYZ001,Sup,XYZ,A,0,Presales,6,Callout,0,0,1,N,Prospect".
The CSV contains 1000s of such lines. The Connector is able to successfully post them on the topic and I am also able to get the message in Spark. I am not sure how can I deserialize that message to my schema? Note that the messages are headerless so the key part in the kafka message is null. The value part includes the complete CSV string as above. My code is below.
I looked at this - How to deserialize records from Kafka using Structured Streaming in Java? but was unable to port it to my csv case. In addition I've tried other spark sql mechanisms to try and retrieve the individual row from the 'value' column but to no avail. If I do manage to get a compiling version (e.g. a map over the indivValues Dataset or dsRawData) I get errors similar to: "org.apache.spark.sql.AnalysisException: cannot resolve 'IC' given input columns: [value];". If I understand correctly, it is because value is a comma separated string and spark isn't really going to magically map it for me without me doing 'something'.
//build the spark session
SparkSession sparkSession = SparkSession.builder()
.appName(seCfg.arg0AppName)
.config("spark.cassandra.connection.host",config.arg2CassandraIp)
.getOrCreate();
...
//my target schema is this:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("timeOfOrigin", DataTypes.TimestampType, true),
DataTypes.createStructField("cName", DataTypes.StringType, true),
DataTypes.createStructField("cRole", DataTypes.StringType, true),
DataTypes.createStructField("bName", DataTypes.StringType, true),
DataTypes.createStructField("stage", DataTypes.StringType, true),
DataTypes.createStructField("intId", DataTypes.IntegerType, true),
DataTypes.createStructField("intName", DataTypes.StringType, true),
DataTypes.createStructField("intCatId", DataTypes.IntegerType, true),
DataTypes.createStructField("catName", DataTypes.StringType, true),
DataTypes.createStructField("are_vval", DataTypes.IntegerType, true),
DataTypes.createStructField("isee_vval", DataTypes.IntegerType, true),
DataTypes.createStructField("opCode", DataTypes.IntegerType, true),
DataTypes.createStructField("opType", DataTypes.StringType, true),
DataTypes.createStructField("opName", DataTypes.StringType, true)
});
...
Dataset<Row> dsRawData = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", config.arg3Kafkabootstrapurl)
.option("subscribe", config.arg1TopicName)
.option("failOnDataLoss", "false")
.load();
//getting individual terms like '201310', 'XYZ001'.. from "values"
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING())
.flatMap((FlatMapFunction<String, String>) x -> Arrays.asList(x.split(",")).iterator(), Encoders.STRING());
//indivValues when printed to console looks like below which confirms that //I receive the data correctly and completely
/*
When printed on console, looks like this:
+--------------------+
| value|
+--------------------+
| 201310|
| XYZ001|
| Sup|
| XYZ|
| A|
| 0|
| Presales|
| 6|
| Callout|
| 0|
| 0|
| 1|
| N|
| Prospect|
+--------------------+
*/
StreamingQuery sq = indivValues.writeStream()
.outputMode("append")
.format("console")
.start();
//await termination
sq.awaitTermination();
I require the data to be typed as my custom schema shown above since I would be running mathematical calculations over it (for every new row combined with some older rows).
Is it better to synthesize headers in the Kafka Connector source task before pushing them onto the topic? Will having headers make this issue resolution simpler?
Thanks!

Given your existing code, the easiest way to parse your input from your dsRawData is to convert it to a Dataset<String> and then use the native csv reader api
//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
Dataset<Row> finalValues = sparkSession.read()
.schema(schema)
.option("delimiter",",")
.csv(indivValues);
With such a construct you can use exactly the same CSV parsing options that are available when directly reading a CSV file from Spark.

I have been able to resolve this now. Via use of spark sql. The code to the solution is below.
//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
//create new columns, parse out the orig message and fill column with the values
Dataset<Row> dataAsSchema2 = indivValues
.selectExpr("value",
"split(value,',')[0] as time",
"split(value,',')[1] as cname",
"split(value,',')[2] as crole",
"split(value,',')[3] as bname",
"split(value,',')[4] as stage",
"split(value,',')[5] as intid",
"split(value,',')[6] as intname",
"split(value,',')[7] as intcatid",
"split(value,',')[8] as catname",
"split(value,',')[9] as are_vval",
"split(value,',')[10] as isee_vval",
"split(value,',')[11] as opcode",
"split(value,',')[12] as optype",
"split(value,',')[13] as opname")
.drop("value");
//remove any whitespaces as they interfere with data type conversions
dataAsSchema2 = dataAsSchema2
.withColumn("intid", functions.regexp_replace(functions.col("int_id"),
" ", ""))
.withColumn("intcatid", functions.regexp_replace(functions.col("intcatid"),
" ", ""))
.withColumn("are_vval", functions.regexp_replace(functions.col("are_vval"),
" ", ""))
.withColumn("isee_vval", functions.regexp_replace(functions.col("isee_vval"),
" ", ""))
.withColumn("opcode", functions.regexp_replace(functions.col("opcode"),
" ", ""));
//change types to ready for calc
dataAsSchema2 = dataAsSchema2
.withColumn("intcatid",functions.col("intcatid").cast(DataTypes.IntegerType))
.withColumn("intid",functions.col("intid").cast(DataTypes.IntegerType))
.withColumn("are_vval",functions.col("are_vval").cast(DataTypes.IntegerType))
.withColumn("isee_vval",functions.col("isee_vval").cast(DataTypes.IntegerType))
.withColumn("opcode",functions.col("opcode").cast(DataTypes.IntegerType));
//build a POJO dataset
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
Dataset<Pojoclass2> pjClass = new Dataset<Pojoclass2>(sparkSession, dataAsSchema2.logicalPlan(), encoder);

Related

How to add commentaries to Glue on an AWS EMR using Pyspark

I'm having a problem where I cannot find a way to save commentaries on glue metadata with Pyspark.
Currently I create new tables using :
df.write \
.saveAsTable(
'db_temp.tb_temp',
format='parquet',
path='s3://datalake-123/table/df/',
mode='overwrite'
)
So if possible, I would like to add the comments in glue using code, just like the picture bellow shows :
You need to modify existing schema of dataframe by adding required comment. After schema modification, create new dataframe using modified schema and write dataframe as table.
df = spark.createDataFrame([(1, 'abc'), (2, 'def')], ["id", "name"])
schema = StructType([StructField("id", IntegerType(), False, {"comment": "This is ID"}),
StructField("name", StringType(), True, {"comment": "This is name"})])
df_with_comment = spark.createDataFrame(df.rdd, schema)
df_with_comment.write.format('parquet').saveAsTable('mytable')
spark.sql('describe mytable').show()
+--------+---------+------------+
|col_name|data_type| comment|
+--------+---------+------------+
| id| int| This is ID|
| name| string|This is name|
+--------+---------+------------+

How to refresh loaded dataframe contents in spark streaming?

Using spark-sql 2.4.1 and kafka for real time streaming.
I have following use case
Need to load a meta-data from hdfs for joining with streaming dataframe from kafka.
streaming data record's particular columns should be looked up in meta-data dataframe particular colums(col-X) data.
If found pick meta-data column(col-Y) data
Else not found , insert streaming record/column data into meta-data dataframe i.e. into hdfs. I.e. it should be looked up if
streaming dataframe contain same data again.
As meta-data loaded at the beginning of the spark job how to refresh its contents again in the streaming-job to lookup and join with another streaming dataframe ?
I may have misunderstood the question, but refreshing the metadata dataframe should be a feature supported out of the box.
You simply don't have to do anything.
Let's have a look at the example:
// a batch dataframe
val metadata = spark.read.text("metadata.txt")
scala> metadata.show
+-----+
|value|
+-----+
|hello|
+-----+
// a streaming dataframe
val stream = spark.readStream.text("so")
// join on the only value column
stream.join(metadata, "value").writeStream.format("console").start
As long as the content of the files in so directory matches metadata.txt file you should get a dataframe printed out to the console.
-------------------------------------------
Batch: 1
-------------------------------------------
+-----+
|value|
+-----+
|hello|
+-----+
Change metadata.txt to, say, world and only worlds from new files get matched.
EDIT This solution is more elaborate and would work (for all use cases).
For simpler cases where the data is appended to existing files without changing the files or read from the databases simpler solution can be used as pointed out in the other answer.
This is because the dataframe (and underlying RDD) partitions are created once and the data is read everytime the datafframe is used. (unless it is cached by spark)
If can afford it you can try to (re)read this meta-data dataframe in every micro-bacth.
A better approach would be to put the meta-data dataframe in a cache (not to be confused with spark caching the dataframe). A cache is similar to a map except that it will not not give entries inserted more than the configured time-to-live duration.
In your code you'll try to fetch this meta-data dataframe from the cache once for every micro batch. If the cache return null. You'll read the data frame again, put into cache and then use the dataframe.
The Cache class would be
import scala.collection.mutable
// cache class to store the dataframe
class Cache[K, V](timeToLive: Long) extends mutable.Map[K, V] {
private var keyValueStore = mutable.HashMap[K, (V, Long)]()
override def get(key: K):Option[V] = {
keyValueStore.get(key) match {
case Some((value, insertedAt)) if insertedAt+timeToLive > System.currentTimeMillis => Some(value)
case _ => None
}
}
override def iterator: Iterator[(K, V)] = keyValueStore.iterator
.filter({
case (key, (value, insertedAt)) => insertedAt+timeToLive > System.currentTimeMillis
}).map(x => (x._1, x._2._1))
override def -=(key: K): this.type = {
keyValueStore-=key
this
}
override def +=(kv: (K, V)): this.type = {
keyValueStore += ((kv._1, (kv._2, System.currentTimeMillis())))
this
}
}
The logic to access the meta-data dataframe through the cache
import org.apache.spark.sql.DataFrame
object DataFrameCache {
lazy val cache = new Cache[String, DataFrame](600000) // ten minutes timeToLive
def readMetaData: DataFrame = ???
def getMetaData: DataFrame = {
cache.get("metadataDF") match {
case Some(df) => df
case None => {
val metadataDF = readMetaData
cache.put("metadataDF", metadataDF)
metadataDF
}
}
}
}
Below is the scenario which I followed in spark 2.4.5 for left outer join with stream join.Below process is pushing spark to read latest dimension data changes.
Process is for Stream Join with batch dimension (always update)
Step 1:-
Before starting Spark streaming job:-
Make sure dimension batch data folder has only one file and the file should have at-least one record (for some reason placing empty file is not working).
Step 2:-
Start your streaming job and add a stream record in kafka stream
Step 3:-
Overwrite dim data with values (the file should be same name don't change and the dimension folder should have only one file)
Note:- don't use spark to write to this folder use Java or Scala filesystem.io to overwrite the file or bash delete the file and replace with new data file with same name.
Step 4:-
In next batch spark is able to read updated dimension data while joining with kafka stream...
Sample Code:-
package com.broccoli.streaming.streamjoinupdate
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastStreamJoin3 {
def main(args: Array[String]): Unit = {
#transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("com.amazonaws").setLevel(Level.ERROR)
Logger.getLogger("com.amazon.ws").setLevel(Level.ERROR)
Logger.getLogger("io.netty").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
val schemaUntyped1 = StructType(
Array(
StructField("id", StringType),
StructField("customrid", StringType),
StructField("customername", StringType),
StructField("countrycode", StringType),
StructField("timestamp_column_fin_1", TimestampType)
))
val schemaUntyped2 = StructType(
Array(
StructField("id", StringType),
StructField("countrycode", StringType),
StructField("countryname", StringType),
StructField("timestamp_column_fin_2", TimestampType)
))
val factDf1 = spark.readStream
.schema(schemaUntyped1)
.option("header", "true")
.csv("src/main/resources/broadcasttest/fact")
val dimDf3 = spark.read
.schema(schemaUntyped2)
.option("header", "true")
.csv("src/main/resources/broadcasttest/dimension")
.withColumnRenamed("id", "id_2")
.withColumnRenamed("countrycode", "countrycode_2")
import spark.implicits._
factDf1
.join(
dimDf3,
$"countrycode_2" <=> $"countrycode",
"inner"
)
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination
}
}
Thanks
Sri

How does Spark 2.0 handle column nullability?

In the recently released The Data Engineer's Guide to Apache Spark, the authors stated (page 74):
"...when you define a schema where all columns are declared to not
have null values - Spark will not enforce that and will happily let
null values into that column. The nullable signal is simply to help
Spark SQL optimize for handling that column. If you have null values
in columns that should not have null values, you can get an incorrect
result or see strange exceptions that can be hard to debug."
While going over notes and previous JIRAs, it appears that the statement above may really no longer be true.
According to SPARK-13740 and SPARK-15192, it looks like when a schema is defined on a DataFrame creation that nullability is enforced.
Could I get some clarification? I'm no longer certain what the behavior is.
Different DataFrame creation processes are handled differently with respect to null types. It's not really straightforward, because there are at least three different areas that nulls are being handled completely differently.
First, SPARK-15192 is about RowEncoders. And in the case of RowEncoders, there are no nulls allowed, and the error messages have been improved. For example, with the two dozen or so overloading of SparkSession.createDataFrame(), there are quite a few implementations of createDataFrame() that are basically converting an RDD to a DataFrame.
In my example below no nulls were accepted. So try something similar to converting an RDD to a DataFrame using createDateFrame() method like below and you will get same results...
val nschema = StructType(Seq(StructField("colA", IntegerType, nullable = false), StructField("colB", IntegerType, nullable = true), StructField("colC", IntegerType, nullable = false), StructField("colD", IntegerType, nullable = true)))
val intNullsRDD = sc.parallelize(List(org.apache.spark.sql.Row(null,null,null,null),org.apache.spark.sql.Row(2,null,null,null),org.apache.spark.sql.Row(null,3,null,null),org.apache.spark.sql.Row(null,null,null,4)))
spark.createDataFrame(intNullsRDD, schema).show()
In Spark 2.1.1, the error message is pretty nice.
17/11/23 21:30:37 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 6)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: The 0th field 'colA' of input row cannot be null.
validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, colA), IntegerType) AS colA#73
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, colA), IntegerType)
+- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, colA)
+- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
+- input[0, org.apache.spark.sql.Row, true]
Stepping through the code, you can see where this happens. Way below in the doGenCode() method there is the validation. And immediately below, when the RowEncoder object is being created with val encoder = RowEncoder(schema), that logic begins.
#DeveloperApi
#InterfaceStability.Evolving
def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame = {
createDataFrame(rowRDD, schema, needsConversion = true)
}
private[sql] def createDataFrame(
rowRDD: RDD[Row],
schema: StructType,
needsConversion: Boolean) = {
// TODO: use MutableProjection when rowRDD is another DataFrame and the applied
// schema differs from the existing schema on any field data type.
val catalystRows = if (needsConversion) {
val encoder = RowEncoder(schema)
rowRDD.map(encoder.toRow)
} else {
rowRDD.map{r: Row => InternalRow.fromSeq(r.toSeq)}
}
val logicalPlan = LogicalRDD(schema.toAttributes, catalystRows)(self)
Dataset.ofRows(self, logicalPlan)
}
After stepping through this logic more, here is that improved message in objects.scala and this is where the code handles null values. Actually the error message is passed into ctx.addReferenceObj(errMsg) but you get the idea.
case class GetExternalRowField(
child: Expression,
index: Int,
fieldName: String) extends UnaryExpression with NonSQLExpression {
override def nullable: Boolean = false
override def dataType: DataType = ObjectType(classOf[Object])
override def eval(input: InternalRow): Any =
throw new UnsupportedOperationException("Only code-generated evaluation is supported")
private val errMsg = s"The ${index}th field '$fieldName' of input row cannot be null."
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
// Use unnamed reference that doesn't create a local field here to reduce the number of fields
// because errMsgField is used only when the field is null.
val errMsgField = ctx.addReferenceObj(errMsg)
val row = child.genCode(ctx)
val code = s"""
${row.code}
if (${row.isNull}) {
throw new RuntimeException("The input external row cannot be null.");
}
if (${row.value}.isNullAt($index)) {
throw new RuntimeException($errMsgField);
}
final Object ${ev.value} = ${row.value}.get($index);
"""
ev.copy(code = code, isNull = "false")
}
}
Something completely different happens when pulling from an HDFS data source. In this case there will be no error message when there is a non-nullable column and a null comes in. The column still accepts null values. Check out the quick testFile "testFile.csv" I created and then put it into hdfs hdfs dfs -put testFile.csv /data/nullTest
|colA|colB|colC|colD|
| | | | |
| | 2| 2| 2|
| | 3| | |
| 4| | | |
When I read from the file below with the same nschema schema, all of the blank values became null, even if the field was non-nullable. There are ways of how to handle blanks differently, but this is the default. Both csv and parquet had the same results.
val nschema = StructType(Seq(StructField("colA", IntegerType, nullable = true), StructField("colB", IntegerType, nullable = true), StructField("colC", IntegerType, nullable = true), StructField("colD", IntegerType, nullable = true)))
val jListNullsADF = spark.createDataFrame(List(org.apache.spark.sql.Row(null,null,null,null),org.apache.spark.sql.Row(2,null,null,null),org.apache.spark.sql.Row(null,3,null,null),org.apache.spark.sql.Row(null,null,null,4)).asJava,nschema)
jListNullsADF.write.format("parquet").save("/data/parquetnulltest")
spark.read.format("parquet").schema(schema).load("/data/parquetnulltest").show()
+----+----+----+----+
|colA|colB|colC|colD|
+----+----+----+----+
|null|null|null|null|
|null| 2| 2| 2|
|null|null| 3|null|
|null| 4|null| 4|
+----+----+----+----+
The cause of the nulls being allowed starts with the DataFrameReader creation where a call is made to baseRelationToDataFrame() in DataFramerReader.scala. baseRelationToDataFrame() in SparkSession.scala uses a QueryPlan class in the method and the QueryPlan is recreating the StructType. The method fromAttributes() which always has nullable fields is basically the same schema as the original one but forces nullability. Thus, by the time it gets back RowEncoder(), it is now a nullable version of the original schema.
Immediately below in DataFrameReader.scala you can see the baseRelationToDataFrame() call...
#scala.annotation.varargs
def load(paths: String*): DataFrame = {
sparkSession.baseRelationToDataFrame(
DataSource.apply(
sparkSession,
paths = paths,
userSpecifiedSchema = userSpecifiedSchema,
className = source,
options = extraOptions.toMap).resolveRelation())
}
Immediately below in the file SparkSession.scala you can see the Dataset.ofRows(self: SparkSession, lr: LogicalRelation) method is being called, pay close attention to the LogicalRelation plan constructor.
def baseRelationToDataFrame(baseRelation: BaseRelation): DataFrame = {
Dataset.ofRows(self, LogicalRelation(baseRelation))
}
In Dataset.scala, the analyzed QueryPlan object's schema property is being passed as the third argument to create the Dataset in new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema)).
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
val qe = sparkSession.sessionState.executePlan(logicalPlan)
qe.assertAnalyzed()
new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
}
}
In QueryPlan.scala the StructType.fromAttributes() method is being used
lazy val schema: StructType = StructType.fromAttributes(output)
And finally in StructType.scala the nullable property is always nullable.
private[sql] def fromAttributes(attributes: Seq[Attribute]): StructType =
StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata)))
About the query plan being different based on nullability, I think it is totally possible that the LogicalPlan was different based on whether a column was nullable or not. A lot of information is passed into that object and there is a lot of subsequent logic to creeate the plan. But it is not being kept nullable when it is actually writing the dataframe, as we saw a second ago.
The third case is dependent on DataType. When you create a DataFrame using the method createDataFrame(rows: java.util.List[Row], schema: StructType) it will actually create zeros where there is a null passed into a non-nullable IntegerType field. You can see the example below...
val schema = StructType(Seq(StructField("colA", IntegerType, nullable = false), StructField("colB", IntegerType, nullable = true), StructField("colC", IntegerType, nullable = false), StructField("colD", IntegerType, nullable = true)))
val jListNullsDF = spark.createDataFrame(List(org.apache.spark.sql.Row(null,null,null,null),org.apache.spark.sql.Row(2,null,null,null),org.apache.spark.sql.Row(null,3,null,null),org.apache.spark.sql.Row(null,null,null,4)).asJava,schema)
jListNullsDF.show()
+----+----+----+----+
|colA|colB|colC|colD|
+----+----+----+----+
| 0|null| 0|null|
| 2|null| 0|null|
| 0| 3| 0|null|
| 0|null| 0| 4|
+----+----+----+----+
It looks like there is logic in org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt() that substitutes zeros for nulls. However, with non-nullable StringType fields, nulls are not handled as gracefully.
val strschema = StructType(Seq(StructField("colA", StringType, nullable = false), StructField("colB", StringType, nullable = true), StructField("colC", StringType, nullable = false), StructField("colD", StringType, nullable = true)))
val strNullsRDD = sc.parallelize(List(org.apache.spark.sql.Row(null,null,null,null),org.apache.spark.sql.Row("r2colA",null,null,null),org.apache.spark.sql.Row(null,"r3colC",null,null),org.apache.spark.sql.Row(null,null,null,"r4colD")))
spark.createDataFrame(List(org.apache.spark.sql.Row(null,null,null,null),org.apache.spark.sql.Row("r2cA",null,null,null),org.apache.spark.sql.Row(null,"row3cB",null,null),org.apache.spark.sql.Row(null,null,null,"row4ColD")).asJava,strschema).show()
but below is the not very helpful error message that doesn't specify the ordinal position of the field...
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
Long story short we don't know. It is true that Spark become much stricter with enforcing nullable attributes
However considering complexity of Spark (number of guest languages, size of the library, number of low level mechanisms used for optimizations, plugable data sources, and relatively large pool of legacy code) there is really no guarantee that fairly limited safety checks included in the recent versions cover all possible scenarios.

What is the efficient way to create schema for a dataframe?

I am new to spark and I saw that there are two ways to create a data frame's schema.
I have an RDD: empRDD with data(split by ",")
+---+-------+------+-----+
| 1| Mark| 1000| HR|
| 2| Peter| 1200|SALES|
| 3| Henry| 1500| HR|
| 4| Adam| 2000| IT|
| 5| Steve| 2500| IT|
| 6| Brian| 2700| IT|
| 7|Michael| 3000| HR|
| 8| Steve| 10000|SALES|
| 9| Peter| 7000| HR|
| 10| Dan| 6000| BS|
+---+-------+------+-----+
val empFile = sc.textFile("emp")
val empData = empFile.map(e => e.split(","))
First way to create schema is using a case class:
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empData.map(e => employee(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = empRDD.toDF()
Second way is using StructType:
val empSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("salary", IntegerType, true),
StructField("dept", StringType, true)))
val empRDD = empdata.map(e => Row(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = sqlContext.createDataFrame(empRDD, empSchema)
Personally I prefer to code using StructType. But I don't know which way is recommended in the actual industry projects. Could anyone let me know the preferred way ?
You can use spark-csv library to read a csv files, This library have lots of options as per our requirement.
You can read a csv file as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("data.csv")
However you can also provide schema manually.
I think the best way is to read a csv with spark-csv as a dataset as
val cities = spark.read
.option("header", "true")
.csv(location)
.as[employee]
Read the advantage of dataset over rdd and dataframe here.
You can also generate the schema from case class if you have it already.
import org.apache.spark.sql.Encoders
val empSchema = Encoders.product[Employee].schema
Hope this helps
In the case when you are creating your RDD's from a CSV file(or any delimited file) you can infer schema automatically as #Shankar Koirala mentioned.
In case you are creating your RDD's from a different source then:
A. When you have less number of fields(less than 22) you can create it using case classes.
B. When you have more than 22 fields you need to create schema programmatically
Link to Spark Programming Guide
If your input file is delimited file, you can use databrick's spark-csv library.
Use this way:
// For spark < 2.0
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
For spark 2.0;
DataFrame df = sqlContext.read()
.format("csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
There are lots of customization possible using option in the command.
Such as:
.option("inferSchema", "true") to infer data types of each column automatically.
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") to define compression codec
.option("delimiter", ",") to specify delimiter as ','
Databrick's spark-csv library is ported in to spark 2.0.
Using this library will give you freedom from the difficulties parsing various use cases of delimited files.
Refer: https://github.com/databricks/spark-csv

Reading a spark data frame as an array type from Postgres DB

I have a local PSQL database on my computer. Some columns have the data contained in them as an array. (Example below)
+--------------------+
| _authors|
+--------------------+
|[u'Miller, Roger ...|
|[u'Noyes, H.Pierre']|
|[u'Berman, S.M.',...|
+--------------------+
only showing top 3 rows
root
|-- _authors: string (nullable = true)
I need to read them as an Array / Wrapped array. How do i achieve that?
val sqlContext: SQLContext = new SQLContext(sc)
val df_records = sqlContext.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/dbname")
.option("driver", "org.postgresql.Driver")
.option("dbtable", "public.records")
.option("user", "name")
.option("password", "pwd").load().select("_authors")
df_records.printSchema()
I need to explode this array / flatten in later stages of my pipeline.
Thanks,
I have two suggestions for you problem:
1) I'm not sure it works for arrays, but it's worth a try: It's possible to define a specific schema when reading a dataframe from a source. Example:
val customSchema = StructType(Seq(
StructField("_authors", DataTypes.createArrayType(StringType), true),
StructField("int_column", IntegerType, true),
// other columns...
))
val df_records = sqlContext.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost:5432/dbname")
.option("driver", "org.postgresql.Driver")
.option("dbtable", "public.records")
.option("user", "name")
.option("password", "pwd")
.schema(customSchema)
.load()
df_records.select("_authors").show()
2) If the other option doesn't work, at the moment I can only think of defining a parsing UDF:
val splitString: (String => Seq[String]) = { s =>
val seq = s.split(",").map(i => i.trim).toSeq
// Remove "u[" from the first element and "]" from the last:
Seq(seq(0).drop(2)) ++
seq.drop(1).take(seq.length-2) ++
Seq(seq.last.take(seq.last.length-1))
}
import org.apache.spark.sql.functions._
val newDF = df_records
.withColumn("authors_array", udf(splitString).apply(col("_authors")))
For more details about StructType: org.apache.spark.sql.types.StructType
For more examples of defining UDFs: this tutorial

Resources