KafkaAvroDecoder object to DataFrame convertion - apache-spark

In spark streaming with kafka and schema registry after receiving dstream how can I convert the dstream batch to Dataframe in spark?
Dstream type after using KafkaAvroDecoder from confluent is Dstream(String,Object). When I use below code it changes schema data type something like Int to Long in avro columns.
val kafkaStream: DStream[(String, Object)] =
KafkaUtils.createDirectStream[String, Object, StringDecoder, KafkaAvroDecoder](
ssc, kafkaParams, Set(topic)
)
// Load JSON strings into DataFrame
kafkaStream.foreachRDD { rdd =>
// Get the singleton instance of SQLContext
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val topicValueStrings = rdd.map(_._2.toString)
val df = sqlContext.read.json(topicValueStrings)
code reference
Object.toSting and reading as json looses schema for int. Is there any other way instead of casting type in dataframe columns?

Related

Converting CassandraRow obtained from joinWithCassandraTable to DataFrame

case class SourcePartition(id: String, host:String ,bucket: Int)
joinedRDDs =partitions.joinWithCassandraTable("db_name","table_name")
joinedRDDs.values.foreach(println)
I have to use joinWithCassandraTable , How do i covert the result CassandraRow in to a DataFrame? OR is there any equivalent of joinWithCassandraTable with DataFrame ?
I have to read a lot of partitions in one go, I'm aware of Datastax Cassandra connector Predicate push down, but it allows to pull only one Partition at a time ( It doesnt seems to allow IN operator , Only = seems to be supported)
val spark: SparkSession = SparkSession.builder().master("local[4]").appName("RDD2DF").getOrCreate()
val sc: SparkContext = spark.sparkContext
import spark.implicits._
val internalJoinRDD = spark.sparkContext.cassandraTable("test", "test_table_1").joinWithCassandraTable("test", "table_table_2")
internalJoin.toDebugString
internalJoinRDD.toDF()
Can you try the above code snippet.
If you have a schema for your data, you can use
def createDataFrame(internalJoinRDD: RDD[Row], schema: StructType): DataFrame

Avro format deserialization in Spark structured stream

I'm using Spark Structured Streaming as described on
this page.
I get correct message from Kafka topic but value is in Avro format. Is there some way to deserialize Avro records (something like KafkaAvroDeserializer approach)?
Spark >= 2.4
You can use from_avro function from spark-avro library.
import org.apache.spark.sql.avro._
val schema: String = ???
df.withColumn("value", from_avro($"value", schema))
Spark < 2.4
Define a function which takes Array[Byte] (serialized object):
import scala.reflect.runtime.universe.TypeTag
def decode[T : TypeTag](bytes: Array[Byte]): T = ???
which will deserialize Avro data and create object, that can be stored in a Dataset.
Create udf based on the function.
val decodeUdf = udf(decode _)
Call udf on value
val df = spark
.readStream
.format("kafka")
...
.load()
df.withColumn("value", decodeUdf($"value"))

How to create a DStream from a List of string?

I have a list of string, but i cant find a way to change the list to a DStream of spark streaming.
I tried this:
val tmpList = List("hi", "hello")
val rdd = sqlContext.sparkContext.parallelize(Seq(tmpList))
val rowRdd = rdd.map(v => Row(v: _*))
But the eclipse says sparkContext is not a member of sqlContext, so, How can i do this?
Appreciate your help, Please.
DStream is the sequence of RDD and it is created when you have register a received to some streaming source like Kafka. For testing if you want to create DStream from list of RDD's you can do that as follows:
val rdd1 = sqlContext.sparkContext.parallelize(Seq(tmpList))
val rdd2 = sqlContext.sparkContext.parallelize(Seq(tmpList1))
ssc.queueStream[String](mutable.Queue(rdd1,rdd2))
Hope it answers your question.

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown

Intent
I'm receiving data from Kafka via direct stream and would like to enrich the messages with data from Cassandra. The Kafka messages (Protobufs) are decoded into DataFrames and then joined with a (supposedly pre-filtered) DF from Cassandra. The relation of (Kafka) streaming batch size to raw C* data is [several streaming messages to millions of C* rows], BUT the join always yields exactly ONE result [1:1] per message. After the join the resulting DF is eventually stored to another C* table.
Problem
Even though I'm joining the two DFs on the full Cassandra primary key and pushing the corresponding filter to C*, it seems that Spark is loading the whole C* data-set into memory before actually joining (which I'd like to prevent by using the filter/predicate pushdown). This leads to a lot of shuffling and tasks being spawned, hence the "simple" join takes forever...
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("test")
.set("spark.cassandra.connection.host", "xxx")
.set("spark.cassandra.connection.keep_alive_ms", "30000")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(10))
ssc.sparkContext.setLogLevel("INFO")
// Initialise Kafka
val kafkaTopics = Set[String]("xxx")
val kafkaParams = Map[String, String](
"metadata.broker.list" -> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",
"auto.offset.reset" -> "smallest")
// Kafka stream
val messages = KafkaUtils.createDirectStream[String, MyMsg, StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)
// Executed on the driver
messages.foreachRDD { rdd =>
// Create an instance of SQLContext
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
// Map MyMsg RDD
val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}
// Convert RDD[MyMsg] to DataFrame
val MyMsgDf = MyMsgRdd.toDF()
.select(
$"prim1Id" as 'prim1_id,
$"prim2Id" as 'prim2_id,
$...
)
// Load DataFrame from C* data-source
val base_data = base_data_df.getInstance(sqlContext)
// Left join on prim1Id and prim2Id
val joinedDf = MyMsgDf.join(base_data,
MyMsgDf("prim1_id") === base_data("prim1_id") &&
MyMsgDf("prim2_id") === base_data("prim2_id"), "left")
.filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))
&& base_data("prim2_id").isin(MyMsgDf("prim2_id")))
joinedDf.show()
joinedDf.printSchema()
// Select relevant fields
// Persist
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
Environment
Spark 1.6
Cassandra 2.1.12
Cassandra-Spark-Connector 1.5-RC1
Kafka 0.8.2.2
SOLUTION
From discussions on the DataStax Spark Connector for Apache Cassandra ML
Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown
How to create a DF from CassandraJoinRDD
I've learned the following:
Quoting Russell Spitzer
This wouldn't be a case of predicate pushdown. This is a join on a partition key column. Currently only joinWithCassandraTable supports this direct kind of join although we are working on some methods to try to have this automatically done within Spark.
Dataframes can be created from any RDD which can have a schema applied to it. The easiest thing to do is probably to map your joinedRDD[x,y] to Rdd[JoinedCaseClass] and then call toDF (which will require importing your sqlContext implicits.) See the DataFrames documentation here for more info.
So the actual implementation now resembles something like
// Join myMsg RDD with myCassandraTable
val joinedMsgRdd = myMsgRdd.joinWithCassandraTable(
"keyspace",
"myCassandraTable",
AllColumns,
SomeColumns(
"prim1_id",
"prim2_id"
)
).map{case (myMsg, cassandraRow) =>
JoinedMsg(
foo = myMsg.foo
bar = cassandraRow.bar
)
}
// Convert RDD[JoinedMsg] to DataFrame
val myJoinedDf = joinedMsgRdd.toDF()
Have you tried joinWithCassandraTable ? It should pushdown to C* all keys you are looking for...

Resources