I have a spark (spark 2.1) job which processes the stream data using Kafka direct stream. I enriched the stream data with the data files stored in HDFS. I first read the data files(*.parquet) and stored them in a data frame, then enrich one record each time with this data frame.
The code ran without any error, but the enrichment did not occur. I ran the codes in debug mode and found the data frame (eg. df) is shown as an invalid tree. Why the data frame is null inside rdd.foreachPartition? how to correct this problem? Thanks!
val kafkaSinkVar = ssc.sparkContext.broadcast(KafkaSink(kafkaServers, outputTopic))
Service.aggregate(kafkaInputStream).foreachRDD(rdd => {
val df =ss.read.parquet( filePath + "/*.parquet" )
println("Record Count in DF: " + df.count()) ==> the console shows the files were loaded successfully with the record count = 1300
rdd.foreachPartition(partition => {
val futures = partition.map(event => {
sentMsgsNo.add(1L)
val eventEnriched = someEnrichmen1(event,df) ==> df is shown as invalid tree here
kafkaSinkVar.value.sendCef(eventEnriched)
})
})
})
})
Related
I want to create and update a data frame inside the foreach batch of a spark stream and access it outside the foreach batch iterator below is what I am trying to do in spark structured streaming.
Is it possible to access data frames which are created or updated inside foreach batch from outside for each batch in spark structured streaming ?
// assign a empty data frame
var df1: Option[DataFrame] = None: Option[DataFrame]
validatedFinalDf.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
println("I am here printing batchDF")
batchDF.withColumn("extra", lit("batch-df")).show()
// un presist the data frame if it has data
if (df1 != None) {
df1.get.unpersist()
}
// assign data to data frame
df1 = Some(batchDF.withColumn("extra", lit("batch-df-dim")))
}.start()
// access data frame outside foreach not working stale data ....
if (df1 != None) {
df1.get.show()
}
spark.streams.awaitAnyTermination()
I cant even access temp tables which are created inside foreach batch from outside for each batch.
Even the data frame which is updated inside foreach batch shows stale data from outside foreach batch.
Thanks
Sri
The foreachBatch iterates over the collection and, if i don't mistake, expect an effectful operation (eg writes, print, etc).
However what you do inside the body is to assign a temporary result to an external var.
So there the problems:
conceptually that is wrong because, even if it had worked fine, you would end up with just the last Dataframe assigned to your var.
I think you need to start the operation as exemplified in the doc here
DF are immutable. If you want to change your DataFrame when use mapping functions (eg, withColumn) or other transformation API and return the new DF.
When you're satisfied with the result only then persist using the foreach / foreachBatch calls
A small work around made the trick , converted batch data frame to in memory stream which was accessed outside foreach batch.
case class StreamData(
account_id: String,
run_dt: String,
trxn_dt: String,
trxn_amt: String)
import spark.implicits._
implicit val ctx = spark.sqlContext
val streamDataSource = MemoryStream[StreamData]
source.writeStream
.foreachBatch { (batchDf: DataFrame, batchId: Long) =>
val batchDs = batchDf.as[StreamData]
val obj = batchDs
.map(x => StreamData(x.account_id,x.run_dt,x.trxn_dt, x.trxn_amt))
.collect()
streamDataSource.addData(obj)
}
.start()
val datasetStreaming: Dataset[StreamData] = streamDataSource.toDS()
println("This is the streaming dataset:")
datasetStreaming
.writeStream
.format("console")
.outputMode("append")
.start()
spark.streams.awaitAnyTermination()
I am working on a Spark-JDBC program
I came up with the following code so far:
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val log = LogManager.getLogger("Spark-JDBC Program")
Logger.getLogger("org").setLevel(Level.ERROR)
val conFile = "/home/hmusr/ReconTest/inputdir/testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
val tableName = "source.bank_accounts"
try {
Class.forName(driverClass).newInstance()
} catch {
case cnf: ClassNotFoundException =>
log.error("Driver class: " + driverClass + " not found")
System.exit(1)
case e: Exception =>
log.error("Exception: " + e.printStackTrace())
System.exit(1)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
val rc = gpTable.filter(gpTable("source_system_name")==="ORACLE").count()
println("gpTable Count: " + rc)
}
}
In the above code, will the statement:val gpTable = spark.read.format("jdbc").option("url", connectionUrl) dump the whole data of the table: bank_accounts into the DataFrame: gpTable and then DataFrame: rc gets the filtered data. I have this doubt as the table: bank_accounts is a very small table and it doesn't have an effect if it is loaded into memory as a dataframe as a whole. But in our production, there are tables with billions of records. In that case what is the recommended way to load data into a DataFrame using a JDBC connection ?
Could anyone let me know the concept of Spark-Jdbc's entry point here ?
will the statement ... dump the whole data of the table: bank_accounts into the DataFrame: gpTable and then DataFrame: rc gets the filtered data.
No. DataFrameReader is not eager. It only defines data bindings.
Additionally, simple predicates, like trivial equality, checks are pushed to the source and only required columns should loaded when plan is executed.
In the database log you should see a query similar to
SELECT 1 FROM table WHERE source_system_name = 'ORACLE'
if it is loaded into memory as a dataframe as a whole.
No. Spark doesn't load data in memory unless it instructed to (primarily cache) and even then it limits itself to the blocks that fit into available storage memory.
During standard process it keep only the data that is required to compute the plan. For global plan memory footprint shouldn't depend on the amount of data.
In that case what is the recommended way to load data into a DataFrame using a JDBC connection ?
Please check Partitioning in spark while reading from RDBMS via JDBC, Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?, https://stackoverflow.com/a/45028675/8371915 for questions related to scalability.
Additionally you can read Does spark predicate pushdown work with JDBC?
i have a Kakfa topic in which data is stored in a JSON format. I have written a spark streaming code and I want to save just the values from the Kafka topic to a file in HDFS .
This is how the data in my kafka topic looks like :
{"group_city":"\"Washington\"","group_country":"\"us\"","event_name":"\"Outdoor Afro Goes Ziplining\""}
Below, is the code I have written. When i print it, I get the parsed JSON, but my problem comes when i try to save just the values to text file.
val dstream = KafkaUtils.createDirectStream[String, String](ssc,preferredHosts,ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
//___PRINTING RECORDS________
val output= dstream.foreachRDD { rdd =>
rdd.foreach { record =>
val values = record.value()
val tweet = scala.util.parsing.json.JSON.parseFull(values)
val map:Map[String,String] = tweet.get.asInstanceOf[Map[String, String]]
map.foreach(p => println(p._2))
}
}
You can save the rdd with saveAsTextFile, But since you only want to save the values you can convert to dataframe and write as a csv
dstream.foreachRDD(rawRDD => {
// get the data
val rdd = rawRDD.map(_._2)
rdd.saveAsTextFile("file path")
// or read the json String to dataframe and write as a csv
spark.read.json(rdd).write.mode(SaveMode.Append).csv("path for output")
})
Hope this helps!
I have a situation where I have to filter data-points in a stream based on some condition involving a reference to external data. I have loaded up the external data in a Dataframe (so that I get to query on it using SQL interface). But when I tried to query on Dataframe I see that we cannot access it inside the transform (filter) function. (sample code below)
// DStream is created and temp table called 'locations' is registered
dStream.filter(dp => {
val responseDf = sqlContext.sql("select location from locations where id='001'")
responseDf.show() //nothing is displayed
// some condition evaluation using responseDf
true
})
Am I doing something wrong? If yes, then what would be a better approach to load external data in-memory and query it during stream transformation stage.
Using SparkSession instead of SQLContext solved the issue. Code below,
val sparkSession = SparkSession.builder().appName("APP").getOrCreate()
val df = sparkSession.createDataFrame(locationRepo.getLocationInfo, classOf[LocationVO])
df.createOrReplaceTempView("locations")
val dStream: DStream[StreamDataPoint] = getdStream()
dStream.filter(dp => {
val sparkAppSession = SparkSession.builder().appName("APP").getOrCreate()
val responseDf = sparkAppSession.sql("select location from locations where id='001'")
responseDf.show() // this prints the results
// some condition evaluation using responseDf
true
})
I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring.
I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is that there's dataloss between when I call saveAsTextFile on the RDD and DataFrame's write method with format("csv"). I can't seem to pin point why this is happening.
val ssc = new StreamingContext(spark.sparkContext, Seconds(60))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
stream.foreachRDD {
rdd => {
rdd.saveAsTextFile("/Users/jarvis/rdds/"+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_rdd")
import spark.implicits._
val messagesDF = rdd.map(_.split("\t")).map( w => { Record ( w(0), autoTag( w(1),w(4) ) , w(2), w(3), w(4), w(5).substring(w(5).lastIndexOf("http://")), w(6).split("\n")(0) )}).toDF("recordTS","tag","channel_url","title","description","link","pub_TS")
messagesDF.write.format("csv").save(dumpPath+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_DF")
}
}
ssc.start()
ssc.awaitTermination()
There's data loss ie Many rows don't make it to the DataFrame from the RDD.
There's also replication: Many rows that do reach the Dataframe are replicated many times.
Found the error. Actually there was a wrong understanding about the ingested data format.
The intended data was "\t\t\t..." and hence the Row was supposed be split at "\n".
However the actual data was :
"\t\t\t...\n\t\t\t...\n"
So the rdd.map(...) operation needed another map for splitting at every "\n"