Spark Streaming: Using external data during stream transformation - apache-spark

I have a situation where I have to filter data-points in a stream based on some condition involving a reference to external data. I have loaded up the external data in a Dataframe (so that I get to query on it using SQL interface). But when I tried to query on Dataframe I see that we cannot access it inside the transform (filter) function. (sample code below)
// DStream is created and temp table called 'locations' is registered
dStream.filter(dp => {
val responseDf = sqlContext.sql("select location from locations where id='001'")
responseDf.show() //nothing is displayed
// some condition evaluation using responseDf
true
})
Am I doing something wrong? If yes, then what would be a better approach to load external data in-memory and query it during stream transformation stage.

Using SparkSession instead of SQLContext solved the issue. Code below,
val sparkSession = SparkSession.builder().appName("APP").getOrCreate()
val df = sparkSession.createDataFrame(locationRepo.getLocationInfo, classOf[LocationVO])
df.createOrReplaceTempView("locations")
val dStream: DStream[StreamDataPoint] = getdStream()
dStream.filter(dp => {
val sparkAppSession = SparkSession.builder().appName("APP").getOrCreate()
val responseDf = sparkAppSession.sql("select location from locations where id='001'")
responseDf.show() // this prints the results
// some condition evaluation using responseDf
true
})

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

Creating an RDD from ConsumerRecord Value in Spark Streaming

I am trying to create a XmlRelation based on ConsumerRecord Value.
val value = record.value();
logger.info(".processRecord() : Value ={}" , value)
if(value !=null) {
val rdd = spark.sparkContext.parallelize(List(new String(value)))
How ever when i try to create an RDD based on the value i am getting NullPointerException.
org.apache.spark.SparkException: Job aborted due to stage failure:
Is this because i cannot create an RDD as i cannot get sparkContext on on worker nodes. Obviously i cannot send this information to back to the Driver as this is an infinite Stream.
What alternatives do i have.
The other alternative is write this record data along with Header info to another topic and write it back to another topic and have another streaming job process that info.
The ConsumerRecord Value i am getting is String (XML) and i want to parse it using an existing schema into an RDD and process it further.
Thanks
Sateesh
I am able to use the following code and make it work
val xmlStringDF:DataFrame = batchDF.selectExpr("value").filter($"value".isNotNull)
logger.info(".convert() : xmlStringDF Schema ={}",xmlStringDF.schema.treeString)
val rdd: RDD[String] = xmlStringDF.as[String].rdd
logger.info(".convert() : Before converting String DataFrame into XML DataFrame")
val relation = XmlRelation(
() => rdd,
None,
parameters.toMap,
xmlSchema)(spark.sqlContext)
val xmlDF = spark.baseRelationToDataFrame(relation)

How to collect a streaming dataset (to a Scala value)?

How can I store a dataframe value to a scala variable ?
I need to store values from the below dataframe (assuming column "timestamp" producing same values) to a variable and later I need to use this variable somewhere
i have tried following
val spark =SparkSession.builder().appName("micro").
enableHiveSupport().config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("spark.sql.streaming.checkpointLocation", "hdfs://dff/apps/hive/warehouse/area.db").
getOrCreate()
val xmlSchema = new StructType().add("id", "string").add("time_xml", "string")
val xmlData = spark.readStream.option("sep", ",").schema(xmlSchema).csv("file:///home/shp/sourcexml")
val xmlDf_temp = xmlData.select($"id",unix_timestamp($"time_xml", "dd/mm/yyyy HH:mm:ss").cast(TimestampType).as("timestamp"))
val collect_time = xmlDf_temp.select($"timestamp").as[String].collect()(0)
its thorwing error saying following:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
Is there any way i can store some dataframe values to a variable and use later?
is there any way i can store some dataframe values to a variable and use later ?
That's not possible in Spark Structured Streaming since a streaming query never ends and so it is not possible to express collect.
and later I need to use this variable somewhere
This "later" has to be another streaming query that you could join together and produce a result.

Filtering and selecting data from a DataFrame in Spark

I am working on a Spark-JDBC program
I came up with the following code so far:
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val log = LogManager.getLogger("Spark-JDBC Program")
Logger.getLogger("org").setLevel(Level.ERROR)
val conFile = "/home/hmusr/ReconTest/inputdir/testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
val tableName = "source.bank_accounts"
try {
Class.forName(driverClass).newInstance()
} catch {
case cnf: ClassNotFoundException =>
log.error("Driver class: " + driverClass + " not found")
System.exit(1)
case e: Exception =>
log.error("Exception: " + e.printStackTrace())
System.exit(1)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
val rc = gpTable.filter(gpTable("source_system_name")==="ORACLE").count()
println("gpTable Count: " + rc)
}
}
In the above code, will the statement:val gpTable = spark.read.format("jdbc").option("url", connectionUrl) dump the whole data of the table: bank_accounts into the DataFrame: gpTable and then DataFrame: rc gets the filtered data. I have this doubt as the table: bank_accounts is a very small table and it doesn't have an effect if it is loaded into memory as a dataframe as a whole. But in our production, there are tables with billions of records. In that case what is the recommended way to load data into a DataFrame using a JDBC connection ?
Could anyone let me know the concept of Spark-Jdbc's entry point here ?
will the statement ... dump the whole data of the table: bank_accounts into the DataFrame: gpTable and then DataFrame: rc gets the filtered data.
No. DataFrameReader is not eager. It only defines data bindings.
Additionally, simple predicates, like trivial equality, checks are pushed to the source and only required columns should loaded when plan is executed.
In the database log you should see a query similar to
SELECT 1 FROM table WHERE source_system_name = 'ORACLE'
if it is loaded into memory as a dataframe as a whole.
No. Spark doesn't load data in memory unless it instructed to (primarily cache) and even then it limits itself to the blocks that fit into available storage memory.
During standard process it keep only the data that is required to compute the plan. For global plan memory footprint shouldn't depend on the amount of data.
In that case what is the recommended way to load data into a DataFrame using a JDBC connection ?
Please check Partitioning in spark while reading from RDBMS via JDBC, Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?, https://stackoverflow.com/a/45028675/8371915 for questions related to scalability.
Additionally you can read Does spark predicate pushdown work with JDBC?

How to send transformed data from partitions to S3?

I have an RDD which is to big to collect. I have applied a chain of transformations to the RDD and want to send its transformed data directly from its partitions on my slaves to S3. I am currently operating as follows:
val rdd:RDD = initializeRDD
val rdd2 = rdd.transform
rdd2.first // in order to force calculation of RDD
rdd2.foreachPartition sendDataToS3
Unfortunately, the data that gets sent to S3 is untransformed. The RDD looks exactly like it did in stage initializeRDD.
Here is the body of sendDataToS3:
implicit class WriteableRDD[T](rdd:RDD[T]){
def transform:RDD[String] = rdd map {_.toString}
....
def sendPartitionsToS3(prefix:String) = {
rdd.foreachPartition { p =>
val filename = prefix+new scala.util.Random().nextInt(1000000)
val pw = new PrintWriter(new File(filename))
p foreach pw.println
pw.close
s3.putObject(S3_BUCKET, filename, new File(filename))
}
this
}
}
This is called with rdd.transform.sendPartitionsToS3(prefix).
How do I make sure the data that gets sent in sendDataToS3 is the transformed data?
My guess is there is a bug in your code that is not included in the question.
I'm answering anyway just to make sure you are aware of RDD.saveAsTextFile. You can give it a path on S3 (s3n://bucket/directory) and it will write each partition into that path directly from the executors.
I can hardly imagine when you would need to implement your own sendPartitionsToS3 instead of using saveAsTextFile.

Resources