How to collect a streaming dataset (to a Scala value)? - apache-spark

How can I store a dataframe value to a scala variable ?
I need to store values from the below dataframe (assuming column "timestamp" producing same values) to a variable and later I need to use this variable somewhere
i have tried following
val spark =SparkSession.builder().appName("micro").
enableHiveSupport().config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("spark.sql.streaming.checkpointLocation", "hdfs://dff/apps/hive/warehouse/area.db").
getOrCreate()
val xmlSchema = new StructType().add("id", "string").add("time_xml", "string")
val xmlData = spark.readStream.option("sep", ",").schema(xmlSchema).csv("file:///home/shp/sourcexml")
val xmlDf_temp = xmlData.select($"id",unix_timestamp($"time_xml", "dd/mm/yyyy HH:mm:ss").cast(TimestampType).as("timestamp"))
val collect_time = xmlDf_temp.select($"timestamp").as[String].collect()(0)
its thorwing error saying following:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
Is there any way i can store some dataframe values to a variable and use later?

is there any way i can store some dataframe values to a variable and use later ?
That's not possible in Spark Structured Streaming since a streaming query never ends and so it is not possible to express collect.
and later I need to use this variable somewhere
This "later" has to be another streaming query that you could join together and produce a result.

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

Apache spark custom log unfiltered data (LazyLogging)

I'm filtering a column to comply with some validations and I can filter using Spark built-in functions,
but I need to log the invalid data with a proper message (I am using LazyLogging), is there any way I can do it without using a custom UDF, so I can keep Spark optimization?
for example filtering names that are shorter then 20 characters:
df.filter(length($"name") <= lit(20))
in this scenario how can I log the names that are more than 20 characters without custom UDF?
In case the result of the filter operation is not too large that it does not fit into your driver, you can collect the result and print it out to your default Logger.
val logCollection = df.filter(length($"name") > lit(20)).collectAsList
logCollection.foreach(logger.info(_))
As an alternative you can create a separate stream by applying another writeStream format to write the names into a database, console etc. Just keep in mind that when you do this, you will actually create multiple streaming queries within your SparkSession which are consuming the data independently:
val originalDf = df.[...]
val logDf = df.filter(length($"name") > lit(20))
val originalQuery = originalDf.writeStream.[...].start() // keep logic as is
val logQuery = logDf.writeStream.format("console").[...].start()
spark.streams.awaitAnyTermination()

scala joinWithCassandraTable result to dataframe

I'm using Datastax spark-Cassandra-connector to access some data in Cassandra.
My requirement is to Join an RDD with a Cassandra table, fetch the result and store it in the hive table.
Im using joinWithCassandraTable to join the cassadra table. After the join the resuting RDD looks like below
com.datastax.spark.connector.rdd.CassandraJoinRDD[org.apache.spark.sql.Row,
com.datastax.spark.connector.CassandraRow] =
CassandraJoinRDD[17] at RDD at CassandraRDD.scala:19
I tried below steps to convert to the data frame but none of the approaches is working.
val data=joinWithRDD.map{
case(_, cassandraRow) => Row(cassandraRow.columnValues:_*)
}
sqlContext.createDataFrame(data,schema)
I'm getting below error
java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of
type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
Can you please help me in converting joinWithCassandraTable to a dataframe?
As I see, you're using dataframe on the left side of the join. Instead of using joinWithCassandraTable that uses RDD API, I recommend to take the Spark Cassandra Connector 2.5.x (2.5.1 is the latest) that has support for join in the Dataframe API, and use it directly. It's really easy, you just need to start your job with --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions to activate this functionality, after that, code is just using normal joins on dataframes:
val parsed = ...some dataframe...
val cassandra = spark.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "stock_info", "keyspace" -> "test"))
.load
// we can use left join to detect what data is incorrect - if we don't have some data in the
// Cassandra, then symbol field will be null, so we can detect such entries, and do something with that
// we can omit the joinType parameter, in that case, we'll process only data that are in the Cassandra
val joined = parsed.join(cassandra, cassandra("symbol") === parsed("ticker"), "left")
.drop("ticker")
Full source code with README is here.

Creating an RDD from ConsumerRecord Value in Spark Streaming

I am trying to create a XmlRelation based on ConsumerRecord Value.
val value = record.value();
logger.info(".processRecord() : Value ={}" , value)
if(value !=null) {
val rdd = spark.sparkContext.parallelize(List(new String(value)))
How ever when i try to create an RDD based on the value i am getting NullPointerException.
org.apache.spark.SparkException: Job aborted due to stage failure:
Is this because i cannot create an RDD as i cannot get sparkContext on on worker nodes. Obviously i cannot send this information to back to the Driver as this is an infinite Stream.
What alternatives do i have.
The other alternative is write this record data along with Header info to another topic and write it back to another topic and have another streaming job process that info.
The ConsumerRecord Value i am getting is String (XML) and i want to parse it using an existing schema into an RDD and process it further.
Thanks
Sateesh
I am able to use the following code and make it work
val xmlStringDF:DataFrame = batchDF.selectExpr("value").filter($"value".isNotNull)
logger.info(".convert() : xmlStringDF Schema ={}",xmlStringDF.schema.treeString)
val rdd: RDD[String] = xmlStringDF.as[String].rdd
logger.info(".convert() : Before converting String DataFrame into XML DataFrame")
val relation = XmlRelation(
() => rdd,
None,
parameters.toMap,
xmlSchema)(spark.sqlContext)
val xmlDF = spark.baseRelationToDataFrame(relation)

createDataFrame() returning a list instead of DataFrame in Spark

I am running Spark 1.5.1. On startup I have HiveContext available as sqlContext but set
sqlContext2 = SQLContext(sc)
I create a pipelined RDD by parsing a list of strings to JSON
data = points.map(lambda line: json.loads(line))
I then try to convert this into a dataframe using
DF = sqlContext2.createDataFrame(data).collect()
This runs perfectly, but then when i run type(DF) it says that it is a list.
How is this possible? How is a list coming out of a createDataFrame()
That's because when you apply collect() on a DataFrame, it return a list that contains all of the elements (Rows) in this DataFrame.
if you want just a DatFrame, df = sqlContext.createDataFrame(data) is enough.
There is no need for sqlContext2 here.

Resources