lag function in Spark structured streaming - apache-spark

I am using Spark 2.3 Structured streaming and trying to use 'lag' function. However looks like lag is not supported in structured streaming.
val output = spark.sql("SELECT temperature, time, lag(temperature, 1) OVER (ORDER BY time) AS PrevTemp FROM InputTable")
Get this error:
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets; line 1 pos 0;
Is there an alternate way to achieve this 'lag' functionality with structured streaming?
Thanks!

As far as I know, there isn't.
Probably, you may play with mapGroupsWithState. for example:
case class PayLoad(event_time: java.sql.Timestamp, data: String)
def mappingFunction(key: java.sql.Timestamp, values: Iterator[PayLoad], state: GroupState[PayLoad]): PayLoad = {
??? // Work with values iterator
}
val temperature: DataFrame = ???
temperature
.withColumn("event_time", org.apache.spark.sql.functions.current_timestamp())
.as[PayLoad]
.groupByKey(_.event_time)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(mappingFunction)
You don't need to keep state, but in this way you have access to values iterator and you are able to solve any task.
Keep in mind, that in this case all micro batch data will go to one partition and with huge payload may lead to huge latencies or even OOM. (as well as with OVER (ORDER BY time))
Hope it helps.

Related

Spark - Java - Filter Streaming Queries

I've a Spark application that receives data in a dataframe:
Dataset<Row> df = spark.readStream().format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "topic").load().selectExpr("CAST(key AS STRING) as key");
String my_key = df.select("key").first().toString();
if (my_key == "a")
{
do_stuff
}
Basically I will need to in case of value a then I apply some transformations on the dataframe otherwise I apply other transformations.
However, I am dealing with streaming queries and when I tried to apply my code above I got:
Queries with streaming sources must be executed with writeStream.start()
The error happens when I make the first operation.
Anyone have any ideas?
Thanks in advance :)
I was able to sole my problem using:
Dataset<Row> df = spark.readStream().format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "topic").load().selectExpr("CAST(key AS STRING) as key").filter(functions.col("key").contains("a"));

Multiple operations/aggregations on the same Dataframe/Dataset in Spark Structured Streaming

I use Spark 2.3.2.
I'm receiving data from Kafka. I must do multiple aggregations on the same data. Then all aggregations results will go to the same database (columns or tables may be changed). For example:
val kafkaSource = spark.readStream.option("kafka") ...
val agg1 = kafkaSource.groupBy().agg ...
val agg2 = kafkaSource.groupBy().mapgroupswithstate() ...
val agg3 = kafkaSource.groupBy().mapgroupswithstate() ...
But when I try call writeStream for each aggregation result:
aggr1.writeStream().foreach().start()
aggr2.writeStream().foreach().start()
aggr3.writeStream().foreach().start()
Spark receives data independently in each writeStream. Is this way efficient?
Can I do multiple aggregations with one writeStream? If it is possible, this way is efficient?
Every “writestream” operation results in a new streaming query. Every streaming query will read from the source and execute the entire query plan. Unlike DStream, there is no cache/persist option available.
In spark 2.4, a new API “forEachBatch” has been introduced to solve these kind of scenarios in a more efficient manner.
Caching can be used to avoid multiple reads:
kafkaSource.writeStream.foreachBatch((df, id) => {
df.persist()
val agg1 = df.groupBy().agg ...
val agg2 = df.groupBy().mapgroupswithstate() ...
val agg3 = df.groupBy().mapgroupswithstate() ...
df.unpersist()
}).start()

Convert a Spark SQL batch source to structured streaming sink

Trying to convert an org.apache.spark.sql.sources.CreatableRelationProvider into a org.apache.spark.sql.execution.streaming.Sink by simply implementing addBatch(...) which calls the createRelation(...) but there is a df.rdd in the createRelation(...), which causes the following error:
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
Was trying to look into howorg.apache.spark.sql.execution.streaming.FileStreamSink which also needs to get Rdd from dataframe in the streaming job, it seems to play the trick of using df.queryExecution.executedPlan.execute() to generate the RDD instead of calling .rdd.
However things does not seems to be that simple:
It seems the output ordering might need to be taken care of - https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L159
Might be some eager execution concerns? (not sure)
https://issues.apache.org/jira/browse/SPARK-20865
More details of the issue I am running into can be found here
Wondering what would be the idiomatic way to do this conversion?
Dataset.rdd() creates a new plan that just breaks the incremental planing. Because StreamExecution uses the existing plan to collect metrics and update watermark, we should never create a new plan. Otherwise, metrics and watermark are updated in the new plan, and StreamExecution cannot retrieval them.
Here is an example of the code in Scala to convert column values in Structured Streaming:
val convertedRows: RDD[Row] = df.queryExecution.toRdd.mapPartitions { iter: Iterator[InternalRow] =>
iter.map { row =>
val convertedValues: Array[Any] = new Array(conversionFunctions.length)
var i = 0
while (i < conversionFunctions.length) {
convertedValues(i) = conversionFunctions(i)(row, i)
i += 1
}
Row.fromSeq(convertedValues)
}
}

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState?

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API? looking for a more declarative way
Example:
select count(*) from some_view
I want the output to just count whatever records are available in each batch but not aggregate from the previous batch
To do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API, you can use following code-
import spark.implicits._
def countValues = (_: String, it: Iterator[(String, String)]) => it.length
val query =
dataStream
.select(lit("a").as("newKey"), col("value"))
.as[(String, String)]
.groupByKey { case(newKey, _) => newKey }
.mapGroups[Int](countValues)
.writeStream
.format("console")
.start()
Here what we are doing is-
We added one column to our datastream - newKey. We did this so that we can do a groupBy over it, using groupByKey. I have used a literal string "a", but you can use anything. Also, you need to select anyone column from the available columns in datastream. I have selected value column for this purpose, you can select anyone.
We created a mapping function - countValues, to count the values aggregated by groupByKey function by writing it.length.
So, in this way, we can count whatever records are available in each batch but not aggregating from the previous batch.
I hope it helps!

How to effectively read millions of rows from Cassandra?

I have a hard task to read from a Cassandra table millions of rows. Actually this table contains like 40~50 millions of rows.
The data is actually internal URLs for our system and we need to fire all of them. To fire it, we are using Akka Streams and it have been working pretty good, doing some back pressure as needed. But we still have not found a way to read everything effectively.
What we have tried so far:
Reading the data as Stream using Akka Stream. We are using phantom-dsl that provides a publisher for a specific table. But it does not read everything, only a small portion. Actually it stops to read after the first 1 million.
Reading using Spark by a specific date. Our table is modeled like a time series table, with year, month, day, minutes... columns. Right now we are selecting by day, so Spark will not fetch a lot of things to be processed, but this is a pain to select all those days.
The code is the following:
val cassandraRdd =
sc
.cassandraTable("keyspace", "my_table")
.select("id", "url")
.where("year = ? and month = ? and day = ?", date.getYear, date.getMonthOfYear, date.getDayOfMonth)
Unfortunately I can't iterate over the partitions to get less data, I have to use a collect because it complains the actor is not serializable.
val httpPool: Flow[(HttpRequest, String), (Try[HttpResponse], String), HostConnectionPool] = Http().cachedHostConnectionPool[String](host, port).async
val source =
Source
.actorRef[CassandraRow](10000000, OverflowStrategy.fail)
.map(row => makeUrl(row.getString("id"), row.getString("url")))
.map(url => HttpRequest(uri = url) -> url)
val ref = Flow[(HttpRequest, String)]
.via(httpPool.withAttributes(ActorAttributes.supervisionStrategy(decider)))
.to(Sink.actorRef(httpHandlerActor, IsDone))
.runWith(source)
cassandraRdd.collect().foreach { row =>
ref ! row
}
I would like to know if any of you have such experience on reading millions of rows for doing anything different from aggregation and so on.
Also I have thought to read everything and send to a Kafka topic, where I would be receiving using Streaming(spark or Akka), but the problem would be the same, how to load all those data effectively ?
EDIT
For now, I'm running on a cluster with a reasonable amount of memory 100GB and doing a collect and iterating over it.
Also, this is far different from getting bigdata with spark and analyze it using things like reduceByKey, aggregateByKey, etc, etc.
I need to fetch and send everything over HTTP =/
So far it is working the way I did, but I'm afraid this data get bigger and bigger to a point where fetching everything into memory makes no sense.
Streaming this data would be the best solution, fetching in chunks, but I haven't found a good approach yet for this.
At the end, I'm thinking of to use Spark to get all those data, generate a CSV file and use Akka Stream IO to process, this way I would evict to keep a lot of things in memory since it takes hours to process every million.
Well, after spending sometime reading, talking with other guys and doing tests the result could be achieve by the following code sample:
val sc = new SparkContext(sparkConf)
val cassandraRdd = sc.cassandraTable(config.getString("myKeyspace"), "myTable")
.select("key", "value")
.as((key: String, value: String) => (key, value))
.partitionBy(new HashPartitioner(2 * sc.defaultParallelism))
.cache()
cassandraRdd
.groupByKey()
.foreachPartition { partition =>
partition.foreach { row =>
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val myActor = system.actorOf(Props(new MyActor(system)), name = "my-actor")
val source = Source.fromIterator { () => row._2.toIterator }
source
.map { str =>
myActor ! Count
str
}
.to(Sink.actorRef(myActor, Finish))
.run()
}
}
sc.stop()
class MyActor(system: ActorSystem) extends Actor {
var count = 0
def receive = {
case Count =>
count = count + 1
case Finish =>
println(s"total: $count")
system.shutdown()
}
}
case object Count
case object Finish
What I'm doing is the following:
Try to achieve a good number of Partitions and a Partitioner using the partitionBy and groupBy methods
Use Cache to prevent Data Shuffle, making your Spark move large data across nodes, using high IO etc.
Create the whole actor system with it's dependencies as well as the Stream inside the foreachPartition method. Here is a trade off, you can have only one ActorSystem but you will have to make a bad use of .collect as I wrote in the question. However creating everything inside, you still have the ability to run things inside spark distributed across your cluster.
Finish each actor system at the end of the iterator using the Sink.actorRef with a message to kill(Finish)
Perhaps this code could be even more improved, but so far I'm happy to do not make the use of .collect anymore and working only inside Spark.

Resources