Trying to understand spark streaming windowing

Trying to understand spark streaming windowing - apache-spark

I'm investigating Spark Streaming as a solution for an anti-fraud service I am building, but I am struggling to figure out exactly how to apply it to my use case. The use case is: data from a user session is streamed, and a risk score is calculated for a given user, after 10 seconds of data is collected for that user. I am planning on using a batch interval time of 2 seconds, but need to use data from the full 10 second window. At first, updateStateByKey() seemed to be the perfect solution, as I could build up a UserRisk object using the events the system collects. The trouble is, I am not sure how to tell Spark to stop updating a user after the 10 seconds have passed, as at the 10 second mark, I run our inference engine against the UserRisk object, and persist the result. The other approach is the window transformation. The issue with the window transformation is that I have to dedup data manually, which might be wasteful. Any suggestions on how to tell updateStateByKey to stop reducing on a certain key after an interval of time has passed?

If you don't want use windowing, you can reduce batch interval to 1s and then in updateStateByKey update also an incremental counter and bypass the update function when it reachs 10.
myDstreamByKey.updateStateByKey( (newValues: Seq[Row], runningState: Option[(UserRisk, Int)]) => {
if(runningState.get._2 < 10){
Some(( updateUserRisk(runningState.get._1, newValues), runningState.get._2 + 1) )
}else{
runningState
}
} )
For semplicity I'm considering the State always a Some, but you have to handle it according your business logic, that I don't know.
In my example Row is as a fake case class that represents your original data and UserRisk is the accumulating state. the updateUserRisk function contains your business logic to update the UserRisk

According to your case, you can try reduceByKeyAndWindow Dstream function, It will fulfill your requirement
Here is sample code in java
JavaPairDStream<String, Integer> counts = pairs.reduceByKeyAndWindow(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}, new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 - i2;
}
}, new Duration(60 * 1000), new Duration(2 * 1000));
Some important links
http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
Spark Streaming Window Operation

Related

NoNodeAvailableException after some insert request to cassandra

I am trying to insert data into Cassandra local cluster using async execution and version 4 of the driver (as same as my Cassandra instance)
I have instantiated the cql session in this way:
CqlSession cqlSession = CqlSession.builder()
.addContactEndPoint(new DefaultEndPoint(
InetSocketAddress.createUnresolved("localhost",9042))).build();
Then I create a statement in an async way:
return session.prepareAsync(
"insert into table (p1,p2,p3, p4) values (?, ?,?, ?)")
.thenComposeAsync(
(ps) -> {
CompletableFuture<AsyncResultSet>[] result = data.stream().map(
(d) -> session.executeAsync(
ps.bind(d.p1,d.p2,d.p3,d.p4)
)
).toCompletableFuture()
).toArray(CompletableFuture[]::new);
return CompletableFuture.allOf(result);
}
);
data is a dynamic list filled with user data.
When I exec the code I get the following exception:
Caused by: com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query
at com.datastax.oss.driver.api.core.AllNodesFailedException.fromErrors(AllNodesFailedException.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.sendRequest(CqlPrepareHandler.java:210)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.onThrottleReady(CqlPrepareHandler.java:167)
at com.datastax.oss.driver.internal.core.session.throttling.PassThroughRequestThrottler.register(PassThroughRequestThrottler.java:52)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.<init>(CqlPrepareHandler.java:153)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareAsyncProcessor.process(CqlPrepareAsyncProcessor.java:66)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareAsyncProcessor.process(CqlPrepareAsyncProcessor.java:33)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:210)
at com.datastax.oss.driver.api.core.cql.AsyncCqlSession.prepareAsync(AsyncCqlSession.java:90)
The node is active and some data are inserted before the exception rise. I have also tried to set up a data center name on the session builder without any result.
Why this exception rise if the node is up and running? Actually I have only one local node and that could be a problem?

The biggest thing that I don't see, is a way to limit the current number of active async threads.
Basically, if that (mapped) data stream gets hit hard enough, it'll basically create all of these new threads that it's awaiting. If the number of writes coming in from those threads creates enough back-pressure that node can't keep up or catch up to, the node will become overwhelmed and not accept requests.
Take a look at this post by Ryan Svihla of DataStax:
Cassandra: Batch Loading Without the Batch — The Nuanced Edition
Its code is from the 3.x version of the driver, but the concepts are the same. Basically, provide some way to throttle-down the writes, or limit the number of "in flight threads" running at any given time, and that should help greatly.

Finally, I have found a solution using BatchStatement and a little custom code to create a chucked list.
int chunks = 0;
if (data.size() % 100 == 0) {
chunks = data.size() / 100;
} else {
chunks = (data.size() / 100) + 1;
}
final int finalChunks = chunks;
return session.prepareAsync(
"insert into table (p1,p2,p3, p4) values (?, ?,?, ?)")
.thenComposeAsync(
(ps) -> {
AtomicInteger counter = new AtomicInteger();
final List<CompletionStage<AsyncResultSet>> batchInsert = data.stream()
.map(
(d) -> ps.bind(d.p1,d.p2,d.p3,d.p4)
)
.collect(Collectors.groupingBy(it -> counter.getAndIncrement() / finalChunks))
.values().stream().map(
boundedStatements -> BatchStatement.newInstance(BatchType.LOGGED, boundedStatements.toArray(new BatchableStatement[0]))
).map(
session::executeAsync
).collect(Collectors.toList());
return CompletableFutures.allSuccessful(batchInsert);
}
);

Spark arbitrary stateful stream aggregation, flatMapGroupsWithState API

10 days old spark developer, trying to understand the flatMapGroupsWithState API of spark.
As I understand:
We pass 2 options to it which are timeout configuration. A possible value is GroupStateTimeout.ProcessingTimeTimeout i.e. kind of an instruction to spark to consider processing time and not event time. Other is the output mode.
We pass in a function, lets say myFunction, that is responsible for setting the state for each key. And we also set a timeout duration with groupState.setTimeoutDuration(TimeUnit.HOURS.toMillis(4)), assuming groupState is the instance of my groupState for a key.
As I understand, as micro batches of stream data keep coming in, spark maintain an intermediate state as we define in user defined function. Lets say the intermediate state after processing n micro batches of data is as follows:
State for Key1:
{
key1: [v1, v2, v3, v4, v5]
}
State for key2:
{
key2: [v11, v12, v13, v14, v15]
}
For any new data that come in, myFunction is called with state for the particular key. Eg. for key1, myFunction is called with key1, new key1 values, [v1,v2,v3,v4,v5] and it updates the key1 state as per the logic.
I read about the timeout and I found Timeout dictates how long we should wait before timing out some intermediate state.
Questions:
If this process run indefinitely, my intermediate states will keep on piling and hit the memory limits on nodes. So when are these intermediate states cleared. I found that in case of event time aggregation, watermarks dictates when the intermediate states will be cleared.
What does timing out the intermediate state mean in the context of Processing time.

If this process run indefinitely, my intermediate states will keep on piling and hit the memory limits on nodes. So when are these intermediate states cleared. I found that in case of event time aggregation, watermarks dictates when the intermediate states will be cleared.
Apache Spark will mark them as expired after the expiration time, so in your example after 4 hours of inactivity (real time + 4 hours, inactivity = no new event updating the state).
What does timing out the intermediate state mean in the context of Processing time.
It means that it will time out accordingly to the real clock (processing time, org.apache.spark.util.SystemClock class). You can check what clock is currently used by analyzing org.apache.spark.sql.streaming.StreamingQueryManager#startQuery triggerClock parameter.
You will find more details in FlatMapGroupsWithStateExec class, and more particularly here:
// Generate a iterator that returns the rows grouped by the grouping function
// Note that this code ensures that the filtering for timeout occurs only after
// all the data has been processed. This is to ensure that the timeout information of all
// the keys with data is updated before they are processed for timeouts.
val outputIterator =
processor.processNewData(filteredIter) ++ processor.processTimedOutState()
And if you analyze these 2 methods, you will see that:
processNewData applies mapping function to all active keys (present in the micro-batch)
/**
* For every group, get the key, values and corresponding state and call the function,
* and return an iterator of rows
*/
def processNewData(dataIter: Iterator[InternalRow]): Iterator[InternalRow] = {
val groupedIter = GroupedIterator(dataIter, groupingAttributes, child.output)
groupedIter.flatMap { case (keyRow, valueRowIter) =>
val keyUnsafeRow = keyRow.asInstanceOf[UnsafeRow]
callFunctionAndUpdateState(
stateManager.getState(store, keyUnsafeRow),
valueRowIter,
hasTimedOut = false)
}
}
processTimedOutState calls the mapping function on all expired states
def processTimedOutState(): Iterator[InternalRow] = {
if (isTimeoutEnabled) {
val timeoutThreshold = timeoutConf match {
case ProcessingTimeTimeout => batchTimestampMs.get
case EventTimeTimeout => eventTimeWatermark.get
case _ =>
throw new IllegalStateException(
s"Cannot filter timed out keys for $timeoutConf")
}
val timingOutPairs = stateManager.getAllState(store).filter { state =>
state.timeoutTimestamp != NO_TIMESTAMP && state.timeoutTimestamp < timeoutThreshold
}
timingOutPairs.flatMap { stateData =>
callFunctionAndUpdateState(stateData, Iterator.empty, hasTimedOut = true)
}
} else Iterator.empty
}
An important point to notice here is that Apache Spark will keep expired state in the state store if you don't invoke GroupState#remove method. The expired states won't be returned for processing though because they're flagged with NO_TIMESTAMP field. However, they will be stored in the state store delta files which may slow down the reprocessing if you need to reload the most recent state. If you analyze FlatMapGroupsWithStateExec again, you will see that the state is removed only when the state removed flag is set to true:
def callFunctionAndUpdateState(...)
// ...
// When the iterator is consumed, then write changes to state
def onIteratorCompletion: Unit = {
if (groupState.hasRemoved && groupState.getTimeoutTimestamp == NO_TIMESTAMP) {
stateManager.removeState(store, stateData.keyRow)
numUpdatedStateRows += 1
} else {
val currentTimeoutTimestamp = groupState.getTimeoutTimestamp
val hasTimeoutChanged = currentTimeoutTimestamp != stateData.timeoutTimestamp
val shouldWriteState = groupState.hasUpdated || groupState.hasRemoved || hasTimeoutChanged
if (shouldWriteState) {
val updatedStateObj = if (groupState.exists) groupState.get else null
stateManager.putState(store, stateData.keyRow, updatedStateObj, currentTimeoutTimestamp)
numUpdatedStateRows += 1
}
}
}

How do I limit write operations to 1k records/sec?

Currently, I am able to write to database in the batchsize of 500. But due to the memory shortage error and delay synchronization between child aggregator and leaf node of database, sometimes I am running into Leaf Node Memory Error. The only solution for this is if I limit my write operations to 1k records per second, I can get rid of the error.
dataStream
.map(line => readJsonFromString(line))
.grouped(memsqlBatchSize)
.foreach { recordSet =>
val dbRecords = recordSet.map(m => (m, Events.transform(m)))
dbRecords.map { record =>
try {
Events.setValues(eventInsert, record._2)
eventInsert.addBatch
} catch {
case e: Exception =>
logger.error(s"error adding batch: ${e.getMessage}")
val error_event = Events.jm.writeValueAsString(mapAsJavaMap(record._1.asInstanceOf[Map[String, Object]]))
logger.error(s"event: $error_event")
}
}
// Bulk Commit Records
try {
eventInsert.executeBatch
} catch {
case e: java.sql.BatchUpdateException =>
val updates = e.getUpdateCounts
logger.error(s"failed commit: ${updates.toString}")
updates.zipWithIndex.filter { case (v, i) => v == Statement.EXECUTE_FAILED }.foreach { case (v, i) =>
val error = Events.jm.writeValueAsString(mapAsJavaMap(dbRecords(i)._1.asInstanceOf[Map[String, Object]]))
logger.error(s"insert error: $error")
logger.error(e.getMessage)
}
}
finally {
connection.commit
eventInsert.clearBatch
logger.debug(s"committed: ${dbRecords.length.toString}")
}
}
The reason for 1k records is that, some of the data that I am trying to write can contains tons of json records and if batch size if 500, that may result in 30k records per second. Is there any way so that I can make sure that only 1000 records will be written to the database in a batch irrespective of the number of records?

I don't think Thead.sleep is a good idea to handle this situation. Generally we don't recommend to do so in Scala and we don't want to block the thread in any case.
One suggestion would be using any Streaming techniques such as Akka.Stream, Monix.Observable. There are some pro and cons between those libraries I don't want to spend too much paragraph on it. But they do support back pressure to control the producing rate when consumer is slower than producer. For example, in your case your consumer is database writing and your producer maybe is reading some json files and doing some aggregations.
The following code illustrates the idea and you will need to modify as your need:
val sourceJson = Source(dataStream.map(line => readJsonFromString(line)))
val sinkDB = Sink(Events.jm.writeValueAsString) // you will need to figure out how to generate the Sink
val flowThrottle = Flow[String]
.throttle(1, 1.second, 1, ThrottleMode.shaping)
val runnable = sourceJson.via[flowThrottle].toMat(sinkDB)(Keep.right)
val result = runnable.run()

The code block is already called by a thread and there are multiple threads running in parallel. Either I can use Thread.sleep(1000) or delay(1.0) in this scala code. But if I use delay() it will use a promise which might have to call outside the function. Looks like Thread.sleep() is the best option along with batch size of 1000. After performing the testing, I could benchmark 120,000 records/thread/sec without any problem.
According to the architecture of memsql, all loads into memsql are done into a rowstore first into the local memory and from there memsql will merge into the columnstore at the end leaves. That resulted into the leaf error everytime I pushed more number of data causing bottleneck. Reducing the batchsize and introducing a Thread.sleep() helped me writing 120,000 records/sec. Performed testing with this benchmark.

What is an efficient way to partition by column but maintain a fixed partition count?

What is the best way to partition the data by a field into predefined partition count?
I am currently partitioning the data by specifying the partionCount=600. The count 600 is found to give best query performance for my dataset/cluster setup.
val rawJson = sqlContext.read.json(filename).coalesce(600)
rawJson.write.parquet(filenameParquet)
Now I want to partition this data by the column 'eventName' but still keep the count 600. The data currently has around 2000 unique eventNames, plus the number of rows in each eventName is not uniform. Around 10 eventNames have more than 50% of the data causing data skew. Hence if I do the partitioning like below, its not very performant. The write is taking 5x more time than without.
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("eventName").parquet(filenameParquet)
What is a good way to partition the data for these scenarios? Is there a way to partition by eventName but spread this into 600 partitions?
My schema looks like this:
{
"eventName": "name1",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "detailed1",
"id": "1234",
...
...
}
}
}
Thanks!

This is a common problem with skewed data and there are several approaches you can take.
List bucketing works if the skew remains stable over time, which may or may not be the case, especially if new values of the partitioning variable are introduced. I have not researched how easy it is to adjust list bucketing over time and, as your comment states, you can't use that anyway because it is a Spark 2.0 feature.
If you are on 1.6.x, the key observation is that you can create your own function that maps each event name into one of 600 unique values. You can do this as a UDF or as a case expression. Then, you simply create a column using that function and then partition by that column using repartition(600, 'myPartitionCol) as opposed to coalesce(600).
Because we deal with very skewed data at Swoop, I've found the following workhorse data structure to be quite useful for building partitioning-related tools.
/** Given a key, returns a random number in the range [x, y) where
* x and y are the numbers in the tuple associated with a key.
*/
class RandomRangeMap[A](private val m: Map[A, (Int, Int)]) extends Serializable {
private val r = new java.util.Random() // Scala Random is not serializable in 2.10
def apply(key: A): Int = {
val (start, end) = m(key)
start + r.nextInt(end - start)
}
override def toString = s"RandomRangeMap($r, $m)"
}
For example, here is how we build a partitioner for a slightly different case: one where the data is skewed and the number of keys is small so we have to increase the number of partitions for the skewed keys while sticking with 1 as the minimum number of partitions per key:
/** Partitions data such that each unique key ends in P(key) partitions.
* Must be instantiated with a sequence of unique keys and their Ps.
* Partition sizes can be highly-skewed by the data, which is where the
* multiples come in.
*
* #param keyMap maps key values to their partition multiples
*/
class ByKeyPartitionerWithMultiples(val keyMap: Map[Any, Int]) extends Partitioner {
private val rrm = new RandomRangeMap(
keyMap.keys
.zip(
keyMap.values
.scanLeft(0)(_+_)
.zip(keyMap.values)
.map {
case (start, count) => (start, start + count)
}
)
.toMap
)
override val numPartitions =
keyMap.values.sum
override def getPartition(key: Any): Int =
rrm(key)
}
object ByKeyPartitionerWithMultiples {
/** Builds a UDF with a ByKeyPartitionerWithMultiples in a closure.
*
* #param keyMap maps key values to their partition multiples
*/
def udf(keyMap: Map[String, Int]) = {
val partitioner = new ByKeyPartitionerWithMultiples(keyMap.asInstanceOf[Map[Any, Int]])
(key:String) => partitioner.getPartition(key)
}
}
In your case, you have to merge several event names into a single partition, which would require changes but I hope the code above gives you an idea how to approach the problem.
One final observation is that if the distribution of event names values a lot in your data over time, you can perform a statistics gathering pass over some part of the data to compute a mapping table. You don't have to do this all the time, just when it is needed. To determine that, you can look at the number of rows and/or size of output files in each partition. In other words, the entire process can be automated as part of your Spark jobs.

Spark RDD.isEmpty costs much time

I built a Spark cluster.
workers:2
Cores:12
Memory: 32.0 GB Total, 20.0 GB Used
Each worker gets 1 cpu, 6 cores and 10.0 GB memory
My program gets data source from MongoDB cluster. Spark and MongoDB cluster are in the same LAN(1000Mbps).
MongoDB document format:
{name:string, value:double, time:ISODate}
There is about 13 million documents.
I want to get the average value of a special name from a special hour which contains 60 documents.
Here is my key function
/*
*rdd=sc.newAPIHadoopRDD(configOriginal, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
Apache-Spark-1.3.1 scala doc: SparkContext.newAPIHadoopFile[K, V, F <: InputFormat[K, V]](path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)]
*/
def findValueByNameAndRange(rdd:RDD[(Object,BSONObject)],name:String,time:Date): RDD[BasicBSONObject]={
val nameRdd = rdd.map(arg=>arg._2).filter(_.get("name").equals(name))
val timeRangeRdd1 = nameRdd.map(tuple=>(tuple, tuple.get("time").asInstanceOf[Date]))
val timeRangeRdd2 = timeRangeRdd1.map(tuple=>(tuple._1,duringTime(tuple._2,time,getHourAgo(time,1))))
val timeRangeRdd3 = timeRangeRdd2.filter(_._2).map(_._1)
val timeRangeRdd4 = timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
if(timeRangeRdd4.isEmpty()){
return basicBSONRDD(name, time)
}
else{
return timeRangeRdd4.map(tuple => {
val bson = new BasicBSONObject()
bson.put("name", tuple._1)
bson.put("value", tuple._2/60)
bson.put("time", time)
bson })
}
}
Here is part of Job information
My program works so slowly. Does it because of isEmpty and reduceByKey? If yes, how can I improve it ? If not, why?
=======update ===
timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
is on the line of 34
I know reduceByKey is a global operation, and may costs much time, however, what it costed is beyond my budget. How can I improvet it or it is the defect of Spark. With the same calculation and hardware, it just costs several seconds if I use multiple thread of java.

First, isEmpty is merely the point at which the RDD stage ends. The maps and filters do not create a need for a shuffle, and the method used in the UI is always the method that triggers a stage change/shuffle...in this case isEmpty. Why it's running slow is not as easy to discern from this perspective, especially without seeing the composition of the originating RDD. I can tell you that isEmpty first checks the partition size and then does a take(1) and verifies whether data was returned or not. So, the odds are that there is a bottle neck in the network or something else blocking along the way. It could even be a GC issue... Click into the isEmpty and see what more you can discern from there.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string