Spark Stateful Structured Streaming: State getting too big in mapGroupsWithState

Spark Stateful Structured Streaming: State getting too big in mapGroupsWithState - apache-spark

I am trying to use mapGroupsWithState method for stateful structured streaming for my incoming stream of data. But the problem that I face is that the key I am choosing for groupByKey makes my state too big too fast. The obvious way out would be to change the key but the business logic I wish to apply in update method, requires the key to exactly same as I have it right now OR if it is possible, access GroupState for all keys.
For example, I have a stream of data coming in from various Organizations and typically an organization contains userId, personId etc. Please see the code below:
val stream: Dataset[User] = dataFrame.as[User]
val noTimeout = GroupStateTimeout.NoTimeout
val statisticStream = stream
.groupByKey(key => key.orgId)
.mapGroupsWithState(noTimeout)(updateUserStatistic)
val df = statisticStream.toDF()
val query = df
.writeStream
.outputMode(Update())
.option("checkpointLocation", s"$checkpointLocation/$name")
.foreach(new UserCountWriter(spark.sparkContext.getConf))
.outputMode(Update())
.queryName(name)
.trigger(Trigger.ProcessingTime(Duration.apply("10 seconds")))
case classes:
case class User(
orgId: Long,
profileId: Long,
userId: Long)
case class UserStatistic(
orgId: Long,
known: Long,
uknown: Long,
userSeq: Seq[User])
update method:
def updateUserStatistic(
orgId: Long,
newEvents: Iterator[User],
oldState: GroupState[UserStatistic]): UserStatistic = {
var state: UserStatistic = if (oldState.exists) oldState.get else UserStatistic(orgId, 0L, 0L, Seq.empty)
for (event <- newEvents) {
//business logic like checking if userId in this organization is of certain type and then accordingly update the known or unknown attribute for that particular user.
oldState.update(state)
state
}
The problem gets worse when I have to execute this on Driver-Executor model as I am expecting 1-10 million users in every organization which could mean these many states on a single executor(correct me if I am wrong in understanding this.)
Possible solutions that failed:
grouping by key as User Id - because then I am unable to get all userIds for a given orgId as these GroupStates are put in aggregation key, value pair and here, it is UserId. so for every new UserId, a new state is created, even if it belongs to same organization.
Any help or suggestions are appreciated.

Your state keeps increasing in size because in the current implementation no key/state pair will ever be removed from the GroupState.
To mitigate exactly the problem you are facing (infinite increasing state) the mapGroupsWithState method allows you to use a Timeout. You can choose between two types of timeouts:
Processing-Time timeouts using GroupStateTimeout.ProcessingTimeTimeout with GroupState.setTimeoutDuration() , or
Event-Time timeouts using GroupStateTimeout.EventTimeTimeout with GroupState.setTimeoutTimestamp().
Note the difference between them is a duration-based timeout and the more flexible time-based timeout.
In the ScalaDocs of the trait GroupState you will find a nice template on how to use timeouts in your mapping function:
def mappingFunction(key: String, value: Iterator[Int], state: GroupState[Int]): String = {
if (state.hasTimedOut) { // If called when timing out, remove the state
state.remove()
} else if (state.exists) { // If state exists, use it for processing
val existingState = state.get // Get the existing state
val shouldRemove = ... // Decide whether to remove the state
if (shouldRemove) {
state.remove() // Remove the state
} else {
val newState = ...
state.update(newState) // Set the new state
state.setTimeoutDuration("1 hour") // Set the timeout
}
} else {
val initialState = ...
state.update(initialState) // Set the initial state
state.setTimeoutDuration("1 hour") // Set the timeout
}
...
// return something
}

Related

Can't confirm any actors are being created

In Service Fabric I am trying to call an ActorService and get a list of all actors. I'm not getting any errors, but no actors are returned. It's always zero.
This is how I add actors :
ActorProxy.Create<IUserActor>(
new ActorId(uniqueName),
"fabric:/ECommerce/UserActorService");
And this is how I try to get a list of all actors:
var proxy = ActorServiceProxy.Create(new Uri("fabric:/ECommerce/UserActorService"), 0);
ContinuationToken continuationToken = null;
CancellationToken cancellationToken = new CancellationTokenSource().Token;
List<ActorInformation> activeActors = new List<ActorInformation>();
do
{
var proxy = GetUserActorServiceProxy();
PagedResult<ActorInformation> page = await proxy.GetActorsAsync(continuationToken, cancellationToken);
activeActors.AddRange(page.Items.Where(x => x.IsActive));
continuationToken = page.ContinuationToken;
}
while (continuationToken != null);
But no matter how many users I've added, the page object will always have zero items. What am I missing?

The second argument int in ActorServiceProxy.Create(Uri, int, string) is the partition key (you can find out more about actor partitioning here).
The issue here is that your code checks only one partition (partitionKey = 0).
So the solutions is quite simple - you have to iterate over all partitions of you service. Here is an answer with code sample to get partitions and iterate over them.
UPDATE 2019.07.01
I didn't spot this from the first time but the reason why you aren't getting any actors returned is because you aren't creating any actors - you are creating proxies!
The reason for such confusion is that Service Fabric actors are virtual i.e. from the user point of view actor always exists but in real life Service Fabric manages actor object lifetime automatically persisting and restoring it's state as needed.
Here is a quote from the documentation:
An actor is automatically activated (causing an actor object to be constructed) the first time a message is sent to its actor ID. After some period of time, the actor object is garbage collected. In the future, using the actor ID again, causes a new actor object to be constructed. An actor's state outlives the object's lifetime when stored in the state manager.
In you example you've never send any messages to actors!
Here is a code example I wrote in Program.cs of newly created Actor project:
// Please don't forget to replace "fabric:/Application16/Actor1ActorService" with your actor service name.
ActorRuntime.RegisterActorAsync<Actor1> (
(context, actorType) =>
new ActorService(context, actorType)).GetAwaiter().GetResult();
var actor = ActorProxy.Create<IActor1>(
ActorId.CreateRandom(),
new Uri("fabric:/Application16/Actor1ActorService"));
_ = actor.GetCountAsync(default).GetAwaiter().GetResult();
ContinuationToken continuationToken = null;
var activeActors = new List<ActorInformation>();
var serviceName = new Uri("fabric:/Application16/Actor1ActorService");
using (var client = new FabricClient())
{
var partitions = client.QueryManager.GetPartitionListAsync(serviceName).GetAwaiter().GetResult();;
foreach (var partition in partitions)
{
var pi = (Int64RangePartitionInformation) partition.PartitionInformation;
var proxy = ActorServiceProxy.Create(new Uri("fabric:/Application16/Actor1ActorService"), pi.LowKey);
var page = proxy.GetActorsAsync(continuationToken, default).GetAwaiter().GetResult();
activeActors.AddRange(page.Items);
continuationToken = page.ContinuationToken;
}
}
Thread.Sleep(Timeout.Infinite);
Pay special attention to the line:
_ = actor.GetCountAsync(default).GetAwaiter().GetResult();
Here is where the first message to actor is sent.
Hope this helps.

How do I configure Hazelcast read-through Map when only part of the nodes are able to populate the Map data?

Let's say I have two types of Hazelcast nodes running on cluster:
"Leader" nodes – these are able to load and populate Hazelcast map M. Leaders will also update values in M from time to time (based on external resource).
"Follower" nodes – these will need to read from M
My intent is for Follower nodes to trigger loading missing elements into M (loading thus needs to be done on Leader side) .
Roughly, the steps made to get an element from map could look like this:
IMap m = hazelcastInstance.getMap("M");
if (!m.containsKey(k)) {
if (iAmLeader()) {
Object fresh = loadByKey(k); // loading from external resource
return m.put(k, fresh);
} else {
makeSomeLeaderPopulateValueForKey(k);
}
}
return m.get(k);
What approach could you suggest?
Notes
I want Followers to act as nodes, not just clients, because there are going to be far more Follower instances than Leaders and I would like them to participate in load distribution.
I could just build another level of service, that would run only on Leader nodes and provide interface to populate map with requested keys. But that would mean adding extra layer of communication and configuration, and I was hoping that the kind of requirements stated above could be solved within single Hazelcast cluster.

I think I may have found an answer in the form of MapLoader (EDIT since originally posting, I have confirmed this is indeed the way to do this).
final Config config = new Config();
config.getMapConfig("MY_MAP_NAME").setMapStoreConfig(
new MapStoreConfig().setImplementation(new MapLoader<KeyType, ValueType>(){
#Override
public ValueType load(final KeyType key) {
//when a client asks for data for corresponding key of type
//KeyType that isn't already loaded
//this function will be invoked and give you a chance
//to load it and return it
ValueType rv = ...;
return rv;
}
#Override
public Map<KeyType, ValueType> loadAll(
final Collection<KeyType> keys) {
//Similar to MapLoader#load(KeyType), except this is
//a batched version of it for performance gains.
//this gets called on first access to the cache,
//where MapLoader#loadAllKeys() is called to get
//the keys parameter for this funcion
Map<KeyType, ValueType> rv = new HashMap<>();
keys.foreach((key)->{
rv.put(key, /*figure out what key means*/);
});
return rv;
}
#Override
public Set<KeyType> loadAllKeys() {
//Prepopulate all the keys. My understanding is that
//this is an initialization step, to give you a chance
//to load data on startup so an initial set of datas
//will be available to anyone using the cache. Any keys
//returned here are sent to MapLoader#loadAll(Collection)
Set<KeyType> rv = new HashSet<>();
//figure out what keys need to be in the return value
//to load a key into cache at first access to this map,
//named "MY_MAP_NAME" in this example
return rv;
}
}));
config.getGroupConfig().setName("MY_INSTANCE_NAME").setPassword("my_password");
final HazelcastInstance hazelcast = Hazelcast
.getOrCreateHazelcastInstance(config);

DocumentDb optimizing the resume of a non-partitioned change feed

I am creating an un-partitioned change feed that I want to resume e.g. poll for new changes every X seconds. The checkpoint variable below holds the last response continuation response.
private string checkpoint;
private async Task ReadEvents()
{
FeedResponse<dynamic> feedResponse;
do
{
feedResponse = await client.ReadDocumentFeedAsync(commitsLink, new FeedOptions
{
MaxItemCount = this.subscriptionOptions.MaxItemCount,
RequestContinuation = checkpoint
});
if (feedResponse.ResponseContinuation != null)
{
checkpoint = feedResponse.ResponseContinuation;
}
// Code to process docs goes here...
} while (feedResponse.ResponseContinuation != null);
}
Note the use of the "if" block around the checkpoint. This is done because if I leave this out the responseContinuation gets set to null, which will basically restart the polling cycle as setting the request continuation to null will pull the 1st set of documents in the change feed.
However, the downside is each polling loop will replay the previous set of documents rather than just any additional changes. Is there anything I can do in order to optimized this further or is this a limitation of the change feed API?

In order to read change feed, you must use CreateDocumentChangeFeedQuery (which never resets ResponseContinuation), instead of ReadDocumentFeed (which sets to null when there are no more results).
See https://learn.microsoft.com/en-us/azure/documentdb/documentdb-change-feed#working-with-the-rest-api-and-sdk for a sample.

Spark 2 Job Monitoring with Multiple Simultaneous Jobs on a Spark Context (JobProgressListener)

On Spark 2.0.x, I have been using a JobProgressListener implementation to retrieve Job/Stage/Task progress information in real-time from our cluster. I understand how the event flow works, and successfully receive updates on the work.
My problem is that we have several different submissions running at the same time on the same Spark Context, and it is seemingly impossible to differentiate between which Job/Stage/Task belongs to each submittal. Each Job/Stage/Task receives a unique id, which is great. However, I'm looking for a way to provide a submission "id" or "name" that would be returned along with the received JobProgressListener event objects.
I realize that the "Job Group" can be set on the Spark Context, but if multiple jobs are simultaneously running on the same context, they will become scrambled.
Is there a way I can sneak in custom properties that would be returned with the listener events for a single SQLContext? In so doing, I should be able to link up subsequent Stage and Task events and get what I need.
Please note: I am not using spark-submit for these jobs. They are being executed using Java references to a SparkSession/SQLContext.
Thanks for any solutions or ideas.

I'm using a local property - this can be accessed from listener during the onStageSubmit event. After that I'm using the corresponding stageId in order to identify the task executed during that stage.
Future({
sc.setLocalProperty("job-context", "second")
val listener = new MetricListener("second")
sc.addSparkListener(listener)
//do some spark actions
val df = spark.read.load("...")
val countResult = df.filter(....).count()
println(listener.rows)
sc.removeSparkListener(listener)
})
class MetricListener(name:String) extends SparkListener{
var rows: Long = 0L
var stageId = -1
override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit = {
if (stageSubmitted.properties.getProperty("job-context") == name){
stageId = stageSubmitted.stageInfo.stageId
}
}
override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
if (taskEnd.stageId == stageId)
rows = rows + taskEnd.taskMetrics.inputMetrics.recordsRead
}
}

Connecting two DStreams to have separate mapWithState States

We are using mapWithState to keep track of incoming Booking events. Once a group of Booking events has been completed I would like to publish a new Order event to a different Spark Stream, which will have its own key and its own state. Bookings will be on the Account level, and Orders will be on the Client level. I can create two custom streams like this:
val bookingsReceiver = new MyJMSReceiver()
val bookingsReceiverStream = ssc.receiverStream(bookingsReceiver)
val ordersReceiver = new MyCustomInMemoryReceiver()
val ordersReceiverStream = ssc.receiverStream(ordersReceiver)
So the definitions of the mapWithState State Specs would look like this
def bookingsStateFunc(batchTime: Time, key: AccountID, value: Option[Booking], state: State[(AccountID, BookingRequest)]): Option[(AccountID, BookingRequest)]
def ordersStateFunc(batchTime: Time, key: ClientID, value: Option[Order], state: State[(ClientID, OrderHistory)]): Option[(ClientID, OrderHistory)]
And I can create a new Order when a Booking is complete, like this:
ordersStream.foreachRDD { rdd =>
rdd.filter(_._2.bookingStatus == BookingStatus.Completed).foreach { bookingRequest =>
val order = Order(bookingRequest)
// What can I do to send the new Order to the Orders stream?
ordersReceiver.storeOrder(order) //This will not work as ordersReceiver is not available on the Workers ...
}
}
I could create an external message queue to publish the new Orders to. But is there a way to either chain the new data to an existing stream? Or is there a way to collect the new data to the Driver?
This example Custom Receiver will not work but shows how I am thinking...
class MyCustomInMemoryReceiver extends Receiver[Order](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() = {}
def onStop() = {}
def storeOrder(order: Order) = store(order)
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Stateful Structured Streaming: State getting too big in mapGroupsWithState - apache-spark

Related

Can't confirm any actors are being created

How do I configure Hazelcast read-through Map when only part of the nodes are able to populate the Map data?

DocumentDb optimizing the resume of a non-partitioned change feed

Spark 2 Job Monitoring with Multiple Simultaneous Jobs on a Spark Context (JobProgressListener)

Connecting two DStreams to have separate mapWithState States

Categories

Resources