Connecting two DStreams to have separate mapWithState States - apache-spark

We are using mapWithState to keep track of incoming Booking events. Once a group of Booking events has been completed I would like to publish a new Order event to a different Spark Stream, which will have its own key and its own state. Bookings will be on the Account level, and Orders will be on the Client level. I can create two custom streams like this:
val bookingsReceiver = new MyJMSReceiver()
val bookingsReceiverStream = ssc.receiverStream(bookingsReceiver)
val ordersReceiver = new MyCustomInMemoryReceiver()
val ordersReceiverStream = ssc.receiverStream(ordersReceiver)
So the definitions of the mapWithState State Specs would look like this
def bookingsStateFunc(batchTime: Time, key: AccountID, value: Option[Booking], state: State[(AccountID, BookingRequest)]): Option[(AccountID, BookingRequest)]
def ordersStateFunc(batchTime: Time, key: ClientID, value: Option[Order], state: State[(ClientID, OrderHistory)]): Option[(ClientID, OrderHistory)]
And I can create a new Order when a Booking is complete, like this:
ordersStream.foreachRDD { rdd =>
rdd.filter(_._2.bookingStatus == BookingStatus.Completed).foreach { bookingRequest =>
val order = Order(bookingRequest)
// What can I do to send the new Order to the Orders stream?
ordersReceiver.storeOrder(order) //This will not work as ordersReceiver is not available on the Workers ...
}
}
I could create an external message queue to publish the new Orders to. But is there a way to either chain the new data to an existing stream? Or is there a way to collect the new data to the Driver?
This example Custom Receiver will not work but shows how I am thinking...
class MyCustomInMemoryReceiver extends Receiver[Order](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() = {}
def onStop() = {}
def storeOrder(order: Order) = store(order)
}

Related

Spark Stateful Structured Streaming: State getting too big in mapGroupsWithState

I am trying to use mapGroupsWithState method for stateful structured streaming for my incoming stream of data. But the problem that I face is that the key I am choosing for groupByKey makes my state too big too fast. The obvious way out would be to change the key but the business logic I wish to apply in update method, requires the key to exactly same as I have it right now OR if it is possible, access GroupState for all keys.
For example, I have a stream of data coming in from various Organizations and typically an organization contains userId, personId etc. Please see the code below:
val stream: Dataset[User] = dataFrame.as[User]
val noTimeout = GroupStateTimeout.NoTimeout
val statisticStream = stream
.groupByKey(key => key.orgId)
.mapGroupsWithState(noTimeout)(updateUserStatistic)
val df = statisticStream.toDF()
val query = df
.writeStream
.outputMode(Update())
.option("checkpointLocation", s"$checkpointLocation/$name")
.foreach(new UserCountWriter(spark.sparkContext.getConf))
.outputMode(Update())
.queryName(name)
.trigger(Trigger.ProcessingTime(Duration.apply("10 seconds")))
case classes:
case class User(
orgId: Long,
profileId: Long,
userId: Long)
case class UserStatistic(
orgId: Long,
known: Long,
uknown: Long,
userSeq: Seq[User])
update method:
def updateUserStatistic(
orgId: Long,
newEvents: Iterator[User],
oldState: GroupState[UserStatistic]): UserStatistic = {
var state: UserStatistic = if (oldState.exists) oldState.get else UserStatistic(orgId, 0L, 0L, Seq.empty)
for (event <- newEvents) {
//business logic like checking if userId in this organization is of certain type and then accordingly update the known or unknown attribute for that particular user.
oldState.update(state)
state
}
The problem gets worse when I have to execute this on Driver-Executor model as I am expecting 1-10 million users in every organization which could mean these many states on a single executor(correct me if I am wrong in understanding this.)
Possible solutions that failed:
grouping by key as User Id - because then I am unable to get all userIds for a given orgId as these GroupStates are put in aggregation key, value pair and here, it is UserId. so for every new UserId, a new state is created, even if it belongs to same organization.
Any help or suggestions are appreciated.
Your state keeps increasing in size because in the current implementation no key/state pair will ever be removed from the GroupState.
To mitigate exactly the problem you are facing (infinite increasing state) the mapGroupsWithState method allows you to use a Timeout. You can choose between two types of timeouts:
Processing-Time timeouts using GroupStateTimeout.ProcessingTimeTimeout with GroupState.setTimeoutDuration() , or
Event-Time timeouts using GroupStateTimeout.EventTimeTimeout with GroupState.setTimeoutTimestamp().
Note the difference between them is a duration-based timeout and the more flexible time-based timeout.
In the ScalaDocs of the trait GroupState you will find a nice template on how to use timeouts in your mapping function:
def mappingFunction(key: String, value: Iterator[Int], state: GroupState[Int]): String = {
if (state.hasTimedOut) { // If called when timing out, remove the state
state.remove()
} else if (state.exists) { // If state exists, use it for processing
val existingState = state.get // Get the existing state
val shouldRemove = ... // Decide whether to remove the state
if (shouldRemove) {
state.remove() // Remove the state
} else {
val newState = ...
state.update(newState) // Set the new state
state.setTimeoutDuration("1 hour") // Set the timeout
}
} else {
val initialState = ...
state.update(initialState) // Set the initial state
state.setTimeoutDuration("1 hour") // Set the timeout
}
...
// return something
}

Can't confirm any actors are being created

In Service Fabric I am trying to call an ActorService and get a list of all actors. I'm not getting any errors, but no actors are returned. It's always zero.
This is how I add actors :
ActorProxy.Create<IUserActor>(
new ActorId(uniqueName),
"fabric:/ECommerce/UserActorService");
And this is how I try to get a list of all actors:
var proxy = ActorServiceProxy.Create(new Uri("fabric:/ECommerce/UserActorService"), 0);
ContinuationToken continuationToken = null;
CancellationToken cancellationToken = new CancellationTokenSource().Token;
List<ActorInformation> activeActors = new List<ActorInformation>();
do
{
var proxy = GetUserActorServiceProxy();
PagedResult<ActorInformation> page = await proxy.GetActorsAsync(continuationToken, cancellationToken);
activeActors.AddRange(page.Items.Where(x => x.IsActive));
continuationToken = page.ContinuationToken;
}
while (continuationToken != null);
But no matter how many users I've added, the page object will always have zero items. What am I missing?
The second argument int in ActorServiceProxy.Create(Uri, int, string) is the partition key (you can find out more about actor partitioning here).
The issue here is that your code checks only one partition (partitionKey = 0).
So the solutions is quite simple - you have to iterate over all partitions of you service. Here is an answer with code sample to get partitions and iterate over them.
UPDATE 2019.07.01
I didn't spot this from the first time but the reason why you aren't getting any actors returned is because you aren't creating any actors - you are creating proxies!
The reason for such confusion is that Service Fabric actors are virtual i.e. from the user point of view actor always exists but in real life Service Fabric manages actor object lifetime automatically persisting and restoring it's state as needed.
Here is a quote from the documentation:
An actor is automatically activated (causing an actor object to be constructed) the first time a message is sent to its actor ID. After some period of time, the actor object is garbage collected. In the future, using the actor ID again, causes a new actor object to be constructed. An actor's state outlives the object's lifetime when stored in the state manager.
In you example you've never send any messages to actors!
Here is a code example I wrote in Program.cs of newly created Actor project:
// Please don't forget to replace "fabric:/Application16/Actor1ActorService" with your actor service name.
ActorRuntime.RegisterActorAsync<Actor1> (
(context, actorType) =>
new ActorService(context, actorType)).GetAwaiter().GetResult();
var actor = ActorProxy.Create<IActor1>(
ActorId.CreateRandom(),
new Uri("fabric:/Application16/Actor1ActorService"));
_ = actor.GetCountAsync(default).GetAwaiter().GetResult();
ContinuationToken continuationToken = null;
var activeActors = new List<ActorInformation>();
var serviceName = new Uri("fabric:/Application16/Actor1ActorService");
using (var client = new FabricClient())
{
var partitions = client.QueryManager.GetPartitionListAsync(serviceName).GetAwaiter().GetResult();;
foreach (var partition in partitions)
{
var pi = (Int64RangePartitionInformation) partition.PartitionInformation;
var proxy = ActorServiceProxy.Create(new Uri("fabric:/Application16/Actor1ActorService"), pi.LowKey);
var page = proxy.GetActorsAsync(continuationToken, default).GetAwaiter().GetResult();
activeActors.AddRange(page.Items);
continuationToken = page.ContinuationToken;
}
}
Thread.Sleep(Timeout.Infinite);
Pay special attention to the line:
_ = actor.GetCountAsync(default).GetAwaiter().GetResult();
Here is where the first message to actor is sent.
Hope this helps.

How do I configure Hazelcast read-through Map when only part of the nodes are able to populate the Map data?

Let's say I have two types of Hazelcast nodes running on cluster:
"Leader" nodes – these are able to load and populate Hazelcast map M. Leaders will also update values in M from time to time (based on external resource).
"Follower" nodes – these will need to read from M
My intent is for Follower nodes to trigger loading missing elements into M (loading thus needs to be done on Leader side) .
Roughly, the steps made to get an element from map could look like this:
IMap m = hazelcastInstance.getMap("M");
if (!m.containsKey(k)) {
if (iAmLeader()) {
Object fresh = loadByKey(k); // loading from external resource
return m.put(k, fresh);
} else {
makeSomeLeaderPopulateValueForKey(k);
}
}
return m.get(k);
What approach could you suggest?
Notes
I want Followers to act as nodes, not just clients, because there are going to be far more Follower instances than Leaders and I would like them to participate in load distribution.
I could just build another level of service, that would run only on Leader nodes and provide interface to populate map with requested keys. But that would mean adding extra layer of communication and configuration, and I was hoping that the kind of requirements stated above could be solved within single Hazelcast cluster.
I think I may have found an answer in the form of MapLoader (EDIT since originally posting, I have confirmed this is indeed the way to do this).
final Config config = new Config();
config.getMapConfig("MY_MAP_NAME").setMapStoreConfig(
new MapStoreConfig().setImplementation(new MapLoader<KeyType, ValueType>(){
#Override
public ValueType load(final KeyType key) {
//when a client asks for data for corresponding key of type
//KeyType that isn't already loaded
//this function will be invoked and give you a chance
//to load it and return it
ValueType rv = ...;
return rv;
}
#Override
public Map<KeyType, ValueType> loadAll(
final Collection<KeyType> keys) {
//Similar to MapLoader#load(KeyType), except this is
//a batched version of it for performance gains.
//this gets called on first access to the cache,
//where MapLoader#loadAllKeys() is called to get
//the keys parameter for this funcion
Map<KeyType, ValueType> rv = new HashMap<>();
keys.foreach((key)->{
rv.put(key, /*figure out what key means*/);
});
return rv;
}
#Override
public Set<KeyType> loadAllKeys() {
//Prepopulate all the keys. My understanding is that
//this is an initialization step, to give you a chance
//to load data on startup so an initial set of datas
//will be available to anyone using the cache. Any keys
//returned here are sent to MapLoader#loadAll(Collection)
Set<KeyType> rv = new HashSet<>();
//figure out what keys need to be in the return value
//to load a key into cache at first access to this map,
//named "MY_MAP_NAME" in this example
return rv;
}
}));
config.getGroupConfig().setName("MY_INSTANCE_NAME").setPassword("my_password");
final HazelcastInstance hazelcast = Hazelcast
.getOrCreateHazelcastInstance(config);

Spark 2 Job Monitoring with Multiple Simultaneous Jobs on a Spark Context (JobProgressListener)

On Spark 2.0.x, I have been using a JobProgressListener implementation to retrieve Job/Stage/Task progress information in real-time from our cluster. I understand how the event flow works, and successfully receive updates on the work.
My problem is that we have several different submissions running at the same time on the same Spark Context, and it is seemingly impossible to differentiate between which Job/Stage/Task belongs to each submittal. Each Job/Stage/Task receives a unique id, which is great. However, I'm looking for a way to provide a submission "id" or "name" that would be returned along with the received JobProgressListener event objects.
I realize that the "Job Group" can be set on the Spark Context, but if multiple jobs are simultaneously running on the same context, they will become scrambled.
Is there a way I can sneak in custom properties that would be returned with the listener events for a single SQLContext? In so doing, I should be able to link up subsequent Stage and Task events and get what I need.
Please note: I am not using spark-submit for these jobs. They are being executed using Java references to a SparkSession/SQLContext.
Thanks for any solutions or ideas.
I'm using a local property - this can be accessed from listener during the onStageSubmit event. After that I'm using the corresponding stageId in order to identify the task executed during that stage.
Future({
sc.setLocalProperty("job-context", "second")
val listener = new MetricListener("second")
sc.addSparkListener(listener)
//do some spark actions
val df = spark.read.load("...")
val countResult = df.filter(....).count()
println(listener.rows)
sc.removeSparkListener(listener)
})
class MetricListener(name:String) extends SparkListener{
var rows: Long = 0L
var stageId = -1
override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit = {
if (stageSubmitted.properties.getProperty("job-context") == name){
stageId = stageSubmitted.stageInfo.stageId
}
}
override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
if (taskEnd.stageId == stageId)
rows = rows + taskEnd.taskMetrics.inputMetrics.recordsRead
}
}

Spring Aggregation Group

I did create an aggregate service as below
#EnableBinding(Processor.class)
class Configuration {
#Autowired
Processor processor;
#ServiceActivator(inputChannel = Processor.INPUT)
#Bean
public MessageHandler aggregator() {
AggregatingMessageHandler aggregatingMessageHandler =
new AggregatingMessageHandler(new DefaultAggregatingMessageGroupProcessor(),
new SimpleMessageStore(10));
//AggregatorFactoryBean aggregatorFactoryBean = new AggregatorFactoryBean();
//aggregatorFactoryBean.setMessageStore();
aggregatingMessageHandler.setOutputChannel(processor.output());
//aggregatorFactoryBean.setDiscardChannel(processor.output());
aggregatingMessageHandler.setSendPartialResultOnExpiry(true);
aggregatingMessageHandler.setSendTimeout(1000L);
aggregatingMessageHandler.setCorrelationStrategy(new ExpressionEvaluatingCorrelationStrategy("requestType"));
aggregatingMessageHandler.setReleaseStrategy(new MessageCountReleaseStrategy(3)); //ExpressionEvaluatingReleaseStrategy("size() == 5")
aggregatingMessageHandler.setExpireGroupsUponCompletion(true);
aggregatingMessageHandler.setGroupTimeoutExpression(new ValueExpression<>(3000L)); //size() ge 2 ? 5000 : -1
aggregatingMessageHandler.setExpireGroupsUponTimeout(true);
return aggregatingMessageHandler;
}
}
Now i want to release the group as soon as a new group is created, so i only have one group at a time.
To be more specific i do receive two types of requests 'PUT' and 'DEL' . i want to keep aggregating per the above rules but as soon as i receive a request type other than what i am aggregating i want to release the current group and start aggregating the new Type.
The reason i want to do this is because these requests are sent to another party that don't support having PUT and DEL requests at the same time and i can't delay any DEL request as sequence between PUT and DEL is important.
I understand that i need to create a custom release Pojo but will i be able to check the current groups ?
For Example
If i receive 6 messages like below
PUT PUT PUT DEL DEL PUT
they should be aggregated as below
3PUT
2 DEL
1 PUT
OK. Thank you for sharing more info.
Yes, you custom ReleaseStrategy can check that message type and return true to lead to the group completion function.
As long as you have only static correlationKey, so only one group is there in the store. When your message is stepping to the ReleaseStrategy, there won't be much magic just to check the current group for completion signal. Since there are no any other groups in the store, there is no need any complex release logic.
You should add expireGroupsUponCompletion = true to let the group to be removed after completion and the next message will form a new group for the same correlationKey.
UPDATE
Thank you for further info!
So, yes, your original PoC is good. And even static correlationKey is fine, since you are just going to collect incoming messages to batches.
Your custom ReleaseStrategy should analyze MessageGroup for a message with different key and return true in that case.
The custom MessageGroupProcessor should filter a message with different key from the output List and send that message to the aggregator back to let to form a new group for a sequence for its key.
i ended up implementing the below ReleaseStrategy as i found it simpler than removing message and queuing it again.
class MessageCountAndOnlyOneGroupReleaseStrategy implements org.springframework.integration.aggregator.ReleaseStrategy {
private final int threshold;
private final MessageGroupProcessor messageGroupProcessor;
public MessageCountAndOnlyOneGroupReleaseStrategy(int threshold,MessageGroupProcessor messageGroupProcessor) {
super();
this.threshold = threshold;
this.messageGroupProcessor = messageGroupProcessor;
}
private MessageGroup currentGroup;
#Override
public boolean canRelease(MessageGroup group) {
if(currentGroup == null)
currentGroup = group;
if(!group.getGroupId().equals(currentGroup.getGroupId())) {
messageGroupProcessor.processMessageGroup(currentGroup);
currentGroup = group;
return false;
}
return group.size() >= this.threshold;
}
}
Note that i did used new HeaderAttributeCorrelationStrategy("request_type") instead of just FOO for CollorationStrategy

Resources