Passing arguments to Hazelcast MapStore - cassandra

I want to write my own MapStore which will access Cassandra.
I would like to be able to pass arguments such as Cassandra's address, how can I do this assuming I can use the constructor.
I'm using Dropwizard and specifically dropwizard-cassandra library.

#danieln
#noctarius showed the declarative way to specify properties in hazelcast.xml.
But MapLoader doesn't have a way to inject this property to the instance.
To do that, you need to implement MapLoaderLifecycleSupport interface.
Properties will me injected to init() method
public interface MapLoaderLifecycleSupport {
/**
* Initializes this MapLoader implementation. Hazelcast will call
* this method when the map is first used on the
* HazelcastInstance. Implementation can
* initialize required resources for the implementing
* mapLoader, such as reading a config file and/or creating a
* database connection. References to maps, other than the one on which
* this {#code MapLoader} is configured, can be obtained from the
* {#code hazelcastInstance} in this method's implementation.
* <p>
* On members joining a cluster, this method is executed during finalization
* of the join operation, therefore care should be taken to adhere to the
* rules for {#link com.hazelcast.spi.PostJoinAwareService#getPostJoinOperation()}.
* If the implementation executes operations which may wait on locks or otherwise
* block (e.g. waiting for network operations), this may result in a time-out and
* obstruct the new member from joining the cluster. If blocking operations are
* required for initialization of the {#code MapLoader}, consider deferring them
* with a lazy initialization scheme.
* </p>
*
* #param hazelcastInstance HazelcastInstance of this mapLoader.
* #param properties Properties set for this mapStore. see MapStoreConfig
* #param mapName name of the map.
*/
void init(HazelcastInstance hazelcastInstance, Properties properties, String mapName);
/**
* Hazelcast will call this method before shutting down.
* This method can be overridden to clean up the resources
* held by this map loader implementation, such as closing the
* database connections, etc.
*/
void destroy();
}
We don't have Cassandra example, but we do have Mongo example here at your disposal. This example illustrates the approach of passing the properties to the loader.
Also, we would gladly accept Cassandra example to hazelcast-code-samples, if you will be kind to contribute one.
If you have any questions, let me know in comments below.
Thank you
Vik

Hazelcast provides the option to pass properties (configuration) from inside the hazelcast.xml to the MapStore implementation. Unfortunately, you're right, there is no example in the docs that shows how to do that, but here's the link to the XSD schema: https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/resources/hazelcast-config-3.7.xsd#L1731
For a documentation example, I pass the info to our docs team to add one :)

Related

Sharing RDDs with storage level NONE among Spark jobs

I have multiple Spark jobs which share a part of the dataflow graph including an expensive shuffle operation. If I persist that RDD, I see huge improvement (22x) as expected.
However, even when I keep the storage level of those RDDs as NONE, I still see upto 4x improvement just by sharing the RDDs among jobs.
Why?
I am under the assumption that Sark always recompute RDDs with storage level NONE and those are not subject to eviction/spilling.
My Spark version is 3.3.1. Showing the code is difficult as the code is spread in multiple files in a bigger system. I am essentially doing the following:
Identify the repetitive (and expensive) Spark operation across jobs. I do that by maintaining my own lineage traces [1].
After the first execution of those Spark operations, I keep the RDD handles locally in a cache, which is a hashmap <lineage-trace, RDD>.
The next time onwards, when I get the same operation, I simply reuse the cached RDD.
If I persist the RDDs in the second step (by calling rdd.persist(StorageLevel.MEMORY_AND_DISK), I see a huge improvement. But even if I just reuse the same RDD (storage level NONE), I still see improvement.
[1] LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems. Arnab Phani, Benjamin Rath, Matthias Boehm. SIGMOD 2021
If we have a look at the source code for the rdd.persist(StorageLevel) method, we see the following:
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
So that calls a persist method with an extra input argument. It looks like this:
/**
* Mark this RDD for persisting using the specified level.
*
* #param newLevel the target storage level
* #param allowOverride whether to override any existing level with the new one
*/
private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
// TODO: Handle changes of StorageLevel
if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
throw SparkCoreErrors.cannotChangeStorageLevelError()
}
// If this is the first time this RDD is marked for persisting, register it
// with the SparkContext for cleanups and accounting. Do this only once.
if (storageLevel == StorageLevel.NONE) {
sc.cleaner.foreach(_.registerRDDForCleanup(this))
sc.persistRDD(this)
}
storageLevel = newLevel
this
}
In there, we see something interesting. If the current storageLevel (not the new one) == StorageLevel.NONE, we're going to registerRDDForCleanup and persistRDD on this RDD.
Now, the default value for storageLevel is StorageLevel.NONE. That means that your case (calling persist on an unpersisted RDD) falls under this category.
So we found out that calling rdd.persist(StorageLevel.NONE) actually does something with your RDD! Let's have a look at both of these operations.
registerRDDForCleanup
registerRDDForCleanup is a method of the ContextCleaner class. It looks like this:
/** Register an RDD for cleanup when it is garbage collected. */
def registerRDDForCleanup(rdd: RDD[_]): Unit = {
registerForCleanup(rdd, CleanRDD(rdd.id))
}
// some other code between here that I removed for this explanation
/** Register an object for cleanup. */
private def registerForCleanup(objectForCleanup: AnyRef, task: CleanupTask): Unit = {
referenceBuffer.add(new CleanupTaskWeakReference(task, objectForCleanup, referenceQueue))
}
So this method actually adds a cleanup task (associated with this RDD) to some buffer called the referenceBuffer. That looks like this:
/**
* A buffer to ensure that `CleanupTaskWeakReference`s are not garbage collected as long as they
* have not been handled by the reference queue.
*/
private val referenceBuffer =
Collections.newSetFromMap[CleanupTaskWeakReference](new ConcurrentHashMap)
So as the code comments are saying, this referenceBuffer is a buffer to ensure that tasks don't get garbage collected too soon. So your RDDs are getting less garbage collected, which improves your performance!
persistRDD
The second method that was called on our RDD is the persistRDD method. I won't go into too much detail (since it is less important here) but this method (of Sparkcontext.scala) basically adds this RDD to a Map in which the SparkContext keeps track of all persisted RDDs.
Conclusion
We could go deeper in this investigation, but that would become impractical to write/read. I think this level of abstraction is enough to understand that calling rdd.persist(StorageLevel) actually does something to make your RDDs not be garbage collected too soon!
Hope this helps :)

Why we should use cache since we have persist in spark

Q. why should we use cache since we have persist which has memory-only and other options?
this question was asked to me in an interview I don't have any idea about this please help me to understand.
cache is the same as persist with the default storage level:
From the Scala code:
/**
* Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
*
* #group basic
* #since 1.6.0
*/
def cache(): this.type = persist()
So cache can bee seen as a convenience function that is widely used.

Consuming batch of message from pubsub with Spring

How to consume multiple messages from pubsub? This seems like a simple problem that should have simple solution but currently I can find easy way to consume batches of records from pubsub with spring-cloud-gcp-pubsub.
I'm using spring-cloud-gcp-pubsub to consume messages from pubsub and process them in spring boot application. My current setup is very simple I have PubSubInboundChannelAdapter and ServiceActivator that consumes records. After research I have found spring integration Aggregators but they didn't seem like a good way of doing this because it's not easy to propagate the acknowledgment downstream. Is there anything I'm missing? How can I consume batches of messages?
The PubSubInboundChannelAdapter is based on the subscription to the topic. So, it is going to be a stream of messages and this PubSubInboundChannelAdapter reacts to each of them converting to the Spring Message and sending it downstream to the configured channel.
There is really no way to get a batch of messages during subscription.
Also you need to keep in mind that there is no something like offset in GCP Pub/Sub. You indeed should acknowledge every single message you consume from the Pub/Sub.
Although there is the way to pull a batch of messages at once, using PubSubMessageSource. The messageSource.setMaxFetchSize(5); does the trick, but this PubSubMessageSource still produces every message individually, so you would be able to (n)ack them independently.
You can, of course, leverage the feature PubSubMessageSource uses - PubSubSubscriberOperations.pullAndConvert(). See it's JavaDocs for more info:
/**
* Pull a number of messages from a Google Cloud Pub/Sub subscription and convert them to Spring messages with
* the desired payload type.
* #param subscription the subscription name
* #param maxMessages the maximum number of pulled messages
* #param returnImmediately returns immediately even if subscription doesn't contain enough
* messages to satisfy {#code maxMessages}
* #param payloadType the type to which the payload of the Pub/Sub messages should be converted
* #param <T> the type of the payload
* #return the list of received acknowledgeable messages
* #since 1.1
*/
<T> List<ConvertedAcknowledgeablePubsubMessage<T>> pullAndConvert(String subscription, Integer maxMessages,
Boolean returnImmediately, Class<T> payloadType);
So, this one looks like what you are looking for because you indeed are going to have a list of messages and each of them is a wrapper with (n)ack callbacks.
This API could be used in the custom #InboundChannelAdapter MessageSource or Supplier #Bean implementation.
But still: I don't see benefits of the whole batch processing since every message can be ack'ed individually not affecting all others.
Try using below:
#Bean
#InboundChannelAdapter(channel = "pubsubInputChannel", poller = #Poller(fixedDelay = "5000", maxMessagesPerPoll = "3"))
public MessageSource<Object> pubsubAdapter(PubSubTemplate pubSubTemplate) {
PubSubMessageSource messageSource = new PubSubMessageSource(pubSubTemplate, "testSubscription");
messageSource.setAckMode(AckMode.MANUAL);
return messageSource;
}
maxMessagesPerPoll property determines how many messages will be polled.

Drools in Spark - performance

I have a batch job in Scala/Spark that dynamically creates Drools rules depending on some input and then evaluates the rules. I also have as input RDD[T] which corresponds to the facts to be inserted to the rule engine.
So far, I am inserting the facts one by one and then triggering all rules on this fact. I am doing this using rdd.aggregate.
The seqOp operator is defined like this :
/**
* #param broadcastRules the broadcasted KieBase object containing all rules
* #param aggregator used to accumulate values when rule matches
* #param item the fact to run Drools with
* #tparam T the type of the given item
* #return the updated aggregator
*/
def seqOp[T: ClassTag](broadcastRules: Broadcast[KieBase])(
aggregator: MyAggregator,
item: T) : MyAggregator = {
val session = broadcastRules.value.newStatelessKieSession
session.setGlobal("aggregator", aggregator)
session.execute(CommandFactory.newInsert(item))
aggregator
}
Here is an example of a generated rule:
dialect "mvel"
global batch.model.MyAggregator aggregator
rule "1"
when condition
then do something on the aggregator
end
For the same RDD, the batch took 20 minutes to evaluate 3K rules but 10 hours to evaluate 10K rules!
I am wondering if inserting fact by fact is the best approach. Is is better to insert all items of the RDD at once then fire all rules? It doesn't seem optimal to me as all facts will be in the working memory at the same time.
Do you see any issue with the code above?
Finally I figured out the issue, it was more linked to the action done on the aggregator when a rule matches rather than the evaluation of the rules.

What's the difference between SparkSession.catalog and SparkSession.sessionState.catalog?

I'm learning Spark, got confused about Spark's Catalog.
I found a catalog in SparkSession, which is an instance of CatalogImpl, as below
/**
* Interface through which the user may create, drop, alter or query underlying
* databases, tables, functions etc.
*
* #since 2.0.0
*/
#transient lazy val catalog: Catalog = new CatalogImpl(self)
And I found that there is a catalog in SparkSession.sessionSate, which is an instance of SessionCatalog.
What's the difference between them?
What's the difference between them?
tl;dr None.
The line in CatalogImpl is the missing piece in your understanding:
private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog
In other words, SparkSession.catalog creates a CatalogImpl that uses sparkSession.sessionState.catalog under the covers.

Resources