Q. why should we use cache since we have persist which has memory-only and other options?
this question was asked to me in an interview I don't have any idea about this please help me to understand.
cache is the same as persist with the default storage level:
From the Scala code:
/**
* Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
*
* #group basic
* #since 1.6.0
*/
def cache(): this.type = persist()
So cache can bee seen as a convenience function that is widely used.
Related
I have multiple Spark jobs which share a part of the dataflow graph including an expensive shuffle operation. If I persist that RDD, I see huge improvement (22x) as expected.
However, even when I keep the storage level of those RDDs as NONE, I still see upto 4x improvement just by sharing the RDDs among jobs.
Why?
I am under the assumption that Sark always recompute RDDs with storage level NONE and those are not subject to eviction/spilling.
My Spark version is 3.3.1. Showing the code is difficult as the code is spread in multiple files in a bigger system. I am essentially doing the following:
Identify the repetitive (and expensive) Spark operation across jobs. I do that by maintaining my own lineage traces [1].
After the first execution of those Spark operations, I keep the RDD handles locally in a cache, which is a hashmap <lineage-trace, RDD>.
The next time onwards, when I get the same operation, I simply reuse the cached RDD.
If I persist the RDDs in the second step (by calling rdd.persist(StorageLevel.MEMORY_AND_DISK), I see a huge improvement. But even if I just reuse the same RDD (storage level NONE), I still see improvement.
[1] LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems. Arnab Phani, Benjamin Rath, Matthias Boehm. SIGMOD 2021
If we have a look at the source code for the rdd.persist(StorageLevel) method, we see the following:
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
So that calls a persist method with an extra input argument. It looks like this:
/**
* Mark this RDD for persisting using the specified level.
*
* #param newLevel the target storage level
* #param allowOverride whether to override any existing level with the new one
*/
private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
// TODO: Handle changes of StorageLevel
if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
throw SparkCoreErrors.cannotChangeStorageLevelError()
}
// If this is the first time this RDD is marked for persisting, register it
// with the SparkContext for cleanups and accounting. Do this only once.
if (storageLevel == StorageLevel.NONE) {
sc.cleaner.foreach(_.registerRDDForCleanup(this))
sc.persistRDD(this)
}
storageLevel = newLevel
this
}
In there, we see something interesting. If the current storageLevel (not the new one) == StorageLevel.NONE, we're going to registerRDDForCleanup and persistRDD on this RDD.
Now, the default value for storageLevel is StorageLevel.NONE. That means that your case (calling persist on an unpersisted RDD) falls under this category.
So we found out that calling rdd.persist(StorageLevel.NONE) actually does something with your RDD! Let's have a look at both of these operations.
registerRDDForCleanup
registerRDDForCleanup is a method of the ContextCleaner class. It looks like this:
/** Register an RDD for cleanup when it is garbage collected. */
def registerRDDForCleanup(rdd: RDD[_]): Unit = {
registerForCleanup(rdd, CleanRDD(rdd.id))
}
// some other code between here that I removed for this explanation
/** Register an object for cleanup. */
private def registerForCleanup(objectForCleanup: AnyRef, task: CleanupTask): Unit = {
referenceBuffer.add(new CleanupTaskWeakReference(task, objectForCleanup, referenceQueue))
}
So this method actually adds a cleanup task (associated with this RDD) to some buffer called the referenceBuffer. That looks like this:
/**
* A buffer to ensure that `CleanupTaskWeakReference`s are not garbage collected as long as they
* have not been handled by the reference queue.
*/
private val referenceBuffer =
Collections.newSetFromMap[CleanupTaskWeakReference](new ConcurrentHashMap)
So as the code comments are saying, this referenceBuffer is a buffer to ensure that tasks don't get garbage collected too soon. So your RDDs are getting less garbage collected, which improves your performance!
persistRDD
The second method that was called on our RDD is the persistRDD method. I won't go into too much detail (since it is less important here) but this method (of Sparkcontext.scala) basically adds this RDD to a Map in which the SparkContext keeps track of all persisted RDDs.
Conclusion
We could go deeper in this investigation, but that would become impractical to write/read. I think this level of abstraction is enough to understand that calling rdd.persist(StorageLevel) actually does something to make your RDDs not be garbage collected too soon!
Hope this helps :)
I am trying to Spark to Oracle. If my connection fails, job is failing. Instead, I want to set some connection retry limit to ensure its trying to reconnect as per the limit and then fail the job if its not connecting.
Please suggest on how we could implement this.
Let's assume you are using PySpark. Recently I used this in my project so I know this works.
I have used retry PyPi project
retry 0.9.2
and its application passed through extensive testing process
I used a Python class to hold the retry related configurations.
class RetryConfig:
retry_count = 1
delay_interval = 1
backoff_multiplier = 1
I collected the application parameter from runtime configurations and set them as below:
RetryConfig.retry_count = <retry_count supplied from config>
RetryConfig.delay_interval = <delay_interval supplied from config>
RetryConfig.backoff_multiplier = <backoff_multiplier supplied from config>
Then applied the on the method call that connects the DB
#retry((Exception), tries=RetryConfig.retry_count, delay=RetryConfig.delay_interval, backoff=RetryConfig.backoff_multiplier)
def connect(connection_string):
print("trying")
obj = pyodbc.connect(connection_string)
return obj
Backoff will increase the delay by backoff multiplication factor with each retry - a quite common functional ask.
Cheers!!
I have a batch job in Scala/Spark that dynamically creates Drools rules depending on some input and then evaluates the rules. I also have as input RDD[T] which corresponds to the facts to be inserted to the rule engine.
So far, I am inserting the facts one by one and then triggering all rules on this fact. I am doing this using rdd.aggregate.
The seqOp operator is defined like this :
/**
* #param broadcastRules the broadcasted KieBase object containing all rules
* #param aggregator used to accumulate values when rule matches
* #param item the fact to run Drools with
* #tparam T the type of the given item
* #return the updated aggregator
*/
def seqOp[T: ClassTag](broadcastRules: Broadcast[KieBase])(
aggregator: MyAggregator,
item: T) : MyAggregator = {
val session = broadcastRules.value.newStatelessKieSession
session.setGlobal("aggregator", aggregator)
session.execute(CommandFactory.newInsert(item))
aggregator
}
Here is an example of a generated rule:
dialect "mvel"
global batch.model.MyAggregator aggregator
rule "1"
when condition
then do something on the aggregator
end
For the same RDD, the batch took 20 minutes to evaluate 3K rules but 10 hours to evaluate 10K rules!
I am wondering if inserting fact by fact is the best approach. Is is better to insert all items of the RDD at once then fire all rules? It doesn't seem optimal to me as all facts will be in the working memory at the same time.
Do you see any issue with the code above?
Finally I figured out the issue, it was more linked to the action done on the aggregator when a rule matches rather than the evaluation of the rules.
I'm learning Spark, got confused about Spark's Catalog.
I found a catalog in SparkSession, which is an instance of CatalogImpl, as below
/**
* Interface through which the user may create, drop, alter or query underlying
* databases, tables, functions etc.
*
* #since 2.0.0
*/
#transient lazy val catalog: Catalog = new CatalogImpl(self)
And I found that there is a catalog in SparkSession.sessionSate, which is an instance of SessionCatalog.
What's the difference between them?
What's the difference between them?
tl;dr None.
The line in CatalogImpl is the missing piece in your understanding:
private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog
In other words, SparkSession.catalog creates a CatalogImpl that uses sparkSession.sessionState.catalog under the covers.
I want to write my own MapStore which will access Cassandra.
I would like to be able to pass arguments such as Cassandra's address, how can I do this assuming I can use the constructor.
I'm using Dropwizard and specifically dropwizard-cassandra library.
#danieln
#noctarius showed the declarative way to specify properties in hazelcast.xml.
But MapLoader doesn't have a way to inject this property to the instance.
To do that, you need to implement MapLoaderLifecycleSupport interface.
Properties will me injected to init() method
public interface MapLoaderLifecycleSupport {
/**
* Initializes this MapLoader implementation. Hazelcast will call
* this method when the map is first used on the
* HazelcastInstance. Implementation can
* initialize required resources for the implementing
* mapLoader, such as reading a config file and/or creating a
* database connection. References to maps, other than the one on which
* this {#code MapLoader} is configured, can be obtained from the
* {#code hazelcastInstance} in this method's implementation.
* <p>
* On members joining a cluster, this method is executed during finalization
* of the join operation, therefore care should be taken to adhere to the
* rules for {#link com.hazelcast.spi.PostJoinAwareService#getPostJoinOperation()}.
* If the implementation executes operations which may wait on locks or otherwise
* block (e.g. waiting for network operations), this may result in a time-out and
* obstruct the new member from joining the cluster. If blocking operations are
* required for initialization of the {#code MapLoader}, consider deferring them
* with a lazy initialization scheme.
* </p>
*
* #param hazelcastInstance HazelcastInstance of this mapLoader.
* #param properties Properties set for this mapStore. see MapStoreConfig
* #param mapName name of the map.
*/
void init(HazelcastInstance hazelcastInstance, Properties properties, String mapName);
/**
* Hazelcast will call this method before shutting down.
* This method can be overridden to clean up the resources
* held by this map loader implementation, such as closing the
* database connections, etc.
*/
void destroy();
}
We don't have Cassandra example, but we do have Mongo example here at your disposal. This example illustrates the approach of passing the properties to the loader.
Also, we would gladly accept Cassandra example to hazelcast-code-samples, if you will be kind to contribute one.
If you have any questions, let me know in comments below.
Thank you
Vik
Hazelcast provides the option to pass properties (configuration) from inside the hazelcast.xml to the MapStore implementation. Unfortunately, you're right, there is no example in the docs that shows how to do that, but here's the link to the XSD schema: https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/resources/hazelcast-config-3.7.xsd#L1731
For a documentation example, I pass the info to our docs team to add one :)