in our cluster of 4 application servers (all of which use embedded hazelcast) we experimented with dynamically replicated maps. After a implementation error, we ended up having replicated maps with about 1g of memory consumption which we wanted to discard.
We created new distributed maps in our dynamic configuration and made an rolling update to the cluster.
The current state is now, that the old replicated maps are still active and consuming memory and hazelcastInstance.getDistrubutedObject("...").destroy() does not remove the map from all cluster members.
Also, asking the distributedObjects for their service type or size returns "null".
Is there a way to destroy unconfigured distributed objects from our cluster?
distributedObject.name == null
DistributedObjectUtil.getName(distributedObject) == "Some Name"
distributedObject.serviceName == null
distributedObject.getClass() == com.hazelcast.map.impl.proxy.MapProxyImpl.class
distributedObject.size() == null (!!)
distributedObject.keySet() == null
Thank you!
My bad: destroy() needed some time to remove the elements from all cluster members. After a few minutes, the distributed objects where gone.
Related
I recently posted question and I received full answer. But I am encountering another problem.
Case scenario is the same as in my recent question.
How can I configure member to own partition key?
e.g. DataCenterOnRussia partition key must always be owned by member1 and DataCenterOnGermany partition key must always be owned by member2.
So member2 could request data from DataCenterOnRussia using PartitionAwareKey.
The intent of the PartitionAwareKey is to allow for data affinity ... orders for a customer should be stored in the same partition as the customer record, for example, since they are frequently accessed together.
The PartitionAwareKey allows grouping items together, but not a way to specify the placement of those items on a specific cluster member. (I guess if there were such a thing, it would likely be called MemberAwareKey).
A cluster in Hazelcast isn't a fixed-size entity; it is dynamically scalable, so members might be added or removed, and it is fault-tolerant, so a member could be lost without loss of the data that happened to be on that member. In order to support those features, the cluster must have the freedom to move partitions around to different machines as the cluster topology changes.
Hazelcast recommends that all members of a cluster be similarly configured (equivalent memory configuration, most particularly) because of the idea that cluster members are interchangeable, at least as far as data storage. (The MemberSelector facility does provide a provision for handling systems that have different processing capability, e.g., number of processor cores; but nothing similar exits to allow placement of specific data entries or partitions on a designated member).
If your use case requires specific placement on machines, it's an indication that those machines probably should not be part of the same cluster.
I am currently using spark to process documents. I have two servers at my disposal (innov1 and innov2) and I am using yarn as the resource manager.
The first step is to gather the paths of the files from a database, filter them, repartition them and persist them in a RDD[String]. However, I can't manage to have a fair sharing of the persist among all the executors:
persisted RDD memory taken among executors
and this lead to the executors not doing the same amount of work after that:
Work done by each executors (do not care about the 'dead' here, it's another problem)
And this happens randomly, sometimes it's innov1 that takes all the persist, and then only executors on innov1 work (but it tends to be innov2 in general). Right now, each time two executors are on innov1, I just kill the job to relaunch, and I pray for them to be on innov2 (which is utterly stupid, and break the goal of using spark).
What I have tried so far (and that didn't work):
make the driver sleep 60 seconds before the loading from the database (maybe innov1 takes more time to wake up?)
add spark.scheduler.minRegisteredResourcesRatio=1.0 when I submit the job (same idea than above)
persist with replication x2 (idea from this link), hoping that some of the block would be replicated on innov1
Note for point 3, sometimes it was persisting a replication on the same executor (which is a bit counter intuitive), or even weirder, not replicated at all (innov2 is not able to communicate with innov1?).
I am open to any suggestion, or link to similar problems I would have missed.
Edit:
I can't really put code here, as it's part of my company's product. I can give a simplified version however:
val rawHBaseRDD : RDD[(ImmutableBytesWritable, Result)] = sc
.newAPIHadoopRDD(...)
.map(x => (x._1, x._2)) // from doc of newAPIHadoopRDD
.repartition(200)
.persist(MEMORY_ONLY)
val pathsRDD: RDD[(String, String)] = rawHBaseRDD
.mapPartitions {
...
extract the key and the path from ImmutableBytesWritable and
Result.rawCells()
...
}
.filter(some cond)
.repartition(200)
.persist(MEMORY_ONLY)
For both persist, everything is on innov2. Is it possible that it's because the data are only on innov2? even if it's the case, I would assume that repartition help to share the rows between innov1 and innov2, but it doesn't happen here.
Your persisted data set is not very big - some ~100MB according to your screenshot. You have allocated 10 cores with 20GB of memory, so the 100MB fits easily into the memory of a single executor and that is basically what is happening.
In other words, you have allocated many more resources than are actually needed, so Spark just randomly picks the subset of resources that it needs to complete the job. Sometimes those resources happen to be on one worker, sometimes on another and sometimes it uses resources from both workers.
You have to remember that to Spark, it makes no difference if all resources are placed on a single machine or on a 100 different machines - as long as you are not trying to use more resources than available (in which case you would get an OOM).
Unfortunately (fortunately?) the problem solved by itself today. I assume it is not spark related as I hadn't modified the code until the resolution.
It's probably due to the complete reboot of all services with Ambari (even if I am not 100% sure, because I already tried this before), as it's the only "major" change that happened today.
When a new member joins a cluster, table repartitioning and data merge will happen.
If the data is large, I believe it will take some time. While it is happening, what is the state of the cache like?
If I am using embedded mode, does it block my application until the merging is completed? or if I don't want to work with an incomplete cache, do I need to wait (somehow) before starting my application operations?
Partition migration will start as soon as the member joins the cluster. It will not block your application because it will progress asynchronously in the background.
Only mutating operations that fall into a migrating partition are blocked. Read-only operations are not blocked.
Mutating operations will get PartitionMigrationException which is a RetryableHazelcastException so they will be retried for default 2 minutes. If you have small partition sizes, then migration of a partition will last shorter. You can increase partition count via system property hazelcast.partition.count.
If you want to block your application until all migrations finish, you can check isClusterSafe method to make sure there are no migrating partitions in the cluster. But beware that isClusterSafe returns the status of the cluster rather than current member so it might not be something to rely on. Instead, I would recommend not to block the application while partitions are migrating.
(I could not find a good source explaining this, so if it is available elsewhere, you could just point me to it)
Hazelcast replicates data across all nodes in clusters. So, if data is changed in one of the nodes, does the node update its own copy and then propagate it to other nodes?
I read somewhere that each data is owned by a node, how does Hazelcast determine the owner? Is the owner determined per datastructure or per key in the datastructure?
Does Hazelcast follow "eventually consistent" principle? (When the data is being propagated across the nodes, there could be a small window during which the data might be inconsistent between the nodes)
How are conflicts handled? (Two nodes update the same key-value simultaneously)
Hazelcast does not replicate (with exception of the ReplicatedMap, obviously ;-)) but partitions data. That means you have one node that owns a given key. All updates to that key will go to the owner and he notifies possible updates.
The owner is determined by consistent hashing using the following formula:
partitionId = hash(serialize(key)) % partitionCount
Since there is only one owner per key it is not eventually consistent but consistent whenever the mutating operations is returned. All following read operations will see the new value. Under normal operational circumstances. When any kind of failure happens (network, host, ...) we choose availability over consistency and it might happen that a not yet updated backup is reactivated (especially if you use async backups).
Conflicts can happen after split-brain when the split cluster re-merge. For this case you have to configure (or use the default one) MergePolicy to define the behavior on how conflicting elements are merged together or which one of both wins.
When adding a new datacenter the dynamicSnitch causes us to read data from the new dc when the data is not there yet.
We have a cassandra (1.0.11) cluster running on 3 datacenters and we want to add a forth datacenter. The cluster is configured with PropertyFileSnitch and DynamicSnitch enabled with 0.0 badness factor. The relevant keyspaces replication factor are DC1:2, DC2:2, DC3:2. Our plan was to add the new datacenter to the ring, add it to the schema and run a rolling repair -pr on all the nodes so the new nodes will get all of their needed data.
Once we started the process we noticed that the new datacenter recieves read calls from the other data centers because it has a lower load and the DynamicSnitch decides it will be better to read from it. The problem is that the data center still doesn't have the data and returns no results.
We tried removing the DynamicSnitch entirely but once we did that every time a single server got a bit of load we experience extreme performance degredation.
Have anyone encountered this issue ?
Is there a way to directly influence the score of a specific data center so it won't be picked by the DynamicSnitch ?
Are there any better ways to add a datacenter in cassandra 1.0.11 ? Have anyone written a snitch that handles these issues ?
Thanks,
Izik.
You could bootstrap the nodes instead of adding to the ring without bootstrap and then repairing. The former ensures that no reads will be routed to it until it has all the data it needs. (That is why Cassandra defaults to auto_bootstrap: true and in fact disabling it is a sufficiently bad idea that we removed it from the example cassandra.yaml.)
The problem with this, and the reason that the documentation recommends adding all the nodes first without bootstrap, is that if you have N replicas configured for DC4, Cassassandra will try to replicate the entire dataset for that keyspace to the first N nodes you add, which can be problematic!
So here are the options I see:
If your dataset is small enough, go ahead and use the bootstrap plan
Increase ConsistencyLevel on your reads so that they will always touch a replica that does have the data, as well as one that does not
Upgrade to 1.2 and use ConsistencyLevel.LOCAL_ONE on your reads which will force it to never make cross-DC requests