Sharing Hazelcast Cluster across multiple development team

Sharing Hazelcast Cluster across multiple development team - hazelcast

Read from a few places which indicate mapstore and maploader must be running with the Hazelcast Node. Would like to find out if there any ways to implement mapstore/maploader separate from the Hazelcast Node?
Background:
If i have a hazelcast cluster for the team, and this cluster is to be use by different sub-team providing different map as data, and each sub-team should implement mapstore/maploader for the map they own, how can this be done? (Note that each sub-team have their own SVN repository)
Thanks in advance~

MapLoader's load() operation is only invoked on the node that would have the key when the key is missing, so there is no way to push this processing elsewhere.
However, each map can have a different MapStore/MapLoader implementation, so having a different team provide each is certainly feasible.
Exactly how you achieve this comes down to your build and deploy practices. For example, each team's classes could be in a separate jar file on the classpath. Or, there could be a single jar file constructed containing the classes provided by each team. Many ways exist!

Related

What is the best way to resolve CouchDB document conflicts across 2 DB instances?

I have one application running over NodeJS and I am trying to make a distributed app. All write request goes to Node application and it writes to CouchDB A and on success of that It writes to CouchDB B. We read data through ELB(which reads from the 2 DBs).It's working fine.
But I faced a problem recently, my CouchDB B goes down and after CouchDB B up, now there is document _rev mismatch between the 2 instances.
What would be the best approach to resolve the above scenario without any down time?

If your CouchDB A & CouchDB B are in the same data centre, then #Flimzy's suggestion of using CouchDB 2.0 in a clustered deployment is a good one. You can have n CouchDB nodes configured in a cluster with a load balancer sitting above the cluster, delivering HTTP(s) traffic to any node that is "up".
If A & B are geographically separated, you can use CouchDB Replication to move data from A-->B and B-->A which would keep both instances perfectly in sync. A & B could each be clusters of 3 or more CouchDB 2.0 nodes, or single instances of CouchDB 1.7.
None of these solutions will "fix" the problem you are seeing when two copies of the database are modified in different ways at the same time. This "conflict" state is CouchDB's way of preventing data loss when two writes clash. Your app can resolve the conflict by picking a winning revision or writing a new one. It's not a fault condition, it's helping your application recover from a data loss during concurrent writes in a distributed system.
You can read more about document conflicts in this blog post series.

If both of your 1.6.x nodes are syncing buckets using standard replication, turning off one node shouldn’t be an issue. On node up it receives all updates without having conflicts – because there were no way to make them, the node was down.
If you experience conflicts during normal operation, unfortunately there exist no common general way to resolve them automatically. However, in most cases you can find a strategy of marking affected doc subtrees in a way allowing to determine which subversion is most recent (or more important).
To detect docs that have conflicts you may use standard views: a doc received by a view function has the _conflicts property if there exist conflicting revisions. Using appropriate view you can detect conflicts and merge docs. Anyway, regardless of how you detect conflicts, you need external code for resolving them.
If your conflicting data is numeric by nature, consider using CRDT structures and standard map/reduce to obtain final value. If your data is text-like you may also try to use CRDT, but to obtain reasonable performance you need to use reducers written in Erlang.
As for 2.x. I do not recommend using 2.x for your case (actually, for any real case except experiments). First, using 2.x will not remove conflicts, so it does not solve your problem. Also taking in account 2.x requires a lot of poorly documented manual operations across nodes and is unable to rebalance, you will get more pain than value.
BTW using any cluster solution have very little sense for two nodes.
As for above mentioned CVE 12635 and CouchDB 1.6.x: you can use this patch https://markmail.org/message/kunbxk7ppzoehih6 to cover the vulnerability.

Spark security consideration

I have some extra security considerations from a normal job. I usually use sbt to build and I will give it some libraries to grab from a Maven repository. But now, I'm unable to use a lot of external libraries, and I'm unsure at this point if I will be able to go out to Maven to get the Spark libraries that I might need. Even if I were to get the external libraries, there would be a vetting process that would take months for each library. Has anyone been in a similar situation? From the standpoint of not being able to use external libraries, can anyone share what they did to have a successful suite of Spark jobs to do there data munging and data science on a hadoop cluster?

I think there isn't a standard solution for your problem within the context you exposed. It depends on how much you go with external dependencies and what you really need. And I give you an example: parsing csv rows and construct dataframe/datasets or rdd. You have plenty of options:
use external library (from databricks or others)
rely on your code and do it by hand, so no external dependency
rely on spark newer versions that knows how to deal with csv
If you have a hadoop cluster than all the spark runtime environment already contains plenty of libraries that will be loaded (json manipulation, networking, logging, just to name a few). Most of your business logic inside your spark jobs can be done with those.
I give you some examples on how I have approached the problem with external dependency although I did'n have any security constraints. In one case we had to use Spring dependency within our Spark application (cause we wanted to update some relation tables), so we got a fat jar with all spring dependencies and they were many. Conclusion: got a lot of dependency for nothing (horror maintaining it :) ). So that was not a good approach. In other case we had to do the same thing, but then we kept the dependency at minimum (the most simple thing that can read/update a table with a jdbc). Conclusion: the fat jar was not that big, we kept only what was really needed, nothing more nothing less.
Spark already provides you with a lot of functionalities. Knowing a external library that can do something does not mean that spark can't do it with what is has.

Configuring Distributed Objects Dynamically

I'm currently evaluating using Hazelcast for our software. Would be glad if you could help me elucidate the following.
I have one specific requirement: I want to be able to configure distributed objects (say maps, queues, etc.) dynamically. That is, I can't have all the configuration data at hand when I start the cluster. I want to be able to initialise (and dispose) services on-demand, and their configuration possibly to change in-between.
The version I'm evaluating is 3.6.2.
The documentation I have available (Reference Manual, Deployment Guide, as well as the "Mastering Hazelcast" e-book) are very skimpy on details w.r.t. this subject, and even partially contradicting.
So, to clarify an intended usage: I want to start the cluster; then, at some point, create, say, a distributed map structure, use it across the nodes; then dispose it and use a map with a different configuration (say, number of backups, eviction policy) for the same purposes.
The documentation mentions, and this is to be expected, that bad things will happen if nodes have different configurations for the same distributed object. That makes perfect sense and is fine; I can ensure that the configs will be consistent.
Looking at the code, it would seem to be possible to do what I intend: when creating a distributed object, if it doesn't already have a proxy, the HazelcastInstance will go look at its Config to create a new one and store it in its local list of proxies. When that object is destroyed, its proxy is removed from the list. On the next invocation, it would go reload from the Config. Furthermore, that config is writeable, so if it has been changed in-between, it should pick up those changes.
So this would seem like it should work, but given how silent the documentation is on the matter, I'd like some confirmation.
Is there any reason why the above shouldn't work?
If it should work, is there any reason not to do the above? For instance, are there plans to change the code in future releases in a way that would prevent this from working?
If so, is there any alternative?

Changing the configuration on the fly on an already created Distributed object is not possible with the current version though there is a plan to add this feature in future release. Once created the map configs would stay at node level not at cluster level.
As long as you are creating the Distributed map fresh from the config, using it and destroying it, your approach should work without any issues.

Using Hazelcast as a service directory?

I am exploring the notion of using Hazelcast (or any another caching framework) to advertise services within a cluster. Ideally when a cluster member departs then its services (or objects advertising them) should be removed from the cache.
Is this at all possible?

It is possible for sure.
The question is: which solution do you like.
If the services can be stored in a map, you could create a map with a ttl of e.g. a few minutes and each member needs to refresh its service to prevent the services from expiring.
An alternative solution is to listen to member changes using the membershiplistener and once a member leaves, the services that belong to that member need to be removed from the map.
If you don't like none of this, you could create your own SPI based implementation. The SPI is the lower level infrastructure used by hazelcast to create its distributed datastructures. A lot more work, but also a lot of flexibility.
So there are many solutions.

Jackrabbit Security

I started to use Jackrabbit in my project. As i found out there is no complex LoginModule and AccessManager given. I mean we can find SimpleLoginModule but it is just a mock.
What i need is a simple LoginModule which can be configured eg from a file with users, passwords and groups. I know that i can implement my own classes, but it is hard to believe that after so many years there is no ready solution...

there are a couple Jackrabbit based open source / closed source projects out there that use JCR as their reference implementation and have implementations. Most probably you're best of choosing one of them in order to not reinvent the wheel. For a complete list: http://en.wikipedia.org/wiki/Apache_Jackrabbit

Are you running inside an app server or web container? If so, you would usually expect the container to provide a JAAS implementation. For example, for instructions on how to set it up with Jetty, storing user information in a database, a properties file, or LDAP, see:
http://www.eclipse.org/jetty/documentation/current/jaas-support.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string