Is there a compatibility matrix for Hadoop components? - apache-spark

I wonder if there is a compatibility matrix for the various Hadoop components of the eco-system ?
Each Hadoop upgrade has big compatibility impact, e.g:
Apache Spark 2.4 does not support Hadoop v3,
Hadoop does not support Java 9 and 10,
and so on...
I know that vendors such as Hortonworks publish components lists with each version of their distribution, but this is not meant for the public in large because this includes patched components.
Does one have to go through all the bug trackers over at Jira for each tool to find out about compatibility problems ?

One of the key things that a company like Cloudera/Hortonworks does is taking all the open source projects that make up Hadoop and making sure that they work well together. Both from a functional perspective as well as security a lot of testing and tweaking is done to ensure that everything together forms a proper release.
Now that you have some insight how much effort goes into the release of just one distribution with comparatively strong focus on recent versions, you might understand that there will not be a general overview of 'how everything works with everything' beyond these distributions.
Full disclosure: I am an employee of Cloudera, but even without this I would still recommend you to work with a distribution where possible

Related

Cassandra cluster upgrade from 4.0.1 to 4.0.5, looking for any official documentation about it

I'm surprise to not find a official guidelines around upgrades for the community edition of cassandre in their website.
https://cassandra.apache.org/doc/4.0/index.html
I see that datastax provides some guidelines for their enterprise product but not really for the community versions. maybe I'm not looking at the right sites?
by googling around I see different howTos and different advices regarding upgrading minor versions in a cluster but nothing specific for making the jump between 4.0.1 to 4.0.5(which is the latest rpm available in their official repos)
[context]
about a year ago, I had put together a very simple cassandra cluster with 3 seed nodes and 2 normal nodes. this cluster is running on 4.0.1 since then without any issues and now I'm looking to upgrade a few minor versions to 4.0.5.
Besides replacing from time to time a node the maintenance is pretty simple and tbh I cannot complain about cassandra software itself.
[/context]
My understanding from other stackoverflow questions is that for minor versions(saw lot of questions about 3.11 minor upgrades) the risk is low and sometimes depending on the versions you can get away without not even having to upgrade the sstables but I cannot find if this applies to 4.0.1 to 4.0.5.
I would like to understand how the community is handling this lack of official community guidelines for upgrades in C*, so I'm looking to see what sites or docs the great stackoverflow heroes recommended to take them as reference.
I'm still in the researching phase before the upgrade, I was thinking to run some sort of "online" upgrade by updating node by node, to avoid downtime, i know this will mean having mixed versions in the cluster for a while but I understand the other option is to bring the whole cluster down, perform the upgrade on all nodes and the start it back.

Node version - using an obsolete version

I would like to know the following.
We are using node version 6 and an upgrade to node 12 or 14 is easier said than done as it may demand a re-write of our code in our case.
That said I would like to know the disadvantage of continuing with node6 for significant amount of time in the future? I know node 6 is not supported but what would that mean for a production application which is running for several years? Thanks.
Let's assume your production application is running perfectly fine till, so there is no need to change anything (Here we are focusing on business logic). But apart from that, there are certain things we may focus on while talking about maintaining the production projects like adding new features, improving the performance, and many more.
Let's focus on the above two points
if we want to add new functionality, we have to stick with the older version of nodejs. So libraries which we are using should be also running on nodejs version 6 or lower which will be problematic for developers.
improving the performance
Let's look out the key improvements done in nodejs after version 6
Heap size & dump improvements
Native modules N-API improvements
Improved startup performance, TLS, and security
Performance improvements with V8 Engine v6.6
and many more
These are just the system improvements, apart from these there are function improvements like promises, async/await, ES6, diagnosis, and monitoring
So if we are updating slowly to new stable versions it will help both developers to maintain and getting high performance
Note that, node version 8 is coming to an end and after 2020, there'll be no support for it including any security patches.
Considering your case, I wouldn't recommend jumping 8 major versions and upgrading to version 14.
Instead go one version up, see how things go; then continue doing the same until you get to a LTS version.

JanusGraph + Cassandra (Generic questions)

I have a few questions regarding the integration of the two tools. Not technical questions and how to setup( i will have my fun with that later ) but more on the course of the project and the direction, seeing that JanusGraph is still very young.
I am starting a new project and already decided to use Cassandra for storage and using a graph on top sounds very appealing to me.
A couple of things that i would like to know in advance before i take that road.
JanusGraph is very young and it picks up from where Titan left about a year or so ago. There is gap there but the fact that is part of the Linux Foundation and all the big players are going to support it sounds promising. Is it safe to assume at this point that JanusGraph is here to stay? Would it be safe to depend on Janus as a startup project? And follow development of course and be up to date as much as possible.
Cassandra. Titan/JanusGraph integrates with Cassandra 2.1.9 using the thrift api which will be deprecated eventually in Cassandra 4. I know that work is being done at the moment to make janus work with Cassandra 3 and eventually work with CQL as well. Is it safe to start with existing janus and Cassandra 2.1.9 and deal with the migration later on? Will it be a huge task for a startup to handle?
Production ready JanusGraph.(This question relates to any kind of software in it's early stages and whether it's safe for a start up to use). As i understand it, it will take some time for JanusGraph to be production ready and catch up with the rest of the tools it integrates with( although work is being done as we speak:)). Again would it be safe to start using Janus at this point and follow development and finally migrate to a production ready version? What is the overall roadmap for JanusGraph?
My concern in general is whether the combination of the tools is a safe choice for a start up. The whole stack is already new to us and we are excited to try and learn but we will hit a migration period pretty quickly. Is it something that you would do/recommend? Is it a suicide?
Please share your thoughts and keep in mind that it doesn't have to be about the stack i am talking about. It could be any startup company dealing with any kind of software in its early stages.
Cheers
Full disclosure, I'm a developer for JanusGraph on Compose.
It's as safe as any other OSS software project with a large amount of backers. Everyone could jump on some new toy tomorrow, but I doubt it. Companies are putting money into it and the development community is very active.
There is a CQL backend for Janus that's compatible with the Thrift data model. Migration to CQL should be simple and pretty painless when 0.2.0 is released.
I know there are already people using Titan for production applications. With JanusGraph being forked from Titan, I think it's pretty reasonable to start in with JanusGraph from everything I've seen. As far as a roadmap, I'd check out the JanusGraph mailing list (dev/users) and see what's going on and what's being talked about.
Disclosure: I am one of the co-founders of the JanusGraph project; I am also seeking out and adding production users to our GitHub repo and website, so I may be slightly biased. :)
Regarding your questions:
Is it safe to use?
The project is young, but it is built on a foundation of Titan, a very popular graph database that's been around since 2012 and has already been running in production. We have contributors from a number of well-known companies, and several companies are building their business-critical applications directly on JanusGraph, e.g.,
GRAKN.AI is building their knowledge graph on JanusGraph
IBM's Compose.io has built a managed JanusGraph service
Uber is already running JanusGraph in production (having previously run Titan)
several other companies run JanusGraph as a core part of their production environment
We are also starting to identify companies who will provide consulting services around JanusGraph in case someone needs production-level support for their own self-managed deployments.
So as you can see, there is significant interest in and support for this project.
Cassandra upgrade
#pantalohnes answered this question; I won't repeat it here.
Production readiness
As I linked above (GitHub repo and website), we already have production users of JanusGraph which you can find there. Those are just the companies that are publicly willing to lend their name/logo to the project; I'm sure there are more. Also, Titan has been running in many production environments for several years; JanusGraph is a more up-to-date version of Titan, despite the low version number.
I am also speaking with other companies who are planning to migrate to JanusGraph soon; look for announcements via the #JanusGraph Twitter handle to learn about more production deployments.

Spark security consideration

I have some extra security considerations from a normal job. I usually use sbt to build and I will give it some libraries to grab from a Maven repository. But now, I'm unable to use a lot of external libraries, and I'm unsure at this point if I will be able to go out to Maven to get the Spark libraries that I might need. Even if I were to get the external libraries, there would be a vetting process that would take months for each library. Has anyone been in a similar situation? From the standpoint of not being able to use external libraries, can anyone share what they did to have a successful suite of Spark jobs to do there data munging and data science on a hadoop cluster?
I think there isn't a standard solution for your problem within the context you exposed. It depends on how much you go with external dependencies and what you really need. And I give you an example: parsing csv rows and construct dataframe/datasets or rdd. You have plenty of options:
use external library (from databricks or others)
rely on your code and do it by hand, so no external dependency
rely on spark newer versions that knows how to deal with csv
If you have a hadoop cluster than all the spark runtime environment already contains plenty of libraries that will be loaded (json manipulation, networking, logging, just to name a few). Most of your business logic inside your spark jobs can be done with those.
I give you some examples on how I have approached the problem with external dependency although I did'n have any security constraints. In one case we had to use Spring dependency within our Spark application (cause we wanted to update some relation tables), so we got a fat jar with all spring dependencies and they were many. Conclusion: got a lot of dependency for nothing (horror maintaining it :) ). So that was not a good approach. In other case we had to do the same thing, but then we kept the dependency at minimum (the most simple thing that can read/update a table with a jdbc). Conclusion: the fat jar was not that big, we kept only what was really needed, nothing more nothing less.
Spark already provides you with a lot of functionalities. Knowing a external library that can do something does not mean that spark can't do it with what is has.

Which Cassandra version is more stable for Production deployment? And which Cassandra driver is better?

In My organisation we are planning to use Cassandra and these days we are running some experimental tests against Custom Configuraiton to check the better and stable verison of Cassandra. And we are using DataStax drivers.
We are running tests, INSERT into and Select * from CQL statements in very tight loop with higher load like 10K qps.
So any one has any experience on which Cassandra version is better and stable and which drivers shall be used?
Thanks in advance.
You cannot go wrong with the latest 2.0 release (2.0.9). You can get that version from either the Apache Cassandra project or DataStax. The Apache Cassandra download page also has links for the latest release candidates (RC5 is the latest) of 2.1, but those are still in development, so consider that before installing them.
As for the driver, there are drivers available for more than a dozen languages. Chances are that you probably know or use one of them. There is no one driver (at least that I am aware of) that significantly out-performs all of the others. So pick the driver for the language that either:
You have the most thorough knowledge of.
Complies with the usage standards of your team.
For instance, you could make an argument for using Java. After all, Cassandra is written in Java and all of the examples on the original DataStax Academy are done with the Java CQL Driver. But that argument loses ground quickly if you have never done Java before. Or if your team is a .Net shop, and there's nobody else who understands Java. InfoWorld's Andrew Oliver put it best when he wrote:
The lesson to be learned here is: Don't solve a simple problem with a
completely unfamiliar technology and apply it to use cases it isn't
especially appropriate for.
Again, you cannot go wrong with using a "DataStax Supported Driver" from their downloads page.
“You should not deploy a Cassandra version X.Y.Z to production where Z <= 5.”
Source:
https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
Hence go with 2.0.x . Currently its 2.0.10

Resources