Spark IndexedRDD for Scala 2.11.x

Spark IndexedRDD for Scala 2.11.x - apache-spark

I am referring to the project here: https://github.com/amplab/spark-indexedrdd
My questions are:
Is this still maintained? I noticed the last commit was in Sep 2015
Are there plans to add a 2.11.x version on Maven?
Are there any plans to add the Indexed RDDs into Spark Core?
Writing this I also realise the Spark project uses Scala 2.10.x. Is there any reason why there hasn't been a move to 2.11?

I don't believe it is still maintained but you should open an issue on their github repo to check that directly with the authors.
Thus, also the point concerning adding Scala 2.11 support should be answered there.
Concerning spark-core integration, there is a JIRA discussion about the topic. The project seems to be initiated as a pull request into spark-core but finally, they made a decision about keeping it as a separate package for now.

Related

Cassandra cluster upgrade from 4.0.1 to 4.0.5, looking for any official documentation about it

I'm surprise to not find a official guidelines around upgrades for the community edition of cassandre in their website.
https://cassandra.apache.org/doc/4.0/index.html
I see that datastax provides some guidelines for their enterprise product but not really for the community versions. maybe I'm not looking at the right sites?
by googling around I see different howTos and different advices regarding upgrading minor versions in a cluster but nothing specific for making the jump between 4.0.1 to 4.0.5(which is the latest rpm available in their official repos)
[context]
about a year ago, I had put together a very simple cassandra cluster with 3 seed nodes and 2 normal nodes. this cluster is running on 4.0.1 since then without any issues and now I'm looking to upgrade a few minor versions to 4.0.5.
Besides replacing from time to time a node the maintenance is pretty simple and tbh I cannot complain about cassandra software itself.
[/context]
My understanding from other stackoverflow questions is that for minor versions(saw lot of questions about 3.11 minor upgrades) the risk is low and sometimes depending on the versions you can get away without not even having to upgrade the sstables but I cannot find if this applies to 4.0.1 to 4.0.5.
I would like to understand how the community is handling this lack of official community guidelines for upgrades in C*, so I'm looking to see what sites or docs the great stackoverflow heroes recommended to take them as reference.
I'm still in the researching phase before the upgrade, I was thinking to run some sort of "online" upgrade by updating node by node, to avoid downtime, i know this will mean having mixed versions in the cluster for a while but I understand the other option is to bring the whole cluster down, perform the upgrade on all nodes and the start it back.

Using Apache Beam with Rust

All our serving code is in Rust. To prevent training serving skew we would like to use the same serving code in our batch processes which use Apache Beam. Any pointers of using Rust in Apache Beam?

As of now, there are only Beam SDKs for Java, Python and Go (TypeScript is on the way, currently experimental).
I see that there is a requesting issue on GitHub, which was imported from BEAM-12658. The JIRA ticket had some discussions and effort being put into bringing a Rust SDK, but it doesn't seem to have a lot of recent traction.
I'd suggest commenting and tracking the GitHub issue above, or contributing to the project so we can make it happen.

Is there a compatibility matrix for Hadoop components?

I wonder if there is a compatibility matrix for the various Hadoop components of the eco-system ?
Each Hadoop upgrade has big compatibility impact, e.g:
Apache Spark 2.4 does not support Hadoop v3,
Hadoop does not support Java 9 and 10,
and so on...
I know that vendors such as Hortonworks publish components lists with each version of their distribution, but this is not meant for the public in large because this includes patched components.
Does one have to go through all the bug trackers over at Jira for each tool to find out about compatibility problems ?

One of the key things that a company like Cloudera/Hortonworks does is taking all the open source projects that make up Hadoop and making sure that they work well together. Both from a functional perspective as well as security a lot of testing and tweaking is done to ensure that everything together forms a proper release.
Now that you have some insight how much effort goes into the release of just one distribution with comparatively strong focus on recent versions, you might understand that there will not be a general overview of 'how everything works with everything' beyond these distributions.
Full disclosure: I am an employee of Cloudera, but even without this I would still recommend you to work with a distribution where possible

Spark security consideration

I have some extra security considerations from a normal job. I usually use sbt to build and I will give it some libraries to grab from a Maven repository. But now, I'm unable to use a lot of external libraries, and I'm unsure at this point if I will be able to go out to Maven to get the Spark libraries that I might need. Even if I were to get the external libraries, there would be a vetting process that would take months for each library. Has anyone been in a similar situation? From the standpoint of not being able to use external libraries, can anyone share what they did to have a successful suite of Spark jobs to do there data munging and data science on a hadoop cluster?

I think there isn't a standard solution for your problem within the context you exposed. It depends on how much you go with external dependencies and what you really need. And I give you an example: parsing csv rows and construct dataframe/datasets or rdd. You have plenty of options:
use external library (from databricks or others)
rely on your code and do it by hand, so no external dependency
rely on spark newer versions that knows how to deal with csv
If you have a hadoop cluster than all the spark runtime environment already contains plenty of libraries that will be loaded (json manipulation, networking, logging, just to name a few). Most of your business logic inside your spark jobs can be done with those.
I give you some examples on how I have approached the problem with external dependency although I did'n have any security constraints. In one case we had to use Spring dependency within our Spark application (cause we wanted to update some relation tables), so we got a fat jar with all spring dependencies and they were many. Conclusion: got a lot of dependency for nothing (horror maintaining it :) ). So that was not a good approach. In other case we had to do the same thing, but then we kept the dependency at minimum (the most simple thing that can read/update a table with a jdbc). Conclusion: the fat jar was not that big, we kept only what was really needed, nothing more nothing less.
Spark already provides you with a lot of functionalities. Knowing a external library that can do something does not mean that spark can't do it with what is has.

Which Cassandra version is more stable for Production deployment? And which Cassandra driver is better?

In My organisation we are planning to use Cassandra and these days we are running some experimental tests against Custom Configuraiton to check the better and stable verison of Cassandra. And we are using DataStax drivers.
We are running tests, INSERT into and Select * from CQL statements in very tight loop with higher load like 10K qps.
So any one has any experience on which Cassandra version is better and stable and which drivers shall be used?
Thanks in advance.

You cannot go wrong with the latest 2.0 release (2.0.9). You can get that version from either the Apache Cassandra project or DataStax. The Apache Cassandra download page also has links for the latest release candidates (RC5 is the latest) of 2.1, but those are still in development, so consider that before installing them.
As for the driver, there are drivers available for more than a dozen languages. Chances are that you probably know or use one of them. There is no one driver (at least that I am aware of) that significantly out-performs all of the others. So pick the driver for the language that either:
You have the most thorough knowledge of.
Complies with the usage standards of your team.
For instance, you could make an argument for using Java. After all, Cassandra is written in Java and all of the examples on the original DataStax Academy are done with the Java CQL Driver. But that argument loses ground quickly if you have never done Java before. Or if your team is a .Net shop, and there's nobody else who understands Java. InfoWorld's Andrew Oliver put it best when he wrote:
The lesson to be learned here is: Don't solve a simple problem with a
completely unfamiliar technology and apply it to use cases it isn't
especially appropriate for.
Again, you cannot go wrong with using a "DataStax Supported Driver" from their downloads page.

“You should not deploy a Cassandra version X.Y.Z to production where Z <= 5.”
Source:
https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
Hence go with 2.0.x . Currently its 2.0.10

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string