A quick guide on Salt-based install of Spark cluster - apache-spark

I tried asking this on the official Salt user forum, but for some reason I did not get any assistance there. I am hoping I might get help here.
I am a new user of Salt. I am still evaluating the framework as a candidate for our SCM tool (as opposed to Ansible).
I went through the tutorial, and I am able to successfully manage master-minion/s relationship as covered in the first half of the tutorial.
Tutorial now forks into many different, intricate areas.
What I need is relatively straight forward, so I am hoping that perhaps someone can guide me here how to accomplish it.
I am looking to install Spark and HDFS on 20 RHEL 7 machines (lets say in ranges 168.192.10.0-20, 0 is a name node).
I see:
https://github.com/saltstack-formulas/hadoop-formula
and I found third-party Spark formula:
https://github.com/beauzeaux/spark-formula
Could someone be kind enough to suggest a set of instructions on how to go about this install in a most straightforward way?

Disclaimer: This answer describes only the rough process of what you need to do. I've distilled it from the respective documentation chapters and added the sources for reference. I'm assuming that you are familiar with the basic workings of Salt (states and pillars and whatnot) and also with Hadoop (I'm not).
1. Configure GitFS
The typical way to install Salt formulas is using GitFS. See the respective chapter from the Salt manual for an in-depth documentation.
This needs to be done on your Salt master node.
Enable GitFS in the master configuration file (typically /etc/salt/master, or a separate file in /etc/salt/master.d):
fileserver_backend:
- git
Add the two Salt formulas that you need as remotes (same file). This is also covered in the documentation:
gitfs_remotes:
- https://github.com/saltstack-formulas/hadoop-formula.git
- https://github.com/beauzeaux/spark-formula
(optional): Note the following warning from the Formula documentation:
We strongly recommend forking a formula repository into your own GitHub account to avoid unexpected changes to your infrastructure.
Many Salt Formulas are highly active repositories so pull new changes with care. Plus any additions you make to your fork can be easily sent back upstream with a quick pull request!
Fork the formulas into your own Git repository (using GitHub or otherwise) and use your private Git URL as remote in order to prevent unexpected changes to your configuration.
Restart Salt master.
2. Install Hadoop
This is documented in-depth in the Formulas README file. From a cursory reading, the formula can set up both Hadoop masters and slaves; the role is determined using a Salt grain.
Configure the Hadoop role in the file /etc/salt/grains. This needs to be done on each Salt minion node (use hadoop_master and hadoop_slave appropriately):
roles:
- hadoop_master
Configure the Salt mine on your Salt minion (typically /etc/salt/minion or a separate file in /etc/salt/minion.d):
mine_functions:
network.interfaces: []
network.ip_addrs: []
grains.items: []
Have a look at additional configuration grains and set them as you see fit.
Add the required pillar data for configuring your Hadoop set up. We're back on the Salt master node for this (for this, I'm assuming you are familiar with states and pillars; see the manual or this walkthrough otherwise). Have a look at the example pillar for possible configuration options.
Use the hadoop and hadoop.hdfs states in your top.sls:
'your-hadoop-hostname*':
- hadoop
- hadoop.hdfs
3. Install Spark
According to the Formula's README, there's nothing to configure via grains or pillars, so all that's left is to use the spark state in your top.sls:
'your-hadoop-hostname*':
- hadoop
- hadoop.hdfs
- spark
4. Fire!
Apply all states:
salt 'your-hadoop-hostname*' state.highstate

Related

Spark S3Guard - Skip listing S3

I'm using Spark (2.4) to process I data being stored on S3.
I'm trying to understand if there's a way to spare the listing of the objects that I'm reading as my batch job inputs (I'm talking about ~1M )
I know about S3Guard that stores the objects metadata, and thought that I can use it for skipping the S3 listing.
I've read this Cloudera's blog
Note that it is possible to skip querying S3 in some cases, just
serving results from the Metadata Store. S3Guard has mechanisms for
this but it is not yet supported in production.
I know it's quite old , is it already available in production?
As of July 2019 it is still tagged as experimental; HADOOP-14936 lists the tasks there.
The recent work has generally corner cases you aren't going to encounter on a daily basis, but which we know exist and can't ignore.
The specific feature you are talking about, "auth mode", relies on all clients to be using S3Guard and update the tables, and us being happy that we can handle the failure conditions for consistency.
For a managed table, I'm going to say Hadoop 3.3 will be ready to use this. For HADOOP-3.2, it's close. Really, more testing is needed.
In the meantime, if you can't reduce the number of files in S3, can you make sure you don't have a deep directory tree, as its that recursive directory scan which really suffers against it

Get cache entries(keys, values) list on particular node in Apache ignite

Is there any option in ignitevisorcmd where I can see what entries(key,value details) are present in particular node? I tried cache -scan -c=mycache -id8=12345678 command but it prints entries from all other nodes also for mycache instead of printing data for 12345678 node only.
Current version of Visor Cmd does not support this, but I think it is easy to implement. I created issue in Ignite JIRA, you may track or even contribute.

How do production Cassandra DBA's do table changes & additions?

I am interested in how the Cassandra production DBA's processes change when using Cassandra and performing many releases over a year. During the releases, columns in tables would change frequently and so would the number of Cassandra tables, as new features and queries are supported.
In the relational DB, in production, you create the 'view' and BOOM you get the data already there - loaded from the view's query.
With Cassandra, does the DBA have to create a new Cassandra table AND have to write/run a script to copy all the required data into that table? Can a production level Cassandra DBA provide some pointers on their processes?
We run a small shop, so I can tell you how I manage table/keyspace changes, and that may differ from how others get it done. First, I keep a text .cql file in our (private) Git repository that has all of our tables and keyspaces in their current formats. When changes are made, I update that file. This lets other developers know what the current tables look like, without having to use SSH or DevCenter. This also has the added advantage of giving us a file that allows us to restore our schema with a single command.
If it's a small change (like adding a new column) I'll try to get that out there just prior to deploying our application code. If it's a new table, I may create that earlier, as a new table without code to use it really doesn't hurt anything.
However, if it is a significant change...such as updating/removing an existing column or changing a key...I will create it as a new table. That way, we can deploy our code to use the new table(s), and nobody ever knows that we switched something behind the scenes. Obviously, if the table needs to have data in it, I'll have export/import scripts ready ahead of time and run those right after we deploy.
Larger corporations with enterprise deployments use tools like Chef to manage their schema deployments. When you have a large number of nodes or clusters, an automated deployment tool is really the best way to go.

how to reload maps from db when split brain occurs due to network partitioning in hazelcast

I am using hazelcast 3.2.2 community edition.
I am performing various tests with hazelcast. I have two separate VMs which are running two hazelcast instances as linux service forming a single cluster. I will refer them as HAZ-A and HAZ-B in this context.
Here is the test flow (link means Physical link in this context):
1) HAZ-A is up, HAZ-B is up.
2) Link down of HAZ-A, HAZ-B link is up.
Perform some operations say change password of a user, so HAZ-B will have two versions of user object (One will be backup of HAZ-A say version 1, another will be updated copy say version 2).
3) Link down of HAZ-B, HAZ-A link is already down. Hence links of both HAZ-A and HAZ-B are down.
4) Restore link of HAZ-A. Link is already down of HAZ-B.
Perform some operations say change password of a user, at this time I am getting stale data, since HAZ-A did not get a single chance to sync with HAZ-B.
So the point here is:
Can we implement/inject any kind of listener which will detect
interface up/down or link up/down and upon detection we can simply
re-sync data from db ?
From the documentation it seems both HAZ-A and HAZ-B will load the values from the DB and when they eventually see each other, they will merge
From Chapter 18
If a MapStore was in use, those lost partitions would be reloaded from some database, making each mini-cluster complete. Each mini-cluster will then recreate the missing primary partitions and continue to store data in them, including backups on the other nodes.

How ManifoldCF job scheduling behaves?

I am working on integrating manifoldcf or mcf with alfresco cms as repository connector using CMIS query and using solr as output channel where all index are stored. I am able to do it fine & can search documents in solr index.
Now as part of implementation, i am planing to introduce multiple repository such as sharepoint, file systems etc. so now i have three document repositories : alfresco, sharepoint & filesystem. I am planning to have scheduled jobs which run through each of repositories and crawl these at particular intervals. But i have following contentions.
Although i am scheduling jobs for frequent intervals, i want to make sure that mcf jobs pick only those content which are either added new or updated say i have 100 docs dring current job run but say 110 at next job run so i only want to run jobs for new 10 docs not entire 110 docs.
As there are relatively lesser mcf tutorials available, i have no means to ensure that mcf jobs behaves this way but i assume it is intelligent enough to behave this way but again no proof to substantiate it.
I want to know more about mcf job schedule type : scan every document once/rescan documents directly. Similarly i want to know more about job invocation : complete/minimal. i would be sorry for being a newbie.
Also i am considering about doing some custom coding to ensure that only latest/updated docs are eligible for processing but again going thru code only as less documentation available.
Is it wise to doc custom coding in this case or mcf provides all these features OOTB.
Many thanks in advance.
ManifoldCF schedules the job based on what you have configured for the Job.
it depends on how you repository connector is written, usually when when job runs it runs the getDocumentVersion() of repository connector, if the version of a document specification is different that earlier version, manifold indexes that document else not. Usually your document version string is the last modified date of the document
Unfortunately, manifold does not contain much of the document from the developer perspective side, your probable bet is to go through the code. It is quite explanatory.
This is what minimal is presented as per the mcf documentation
Using the "minimal" variant of the listed actions will perform the minimum possible amount of work, given the model that the connection type for the job uses. In some cases, this will mean that additions and modifications are indexed, but deletions are not detected mcf doc jobs
you should implement your logic in public String[] getDocumentVersions(..)
OOTB feature, is quite enough. But one thing to consider additionally the permission of the documents. if the permission of the document is changed you can choose change the version of document.

Resources