I am trying to build a hadoop cluster with four nodes.
The four machines are from my school's lab and I found their /usr/local are mount from a same public disk which means their /usr/local are identical.
The problem is, I can not start data node on slaves because the hadoop files are always the same(like tmp/dfs/data).
I am planning to configure and insatll hadoop in other dirs like /opt .
The problem is I found almost all the installation tutorial ask us to install it in /usr/local , so I was wondering will there be any bad consequence if I install hadoop in other place like /opt ?
Btw, I am using Ubuntu 16.04
As long as HADOOP_HOME points to where you extracted the hadoop binaries, then it shouldn't matter.
You'll also want to update PATH in ~/.bashrc, for example.
export HADOOP_HOME=/path/to/hadoop_x.yy
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
For reference, I have some configuration files inside of /etc/hadoop.
(Note: Apache Ambari makes installation easier)
It is not at all necessary to install hadoop under /usr/local. That location is generally used when you install single node hadoop cluster (although it is not mandatory). As long as you have following variables specified in .bashrc, any location should work.
export HADOOP_HOME=<path-to-hadoop-install-dir>
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
Related
On Linux (Ubuntu 20.04), using yarn 3.2.0, had an issue where yarn install would always fail with a number of "permission denied" during the Link step, where it was trying to use other modules installed in node_modules as part of the same install (e.g. node-gyp, node-gyp-build, node-pre-gyp, prebuild-install).
Turns out after a long period of investigation (mainly focused on file permissions because of the permission denied error) that it was in fact down to Yarn's use of the /tmp folder, which it apparently uses during its Link step for placing and executing some files. This is all very well, but not on a Linux server which is following "best practice" in having the noexec flag on the /tmp mount point (see: /opt/fstab) ! noexec prevents use of executables, hence the permission denied. If I take the noexec flag off yarn works flawlessly.
So the question is, how do I get around this behaviour in Yarn so that I don't have to break best practice on the /tmp folder? I have dug hard into yarn's configuration options but there appears to be nothing in this area.
Fortunately, Yarn is respecting standard TMPDIR variable. I'm guessing it is using standard NodeJS os.tmpdir() method which supports this.
Citing Wikipedia's page about TMPDIR:
TMPDIR is the canonical environment variable in Unix and POSIX that should be used to specify a temporary directory for scratch space. Most Unix programs will honor this setting and use its value to denote the scratch area for temporary files instead of the common default of /tmp or /var/tmp.
You can easily do something like:
mkdir ~/tmp && export TMPDIR=~/tmp && yarn install
Btw. I went through same deal, spending too much time chasing those weird permission denied errors and forgetting, that it is executed in /tmp. It would be amazing if Yarn would detect this automatically.
In order to test and learn Spark functions, developers require Spark latest version. As the API's and methods earlier to version 2.0 are obsolete and no longer work in the newer version. This throws a bigger challenge and developers are forced to install Spark manually which wastes a considerable amount of development time.
How do I use a later version of Spark on the Quickstart VM?
Every one should not waste setup time which I have wasted, so here is the solution.
SPARK 2.2 Installation Setup on Cloudera VM
Step 1: Download a quickstart_vm from the link:
Prefer a vmware platform as it is easy to use, anyways all the options are viable.
Size is around 5.4gb of the entire tar file. We need to provide the business email id as it won’t accept personal email ids.
Step 2: The virtual environment requires around 8gb of RAM, please allocate sufficient memory to avoid performance glitches.
Step 3: Please open the terminal and switch to root user as:
su root
password: cloudera
Step 4: Cloudera provides java –version 1.7.0_67 which is old and does not match with our needs. To avoid java related exceptions, please install java with the following commands:
Downloading Java:
wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
Switch to /usr/java/ directory with “cd /usr/java/” command.
cp the java download tar file to the /usr/java/ directory.
Untar the directory with “tar –zxvf jdk-8u31-linux-x64.tar.gz”
Open the profile file with the command “vi ~/.bash_profile”
export JAVA_HOME to the new java directory.
export JAVA_HOME=/usr/java/jdk1.8.0_131
Save and Exit.
In order to reflect the above change, following command needs to be executed on the shell:
source ~/.bash_profile
The Cloudera VM provides spark 1.6 version by default. However, 1.6 API’s are old and do not match with production environments. In that case, we need to download and manually install Spark 2.2.
Switch to /opt/ directory with the command:
cd /opt/
Download spark with the command:
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
Untar the spark tar with the following command:
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
We need to define some environment variables as default settings:
Please open a file with the following command:
vi /opt/spark-2.2.0-bin-hadoop2.7/conf/spark-env.sh
Paste the following configurations in the file:
SPARK_MASTER_IP=192.168.50.1
SPARK_EXECUTOR_MEMORY=512m
SPARK_DRIVER_MEMORY=512m
SPARK_WORKER_MEMORY=512m
SPARK_DAEMON_MEMORY=512m
Save and exit
We need to start spark with the following command:
/opt/spark-2.2.0-bin-hadoop2.7/sbin/start-all.sh
Export spark_home :
export SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.7/
Change the permissions of the directory:
chmod 777 -R /tmp/hive
Try “spark-shell”, it should work.
I am learning Hadoop (2.7.1). I am configuring it on Ubuntu (15.04) and I created a separate user for Hadoop to isolate Hadoop file system from Linux file system. But when I try to use sudo under this hadoop user I get an error:
hadoop is not in the sudoers file. This incident will be reported.
Should this user be in sudoers file? In which cases should I work under hadoop and root users?
No, hadoop user should not be (need not be) in sudoers file.
As you have said, to isolate Hadoop related operations from your local operations, you should use the specific users for specific purposes.
You should use your normal Linux user (or root user) for, say, installing Linux packages needed for hadoop e.g. OpenSSH, Java etc.
You should use hadoop user for hadoop related operations e.g. Start cluster, Use HDFS, Run MR programs etc.
Hope this helps!
I have a virtual machine which has Spark 1.3 on it but I want to upgrade it to Spark 1.5 primarily due certain supported functionalities which were not in 1.3. Is it possible I can upgrade the Spark version from 1.3 to 1.5 and if yes then how can I do that?
Pre-built Spark distributions, like the one I believe you are using based on another question of yours, are rather straightforward to "upgrade", since Spark is not actually "installed". Actually, all you have to do is:
Download the appropriate Spark distro (pre-built for Hadoop 2.6 and later, in your case)
Unzip the tar file in the appropriate directory (i.e.where folder spark-1.3.1-bin-hadoop2.6 already is)
Update your SPARK_HOME (and possibly some other environment variables depending on your setup) accordingly
Here is what I just did myself, to go from 1.3.1 to 1.5.2, in a setting similar to yours (vagrant VM running Ubuntu):
1) Download the tar file in the appropriate directory
vagrant#sparkvm2:~$ cd $SPARK_HOME
vagrant#sparkvm2:/usr/local/bin/spark-1.3.1-bin-hadoop2.6$ cd ..
vagrant#sparkvm2:/usr/local/bin$ ls
ipcluster ipcontroller2 iptest ipython2 spark-1.3.1-bin-hadoop2.6
ipcluster2 ipengine iptest2 jsonschema
ipcontroller ipengine2 ipython pygmentize
vagrant#sparkvm2:/usr/local/bin$ sudo wget http://apache.tsl.gr/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
[...]
vagrant#sparkvm2:/usr/local/bin$ ls
ipcluster ipcontroller2 iptest ipython2 spark-1.3.1-bin-hadoop2.6
ipcluster2 ipengine iptest2 jsonschema spark-1.5.2-bin-hadoop2.6.tgz
ipcontroller ipengine2 ipython pygmentize
Notice that the exact mirror you should use with wget will be probably different than mine, depending on your location; you will get this by clicking the "Download Spark" link in the download page, after you have selected the package type to download.
2) Unpack the tgz file with
vagrant#sparkvm2:/usr/local/bin$ sudo tar -xzf spark-1.*.tgz
vagrant#sparkvm2:/usr/local/bin$ ls
ipcluster ipcontroller2 iptest ipython2 spark-1.3.1-bin-hadoop2.6
ipcluster2 ipengine iptest2 jsonschema spark-1.5.2-bin-hadoop2.6
ipcontroller ipengine2 ipython pygmentize spark-1.5.2-bin-hadoop2.6.tgz
You can see that now you have a new folder, spark-1.5.2-bin-hadoop2.6.
3) Update accordingly SPARK_HOME (and possibly other environment variables you are using) to point to this new directory instead of the previous one.
And you should be done, after restarting your machine.
Notice that:
You don't need to remove the previous Spark distribution, as long as all the relevant environment variables point to the new one. That way, you may even quickly move "back-and-forth" between the old and new version, in case you want to test things (i.e. you just have to change the relevant environment variables).
sudo was necessary in my case; it may be unnecessary for you depending on your settings.
After ensuring that everything works fine, it's good idea to delete the downloaded tgz file.
You can use the exact same procedure to upgrade to future versions of Spark, as they come out (rather fast). If you do this, either make sure that previous tgz files have been deleted, or modify the tar command above to point to a specific file (i.e. no * wildcards as above).
Set your SPARK_HOME to /opt/spark
Download the latest pre-built binary i.e. spark-2.2.1-bin-hadoop2.7.tgz - can use wget
Create the symlink to the latest download - ln -s /opt/spark-2.2.1 /opt/spark
Edit files in $SPARK_HOME/conf accordingly
For every new version you download just create the symlink to it (step 3)
ln -s /opt/spark-x.x.x /opt/spark
Working with memsql cluster as primary storage design, by default data files are installed in a place like the following on CentOS 6.x:
/var/lib/memsql-ops/data/installs/MI9dfcc72a5b044f2694b5f7028803a21e
Is there any way to relocate the data path to another folder on the same machine?
This is not a best way but it works. I just re-install MemSQL to other directory:
sudo mkdir /data/memsql
sudo ./install.sh --root-dir /data/memsql
In this case MemSQL Ops still will be in /var/lib/memsql-ops but all nodes will be installed to /data/memsql directory (look at symlink /var/lib/memsql) and all data will be inside this directory too.
P.S. Additional installation options you can find use memsql-ops agent-install --help command.