configure Apache Hive local multi users or in Spark standalone without Hadoop - apache-spark

We run Apache Spark stand alone for teaching purposes on Fedora 27. I have it configured with MySQL for the metastore_db.
I am trying to follow this tip running Hive in local mode, so running the export command:
export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse' before starting hive. Another blog suggested this command:
export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true'
So I ran chmod 777 /tmp/warehouse but won't any subsequent users just rewrite over everything in this local file database? Is there a better way to achieve this? Whenever I try to use Hive within Spark or without any export of a local file database I get the localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; errors, i.e. the port that Hadoop runs on. We're trying to do this without Hadoop for teaching purposes. Should each user just specify a different path to either warehouse or databaseName, perhaps in their home directory?

Related

How to transfer data from local file system (linux) to a Hadoop Cluster made on Google Cloud Platform

I am a beginner in Hadoop, I made a Hadoop Cluster (one master and two slaves) on Google Cloud Platform.
I accessed the master of the cluster using from the local file system (Linux): ssh -i key key#public_ip_of_master
Then I did sudo su - inside the cluster because Hadoop functions only appears while being root.
Then I initiated the HDFS using start-dfs.sh and start-all.sh
Now the problem is that I want to tranfer files from the local Linux file system to the Hadoop Cluster and vice versa using the following command (inserting the command inside the cluster while being root):
root#master:~# hdfs dfs -put /home/abas1/Desktop/chromFa.tar.gz /Hadoop_File
The problem is that the local path which is: /home/abas1/Desktop/chromFa.tar.gz is never recognized and I can not seem to know what to do.
I am sure I am missing something trivial but I do not know what it is. I have to use either -copyFromLocal or -put.
local path is never recognized
That is not a Hadoop problem, then. You are on the master node (over SSH), as the root user. There is a /root folder with files, and probably no /home/abas1.
In other words, run ls -l /home, and you see what local files are available.
To get files to the master server to upload from that terminal session, you will want to SCP files first to there from a different machine.
Exit the SSH session
scp -i key root#master-ip home/abas1/Desktop/chromFa.tar.gz /tmp
ssh -i key root#master-ip
Then you can do this
hdfs mkdir /Hadoop_File
ls -l /tmp | grep chromFa # for example, to check file
hdfs -put /tmp/chromFa.tar.gz /Hadoop_file/
Hadoop functions only appears while being root.
Please do not use root for interacting with Hadoop services. Create unique user accounts for HDFS, YARN, Zookeeper, etc. with restricted permissions like you would for any other Unix process.
Using DataProc will do this... And you can still SSH to it, so you should really considering using it instead of manual GCE cluster.

how to Intialize the spark shell with a specific user to save data to hdfs by apache spark

im using ubuntu
im using spark dependency using intellij
Command 'spark' not found, but can be installed with: .. (when i enter spark in shell)
i have two user amine , and hadoop_amine (where hadoop hdfs is set)
when i try to save a dataframe to HDFS (spark scala):
procesed.write.format("json").save("hdfs://localhost:54310/mydata/enedis/POC/processed.json")
i got this error
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/mydata/enedis/POC":hadoop_amine:supergroup:drwxr-xr-x
Try to change the permissions of the HDFS directory or change your spark user simply!
For changing the directory permission you can use hdfs command line like this
hdfs dfs -chmod ...
In spark-submit you can use the proxy-user option
And at last, you can run the spark-submit or spark-shell with the proper user like this command:
sudo -u hadoop_amine spark-submit ...

pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10

I have installed spark 2.2 with winutils in windows 10.when i am going to run pyspark i am facing bellow exception
pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
I have already tried permission 777 commands in tmp/hive folder as well.but it is not working for now
winutils.exe chmod -R 777 C:\tmp\hive
after applying this the problem remains same. I am using pyspark 2.2 in my windows 10.
Her is spark-shell env
Here is pyspark shell
Kindly help me to figure out
Thankyou
I had the same problem using the command 'pyspark' as well as 'spark-shell' (for scala) in my mac os with apache-spark 2.2. Based on some research I figured its because of my JDK version 9.0.1 which does not work well with Apache-Spark. Both errors got resolved by switching back from Java JDK 9 to JDK 8.
Maybe that might help with your windows spark installation too.
Port 9000?! It must be something Hadoop-related as I don't remember the port for Spark. I'd recommend using spark-shell first that would eliminate any additional "hops", i.e. spark-shell does not require two runtimes for Spark itself and Python.
Given the exception I'm pretty sure that the issue is that you've got some Hive-
or Hadoop-related configuration somewhere lying around and Spark uses it apparently.
The "Caused by" seems to show that 9000 is used when Spark SQL is created which is when Hive-aware subsystem is loaded.
Caused by: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.net.ConnectException: Call From DESKTOP-SDNSD47/192.168.10.143 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused
Please review the environment variables in Windows 10 (possibly using set command on command line) and remove anything Hadoop-related.
Posting this answer for posterity. I faced the same error.
The way i solved it is by first trying out spark-shell instead of pyspark. The error message was more direct.
This gave a better idea; there was S3 access error.
Next; i checked the ec2 role/instance profile for that instance; it has S3 administrator access.
Then i did a grep for s3:// in all the conf files under /etc/ directory.
Then i found that in core-site.xml there is a property called
<!-- URI of NN. Fully qualified. No IP.-->
<name>fs.defaultFS</name>
<value>s3://arvind-glue-temp/</value>
</property>
Then i remembered. I had removed HDFS as the default file system and set it to S3. I had created the ec2 instance from an earlier AMI and had forgotten to update the S3 bucket corresponding to the newer account.
Once i updated the s3 bucket to the one which is accessible by the current ec2 instance profile; it worked.
To use Spark on Windows OS, you may follow this guide.
NOTE: Ensure that you have correctly resolved your IP address against your hostname as well as localhost, lack of localhost resolution has caused problems for us in the past.
Also, you should provide the full stack trace as it helps to debug the issue quickly and saves the guesswork.
Let me know if this helps. Cheers.
Try this . It worked for me!. Open up a command prompt in administrator mode and then run the command 'pyspark'. This should help open a spark session without errors.
I also come across the error in Unbuntu 16.04:
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
this is because I have already run ./bin/spark-shell
So, just kill that spark-shell, and re-run ./bin/pyspark
I also come across the error in MacOS10, and I solved this by use Java8 instead of Java9.
When Java 9 is the default version getting resolved in the environment, pyspark will throw error below and you will see name 'xx' is not defined error when trying to access sc, spark etc. from shell / Jupyter.
more details you can see this link
You must have hive-site.xml file in the spark configuration directory.
Change the port from 9000 to 9083 resolved the problem for me.
Please ensure that the property is updated in both the hive-site.xml files which would be placed under hive config and spark config directory.
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property>
For me in ubuntu, the location for hive-site.xml are:
/home/hadoop/hive/conf/
and
/home/hadoop/spark/conf/

Could not connect to cassandra with cqlsh

I want to connect to cassandra but got this error:
$ bin/cqlsh
Connection error: ('Unable to connect to any servers', {'192.168.1.200': error(10061, "Tried connecting to [('192.168.1.200', 9042)]. Last error: No connection could be made because the target machine actively refused it")})
Pretty simple.
The machine is actively refusing it because your system does not have cassandra running on it. Follow the following steps to completely get rid of this trouble :
Install Cassandra from DataStax (Datastax-DDC; Cassandra version 3).
Go to ~\installation\path\DataStax-DDC\apache-cassandra\bin.
Open up cmd there. (Use Alt+F+P to open it if you are on windows 8 or later).
type cassandra -f this will generate a lot of stuff on the window and you must get the last line as INFO 11:32:31 Created default superuser role 'cassandra'
Now open another cmd window in the same folder.
Type cqlsh
This should give you a prompt, without any error.
I also discovered that this error doesn't pop up if I use cassadra v2.x found here Archived version of Cassandra. I don't know why :( (If you find out please comment).
So, if the above steps do not work, you can always go back to Cassandra v2.x.
Cheers.
Check if you have started Cassandra server, then provide the host and port as the arguments.
$ bin/cqlsh 127.0.0.1 4092
I run into the same problem. This worked for me.
Go to any directory for example E:\ (doesn't have to be the same disc as the cassandra installation)
Create the following directories
E:\cassandra\storage\commitlogs
E:\cassandra\storage\data
E:\cassandra\storage\savedcaches
Then go to your cassandra installations conf path. In my case.
D:\DataStax-DDC\apache-cassandra\conf
Open cassandra.yaml. Edit the lines containing: data_file_directories, commitlog_directory, saved_caches_directory to look like the code below (change paths accordingly to where you created the folders)
data_file_directories:
- E:\cassandra\storage\data
commitlog_directory: E:\cassandra\storage\commitlog
saved_caches_directory: E:\cassandra\storage\savedcaches
Then open the cmd (I did it as administrator, but didn't check if it is necessary) to your cassandra installations bin path. In my case.
D:\DataStax-DDC\apache-cassandra\bin
run cassandra -f
Lots of stuff will be logged to your screen.
You should now be able to run cqlsh and all other stuff without problems.
Edit: The operating system was windows10 64bit
Edit2: If it stops working after a while check if the service is till running using nodetool status. If it isn't follow this instruction.
I also faced the same problem on a Win32 windows 7 machine.
Check if you have JAVA installed correctly and JAVA_HOME variable set.
Once you have checked the java installation and set JAVA_HOME, uninstall Cassandra and install it again.
Hopefully this would solve the problem. Mine was solved after applying the above two steps.
You need to mention host, user, password for cassandra cqlsh connection. Default cassandra cqlsh user is cassandra and password is cassandra.
$ bin/cqlsh <host> -u cassandra -p cassandra
I also had same problem. I applied many methods given on google and youtube but none of them worked in my case. Finally, I applied the following 3 steps and it worked in my case:-
Create a folder without any space in C or D whichever is your system drive. eg:- C:\cassandra
Install Cassandra in this folder instead of installing in"Program Files".
After installation, it will be like this- C:\cassandra\apache-cassandra-3.11.6
Copy python 2.7 installed in bin folder i.e.,C:\cassandra\apache-cassandra-3.11.6\bin
Now your program is ready for work.
There is no special method to connect cqlsh it simple as below:-
$ bin/cqlsh 127.0.0.1(host IP) 9042 or $ bin/cqlsh 127.0.0.1(host IP) 9160 (if older version of Cassandra)
Don't forget to check port connectivity if you are connecting cqlsh to remote host. Also you can use username/password if you enabled by default it is disabled.

PDI hadoop file browser no list

I've hadoop single instance cluster configured to run with some IP address ( instead of localhost ) on centos linux. I was able to execute example mapreduce job correctly. That tells me that the hadoop setup appears to be fine.
I have also addded couple of data files to hadoop databse under "/data" folder and are visible through the "dfs" comand
bin/hadoop dfs -ls /data
I am trying to connect to this HDFS system from PDI/Kettle. In the HDFS File browser, if I put the HDFS connection parameters incorrectly, e.g. incorrect port, it says it can not connect to the HDFS server. Instead, If I put in all parameters correctly ( server,port,user,password ), and click 'connect' it does not give the error, meaning it is able to connect. But in the file list, it shows "/" .
Doesnt show data folder. What could be going wrong ?
I've already tried this :
tried chmod 777 to the datafiles using "bin/hadoop dfs -chmod -R 777 /data"
tried using root and also hdfs linux user in the PDI file browser
tried adding the data files in some other location
re-formatting hdfs several times and adding data files again
copying the hadoop-core jar file from hadoop installable to PDI extlib
but it does not list files in the PDI browser. I can not see anything in the PDI log either... Need quick help ... thanks !!!
-abhay
I got past this issue. On windows, PDI was not logging anything in the log file. I tried same thing on linux, when it showed me in the log that it was missing a library from Apache, the commons-configuration. I downloaded latest version of the same and put it under the extlib/pentaho folder and boom ! it worked !!

Resources