From spark using:
DataFrame.write().mode(SaveMode.Ignore).format("orc").saveAsTable("myTableName")
Table is getting saved I can see using below command's hadoop fs -ls /apps/hive/warehouse\test.db' where test is my database name
drwxr-xr-x - psudhir hdfs 0 2016-01-04 05:02
/apps/hive/warehouse/test.db/myTableName
but when I trying to check tables in Hive I cannot view them either with command SHOW TABLES from hiveContext.
sudo cp /etc/hive/conf.dist/hive-site.xml /etc/spark/conf/
This worked for me in a Cloudera quick start Virtual Box.
You have to copy the hive-site.xml file (mine is located at /etc/hive/conf.dist/hive-site.xml) to Spark conf folder (mine is located at /etc/spark/conf/)
sudo cp /etc/hive/conf.dist/hive-site.xml /etc/spark/conf/
Restart Spark and it should work.
I think you need to run INVALIDATE METADATA; in the hive console to refresh the databases and view your new table.
Related
I was trying to copy a file to local from HDFS using Hadoop's copyToLocalFile function from my Spark2 application.
val hadoopConf = new Configuration()
val hdfs = FileSystem.get(hadoopConf)
val src = new Path("/user/yxs7634/all.txt")
val dest = new Path("file:///home/yxs7634/all.txt")
hdfs.copyToLocalFile(src, dest)
The above code is working fine when I submit my spark application in Yarn client mode. But, It keeps failing with the below exception in Yarn cluster mode.
18/10/03 12:18:40 ERROR yarn.ApplicationMaster: User class threw exception: java.io.FileNotFoundException: /home/yxs7634/all.txt (Permission denied)
In yarn-cluster mode the driver is also handled by yarn and the selected driver node may not be the one where you're submitting the job. Hence for this job to work in yarn-cluster mode I believe you need to place the local file in all the spark nodes in the cluster.
In yarn mode, the spark job is submitted through YARN.
The driver would be started on a different node.
To tackle this issue, you can use a distributed file system like HDFS to store your file and then giving the absolute path.
eg:
val src = new Path("hdfs://nameservicehost:8020/user/yxs7634/all.txt")
Looks like Spark server running under one user (for ex. "spark"), and file in code stored in other user "yxs7634" directory.
In cluster mode user "spark" does not allows to write in "yxs7634" user dir, and such exception occurs.
Additional permission for Spark user to write in "/home/yxs7634" is required.
In local mode worked fine, because Spark runs under "yxs7634" user.
You have a permission denied error, I mean, the user you are using to submit the job is not able to access the file. The directory should have at least read permission to user "other", something like this: -rw-rw-r--
Can you paste the permissions of the directory and the file? The command is
hdfs dfs -ls /your-directory/
ERROR received in the logs:
FATAL datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to hadooptest3/100.6.89.29:8020
There are 2 Possible Solutions to resolve
First:
Your namenode and datanode cluster ID does not match, make sure to make them the same.
In name node, change your cluster id in the file located in:
cat HADOOP_FILE_SYSTEM/namenode/current/VERSION
In data node you cluster id is stored in the file:
cat HADOOP_FILE_SYSTEM/datanode/current/VERSION
This locations are set in hdfs-site.xml file in the cluster.
Check your hdfs-site.xml file and check for dfs.datanode.data.dir and dfs.namenode.name.dir.
By going through those folders, Here I get the contents (in my pseudo-cluster)
clusterID=CID-483c19b1-b198-4806-93d2-af7508d1a5e5
You should have exactly same cluster-id.
Secondly:
Format the namenode:
Hadoop 1.x: hadoop namenode -format
Hadoop 2.x: hdfs namenode -format
Alternatively, remove hdfs root directory /tmp/hadoop-root/ (set up in conf files) - and format the namenode to initialize from begining.
Your config files looks fine. From the error logs that you commented Unexpected version of storage directory /home/hadoop/hdfs. Reported: -60. Expecting = -56. , it seems that data directory created inside /home/hadoop/hdfs is not reformated when you applied `hadoop namenode -format command.
So I suggest you to delete that data directory inside /home/hadoop/hdfs before you format namenode. Then apply format command and start hadoop cluster. It should be solved then.
I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.
I am trying to load schema to Cassandra server from a file. As suggested by some one, i tried sstable2json and json2sstable but i guess that imports and exports data files while i am trying to load the schema of the database only.Any suggestion on possible ways to do it ?
I am using Cassandra 1.2.
To get schema file go to directory where Cassandra resides ..not in bin directory within it
echo -e "use your_keyspace;\r\n show schema;\n" | bin/cassandra-cli -h your_listen_address(e.g.localhost) > mySchema.cdl
To load that file
bin/cassandra-cli -h localhost -f mySchema.cdl
I have followed the below steps to configure Hive 0.8.1 in Cygwin. Hive is started properly as I am getting Hive CLI when type hive. But while running any command in hive its not returning any response and the command is running into a infinite loop.
Please help if I miss anything.
Steps to configure Hive
Chown of the hive folder
Change permission of hive folder to 755
Set this to hive-site.xml
<property>
<name>hive.exec.scratchdir</name>
<value>/home/yourusername/mydir</value>
<description>Scratch space for Hive jobs</description>
</property>
Put the following in the hive lib folder:
hadoop-0.20-core.jar
hive/lib/hive-exec-0.7.1.jar
hive/lib/hive-jdbc-0.7.1.jar
hive/lib/hive-metastore-0.7.1.jar
hive/lib/hive-service-0.7.1.jar
hive/lib/libfb303.jar
lib/commons-logging-1.0.4.jar
slf4j-api-1.6.1.jar
slf4j-log4j12-1.6.1.jar
In hive-env.sh change following:
# Set HADOOP_HOME to point to a specific hadoop install directory
#here instead of path what i have given you give your own path where hadoop #isthere
export HADOOP_HOME=/home/user/Hadoop/hadoop-0.20.205
# Hive Configuration Directory can be controlled by:
#here you specify the conf directory path of hive
export HIVE_CONF_DIR=/home/user/Hadoop/hive-0.8.1/conf
#Folder containing extra ibraries required for hive compilation/execution
#can be controlled by:
#here you specify the lib file directory, here you can specify the lib
I had this issue, i could successfully run HIVE after starting all hadoop deamons like namenode,datanode,jobtracker & Task Tracker. And Run queries from files using "hive -f " insted of writing queries directly at hive command prompt. You may also use bin/hive -e 'SHOW TABLES'