PDI hadoop file browser no list - browser

I've hadoop single instance cluster configured to run with some IP address ( instead of localhost ) on centos linux. I was able to execute example mapreduce job correctly. That tells me that the hadoop setup appears to be fine.
I have also addded couple of data files to hadoop databse under "/data" folder and are visible through the "dfs" comand
bin/hadoop dfs -ls /data
I am trying to connect to this HDFS system from PDI/Kettle. In the HDFS File browser, if I put the HDFS connection parameters incorrectly, e.g. incorrect port, it says it can not connect to the HDFS server. Instead, If I put in all parameters correctly ( server,port,user,password ), and click 'connect' it does not give the error, meaning it is able to connect. But in the file list, it shows "/" .
Doesnt show data folder. What could be going wrong ?
I've already tried this :
tried chmod 777 to the datafiles using "bin/hadoop dfs -chmod -R 777 /data"
tried using root and also hdfs linux user in the PDI file browser
tried adding the data files in some other location
re-formatting hdfs several times and adding data files again
copying the hadoop-core jar file from hadoop installable to PDI extlib
but it does not list files in the PDI browser. I can not see anything in the PDI log either... Need quick help ... thanks !!!
-abhay

I got past this issue. On windows, PDI was not logging anything in the log file. I tried same thing on linux, when it showed me in the log that it was missing a library from Apache, the commons-configuration. I downloaded latest version of the same and put it under the extlib/pentaho folder and boom ! it worked !!

Related

windows log path for running Spark HistoryServer

I have followed instruction on spark website for configuring pySpark HistoryServer locally on Windows but cannot get past this error when I run: spark-class.cmd org.apache.spark.deploy.history.HistoryServer
: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory?
spark-defaults.conf has:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/tmp/spark-events
spark.eventLog.dir file:/tmp/spark-events
I can get pyspark to run and I can successfully submit .py script with spark-submit
I have created the directory /tmp/spark-events in both SPARK_HOME and SPARK_HOME/bin because i'm not exactly sure where "file:/tmp/spark-events" should actually located. Where exactly on Windows do I need to create this directory "tmp/spark-events" so it can be found? Am I missing anything else? Also, even if I change the paths in the .conf file it still gives error saynig can't find tmp/spark-events so it's like it's not even using the values in the config.
You can choose where spark.history.fs.logDirectory points to! In your case, it should be a windows path. The idea is the following:
You make a directory wherever you would like it, with the proper permissions on there (more info on that here)
When that is done, you should be able to start up your history server, with spark.history.fs.logDirectory pointing to that one directory you made. This is not a relative path w.r.t. your $SPARK_HOME env variable, but an absolute path.
If that works, you should see a rather uninteresting screen (the default port is 18080 so locally you should visit localhost:18080): since none of your applications have written to your directory yet you will see an empty History Server screen
If you want to make use of the history server, you have to make your apps write to the eventlog directory you made. That can be done by adding --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=<your-dir> to your spark-submit call.
If that was successful, you should see a file in your log directory you made!
Have a look at your History server (by default on localhost:18080). You should see your application's logs in there!
Hope this helps :)

How to transfer data from local file system (linux) to a Hadoop Cluster made on Google Cloud Platform

I am a beginner in Hadoop, I made a Hadoop Cluster (one master and two slaves) on Google Cloud Platform.
I accessed the master of the cluster using from the local file system (Linux): ssh -i key key#public_ip_of_master
Then I did sudo su - inside the cluster because Hadoop functions only appears while being root.
Then I initiated the HDFS using start-dfs.sh and start-all.sh
Now the problem is that I want to tranfer files from the local Linux file system to the Hadoop Cluster and vice versa using the following command (inserting the command inside the cluster while being root):
root#master:~# hdfs dfs -put /home/abas1/Desktop/chromFa.tar.gz /Hadoop_File
The problem is that the local path which is: /home/abas1/Desktop/chromFa.tar.gz is never recognized and I can not seem to know what to do.
I am sure I am missing something trivial but I do not know what it is. I have to use either -copyFromLocal or -put.
local path is never recognized
That is not a Hadoop problem, then. You are on the master node (over SSH), as the root user. There is a /root folder with files, and probably no /home/abas1.
In other words, run ls -l /home, and you see what local files are available.
To get files to the master server to upload from that terminal session, you will want to SCP files first to there from a different machine.
Exit the SSH session
scp -i key root#master-ip home/abas1/Desktop/chromFa.tar.gz /tmp
ssh -i key root#master-ip
Then you can do this
hdfs mkdir /Hadoop_File
ls -l /tmp | grep chromFa # for example, to check file
hdfs -put /tmp/chromFa.tar.gz /Hadoop_file/
Hadoop functions only appears while being root.
Please do not use root for interacting with Hadoop services. Create unique user accounts for HDFS, YARN, Zookeeper, etc. with restricted permissions like you would for any other Unix process.
Using DataProc will do this... And you can still SSH to it, so you should really considering using it instead of manual GCE cluster.

Can't copy local file in linux to hadoop

I just installed Hadoop on a VM linux system. Now I am following my guide book to copy a file from locally to hadoop (file is saved on VM desktop). here is what I did:
hdfs dfs -copyFromLocal filename.csv /user/root
However, I received message saying
"copyFromLocal: 'filename.csv': no such file or directory"
Can anyone tell me what went wrong and what should I do to make it right?
Thanks!
you need to be in your Desktop folder ( containing your file to find the file)
cd /root/Desktop
there are two methods for placing file from local host to hadoop's hdfs:
1) copyFromLocal - as you have used
2) hadoop - hadoop dfs -put yourfilepath(local) hdfspath

Command to store File on HDFS

Introduction
A Hadoop NameNode and three DataNodes have been installed and are running. The next step is to provide a File to HDFS. The following commands have been executed:
hadoop fs -copyFromLocal ubuntu-14.04-desktop-amd64.iso
copyFromLocal: `.': No such file or directory
and
hadoop fs -put ubuntu-14.04-desktop-amd64.iso
put: `.': No such file or directory
without succes.
Question
Which command needs to be issued in order to store a file on HDFS?
If no path is provided, hadoop will try to copy the file in your hdfs home directory. In other words, if you're logged as utrecht, it will try to copy ubuntu-14.04-desktop-amd64.iso to /user/utrecht.
However, this folder doesn't exist from scratch (you can normally check the dfs via a web browser).
To make your command work, you have two choices :
copy it elsewhere (/ works, but putting everything there may lead to complications in the future)
create the directory you want with hdfs dfs -mkdir /yourFolderPath

how to load schema file into Cassandra with cqlsh

I have a schema file for Cassandra. I'm using a windows 7 machine (Cassandra on this machien as well - 1 node). I want to load the schema with cqssh. So far I have not been able to find how. I was hoping to be able to pass the file to cqlsh: cqlsh mySchemaFile. However since I run in windows, to start cqlsh I do the following
python "C:\Program Files (x86)\DataStax Community\apache-cassandra\bin\cqlsh" localhost 9160
Even though I have csqsh in my path, when called like this from python it needs the full path.
I tried to add in there the file name but no luck so far.
Is this even possible?
cqlsh takes a file to execute via the -f or --file option, not as a positional argument (like the host and port), so the correct form would be:
python "C:\Program Files (x86)\DataStax Community\apache-cassandra\bin\cqlsh" localhost 9160 -f mySchemaFile
Note: I'm not 100% sure about whether you'd use -f or \f in Windows.

Resources