How to access files in Hadoop HDFS? - linux

I have a .jar file (containing a Java project that I want to modify) in my Hadoop HDFS that I want to open in Eclipse.
When I type hdfs dfs -ls /user/... I can see that the .jar file is there - however, when I open up Eclipse and try to import it I just can't seem to find it anywhere. I do see a hadoop/hdfs folder in my File System which takes me to 2 folders; namenode and namesecondary - none of these have the file that I'm looking for.
Any ideas? I have been stuck on this for a while. Thanks in advance for any help.

As HDFS is virtual storage it is spanned across the cluster so you can see only the metadata in your File system you can't see the actual data.
Try downloading the jar file from HDFS to your Local File system and do the required modifications.
Access the HDFS using its web UI.
Open your Browser and type localhost:50070 You can see the web UI of HDFS move to utilities tab which is on the right side and click on Browse the File system, you can see the list of files which are in your HDFS.
Follow the below steps to download the file to your local file system.
Open Browser-->localhost:50070-->Utilities-->Browse the file system-->Open your required file directory-->Click on the file(a pop up will open)-->Click on download
The file will be downloaded into your local File System and you can do your required modifications.

HDFS filesystem and local filesystem are both different.
You can copy the jar file from hdfs filesystem to your preferred location in your local filesytem by using this command:
bin/hadoop fs -copyToLocal locationOfFileInHDFS locationWhereYouWantToCopyFileInYourFileSystem
For example
bin/hadoop fs -copyToLocal file.jar /home/user/file.jar
I hope this helps you.

1) Get the file from HDFS to your local system
bin/hadoop fs -get /hdfs/source/path /localfs/destination/path
2) you can manage it in this way:
New Java Project -> Java settings -> Source -> Link source (Source folder).

You can install plugin to Eclipse that can browse HDFS:
http://hdt.incubator.apache.org
OR
you can mount HDFS via fuse:
https://wiki.apache.org/hadoop/MountableHDFS

You can not directly import the files present in HDFS to Eclipse. First you will have to move file from HDFS to local drive then only you can use it in any utility.
fs -copyToLocal hdfsLocation localDirectoryPath

Related

Hadoop getmerge fails when trying to merge files to local directory

I am trying to merge two files from my HDFS to a folder on my local machine's desktop. The command that I am using is:
hadoop fs -getmerge -nl /user/hadoop/folder_name/ /Desktop/test_files/finalfile.csv
But that returns the following error:
getmerge: Mkdirs failed to create file:/Desktop/test_files (exists=false, cwd=file:/home/hadoop)
Does anyone know why this might be? I couldn't find much of anything else in my search.
You need to create the local folder /Desktop/test_files/

how do we copy file from hadoop to abfs remotely

how do we copy files from Hadoop to abfs (azure blob file system)
I want to copy from Hadoop fs to abfs file system but it throws an error
this is the command I ran
hdfs dfs -ls abfs://....
ls: No FileSystem for scheme "abfs"
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
any idea how this can be done ?
In the core-site.xml you need to add a config property for fs.abfs.impl with value org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem, and then add any other related authentication configurations it may need.
More details on installation/configuration here - https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
the abfs binding is already in core-default.xml for any release with the abfs client present. however, the hadoop-azure jar and dependency is not in the hadoop common/lib dir where it is needed (it is in HDI, CDH, but not the apache one)
you can tell the hadoop script to pick it and its dependencies up by setting the HADOOP_OPTIONAL_TOOLS env var; you can do this in ~/.hadoop-env; just try on your command line first
export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-aws"
after doing that, download the latest cloudstore jar and use its storediag command to attempt to connect to an abfs URL; it's the place to start debugging classpath and config issues
https://github.com/steveloughran/cloudstore

SparkContext.addFile upload the file to driver node but not workers

I tried to run a sc.texfile("file:///.../myLocalFile.txt") on a cluster and I got java.io.FileNotFoundException on the workers.
So I googled and I found sc.addFile / SparkFiles.get to upload the file to each workers.
So here is my code:
sc.addFile("file:///.../myLocalFile.txt")
val input = sc.textFile(SparkFiles.get("myLocalFile.txt"))
I see that the driver node upload the file to a directory in /tmp and then my workers get the FileNotFoundException because:
I don't see any printout saying that the workers have downloaded the file as they should have
They try to access the file with the drivers's path. So I assume SparkFiles.get() is ran on the driver node, not the worker (which I confirmed by adding a println).
I tried with spark-submit --files option and I see exactly the same problem.
So what am I doing wrong? All I want is to sc.textFile() on a cluster.
You need to copy files on workers to the same path as on driver, or use hdfs as it will be available on on workers. Workers don't have these files you can go to the folder and see yourself, i would scp them
sc.addFile is not for this purpose. If you want to read files through sc, you need put your file on hdfs instead of using sc.addFile

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

Where is Spark writing SaveAsTextFile in cluster?

I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.

Resources