I am trying to merge two files from my HDFS to a folder on my local machine's desktop. The command that I am using is:
hadoop fs -getmerge -nl /user/hadoop/folder_name/ /Desktop/test_files/finalfile.csv
But that returns the following error:
getmerge: Mkdirs failed to create file:/Desktop/test_files (exists=false, cwd=file:/home/hadoop)
Does anyone know why this might be? I couldn't find much of anything else in my search.
You need to create the local folder /Desktop/test_files/
Related
how do we copy files from Hadoop to abfs (azure blob file system)
I want to copy from Hadoop fs to abfs file system but it throws an error
this is the command I ran
hdfs dfs -ls abfs://....
ls: No FileSystem for scheme "abfs"
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
any idea how this can be done ?
In the core-site.xml you need to add a config property for fs.abfs.impl with value org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem, and then add any other related authentication configurations it may need.
More details on installation/configuration here - https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
the abfs binding is already in core-default.xml for any release with the abfs client present. however, the hadoop-azure jar and dependency is not in the hadoop common/lib dir where it is needed (it is in HDI, CDH, but not the apache one)
you can tell the hadoop script to pick it and its dependencies up by setting the HADOOP_OPTIONAL_TOOLS env var; you can do this in ~/.hadoop-env; just try on your command line first
export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-aws"
after doing that, download the latest cloudstore jar and use its storediag command to attempt to connect to an abfs URL; it's the place to start debugging classpath and config issues
https://github.com/steveloughran/cloudstore
I have a Node-RED app running in a docker container, with the aim to periodically read contents of a directory where .csv files are constantly updated and new .csv files are sometimes added. The point is to read new entries periodically, parse data, and send it onward.
I have not utilized the numerous 'contrib' nodes, as I have enabled the NodeJS 'fs' module and played with it. Additionally the built-in 'file' and 'file in' Node-RED modules are useful when reading the .csv files' contents, so that is not an issue.
The problem comes with the new .csv files being added into the directory where all the .csv files are. I want be able to read all the file names and subsequently read all the .csv files.
I have mounted the .csv file directory into the docker container, and when testing whether I'm able to read the file names, weird things happen. Even though the files are visible in the container (viewed using docker exec -it CONTAINER /bin/bash) a piece of code containing fs.readdir does not list the files. When I try the fs.readdir too see the contents of /data directory, which is mounted into the container, it lists the contents like 10 % of the time (injecting a timestamp into the node to run it)
As you can see from the image, the contents of the directorty in question are not listed on every execution of the node. The contents of the mounted directory containing the .csv files are never listed upon running this node with the correct path as parameter.
The operating system is CentOS 7, where I am not a sudoer. I have managed to make it so that none of the mounted files or directories are owned by root, so they are owned by user node-red within the container. I managed to pull this directory file listing through on my ubuntu where I am a sudoer, but as none of the stuff is root-owned there either, I am not sure if that is the problem. I have a feeling this might be an operating system -relating thing.
Notes:
All relevant files and directories have permissions rwxr-xr-x
I have tried to mount the .csv files containing directory under /data directory, and as its own directory directly under root as /files
I am able to read the file contents with the Node-RED file nodes, just not the directories. Reading static file names is not enough as the directory contents keep changing
I have enabled NodeJS 'fs' module from the settings.js file which is mounted into the container
The Node-RED node (in image) does not output any errors (I tried this by adding an error return to the function in the image)
I have tried to run the Node-RED container as root user and without defining the user
I am running the Node-RED container using docker-compose
I hope this was not too much text or too unclear, I just wanted to make sure at least most of the stuff I have tried would be written here. If someone has some insight on the workings of Node-RED under docker and using the NodeJS fs module, it would be most appreciated :)
The core Watch node should do all of this for you, no need to write function nodes.
If you want walk subdirectories make sure you tick the right box in the config.
From the Sidebar docs for the watch node:
The full filename of the file that actually changed is put into
msg.payload and msg.filename, while a stringified version of the watch
list is returned in msg.topic.
msg.file contains just the short filename of the file that changed.
msg.type has the type of thing changed, usually file or directory,
while msg.size holds the file size in bytes.
To answer my question of why Node-RED was unable to read directory contents most of the time, it was because of using the asynchronous fs.readdir module. When I switched to using the synchronous version fs.readdirSync, Node-RED was able to read directory contents without problems.
I have a .jar file (containing a Java project that I want to modify) in my Hadoop HDFS that I want to open in Eclipse.
When I type hdfs dfs -ls /user/... I can see that the .jar file is there - however, when I open up Eclipse and try to import it I just can't seem to find it anywhere. I do see a hadoop/hdfs folder in my File System which takes me to 2 folders; namenode and namesecondary - none of these have the file that I'm looking for.
Any ideas? I have been stuck on this for a while. Thanks in advance for any help.
As HDFS is virtual storage it is spanned across the cluster so you can see only the metadata in your File system you can't see the actual data.
Try downloading the jar file from HDFS to your Local File system and do the required modifications.
Access the HDFS using its web UI.
Open your Browser and type localhost:50070 You can see the web UI of HDFS move to utilities tab which is on the right side and click on Browse the File system, you can see the list of files which are in your HDFS.
Follow the below steps to download the file to your local file system.
Open Browser-->localhost:50070-->Utilities-->Browse the file system-->Open your required file directory-->Click on the file(a pop up will open)-->Click on download
The file will be downloaded into your local File System and you can do your required modifications.
HDFS filesystem and local filesystem are both different.
You can copy the jar file from hdfs filesystem to your preferred location in your local filesytem by using this command:
bin/hadoop fs -copyToLocal locationOfFileInHDFS locationWhereYouWantToCopyFileInYourFileSystem
For example
bin/hadoop fs -copyToLocal file.jar /home/user/file.jar
I hope this helps you.
1) Get the file from HDFS to your local system
bin/hadoop fs -get /hdfs/source/path /localfs/destination/path
2) you can manage it in this way:
New Java Project -> Java settings -> Source -> Link source (Source folder).
You can install plugin to Eclipse that can browse HDFS:
http://hdt.incubator.apache.org
OR
you can mount HDFS via fuse:
https://wiki.apache.org/hadoop/MountableHDFS
You can not directly import the files present in HDFS to Eclipse. First you will have to move file from HDFS to local drive then only you can use it in any utility.
fs -copyToLocal hdfsLocation localDirectoryPath
I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.
This question is related to Hadoop on Azure environment.
I am trying to use Runtime.exec() to execute a batch script in the reduce function. I could not get this running in Hadoop on Azure environment while it runs fine in the Hadoop on Linux. I tested the Runtime.exec() code snippet in my desktop (windows 7) environment and it runs fine there. I have made sure that I consume the output and error streams of the sub-process after Runtime.exec().
The batch script contains the below ( a single command):
c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\tool.exe
-f c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\work\11_task_201207121317_0024_r_000001.out
-i c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\input.txt
I distribute the tool.exe and input.txt files using Distributed cache and it creates a symlink from the working directory. tool.exe and input.txt points to the actual files in the jobcache directory.
2012-07-16 04:31:51,613 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /hdfs/mapred/local/taskTracker/distcache/-978619214658189372_-1497645545_209290723/10.73.50.78tool.exe <- \hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\tool.exe
2012-07-16 04:31:51,644 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /hdfs/mapred/local/taskTracker/distcache/-4944695173898834237_1545037473_2085004342/10.73.50.78input.txt <- \hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\input.txt
The reducer gives the below error when it runs.
Command Execution Error: Cannot run program
"cmd /q /c c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\work\11_task_201207121317_0024_r_0000011513543720767963399.bat":
CreateProcess error=2, The system cannot find the file specified
In another case, I tried running the same but without using the absolute paths.. The output stream from the sub-process is shown below:
c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0022\attempt_201207121317_0022_r_000000_0\work>tool.exe -f /hdfs/mapred/local/taskTracker/nabeel/jobcache/job_201207121317_0022/work/1_task_201207121317_0022_r_000000.out
-i input.txt
I do not know how the job working directory paths and distributed cache works in Hadoop on Azure environment. Could you please let me know if I am missing something here (or) there is something I need to take care of while using Runtime.exec() in Hadoop on Azure environment.
Thanks,
.,._
Reply to sender | Reply to group | Reply via web post | Start a New Topic
I am not familiar with Hadoop. But the error message seems to be obvious. It would be better if you can check whether the file exists.
c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\work\11_task_201207121317_0024_r_0000011513543720767963399.bat
Best Regards,
Ming Xu