Access hdfs cluster from pydoop - python-3.x

I have hdfs cluster and python on the same google cloud platform. I want to access the files present in the hdfs cluster from python. I found that using pydoop one can do that but I am struggling with giving it right parameters maybe. Below is the code that I have tried so far:-
import pydoop.hdfs as hdfs
import pydoop
pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
"""
class pydoop.hdfs.hdfs(host='default', port=0, user=None, groups=None)
A handle to an HDFS instance.
Parameters
host (str) – hostname or IP address of the HDFS NameNode. Set to an empty string (and port to 0) to connect to the local file system; set to 'default' (and port to 0) to connect to the default (i.e., the one defined in the Hadoop configuration files) file system.
port (int) – the port on which the NameNode is listening
user (str) – the Hadoop domain user name. Defaults to the current UNIX user. Note that, in MapReduce applications, since tasks are spawned by the JobTracker, the default user will be the one that started the JobTracker itself.
groups (list) – ignored. Included for backwards compatibility.
"""
#print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
It gives this error:-
RuntimeError: Hadoop config not found, try setting HADOOP_CONF_DIR
And if I execute this line of code:-
print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
nothing happens. But this "vs_co2_all_2019_v1.csv" file does exist in the cluster but is not available at the moment, when I took screenshot.
My hdfs screenshot is shown below:
and the credentials that I have are shown below:
Can anybody tell me that what am I doing wrong? Which credentials do I need to put where in the pydoop api? Or maybe there is another simpler way around this problem, any help will be much appreciated!!

Have you tried the following?
import pydoop.hdfs as hdfs
import pydoop
hdfs_object = pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
hdfs_object.list_directory("/vs_co2_all_2019_v1.csv")
or simply:
hdfs_object.list_directory("/")
Keep in mind that pydoop.hdfs module is not directly related with the hdfs class (hdfs_object). Thus, the connection that you established in the first command is not used in hdfs.ls("/vs_co2_all_2019_v1.csv")

Related

Get correct path for sparkdid.cassandra.connection.config.cloud.path in AWS EMR

I checked the entire web. but could not found the solution.
I am trying to connect astra cassandra using bundle in AWS EMR its able to download the bundle file but not loading it.
spark.conf.set("sparkdid.cassandra.connection.config.cloud.path", "secure-connect-app.zip")
this is how I am giving path I know this is wrong path since its not loading the correct config returning connection refused with localhost.
I don't what is the correct path in EMR.
If you look into documentation, then you see that the file could be either specified as URL of file on S3, or you can use —files option when submitting with spark-submit or spark-shell , then it will be available as just a file name, like you’re doing

Cannot import CSV file into h2o from Databricks cluster DBFS

I have successfully installed both h2o on my AWS Databricks cluster, and then successfully started the h2o server with:
h2o.init()
When I attempt to import the iris CSV file that is stored in my Databricks DBFS:
train, valid = h2o.import_file(path="/FileStore/tables/iris.csv").split_frame(ratios=[0.7])
I get an H2OResponseError: Server error water.exceptions.H2ONotFoundArgumentException
The CSV file is absolutely there; in the same Databricks notebook, I am able to read it directly into a DataFrame and view the contents using the exact same fully qualified path:
df_iris = ks.read_csv("/FileStore/tables/iris.csv")
df_iris.head()
I've also tried calling:
h2o.upload_file("/FileStore/tables/iris.csv")
but to no avail; I get H2OValueError: File /FileStore/tables/iris.csv does not exist. I've also tried uploading the file directly from my local computer (C drive), but that doesn't succeed either.
I've tried not using the fully qualified path, and just specifying the file name, but I get the same errors. I've read through the H2O documentation and searched the web, but cannot find anyone who has ever encountered this problem before.
Can someone please help me?
Thanks.
H2O may not understand that this path is on the DBFS. You may try to specify path /dbfs/FileStore/tables/iris.csv - in this case it will be read as "local file", or try to specify the full path with schema, like dbfs:/FileStore/tables/iris.csv - but this may require DBFS-specific jars for H2O.

Can't access files via Local file API on Databricks

I'm trying to access small text file stored directly on dbfs using local file API.
I'm getting the following error.
No such file or directory
My code:
val filename = "/dbfs/test/test.txt"
for (line <- Source.fromFile(filename).getLines()) {
println(line)
}
At the same time I can access this file without any problems using dbutils or load it to RDD via spark context.
I've tried specifying the path starting with dbfs:/ or /dbfs/ or just with the test folder name, both in Scala and Python, getting the same error each time. I'm running the code from the notebook. Is it some problem with the cluster configuration?
Check if your cluster has Credentials Passthrough enabled. If so, local file Api is not available.
https://docs.azuredatabricks.net/data/databricks-file-system.html#local-file-apis

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.
Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'
Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

Apache Spark FileNotFoundException

I am trying to play a little bit with apache-spark cluster mode.
So my cluster consists of a driver in my machine and a worker and manager in host machine(separate machine).
I send a textfile using sparkContext.addFile(filepath) where the filepath is the path of my text file in local machine for which I get the following output:
INFO Utils: Copying /home/files/data.txt to /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
INFO SparkContext: Added file /home/files/data.txt at http://192.XX.XX.164:58143/files/data.txt with timestamp 1457432207649
But when I try to access the same file using SparkFiles.get("data.txt"), I get the path to file in my driver instead of worker.
I am setting my file like this
SparkConf conf = new SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
conf.setJars(new String[]{"jars/SparkWorker.jar"});
JavaSparkContext sparkContext = new JavaSparkContext(conf);
sparkContext.addFile("/home/files/data.txt");
List<String> file =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
I am getting FileNotFoundException here.
I have recently faced the same issue and hopefully my solution can help other people solve this issue.
We know that when you use SparkContext.addFile(<file_path>), it sends the file to the automatically created working directories in the driver node (in this case, your machine) as well as the worker nodes of the Spark cluster.
The block of code that you shared where you are using SparkFiles.get("data.txt") is being executed on the driver, so it returns the path to the file on the driver, instead of the worker. But, the task is being run on the worker and path to the file on the driver does not match the path to the file on the worker because the driver and worker nodes have different working directory paths. Hence, you get the FileNotFoundException.
There is a workaround to this problem without using any distributed file system or ftp server. You should put the file in your working directory on your host machine. Then, instead of using SparkContext.get("data.txt"), you use "./data.txt".
List<String> file = sparkContext.textFile("./data.txt").collect();
Now, even though there is a mismatch of working directory paths between the spark driver and worker nodes, you will NOT face FileNotFoundException since you are using a relative path to access the file.
I think that the main issue is that you are trying to read the file via the textFile method. What is inside the brackets of the textFile method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed. When you type textFile what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node.
Thus, when you're trying to collect the data, the worker is asked to read the file at the URL you've passed to textFile, which is told by the driver. Since your file is in the local filesystem of the driver and the worker node doesn't have access to it, you get the FileNotFoundException.
The solution is to make the file available to the worker node by putting it into a distributed filesystem as HDFS or via (s)ftp or you have to trasfer the file into the worker node before running the Spark job and then you have to put as an argument of textFile the path of the file in the worker filesystem.

Resources