Cannot import CSV file into h2o from Databricks cluster DBFS - python-3.x

I have successfully installed both h2o on my AWS Databricks cluster, and then successfully started the h2o server with:
h2o.init()
When I attempt to import the iris CSV file that is stored in my Databricks DBFS:
train, valid = h2o.import_file(path="/FileStore/tables/iris.csv").split_frame(ratios=[0.7])
I get an H2OResponseError: Server error water.exceptions.H2ONotFoundArgumentException
The CSV file is absolutely there; in the same Databricks notebook, I am able to read it directly into a DataFrame and view the contents using the exact same fully qualified path:
df_iris = ks.read_csv("/FileStore/tables/iris.csv")
df_iris.head()
I've also tried calling:
h2o.upload_file("/FileStore/tables/iris.csv")
but to no avail; I get H2OValueError: File /FileStore/tables/iris.csv does not exist. I've also tried uploading the file directly from my local computer (C drive), but that doesn't succeed either.
I've tried not using the fully qualified path, and just specifying the file name, but I get the same errors. I've read through the H2O documentation and searched the web, but cannot find anyone who has ever encountered this problem before.
Can someone please help me?
Thanks.

H2O may not understand that this path is on the DBFS. You may try to specify path /dbfs/FileStore/tables/iris.csv - in this case it will be read as "local file", or try to specify the full path with schema, like dbfs:/FileStore/tables/iris.csv - but this may require DBFS-specific jars for H2O.

Related

Get correct path for sparkdid.cassandra.connection.config.cloud.path in AWS EMR

I checked the entire web. but could not found the solution.
I am trying to connect astra cassandra using bundle in AWS EMR its able to download the bundle file but not loading it.
spark.conf.set("sparkdid.cassandra.connection.config.cloud.path", "secure-connect-app.zip")
this is how I am giving path I know this is wrong path since its not loading the correct config returning connection refused with localhost.
I don't what is the correct path in EMR.
If you look into documentation, then you see that the file could be either specified as URL of file on S3, or you can use —files option when submitting with spark-submit or spark-shell , then it will be available as just a file name, like you’re doing

How to mount a Google Bucket in Kubeflow Pipeline?

I have a KubeFlow Pipeline up and running on a VM in GCP with KF.
I create the pipeline using a Jupyter Notebook server with image jupyter-kale and python.
The first part of the pipeline is doing the dataprep, it downloads images and saves them to a PVC. This all works just fine but I run out of storage space so I decided to save the images downloaded directly to a google Bucket instead using the PVC.
I modified my pipeline as shown in the code below:
import kfp
import kfp.dsl as dsl
import kfp.onprem as onprem
import kfp.compiler as compiler
import os
#dsl.pipeline(
name='try_mount',
description='...'
)
def one_d_pipe(pvc_name = "gs://xxx-images/my_folder/"):
trymount = dsl.ContainerOp(
name="trymount",
#image = "sprintname3:0.2.0",
image = "eu.gcr.io/xxx-admin/kubeflow/trymount_1:0.1"
)
steps = [trymount]
for step in steps:
step.apply(onprem.mount_pvc(pvc_name, "gs://xxx-images/my_folder/", '/home/jovyan/data'))
But this code resulting in an error message right after start saying that the volume has invalid value and could not be found:
This step is in Error state with this message: Pod "try-mount-75vrt-3151677017" is invalid: [spec.volumes[2].name: Invalid value: "gs://xxx-images/my_folder/": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is 'a-z0-9?'), spec.containers[0].volumeMounts[3].name: Not found: "gs://xxx-images/my_folder/", spec.containers[1].volumeMounts[0].name: Not found: "gs://xxx-images/my_folder/"]
So, my question:
How to mount a google bucket in Kubeflow Pipelines?
You can't mount a bucket as a volume. It's not a file system. However, I'm sure that you can cheat by use gcsfuse on your VM.
On your VM, mount the GCS bucket with fuse
gcsfuse xxx-images /path/to/mount-gcs
Then in your code, use this directory. No mount required, the GCS is already mount with GCSFuse.

Can't access files via Local file API on Databricks

I'm trying to access small text file stored directly on dbfs using local file API.
I'm getting the following error.
No such file or directory
My code:
val filename = "/dbfs/test/test.txt"
for (line <- Source.fromFile(filename).getLines()) {
println(line)
}
At the same time I can access this file without any problems using dbutils or load it to RDD via spark context.
I've tried specifying the path starting with dbfs:/ or /dbfs/ or just with the test folder name, both in Scala and Python, getting the same error each time. I'm running the code from the notebook. Is it some problem with the cluster configuration?
Check if your cluster has Credentials Passthrough enabled. If so, local file Api is not available.
https://docs.azuredatabricks.net/data/databricks-file-system.html#local-file-apis

Access hdfs cluster from pydoop

I have hdfs cluster and python on the same google cloud platform. I want to access the files present in the hdfs cluster from python. I found that using pydoop one can do that but I am struggling with giving it right parameters maybe. Below is the code that I have tried so far:-
import pydoop.hdfs as hdfs
import pydoop
pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
"""
class pydoop.hdfs.hdfs(host='default', port=0, user=None, groups=None)
A handle to an HDFS instance.
Parameters
host (str) – hostname or IP address of the HDFS NameNode. Set to an empty string (and port to 0) to connect to the local file system; set to 'default' (and port to 0) to connect to the default (i.e., the one defined in the Hadoop configuration files) file system.
port (int) – the port on which the NameNode is listening
user (str) – the Hadoop domain user name. Defaults to the current UNIX user. Note that, in MapReduce applications, since tasks are spawned by the JobTracker, the default user will be the one that started the JobTracker itself.
groups (list) – ignored. Included for backwards compatibility.
"""
#print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
It gives this error:-
RuntimeError: Hadoop config not found, try setting HADOOP_CONF_DIR
And if I execute this line of code:-
print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
nothing happens. But this "vs_co2_all_2019_v1.csv" file does exist in the cluster but is not available at the moment, when I took screenshot.
My hdfs screenshot is shown below:
and the credentials that I have are shown below:
Can anybody tell me that what am I doing wrong? Which credentials do I need to put where in the pydoop api? Or maybe there is another simpler way around this problem, any help will be much appreciated!!
Have you tried the following?
import pydoop.hdfs as hdfs
import pydoop
hdfs_object = pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
hdfs_object.list_directory("/vs_co2_all_2019_v1.csv")
or simply:
hdfs_object.list_directory("/")
Keep in mind that pydoop.hdfs module is not directly related with the hdfs class (hdfs_object). Thus, the connection that you established in the first command is not used in hdfs.ls("/vs_co2_all_2019_v1.csv")

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.
Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'
Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

Resources