Can't access files via Local file API on Databricks - databricks

I'm trying to access small text file stored directly on dbfs using local file API.
I'm getting the following error.
No such file or directory
My code:
val filename = "/dbfs/test/test.txt"
for (line <- Source.fromFile(filename).getLines()) {
println(line)
}
At the same time I can access this file without any problems using dbutils or load it to RDD via spark context.
I've tried specifying the path starting with dbfs:/ or /dbfs/ or just with the test folder name, both in Scala and Python, getting the same error each time. I'm running the code from the notebook. Is it some problem with the cluster configuration?

Check if your cluster has Credentials Passthrough enabled. If so, local file Api is not available.
https://docs.azuredatabricks.net/data/databricks-file-system.html#local-file-apis

Related

I am trying to connect to abfss directly(without mounting to DBFS) and trying to open json file using open() in databricks

I am trying to connect to abfss directly(without mounting to DBFS) and trying to open json file using open() method in databricks.
json_file = open("abfss://#.dfs.core.windows.net/test.json') databricks is unable to open file present in azure blob container and getting below error :
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://#.dfs.core.windows.net/test.json'
I have done all the configuration setting using service principal. Please suggest other way of opening file using abfss direct path.
the open method works only with local files - it doesn't know anything about abfss or other cloud storages. You have following choice:
use dbutils.fs.cp to copy file from ADLS to local disk of driver node, and then work with it, like: dbutils.fs.cp("abfss:/....", "file:/tmp/my-copy")
Copy file from ADLS to driver node using the Azure SDK
The first method is easier to use than second

Cannot import CSV file into h2o from Databricks cluster DBFS

I have successfully installed both h2o on my AWS Databricks cluster, and then successfully started the h2o server with:
h2o.init()
When I attempt to import the iris CSV file that is stored in my Databricks DBFS:
train, valid = h2o.import_file(path="/FileStore/tables/iris.csv").split_frame(ratios=[0.7])
I get an H2OResponseError: Server error water.exceptions.H2ONotFoundArgumentException
The CSV file is absolutely there; in the same Databricks notebook, I am able to read it directly into a DataFrame and view the contents using the exact same fully qualified path:
df_iris = ks.read_csv("/FileStore/tables/iris.csv")
df_iris.head()
I've also tried calling:
h2o.upload_file("/FileStore/tables/iris.csv")
but to no avail; I get H2OValueError: File /FileStore/tables/iris.csv does not exist. I've also tried uploading the file directly from my local computer (C drive), but that doesn't succeed either.
I've tried not using the fully qualified path, and just specifying the file name, but I get the same errors. I've read through the H2O documentation and searched the web, but cannot find anyone who has ever encountered this problem before.
Can someone please help me?
Thanks.
H2O may not understand that this path is on the DBFS. You may try to specify path /dbfs/FileStore/tables/iris.csv - in this case it will be read as "local file", or try to specify the full path with schema, like dbfs:/FileStore/tables/iris.csv - but this may require DBFS-specific jars for H2O.

Access hdfs cluster from pydoop

I have hdfs cluster and python on the same google cloud platform. I want to access the files present in the hdfs cluster from python. I found that using pydoop one can do that but I am struggling with giving it right parameters maybe. Below is the code that I have tried so far:-
import pydoop.hdfs as hdfs
import pydoop
pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
"""
class pydoop.hdfs.hdfs(host='default', port=0, user=None, groups=None)
A handle to an HDFS instance.
Parameters
host (str) – hostname or IP address of the HDFS NameNode. Set to an empty string (and port to 0) to connect to the local file system; set to 'default' (and port to 0) to connect to the default (i.e., the one defined in the Hadoop configuration files) file system.
port (int) – the port on which the NameNode is listening
user (str) – the Hadoop domain user name. Defaults to the current UNIX user. Note that, in MapReduce applications, since tasks are spawned by the JobTracker, the default user will be the one that started the JobTracker itself.
groups (list) – ignored. Included for backwards compatibility.
"""
#print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
It gives this error:-
RuntimeError: Hadoop config not found, try setting HADOOP_CONF_DIR
And if I execute this line of code:-
print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
nothing happens. But this "vs_co2_all_2019_v1.csv" file does exist in the cluster but is not available at the moment, when I took screenshot.
My hdfs screenshot is shown below:
and the credentials that I have are shown below:
Can anybody tell me that what am I doing wrong? Which credentials do I need to put where in the pydoop api? Or maybe there is another simpler way around this problem, any help will be much appreciated!!
Have you tried the following?
import pydoop.hdfs as hdfs
import pydoop
hdfs_object = pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
hdfs_object.list_directory("/vs_co2_all_2019_v1.csv")
or simply:
hdfs_object.list_directory("/")
Keep in mind that pydoop.hdfs module is not directly related with the hdfs class (hdfs_object). Thus, the connection that you established in the first command is not used in hdfs.ls("/vs_co2_all_2019_v1.csv")

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.
Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'
Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

Load props file in EMR Spark Application

I am trying to load custom properties in my spark application using :-
command-runner.jar,spark-submit,--deploy-mode,cluster,--properties-file,s3://spark-config-test/myprops.conf,--num-executors,5,--executor-cores,2,--class,com.amazon.Main,#{input.directoryPath}/SWALiveOrderModelSpark-1.0-super.jar
However, I am getting the following exception:-
Exception in thread "main" java.lang.IllegalArgumentException: Invalid
properties file 's3://spark-config-test/myprops.conf''. at
org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
at
org.apache.spark.launcher.AbstractCommandBuilder.loadPropertiesFile(AbstractCommandBuilder.java:284)
at
org.apache.spark.launcher.AbstractCommandBuilder.getEffectiveConfig(AbstractCommandBuilder.java:264)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:233)
at org
Is this the correct way to load file from S3?
You can't load a properties file directly from S3. Instead you will need to download the properties file to your master node somewhere, then submit the spark job referencing the local path on that node. You can do the download by using command-runner.jar to run the aws cli utility.

Resources