In azure databricks i have different results for the directory list of dbfs by simply adding two dots.
Can anybody explain to me why this happens?
With dbutils, you can only use "dbfs:/" paths.
If you do not specify "dbfs:/" at the start of your path, it will simply auto-add it.
dbutils.fs.ls('pathA')
--> dbfs:/pathA
is exactly the same as
dbutils.fs.ls('dbfs:/pathA')
but if you do not use the ':', then it will add it silently.
dbutils.fs.ls('dbfs/pathB')
--> dbfs:/dbfs/pathB
It means your dbfs/ is considered as a folder name dbfs at the root of your dbfs:/
To avoid confusion, always specify dbfs:/ to your path.
Related
I uploaded files to DBFS:
/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
I tried to access them by pandas and I always receive information that such files don't exist.
I tried to use the following paths:
/dbfs/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
dbfs/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
dbfs:/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
./FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
What is funny, when I check them by dbutils.fs.ls I see all the files.
I found this solution, and I tried it already: Databricks dbfs file read issue
Moved them to a new folder:
dbfs:/new_folder/
I tried to access them from this folder, but still, it didn't work for me. The only difference is that I copied files to a different place.
I checked as well the documentation: https://docs.databricks.com/data/databricks-file-system.html
I use Databricks Community Edition.
I don't understand what I'm doing wrong and why it's happening like that.
I don't have any other ideas.
The /dbfs/ mount point isn't available on the Community Edition (that's a known limitation), so you need to do what is recommended in the linked answer:
dbutils.fs.cp(
'dbfs:/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv',
'file:/tmp/file_name.csv')
and then use /tmp/file_name.csv as input parameter to Pandas' functions. If you'll need to write something to DBFS, then you do other way around - write to local file /tmp/..., and copy that file to DBFS.
How to check whether the directory exists. if not create a directory
filesystem_client.create_directory("my_directory")
will create a directory, but i want to achieve something like this:
if not os.path.exists("my_directory"):
filesystem_client.create_directory("my_directory")
A directory will automatically be created if it does not exist, however, if you're referring to a file system/container then from my knowledge there does not seem to be a pythonic way of doing that, however, you can just wrap the code within a try except block, if it does not exist it will get caught in the except block and you can then create the file system/container.
I have a Pipeline job in Azure Data Factory which I want to use to run the pipeline job but pass all files for a specific month through for example.
I have a folder called 2020/01 inside this folder is numerous files with different names.
The question is: Can one pass a parameter through to only extract and load the files for 2020/01/01 and 2020/01/02 if that makes sense?
Excellent, Thanks Jay it worked and i can now run my pipeline jobs passing through the month or even day level.
Really appreciate your response, have a fantastic day.
Regards
Rayno
The question is: Can one pass a parameter through to only extract and
load the files for 2020/01/01 and 2020/01/02 if that makes sense?
You did't mention which connector you are using in pipeline job,but you mentioned folder in your question.As i know,the majority folder path could be parametrization in ADF copy activity configuration.
You could create a param :
Then apply it in the wildcard folder path:
Even if your files' names have same prefix,you could apply 01*.json on the wildcard file name property.
I am trying to read data from aws s3 where I am having error.
s3 bucket and paths for example as below:
s3://USA/Texas/Austin/valid
s3://USA/Texas/Austin/invalid
s3://USA/Texas/Houston/valid
s3://USA/Texas/Houston/invalid
s3://USA/Texas/Dallas/valid
s3://USA/Texas/Dallas/invalid
s3://USA/Texas/San_Antonio/valid
s3://USA/Texas/San_Antonio/invalid
when I try to read as
spark.read.parquet("s3://USA/Texas/Austin/valid")
or
spark.read.parquet("s3://USA/Texas/Austin/invalid")
or
spark.read.parquet("s3://USA/Texas/Austin")
it works just fine.
but when I try to read as
spark.read.parquet("s3://USA/Texas/*")
or
spark.read.parquet("s3://USA/Texas")
it throws an exception.
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
as per suggestion I can read them individually but I have more then 500 files, to read them individually and union them will be hectic.
is there any other way to achieve this?
I am using HDFS with Parquet but I ran into the same issue. For me, setting the basePath to a path level above anything you will be accessing in that query works.
Also, I believe the '*' is unnecessary, though I'm not sure of the behavior of S3 on this one.
eg.
spark.read.option("basePath", "s3://USA/Texas/").parquet("s3://USA/Texas/")
Perhaps this is off-base for your S3 scenario but will hopefully help someone else with HDFS getting the same error.
If you can use Hive, then set two configurations
hive.input.dir.recursive=true
hive.mapred.supports.subdirectories=true
and create external table on the root path. Then, the table should read all the subdirectories data in the table but the schema should be the same or it will get an error.
I am working on a spark java wrapper which uses third party libraries, which will read files from a hard coded directory name say "resdata" from where job executes. I know this is twisted but will try to explain.
when I execute the job it is trying to find the required files in the path something like this below,
/data/Hadoop/yarn/local//appcache/application_xxxxx_xxx/container_00_xxxxx_xxx/resdata
I am assuming it is looking for the files in the current data directory , under that looking for directory name "resdata". At this point I don't know how to configure the current directory to any path on hdfs or local.
So looking for options to create directory structure similar to what the third party libraries expecting and copying required files over there. This I need to do on each node. I am working on spark 2.2.0
Please help me in achieving this?
just now got the answer I need to put all the files under resdata directory and zip it say restdata.zip, pass the file using the options "--archives" . Then each node will have directory restdata.zip/restdata/file1 etc