I try to save the XML files from the folder/subfolders as wholeText file but when I try to use
sc.wholeTextFiles("folder/*/*.xml")
I am getting an error:
llegalArgumentException: 'java.net.URISyntaxException: Expected
scheme-specific part at index
I am using databricks
Identified the route cause of the issue. The problem was “:” in one of the folder caused this issue.. referred github.com/apache/spark/pull/4368
Related
I'm trying to use node-gpg: https://github.com/drudge/node-gpg for encrypting/decrypting files using GPG.
snippet of code
a file is being created in the fileDecrypted location but doesn't contain anything
Does anyone know what might be wrong/have encountered a similar issue?
Any help would be appreciated
I also looked into this implementation: https://jaygould.co.uk/2019-01-21-decrypting-gpg-file-node-programatically/
and am doing the same steps but still running into the same problem
I uploaded files to DBFS:
/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
I tried to access them by pandas and I always receive information that such files don't exist.
I tried to use the following paths:
/dbfs/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
dbfs/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
dbfs:/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
./FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
What is funny, when I check them by dbutils.fs.ls I see all the files.
I found this solution, and I tried it already: Databricks dbfs file read issue
Moved them to a new folder:
dbfs:/new_folder/
I tried to access them from this folder, but still, it didn't work for me. The only difference is that I copied files to a different place.
I checked as well the documentation: https://docs.databricks.com/data/databricks-file-system.html
I use Databricks Community Edition.
I don't understand what I'm doing wrong and why it's happening like that.
I don't have any other ideas.
The /dbfs/ mount point isn't available on the Community Edition (that's a known limitation), so you need to do what is recommended in the linked answer:
dbutils.fs.cp(
'dbfs:/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv',
'file:/tmp/file_name.csv')
and then use /tmp/file_name.csv as input parameter to Pandas' functions. If you'll need to write something to DBFS, then you do other way around - write to local file /tmp/..., and copy that file to DBFS.
I have a file containing api sensitive keys. It is on git ignore, so I want to reference it out to a variable in order to use it's variables.
I have seen multiple questions about this issue but nothing solved my issue.
I have tried the following method which did not work -
I am getting the following error -
The JSON file is located at the same folder as the "module_fcm" is located, so I really am clueless about what could I have caused this error.
edit -
here is my directory -
and here is my file - (information is censored of course)
In your module_fcm.js file:
let jsonInput = require("./cloudinary-account.json");
console.log(jsonInput);
I am trying to read data from aws s3 where I am having error.
s3 bucket and paths for example as below:
s3://USA/Texas/Austin/valid
s3://USA/Texas/Austin/invalid
s3://USA/Texas/Houston/valid
s3://USA/Texas/Houston/invalid
s3://USA/Texas/Dallas/valid
s3://USA/Texas/Dallas/invalid
s3://USA/Texas/San_Antonio/valid
s3://USA/Texas/San_Antonio/invalid
when I try to read as
spark.read.parquet("s3://USA/Texas/Austin/valid")
or
spark.read.parquet("s3://USA/Texas/Austin/invalid")
or
spark.read.parquet("s3://USA/Texas/Austin")
it works just fine.
but when I try to read as
spark.read.parquet("s3://USA/Texas/*")
or
spark.read.parquet("s3://USA/Texas")
it throws an exception.
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
as per suggestion I can read them individually but I have more then 500 files, to read them individually and union them will be hectic.
is there any other way to achieve this?
I am using HDFS with Parquet but I ran into the same issue. For me, setting the basePath to a path level above anything you will be accessing in that query works.
Also, I believe the '*' is unnecessary, though I'm not sure of the behavior of S3 on this one.
eg.
spark.read.option("basePath", "s3://USA/Texas/").parquet("s3://USA/Texas/")
Perhaps this is off-base for your S3 scenario but will hopefully help someone else with HDFS getting the same error.
If you can use Hive, then set two configurations
hive.input.dir.recursive=true
hive.mapred.supports.subdirectories=true
and create external table on the root path. Then, the table should read all the subdirectories data in the table but the schema should be the same or it will get an error.
I have a problem with mapDB version 1.0.6. When i create a database i end up with two files with the same name but with different file types.
One is for example IRTree with file type FILE and the other is IRTree with file type .p
Having said that, whenever i try to read my database providing a filename IRTree i end up with an exception:
NullPointerException with the command DBMaker.newFileDB(new File(filename)).readOnly().make(); or an IOException: storage header is invalid.
Can anyone explain to me what's going on?
MapDB uses two files. .P file is used to store data. Always open file without extension, otherwise it will try to open incorrect file.