Will Spark create a s3 folder path if it doesn't exist? - apache-spark

Let's say I have this line, I want to know if Spark automatically creates a folder path and writes to folder like it does in local.
Yes, s3 is not folder system rather a key val system.
val path="s3a://dev-us-east-1/"
val op = df_formatted.coalesce(1).write.mode("overwrite").format("csv").save(path + "report/output")
Will this be written to "s3a://dev-us-east-1/report/output"

Granted that you have set correctly the "security stuff", that is that you have credentials of an IAM user with write access, then yes Spark will create folders and files.

Related

how do you create a folder and write to heroku's ephemeral storage?

i'm using nodejs, and when my script uses fs.mkdir nothing seems to happen... it works well locally. is there an alternate command/function i can use to create and write to folders in heroku's file system?
(yes i'm aware the ephemeral system is temporary, with my use case, all files will be deleted after 5 minutes)
You can create a tmp folder in the root of your project, which is where you will write to and read files from in Heroku. For instance, first line of code below allows you to save data to a specified file path inside the tmp folder. The second line creates the stream for that file
fs.writeFileSync(`/tmp/${filename}.json`, dataToSave)
const fileStream = fs.createReadStream(`/tmp/${filename}.json`)

databricks load file from s3 bucket path parameter

I am new to databricks or spark and learning this demo from databricks. I have a databricks workspace setup on AWS.
The code below is from the official demo and it runs ok. But where is this csv file? I want to check the file and also understand how the path parameter works.
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
header "true")
I have checked at the databrikcs location on S3 bucket and have not found the file:
/databricks-datasets is a special mount location that is owned by Databricks and available out of box in all workspaces. You can't browse it via S3 browser, but you can use display(dbutils.fs.ls("/databricks-datasets")), or %fs ls /databricks-datasets, or DBFS File browser (in "Data" tab) to explore its content - see a separate page about it.

What Visual Studio Code extension API should be used to add folders and move files in a Visual Studio Code workspace?

I want to write an extension that will:
Read the folders and files in the current workspace
Reorganise those files by creating new folders and moving the files into them
So I need an API to do the above. I am not clear on whether I should use
the standard fs node module,
the File System Provider or
the workspace namespace.
I read the answer here, but it doesn't say what to use to create folders and move files within the workspace.
The FileSystemProvider is to be used if you want to serve files from non-local storage like FTP sites or virtual file systems inside remote devices and present them to Visual Studio Code as storage directories. The FileSystemProvider is a view/controller of the remote storage. In other words, you have to implement all the file manipulation operations by communicating with the remote storage.
If you want to just manipulate the files in the current workspace and also be able to use URI's from FileSystemProviders, you use vscode.workspace.fs.
You can also use the Node.js fs module, but that only handles local disk workspaces (URI's with scheme file:). I recommend to use the synchronous versions of the fs methods. I had some troubles using the asynchronous fs methods in Visual Studio Code (I did not know of vscode.workspace.fs at that time).
I found extension.ts, sample code for implementing FileSystemProvider provided by Microsoft.
The below steps are provided which helps you understand how to create a new folder (createDirectory) and move files within the workspace (using the copy command, copying all files in an old folder to a new folder and use the delete command if you don’t wish for files to remain in the old folder).
The first step in FileSystemProvider is to register the filesystem provider for a given scheme using registerFileSystemProvider:
const memFs = new MemFS();
context.subscriptions.push(
vscode.workspace.registerFileSystemProvider('memfs', memFs, {
isCaseSensitive: true
}));
Next is to register the command registerCommand to perform your operations in FileSystemProvider like readDirectory, readFile, createDirectory, copy, delete.
context.subscriptions.push(
vscode.commands.registerCommand('memfs.init', _ =>
{
#TODO: add required functionalities
}));
Read folders and read files
for (const [name] of memFs.readDirectory(vscode.Uri.parse('memfs:/'))) {
memFs.readFile(vscode.Uri.parse(`memfs:/${name}`));
}
Create the new folder
memFs.createDirectory(vscode.Uri.parse(`memfs:/folder/`));
Move files to newly created folder. Unfortunately there doesn't seem to be a separate move file command, but you can use the Copy command to copy to the new folder and then delete:
COPY
copy(source: Uri, destination: Uri, options: {overwrite: boolean})
DELETE
context.subscriptions.push(vscode.commands.registerCommand('memfs.reset', _ => {
for (const [name] of memFs.readDirectory(vscode.Uri.parse('memfs:/'))) {
memFs.delete(vscode.Uri.parse(`memfs:/${name}`));
}
}));

IFileProvider Azure File storage

I am thinking about implementing IFileProvider interface with Azure File Storage.
What i am trying to find in docs is if there is a way to send the whole path to the file to Azure API like rootDirectory/sub1/sub2/example.file or should that actually be mapped to some recursion function that would take path and traverse directories structure on file storage?
just want to make sure i am not missing something and reinvent the wheel for something that already exists.
[UPDATE]
I'm using Azure Storage Client for .NET. I would not like to mount anything.
My intentention is to have several IFileProviders which i could switch based on Environment and other conditions.
So, for example, if my environment is Cloud then i would use IFileProvider implementation that uses Azure File Services through Azure Storage Client. Next, if i have environment MyServer then i would use servers local file system. Third option would be environment someOther with that particular implementation.
Now, for all of them, IFileProvider operates with path like root/sub1/sub2/sub3. For Azure File Storage, is there a way to send the whole path at once to get sub3 info/content or should the path be broken into individual directories and get reference/content for each step?
I hope that clears the question.
Now, for all of them, IFileProvider operates with path like ˙root/sub1/sub2/sub3. For Azure File Storage, is there a way to send the whole path at once to getsub3` info/content or should the path be broken into individual directories and get reference/content for each step?
For access the specific subdirectory across multiple sub directories, you could use the GetDirectoryReference method for constructing the CloudFileDirectory as follows:
var fileshare = storageAccount.CreateCloudFileClient().GetShareReference("myshare");
var rootDir = fileshare.GetRootDirectoryReference();
var dir = rootDir.GetDirectoryReference("2017-10-24/15/52");
var items=dir.ListFilesAndDirectories();
For access the specific file under the subdirectory, you could use the GetFileReference method to return the CloudFile instance as follows:
var file=rootDir.GetFileReference("2017-10-24/15/52/2017-10-13-2.png");

Get Azure Blob path from MapReduce

In Hadoop, we can get map input file path as;
Path pt = new Path(((FileSplit) context.getInputSplit()).getPath().toString());
But I cannot find any documentation how to achieve this from Azure Blob Storage account. Is there a way to get Azure Blob path from mapreduce program?
If you want to get the input file path for the current process of mapper or reducer, your code is the only way to get the path via MapContext/ReduceContext.
If not, to get the file list of the container defined in the core-site.xml file, try the code below.
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get(configuration);
Path home = hdfs.getHomeDirectory();
FileStatus[] files = hdfs.listStatus(home);
Hope it helps.

Resources