databricks load file from s3 bucket path parameter

databricks load file from s3 bucket path parameter - databricks

I am new to databricks or spark and learning this demo from databricks. I have a databricks workspace setup on AWS.
The code below is from the official demo and it runs ok. But where is this csv file? I want to check the file and also understand how the path parameter works.
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
header "true")
I have checked at the databrikcs location on S3 bucket and have not found the file:

/databricks-datasets is a special mount location that is owned by Databricks and available out of box in all workspaces. You can't browse it via S3 browser, but you can use display(dbutils.fs.ls("/databricks-datasets")), or %fs ls /databricks-datasets, or DBFS File browser (in "Data" tab) to explore its content - see a separate page about it.

Related

How to read and write data in spark via an S3 access point

I am attempting to use an S3 access point to store data in an S3 bucket. I have tried saving as I would if I had access to the bucket directly:
someDF.write.format("csv").option("header","true").mode("Overwrite")
.save("arn:aws:s3:us-east-1:000000000000:accesspoint/access-point/prefix/")
This returns the error
IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: "arn:aws:s3:us-east-1:000000000000:accesspoint/access-point/prefix/"
I havnt been able to find any documentation on how to do this. Are access points not supported? Is there a way to set up the access point as a custom data source?
Thank you

The problem is that you have provided the arn instead of the s3 URL. The URL would be something like this (assuming accesspoint is the bucket name):
s3://accesspoint/access-point/prefix/
There is a button in the AWS console if you are in the object or prefix, top right Copy S3 URL

Azure Blob Using Python

I am accessing a website that allows me to download CSV file. I would like to store the CSV file directly to the blob container. I know that one way is to download the file locally and then upload the file, but I would like to skip the step of downloading the file locally. Is there a way in which I could achieve this.
i tried the following:
block_blob_service.create_blob_from_path('containername','blobname','https://*****.blob.core.windows.net/containername/FlightStats',content_settings=ContentSettings(content_type='application/CSV'))
but I keep getting errors stating path is not found.
Any help is appreciated. Thanks!

The file_path in create_blob_from_path is the path of your local file, looks like "C:\xxx\xxx". This path('https://*****.blob.core.windows.net/containername/FlightStats') is Blob URL.
You could download your file to byte array or stream, then use create_blob_from_bytes or create_blob_from_stream method.

Other answer uses the so called "Azure SDK for Python legacy".
I recommend that if it's fresh implementation then use Gen2 Storage Account (instead of Gen1 or Blob storage).
For Gen2 storage account, see example here:
from azure.storage.filedatalake import DataLakeFileClient
data = b"abc"
file = DataLakeFileClient.from_connection_string("my_connection_string",
file_system_name="myfilesystem", file_path="myfile")
file.append_data(data, offset=0, length=len(data))
file.flush_data(len(data))
It's painful, if you're appending multiple times then you'll have to keep track of offset on client side.

How TensorFlow read file from s3 bytestream

I have done a deep learning model in TensorFlow for image recognition, and this one works reading an image file from local directory with tf.read_file() method, but I need now that the file be read by TensorFlow since a variable that is a Byte-Streaming that extract the image file since an S3 Bucket of Amazon without storage the streaming in local directory

You should be able to pass in the fully formed s3 path to tf.read_file(), like:
s3://bucket-name/path/to/file.jpeg where bucket-name is the name of your s3 bucket, and path/to/file.jpeg is where it's stored in your bucket. It seems possible you might be running into some access permissions issue, depending on if your bucket is private. You can follow https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/s3.md to set up your credentials
Is there an error you ran into when doing this?

Will Spark create a s3 folder path if it doesn't exist?

Let's say I have this line, I want to know if Spark automatically creates a folder path and writes to folder like it does in local.
Yes, s3 is not folder system rather a key val system.
val path="s3a://dev-us-east-1/"
val op = df_formatted.coalesce(1).write.mode("overwrite").format("csv").save(path + "report/output")
Will this be written to "s3a://dev-us-east-1/report/output"

Granted that you have set correctly the "security stuff", that is that you have credentials of an IAM user with write access, then yes Spark will create folders and files.

Get Azure Blob path from MapReduce

In Hadoop, we can get map input file path as;
Path pt = new Path(((FileSplit) context.getInputSplit()).getPath().toString());
But I cannot find any documentation how to achieve this from Azure Blob Storage account. Is there a way to get Azure Blob path from mapreduce program?

If you want to get the input file path for the current process of mapper or reducer, your code is the only way to get the path via MapContext/ReduceContext.
If not, to get the file list of the container defined in the core-site.xml file, try the code below.
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get(configuration);
Path home = hdfs.getHomeDirectory();
FileStatus[] files = hdfs.listStatus(home);
Hope it helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

databricks load file from s3 bucket path parameter - databricks

Related

How to read and write data in spark via an S3 access point

Azure Blob Using Python

How TensorFlow read file from s3 bytestream

Will Spark create a s3 folder path if it doesn't exist?

Get Azure Blob path from MapReduce

Categories

Resources