Get Azure Blob path from MapReduce - azure

In Hadoop, we can get map input file path as;
Path pt = new Path(((FileSplit) context.getInputSplit()).getPath().toString());
But I cannot find any documentation how to achieve this from Azure Blob Storage account. Is there a way to get Azure Blob path from mapreduce program?

If you want to get the input file path for the current process of mapper or reducer, your code is the only way to get the path via MapContext/ReduceContext.
If not, to get the file list of the container defined in the core-site.xml file, try the code below.
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get(configuration);
Path home = hdfs.getHomeDirectory();
FileStatus[] files = hdfs.listStatus(home);
Hope it helps.

Related

databricks load file from s3 bucket path parameter

I am new to databricks or spark and learning this demo from databricks. I have a databricks workspace setup on AWS.
The code below is from the official demo and it runs ok. But where is this csv file? I want to check the file and also understand how the path parameter works.
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
header "true")
I have checked at the databrikcs location on S3 bucket and have not found the file:
/databricks-datasets is a special mount location that is owned by Databricks and available out of box in all workspaces. You can't browse it via S3 browser, but you can use display(dbutils.fs.ls("/databricks-datasets")), or %fs ls /databricks-datasets, or DBFS File browser (in "Data" tab) to explore its content - see a separate page about it.

Azure Blob Using Python

I am accessing a website that allows me to download CSV file. I would like to store the CSV file directly to the blob container. I know that one way is to download the file locally and then upload the file, but I would like to skip the step of downloading the file locally. Is there a way in which I could achieve this.
i tried the following:
block_blob_service.create_blob_from_path('containername','blobname','https://*****.blob.core.windows.net/containername/FlightStats',content_settings=ContentSettings(content_type='application/CSV'))
but I keep getting errors stating path is not found.
Any help is appreciated. Thanks!
The file_path in create_blob_from_path is the path of your local file, looks like "C:\xxx\xxx". This path('https://*****.blob.core.windows.net/containername/FlightStats') is Blob URL.
You could download your file to byte array or stream, then use create_blob_from_bytes or create_blob_from_stream method.
Other answer uses the so called "Azure SDK for Python legacy".
I recommend that if it's fresh implementation then use Gen2 Storage Account (instead of Gen1 or Blob storage).
For Gen2 storage account, see example here:
from azure.storage.filedatalake import DataLakeFileClient
data = b"abc"
file = DataLakeFileClient.from_connection_string("my_connection_string",
file_system_name="myfilesystem", file_path="myfile")
file.append_data(data, offset=0, length=len(data))
file.flush_data(len(data))
It's painful, if you're appending multiple times then you'll have to keep track of offset on client side.

Unable to use data from Google Cloud Storage in App Engine using Python 3

How can I read the data stored in my Cloud Storage bucket of my project and use it in my Python code that I am writing in App Engine?
I tried using:
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
But I am unable to figure out how to extract actual data from the code to get it in a usable form.
Any help would be appreciated.
Getting a file from a Google Cloud Storage bucket means that you are just getting an object. This concept abstract the file itself from your code. You will either need to store locally the file to perform any operation on it or depending on the extension of your file put that object inside of a file readstreamer or the method that you need to read the file.
Here you can see a code example on how to read a file from app engine:
def read_file(self, filename):
self.response.write('Reading the full file contents:\n')
gcs_file = gcs.open(filename)
contents = gcs_file.read()
gcs_file.close()
self.response.write(contents)
You have a couple of options.
content = blob.download_as_string() --> Converts the content of your Cloud Storage object to String.
blob.download_to_file(file_obj) --> Updates an existing file_obj to include the Cloud Storage object content.
blob.download_to_filename(filename) --> Saves the object in a file. On App Engine Standard environment, you can store files in /tmp/ directory.
Refer this link for more information.

IFileProvider Azure File storage

I am thinking about implementing IFileProvider interface with Azure File Storage.
What i am trying to find in docs is if there is a way to send the whole path to the file to Azure API like rootDirectory/sub1/sub2/example.file or should that actually be mapped to some recursion function that would take path and traverse directories structure on file storage?
just want to make sure i am not missing something and reinvent the wheel for something that already exists.
[UPDATE]
I'm using Azure Storage Client for .NET. I would not like to mount anything.
My intentention is to have several IFileProviders which i could switch based on Environment and other conditions.
So, for example, if my environment is Cloud then i would use IFileProvider implementation that uses Azure File Services through Azure Storage Client. Next, if i have environment MyServer then i would use servers local file system. Third option would be environment someOther with that particular implementation.
Now, for all of them, IFileProvider operates with path like root/sub1/sub2/sub3. For Azure File Storage, is there a way to send the whole path at once to get sub3 info/content or should the path be broken into individual directories and get reference/content for each step?
I hope that clears the question.
Now, for all of them, IFileProvider operates with path like ˙root/sub1/sub2/sub3. For Azure File Storage, is there a way to send the whole path at once to getsub3` info/content or should the path be broken into individual directories and get reference/content for each step?
For access the specific subdirectory across multiple sub directories, you could use the GetDirectoryReference method for constructing the CloudFileDirectory as follows:
var fileshare = storageAccount.CreateCloudFileClient().GetShareReference("myshare");
var rootDir = fileshare.GetRootDirectoryReference();
var dir = rootDir.GetDirectoryReference("2017-10-24/15/52");
var items=dir.ListFilesAndDirectories();
For access the specific file under the subdirectory, you could use the GetFileReference method to return the CloudFile instance as follows:
var file=rootDir.GetFileReference("2017-10-24/15/52/2017-10-13-2.png");

How to create a sub container in azure storage location

How can I create a sub container in the azure storage location?
Windows Azure doesn't provide the concept of heirarchical containers, but it does provide a mechanism to traverse heirarchy by convention and API. All containers are stored at the same level. You can gain simliar functionality by using naming conventions for your blob names.
For instance, you may create a container named "content" and create blobs with the following names in that container:
content/blue/images/logo.jpg
content/blue/images/icon-start.jpg
content/blue/images/icon-stop.jpg
content/red/images/logo.jpg
content/red/images/icon-start.jpg
content/red/images/icon-stop.jpg
Note that these blobs are a flat list against your "content" container. That said, using the "/" as a conventional delimiter, provides you with the functionality to traverse these in a heirarchical fashion.
protected IEnumerable<IListBlobItem>
GetDirectoryList(string directoryName, string subDirectoryName)
{
CloudStorageAccount account =
CloudStorageAccount.FromConfigurationSetting("DataConnectionString");
CloudBlobClient client =
account.CreateCloudBlobClient();
CloudBlobDirectory directory =
client.GetBlobDirectoryReference(directoryName);
CloudBlobDirectory subDirectory =
directory.GetSubdirectory(subDirectoryName);
return subDirectory.ListBlobs();
}
You can then call this as follows:
GetDirectoryList("content/blue", "images")
Note the use of GetBlobDirectoryReference and GetSubDirectory methods and the CloudBlobDirectory type instead of CloudBlobContainer. These provide the traversal functionality you are likely looking for.
This should help you get started. Let me know if this doesn't answer your question:
[ Thanks to Neil Mackenzie for inspiration ]
Are you referring to blob storage? If so, the hierarchy is simply StorageAccount/Container/BlobName. There are no nested containers.
Having said that, you can use slashes in your blob name to simulate nested containers in the URI. See this article on MSDN for naming details.
I aggree with tobint answer and I want to add something this situation because I also
I need the same way upload my games html to Azure Storage with create this directories :
Games\Beautyshop\index.html
Games\Beautyshop\assets\apple.png
Games\Beautyshop\assets\aromas.png
Games\Beautyshop\customfont.css
Games\Beautyshop\jquery.js
So After your recommends I tried to upload my content with tool which is Azure Storage Explorer and you can download tool and source code with this url : Azure Storage Explorer
First of all I tried to upload via tool but It doesn't allow to hierarchical directory upload because you don't need : How to create sub directory in a blob container
Finally, I debug Azure Storage Explorer source code and I edited Background_UploadBlobs method and UploadFileList field in StorageAccountViewModel.cs file. You can edit it what you wants.I may have made spelling errors :/ I am so sorry but That's only my recommend.
If you are tying to upload files from Azure portal:
To create a sub folder in container, while uploading a file you can go to Advanced options and select upload to a folder, which will create a new folder in the container and upload the file into that.
Kotlin Code
val blobClient = blobContainerClient.getBlobClient("$subDirNameTimeStamp/$fileName$extension");
this will create directory having TimeStamp as name and inside that there will be your Blob File. Notice the use of slash (/) in above code which will nest your blob file by creating folder named as previous string of slash.
It will look like this on portal
Sample code
string myfolder = "<folderName>";
string myfilename = "<fileName>";
string fileName = String.Format("{0}/{1}.csv", myfolder, myfilename);
CloudBlockBlob blob = container.GetBlockBlobReference(fileName);

Resources