My snowflake azure seems to be reloading files when the AWS doesn't - azure

I have the files in S3 because Microsoft's storage is too complicated for me to deal with, but am using an azure version of Snowflake. Every time I run the load process it loads every file that I have. The AWS version of snowflake doesn't do that. It keeps track of the file names tht I've already loaded and doesn't load them. What's going on?
here is an example of one that I am loading through a procedure:
var stmt1 = snowflake.execute( { sqlText:
`
copy into dw_order_charges
from #dw_order_charges
file_format = load_format_pipe
`
} );
Thanks, --sw

Related

Load Azure ML experiment run information from datastore

I have lots of run files created by running PyTorch estimator/ ScriptRunStep experiments that are saved in azureml blob storage container. Previously, I'd been viewing these runs in the Experiments tab of the ml.azure.com portal and associating tags to these runs to categorise and load the desired models.
However, a coworker recently deleted my workspace. I created a new one which is connected to the previously-existing blob container, the run files therefore still exist and can be accessed on this new workspace, but they no longer show up in the Experiment viewer on ml.azure.com. Neither can I see the tags I'd associated to the runs.
Is there any way to load these old run files into the Experiment viewer or is it only possible to view runs created inside the current workspace?
Sample scriptrunconfig code:
data_ref = DataReference(datastore=ds,
data_reference_name="<name>",
path_on_datastore = "<path>")
args = ['--data_dir', str(data_ref),
'--num_epochs', 30,
'--lr', 0.01,
'--classifier', 'int_ext' ]
src = ScriptRunConfig(source_directory='.',
arguments=args,
compute_target = compute_target,
environment = env,
script='train.py')
src.run_config.data_references = {data_ref.data_reference_name:
data_ref.to_config()}
Sorry for your loss! First, I'd make absolutely sure that you can't recover the deleted workspace. Definitely worthwhile to open an priority support ticket with Azure.
Another thing you might try is:
create a new workspace (which will create a new storage account for you for the new workspace's logs)
copy your old workspace's data into the new workspace's storage account.

Azure Blob Using Python

I am accessing a website that allows me to download CSV file. I would like to store the CSV file directly to the blob container. I know that one way is to download the file locally and then upload the file, but I would like to skip the step of downloading the file locally. Is there a way in which I could achieve this.
i tried the following:
block_blob_service.create_blob_from_path('containername','blobname','https://*****.blob.core.windows.net/containername/FlightStats',content_settings=ContentSettings(content_type='application/CSV'))
but I keep getting errors stating path is not found.
Any help is appreciated. Thanks!
The file_path in create_blob_from_path is the path of your local file, looks like "C:\xxx\xxx". This path('https://*****.blob.core.windows.net/containername/FlightStats') is Blob URL.
You could download your file to byte array or stream, then use create_blob_from_bytes or create_blob_from_stream method.
Other answer uses the so called "Azure SDK for Python legacy".
I recommend that if it's fresh implementation then use Gen2 Storage Account (instead of Gen1 or Blob storage).
For Gen2 storage account, see example here:
from azure.storage.filedatalake import DataLakeFileClient
data = b"abc"
file = DataLakeFileClient.from_connection_string("my_connection_string",
file_system_name="myfilesystem", file_path="myfile")
file.append_data(data, offset=0, length=len(data))
file.flush_data(len(data))
It's painful, if you're appending multiple times then you'll have to keep track of offset on client side.

Moving data from a database to Azure blob storage

I'm able to use dask.dataframe.read_sql_table to read the data e.g. df = dd.read_sql_table(table='TABLE', uri=uri, index_col='field', npartitions=N)
What would be the next (best) steps to saving it as a parquet file in Azure blob storage?
From my small research there are a couple of options:
Save locally and use https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json (not great for big data)
I believe adlfs is to read from blob
use dask.dataframe.to_parquet and work out how to point to the blob container
intake project (not sure where to start)
$ pip install adlfs
dd.to_parquet(
df=df,
path='absf://{BLOB}/{FILE_NAME}.parquet',
storage_options={'account_name': 'ACCOUNT_NAME',
'account_key': 'ACCOUNT_KEY'},
)

Azure: Downloading from Blob Storage results in permissions error?

I’ve uploaded some files to Blob storage, and now I’m using the OnStart method to retrieve those files and run them. Right now I’m working locally.
Using the following code:
using (var fileStream = System.IO.File.OpenWrite(#"C:\testfolder"))
{
blob.DownloadToStream(fileStream);
}
Results in a “Access to the path 'C:\testfolder' is denied.” error.
What do you think is causing this? And - will this be an issue once the project is actually pushed up to Azure? I can change permissions locally, but I'm hoping that once it's actually in a live worker role, it won't be an issue.
Any help would be awesome :)
Scratch that - it looks like the C:\testfolder should specify the file name, not the location. I've changed it to C:\testfolder\test.txt and it works just fine :).

Running native code on Azure

I am trying to run a C executable on Azure. I have many workerRoles and they continuously check a Job Queue. If there is a job in the queue, a worker role runs an instance of the C executable as a process according to the command line arguments stored in a job class. The C executable creates some log files normally. I do not know how to access those created files. What is the logic behind it? Where are the created files stored? Can anyone explain me? I am new to azure and C#.
One other problem is that all of the working instances of the C executable need to read a data file. How can I distribute that required file?
First, realize that in Windows Azure, your worker role is simply running inside a Windows 2008 Server environment (either SP2 or R2). When you deploy your app, you would deploy your C executable as well (or grab it from blob storage, but that's a bit more advanced). To find out where your app lives on disk, call Environment.GetEnvironmentVariable("RoleRoot") - that returns a path. You'd typically have your app sitting in a folder called AppRoot under the role root. You'd find your C executable there.
Next, you'll want your app to write its files to an output directory you specify on the command line. You can set up storage in your local VM with your role's properties. Look at the Local Storage tab, and configure a named local storage area:
Now you can get the path to that storage area, in code, and pass it as a command line argument:
var outputStorage = RoleEnvironment.GetLocalResource("MyLocalStorage");
var outputFile = Path.Combine(outputStorage.RootPath, "myoutput.txt");
var cmdline = String.Format("--output {0}", outputFile);
Here's an example of launching your myapp.exe process, with command line arguments:
var appRoot = Path.Combine(Environment.GetEnvironmentVariable("RoleRoot")
+ #"\", #"approot");
var myProcess = new Process()
{
StartInfo = new ProcessStartInfo(Path.Combine(appRoot, #"myapp.exe"), cmdline)
{
CreateNoWindow = false,
UseShellExecute = false,
WorkingDirectory = appRoot
}
};
myProcess.WaitForExit();
Normally you'd set CreateNoWindow to true, but it's easier to debug if you can see the command shell window.
Last thing: Once your app is done creating the file, you'll want to either:
Process it and delete it (it's not in a durable place so eventually it'll disappear)
Change your storage to use a Cloud Drive (durable storage)
Copy your file to a blob (durable storage)
In production, you'll want to add exception-handling, and you can re-route stdout and stderr to be captured. But this sample code should be enough to get you started.
OOPS - one more 'one more thing': When adding your 'myapp.exe' to your project, be SURE to go to its Properties, and set 'Copy to Output Directory' to 'Copy Always' - otherwise your myapp.exe file won't end up in Windows Azure and you'll wonder why things don't work.
EDIT: Pushing results to a blob - a quick example
First get set up a storage account and add to your role's Settings. Say you called it 'AzureStorage' - now set it up in code, get a reference to a blob container, get a reference to a blob within that container, and then perform a file upload to the blob:
CloudStorageAccount storageAccount = CloudStorageAccount.FromConfigurationSetting("AzureStorage");
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer outputfiles = blobClient.GetContainerReference("outputfiles");
outputfiles.CreateIfNotExist();
var blobname = "myoutput.txt";
var blob = outputfiles.GetBlobReference(blobname);
blob.UploadFile(outputFile);
In Azure land you shouldn't write to the file system. You should write to SQL Azure, Table Storage or most likely in this case Blob storage (basically, I think you should think of blob storage as the old file system)
This is because:
You could have multiple instances running and you will end up having different files on different instances (which are just virtual machines)
Your instance could potentially be moved at any moment and you would lose the info on the file system as it's not part of your deployment package.
Using one of the three storage options will provide a central repository for all of your instances to access and it will be persisted over a redeployment.

Resources