pyarrow ds.dataset fails with FileNotFoundError using Azure blob filesystem in azure functions but not locally - azure

I have a couple functions in Azure Functions which download data into my data lake v2 storage. They work by downloading CSVs, converting them to pyarrow and then saving to individual parquet files on Azure's storage which works fine.
I have another function in the same app which is supposed to consolidate and repartition the data. That function results in a FileNotFoundError exception when trying to create a dataset.
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pyarrow.compute as pc
import os
from adlfs import AzureBlobFileSystem
abfs=AzureBlobFileSystem(connection_string=os.environ['Synblob'])
newds=ds.dataset(f"stdatalake/miso/dailyfiles/{filetype}", filesystem=abfs, partitioning="hive")
When running locally, the code runs fine but when running on the cloud I get....
Result: Failure Exception: FileNotFoundError: stdatalake/miso/dailyfiles/daprices/exante/date=20210926
Stack: File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 402, in _handle__invocation_request call_result = await self._loop.run_in_executor( File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs)
File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 611, in _run_sync_func return ExtensionManager.get_sync_invocation_wrapper(context, File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/extension.py",
line 215, in _raw_invocation_wrapper result = function(**args) File "/home/site/wwwroot/consolidate/__init__.py",
line 47, in main newds=ds.dataset(f"stdatalake/miso/dailyfiles/{filetype}", filesystem=abfs, partitioning="hive")
File "/home/site/wwwroot/.python_packages/lib/site-packages/pyarrow/dataset.py", line 667, in dataset return _filesystem_dataset(source, **kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/pyarrow/dataset.py",
line 422, in _filesystem_dataset return factory.finish(schema) File "pyarrow/_dataset.pyx", line 1680,
in pyarrow._dataset.DatasetFactory.finish File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/_fs.pyx", line 1179,
in pyarrow._fs._cb_open_input_file File "/home/site/wwwroot/.python_packages/lib/site-packages/pyarrow/fs.py", line 394, in open_input_file raise FileNotFoundError(path)

Related

Pass a partitioned TabularDataset into ParallelRunStep with azureml sdkv1

Trying to pass a partitioned TabularDataset into a ParallelRunStep as input, but getting the error and can't figure out why azureml ParallelRunStep can't recognize the partitioned dataset:
UserInputNotPartitionedByGivenKeys: The input dataset 'partitioned_combined_scored_dataset_input' is not partitioned by 'model_name'.
Traceback (most recent call last):
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role_process.py", line 111, in run
loop.run_until_complete(self.master_role.start())
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role.py", line 303, in start
await self.wait_for_first_task()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role.py", line 288, in wait_for_first_task
await self.wait_for_input_init()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role.py", line 126, in wait_for_input_init
self.future_create_tasks.result()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_producer.py", line 199, in create_tasks
raise exc
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_producer.py", line 190, in create_tasks
for task_group in self.get_task_groups(provider.get_tasks()):
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_producer.py", line 169, in get_task_groups
for index, task in enumerate(tasks):
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/partition_by_keys_provider.py", line 77, in get_tasks
raise UserInputNotPartitionedByGivenKeys(message=message, compliant_message=compliant_message)
UserInputNotPartitionedByGivenKeys: The input dataset 'partitioned_combined_scored_dataset_input' is not partitioned by 'model_name'.
ParallelRunConfig & ParallelRunStep
parallel_run_config = ParallelRunConfig(
source_directory=source_dir_for_snapshot,
entry_script="src/steps/script.py",
partition_keys=["model_name"],
error_threshold=10,
allowed_failed_count=15,
allowed_failed_percent=10,
run_max_try=3,
output_action="append_row",
append_row_file_name="output_file.csv",
environment=aml_run_config.environment,
compute_target=aml_run_config.target,
node_count=2
)
parallelrun_step = ParallelRunStep(
name="Do Some Parallel Stuff on Each model_name",
parallel_run_config=parallel_run_config ,
inputs=[partitioned_combined_scored_dataset],
output=OutputFileDatasetConsumptionConfig(name='output_dataset'),
arguments=["--score-id", score_id_pipeline_param,
"--partitioned-combined-dataset", partitioned_combined_scored_dataset],
allow_reuse=True
)
partitioned_combined_scored_dataset
partitioned_combined_scored_dataset = DatasetConsumptionConfig(
name="partitioned_combined_scored_dataset_input",
dataset=PipelineParameter(
name="partitioned_combined_dataset",
default_value=future_partitioned_dataset)
)
and then partitioned_combined_scored_dataset was previously created and uploaded using:
partitioned_dataset = TabularDatasetFactory.from_parquet_files(path=(Datastore.get(ws, ), f"{partitioned_combined_datasets_dir}/*.parquet"))\
.partition_by(
partition_keys=['model_name'],
target=DataPath(Datastore(), 'some/path/to/partitioned')
)
I know TabularDataset.partition_by() uploads to a GUID folder generated by AML so that some/path/to/partitioned actually creates some/path/to/partitioned/XXXXXXXX/{model_name}/part0.parquet for each partition on model_name according to documentation, so we've accounted for this when defining the tabular dataset passed into the PipelineParameter for partitioned_combined_scored_dataset at runtime... using
TabularDatasetFactory.from_parquet_files(path=(Datastore(), f"{partitioned_combined_dataset_dir}/*/*/*.parquet"))

uploading Csv file to S3 using boto3

am trying to uplaod a csv file to S3 Bucket "ebayunited" using Boto3
i fetched json products from dummy data as json
Convert it to Csv and save it in the same folder
i used Boto3 to uplaod it
Code
import requests
import pandas as pd
import boto3 as bt
from API_keys import access_key, secret_access_key
client = bt.client(
's3', aws_access_key_id=access_key,
aws_secret_access_key=secret_access_key
)
r = requests.get("https://dummyjson.com/products")
json_products = pd.DataFrame(r.json()["products"])
file_s3 = json_products.to_csv("Orders.csv", index=False)
bucket = "ebayunited"
path = "Lysi Team/ebayapi/"+str(file_s3)
client.upload_file(str(file_s3), bucket, path)
and i get this Error
Traceback (most recent call last):
File "c:\Users\PC\Desktop\S3 Import\s3_import.py", line 18, in <module>
client.upload_file(str(file_s3), bucket, path)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\boto3\s3\inject.py", line 143, in upload_file
return transfer.upload_file(
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\boto3\s3\transfer.py", line 288, in upload_file
future.result()
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\futures.py", line 103, in result
return self._coordinator.result()
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\futures.py", line 266, in result
raise self._exception
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\tasks.py", line 269, in _main
self._submit(transfer_future=transfer_future, **kwargs)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\upload.py", line 585, in _submit
upload_input_manager.provide_transfer_size(transfer_future)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\upload.py", line 244, in provide_transfer_size
self._osutil.get_file_size(transfer_future.meta.call_args.fileobj)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\utils.py", line 247, in get_file_size
return os.path.getsize(filename)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'None'
The upload_file(filename, bucket, key) command expects the name of a file to upload from your local disk.
Your program appears to be assuming that the to_csv() function returns the name of the resulting file, but the to_csv() documentation says:
If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.
Therefore, you will need to pass the actual name of the file:
key = 'Lysi Team/ebayapi/test.csv' # Change desired S3 Key here
client.upload_file('test.csv', bucket, key)

Cannot create Jupyter Notebook in HDInsight 4.0

I'm using Azure HDInsight 4.0 (Spark 2.4). When I attempt to create a new Jupyter notebook (Spark, but I get a similar error for PySpark notebooks), I get the following error message:
Traceback (most recent call last): File "/usr/bin/anaconda/lib/python2.7/site-packages/notebook/base/handlers.py", line 457, in wrapper result = yield gen.maybe_future(method(self, *args, **kwargs)) File "/usr/bin/anaconda/lib/python2.7/site-packages/tornado/gen.py", line 1015, in run value = future.result() File "/usr/bin/anaconda/lib/python2.7/site-packages/tornado/concurrent.py", line 237, in result raise_exc_info(self._exc_info) File "/usr/bin/anaconda/lib/python2.7/site-packages/tornado/gen.py", line 1021, in run yielded = self.gen.throw(*exc_info) File "/usr/bin/anaconda/lib/python2.7/site-packages/notebook/services/contents/handlers.py", line 216, in post yield self._new_untitled(path, type=type, ext=ext) File "/usr/bin/anaconda/lib/python2.7/site-packages/tornado/gen.py", line 1015, in run value = future.result() File "/usr/bin/anaconda/lib/python2.7/site-packages/tornado/concurrent.py", line 237, in result raise_exc_info(self._exc_info) File "/usr/bin/anaconda/lib/python2.7/site-packages/tornado/gen.py", line 285, in wrapper yielded = next(result) File "/usr/bin/anaconda/lib/python2.7/site-packages/notebook/services/contents/handlers.py", line 171, in _new_untitled model = yield gen.maybe_future(self.contents_manager.new_untitled(path=path, type=type, ext=ext)) File "/usr/bin/anaconda/lib/python2.7/site-packages/notebook/services/contents/manager.py", line 338, in new_untitled return self.new(model, path) File "/usr/bin/anaconda/lib/python2.7/site-packages/notebook/services/contents/manager.py", line 364, in new model = self.save(model, path) File "/var/lib/.jupyter/jupyterazure/jupyterazure/httpfscontentsmanager.py", line 84, in save self.create_checkpoint(path) File "/usr/bin/anaconda/lib/python2.7/site-packages/notebook/services/contents/manager.py", line 459, in create_checkpoint return self.checkpoints.create_checkpoint(self, path) File "/usr/bin/anaconda/lib/python2.7/site-packages/notebook/services/contents/checkpoints.py", line 79, in create_checkpoint model = contents_mgr.get(path, content=True) File "/var/lib/.jupyter/jupyterazure/jupyterazure/httpfscontentsmanager.py", line 56, in get 'metadata': {}}) File "/var/lib/.jupyter/jupyterazure/jupyterazure/model.py", line 45, in create_model_from_blob nbformat.version_info[0]) File "/usr/bin/anaconda/lib/python2.7/site-packages/nbformat/__init__.py", line 75, in reads nb = convert(nb, as_version) File "/usr/bin/anaconda/lib/python2.7/site-packages/nbformat/converter.py", line 54, in convert "version doesn't exist" % (to_version)) ValueError: Cannot convert notebook to v5 because that version doesn't exist
After this, a new notebook does appear on the home screen, but if I try to open it I get the following popup message:
An unknown error occurred while loading this notebook. This version can load notebook formats v4 or earlier. See the server log for details.
I can create a notebook just fine on an otherwise-identical HDI 3.6 cluster, but not on 4.0. (I need 4.0 because I need to use Spark 2.4.)
Has anyone experienced/resolved this before?
Recently, we have seen couple of questions on the same issue. You may follow the below steps to resolve the issue.
Steps to resolve this issue:
Step1: Connect to headnode via ssh and change content of file - /usr/bin/anaconda/lib/python2.7/site-packages/nbformat/_version.py, replace 5 to 4.
Change this to:
version_info = (4, 0, 3)
Step2: Restart Jupyter service via Ambari.
For more details, refer HDInshight Create not create Jupyter notebook
Hope this helps. Do let us know if you any further queries.

python aws s3: downloading files based on size

I can filter files in an s3 bucket based on file size, and I can download files, but I get an error trying to do both. This is Python 3.4.
import boto3
import re
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
some_file = "whatever.txt"
for file in bucket.objects.all():
if file.size < 1000 and re.search(".*txt$", file.key):
print(file.key, file.size)
bucket.download_file(file.key, file.key)
bucket.download_file(some_file, some_file) # this works fine
The above for loop works to get files less than 1000 bytes and are .txt files. But the bucket.download_file (file.key, file.key) part gives me this:
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
However, the last line works fine. What's the difference?
FYI, I searched for the error and I saw some stuff about the access permissions. This bucket is not public access. I'm running from an EMR cluster and the secret key credentials are already defined so I don't need to specify in the script.
UPDATE: the full error looks like this:
Traceback (most recent call last):
File "./get_size.py", line 39, in <module>
bucket.download_file(file.key, file.key)
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/boto3-1.9.189-py3.4.egg/boto3/s3/inject.py", line 246, in bucket_download_file
ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/boto3-1.9.189-py3.4.egg/boto3/s3/inject.py", line 172, in download_file
extra_args=ExtraArgs, callback=Callback)
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/boto3-1.9.189-py3.4.egg/boto3/s3/transfer.py", line 307, in download_file
future.result()
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/s3transfer-0.2.1-py3.4.egg/s3transfer/futures.py", line 106, in result
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/s3transfer-0.2.1-py3.4.egg/s3transfer/futures.py", line 265, in result
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/s3transfer-0.2.1-py3.4.egg/s3transfer/tasks.py", line 255, in _main
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/s3transfer-0.2.1-py3.4.egg/s3transfer/download.py", line 345, in _submit
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/botocore-1.12.189-py3.4.egg/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/botocore-1.12.189-py3.4.egg/botocore/client.py", line 661, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
Thanks for your input, but it turned out to be some weird permissions I had on a few of the files in this bucket that I didn't realize. The code as above works on the current bucket after removing them and also on other buckets.
I can confirm this part works:
bucket.download_file(file.key, file.key)

Getting a Key Error 'gs' when trying to write a dask dataframe to csv on google cloud storage

I have the following code below where I'm 1) importing a csv file from a gcs bucket 2) doing some etl on it and 3) converting it to dask df before writing the dask df to_csv. All goes to plan until the very end when I get a KeyError: 'gs' upon writing to csv back in a gcs bucket.
here is my code - can anyone help me understand where the key error comes from?
def stage1_1ph_prod_master(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
import datetime as dt
source_bucket = 'sourcebucket'
destination_path = 'gs://destination_bucket/ddf-*ph_master_static.csv'
storage_client = storage.Client()
source_bucket = storage_client.bucket(source_bucket)
# load in the col names
col_names = ["PPG_Code","PPG_Code_Name","SAP_Product_Name","CP_Sku_Code","UPC_Unit","UPC_Case","Category","Product_Category","Sub_Category","Brand","Sub_Brand","Variant","Size","Gender","Last_Updated_By","Last_Updated_On","Created_By","Created_On","Gross_Weight_Case_kg","Case_Height_mm",]
df = pd.DataFrame(columns=col_names)
for file in list(source_bucket.list_blobs()):
file_path="gs://{}/{}".format(file.bucket.name, file.name)
df = df.append(pd.read_csv(file_path, header=None, skiprows=28, names=col_names, encoding='Latin_1'))
ddf0 = dd.from_pandas(df,npartitions=1, sort=True)
ddf0.to_csv(destination_path) # Key Error happens here
Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 43, in stage1_1ph_prod_master ddf0.to_csv(destination_path) File "/env/local/lib/python3.7/site-packages/dask/dataframe/core.py", line 1299, in to_csv return to_csv(self, filename, **kwargs) File "/env/local/lib/python3.7/site-packages/dask/dataframe/io/csv.py", line 741, in to_csv **(storage_options or {}) File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 302, in open_files urlpath, mode, num=num, name_function=name_function, storage_options=kwargs File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 425, in get_fs_token_paths fs, fs_token = get_fs(protocol, options) File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 571, in get_fs cls = _filesystems[protocol] KeyError: 'gs'
gcsfs and dask has recently changed to use the fsspec package. The former has been released, but the latter is in master only. So gcsfs is no longer registering itself with the filesystems in dask, because fsspec already knows about it, but the version of dask you are using does not yet know about fsspec.
In short, please downgrade gcsfs until we have a chance to release dask, or use dask from master.

Resources