Azureml TabularDataset to_pandas_dataframe() returns InvalidEncoding error

Azureml TabularDataset to_pandas_dataframe() returns InvalidEncoding error - azure-machine-learning-service

When I run:
datasetTabular = Dataset.get_by_name(ws, "<Redacted>")
datasetTabular.to_pandas_dataframe()
The following error is returned. What can I do to get past this?
ExecutionError Traceback (most recent call last) File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\dataset_error_handling.py:101, in _try_execute(action, operation, dataset_info, **kwargs)
100 else:
--> 101 return action()
102 except Exception as e:
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\tabular_dataset.py:169, in TabularDataset.to_pandas_dataframe.<locals>.<lambda>()
168 dataflow = get_dataflow_for_execution(self._dataflow, 'to_pandas_dataframe', 'TabularDataset')
--> 169 df = _try_execute(lambda: dataflow.to_pandas_dataframe(on_error=on_error,
170 out_of_range_datetime=out_of_range_datetime),
171 'to_pandas_dataframe',
172 None if self.id is None else {'id': self.id, 'name': self.name, 'version': self.version})
173 fine_grain_timestamp = self._properties.get(_DATASET_PROP_TIMESTAMP_FINE, None)
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\_loggerfactory.py:213, in track.<locals>.monitor.<locals>.wrapper(*args, **kwargs)
212 try:
--> 213 return func(*args, **kwargs)
214 except Exception as e:
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\dataflow.py:697, in Dataflow.to_pandas_dataframe(self, extended_types, nulls_as_nan, on_error, out_of_range_datetime)
696 with tracer.start_as_current_span('Dataflow.to_pandas_dataframe', trace.get_current_span()) as span:
--> 697 return get_dataframe_reader().to_pandas_dataframe(self,
698 extended_types,
699 nulls_as_nan,
700 on_error,
701 out_of_range_datetime,
702 to_dprep_span_context(span.get_context()))
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\_dataframereader.py:386, in _DataFrameReader.to_pandas_dataframe(self, dataflow, extended_types, nulls_as_nan, on_error, out_of_range_datetime, span_context)
384 if have_pyarrow() and not extended_types and not inconsistent_schema:
385 # if arrow is supported, and we didn't get inconsistent schema, and extended typed were not asked for - fallback to feather
--> 386 return clex_feather_to_pandas()
387 except _InconsistentSchemaError as e:
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\_dataframereader.py:298, in
_DataFrameReader.to_pandas_dataframe.<locals>.clex_feather_to_pandas()
297 activity_data = dataflow_to_execute._dataflow_to_anonymous_activity_data(dataflow_to_execute)
--> 298 dataflow._engine_api.execute_anonymous_activity(
299 ExecuteAnonymousActivityMessageArguments(anonymous_activity=activity_data, span_context=span_context))
301 try:
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\_aml_helper.py:38, in update_aml_env_vars.<locals>.decorator.<locals>.wrapper(op_code, message, cancellation_token)
37 engine_api_func().update_environment_variable(changed)
---> 38 return send_message_func(op_code, message, cancellation_token)
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\engineapi\api.py:160, in EngineAPI.execute_anonymous_activity(self, message_args, cancellation_token)
158 #update_aml_env_vars(get_engine_api)
159 def execute_anonymous_activity(self, message_args: typedefinitions.ExecuteAnonymousActivityMessageArguments, cancellation_token: CancellationToken = None) -> None:
--> 160 response = self._message_channel.send_message('Engine.ExecuteActivity', message_args, cancellation_token)
161 return response
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\engineapi\engine.py:291, in MultiThreadMessageChannel.send_message(self, op_code, message, cancellation_token)
290 cancel_on_error()
--> 291 raise_engine_error(response['error'])
292 else:
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\errorhandlers.py:10, in raise_engine_error(error_response)
9 if 'ScriptExecution' in error_code:
---> 10 raise ExecutionError(error_response)
11 if 'Validation' in error_code:
ExecutionError: Error Code: ScriptExecution.StreamAccess.Validation Validation Error Code: InvalidEncoding Validation Target: TextFile Failed Step: 78059bb0-278f-4c7f-9c21-01a0cccf7b96 Error Message: ScriptExecutionException was caused by StreamAccessException. StreamAccessException was caused by ValidationException.
Unable to read file using Unicode (UTF-8). Attempted read range 0:777. Lines read in the range 0. Decoding error: Unable to translate bytes [8B] at index 1 from specified code page to Unicode.
Unable to translate bytes [8B] at index 1 from specified code page to Unicode. | session_id=295acf7e-4af9-42f1-b04a-79f3c5a0f98c
During handling of the above exception, another exception occurred:
UserErrorException Traceback (most recent call last) Input In [34], in <module>
1 # preview the first 3 rows of the dataset
2 #datasetTabular.take(3)
----> 3 datasetTabular.take(3).to_pandas_dataframe()
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\_loggerfactory.py:132, in track.<locals>.monitor.<locals>.wrapper(*args, **kwargs)
130 with _LoggerFactory.track_activity(logger, func.__name__, activity_type, custom_dimensions) as al:
131 try:
--> 132 return func(*args, **kwargs)
133 except Exception as e:
134 if hasattr(al, 'activity_info') and hasattr(e, 'error_code'):
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\tabular_dataset.py:169, in TabularDataset.to_pandas_dataframe(self, on_error, out_of_range_datetime)
158 """Load all records from the dataset into a pandas DataFrame.
159
160 :param on_error: How to handle any error values in the dataset, such as those produced by an error while (...)
166 :rtype: pandas.DataFrame
167 """
168 dataflow = get_dataflow_for_execution(self._dataflow, 'to_pandas_dataframe', 'TabularDataset')
--> 169 df = _try_execute(lambda: dataflow.to_pandas_dataframe(on_error=on_error,
170 out_of_range_datetime=out_of_range_datetime),
171 'to_pandas_dataframe',
172 None if self.id is None else {'id': self.id, 'name': self.name, 'version': self.version})
173 fine_grain_timestamp = self._properties.get(_DATASET_PROP_TIMESTAMP_FINE, None)
175 if fine_grain_timestamp is not None and df.empty is False:
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\dataset_error_handling.py:104, in _try_execute(action, operation, dataset_info, **kwargs)
102 except Exception as e:
103 message, is_dprep_exception = _construct_message_and_check_exception_type(e, dataset_info, operation)
--> 104 _dataprep_error_handler(e, message, is_dprep_exception)
File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\dataset_error_handling.py:154, in _dataprep_error_handler(e, message, is_dprep_exception)
152 for item in user_exception_list:
153 if _contains(item, getattr(e, 'error_code', 'Unexpected')):
--> 154 raise UserErrorException(message, inner_exception=e)
156 raise AzureMLException(message, inner_exception=e)
UserErrorException: UserErrorException: Message: Execution failed with error message: ScriptExecutionException was caused by StreamAccessException. StreamAccessException was caused by ValidationException.
Unable to read file using Unicode (UTF-8). Attempted read range 0:777. Lines read in the range 0. Decoding error: [REDACTED]
Failed due to inner exception of type: DecoderFallbackException | session_id=295acf7e-4af9-42f1-b04a-79f3c5a0f98c ErrorCode: ScriptExecution.StreamAccess.Validation InnerException Error Code: ScriptExecution.StreamAccess.Validation Validation Error Code: InvalidEncoding Validation Target: TextFile Failed Step: 78059bb0-278f-4c7f-9c21-01a0cccf7b96 Error Message: ScriptExecutionException was caused by StreamAccessException. StreamAccessException was caused by ValidationException.
Unable to read file using Unicode (UTF-8). Attempted read range 0:777. Lines read in the range 0. Decoding error: Unable to translate bytes [8B] at index 1 from specified code page to Unicode.
Unable to translate bytes [8B] at index 1 from specified code page to Unicode. | session_id=295acf7e-4af9-42f1-b04a-79f3c5a0f98c ErrorResponse {
"error": {
"code": "UserError",
"message": "Execution failed with error message: ScriptExecutionException was caused by StreamAccessException.\r\n StreamAccessException was caused by ValidationException.\r\n Unable to read file using Unicode (UTF-8). Attempted read range 0:777. Lines read in the range 0. Decoding error: [REDACTED]\r\n Failed due to inner exception of type: DecoderFallbackException\r\n| session_id=295acf7e-4af9-42f1-b04a-79f3c5a0f98c ErrorCode: ScriptExecution.StreamAccess.Validation"
} }

This kind of error usually happens if the base input is not our supported OS version.
Unable to read file using Unicode (UTF-8) -> this is the key point in the error occurred
str_value = raw_data.decode('utf-8')
using the above code block convert the input and then perform the operation.

Since you're working on a collection of .json files I'd suggest using a FileDataset (if you want to work with the jsons) as you're currently doing.
If you'd prefer working with the data in tabular form, then I'd suggest doing some preprocessing to flatten the json files into a pandas dataframe before saving it as a dataset on AzureML. Then use the register_pandas_dataframe method from the DatasetFactory class to save this dataframe. This will ensure that when you fetch the Dataset from azure, the to_pandas_dataframe() method will work. Just be aware that some datatypes such as numpy arrays are not supported when using the register_pandas_dataframe() method.
The issue with creating a tabular set from json files and then converting this to a pandas dataframe once you've begun working with it (in a run or notebook), is that you're expecting azure to handle the flattening/processing.
Alternatively, you can also look at the from_json_lines method since it might suit your use case better.

Related

Brightway2 writing the imported database

I created several excel inventory files using the eco invent exchanges.
To run my LCA I successfully imported the database with 0 unlinked exchanges using:
imp = bw.ExcelImporter("Inventory_fuelcell.xlsx")
imp.apply_strategies()
imp.match_database("ecoinvent 3.6 cutoff", fields=('name','unit','location'))
imp.match_database("biosphere3", fields=('name','unit'))
imp.match_database(fields=('name', 'unit', 'location'))
imp.statistics()
But when I run imp.write_database()
I get the following error:
Writing activities to SQLite3 database:
0% [######## ] 100% | ETA: 00:00:00
---------------------------------------------------------------------------
InvalidExchange Traceback (most recent call last)
<ipython-input-41-1daab0bbe8d8> in <module>
----> 1 imp.write_database()
/opt/anaconda3/envs/Masterarbeit/lib/python3.7/site-packages/bw2io/importers/excel.py in write_database(self, **kwargs)
257 """Same as base ``write_database`` method, but ``activate_parameters`` is True by default."""
258 kwargs['activate_parameters'] = kwargs.get('activate_parameters', True)
--> 259 super(ExcelImporter, self).write_database(**kwargs)
260
261 def get_activity(self, sn, ws):
/opt/anaconda3/envs/Masterarbeit/lib/python3.7/site-packages/bw2io/importers/base_lci.py in write_database(self, data, delete_existing, backend, activate_parameters, **kwargs)
238
239 existing.update(data)
--> 240 db.write(existing)
241
242 if activate_parameters:
/opt/anaconda3/envs/Masterarbeit/lib/python3.7/site-packages/bw2data/project.py in writable_project(wrapped, instance, args, kwargs)
354 if projects.read_only:
355 raise ReadOnlyProject(READ_ONLY_PROJECT)
--> 356 return wrapped(*args, **kwargs)
/opt/anaconda3/envs/Masterarbeit/lib/python3.7/site-packages/bw2data/backends/peewee/database.py in write(self, data, process)
258 if data:
259 try:
--> 260 self._efficient_write_many_data(data)
261 except:
262 # Purge all data from database, then reraise
/opt/anaconda3/envs/Masterarbeit/lib/python3.7/site-packages/bw2data/backends/peewee/database.py in _efficient_write_many_data(self, data, indices)
203 for index, (key, ds) in enumerate(data.items()):
204 exchanges, activities = self._efficient_write_dataset(
--> 205 index, key, ds, exchanges, activities
206 )
207
/opt/anaconda3/envs/Masterarbeit/lib/python3.7/site-packages/bw2data/backends/peewee/database.py in _efficient_write_dataset(self, index, key, ds, exchanges, activities)
154 for exchange in ds.get('exchanges', []):
155 if 'input' not in exchange or 'amount' not in exchange:
--> 156 raise InvalidExchange
157 if 'type' not in exchange:
158 raise UntypedExchange
InvalidExchange:
I never had this problem before.
Is there a way to figure out where the invalid exchange is?
But even with the error if I look for databases it still shows up.
So it seems like the database was in fact imported.
Can anybody help me what could be wrong?

If we look through the error traceback, we can see the line raising the error:
154 for exchange in ds.get('exchanges', []):
155 if 'input' not in exchange or 'amount' not in exchange:
--> 156 raise InvalidExchange
This means that at least one exchange doesn't have an input or and amount. As all your exchanges are linked, they all have input values, so the amount must be missing. This could be due to a typo in the column field, or off by one errors, etc.
To find it, you could try:
for ds in imp.data:
for exc in ds['exchanges']:
if 'amount' not in exc:
print("Missing `amount` in exc:")
print("\t", exc)
print("Dataset", ds['name'], ds['location'])
elif 'input' not in exc:
# check just to make sure
print("Missing `input` in exc:")
print("\t", exc)
print("Dataset", ds['name'], ds['location'])

Cytoscape: How do you import ABC file types with py2cytoscape's cyrest api?

I have a file of the type:
A B 0.123
A C 0.84
B D 0.52
...
Where the data are tab separated, and the first and second columns are the nodes, and the third is the associated edge weight.
When trying to import this file into cytoscape using py2cytoscape, I'm receiving an error:
from py2cytoscape import cyrest
fileName="/Users/96v/Documents/lco/lcoAllAt25/lcoAll25/lcoAll25_top0.041pct_data/lcoAll25_top0.041pct.txt"
cyclient = cyrest.cyclient()
cyclient.network.import_file(dataTypeList='string,string,double',
afile=fileName,
delimiters='\t',
indexColumnSourceInteraction="0",
indexColumnTargetInteraction="1",
verbose=True)
'http://localhost:1234/v1/commands/network/import file'
TypeError Traceback (most recent call last)
in
----> 1 cyclient.network.import_file(dataTypeList='string,string,double', afile=fileName, delimiters='\t', indexColumnSourceInteraction="0", indexColumnTargetInteraction="1", defaultInteraction="Edge Attribute",verbose=True)
2
~/opt/anaconda3/lib/python3.8/site-packages/py2cytoscape/cyrest/network.py in import_file(self, dataTypeList, defaultInteraction, delimiters, delimitersForDataList, afile, firstRowAsColumnNames, indexColumnSourceInteraction, indexColumnTargetInteraction, indexColumnTypeInteraction, NetworkViewRendererList, RootNetworkList, startLoadRow, TargetColumnList, verbose)
464 afile,firstRowAsColumnNames,indexColumnSourceInteraction,indexColumnTargetInteraction,
465 indexColumnTypeInteraction,NetworkViewRendererList,RootNetworkList,startLoadRow,TargetColumnList])
--> 466 response=api(url=self.__url+"/import file", PARAMS=PARAMS, method="POST", verbose=verbose)
467 return response
468
~/opt/anaconda3/lib/python3.8/site-packages/py2cytoscape/cyrest/base.py in api(namespace, command, PARAMS, body, host, port, version, method, verbose, url, parse_params)
139 sys.stdout.flush()
140 r = requests.post(url = baseurl, json = PARAMS)
--> 141 verbose_=checkresponse(r, verbose=verbose)
142 if (verbose) or (verbose_):
143 verbose=True
~/opt/anaconda3/lib/python3.8/site-packages/py2cytoscape/cyrest/base.py in checkresponse(r, verbose)
43 if 200 <= status < 300:
44 if verbose:
---> 45 print("response status "+status)
46 sys.stdout.flush()
47 res=None
TypeError: can only concatenate str (not "int") to str
The edge weights aren't being recognized, yet the documentation isn't as verbose for this function.
Any help would be extremely appreciated!

After looking further at the GUI, I realized:
Columns are not 0 indexed.
Verbose has an error in it.
The below code works fine:
from py2cytoscape import cyrest
fileName="pathToFile"
cyclient = cyrest.cyclient()
collection = cyclient.network.import_file(dataTypeList='string,string,double',
afile=fileName,
delimiters='\t',
indexColumnSourceInteraction="1",
indexColumnTargetInteraction="2",
defaultInteraction="interacts with")

Unable to use equ.traineddata for tesseract (throes error) but hin,ben,eng works well

I have my tesseract installed at /usr/share/tesseract-ocr/ and is working fine wit hte tessdata directory at /usr/share/tesseract-ocr/4.0/tessdata. Because the equ.traineddata is not given with the original data, I have downoaded it from the officil documentation at managed to paste it at the /usr/share/tesseract-ocr/4.0/tessdata/equ.traineddata. Aong with it, I pasted hin,ben and a few more files too. When I use --l eng+hin+ben it works fine but with the equ it throws error. I used pytesseract too with a few configs such as:
# making a copy of tessdata dir in the home
cli_config = '--oem 1 --psm 12 --tessdata-dir ~/tessdata/ -l eng+equ+ben+hin'
ocr.image_to_string(image=img_path,config=cli_config)
and also
cli_config = '--oem 1 --psm 12` # tessdata is at default location too
ocr.image_to_string(image=img_path,config=cli_config,lang='eng+equ+hin+ben`)
but it keeps throwing me error ONLY FOR equ like:
TesseractError Traceback (most recent call last)
<ipython-input-30-8529ae8e51e8> in <module>
----> 1 ocr.image_to_string(image=img_path,config=cli_config,lang='equ')
~/anaconda3/envs/py36/lib/python3.6/site-packages/pytesseract/pytesseract.py in image_to_string(image, lang, config, nice, output_type, timeout)
356 Output.DICT: lambda: {'text': run_and_get_output(*args)},
357 Output.STRING: lambda: run_and_get_output(*args),
--> 358 }[output_type]()
359
360
~/anaconda3/envs/py36/lib/python3.6/site-packages/pytesseract/pytesseract.py in <lambda>()
355 Output.BYTES: lambda: run_and_get_output(*(args + [True])),
356 Output.DICT: lambda: {'text': run_and_get_output(*args)},
--> 357 Output.STRING: lambda: run_and_get_output(*args),
358 }[output_type]()
359
~/anaconda3/envs/py36/lib/python3.6/site-packages/pytesseract/pytesseract.py in run_and_get_output(image, extension, lang, config, nice, timeout, return_bytes)
264 }
265
--> 266 run_tesseract(**kwargs)
267 filename = kwargs['output_filename_base'] + extsep + extension
268 with open(filename, 'rb') as output_file:
~/anaconda3/envs/py36/lib/python3.6/site-packages/pytesseract/pytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice, timeout)
240 with timeout_manager(proc, timeout) as error_string:
241 if proc.returncode:
--> 242 raise TesseractError(proc.returncode, get_errors(error_string))
243
244
TesseractError: (1, 'Error opening data file /home/deshwal/anaconda3/envs/py36/share/tessdata/equ.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'equ\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
What could be the reason for this? How can use the equ.traineddata ?

equ is legacy language data. Thus, you'd need to use the appropriate oem value. Try tesseract --help-extra command to show usage.

Python3 code Uploading to S3 bucket with IO instead of String IO

I am trying to download the zip file in memory, expand it and upload it to S3.
import boto3
import io
import zipfile
import mimetypes
s3 = boto3.resource('s3')
service_zip = io.BytesIO()
service_bucket = s3.Bucket('services.mydomain.com')
build_bucket = s3.Bucket('servicesbuild.mydomain.com')
build_bucket.download_fileobj('servicesbuild.zip', service_zip)
with zipfile.ZipFile(service_zip) as myzip:
for nm in myzip.namelist():
obj = myzip.open(nm)
print(obj)
service_bucket.upload_fileobj(obj,nm,
ExtraArgs={'ContentType': mimetypes.guess_type(nm)[0]})
service_bucket.Object(nm).Acl().put(ACL='public-read')
Here is the error I get
<zipfile.ZipExtFile name='favicon.ico' mode='r' compress_type=deflate>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-5941e5e45adc> in <module>
18 print(obj)
19 service_bucket.upload_fileobj(obj,nm,
---> 20 ExtraArgs={'ContentType': mimetypes.guess_type(nm)[0]})
21 service_bucket.Object(nm).Acl().put(ACL='public-read')
~/bitbucket/clguru/env/lib/python3.7/site-packages/boto3/s3/inject.py in bucket_upload_fileobj(self, Fileobj, Key, ExtraArgs, Callback, Config)
579 return self.meta.client.upload_fileobj(
580 Fileobj=Fileobj, Bucket=self.name, Key=Key, ExtraArgs=ExtraArgs,
--> 581 Callback=Callback, Config=Config)
582
583
~/bitbucket/clguru/env/lib/python3.7/site-packages/boto3/s3/inject.py in upload_fileobj(self, Fileobj, Bucket, Key, ExtraArgs, Callback, Config)
537 fileobj=Fileobj, bucket=Bucket, key=Key,
538 extra_args=ExtraArgs, subscribers=subscribers)
--> 539 return future.result()
540
541
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/futures.py in result(self)
71 # however if a KeyboardInterrupt is raised we want want to exit
72 # out of this and propogate the exception.
---> 73 return self._coordinator.result()
74 except KeyboardInterrupt as e:
75 self.cancel()
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/futures.py in result(self)
231 # final result.
232 if self._exception:
--> 233 raise self._exception
234 return self._result
235
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/tasks.py in _main(self, transfer_future, **kwargs)
253 # Call the submit method to start submitting tasks to execute the
254 # transfer.
--> 255 self._submit(transfer_future=transfer_future, **kwargs)
256 except BaseException as e:
257 # If there was an exception raised during the submission of task
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/upload.py in _submit(self, client, config, osutil, request_executor, transfer_future, bandwidth_limiter)
547 # Determine the size if it was not provided
548 if transfer_future.meta.size is None:
--> 549 upload_input_manager.provide_transfer_size(transfer_future)
550
551 # Do a multipart upload if needed, otherwise do a regular put object.
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/upload.py in provide_transfer_size(self, transfer_future)
324 fileobj.seek(0, 2)
325 end_position = fileobj.tell()
--> 326 fileobj.seek(start_position)
327 transfer_future.meta.provide_transfer_size(
328 end_position - start_position)
/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py in seek(self, offset, whence)
1023 # Position is before the current position. Reset the ZipExtFile
1024
-> 1025 self._fileobj.seek(self._orig_compress_start)
1026 self._running_crc = self._orig_start_crc
1027 self._compress_left = self._orig_compress_size
/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py in seek(self, offset, whence)
702 def seek(self, offset, whence=0):
703 with self._lock:
--> 704 if self.writing():
705 raise ValueError("Can't reposition in the ZIP file while "
706 "there is an open writing handle on it. "
AttributeError: '_SharedFile' object has no attribute 'writing'
If I comment out the lines after print(obj) to see the validate the zip file content,
import boto3
import io
import zipfile
import mimetypes
s3 = boto3.resource('s3')
service_zip = io.BytesIO()
service_bucket = s3.Bucket('services.readspeech.com')
build_bucket = s3.Bucket('servicesbuild.readspeech.com')
build_bucket.download_fileobj('servicesbuild.zip', service_zip)
with zipfile.ZipFile(service_zip) as myzip:
for nm in myzip.namelist():
obj = myzip.open(nm)
print(obj)
# service_bucket.upload_fileobj(obj,nm,
# ExtraArgs={'ContentType': mimetypes.guess_type(nm)[0]})
# service_bucket.Object(nm).Acl().put(ACL='public-read')
I see the following:
<zipfile.ZipExtFile name='favicon.ico' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='styles/main.css' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='images/example3.png' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='images/example1.png' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='images/example2.png' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='index.html' mode='r' compress_type=deflate>

Appears the issue is with python 3.7. I downgraded to python 3.6 and everything is fine. There is a bug reported on python 3.7
The misprint in the file lib/zipfile.py in line 704 leads to AttributeError: '_SharedFile' object has no attribute 'writing'
"self.writing()" should be replaced by "self._writing()". I also think this code should be covered by tests.
attribute 'writing
So to resolve the issue, use python 3.6.
On osx you can go back to Python 3.6 with the following command.
brew switch python 3.6.4_4

Connect to S3 accelerate endpoint with boto3

I want to download a file into a Python file object from an S3 bucket that has acceleration activated. I came across a few resources suggesting whether to overwrite the endpoint_url to "s3-accelerate.amazonaws.com" and/or to use the use_accelerate_endpoint attribute.
I have tried both, and several variations but the same error was returned everytime. One of the scripts I tried is:
from botocore.config import Config
import boto3
from io import BytesIO
session = boto3.session.Session()
s3 = session.client(
service_name='s3',
aws_access_key_id=<MY_KEY_ID>,
aws_secret_access_key=<MY_KEY>,
region_name="us-west-2",
config=Config(s3={"use_accelerate_endpoint": True,
"addressing_style": "path"}))
input = BytesIO()
s3.download_fileobj(<MY_BUCKET>,<MY_KEY>, input)
Returns the following error:
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<ipython-input-61-92b89b45f215> in <module>()
11 "addressing_style": "path"}))
12 input = BytesIO()
---> 13 s3.download_fileobj(bucket, filename, input)
14
15
~/Project/venv/lib/python3.5/site-packages/boto3/s3/inject.py in download_fileobj(self, Bucket, Key, Fileobj, ExtraArgs, Callback, Config)
568 bucket=Bucket, key=Key, fileobj=Fileobj,
569 extra_args=ExtraArgs, subscribers=subscribers)
--> 570 return future.result()
571
572
~/Project//venv/lib/python3.5/site-packages/s3transfer/futures.py in result(self)
71 # however if a KeyboardInterrupt is raised we want want to exit
72 # out of this and propogate the exception.
---> 73 return self._coordinator.result()
74 except KeyboardInterrupt as e:
75 self.cancel()
~/Project/venv/lib/python3.5/site-packages/s3transfer/futures.py in result(self)
231 # final result.
232 if self._exception:
--> 233 raise self._exception
234 return self._result
235
~/Project/venv/lib/python3.5/site-packages/s3transfer/tasks.py in _main(self, transfer_future, **kwargs)
253 # Call the submit method to start submitting tasks to execute the
254 # transfer.
--> 255 self._submit(transfer_future=transfer_future, **kwargs)
256 except BaseException as e:
257 # If there was an exception raised during the submission of task
~/Project/venv/lib/python3.5/site-packages/s3transfer/download.py in _submit(self, client, config, osutil, request_executor, io_executor, transfer_future)
347 Bucket=transfer_future.meta.call_args.bucket,
348 Key=transfer_future.meta.call_args.key,
--> 349 **transfer_future.meta.call_args.extra_args
350 )
351 transfer_future.meta.provide_transfer_size(
~/Project/venv/lib/python3.5/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
310 "%s() only accepts keyword arguments." % py_operation_name)
311 # The "self" in this scope is referring to the BaseClient.
--> 312 return self._make_api_call(operation_name, kwargs)
313
314 _api_call.__name__ = str(py_operation_name)
~/Project/venv/lib/python3.5/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
603 error_code = parsed_response.get("Error", {}).get("Code")
604 error_class = self.exceptions.from_code(error_code)
--> 605 raise error_class(parsed_response, operation_name)
606 else:
607 return parsed_response
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
When I run the same script with "use_accelerate_endpoint": False it works fine.
However, it returned the same error when:
I overwrite the endpoint_url with "s3-accelerate.amazonaws.com"
I define "addressing_style": "virtual"
When running
s3.get_bucket_accelerate_configuration(Bucket=<MY_BUCKET>)
I get {..., 'Status': 'Enabled'} as expected.
Any idea what is wrong with that code and what I should change to properly query the accelerate endpoint of that bucket?
Using python3.5 with boto3==1.4.7, botocore==1.7.43 on Ubuntu 17.04.
EDIT:
I have also tried a similar script for uploads:
from botocore.config import Config
import boto3
from io import BytesIO
session = boto3.session.Session()
s3 = session.client(
service_name='s3',
aws_access_key_id=<MY_KEY_ID>,
aws_secret_access_key=<MY_KEY>,
region_name="us-west-2",
config=Config(s3={"use_accelerate_endpoint": True,
"addressing_style": "virtual"}))
output = BytesIO()
output.seek(0)
s3.upload_fileobj(output, <MY_BUCKET>,<MY_KEY>)
Which works without the use_accelerate_endpoint option (so my keys are fine), but returns this error when True:
ClientError: An error occurred (SignatureDoesNotMatch) when calling the PutObject operation: The request signature we calculated does not match the signature you provided. Check your key and signing method.
I have tried both addressing_style options here as well (virtual and path)

Using boto3==1.4.7 and botocore==1.7.43.
Here is one way to retrieve an object from a bucket with transfer acceleration enabled.
import boto3
from botocore.config import Config
from io import BytesIO
config = Config(s3={"use_accelerate_endpoint": True})
s3_resource = boto3.resource("s3",
aws_access_key_id=<MY_KEY_ID>,
aws_secret_access_key=<MY_KEY>,
region_name="us-west-2",
config=config)
s3_client = s3_resource.meta.client
file_object = BytesIO()
s3_client.download_fileobj(<MY_BUCKET>, <MY_KEY>, file_object)
Note that the client sends a HEAD request to the accelerated endpoint before a GET.
The canonical request of which looks somewhat like the following:
CanonicalRequest:
HEAD
/<MY_KEY>
host:<MY_BUCKET>.s3-accelerate.amazonaws.com
x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
x-amz-date:20200520T204128Z
host;x-amz-content-sha256;x-amz-date
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Some reasons why the HEAD request can fail include:
Object with given key doesn't exist or has strict access control enabled
Invalid credentials
Transfer acceleration isn't enabled

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Azureml TabularDataset to_pandas_dataframe() returns InvalidEncoding error - azure-machine-learning-service

This kind of error usually happens if the base input is not our supported OS version. Unable to read file using Unicode (UTF-8) -> this is the key point in the error occurred str_value = raw_data.decode('utf-8') using the above code block convert the input and then perform the operation.

Related

Brightway2 writing the imported database

Cytoscape: How do you import ABC file types with py2cytoscape's cyrest api?

Unable to use equ.traineddata for tesseract (throes error) but hin,ben,eng works well

Python3 code Uploading to S3 bucket with IO instead of String IO

Connect to S3 accelerate endpoint with boto3

Categories

Resources