python aws s3: downloading files based on size

python aws s3: downloading files based on size - python-3.x

I can filter files in an s3 bucket based on file size, and I can download files, but I get an error trying to do both. This is Python 3.4.
import boto3
import re
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
some_file = "whatever.txt"
for file in bucket.objects.all():
if file.size < 1000 and re.search(".*txt$", file.key):
print(file.key, file.size)
bucket.download_file(file.key, file.key)
bucket.download_file(some_file, some_file) # this works fine
The above for loop works to get files less than 1000 bytes and are .txt files. But the bucket.download_file (file.key, file.key) part gives me this:
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
However, the last line works fine. What's the difference?
FYI, I searched for the error and I saw some stuff about the access permissions. This bucket is not public access. I'm running from an EMR cluster and the secret key credentials are already defined so I don't need to specify in the script.
UPDATE: the full error looks like this:
Traceback (most recent call last):
File "./get_size.py", line 39, in <module>
bucket.download_file(file.key, file.key)
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/boto3-1.9.189-py3.4.egg/boto3/s3/inject.py", line 246, in bucket_download_file
ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/boto3-1.9.189-py3.4.egg/boto3/s3/inject.py", line 172, in download_file
extra_args=ExtraArgs, callback=Callback)
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/boto3-1.9.189-py3.4.egg/boto3/s3/transfer.py", line 307, in download_file
future.result()
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/s3transfer-0.2.1-py3.4.egg/s3transfer/futures.py", line 106, in result
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/s3transfer-0.2.1-py3.4.egg/s3transfer/futures.py", line 265, in result
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/s3transfer-0.2.1-py3.4.egg/s3transfer/tasks.py", line 255, in _main
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/s3transfer-0.2.1-py3.4.egg/s3transfer/download.py", line 345, in _submit
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/botocore-1.12.189-py3.4.egg/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/hadoop/jobs/scripts/s3/python34/local/lib/python3.4/dist-packages/botocore-1.12.189-py3.4.egg/botocore/client.py", line 661, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

Thanks for your input, but it turned out to be some weird permissions I had on a few of the files in this bucket that I didn't realize. The code as above works on the current bucket after removing them and also on other buckets.
I can confirm this part works:
bucket.download_file(file.key, file.key)

Related

uploading Csv file to S3 using boto3

am trying to uplaod a csv file to S3 Bucket "ebayunited" using Boto3
i fetched json products from dummy data as json
Convert it to Csv and save it in the same folder
i used Boto3 to uplaod it
Code
import requests
import pandas as pd
import boto3 as bt
from API_keys import access_key, secret_access_key
client = bt.client(
's3', aws_access_key_id=access_key,
aws_secret_access_key=secret_access_key
)
r = requests.get("https://dummyjson.com/products")
json_products = pd.DataFrame(r.json()["products"])
file_s3 = json_products.to_csv("Orders.csv", index=False)
bucket = "ebayunited"
path = "Lysi Team/ebayapi/"+str(file_s3)
client.upload_file(str(file_s3), bucket, path)
and i get this Error
Traceback (most recent call last):
File "c:\Users\PC\Desktop\S3 Import\s3_import.py", line 18, in <module>
client.upload_file(str(file_s3), bucket, path)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\boto3\s3\inject.py", line 143, in upload_file
return transfer.upload_file(
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\boto3\s3\transfer.py", line 288, in upload_file
future.result()
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\futures.py", line 103, in result
return self._coordinator.result()
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\futures.py", line 266, in result
raise self._exception
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\tasks.py", line 269, in _main
self._submit(transfer_future=transfer_future, **kwargs)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\upload.py", line 585, in _submit
upload_input_manager.provide_transfer_size(transfer_future)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\upload.py", line 244, in provide_transfer_size
self._osutil.get_file_size(transfer_future.meta.call_args.fileobj)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\site-packages\s3transfer\utils.py", line 247, in get_file_size
return os.path.getsize(filename)
File "C:\Users\PC\AppData\Local\Programs\Python\Python310\lib\genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'None'

The upload_file(filename, bucket, key) command expects the name of a file to upload from your local disk.
Your program appears to be assuming that the to_csv() function returns the name of the resulting file, but the to_csv() documentation says:
If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.
Therefore, you will need to pass the actual name of the file:
key = 'Lysi Team/ebayapi/test.csv' # Change desired S3 Key here
client.upload_file('test.csv', bucket, key)

pyarrow ds.dataset fails with FileNotFoundError using Azure blob filesystem in azure functions but not locally

I have a couple functions in Azure Functions which download data into my data lake v2 storage. They work by downloading CSVs, converting them to pyarrow and then saving to individual parquet files on Azure's storage which works fine.
I have another function in the same app which is supposed to consolidate and repartition the data. That function results in a FileNotFoundError exception when trying to create a dataset.
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pyarrow.compute as pc
import os
from adlfs import AzureBlobFileSystem
abfs=AzureBlobFileSystem(connection_string=os.environ['Synblob'])
newds=ds.dataset(f"stdatalake/miso/dailyfiles/{filetype}", filesystem=abfs, partitioning="hive")
When running locally, the code runs fine but when running on the cloud I get....
Result: Failure Exception: FileNotFoundError: stdatalake/miso/dailyfiles/daprices/exante/date=20210926
Stack: File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 402, in _handle__invocation_request call_result = await self._loop.run_in_executor( File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs)
File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 611, in _run_sync_func return ExtensionManager.get_sync_invocation_wrapper(context, File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/extension.py",
line 215, in _raw_invocation_wrapper result = function(**args) File "/home/site/wwwroot/consolidate/__init__.py",
line 47, in main newds=ds.dataset(f"stdatalake/miso/dailyfiles/{filetype}", filesystem=abfs, partitioning="hive")
File "/home/site/wwwroot/.python_packages/lib/site-packages/pyarrow/dataset.py", line 667, in dataset return _filesystem_dataset(source, **kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/pyarrow/dataset.py",
line 422, in _filesystem_dataset return factory.finish(schema) File "pyarrow/_dataset.pyx", line 1680,
in pyarrow._dataset.DatasetFactory.finish File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/_fs.pyx", line 1179,
in pyarrow._fs._cb_open_input_file File "/home/site/wwwroot/.python_packages/lib/site-packages/pyarrow/fs.py", line 394, in open_input_file raise FileNotFoundError(path)

ValueError: Bucket names must start and end with a number or letter

I am trying to load a tensorflow model in SavedModel format from my google cloud bucket into my cloud function. I am using this tutorial: https://cloud.google.com/blog/products/ai-machine-learning/how-to-serve-deep-learning-models-using-tensorflow-2-0-with-cloud-functions
The cloud function compiles correctly. However when I send an http request to the cloud function it gives this error:
Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v2.py", line 402, in run_http_function result = _function_handler.invoke_user_function(flask.request) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v2.py", line 268, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v2.py", line 261, in call_user_function return self._user_function(request_or_event) File "/user_code/main.py", line 29, in predict download_blob('', 'firstModel/saved_model.pb', '/tmp/model/saved_model.pb') File "/user_code/main.py", line 17, in download_blob bucket = storage_client.get_bucket(bucket_name) File "/env/local/lib/python3.7/site-packages/google/cloud/storage/client.py", line 356, in get_bucket bucket = self._bucket_arg_to_bucket(bucket_or_name) File "/env/local/lib/python3.7/site-packages/google/cloud/storage/client.py", line 225, in _bucket_arg_to_bucket bucket = Bucket(self, name=bucket_or_name) File "/env/local/lib/python3.7/site-packages/google/cloud/storage/bucket.py", line 581, in init name = _validate_name(name) File "/env/local/lib/python3.7/site-packages/google/cloud/storage/_helpers.py", line 67, in _validate_name raise ValueError("Bucket names must start and end with a number or letter.") ValueError: Bucket names must start and end with a number or letter.
I am confused because my buckets' title is a string of letters around 20 characters long without any punctuation/special characters.
This is some of the code that I am running:
if model is None:
download_blob('<terminatorbucket>', 'firstModel/saved_model.pb', '/tmp/model/saved_model.pb')
download_blob('<terminatorbucket>', 'firstModel/assets/tokens.txt', '/tmp/model/assets/tokens.txt')
download_blob('<terminatorbucket>', 'firstModel/variables/variables.index', '/tmp/model/variables/variables.index')
download_blob('<terminatorbucket>', 'firstModel/variables/variables.data-00000-of-00001', '/tmp/model/variables/variables.data-00000-of-00001')
model = tf.keras.models.load_model('/tmp/model')
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)

The error message is complaining about the fact that you have angle brackets in your bucket name, which are not considered numbers or letters. Make sure your bucket name is exactly what you see in the Cloud console.

ReadTimeOut Error while writer.write(xlist) in Alibaba odps connection. Any Suggestion?

from odps import ODPS
from odps import options
import csv
import os
from datetime import timedelta, datetime
options.sql.use_odps2_extension = True
options.tunnel.use_instance_tunnel = True
options.connect_timeout = 60
options.read_timeout=130
options.retry_times = 7
options.chunk_size = 8192*2
odps = ODPS('id','secret','project', endpoint ='endpointUrl')
table = odps.get_table('eventTable')
def uploadFile(file):
with table.open_writer(partition=None) as writer:
with open(file, 'rt') as csvfile:
rows = csv.reader(csvfile, delimiter='~')
for final in rows:
writer.write(final)
writer.close();
uploadFile('xyz.csv')
Assume I pass number of files in uploadFile call one by one from directory To connect alibaba cloud from python to migrate data into max compute table over the cloud. When I run this code, service stops either after working long time or at night time. It reports me error Read Time Out Error at line writer.write(final).
Error:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/response.py", line 226, in _error_catcher
yield
File "/usr/lib/python3/dist-packages/urllib3/response.py", line 301, in read
data = self._fp.read(amt)
File "/usr/lib/python3.5/http/client.py", line 448, in read
n = self.readinto(b)
File "/usr/lib/python3.5/http/client.py", line 488, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/models.py", line 660, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/usr/lib/python3/dist-packages/urllib3/response.py", line 344, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/lib/python3/dist-packages/urllib3/response.py", line 311, in read
flush_decoder = True
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3/dist-packages/urllib3/response.py", line 231, in _error_catcher
raise ReadTimeoutError(self._pool, None, 'Read timed out.')
requests.packages.urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='dt.odps.aliyun.com', port=80): Read timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/dataUploader.py", line 66, in <module>
uploadFile('xyz.csv')
File "/dataUploader.py", line 53, in uploadFile
writer.write(final)
File "/usr/local/lib/python3.5/dist-packages/odps/models/table.py", line 643, in __exit__
self.close()
File "/usr/local/lib/python3.5/dist-packages/odps/models/table.py", line 631, in close
upload_session.commit(written_blocks)
File "/usr/local/lib/python3.5/dist-packages/odps/tunnel/tabletunnel.py", line 308, in commit
in self.get_block_list()])
File "/usr/local/lib/python3.5/dist-packages/odps/tunnel/tabletunnel.py", line 298, in get_block_list
self.reload()
File "/usr/local/lib/python3.5/dist-packages/odps/tunnel/tabletunnel.py", line 238, in reload
resp = self._client.get(url, params=params, headers=headers)
File "/usr/local/lib/python3.5/dist-packages/odps/rest.py", line 138, in get
return self.request(url, 'get', stream=stream, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/odps/rest.py", line 125, in request
proxies=self._proxy)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 608, in send
r.content
File "/usr/lib/python3/dist-packages/requests/models.py", line 737, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "/usr/lib/python3/dist-packages/requests/models.py", line 667, in generate
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='dt.odps.aliyun.com', port=80): Read timed out.
packet_write_wait: Connection to 122.121.122.121 port 22: Broken pipe
This is the error what I got. Can you suggest what is the problem ?

The read timeout is the timeout on waiting to read data. Usually, if the server fails to send a byte seconds after the last byte, a read timeout error will be raised.
This happens because of the server couldn`t read the file within the specified timeout period.
Here, the read timeout was set to 130 seconds, which is less if your file size is very high.
Please increase the timeout limit from 130 seconds to 500 seconds, i.e options.read_timeout=130 to options.read_timeout=500
It would resolve your problem, at the same time minimize the retry times from 7 to 3,
i.e options.retry_times=7 to options.retry_times=3

This error is usually caused by network issue.
Execute curl endpoint URL in a terminal. If it returns immediately with something like this:
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>NoSuchObject</Code>
<Message><![CDATA[Unknown http request location: /]]></Message>
<RequestId>5E5CC9526283FEC94F19DAAE</RequestId>
<HostId>localhost</HostId>
</Error>
Then the endpoint URL is reachable. But if it hangs, then you should check if you are using the right endpoint URL.
Since MaxCompute (ODPS) has public and private endpoint, it could be confusing sometimes.

TypeError: can't pickle memoryview objects when running basic add.delay(1,2) test

Trying to run the most basic test of add.delay(1,2) using celery 4.1.0 with Python 3.6.4 and getting the following error:
[2018-02-27 13:58:50,194: INFO/MainProcess] Received task:
exb.tasks.test_tasks.add[52c3fb33-ce00-4165-ad18-15026eca55e9]
[2018-02-27 13:58:50,194: CRITICAL/MainProcess] Unrecoverable error:
SystemError(' returned a result with an error set',) Traceback (most
recent call last): File
"/opt/myapp/lib/python3.6/site-packages/kombu/messaging.py", line 624,
in _receive_callback
return on_m(message) if on_m else self.receive(decoded, message) File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 570, in on_task_received
callbacks, File "/opt/myapp/lib/python3.6/site-packages/celery/worker/strategy.py",
line 145, in task_message_handler
handle(req) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
221, in _process_task_sem
return self._quick_acquire(self._process_task, req) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/semaphore.py",
line 62, in acquire
callback(*partial_args, **partial_kwargs) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
226, in _process_task
req.execute_using_pool(self.pool) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/request.py",
line 531, in execute_using_pool
correlation_id=task_id, File "/opt/myapp/lib/python3.6/site-packages/celery/concurrency/base.py",
line 155, in apply_async
**options) File "/opt/myapp/lib/python3.6/site-packages/billiard/pool.py", line 1486,
in apply_async
self._quick_put((TASK, (result._job, None, func, args, kwds))) File
"/opt/myapp/lib/python3.6/site-packages/celery/concurrency/asynpool.py",
line 813, in send_job
body = dumps(tup, protocol=protocol) TypeError: can't pickle memoryview objects
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
203, in start
self.blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
370, in start
return self.obj.start() File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 320, in start
blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 596, in start
c.loop(*c.loop_args()) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/loops.py", line
88, in asynloop
next(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/hub.py", line 354,
in create_loop
cb(*cbargs) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
236, in on_readable
reader(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
218, in _read
drain_events(timeout=0) File "/opt/myapp/lib/python3.6/site-packages/librabbitmq-2.0.0-py3.6-linux-x86_64.egg/librabbitmq/init.py",
line 227, in drain_events
self._basic_recv(timeout) SystemError: returned a result with an error set
I cannot find any previous evidence of anyone hitting this error. I noticed from the celery site that only python 3.5 is mentioned as supported, is that the issue or is this something I am missing?
Any help would be much appreciated!
UPDATE: Tried with Python 3.5.5 and the problem persists. Tried with Django 4.0.2 and the problem persists.
UPDATE: Uninstalled librabbitmq and the problem stopped. This was seen after migration from Python 2.7.5, Django 1.7.7 to Python 3.6.4, Django 2.0.2.

After uninstalling librabbitmq, the problem was resolved.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

python aws s3: downloading files based on size - python-3.x

Thanks for your input, but it turned out to be some weird permissions I had on a few of the files in this bucket that I didn't realize. The code as above works on the current bucket after removing them and also on other buckets. I can confirm this part works: bucket.download_file(file.key, file.key)

Related

uploading Csv file to S3 using boto3

pyarrow ds.dataset fails with FileNotFoundError using Azure blob filesystem in azure functions but not locally

ValueError: Bucket names must start and end with a number or letter

ReadTimeOut Error while writer.write(xlist) in Alibaba odps connection. Any Suggestion?

TypeError: can't pickle memoryview objects when running basic add.delay(1,2) test

Categories

Resources