Writing CSV file into dataframe from FTPS server with python - python-3.x

I am trying to get a csv file out of an ftps server. I am receiving this info, though:
file = r'filename.csv'
with ftplib.FTP() as ftp:
with open(file, 'rb') as f:
ftp.retrbinary(file, f.read)
df1= pd.read_csv(file)
df1.head()
with this particular error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-a2725f958d45> in <module>
4
5 with open(file, 'rb') as f:
----> 6 ftp.retrbinary(file, f.read)
7 df1= pd.read_csv(file) #delimiter = '|', encoding = 'latin1')
8 df1.head()
~\AppData\Local\Continuum\anaconda3\lib\ftplib.py in retrbinary(self, cmd, callback, blocksize, rest)
439 The response code.
440 """
--> 441 self.voidcmd('TYPE I')
442 with self.transfercmd(cmd, rest) as conn:
443 while 1:
~\AppData\Local\Continuum\anaconda3\lib\ftplib.py in voidcmd(self, cmd)
275 def voidcmd(self, cmd):
276 """Send a command and expect a response beginning with '2'."""
--> 277 self.putcmd(cmd)
278 return self.voidresp()
279
~\AppData\Local\Continuum\anaconda3\lib\ftplib.py in putcmd(self, line)
197 def putcmd(self, line):
198 if self.debugging: print('*cmd*', self.sanitize(line))
--> 199 self.putline(line)
200
201 # Internal: return one line from the server, stripping CRLF.
~\AppData\Local\Continuum\anaconda3\lib\ftplib.py in putline(self, line)
192 if self.debugging > 1:
193 print('*put*', self.sanitize(line))
--> 194 self.sock.sendall(line.encode(self.encoding))
195
196 # Internal: send one command to the server (through putline())
AttributeError: 'NoneType' object has no attribute 'sendall'
Any ideas as to why this isn't putting the requested file into a dataframe?

the documentation says that the cmd argument of retrbinary method should be an appropriate RETR command: RETR filename and the callback function is called for each block of data received.
if you need to get data, write to file and read file try: ftp.retrbinary(f'RETR {file}', f.write)

Method Name:
retrbinary
retrbinary(cmd, callback, blocksize=8192, rest=None)
callback:For each block of the data received from the FTP server the callback function is called. This callback function can be used for processing the data received. For example, the callback can be used for writing the received blocks into a file
for example:
you can use this:
fhandle = open(filename, 'wb')
ftp.retrbinary('RETR ' + filename, fhandle.write)
or
ftp.retrbinary('RETR %s' % FILE, open(FILE, 'wb').write)

Related

Unable to use equ.traineddata for tesseract (throes error) but hin,ben,eng works well

I have my tesseract installed at /usr/share/tesseract-ocr/ and is working fine wit hte tessdata directory at /usr/share/tesseract-ocr/4.0/tessdata. Because the equ.traineddata is not given with the original data, I have downoaded it from the officil documentation at managed to paste it at the /usr/share/tesseract-ocr/4.0/tessdata/equ.traineddata. Aong with it, I pasted hin,ben and a few more files too. When I use --l eng+hin+ben it works fine but with the equ it throws error. I used pytesseract too with a few configs such as:
# making a copy of tessdata dir in the home
cli_config = '--oem 1 --psm 12 --tessdata-dir ~/tessdata/ -l eng+equ+ben+hin'
ocr.image_to_string(image=img_path,config=cli_config)
and also
cli_config = '--oem 1 --psm 12` # tessdata is at default location too
ocr.image_to_string(image=img_path,config=cli_config,lang='eng+equ+hin+ben`)
but it keeps throwing me error ONLY FOR equ like:
TesseractError Traceback (most recent call last)
<ipython-input-30-8529ae8e51e8> in <module>
----> 1 ocr.image_to_string(image=img_path,config=cli_config,lang='equ')
~/anaconda3/envs/py36/lib/python3.6/site-packages/pytesseract/pytesseract.py in image_to_string(image, lang, config, nice, output_type, timeout)
356 Output.DICT: lambda: {'text': run_and_get_output(*args)},
357 Output.STRING: lambda: run_and_get_output(*args),
--> 358 }[output_type]()
359
360
~/anaconda3/envs/py36/lib/python3.6/site-packages/pytesseract/pytesseract.py in <lambda>()
355 Output.BYTES: lambda: run_and_get_output(*(args + [True])),
356 Output.DICT: lambda: {'text': run_and_get_output(*args)},
--> 357 Output.STRING: lambda: run_and_get_output(*args),
358 }[output_type]()
359
~/anaconda3/envs/py36/lib/python3.6/site-packages/pytesseract/pytesseract.py in run_and_get_output(image, extension, lang, config, nice, timeout, return_bytes)
264 }
265
--> 266 run_tesseract(**kwargs)
267 filename = kwargs['output_filename_base'] + extsep + extension
268 with open(filename, 'rb') as output_file:
~/anaconda3/envs/py36/lib/python3.6/site-packages/pytesseract/pytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice, timeout)
240 with timeout_manager(proc, timeout) as error_string:
241 if proc.returncode:
--> 242 raise TesseractError(proc.returncode, get_errors(error_string))
243
244
TesseractError: (1, 'Error opening data file /home/deshwal/anaconda3/envs/py36/share/tessdata/equ.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'equ\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
What could be the reason for this? How can use the equ.traineddata ?
equ is legacy language data. Thus, you'd need to use the appropriate oem value. Try tesseract --help-extra command to show usage.

Tweet-Cleaning by removing b' and ASCII - can't find the problem

I am currently preprocessing tweets, extracted via Twitter API and saved as csv. Within the csv there are some characters like "b'" at the beginning of the tweet and code like aren\xe2\x80\x99t, which stands for "'". Now I want to remove these chars but don't know how although I have tried it a couple of times. Can anyone help me? I read the file with pandas and Python3. The column is called "text"
What I mean is the following:
b'RT #username: some text some text C\xe2\x80\xa6' OR
"b'RT #username: some text some text .A\xe2\x80\xa6'
Input 1:
df = pd.read_csv('Data/test.csv', encoding= 'utf8')
df['text'] = df['text'].str.replace('b[\s]+', ' ')
df['text'] = df['text'].str.replace('[^\x00-\x7F]+',' ')
df['text'] = df['text'].str.replace('[^\u0000-\uD7FF\uE000-\uFFFF]',' ')
Output 1: Nothing happens.
With the next snippet I tried to apply the UTF-8 encoding. As I am write this need sometimes to be done for further processing.
Input 2:
df = pd.read_csv('Data/Result_w8_Pfizer_en_test.csv', encoding= 'utf8')
df.apply(lambda x: pd.lib.infer_dtype(x.values))
Output 2:
AttributeError Traceback (most recent call last)
<ipython-input-50-4c6bdb11d736> in <module>
25
26 df = pd.read_csv('Data/test.csv', encoding= 'utf8') # dtype=string
---> 27 df.apply(lambda x: pd.lib.infer_dtype(x.values))
28
29
~/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6485 args=args,
6486 kwds=kwds)
-> 6487 return op.get_result()
6488
6489 def applymap(self, func):
~/conda/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
149 return self.apply_raw()
150
--> 151 return self.apply_standard()
152
153 def apply_empty_result(self):
~/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
255
256 # compute the result using the series generator
--> 257 self.apply_series_generator()
258
259 # wrap results
~/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
284 try:
285 for i, v in enumerate(series_gen):
--> 286 results[i] = self.f(v)
287 keys.append(v.name)
288 except Exception as e:
<ipython-input-50-4c6bdb11d736> in <lambda>(x)
25
26 df = pd.read_csv('Data/test.csv', encoding= 'utf8')
---> 27 df.apply(lambda x: pd.lib.infer_dtype(x.values))
28
29
AttributeError: ("module 'pandas' has no attribute 'lib'", 'occurred at index date')
Here I did some research but couldn't find out the issue or how to solve it.

Python3 code Uploading to S3 bucket with IO instead of String IO

I am trying to download the zip file in memory, expand it and upload it to S3.
import boto3
import io
import zipfile
import mimetypes
s3 = boto3.resource('s3')
service_zip = io.BytesIO()
service_bucket = s3.Bucket('services.mydomain.com')
build_bucket = s3.Bucket('servicesbuild.mydomain.com')
build_bucket.download_fileobj('servicesbuild.zip', service_zip)
with zipfile.ZipFile(service_zip) as myzip:
for nm in myzip.namelist():
obj = myzip.open(nm)
print(obj)
service_bucket.upload_fileobj(obj,nm,
ExtraArgs={'ContentType': mimetypes.guess_type(nm)[0]})
service_bucket.Object(nm).Acl().put(ACL='public-read')
Here is the error I get
<zipfile.ZipExtFile name='favicon.ico' mode='r' compress_type=deflate>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-5941e5e45adc> in <module>
18 print(obj)
19 service_bucket.upload_fileobj(obj,nm,
---> 20 ExtraArgs={'ContentType': mimetypes.guess_type(nm)[0]})
21 service_bucket.Object(nm).Acl().put(ACL='public-read')
~/bitbucket/clguru/env/lib/python3.7/site-packages/boto3/s3/inject.py in bucket_upload_fileobj(self, Fileobj, Key, ExtraArgs, Callback, Config)
579 return self.meta.client.upload_fileobj(
580 Fileobj=Fileobj, Bucket=self.name, Key=Key, ExtraArgs=ExtraArgs,
--> 581 Callback=Callback, Config=Config)
582
583
~/bitbucket/clguru/env/lib/python3.7/site-packages/boto3/s3/inject.py in upload_fileobj(self, Fileobj, Bucket, Key, ExtraArgs, Callback, Config)
537 fileobj=Fileobj, bucket=Bucket, key=Key,
538 extra_args=ExtraArgs, subscribers=subscribers)
--> 539 return future.result()
540
541
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/futures.py in result(self)
71 # however if a KeyboardInterrupt is raised we want want to exit
72 # out of this and propogate the exception.
---> 73 return self._coordinator.result()
74 except KeyboardInterrupt as e:
75 self.cancel()
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/futures.py in result(self)
231 # final result.
232 if self._exception:
--> 233 raise self._exception
234 return self._result
235
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/tasks.py in _main(self, transfer_future, **kwargs)
253 # Call the submit method to start submitting tasks to execute the
254 # transfer.
--> 255 self._submit(transfer_future=transfer_future, **kwargs)
256 except BaseException as e:
257 # If there was an exception raised during the submission of task
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/upload.py in _submit(self, client, config, osutil, request_executor, transfer_future, bandwidth_limiter)
547 # Determine the size if it was not provided
548 if transfer_future.meta.size is None:
--> 549 upload_input_manager.provide_transfer_size(transfer_future)
550
551 # Do a multipart upload if needed, otherwise do a regular put object.
~/bitbucket/clguru/env/lib/python3.7/site-packages/s3transfer/upload.py in provide_transfer_size(self, transfer_future)
324 fileobj.seek(0, 2)
325 end_position = fileobj.tell()
--> 326 fileobj.seek(start_position)
327 transfer_future.meta.provide_transfer_size(
328 end_position - start_position)
/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py in seek(self, offset, whence)
1023 # Position is before the current position. Reset the ZipExtFile
1024
-> 1025 self._fileobj.seek(self._orig_compress_start)
1026 self._running_crc = self._orig_start_crc
1027 self._compress_left = self._orig_compress_size
/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py in seek(self, offset, whence)
702 def seek(self, offset, whence=0):
703 with self._lock:
--> 704 if self.writing():
705 raise ValueError("Can't reposition in the ZIP file while "
706 "there is an open writing handle on it. "
AttributeError: '_SharedFile' object has no attribute 'writing'
If I comment out the lines after print(obj) to see the validate the zip file content,
import boto3
import io
import zipfile
import mimetypes
s3 = boto3.resource('s3')
service_zip = io.BytesIO()
service_bucket = s3.Bucket('services.readspeech.com')
build_bucket = s3.Bucket('servicesbuild.readspeech.com')
build_bucket.download_fileobj('servicesbuild.zip', service_zip)
with zipfile.ZipFile(service_zip) as myzip:
for nm in myzip.namelist():
obj = myzip.open(nm)
print(obj)
# service_bucket.upload_fileobj(obj,nm,
# ExtraArgs={'ContentType': mimetypes.guess_type(nm)[0]})
# service_bucket.Object(nm).Acl().put(ACL='public-read')
I see the following:
<zipfile.ZipExtFile name='favicon.ico' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='styles/main.css' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='images/example3.png' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='images/example1.png' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='images/example2.png' mode='r' compress_type=deflate>
<zipfile.ZipExtFile name='index.html' mode='r' compress_type=deflate>
Appears the issue is with python 3.7. I downgraded to python 3.6 and everything is fine. There is a bug reported on python 3.7
The misprint in the file lib/zipfile.py in line 704 leads to AttributeError: '_SharedFile' object has no attribute 'writing'
"self.writing()" should be replaced by "self._writing()". I also think this code should be covered by tests.
attribute 'writing
So to resolve the issue, use python 3.6.
On osx you can go back to Python 3.6 with the following command.
brew switch python 3.6.4_4

Connect to S3 accelerate endpoint with boto3

I want to download a file into a Python file object from an S3 bucket that has acceleration activated. I came across a few resources suggesting whether to overwrite the endpoint_url to "s3-accelerate.amazonaws.com" and/or to use the use_accelerate_endpoint attribute.
I have tried both, and several variations but the same error was returned everytime. One of the scripts I tried is:
from botocore.config import Config
import boto3
from io import BytesIO
session = boto3.session.Session()
s3 = session.client(
service_name='s3',
aws_access_key_id=<MY_KEY_ID>,
aws_secret_access_key=<MY_KEY>,
region_name="us-west-2",
config=Config(s3={"use_accelerate_endpoint": True,
"addressing_style": "path"}))
input = BytesIO()
s3.download_fileobj(<MY_BUCKET>,<MY_KEY>, input)
Returns the following error:
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<ipython-input-61-92b89b45f215> in <module>()
11 "addressing_style": "path"}))
12 input = BytesIO()
---> 13 s3.download_fileobj(bucket, filename, input)
14
15
~/Project/venv/lib/python3.5/site-packages/boto3/s3/inject.py in download_fileobj(self, Bucket, Key, Fileobj, ExtraArgs, Callback, Config)
568 bucket=Bucket, key=Key, fileobj=Fileobj,
569 extra_args=ExtraArgs, subscribers=subscribers)
--> 570 return future.result()
571
572
~/Project//venv/lib/python3.5/site-packages/s3transfer/futures.py in result(self)
71 # however if a KeyboardInterrupt is raised we want want to exit
72 # out of this and propogate the exception.
---> 73 return self._coordinator.result()
74 except KeyboardInterrupt as e:
75 self.cancel()
~/Project/venv/lib/python3.5/site-packages/s3transfer/futures.py in result(self)
231 # final result.
232 if self._exception:
--> 233 raise self._exception
234 return self._result
235
~/Project/venv/lib/python3.5/site-packages/s3transfer/tasks.py in _main(self, transfer_future, **kwargs)
253 # Call the submit method to start submitting tasks to execute the
254 # transfer.
--> 255 self._submit(transfer_future=transfer_future, **kwargs)
256 except BaseException as e:
257 # If there was an exception raised during the submission of task
~/Project/venv/lib/python3.5/site-packages/s3transfer/download.py in _submit(self, client, config, osutil, request_executor, io_executor, transfer_future)
347 Bucket=transfer_future.meta.call_args.bucket,
348 Key=transfer_future.meta.call_args.key,
--> 349 **transfer_future.meta.call_args.extra_args
350 )
351 transfer_future.meta.provide_transfer_size(
~/Project/venv/lib/python3.5/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
310 "%s() only accepts keyword arguments." % py_operation_name)
311 # The "self" in this scope is referring to the BaseClient.
--> 312 return self._make_api_call(operation_name, kwargs)
313
314 _api_call.__name__ = str(py_operation_name)
~/Project/venv/lib/python3.5/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
603 error_code = parsed_response.get("Error", {}).get("Code")
604 error_class = self.exceptions.from_code(error_code)
--> 605 raise error_class(parsed_response, operation_name)
606 else:
607 return parsed_response
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
When I run the same script with "use_accelerate_endpoint": False it works fine.
However, it returned the same error when:
I overwrite the endpoint_url with "s3-accelerate.amazonaws.com"
I define "addressing_style": "virtual"
When running
s3.get_bucket_accelerate_configuration(Bucket=<MY_BUCKET>)
I get {..., 'Status': 'Enabled'} as expected.
Any idea what is wrong with that code and what I should change to properly query the accelerate endpoint of that bucket?
Using python3.5 with boto3==1.4.7, botocore==1.7.43 on Ubuntu 17.04.
EDIT:
I have also tried a similar script for uploads:
from botocore.config import Config
import boto3
from io import BytesIO
session = boto3.session.Session()
s3 = session.client(
service_name='s3',
aws_access_key_id=<MY_KEY_ID>,
aws_secret_access_key=<MY_KEY>,
region_name="us-west-2",
config=Config(s3={"use_accelerate_endpoint": True,
"addressing_style": "virtual"}))
output = BytesIO()
output.seek(0)
s3.upload_fileobj(output, <MY_BUCKET>,<MY_KEY>)
Which works without the use_accelerate_endpoint option (so my keys are fine), but returns this error when True:
ClientError: An error occurred (SignatureDoesNotMatch) when calling the PutObject operation: The request signature we calculated does not match the signature you provided. Check your key and signing method.
I have tried both addressing_style options here as well (virtual and path)
Using boto3==1.4.7 and botocore==1.7.43.
Here is one way to retrieve an object from a bucket with transfer acceleration enabled.
import boto3
from botocore.config import Config
from io import BytesIO
config = Config(s3={"use_accelerate_endpoint": True})
s3_resource = boto3.resource("s3",
aws_access_key_id=<MY_KEY_ID>,
aws_secret_access_key=<MY_KEY>,
region_name="us-west-2",
config=config)
s3_client = s3_resource.meta.client
file_object = BytesIO()
s3_client.download_fileobj(<MY_BUCKET>, <MY_KEY>, file_object)
Note that the client sends a HEAD request to the accelerated endpoint before a GET.
The canonical request of which looks somewhat like the following:
CanonicalRequest:
HEAD
/<MY_KEY>
host:<MY_BUCKET>.s3-accelerate.amazonaws.com
x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
x-amz-date:20200520T204128Z
host;x-amz-content-sha256;x-amz-date
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Some reasons why the HEAD request can fail include:
Object with given key doesn't exist or has strict access control enabled
Invalid credentials
Transfer acceleration isn't enabled

Can't convert 'bytes' object to str implicitly

In [458]: type(obj_xml)
Out[458]: builtins.bytes
In [459]: with codecs.open( xmlOutFile, "+ab", "utf-8" ) as f:
.....: f.write(obj_xml)
.....:
error i am hitting
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-459-61a3d9d572a6> in <module>()
1 with codecs.open( xmlOutFile, "+ab", "utf-8" ) as f:
----> 2 f.write(obj_xml)
3
C:\Python3\lib\codecs.py in write(self, data)
698 def write(self, data):
699
--> 700 return self.writer.write(data)
701
702 def writelines(self, list):
C:\Python3\lib\codecs.py in write(self, object)
354 """ Writes the object's contents encoded to self.stream.
355 """
--> 356 data, consumed = self.encode(object, self.errors)
357 self.stream.write(data)
358
TypeError: Can't convert 'bytes' object to str implicitly
How do i go about writing the contents of obj_xml to the file ?
codecs.open takes a Unicode string and encodes it to bytes when writing. You already have a bytes object, so just open file file in binary mode and write the object:
with open(xmlOutFile,'+ab') as f:
f.write(obj_xml)

Resources