AWS reading utf-8 file pycaption.detect_format returns None - python-3.x

Python version: 3.5-slim-buster
Module: pycaption
When reading caption .srt that is us-ascii encoded from s3 bucket:
obj.get()['Body'].read()
print(pycaption.detect_format(body.decode()))
I get a desired response
<class 'pycaption.srt.SRTReader'>
But when reading utf-8 encoded s3 .srt file
pycaption can't detect format response:
None
I have tried:
obj.get()['Body'].read().decode('utf-8')
print(pycaption.detect_format(body))
But with no luck

In the end the issue was in DOS newlines CR/LF that I converted to DOS newlines CR/LF.

Related

ß can not be read from XML-file with UTF-16 Encoding with Python

I have two XML-files containing a "ß" ("scharfes S" in german), starting with:
<?xml version="1.0" encoding="utf-16" standalone="yes"?>
and
I used the following code to read the utf-8 file:
with open(file.xml, encoding='utf-8') as file:
f = file.read()
xml = xmltodict.parse(f)
and this code for the utf-16 file.
with open(file.xml, encoding='utf-16') as file:
f = file.read()
xml = xmltodict.parse(f)
for the UTF-16 file I get this error: UnicodeError: UTF-16 stream does not start with BOM.
Changing everything to:
with open(file.xml, encoding='utf-16') as file:
file.seek(1, os.SEEK_SET)
f = file.read()
xml = xmltodict.parse(f)
where I tried different points (e.g. seek(1,..), seek(2,..), ... ) doesn't help.
Then I checked the encoding with (Source)
alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"
vic file.xml
> latin-1
(Therefore I replaced encoding='utf-16' to encoding='latin-1').
But now I get errors about the "ß" in the code (e.g. when trying "utf-16-le")
"'utf-16-le' codec can't decode bytes in position 12734-12735: illegal encoding"
Does someone know where the problem is here? Or in general: How can I read XML files in Python with utf-8 or utf-16 encoding without having BOM errors or errors about the character "ß".
Thank you in advance!
If I create a UTF-16LE file:
$ echo 'Character is: ß' | iconv -t utf-16le >f.txt
and examine it with a hex dump:
$ xxd f.txt
00000000: 4300 6800 6100 7200 6100 6300 7400 6500 C.h.a.r.a.c.t.e.
00000010: 7200 2000 6900 7300 3a00 2000 df00 0a00 r. .i.s.:. .....
and then read it in Python:
>>> open('f.txt', encoding='utf-16LE').read()
'Character is: ß\n'
then I get the expected results.
Your file is not correctly encoded with the encoding that you're declaring.
can't decode bytes in position 12734-12735: illegal encoding
Create a much smaller sample file, or generate one as suggested above and look for differences.
If you find yourself messing with the file encoding manually when handling XML files, you're doing something wrong.
Fundamental rule: Never read XML files with open() in text mode.
Use an XML parser to load the file. The parser will sort out the encoding for you automatically. That's the whole point of having an XML declaration like <?xml version="1.0" encoding="utf-16"?> at the top of the file .
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
If you want to use xmltodict, open the file in binary mode (rb):
with open('file.xml', 'rb') as f:
xml = xmltodict.parse(f)
Here, xmltodict will give the file to an XML parser internally, which again will sort out the encoding for you.
If the above mangles characters or even throws errors, your XML file is broken. Fix the producer of the file. If you've edited the file manually, double check that your text editor's encoding settings match the XML declaration.

How can I download an image from my local directory?

Definitely, I'm going to do my PC crawling.
I want to get an image from an HTML document on my PC.
I tried this:
n=0
for i in soup.find_all('div', class_='c_img'):
with open('FILE DIRECTORY', 'r', encoding='utf-8') as f:
r=f.read()
with open(str(n)+'.jpg', 'wb', encoding='utf-8') as f:
f.write(r)
n+=1
And I got:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 5: invalid continuation byte
So I tried encoding='utf-16'
But it threw UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 44-45: illegal encoding
How can I make it? Thanks.
I believe the issue arises because you're attempting to encode a .jpg with utf-8.
You've posted only a small portion of your code, and I'm not sure what the other code does, but you should open the .jpg file as 'wb' without specifying an encoding.
If your "FILE DIRECTORY" file contains the .jpg, open it with 'rb' again, with no encoding.

How to gzip file with unicode encoding using linux cmd prompt?

I have large tsv format file(30GB). I have to transform all those data to google bigquery. So I split the files into smaller chunks and gzip all those chunk files and moved to google cloud storage. After that I have calling google bigquery api to load data from GCS. But I have facing following encoding error.
file_data.part_0022.gz: Error detected while parsing row starting at position: 0. Error: Bad character (ASCII 0) encountered. (error code: invalid)
I am using following unix commands in my python code for splitting and gzip tasks.
cmd = [
"split",
"-l",
"300000",
"-d",
"-a",
"4",
"%s%s" % (<my-dir>, file_name),
"%s/%s.part_" % (<my temp dir>, file_prefix)
]
code = subprocess.check_call(cmd)
cmd = 'gzip %s%s/%s.part*' % (<my temp dir>,file_prefix,file_prefix)
logging.info("Running shell command: %s" % cmd)
code = subprocess.Popen(cmd, shell=True)
code.communicate()
Files are successfully splitted and gziped (file_data.part_0001.gz, file_data.part_0002.gz, etc..) but when I try to load these files to bigquery it throws above error. I understand that was encoding issue.
Is there any way to encoding files while split and gzip operation? or we need to use python file object to read line by line and do unicode encoding and write it to new gzip file?(pythonic way)
Reason:
Error: Bad character (ASCII 0) encountered
Clearly states you have a unicode (UTF-16) tab character there which cannot be decoded.
BigQuery service only supports UTF-8 and latin1 text encodings. So, the file is supposed to be UTF-8 encoded.
Solution: I haven't tested it. Use the -a or --ascii flag with gzip command. It'll be decoded ok by bigquery.

Python reading from non ascii file

I have a text file which contains the following character:
ÿ
When I try and read the file in I've tried both:
with open (file, "r") as myfile:
AND
with codecs.open(file, encoding='utf-8') as myfile:
with success. However when I try to read the file in as a string using:
file_string=myfile.read()
OR
file_string=myfile.readLine()
I keep getting this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 11889: invalid start byte
Ideally I want it to ignore the character or subsitute it with '' or whitespace
I've come up with a solution. Just use python2 instead of python3. I still can't seem to get it to work in python3 though

Character encoding problems

I attempted to convert a file I wrote in Vim to UTF-8. Vim defaulted the encoding to us-ascii. I ran this command: recode UTF-8 [filename]. It reported no errors, but when I run: file -i [filename] it still stays encoding is ASCII. Is this a known error or expected result? Thanks in advance :-)
I have to say that if your file is just ascii character, there is no difference in the final file between the ascii encoding and utf8 encoding, cause for ascii character, the utf8 encoding is exactly the same as ascii encoding.
But if your file contains some non-ascii character, you will see the difference.
Your "fileencodings" setting for vim may use "ascii" before "utf8", that's the list that vim try to detect the file encodings. So if the file can be read as "ascii", the later utf8 will not be tried anymore, although utf8 is also correct.

Resources