Why can't I decode('utf-16') success in Python3 (even work in Py2)? - python-3.x

In Py2:
(chr(145) + chr(78)).decode('utf-16')
I got u'\u4e91':
But in Py3:
(chr(145) + chr(78)).encode('utf-8').decode('utf-16')
I got an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x4e in position 2: truncated data
Sometimes, they work in a same way, such as (chr(93) + chr(78)), but sometimes not.
Why? And how can I do this right in Py3?

You have to use latin1 if you want to encode any byte tranparently:
(chr(145) + chr(78)).encode('latin1').decode('utf-16')
#'云'
chr(145) gets encoded with 2 bytes in utf8 (as with all values above 127):
chr(145).encode('utf8')
# b'\xc2\x91'
while it is what you wanted with latin1:
chr(145).encode('latin1')
# b'\x91'

Related

Set locale for stdout in python3

I want my code to stop producing errors depending on the locale set on the terminal. For example this code:
import os
print(f"Locale {os.getenv('LC_ALL')}")
foo_bytes = b'\xce\x94, \xd0\x99, \xd7\xa7, \xe2\x80\x8e \xd9\x85, \xe0\xb9\x97, \xe3\x81\x82, \xe5\x8f\xb6, \xe8\x91\x89, and \xeb\xa7\x90.'
print(foo_bytes.decode("utf-8", "replace"))
Will print cleanly as long as my locale is US.UTF-8.
However if I change my locale and run the script I described earlier
export LC_ALL=en_US.iso885915
python3 locale_script.py
It will fail on:
Locale en_US.iso885915
Traceback (most recent call last):
File "locale_script.py", line 10, in <module>
print(foo_bytes.decode("utf-8", "replace"))
File "/usr/lib/python3.6/encodings/iso8859_15.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0394' in position 0: character maps to <undefined>
This could be avoided if I had set the terminal locale within my script so it will use "utf-8" as I required. I have tried setlocale, but it still end up in the same error.
import locale
locale.setlocale(locale.LC_ALL, "C.UTF-8")
Any advice on what to do? I hope I could avoid to have to re-encode all my strings:
foo = 'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
print(foo.encode().decode(sys.stdout.encoding))

UnicodeDecodeError: invalid start byte in METADATA file at path:

I see that several Python-package related files have gibberish at their end.
Due to this, I am unable to do several pip operations (even basic ones like "pip list").
(Usually, I use conda by the way)
For example. When I pressed pip list. I get the following error.
ERROR: Exception:
Traceback (most recent call last):
File "C:\Users\shan_jaffry\Miniconda3\envs\SQL_version\lib\site-packages\pip\_internal\cli\base_command.py", line 173, in _main
status = self.run(options, args)
File "C:\Users\shan_jaffry\Miniconda3\envs\SQL_version\lib\site-packages\pip\_internal\commands\list.py", line 179, in run
self.output_package_listing(packages, options)
File "C:\Users\shan_jaffry\Miniconda3\envs\SQL_version\lib\site-packages\pip\_internal\commands\list.py", line 255, in output_package_listing
data, header = format_for_columns(packages, options)
File "C:\Users\shan_jaffry\Miniconda3\envs\SQL_version\lib\site-packages\pip\_internal\commands\list.py", line 307, in format_for_columns
row = [proj.raw_name, str(proj.version)]
File "C:\Users\shan_jaffry\Miniconda3\envs\SQL_version\lib\site-packages\pip\_internal\metadata\base.py", line 163, in raw_name
return self.metadata.get("Name", self.canonical_name)
File "C:\Users\shan_jaffry\Miniconda3\envs\SQL_version\lib\site-packages\pip\_internal\metadata\pkg_resources.py", line 96, in metadata
return get_metadata(self._dist)
File "C:\Users\shan_jaffry\Miniconda3\envs\SQL_version\lib\site-packages\pip\_internal\utils\packaging.py", line 48, in get_metadata
metadata = dist.get_metadata(metadata_name)
File "C:\Users\shan_jaffry\Miniconda3\envs\SQL_version\lib\site-packages\pip\_vendor\pkg_resources\__init__.py", line 1424, in get_metadata
return value.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 14097: invalid start byte in METADATA file at path: c:\users\shan_jaffry\miniconda3\envs\sql_version\lib\site-packages\hupper-1.10.2.dist-info\METADATA
I went into the file META and found the following gibberish at the end. This (I found) has been done in several other files i.e. end of files are appended with gibberish and the actual thin is removed. Any help?
> 0.1 (2016-10-21)
> ================
> -
> - Initial rele9ýl·øA
I found that the by manually going to the site-packages folder, and removing the two folders, :: hupper and hupper-1.10.2.dist-info and then installing hupper again using "pip install hupper", problem was solved.
The issue was that the hupper package (and hupper-1.10.2.dist-info) were corrupted. Hence uninstall and re-install helped.

Mimicking bash wc functionalities using python

I have written a very simple python programme, called wc.py, which mimics "bash wc" behaviour to count the number of words, lines and bytes in a file. My programme is as follow:
import sys
path = sys.argv[1]
w = 0
l = 0
b = 0
for currentLine in file:
wordsInLine = currentLine.strip().split(' ')
wordsInLine = [word for word in wordsInLine if word != '']
w += len(wordsInLine)
b += len(currentLine.encode('utf-8'))
l += 1
#output
print(str(l) + ' ' + str(w) + ' ' + str(b))
In order to execute my programme you should execute the following command:
python3 wc.py [a file to read the data from]
As the result it shows
[The number of lines in the file] [The number of words in the file] [The number of bytes in the file] [the file directory path]
The files I used to test my code is as follow:
file.txt which contains the following data:
1
2
3
4
Executing "wc file.txt" returns
4 4 8
Executing "python3 wc.py file.txt" returns 4 4 8
Download "Annual enterprise survey: 2020 financial year (provisional) – CSV" from CSV file download
Executing "wc [fileName].csv" returns
37081 500273 5881081
Executing "python3 wc.py [fileName].csv" returns
37081 500273 5844000
and a [something].pdf file
Executing "wc [something].pdf" works.
Executing "python3 code.py" throws the following errors:
Traceback (most recent call last):
File "code.py", line 10, in <module>
for currentLine in file:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 10: invalid start byte
As you can see, the output of python3 code.py [something].pdf and python3 code.py [something].csv is not the same as what wc returns. Could you help me to find the reason of this erroneous behaviour in my code?
Regarding the CSV file, if you look at the difference between your result and that of wc:
5881081 - 5844000 = 37081 which is exactly the number of lines.
That is, every line has one additional character in the original file. That character is the carriage return \r which got lost in Python because you iterate over lines and don't specify the linebreaks. If you want a byte-correct result, you have to first identify the type of linebreaks used in the file (and watch out for inconsistencies throughout the document).

ClientError: An error occurred (InvalidTextEncoding) when calling the SelectObjectContent operation: UTF-8 encoding is required. reading gzip file

I am getting the above error in my code. encoding=latin-1 needs to be included as a parameter somewhere in select-object-content. Since I am new to this, I am not sure, where to add it.
Can anyone help me in this?
Code:
client = boto3.client('s3',aws_access_key_id,aws_secret_access_key',region_name)
resp = client.select_object_content(
Bucket='mybucket',
Key='path_to_file/file_name.gz',
ExpressionType='SQL',
Expression=query,
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': compressionType},
OutputSerialization = {'CSV': {}},
)
Traceback:
ClientError Traceback (most recent call last)
C:\path/3649752754.py in <module>
78 Expression=SQL,
79 InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': compression},
---> 80 OutputSerialization = {'CSV': {}},
81 )
82
ClientError: An error occurred (InvalidTextEncoding) when calling the SelectObjectContent operation: UTF-8 encoding is required. The text encoding error was found near byte 90,112.
You need to save your CSV file with UTF-8 encoding. For example, with Notepad++ or Excel->Save As->Select from the dropdown.

Is it possible to delete the file if UnicodeEncodeError occur? [duplicate]

This question already has an answer here:
How to catch all exceptions in Try/Catch Block Python?
(1 answer)
Closed 3 years ago.
My code below goes through each .m4v file in the list and converts them to a .wav file using FFmpeg, and it works. I use python 3 jupyter environment.
for fpath in list:
if (fpath.endswith(".m4v")):
cdir=os.path.dirname(fpath)
os.chdir(cdir)
filename=os.path.basename(fpath)
os.system("ffmpeg -i {0} temp_name.wav".format(filename))
ofnamepath=os.path.splitext(fpath)[0]
temp_name=os.path.join(cdir, "temp_name.wav")
new_name = os.path.join(ofnamepath+'.wav')
os.rename(temp_name,new_name)
old_name=os.path.join(ofnamepath+'.m4v')
os.remove(old_name)
However, for this particular dataset I get the following error;
> UnicodeEncodeError Traceback (most recent call
> last) <ipython-input-10-bd3b17e409fa> in <module>()
>
>
> > 7 os.chdir(cdir)
> > 8 filename=os.path.basename(fpath)
> > ----> 9 os.system("ffmpeg -i {0} temp_name.wav".format(filename))
> > 10 ofnamepath=os.path.splitext(fpath)[0]
> > 11 temp_name=os.path.join(cdir, "temp_name.wav")
>
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 10-16: ordinal not in range(128)
Is it possible to do add an if comment line in the code something like;
if 'UnicodeEncodeError: 'ascii' codec can't encode'
delete that file and continue to the next file?
You can use a try and except block.
If an exception occurs while inside a try block, it will jump to the exception block. What's better is that you can even specify the exception.
Adding this to your code would look something like:
for fpath in list:
if (fpath.endswith(".m4v")):
cdir=os.path.dirname(fpath)
os.chdir(cdir)
filename=os.path.basename(fpath)
try:
os.system("ffmpeg -i {0} temp_name.wav".format(filename))
except UnicodeEncodeError:
print("Some failure message.. Continuing to next..")
# os.remove(filename)
continue # This skips the rest of the current iteration and jumps to the top of the loop.
ofnamepath=os.path.splitext(fpath)[0]
temp_name=os.path.join(cdir, "temp_name.wav")
new_name = os.path.join(ofnamepath+'.wav')
os.rename(temp_name,new_name)
old_name=os.path.join(ofnamepath+'.m4v')
os.remove(old_name)
Uncomment the # os.remove(filename) to have your files deleted. Are you sure you want to permanently delete them?

Resources