Erratic encoding of byte to string on Python 3 on Ubuntu - python-3.5

I'm new to Python and am working on a sensor.
I'm building my code line by line and I have trouble with the encoding/decoding part for bytes to string. Same code, sometime it works, sometime it dosen't.
Here is the code:
import serial
import time
import os
port = serial.Serial("/dev/ttyUSB0", baudrate=9600, timeout=1, bytesize=8)
f_w = open('/home/myname/python_serial_output.txt','r+')
port.send_break()
while True:
op = port.read(2)
op_str = op.decode('utf-8')
f_w.write(op_str)
print(op_str)
It didn't work the first time round, but worked on the second time. Why?
Here is the error I get:
myname#Toshiba:~$ python3 serial_test.py
Traceback (most recent call last):
File "serial_test.py", line 13, in <module>
op_str = op.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte
myname#Toshiba:~$ python3 serial_test.py
Ex
pl
or
er
How do I remove the ambiguity of it running successfully?

This may have have happened because your string had a non ascii character. When you ran your code again there was no non ascii character in the string and hence it ran successfully.
You can encode the non ascii characters by using encode() function

Related

How to get python to tolerate UTF-8 encoding errors

I have a set of UTF-8 texts I have scraped from web pages. I am trying to extract keywords from these files like so:
import os
import json
from rake_nltk import Rake
rake_nltk_var = Rake()
directory = 'files'
results = {}
for filename in os.scandir(directory):
if filename.is_file():
with open("files/" + filename.name, encoding="utf-8", mode = 'r') as infile:
text = infile.read()
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()
results[filename.name] = keyword_extracted
with open("extracted-keywords.json", "w") as outfile:
json.dump(results, outfile)
One of the files I've managed to process so far is throwing the following error on read:
Traceback (most recent call last):
File "extract-keywords.py", line 11, in <module>
text = infile.read()
File "c:\python36\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte
0x92 is a right single quotation mark, but the 66th char of the file is a "u" so IDK where this error is coming from. Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same? I have a lot of files and can't afford to stop and debug every encoding error they might contain.
I have a set of UTF-8 texts I have scraped from web pages
If they can't be read with the script you've shown, then these are not actually UTF-8 encoded files.
We have to know about the code which wrote the files in the first place to tell the correct way to decode. However, the ’ character is 0x92 byte in code page 1252, so try using that encoding instead, i.e.:
with open("files/" + filename.name, encoding="cp1252") as infile:
text = infile.read()
Ignoring decoding errors corrupts the data, so it's best to use the correct decoder when possible, so try and do that first! However, about this part of the question:
Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same?
Yes, you can specify errors="replace"
>>> with open("/tmp/f.txt", "w", encoding="cp1252") as f:
... f.write('this is a right quote: \N{RIGHT SINGLE QUOTATION MARK}')
...
>>> with open("/tmp/f.txt", encoding="cp1252") as f:
... print(f.read()) # using correct encoding
...
this is a right quote: ’
>>> with open("/tmp/f.txt", encoding="utf-8", errors="replace") as f:
... print(f.read()) # using incorrect encoding and replacing errors
this is a right quote: �

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte while reading a text file

I am training a word2vec model, using about 700 text files as my corpus. But, when I start reading the files after the preprocessing step, I get the mentioned error. The code is as follows
class MyCorpus(object):
def __iter__(self):
for i in ceo_path: /// ceo_path contains abs path of all text files
file = open(i, 'r', encoding='utf-8')
text = file.read()
###########
########### /// text preprocessing steps
###########
yield final_text /// returns preprocessed text
sentences = MyCorpus()
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)
# training the model
cores = multiprocessing.cpu_count()
w2v_model = Word2Vec(min_count=5,
iter=30,
window=3,
size=200,
sample=6e-5,
alpha=0.025,
min_alpha=0.0001,
negative=20,
workers=cores-1,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
w2v_model.save('ceo1.model')
The error that I am getting is:
Traceback (most recent call last):
File "C:/Users/name/PycharmProjects/prac2/hbs_word2vec.py", line 131, in <module>
w2v_model.build_vocab(sentences)
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\base_any2vec.py", line 921, in build_vocab
total_words, corpus_count = self.vocabulary.scan_vocab(
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\word2vec.py", line 1403, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\word2vec.py", line 1372, in _scan_vocab
for sentence_no, sentence in enumerate(sentences):
File "C:/Users/name/PycharmProjects/prac2/hbs_word2vec.py", line 65, in __iter__
text = file.read()
File "C:\Users\name\AppData\Local\Programs\Python\Python38-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I am not able to understand the error as I am new to this. I was not getting the error in reading the text files when I wasn't using the iter function and sending the data in chunks as I am doing currently.
It looks like one of your files doesn't have proper utf-8-encoded text.
(Your Word2Vec-related code probably isn't necessary for hitting the error, at all. You could probably trigger the same error with just: sentences_list = list(MyCorpus()).)
To find which file, two different possibilities might be:
Change your MyCorpus class so that it prints the path of each file before it tries to read it.
Add a Python try: ... except UnicodeDecodeError: ... statement around the read, and when the exception is caught, print the offending filename.
Once you know the file involved, you may want to fix the file, or change the code to be able to handle the files you have.
Maybe they're not really in utf-8 encoding, in which case you'd specify a different encoding.
Maybe just one or a few have problems, and it's be OK to just print their names for later investigation, and skip them. (You could use the exception-handling approach above to do that.)
Maybe, those that aren't utf-8 are always in some other platform-specific encoding, so when utf-8 fails, you could try a 2nd encoding.
Separately, when you solve the encoding issue, your iterable MyCorpus is not yet returning whet the Word2Vec class expects.
It doesn't want full text plain strings. It needs those texts to already be broken up into individual word-tokens.
(Often, simply performing a .split() on a string is close-enough-to-real-tokenization to try as a starting point, but usually, projects use some more-sophisticated punctuation-aware tokenization.)

Read data with multi delimiter in pyspark

I have a input file which looks like this and has "|" as multi-delimiter :
162300111000000000106779"|"2005-11-16 14:12:32.860000000"|"1660320"|"0"|"2005-11-16 14:12:32.877000000"|""|""|""|""|""|""|""|"False"|"120600111000000000106776```
I can read this type of record with UDF as below :
inputDf = glueContext.sparkSession.read.option("delimiter", input_file_delimiter,)
.csv("s3://" + landing_bucket_name + "/" + input_file_name)
udf = UserDefinedFunction(lambda x: re.sub('"', '', str(x)))
new_df = inputDf.select(*[udf(column).alias(column) for column in inputDf.columns])
but when i get the input file as
000/00"|"AE71501"|"Complaint for Attachment of Earnings Order"|"In accordance with section test of the Attachment of Test Act Test."|"Non-Test"|"Other non-test offences"|"N"|"Other Non-Test"|"Non-Test
I am getting below exception while reading it, using the same UDF, my code fails at exact same location where i have mu UDF :
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 66: ordinal not in range(128)
Any help on below will be great :
- Optimized code to read both type of files , considering "|" as separator.
- How my existing UDF can handle the second type of input records.
This is likely caused by running in Python 2.x which has two separate types for string-like objects (unicode strings and non-unicode strings, which are nowadays simply byte sequences).
Spark will read in your data (which are bytes, as there is no such thing as plain text), and decode the lines as a sequence of Unicode strings. When you call str on a Unicode string that has a codepoint that is not in the ASCII range of codepoints, Python 2 will produce an error:
# python2.7>>> unicode_str = u"ú"
>>> type(unicode_str)
<type 'unicode'>
>>> str(unicode_str)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 0: ordinal not in range(128)
The recommended path is that you work with Unicode strings (which is the default string object in Python 3) all throughout your program, except at the point where you either read/receive data (where you should provide a suitable encoding scheme, so that you can decode the raw bytes) and at the point where you write/send data (again, where you use an encoding to encode the data as a series of bytes). This is called “the Unicode sandwich”.
Many libraries, including Spark, already decode bytes and encode unicode strings for you. If you simply remove the call to str in your user defined function, your code will likely work:
#pyspark shell using Python 2.7
>>> spark.sparkContext.setLogLevel("OFF") # to hide the big Py4J traceback that is dumped to the console, without modifying the log4j.properties file
>>> from py4j.protocol import Py4JJavaError
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> df = spark.read.csv("your_file.csv", sep="|")
>>> def strip_double_quotes_after_str_conversion(s):
... import re
... return re.sub('"', '', str(s))
...
>>> def strip_double_quotes_without_str_conversion(s):
... import re
... return re.sub('"', '', s)
...
>>> df.select(*[udf(strip_double_quotes_without_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
| _c0| _c1| _c2| _c3| _c4| _c5|_c6| _c7| _c8|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
|037/02|TH68150|Aggravated vehicl...|Contrary to secti...|Theft of motor ve...|Vehicle offences| Y|Aggravated Vehicl...|37.2|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
>>> try:
... df.select(*[udf(strip_double_quotes_after_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
... except Py4JJavaError as e:
... print("That failed. Root cause: %s" % e.java_exception.getCause().getMessage().rsplit("\n", 2)[-2])
...
That failed. Root cause: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 78: ordinal not in range(128)
So, the solution to the experienced problem is simple: don’t use str in your UDF.
Note that Python 2.x will no longer be maintained as of January 1st 2020. You’d do well transitioning to Python 3.x before that. In fact, had you executed this in a Python 3 interpreter, you would not have experienced the issue at all.

python3 UnicodeEncodeError: 'charmap' codec can't encode characters in position 95-98: character maps to <undefined>

A month ago I encountered this Github: https://github.com/taraslayshchuk/es2csv
I installed this package via pip3 in Linux ubuntu. When I wanted to use this package, I encountered the problem that this package is meant for python2. I dived into the code and soon I found the problem.
for line in open(self.tmp_file, 'r'):
timer += 1
bar.update(timer)
line_as_dict = json.loads(line)
line_dict_utf8 = {k: v.encode('utf8') if isinstance(v, unicode) else v for k, v in line_as_dict.items()}
csv_writer.writerow(line_dict_utf8)
output_file.close()
bar.finish()
else:
print('There is no docs with selected field(s): %s.' % ','.join(self.opts.fields))
The code did a check for unicode, this is not necessary within python3 Therefore, I changed the code to the code below. As result, The package worked properly under Ubuntu 16.
for line in open(self.tmp_file, 'r'):
timer += 1
bar.update(timer)
line_as_dict = json.loads(line)
# line_dict_utf8 = {k: v.encode('utf8') if isinstance(v, unicode) else v for k, v in line_as_dict.items()}
csv_writer.writerow(line_as_dict)
output_file.close()
bar.finish()
else:
print('There is no docs with selected field(s): %s.' % ','.join(self.opts.fields))
But a month later, it was necessary to get the es2csv package working on a Windows 10 operating system. After doing the exact same adjustments with es2csv under Windows 10, I received the following error message after I tried to run es2csv:
PS C:\> es2csv -u 192.168.230.151:9200 -i scrapy -o database.csv -q '*'
Found 218 results
Run query [#######################################################################################################################] [218/218] [100%] [0:00:00] [Time: 0:00:00] [ 2.3 Kidocs/s]
Write to csv [# ] [2/218] [ 0%] [0:00:00] [ETA: 0:00:00] [ 3.9 Kilines/s]T
raceback (most recent call last):
File "C:\Users\admin\AppData\Local\Programs\Python\Python36\Scripts\es2csv-script.py", line 11, in <module>
load_entry_point('es2csv==5.2.1', 'console_scripts', 'es2csv')()
File "c:\users\admin\appdata\local\programs\python\python36\lib\site-packages\es2csv.py", line 284, in main
es.write_to_csv()
File "c:\users\admin\appdata\local\programs\python\python36\lib\site-packages\es2csv.py", line 238, in write_to_csv
csv_writer.writerow(line_as_dict)
File "c:\users\admin\appdata\local\programs\python\python36\lib\csv.py", line 155, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "c:\users\admin\appdata\local\programs\python\python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 95-98: character maps to <undefined>
Does anyone has an idea how to fix this error message?
It's due to the default behaviour of open in Python 3. By default, Python 3 will open files in Text mode, which means that it also has to apply a text decoding, such as utf-8 or ASCII, for every character it reads.
Python will use your locale to determine the most suitable encoding. On OS X and Linux, this is usually UTF-8. On Windows, it'll use an 8-bit character set, such windows-1252, to match the behaviour of Notepad.
As an 8-bit character set only has a limited number of characters, it's very easy to end up trying to write a character not supported by the character set. For example, if you tried to write a Hebrew character with Windows-1252, the Western European character set.
To resolve your problem, you simply need to override the automatic encoding selection in open and hardcode it to use UTF-8:
for line in open(self.tmp_file, 'r', encoding='utf-8'):

Python 3.2 TypeError - can't figure out what it means

I originally put this code through Python 2.7 but needed to move to Python 3.x because of work. I've been trying to figure out how to get this code to work in Python 3.2, with no luck.
import subprocess
cmd = subprocess.Popen('net use', shell=True, stdout=subprocess.PIPE)
for line in cmd.stdout:
if 'no' in line:
print (line)
I get this error
if 'no' in (line):
TypeError: Type str doesn't support the buffer API
Can anyone provide me with an answer as to why this is and/or some documentation to read?
Much appreciated.
Python 3 uses the bytes type in a lot places where the encoding is not clearly defined. The stdout of your subprocess is a file object working with bytes data. So, you cannot check if there is some string within a bytes object, e.g.:
>>> 'no' in b'some bytes string'
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
'no' in b'some bytes string'
TypeError: Type str doesn't support the buffer API
What you need to do instead is a test if the bytes string contains another bytes string:
>>> b'no' in b'some bytes string'
False
So, back to your problem, this should work:
if b'no' in line:
print(line)

Resources