Stripping newlines off of read file doesn't work - python-3.x

I have a function that is supposed to read a file as bytes and strip off newline characters, but when I try to use .strip() it gives me the error TypeError: a bytes-like object is required, not 'str', so then I try to encode it using .encode('utf-8') before stripping, and I get AttributeError: 'bytes' object has no attribute 'encode'. I don't really know where to begin with this problem. Here's the code:
file = open(str(filename + ".data"), "rb")
file.seek(0)
array = file.readlines()
b = array[lineNumber].strip('\n\r')
The file is encrypted bytes that I'm trying to feed into a decryption function to get ascii.

This comment showed me that I needed to use .strip() with bytes instead of a string: .strip(b'\n\r') instead of .strip('\n\r') since I was stripping bytes.

Related

generate a hex file by converting strings into hex from a text file in Python

I have a Python tool that generates a text file in which each line has a string. I want to generate a hex file using this text file. The file has lines the following line:
-5.139488050547036391e-01
3.181812818225058681e+00
475.465798764
abc[0]
abc[0]*abc[10]
I tried using binascii.hexlify(b'<String>'), which works when I manually enter the strings, but when I do that:
with open("strings.txt", "r") as a_file:
for line in a_file:
if not line.strip():
continue
stripped_line = line.strip()
hex_= binascii.hexlify(b'<'+ stripped_line +'>')
print(hex_)
I get this error:
TypeError: can't concat str to bytes
How can I convert those strings of different types into hex and generate a .hex file?
To convert a string (line in file) to bytes object you have to encode it. In your case, that only means to replace this line from your code
hex_= binascii.hexlify(b'<'+ stripped_line +'>')
with this line
hex_= binascii.hexlify(stripped_line.encode())
Your code ran into error, because you tried to concatenate 'b<' (byte object) with stripped_line (string object) and it has no meaning in python.

Reading NULL bytes using pandas.read_csv()

I have a file in CSV format which contains NULL bytes (may be 0x84) in each line. I need to read this file using c engine of pd.read_csv() .
This values causes an error while reading - 'utf-8' codec can't decode byte 0x84 in position 14 .
Is there any way out to fix it without changing the file ?
Try these options if it helps:
Option 1:
Set the engine as python.
pd.read_csv(filename, engine='python')
Option 2:
Try utf-16 encoding, because the error could also mean the file is encoded in UTF-16. Or change the encoding to the correct format example
encoding = "cp1252"
encoding = "ISO-8859-1"
Option 3:
Read the file as bytes
with open(filename, 'rb') as f:
data = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Alternatively you can use open method from the codecs module to read in the file:
import codecs
with codecs.open(filename, 'r', encoding='utf-8', errors='ignore') as f:
This will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
https://docs.python.org/3/howto/unicode.html#the-string-type

'charmap' codec can't encode characters in position XX

I have a simple script that is attempting to extract mutiple json objects from a single file, and store it as a list:
import json
URL = r"C:\Users\Kenneth\Youtube_comment_parser\Testing.txt"
with open(URL, 'r', encoding="utf-8") as handle:
json_data = [json.loads(line) for line in handle]
print(json_data) # Can't .encode() because it's a list
Even after specifying utf-8 encoding, I'm still running into a codec error. If possible, I would also like to change this object into a dictionary, but this is as far as I've got.
The exact error reads:
UnicodeEncodeError: 'charmap' codec can't encode characters in position
394-395: character maps to <undefined>
Thanks in advance.
I was able to solve this issue by removing one unicode character that was producing "/undefined>", the string '\ufeff', and then the rest was able to display nicely. This required me to iterate over the keys in the list of dictionaries, and replace as necessary.
import json
URL = r"C:\Users\Kenneth\Youtube_comment_parser\Testing.txt"
json1_file = open(URL, encoding='utf-8')
json1_str = json1_file.read()
json1_str = [d.strip() for d in json1_str.splitlines()]
json1_data = [json.loads(i) for i in json1_str]
json1_data = [{key:value.replace(u'\ufeff', '') for
key, value in json1_data[index].items()} for
index in range(len(json1_data))]
print(json1_data[1]['text'].encode('utf-8'))
Still not sure why I have to open with utf-8 and then encode again with my print statement, but it produced the string nicely.

Python 3, UnicodeEncodeError with decode set to ignore

This code makes an http call to a solr index.
query_uri = prop.solr_base_uri + "?q=" + query + "&wt=json&indent=true"
with urllib.request.urlopen(query_uri) as response:
data = response.read()
#data is bytes
data_str=data.decode('utf-8', 'ignore')
print(data_str)
The print statement throws:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2715' in position 149273: character maps to undefined
I thought the decode('utf-8', 'ignore') was supposed to ignore non utf-8 characters and leave it out of the result? How is it that I have a UnicodeEncodeError in the the print statement? How do I handle characters that can't encoded in Unicode? Thanks!
The error is caused by print (and any file.write()) not having a character map set and defaulting to ASCII.
The recommended approach is to set PYTHONIOENCODING=UTF-8 in your environment or encode each string before printing:
print(`data_str`.encode("utf-8")
For file writing, set the encoding for the file when you open it:
file = open("/temp/test.txt", "w", encoding="UTF-8")
file.write('\u2715')

How do I convert a Python 3 byte-string variable into a regular string? [duplicate]

This question already has answers here:
Convert bytes to a string
(22 answers)
Closed 2 years ago.
I have read in an XML email attachment with
bytes_string=part.get_payload(decode=False)
The payload comes in as a byte string, as my variable name suggests.
I am trying to use the recommended Python 3 approach to turn this string into a usable string that I can manipulate.
The example shows:
str(b'abc','utf-8')
How can I apply the b (bytes) keyword argument to my variable bytes_string and use the recommended approach?
The way I tried doesn't work:
str(bbytes_string, 'utf-8')
You had it nearly right in the last line. You want
str(bytes_string, 'utf-8')
because the type of bytes_string is bytes, the same as the type of b'abc'.
Call decode() on a bytes instance to get the text which it encodes.
str = bytes.decode()
How to filter (skip) non-UTF8 charachers from array?
To address this comment in #uname01's post and the OP, ignore the errors:
Code
>>> b'\x80abc'.decode("utf-8", errors="ignore")
'abc'
Details
From the docs, here are more examples using the same errors parameter:
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "strict")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
The errors argument specifies the response when the input string can’t be converted according to the encoding’s rules. Legal values for this argument are 'strict' (raise a UnicodeDecodeError exception), 'replace' (use U+FFFD, REPLACEMENT CHARACTER), or 'ignore' (just leave the character out of the Unicode result).
UPDATED:
TO NOT HAVE ANY b and quotes at first and end
How to convert bytes as seen to strings, even in weird situations.
As your code may have unrecognizable characters to 'utf-8' encoding,
it's better to use just str without any additional parameters:
some_bad_bytes = b'\x02-\xdfI#)'
text = str( some_bad_bytes )[2:-1]
print(text)
Output: \x02-\xdfI
if you add 'utf-8' parameter, to these specific bytes, you should receive error.
As PYTHON 3 standard says, text would be in utf-8 now with no concern.

Resources