Python3 reading Japanese characters in a pickle file made in python2

Python3 reading Japanese characters in a pickle file made in python2 - python-3.x

I use the code below to read a pickle file made in python2
import pickle
with open('data.pkl', 'rb') as fin:
data_df = pickle.load(fin, encoding='latin1')
Everything works well except the column including Japanese charactors.
For example, string supposed to be "東京都" may become something like "æ±äº¬é".
I think python3 reads the bytes format string as str. How can I convert it back?
Here is some test I did in python3
>>> a='\xe6\x9d\xb1\xe4\xba\xac\xe9\x83\xbd'
>>> b=b'\xe6\x9d\xb1\xe4\xba\xac\xe9\x83\xbd'
>>> a
'æ\x9d±äº¬é\x83½'
>>> b
b'\xe6\x9d\xb1\xe4\xba\xac\xe9\x83\xbd'
>>> print(a)
æ±äº¬é
>>> print(b)
b'\xe6\x9d\xb1\xe4\xba\xac\xe9\x83\xbd'
>>> a.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> b.decode('utf-8')
'東京都'
I think pickle.load reads the utf-8 code as str (like the a case above).
[EDIT]
The reason why I set pickle.load encoding to latin1 was because there's column with datetime format. It causes error if I set encoding='utf-8

Related

Python Windows path with escape character error

I have a windows path stored in a variable called "a". When I tried to print or use it in the code, somehow some special characters are added to the string.
>>> import re
>>> from pathlib import Path
>>>
>>>
>>> a = "E:\POC\testing\functionalities\logs\timer.logs"
>>> a
'E:\\POC\testing\x0cunctionalities\\logs\timer.logs'
>>>
>>> Path(a)
WindowsPath('E:/POC\testing\x0cunctionalities/logs\timer.logs')
>>> Path.absolute(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\program files (x86)\python38-32\lib\pathlib.py", line 1159, in absolute
if self._closed:
AttributeError: 'str' object has no attribute '_closed'
>>>
>>> re.escape(a)
'E:\\\\POC\\\testing\\\x0cunctionalities\\\\logs\\\timer\\.logs'
>>>
>>> a.replace("\\", "/")
'E:/POC\testing\x0cunctionalities/logs\timer.logs'
>>> a.__repr__()
"'E:\\\\POC\\testing\\x0cunctionalities\\\\logs\\timer.logs'"
>>>
I'm able to handle all the special characters but \f is somehow changed to \x0c.
One solution is adding r to the string, but my path is stored in a variable. How I can achieve that? I'm using python 3.8.5 and Windows 10
>>> a = r"E:\POC\testing\functionalities\logs\timer.logs"
>>> a
'E:\\POC\\testing\\functionalities\\logs\\timer.logs'
>>>
>>>
>>> a = "E:\POC\testing\functionalities\logs\timer.logs"
>>> a = r"" + a
>>> a
'E:\\POC\testing\x0cunctionalities\\logs\timer.logs'
>>>

Use raw string or escape the backward slash:
a = r"E:\POC\testing\functionalities\logs\timer.logs"
or
a = "E:\\POC\\testing\\functionalities\\logs\\timer.logs"

Based on your comment under #user8086906's post, couldn't you just do
a.replace('\\', '\')
? I see you tried a.replace("\\", "/") above - could you explain what the desired behavior is? On my machine, the first snippet I posted works.
EDIT:
Thanks #Gopirengaraj C - I see what the issue is now. The problem is that \f is an escape character in Unicode - more specifically, it is called a "form feed". I think a good way to get around this would then be to avoid replace and do something like this instead:
a = r'{0}'.format(a)
Lmk if that works.

Split results in Python for CPU usage

Been trying to get this to work for a few hours now. Nothing I try is splitting this text up. I only want the Current CPU from this
>>> from __future__ import print_function
>>> from urllib.request import urlopen
>>> import json
>>> import subprocess
>>> import requests
>>> import random
>>> import sys
>>> import os
>>> import time
>>> import datetime
>>> import MySQLdb as my
>>> import psutil
>>> os.popen('vcgencmd measure_temp').readline()
"temp=52.0'C\n"
>>> cpu = psutil.cpu_freq()
>>> cpu = cpu.split('current=')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'scpufreq' object has no attribute 'split'
>>> psutil.cpu_freq()
scpufreq(current=600.0, min=600.0, max=1500.0)
>>> psutil.cpu_freq(percpu=True)
[scpufreq(current=600.0, min=600.0, max=1500.0)]
>>> cpu = psutil.cpu_freq(percpu=True)
>>> cpu.split('=')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'split'
>>> AttributeError: 'list' object has no attribute 'split'
File "<stdin>", line 1
AttributeError: 'list' object has no attribute 'split'
^
SyntaxError: invalid syntax
>>> AttributeError: 'list' object has no attribute 'split'
File "<stdin>", line 1
AttributeError: 'list' object has no attribute 'split'
^
SyntaxError: invalid syntax
>>> psutil.cpu_freq(percpu=True).readline()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'readline'
>>> cpu = psutil.cpu_freq()
Where am I going wrong with this?
OS: Rasbian Buster
Python: python3
PIP: pip3

It looks mostly like you're ignoring your error messages:
>>> cpu = psutil.cpu_freq()
>>> cpu = cpu.split('current=')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'scpufreq' object has no attribute 'split'
The return value from psutil.cpu_freq() isn't a string, so it doesn't have a split method. If you just print the value...
>>> cpu
scpufreq(current=700.0, min=700.0, max=800.0)
...you get some idea of what attributes it has, and indeed, we can access those values like this:
>>> cpu.current
700.0
>>> cpu.max
800.0
When you set percpu=True, you're getting back a list:
>>> psutil.cpu_freq(percpu=True)
[scpufreq(current=600.0, min=600.0, max=1500.0)]
And once again, a list isn't a string, so there's no split method. Since there's only a single CPU, you get back a 1-item list, so you can access values like this:
>>> cpu = psutil.cpu_freq(percpu=True)
>>> cpu[0].current
700.0

Issue with python 2 to python 3 TypeError: cannot use a string pattern on a bytes-like object

I have the following code and would like to make it compatible with both python 2.7 and python 3.6
from re import sub, findall
return sub(r' ', ' ', sub(r'(\s){2,}', ' ',sub(r'[^a-z|\s|,]|_|
(x)\1{1,}', '', x.lower())))
I received the following error:
TypeError: cannot use a string pattern on a bytes-like object
I understood that the python3 distinguishes byte and string(unicode),but not sure how to proceed.
Thanks.
tried the following and not working
return sub(rb' ', b' ', sub(rb'(\s){2,}', b' ',sub(rb'[^a-z|\s|,]|_|(x)\1{1,}', b'', x.lower())))

Have you tried using re.findall? For instance:
import re
respdata = # the data you are reading
content = re.findall(r'#findall from and too#', str(respdata)) # output in string
for contents in content:
print(contents) # print results

The "string" you have must be a series of bytes, which you can convert to a real string using x.decode('utf-8'). You can see the problem with a simple example:
>>> import re
>>> s = bytes('hello', 'utf-8')
>>> s
b'hello'
>>> re.search(r'[he]', s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 183, in search
return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object
>>> s.decode('utf-8')
'hello'
>>> re.search(r'[he]', s.decode('utf-8'))
<re.Match object; span=(0, 1), match='h'>
I'm assuming your bytes represent UTF-8 data, but if you're working with a different encoding then just pass its name to decode() instead.

How to serialize objects using pickle in python3

I read "How to think like a Computer Scientist. Learning with Python." book. So I usually have no difficulties to interpret examples from python2 to python3, but at chapter 11 Files & Exceptions I encountered this snippet
>>> import pickle
>>> f = open("test.pck", "w")
>>> pickle.dump(12.3, f)
>>> pickle.dump([1,2,3], f)
>>> f.close()
which when I evaluate it using Python 3.5.2 gives this error
Traceback (most recent call last): File "/(myDirs)/files.py", line 3, in <module>
pickle.dump(3.14, f)
TypeError: write() argument must be str, not bytes
I am not a good docs reader, so if you can help me to solve this riddle I would be grateful.

You need to open the file in binary mode.
In line 2:
f = open("test.pck", "wb")

AttributeError: 'str' object has no attribute 'dispersion_plot' NLTK

I am trying to create a dispersion_plot using NLTK. As far as I can tell, I am following the directions. When I run their example calling the example text that comes with NLTK it works. When I call my own text file, it has the above error.
mine:
>>> text11 = "Text_test.txt"
>>> text11.dispersion_plot(["semiosis", "dialectic", "essentially", "icon", "logo"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'dispersion_plot'
Their example code:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
Thankful for any advice/help!

Note that you have to make it into an NLTK Text object after tokenizing it. Also, your text11 variable as used in your code is the string "Text_test.txt", not the text inside the file called Text_test.txt.
Assuming that
you have matplotlib and numpy installed, which are necessary for dispersion_plot to work
your file is at /home/myfile.txt
your file is simple text like the ones they use
then this should do it
# from Ch. 3
f=open('/home/myfile.txt','rU') # open the file
raw = f.read() # read the text
tokens = nltk.word_tokenize(raw) # tokenize it
mytext = nltk.Text(tokens) # turn text into a NLTK Text object
# from Ch. 1
mytext.dispersion_plot(["semiosis", "dialectic", "essentially", "icon", "logo"])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python3 reading Japanese characters in a pickle file made in python2 - python-3.x

Related

Python Windows path with escape character error

Split results in Python for CPU usage

Issue with python 2 to python 3 TypeError: cannot use a string pattern on a bytes-like object

How to serialize objects using pickle in python3

AttributeError: 'str' object has no attribute 'dispersion_plot' NLTK

Categories

Resources