python3 converting unreadable character to readable chracter - python-3.x

hi i have a text file and i am reading file and parsing datas,
but my file contains some text like
\u03a4\u03c1\u03b5\u03b9\u03c2 \u03bd\u03b5\u03ba\u03c1\u03bf\u03af \u03b1\u03c0\u03cc \u03c0\u03c4\u03ce\u03c3\u03b7 \u03bf\u03b2\u03af\u03b4\u03b1\u03c2 \u03c3\u03b5 \u03c3\u03c0\u03af\u03c4\u03b9 \u03c3\u03c4\u03bf \u03a3\u03b9\u03bd\u03ac
how can i convert a it readable text with python
i try to use these codes to solve but it doesn't work
def encodeDecode(self, data):
new_data = ''
for ch in data:
#let = ch.encode('utf-8').decode('utf-8')
#new_data += let
new_data += repr(ch)[1:2]
return new_data

There is no problem with your string,you have a unicode data.Just based on how you want to use it you can decode it custom or using python default encoding for example if you want to print it, since strings in python 3 are unicode you can just print it.
>>> s="""\u03a4\u03c1\u03b5\u03b9\u03c2 \u03bd\u03b5\u03ba\u03c1\u03bf\u03af \u03b1\u03c0\u03cc \u03c0\u03c4\u03ce\u03c3\u03b7 \u03bf\u03b2\u03af\u03b4\u03b1\u03c2 \u03c3\u03b5 \u03c3\u03c0\u03af\u03c4\u03b9 \u03c3\u03c4\u03bf \u03a3\u03b9\u03bd\u03ac """
>>>
>>> print s
Τρεις νεκροί από πτώση οβίδας σε σπίτι στο Σινά
>>>
But if you want to write your data in a file you need to use a proper encoding for your file.
You can do it with passing your encoding to open() function when you open a file for writing.

You could also convert it using Python's json module - this would also work in Python 2x
>>> f = open('input.txt', 'r')
>>> json_str = '"%s"' % f.read().replace('"', '\\"') # wrap the input string in double quotes
>>> print(json.loads(json_str))
Τρεις νεκροί από πτώση οβίδας σε σπίτι στο Σινά

Related

Text parsing with specific element Python

I have an aligment result with multiple sequences as text file. I want to split each result into new text file. Far now I can detect each sequence with '>', and split into files. However, new text files writen without line that contains '>'.
with open("result.txt",'r') as fo:
start=0
op= ' '
cntr=1
# print(fo.readlines())
for x in fo.readlines():
# print(x)
if (x[0]== '>'):
if (start==1):
with open(str(cntr)+'.txt','w') as opf:
opf.write(op)
opf.close()
op= ' '
cntr+=1
else:
start=1
else:
if (op==''):
op=x
else:
op= op + '\n' + x
fo.close()
print('completed')
>P51051.1 RecName: Full=Melatonin receptor type 1B; Short=Mel-1B-R; Short=Mel1b
receptor [Xenopus laevis]
Length=152
this is how I want to see as a beginning of each text file but they start as
receptor [Xenopus laevis]
Length=152
How can I include from the beginning?
You can do it like this:
with open("result.txt", encoding='utf-8') as fo:
for index, txt in enumerate(fo.read().split(">")):
if txt:
with open(f'{index}.txt', 'w') as opf:
opf.write(txt)
You should provide the encoding of the file e.g. utf-8, no need to specify read r, there is no need to close the file if you are using a context manager i.e. with and you just need to use read instead of readlines to get a string then call split on the string. I'm using enumerate to get a counter as well as enumerate objects. And f-string as it is a better way for string concatenation.

Python: How to Remove range of Characters \x91\x87\xf0\x9f\x91\x87 from File

I have this file with some lines that contain some unicode literals like:
"b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
I want to remove those xe2\x80\x99 like characters.
I can remove them if I declare a string that contains these characters but my solutions don't work when reading from a CSV file. I used pandas to read the file.
SOLUTIONS TRIED
1.Regex
2.Decoding and Encoding
3.Lambda
Regex Solution
line = "b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
code = (re.sub(r'[^\x00-\x7f]',r'', line))
print (code)
LAMBDA SOLUTION
stripped = lambda s: "".join(i for i in s if 31 < ord(i) < 127)
code2 = stripped(line)
print(code2)
ENCODING SOLUTION
code3 = (line.encode('ascii', 'ignore')).decode("utf-8")
print(code3)
HOW FILE WAS READ
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
print(stripped(row['text']))
print(re.sub(r'[^\x00-\x7f]',r'', row['text']))
print(row['text'].encode('ascii', 'ignore')).decode("utf-8"))
SUGGESTED METHOD
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
en = row['text'].encode()
print(type(en))
newline = en.decode('utf-8')
print(type(newline))
print(repr(newline))
print(newline.encode('ascii', 'ignore'))
print(newline.encode('ascii', 'replace'))
Your string is valid utf-8. Therefore it can be directly converted to a python string.
You can then encode it to ascii with str.encode(). It can ignore non-ascii characters with 'ignore'.
Also possible: 'replace'
line_raw = b'Who\xe2\x80\x99s he?'
line = line_raw.decode('utf-8')
print(repr(line))
print(line.encode('ascii', 'ignore'))
print(line.encode('ascii', 'replace'))
'Who’s he?'
b'Whos he?'
b'Who?s he?'
To come back to your original question, your 3rd method was correct. It was just in the wrong order.
code3 = line.decode("utf-8").encode('ascii', 'ignore')
print(code3)
To finally provide a working pandas example, here you go:
import pandas
df = pandas.read_csv('test.csv', encoding="utf-8")
for index, row in df.iterrows():
print(row['text'].encode('ascii', 'ignore'))
There is no need to do decode('utf-8'), because pandas does that for you.
Finally, if you have a python string that contains non-ascii characters, you can just strip them by doing
text = row['text'].encode('ascii', 'ignore').decode('ascii')
This converts the text to ascii bytes, strips all the characters that cannot be represented as ascii, and then converts back to text.
You should look up the difference between python3 strings and bytes, that should clear things up for you, I hope.

Parse string representation of binary loaded from CSV [duplicate]

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | #abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as #Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)
You can use ast.literal_eval to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result

Python: Write to file diacritical marks as escape character sequence

I read text line from input file and after cut i have strings:
-pokaż wszystko-
–ყველას გამოჩენა–
and I must write to other file somethink like this:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
My python script start that:
file_input = open('input.txt', 'r', encoding='utf-8')
file_output = open('output.txt', 'w', encoding='utf-8')
Unfortunately, writing to a file is not what it expects.
I got tip why I have to change it, but cant figure out conversion:
Diacritic marks saved in UTF-8 ("-pokaż wszystko-"), it works correctly only if NLS_LANG = AMERICAN_AMERICA.AL32UTF8
If the output file has diacritics saved in escaping form ("-poka\017C wszystko-"), the script works correctly for any NLS_LANG settings
Python 3.6 solution...format characters outside the ASCII range:
#coding:utf8
s = ['-pokaż wszystko-','–ყველას გამოჩენა–']
def convert(s):
return ''.join(x if ord(x) < 128 else f'\\{ord(x):04X}' for x in s)
for t in s:
print(convert(t))
Output:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
Note: I don't know if or how you want to handle Unicode characters outside the basic multilingual plane (BMP, > U+FFFF), but this code probably won't handle them. Need more information about your escape sequence requirements.

How to handle utf-8 text with Python 3?

I need to parse various text sources and then print / store it somewhere.
Every time a non ASCII character is encountered, I can't correctly print it as it gets converted to bytes, and I have no idea how to view the correct characters.
(I'm quite new to Python, I come from PHP where I never had any utf-8 issues)
The following is a code example:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title').encode('utf-8')
print(title)
file = codecs.open("test.txt", "w", "utf-8")
file.write(str(title))
file.close()
I'd like to print and write in a file the RSS title (BBC Japanese - ホーム) but instead the result is this:
b'BBC Japanese - \xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xa0'
Both on screen and file. Is there a proper way to do this ?
In python3 bytes and str are two different types - and str is used to represent any type of string (also unicode), when you encode() something, you convert it from it's str representation to it's bytes representation for a specific encoding.
In your case in order to the decoded strings, you just need to remove the encode('utf-8') part:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title')
print(title)
file = codecs.open("test.txt", "w", encoding="utf-8")
file.write(title)
file.close()
The function print(A) in python3 will first convert the string A to bytes with its original encoding, and then print it through 'gbk' encoding.
So if you want to print A in utf-8, you first need to convert A with gbk as follow:
print(A.encode('gbk','ignore').decode('gbk'))
JSON data to Unicode support for Japanese characters
def jsonFileCreation (messageData, fileName):
with open(fileName, "w", encoding="utf-8") as outfile:
json.dump(messageData, outfile, indent=8, sort_keys=False,ensure_ascii=False)

Resources