Python 3 csv read does not recognize Cyrillic script

Python 3 csv read does not recognize Cyrillic script - python-3.x

I have the following Python code:
from urllib import request
url_base = "https://translate.google.com"
url_params_list = "/#view=home&op=translate&sl=ru&tl=en&text="
with open('top5000russianlemmasraw.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
url = url_base + url_params_list + request.quote(row[0].encode('cp1251'))
print(url)
The file top5000russianlemmasraw.csv is a list of words in Cyrillic script.
The problem with the code is that the Cyrillic script is imported as strings of question marks, e.g. '????', which then get converted to URL code as '%3F%3F%3F%3F' type strings. I am not sure how to get Python to import Cyrillic script so that it does not show up as a question mark. Would appreciate help on this.

The open() built-in defaults to the encoding returned by locale.getpreferredencoding(). You can override this with the keyword parameter
# ...
with open('top5000russianlemmasraw.csv', encoding='cp1251') as csv_file:
# ...
Or, alternatively you could open the file as bytes and then decode in chunks
with open('top5000russianlemmasraw.csv', 'rb') as csv_file:
blob = csv_file.read()
text = blob.decode('cp1251')
# ...

Related

Parse string representation of binary loaded from CSV [duplicate]

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | #abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.

The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")

If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as #Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)

You can use ast.literal_eval to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result

Extract numbers and text from csv file with Python3.X

I am trying to extract data from a csv file with python 3.6.
The data are both numbers and text (it's url addresses):
file_name = [-0.47, 39.63, http://example.com]
On multiple forums I found this kind of code:
data = numpy.genfromtxt(file_name, delimiter=',', skip_header=skiplines,)
But this works for numbers only, the url addresses are read as NaN.
If I add dtype:
data = numpy.genfromtxt(file_name, delimiter=',', skip_header=skiplines, dtype=None)
The url addresses are read correctly, but they got a "b" at the beginning of the address, such as:
b'http://example.com'
How can I remove that? How can I just have the simple string of text?
I also found this option:
file = open(file_path, "r")
csvReader = csv.reader(file)
for row in csvReader:
variable = row[i]
coordList.append(variable)
but it seems it has some issues with python3.

How to handle utf-8 text with Python 3?

I need to parse various text sources and then print / store it somewhere.
Every time a non ASCII character is encountered, I can't correctly print it as it gets converted to bytes, and I have no idea how to view the correct characters.
(I'm quite new to Python, I come from PHP where I never had any utf-8 issues)
The following is a code example:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title').encode('utf-8')
print(title)
file = codecs.open("test.txt", "w", "utf-8")
file.write(str(title))
file.close()
I'd like to print and write in a file the RSS title (BBC Japanese - ホーム) but instead the result is this:
b'BBC Japanese - \xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xa0'
Both on screen and file. Is there a proper way to do this ?

In python3 bytes and str are two different types - and str is used to represent any type of string (also unicode), when you encode() something, you convert it from it's str representation to it's bytes representation for a specific encoding.
In your case in order to the decoded strings, you just need to remove the encode('utf-8') part:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title')
print(title)
file = codecs.open("test.txt", "w", encoding="utf-8")
file.write(title)
file.close()

The function print(A) in python3 will first convert the string A to bytes with its original encoding, and then print it through 'gbk' encoding.
So if you want to print A in utf-8, you first need to convert A with gbk as follow:
print(A.encode('gbk','ignore').decode('gbk'))

JSON data to Unicode support for Japanese characters
def jsonFileCreation (messageData, fileName):
with open(fileName, "w", encoding="utf-8") as outfile:
json.dump(messageData, outfile, indent=8, sort_keys=False,ensure_ascii=False)

python3 converting unreadable character to readable chracter

hi i have a text file and i am reading file and parsing datas,
but my file contains some text like
\u03a4\u03c1\u03b5\u03b9\u03c2 \u03bd\u03b5\u03ba\u03c1\u03bf\u03af \u03b1\u03c0\u03cc \u03c0\u03c4\u03ce\u03c3\u03b7 \u03bf\u03b2\u03af\u03b4\u03b1\u03c2 \u03c3\u03b5 \u03c3\u03c0\u03af\u03c4\u03b9 \u03c3\u03c4\u03bf \u03a3\u03b9\u03bd\u03ac
how can i convert a it readable text with python
i try to use these codes to solve but it doesn't work
def encodeDecode(self, data):
new_data = ''
for ch in data:
#let = ch.encode('utf-8').decode('utf-8')
#new_data += let
new_data += repr(ch)[1:2]
return new_data

There is no problem with your string,you have a unicode data.Just based on how you want to use it you can decode it custom or using python default encoding for example if you want to print it, since strings in python 3 are unicode you can just print it.
>>> s="""\u03a4\u03c1\u03b5\u03b9\u03c2 \u03bd\u03b5\u03ba\u03c1\u03bf\u03af \u03b1\u03c0\u03cc \u03c0\u03c4\u03ce\u03c3\u03b7 \u03bf\u03b2\u03af\u03b4\u03b1\u03c2 \u03c3\u03b5 \u03c3\u03c0\u03af\u03c4\u03b9 \u03c3\u03c4\u03bf \u03a3\u03b9\u03bd\u03ac """
>>>
>>> print s
Τρεις νεκροί από πτώση οβίδας σε σπίτι στο Σινά
>>>
But if you want to write your data in a file you need to use a proper encoding for your file.
You can do it with passing your encoding to open() function when you open a file for writing.

You could also convert it using Python's json module - this would also work in Python 2x
>>> f = open('input.txt', 'r')
>>> json_str = '"%s"' % f.read().replace('"', '\\"') # wrap the input string in double quotes
>>> print(json.loads(json_str))
Τρεις νεκροί από πτώση οβίδας σε σπίτι στο Σινά

How do I write a map object to a csv?

I have a map object x such that
list(x)
produces a list. I am presently trying:
def writethis(name, mapObject):
import csv
with open(name, 'wb') as f:
csv.writer(f, delimiter = ',').writerows(list(mapObject))
but get either get an error saying the list produced isn't a valid sequence, or, if I have another function pass writethis a map object or list in the interpreter, a blank CSV file.
I'm not very familiar with the csv module. What's going on?

First, note that you open files for CSV writing differently in Python 3. You want
with open(name, 'w', newline='') as f:
instead. Second, writerows expects a sequence of sequences, and it seems like you're only passing it a sequence. If you only have one row to write, then
def writethis(name, mapObject):
with open(name, 'w', newline='') as f:
csv.writer(f, delimiter = ',').writerow(list(mapObject))
should work.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python 3 csv read does not recognize Cyrillic script - python-3.x

Related

Parse string representation of binary loaded from CSV [duplicate]

Extract numbers and text from csv file with Python3.X

How to handle utf-8 text with Python 3?

python3 converting unreadable character to readable chracter

How do I write a map object to a csv?

Categories

Resources