python 3 coding issue - python-3.x

I create a JSON file with VSC (or Notepad++). This contains an array of strings. One of the strings is "GRÜN". Then I read the file with Python 3.
with codecs.open(file,'r',"iso-8859-15") as infile:
dictionarry = json.load(infile)
If I print the array (inside in "dictionary") to the console, I see: "GRÃ\x9cN"
How can I convert "GRÃ\x9cN" to "GRÜN" again?
I try to read the JSON file with codec "iso-8859-1" too, but the issue still occurred.

Related

Read .csv that contains commas

I have a .csv file that contains multiple columns with texts in it. These texts contain commas, which makes things messy when I try to read the file into Python.
When I tried:
import pandas as pd
directory = 'some directory'
dataset = pd.read_csv(directory)
I got the following error:
ParserError: Error tokenizing data. C error: Expected 3 fields in line 42, saw 5
After doing some research, I found the clevercsv package.
So, I ran:
import clevercsv as csv
dataset = csv.read_csv(directory)
Running this, I got the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4359705: character maps to <undefined>
To overcome this, I tried:
dataset = csv.read_csv(directory, encoding="utf8")
However, 10 hours later my computer was still working on reading it. So I expect that something went wrong there.
Furthermore, when I open the file in Excel, it does split cells well. Therefore, What I tried was to save the .csv file as a .xlsx and then save it as .csv in Python with an uncommon delimiter ('~'). However, when I save my .csv file as a .xlsx file, the size of the file gets smaller, which indicates that only a part of the file is saved and that is not what I want.
Lastly, I have tried the solutions given here and here. But neither seem to work for my problem.
Given that Excel reads in the file without problems, I do expect that it should be possible to read it into Python as well. Who can help me with this?
UPDATE:
When using dataset = pd.read_csv(directory, sep = ',', error_bad_lines=False)I manage to read in the .csv. But many lines are skipped. Is there a better way for this?
panda should be work https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Dou you tried somthing like dataset = pd.read_csv(directory, sep = ',', header = None)
Regards

How to read an excel file with extension csv in jupyter?

I have an excel file- with .csv extension.
I want to read it in jupyter notebook.
my code is:
real_csv_data =
pd.read_csv("/Users/xxx/Downloads/myfile.csv")
and I got this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
myfile is an excel file, with .csv extension.
I tried the same with a .txt file- it was good.
any idea?
ok, I added a parameter to pd.read_csv()
now it looks like:
real_csv_data =
pd.read_csv("/Users/xxx/Downloads/myfile.csv", encoding = "utf-16")
and now it works fine.
also if someone interested- you can also add
sep = '/t'
for example, to get the data in a nice table.

Python 3, can I tell if a file opened in binary mode contains text?

I am upgrading some code from python 2 to python 3.
There is a function to open and read files. In Python 2 there is no need to specify binary mode or as a string. While in Python 3 I should specify the mode.
The python 2 code is:
with open(f_path, mode=open_mode) as fp:
content = fp.read()
This is causing me problems as it is called by various other functions where I don't necessarily know the file type in advance. (Sometimes the data is written to a zip file, other times the data is returned via an HTTP endpoint).
I expect most data will be binary image files, though CSv and text files will also be present.
What would be the best way of opening a file of unknown type and detecting if it is binary or string data?
Is it possible for example to open a file in binary mode, then detect that it contains text and convert it (or alternatively generate an exception and open it in string mode instead)?
You might try the binaryornot library.
pip install binaryornot
Then in the code:
from binaryornot.check import is_binary
is_binary(f_path)
Here is their documentation:
https://pypi.org/project/binaryornot/

result shows file full of some symbols rather than text when I loop files

I was looping some files to copy the content of somes file to a new file but after I run the code, the result shows lot of symbols in the new file, not the text content of the files I looped.
first, when I ran the code without putting the 'encoding' attribute in open file line, it showed an error message like,
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 12: character maps to .
I tried various encodings like utf-8,latin1 but nothing worked and when i put 'errors=ignore' in the open file line, then the result showed like I described above.
import os
import glob
folder = os.path.join('R:', os.sep, 'Files')
def notes():
for doc in glob.glob(folder + r'\*'):
if doc.endswith('.pdf'):
with open(doc,'r') as f:
x = f.readlines()
with open('doc1.text', 'w+') as f1:
for line in x:
f1.write(line)
notes()
If I understand your example correctly and you’re trying to read PDF files, your problem is not one of encoding but of file format. PDF files don’t just to store your text in coding materials are unique format that you need to be able to read in order to extract the text. There are a couple of python libraries that can read PDF files (such as Py2PDF), please refer to this thread for more information: How to extract text from a PDF file?

Arabic text replaced with escape sequences when creating CSV files using python

I am trying to create a CSV file that contains Arabic tweets collected using tweepy for a project I am doing. All is fine gathering the data, however, when i am writing to the CSV file all Arabic results are escaped with \xXXXX sequences
as follows:
b'#\xd8\xa7\xd9\x84\xd9\x8a\xd9\x88\xd9\x85_\xd8\xa7\xd9\x84\xd8\xb9\xd8\xa7\xd9\x84\xd9\x85\xd9\x8a_\xd9\x84\xd9\x84\xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd9\x87_2017 \xd8\xa7\xd9\x84\xd8\xa5\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd8\xad\xd9\x82\xd9\x8a\xd9\x82\xd9\x8a\xd8\xa9 \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd9\x81\xd9\x83\xd8\xb1 \xd9\x88\xd9\x84\xd9\x8a\xd8\xb3\xd8\xaa \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9
I looked at many previously asked questions and all I could find was suggestions for python 2 or answers similar to the one I am writing. When I was creating JSON files instead I was using ensure_ascii=False but I couldn't find anything similar for CSV. Below is my code:
with codecs.open('tweets.csv', 'a', encoding='utf-8') as file:
fieldnames = ['tweet', 'country']
writer = csv.DictWriter(file, fieldnames=fieldnames)
data = {'tweet': status.text, 'country': status.place.full_name}
writer.writerow(data)
I tried adding .encoding='utf-8' to status.text and status.place as well but that also didn't work. Any suggestions?
You have to make sure the Arabic string you have is decoded into UTF-8 before you write it. Assuming status.text is of type bytes you should type text=status.text.decode('utf-8'). (Maybe you have to do this for status.place.full_name too.) But if it's of type str then it won't have an decode() method. To avoid escape sequences in your file, a str object should be written anyway.
If you try to specify the encoding of a bytes object (like the one you presumably have) as 'utf-8' that won't work because the text is already in UTF-8 bytes. So in order to get UTF-8 characters you must call decode() on the bytes object. That way it writes the UTF-8 characters and not the UTF-8 bytes.

Resources