How to read an excel file with extension csv in jupyter? - excel

I have an excel file- with .csv extension.
I want to read it in jupyter notebook.
my code is:
real_csv_data =
pd.read_csv("/Users/xxx/Downloads/myfile.csv")
and I got this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
myfile is an excel file, with .csv extension.
I tried the same with a .txt file- it was good.
any idea?

ok, I added a parameter to pd.read_csv()
now it looks like:
real_csv_data =
pd.read_csv("/Users/xxx/Downloads/myfile.csv", encoding = "utf-16")
and now it works fine.
also if someone interested- you can also add
sep = '/t'
for example, to get the data in a nice table.

Related

python 3 coding issue

I create a JSON file with VSC (or Notepad++). This contains an array of strings. One of the strings is "GRÜN". Then I read the file with Python 3.
with codecs.open(file,'r',"iso-8859-15") as infile:
dictionarry = json.load(infile)
If I print the array (inside in "dictionary") to the console, I see: "GRÃ\x9cN"
How can I convert "GRÃ\x9cN" to "GRÜN" again?
I try to read the JSON file with codec "iso-8859-1" too, but the issue still occurred.

Read .csv that contains commas

I have a .csv file that contains multiple columns with texts in it. These texts contain commas, which makes things messy when I try to read the file into Python.
When I tried:
import pandas as pd
directory = 'some directory'
dataset = pd.read_csv(directory)
I got the following error:
ParserError: Error tokenizing data. C error: Expected 3 fields in line 42, saw 5
After doing some research, I found the clevercsv package.
So, I ran:
import clevercsv as csv
dataset = csv.read_csv(directory)
Running this, I got the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4359705: character maps to <undefined>
To overcome this, I tried:
dataset = csv.read_csv(directory, encoding="utf8")
However, 10 hours later my computer was still working on reading it. So I expect that something went wrong there.
Furthermore, when I open the file in Excel, it does split cells well. Therefore, What I tried was to save the .csv file as a .xlsx and then save it as .csv in Python with an uncommon delimiter ('~'). However, when I save my .csv file as a .xlsx file, the size of the file gets smaller, which indicates that only a part of the file is saved and that is not what I want.
Lastly, I have tried the solutions given here and here. But neither seem to work for my problem.
Given that Excel reads in the file without problems, I do expect that it should be possible to read it into Python as well. Who can help me with this?
UPDATE:
When using dataset = pd.read_csv(directory, sep = ',', error_bad_lines=False)I manage to read in the .csv. But many lines are skipped. Is there a better way for this?
panda should be work https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Dou you tried somthing like dataset = pd.read_csv(directory, sep = ',', header = None)
Regards

result shows file full of some symbols rather than text when I loop files

I was looping some files to copy the content of somes file to a new file but after I run the code, the result shows lot of symbols in the new file, not the text content of the files I looped.
first, when I ran the code without putting the 'encoding' attribute in open file line, it showed an error message like,
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 12: character maps to .
I tried various encodings like utf-8,latin1 but nothing worked and when i put 'errors=ignore' in the open file line, then the result showed like I described above.
import os
import glob
folder = os.path.join('R:', os.sep, 'Files')
def notes():
for doc in glob.glob(folder + r'\*'):
if doc.endswith('.pdf'):
with open(doc,'r') as f:
x = f.readlines()
with open('doc1.text', 'w+') as f1:
for line in x:
f1.write(line)
notes()
If I understand your example correctly and you’re trying to read PDF files, your problem is not one of encoding but of file format. PDF files don’t just to store your text in coding materials are unique format that you need to be able to read in order to extract the text. There are a couple of python libraries that can read PDF files (such as Py2PDF), please refer to this thread for more information: How to extract text from a PDF file?

appJar Internationalisation not working

I am trying to use appJar's internationalisation feature with English and Spanish. However, when I use the English config file, everything works fine, but when I use Spanish one, I get errors. When the file is encoded in ANSI, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte
I have also tried to encode it in UTF-8, but again, I get an error:
configparser.MissingSectionHeaderError: File contains no section headers.
file: 'SPANISH.ini', line: 1
'\ufeff[TABBEDFRAME]\r\n'
The English config file is exactly the same as the Spanish one - the only difference being the translation.
I would be grateful for any help and suggestions.
It turns out that the Spanish config file seemed to be corrupted - when I copied the contents over to a new file, it worked fine.

how to save string in 'utf-8' format in a .xlsx using pandas to_exel (to_csv is able to save it as .csv)

since I am able to save string using to_csv using 'utf-8' encoding, I am expecting to do be able to do the same using to_excel. This is not an issue on my side with encoding. None of the previous thread I saw, discuss this issue.
I am using python 2.7.12 on Windows 7(Anaconda) and pandas 0.18.1
I have 2 questions related to saving a panda dataframe containing special character (encoding as 'utf-8') as a .csv or .xlsx file.
For example:
import pandas as pd
# Create a Pandas dataframe from the data.
df = pd.DataFrame({'Data': ['1', 'as', '?%','ä']})
I can save the dataframe as a .cvs file without any issue:
df.to_csv('test_csv.csv',sep=',', encoding='utf-8')
and it works. When importng the data I need to choose 'utf-8' in Excel and everything is fine.
Now if I try to save the same dataframe as an .xlsx then it doesn't work.
I have the following code:
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter', options={'encoding':'utf-8'})
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1',encoding='utf-8')
writer.save()
and I got the following error message:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
7: ordinal not in range(128)
I am not 100% sure to use the correct option to set the encoding:
options={'encoding':'utf-8'}
and
encoding='utf-8'
since it is not clear to me how to proceed from the documentation.
Any idea how to have this working ?
Bonus question related to df.to_csv. Is there a way to use some special character as separator ? Some reason, the code I am migrated from R to python is using sep='¤'. tried to encode this special character in all possible way but is always failed. Is it possible to do that ?
Thanks a lot
Cheers
Fabien
If you are using xlsxwriter as the Excel writing engine then the encoding='utf-8' is ignored because the XlsxWriter module doesn't use it.
XlsxWriter requires that the string data is encoded as utf8. After that it handles the strings automatically.
So you will need to ensure that the string data you are writing is encoded as utf8 via Pandas: either when you read it or after the data is in the data frame.

Resources