read one value from read_csv error - python-3.x

tst = pd.read_csv('/Users/me/Desktop/stuff/Et2Load.csv', header=0,delimiter="\t", quoting=3)
print(tst.head(2)) # ok
#print(tst['date'][0])
I made up this file, one line header 2 lines
3 columns, 2 lines
id,date,coldata
0 1,August 18 2016,"With all this stuff going do...
1 2,August 19 2016,this is a great movie. The mu...
i cannot access a specific "cell"
print(tst['date'][0]) error
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
KeyError: 'date'

well this is the big secret:
set_index('name the column that going to be used as 'id'')
It was difficult to find this one.

Related

Error in running Python code for PLSR modelling

I am trying to develop model using PLSR (Partial Least Squares Regression) in Python3 using code provided https://github.com/pgbrodrick/ensemblePLSR. Sample data is also provided.
When I try to run code, it gives me error
>>> python3 ensemble_plsr.py example_settings.txt
I am using Python (3.7.3), python modules scikit-learn (0.20.2) and pandas (0.23.3).
/usr/lib/python3/dist-packages/sklearn/externals/joblib.py:1: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
n bad bands 57
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: -1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ensemble_plsr.py", line 173, in <module>
df.pop(col)
File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 760, in pop
result = self[item]
File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 2491, in _get_item_cache
values = self._data.get(item)
File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: -1
In short, you're attempting to remove columns for which there is no reference at row 173 of ensemble_plsr.py, hence the KeyError during execution. What's happening under the hood is that when Python attempts to execute the pop method on the DataFrame for the unspecified/non-existent column, it raises that error. There are different ways to resolve this but this solution will resolve the error you are seeing:
Replace rows 172 & 173 in ensemble_plsr.py with the following:
for col in sf.get_setting('ignore columns'):
if col != 'nothing_here':
df.pop(col)
Replace row 16 in example_settings.txt with the following:
ignore columns(any other columns to remove) = nothing_here
Good news, you're done with this issue. Bad news, you're going to hit the next error down the line but you're on your way!

Fuzzy String Matching With Pandas and FuzzyWuzzy ;KeyError: 'name'

I have the data file which looks like this -
And I have another data file which has all the correct country names.
For matching both the files that, I am using below:
from fuzzywuzzy import process
import pandas as pd
names_array=[]
ratio_array=[]
def match_names(wrong_names,correct_names):
for row in wrong_names:
x=process.extractOne(row, correct_names)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array
#Wrong country names dataset
df=pd.read_csv("wrong-country-names.csv",encoding="ISO-8859-1")
wrong_names=df['name'].dropna().values
#Correct country names dataset
choices_df=pd.read_csv("country-names.csv",encoding="ISO-8859-1")
correct_names=choices_df['name'].values
name_match,ratio_match=match_names(wrong_names,correct_names)
df['correct_country_name']=pd.Series(name_match)
df['country_names_ratio']=pd.Series(ratio_match)
df.to_csv("string_matched_country_names.csv")
print(df[['name','correct_country_name','country_names_ratio']].head(10))
I get the below error:
runfile('C:/Users/Drashti Bhatt/Desktop/untitled0.py', wdir='C:/Users/Drashti Bhatt/Desktop')
Traceback (most recent call last):
File "<ipython-input-155-a1fd87d9f661>", line 1, in <module>
runfile('C:/Users/Drashti Bhatt/Desktop/untitled0.py', wdir='C:/Users/Drashti Bhatt/Desktop')
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Drashti Bhatt/Desktop/untitled0.py", line 17, in <module>
wrong_names=df['name'].dropna().values
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2927, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'name'
Any help on this will be much appreciated! Thanks much!
Your code contains e.g. wrong_names=df['name'].dropna().values (it is
mentioned in your Traceback as the "offending" line).
And now look at the picture presenting your DataFrame:
it does not contain name column,
it contains Country column.
Go back to the Traceback: At the very end there is the error message:
KeyError: 'name'.
So you attempt to access a non-existing column.
I noticed also another detail: values attribute contains the underlying
Numpy array, whereas process.extractOne requires "ordinary" Python lists (of strings, to perform the match).
So probably you should change the instruction above to:
wrong_names=df['Country'].dropna().values.tolist()
The same for the other instruction.

Why this time I can not selec one column from a DataFrame by print(['column1'])?

I can selected one column from a DataFrame, for example: the code like print(df['201809']) works:
df = pd.read_csv('xxxx.csv', low_memory=False)
print(df.info()]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11 entries, 0 to 10
Data columns (total 4 columns):
BO_product2 11 non-null object
201808 11 non-null float64
201809 11 non-null float64
4 11 non-null float64
dtypes: float64(3), object(1)
memory usage: 440.0+ bytes
print(df['201809']) # works fine
None
0 1.634931e+06
1 2.653640e+08
2 7.475315e+07
3 9.710830e+06
4 3.023899e+08
5 1.087862e+08
6 2.031106e+08
7 3.556234e+08
8 5.830665e+06
9 8.766841e+08
10 7.544689e+07
Name: 201809, dtype: float64
However print(df['4']) don't. Any tips or ideas is here?
PS: if i save the df.to_csv('yy.csv) to local file in csv format, print(a['4'])works after `df = pd.read_csv('yy.csv').
print(df['4'])
Traceback (most recent call last):
File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 3063, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '4'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/Python/2.py", line 45, in <module>
he()
File "E:/Python/2.py", line 26, in he
print(a['4'])
File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\frame.py", line 2685, in __getitem__
return self._getitem_column(key)
File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\frame.py", line 2692, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 2486, in _get_item_cache
values = self._data.get(item)
File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 3065, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '4'
If you execute the below:
[type(i) for i in df.columns]
#[str, str, str, int]
For columns having type int you should call the column as df[4] and not df['4']
Probably the reason why it is getting written as string is due to the quoting builtin function. From the docs:
quoting : optional constant from csv module
defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are >>converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non->>numeric
Hope this helps.

Python 3.4 Panda sort market-data by Date

I am trying to set up Python (3.4) code to sort a time-series by date.
In python shell, I key in the following
>>>data = quandl.get("YAHOO/INDEX_GSPC", start_date="2017-01-01", end_date="2017-01-20")
>>>print(data)
So, I can load in the data. But when I try to use sort by the command
>>>data = data.sort_values(by='Date')
I get the following list of errors messages. I can't seem to understand/get the syntax for date sort from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html
Experts out there......., many thanks for advice.
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\pandas\indexes\base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'Date'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<pyshell#37>", line 1, in <module>
data = data.sort_values(by='Date')
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 3230, in sort_values
k = self.xs(by, axis=other_axis).values
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1770, in xs
return self[key]
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3543, in get
loc = self.items.get_loc(item)
File "C:\Python34\lib\site-packages\pandas\indexes\base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'Date'
quandl.get loads a DataFrame with the date as index.
So if you sort by index, you're good to go:
data = data.sort_index()
Make sure you look at the error. You are getting a KeyError which means that the column Date does not exist in your DataFrame. It's like that the dates are stored in the index which requires the sort_index method instead. The 'Date' name that you see in your DataFrame is the name of the index and not a column.
data.sort_index()

How can I read the csv file in pandas which is separated with ";"?

I started working with pandas in python 3.4 for couple of days. I chose to work on Book-Crossing data set.
The book information table is like this:
The Book rating table is like this:
I want to grab the "ISBN","Book-title" from the book information table and merge it with the book-rating table in which both match the "ISBN" and after that write the results in another csv file.
I used the code below:
udata = pd.read_csv('1', names = ('User_ID', 'ISBN', 'Book-Rating'), encoding="ISO-8859-1", sep=';', usecols=[0,1,2])
uitem = pd.read_csv('2', names = ('ISBN', 'Book-Title'), encoding="ISO-8859-1", sep=';', usecols=[0,1])
ratings = pd.merge(udata, uitem, on='ISBN')
ratings.to_csv('ratings.csv', index=False)
Unfortunately it doesn't work and it gives an error:
Traceback (most recent call last):
File "C:\Users\masoud\Desktop\Dataset\data2\a.py", line 2, in <module>
udata = pd.read_csv('2.csv', names = ('User_ID', 'ISBN', 'Book-Rating'),encoding="ISO-8859-1", sep=';', usecols=[0,1,2])
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 491, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 278, in _read
return parser.read()
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 740, in read
ret = self._engine.read(nrows)
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 1187, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 758, in pandas.parser.TextReader.read (pandas\parser.c:7919)
File "pandas\parser.pyx", line 780, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8175)
File "pandas\parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas\parser.c:8868)
File "pandas\parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8736)
File "pandas\parser.pyx", line 1732, in pandas.parser.raise_parser_error (pandas\parser.c:22105)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 8 fields in line 6452, saw 9
I was wondering if anybody could fix the error?
In the first and second row, change sep to ;.
sep=';'

Resources