I have successfully read a csv file using pandas. When I am trying to print the a particular column from the data frame i am getting keyerror. Hereby i am sharing the code with the error.
import pandas as pd
reviews_new = pd.read_csv("D:\\aviva.csv")
reviews_new['review']
**
reviews_new['review']
Traceback (most recent call last):
File "<ipython-input-43-ed485b439a1c>", line 1, in <module>
reviews_new['review']
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\indexes\base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'review'
**
Can someone help me in this ?
I think first is best investigate, what are real columns names, if convert to list better are seen some whitespaces or similar:
print (reviews_new.columns.tolist())
I think there can be 2 problems (obviously):
1.whitespaces in columns names (maybe in data also)
Solutions are strip whitespaces in column names:
reviews_new.columns = reviews_new.columns.str.strip()
Or add parameter skipinitialspace to read_csv:
reviews_new = pd.read_csv("D:\\aviva.csv", skipinitialspace=True)
2.different separator as default ,
Solution is add parameter sep:
#sep is ;
reviews_new = pd.read_csv("D:\\aviva.csv", sep=';')
#sep is whitespace
reviews_new = pd.read_csv("D:\\aviva.csv", sep='\s+')
reviews_new = pd.read_csv("D:\\aviva.csv", delim_whitespace=True)
EDIT:
You get whitespace in column name, so need 1.solutions:
print (reviews_new.columns.tolist())
['Name', ' Date', ' review']
^ ^
import pandas as pd
df=pd.read_csv("file.txt", skipinitialspace=True)
df.head()
df['review']
dfObj['Hash Key'] = (dfObj['DEAL_ID'].map(str) +dfObj['COST_CODE'].map(str) +dfObj['TRADE_ID'].map(str)).apply(hash)
#for index, row in dfObj.iterrows():
# dfObj.loc[`enter code here`index,'hash'] = hashlib.md5(str(row[['COST_CODE','TRADE_ID']].values)).hexdigest()
print(dfObj['hash'])
Related
I am writing python script in which i am generating two different csv files and then reading these file by using pandas. I am able to read file1 with pandas but getting error while reading file2 which in same format(same column name) as file1 but different/same values. Please find the below error that i am getting and sample code that i am using.
Error:
Traceback (most recent call last):
File "MSReport.py", line 168, in <module>
fail = pd.read_csv('/home/cisapp/msLogFailure.csv', sep=',')
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 532, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Code:
df = pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', engine='python')
f_output = df.groupby('MSISDN').last()
#print(df)
print(f_output)
fail = pd.read_csv(BASE_LOCATION+'/msLogFailure.csv', engine='python')
fail = fail['MSISDN']
fail = fail.tolist()
for i in fail:
succ = f_output[f_output.MSISDN != i]
In above sample code there is no error while reading file df = pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', engine='python') but while reading file fail = pd.read_csv(BASE_LOCATION+'/msLogFailure.csv', engine='python') i am facing the error as mentioned above. Please help to resolve.
Note: I am running code by using python3.
I faced the same problem and resolved. So you can check using below idea.
Check the delimitator and mention like below examples
pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', encoding='utf-16', sep='\t')
pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', delim_whitespace=True)
You can also add 'r' before file path.
Otherwise share the file image
Your sample of msLogFailure file looks OK - 6 column names and 6 data fields.
I looked for posts concerning just this error message and I found an advice to:
read the input file into a string variable,
read_csv from this string, e.g. pd.read_csv(io.StringIO(txt),...).
Maybe this will help.
Good day, I have 42Gb of data in a list of sequenced 2455 xCSV files.
I am trying to import the data sequentially using a loop into a pd.DataFrame for analysis.
I have tried it with 3 files and it works well.
from glob import glob
import pandas as pd
# Import data into DF
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_trial = [pd.read_csv(f) for f in filenames]
df_trial
I am getting the following error. Copy pasted the traceback here. Please help
df_trial = [pd.read_csv(f) for f in filenames]
Traceback (most recent call last):
File "<ipython-input-23-0438182db491>", line 1, in <module>
df_trial = [pd.read_csv(f) for f in filenames]
File "<ipython-input-23-0438182db491>", line 1, in <listcomp>
df_trial = [pd.read_csv(f) for f in filenames]
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 454, in _read
data = parser.read(nrows)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1148, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\frame.py", line 435, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 254, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 74, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1670, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1726, in form_blocks
float_blocks = _multi_blockify(items_dict["FloatBlock"])
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1820, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1848, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError: Unable to allocate 107. MiB for an array with shape (124, 113012) and data type float64
There are a number of things you can do.
First, only process one dataframe at a time:
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
for f in filenames:
df = pd.read_csv(f)
process(df)
Second, if that's not possible you can try to reduce the amount of memory used when loading the dataframes by a variety of means (smaller dtypes for numeric columns, omitting numeric columns, and more). See https://pythonspeed.com/articles/pandas-load-less-data/ for some starting points on these techniques.
Thanks to all.
I was able to process all 42GB of data using the nrows argument
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_2019=[]
for filename in filenames:
df = pd.read_csv(filename, index_col=None, header=0, nrows = 1000)
df_2019.append(df)
frame = pd.concat(df_2019, axis=0, ignore_index=True)
I have thousands of files as you can see the year range below. Some of the dates of the files are missing so I want to skip over them. But when I tried the method below, and calling data_in, the variable doesn't exist. Any help would be truly appreciated. I am new to python. Thank you.
path = r'file path here'
DataYears = ['2012','2013','2014', '2015','2016','2017','2018','2019', '2020']
Years = np.float64(DataYears)
NumOfYr = Years.size
DataMonths = ['01','02','03','04','05','06','07','08','09','10','11','12']
daysofmonth=[31,28,31,30,31,30,31,31,30,31,30,31]
for yy in range(NumOfYr):
for mm in range (12):
try:
data_in = pd.read_csv(path+DataYears[yy]+DataMonths[mm]+'/*.dat', skiprows=4, header=None, engine='python')
print('Reached data_in') # EDIT
a=data_in[0] #EDIT
except IOError:
pass
#print("File not accessible")
EDIT: Error added
Traceback (most recent call last):
File "Directory/Documents/test.py", line 23, in <module>
data_in = pd.read_csv(path+'.'+DataYears[yy]+DataMonths[mm]+'/*.cod', skiprows=4, header=None, engine='python')
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1126, in _make_engine
self._engine = klass(self.f, **self.options)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2269, in __init__
memory_map=self.memory_map,
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/common.py", line 431, in get_handle
f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 'Directory/Documents/201201/*.dat'
You can adapt the code below to get a list of your date folders:
import glob
# Gives you a list of your folders with the different dates
folder_names = glob.glob("Directory/Documents/")
print(folder_names)
Then with the list of folder, you can iterate through there contents. If you just want a list of all .dat files can do something like:
import glob
# Gives you a list of your folders with the different dates
file_names = glob.glob("Directory/Documents/*/*.dat")
print(file_names)
The code above searches the contents of your directories so you bypass your problem with missing dates. The prints are there so you can see the results of glob.glob().
I am trying to set up Python (3.4) code to sort a time-series by date.
In python shell, I key in the following
>>>data = quandl.get("YAHOO/INDEX_GSPC", start_date="2017-01-01", end_date="2017-01-20")
>>>print(data)
So, I can load in the data. But when I try to use sort by the command
>>>data = data.sort_values(by='Date')
I get the following list of errors messages. I can't seem to understand/get the syntax for date sort from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html
Experts out there......., many thanks for advice.
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\pandas\indexes\base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'Date'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<pyshell#37>", line 1, in <module>
data = data.sort_values(by='Date')
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 3230, in sort_values
k = self.xs(by, axis=other_axis).values
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1770, in xs
return self[key]
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3543, in get
loc = self.items.get_loc(item)
File "C:\Python34\lib\site-packages\pandas\indexes\base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'Date'
quandl.get loads a DataFrame with the date as index.
So if you sort by index, you're good to go:
data = data.sort_index()
Make sure you look at the error. You are getting a KeyError which means that the column Date does not exist in your DataFrame. It's like that the dates are stored in the index which requires the sort_index method instead. The 'Date' name that you see in your DataFrame is the name of the index and not a column.
data.sort_index()
I started working with pandas in python 3.4 for couple of days. I chose to work on Book-Crossing data set.
The book information table is like this:
The Book rating table is like this:
I want to grab the "ISBN","Book-title" from the book information table and merge it with the book-rating table in which both match the "ISBN" and after that write the results in another csv file.
I used the code below:
udata = pd.read_csv('1', names = ('User_ID', 'ISBN', 'Book-Rating'), encoding="ISO-8859-1", sep=';', usecols=[0,1,2])
uitem = pd.read_csv('2', names = ('ISBN', 'Book-Title'), encoding="ISO-8859-1", sep=';', usecols=[0,1])
ratings = pd.merge(udata, uitem, on='ISBN')
ratings.to_csv('ratings.csv', index=False)
Unfortunately it doesn't work and it gives an error:
Traceback (most recent call last):
File "C:\Users\masoud\Desktop\Dataset\data2\a.py", line 2, in <module>
udata = pd.read_csv('2.csv', names = ('User_ID', 'ISBN', 'Book-Rating'),encoding="ISO-8859-1", sep=';', usecols=[0,1,2])
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 491, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 278, in _read
return parser.read()
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 740, in read
ret = self._engine.read(nrows)
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 1187, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 758, in pandas.parser.TextReader.read (pandas\parser.c:7919)
File "pandas\parser.pyx", line 780, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8175)
File "pandas\parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas\parser.c:8868)
File "pandas\parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8736)
File "pandas\parser.pyx", line 1732, in pandas.parser.raise_parser_error (pandas\parser.c:22105)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 8 fields in line 6452, saw 9
I was wondering if anybody could fix the error?
In the first and second row, change sep to ;.
sep=';'