ImageDataBunch.from_df positional indexers are out-of-bounds - python-3.x

scratching my head on this issue. i dont know how to identify the positional indexers. am i even passing them?
attempting this for my first kaggle comp, can pass in the csv to a dataframe and make the needed edits. trying to create the ImageDataBunch so training a cnn can begin. This error pops up no matter which method is tried. Any advice would be appreciated.
data = ImageDataBunch.from_df(path, df, ds_tfms=tfms, size=24)
data.classes
Backtrace
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-25-5588812820e8> in <module>
----> 1 data = ImageDataBunch.from_df(path, df, ds_tfms=tfms, size=24)
2 data.classes
/opt/conda/lib/python3.7/site-packages/fastai/vision/data.py in from_df(cls, path, df, folder, label_delim, valid_pct, seed, fn_col, label_col, suffix, **kwargs)
117 src = (ImageList.from_df(df, path=path, folder=folder, suffix=suffix, cols=fn_col)
118 .split_by_rand_pct(valid_pct, seed)
--> 119 .label_from_df(label_delim=label_delim, cols=label_col))
120 return cls.create_from_ll(src, **kwargs)
121
/opt/conda/lib/python3.7/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
477 assert isinstance(fv, Callable)
478 def _inner(*args, **kwargs):
--> 479 self.train = ft(*args, from_item_lists=True, **kwargs)
480 assert isinstance(self.train, LabelList)
481 kwargs['label_cls'] = self.train.y.__class__
/opt/conda/lib/python3.7/site-packages/fastai/data_block.py in label_from_df(self, cols, label_cls, **kwargs)
283 def label_from_df(self, cols:IntsOrStrs=1, label_cls:Callable=None, **kwargs):
284 "Label `self.items` from the values in `cols` in `self.inner_df`."
--> 285 labels = self.inner_df.iloc[:,df_names_to_idx(cols, self.inner_df)]
286 assert labels.isna().sum().sum() == 0, f"You have NaN values in column(s) {cols} of your dataframe, please fix it."
287 if is_listy(cols) and len(cols) > 1 and (label_cls is None or label_cls == MultiCategoryList):
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1760 except (KeyError, IndexError, AttributeError):
1761 pass
-> 1762 return self._getitem_tuple(key)
1763 else:
1764 # we by definition only have the 0th axis
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
2065 def _getitem_tuple(self, tup: Tuple):
2066
-> 2067 self._has_valid_tuple(tup)
2068 try:
2069 return self._getitem_lowerdim(tup)
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
701 raise IndexingError("Too many indexers")
702 try:
--> 703 self._validate_key(k, i)
704 except ValueError:
705 raise ValueError(
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
2007 # check that the key does not exceed the maximum size of the index
2008 if len(arr) and (arr.max() >= len_axis or arr.min() < -len_axis):
-> 2009 raise IndexError("positional indexers are out-of-bounds")
2010 else:
2011 raise ValueError(f"Can only index by location with a [{self._valid_types}]")
IndexError: positional indexers are out-of-bounds

I faced this error while creating a DataBunch when my dataframe/CSV did not have a class label explicitly defined.
I created a dummy column which stored 1's for all my rows in the dataframe and it seemed to work. Also please be sure to store your independent variable in the second column and the label(dummy variable in this case) in the first column.
I believe this error happens if there's just one column in the Pandas DataFrame.
Thanks.
Code:
df = pd.DataFrame(lines, columns=["dummy_value", "text"])
df.to_csv("./train.csv")
data_lm = TextLMDataBunch.from_csv(path, "train.csv", min_freq=1)
Note: This is my first attempt at answering a StackOverflow question. Hope it helped!

This error also appears when your dataset is not correctly split between test and validation.
In the case of dataframes, it assumes there is a column is_valid that indicates which rows are in validation set.
If all rows have True, then the training set is empty, so fastai cannot index into it to prepare the first example, thus raising this error.
Example:
data = pd.DataFrame({
'fname': [f'{x}.png' for x in range(10)],
'label': np.arange(10)%2,
'is_valid': True
})
blk = DataBlock((ImageBlock, CategoryBlock),
splitter=ColSplitter(),
get_x=ColReader('fname'),
get_y=ColReader('label'),
item_tfms=Resize(224, method=ResizeMethod.Squish),
)
blk.summary(data)
Results in the error.
Solution
The solution is to check that your data can be split correctly into train and valid sets. In the above example, it suffices to have one row that is not in validation set:
data.loc[0, 'is_valid'] = False
How to figure it out?
Work in a jupyter notebook. After the error, type %debug in a cell, and enter the post mortem debugging. Go to the frame of the setup function ( fastai/data/core.py(273) setup() ) by going up 5 frames.
This takes you to this line that is throwing the error.
You can then print(self.splits) and observe that the first one is empty.

Related

No attribute 'set_values' or 'At_indexer' error due to set_values and at() in pandas due to Python3

I am new to pandas and still learning.
I am trying to add two numbers in series label-wise. One method is this:
numbers = pd.Series(np.random.randint(0,1000,10000))
for label, value in numbers.iteritems():
numbers.set_values(label, value+2)
numbers.head()
Output:
AttributeError: 'Series' object has no attribute 'set_values'
Now upon research I found out that it was deprecated and at() is used instead.
so when I used it like this:
for label, value in numbers.iteritems():
numbers.at(label, value+2)
numbers.head()
Output:
TypeError: '_AtIndexer' object is not callable
So when I use it like this with at[]:
for label, value in numbers.iteritems():
numbers.at[label, value+2]
numbers.head()
I get this output:
KeyError Traceback (most recent call last)
<ipython-input-43-b1f985a669d7> in <module>
1 for label, value in numbers.iteritems():
----> 2 numbers.at[label, value+2]
3
4 numbers.head()
C:\Users\Public\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
2078 return self.obj.loc[key]
2079
-> 2080 return super().__getitem__(key)
2081
2082 def __setitem__(self, key, value):
C:\Users\Public\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
2025
2026 key = self._convert_key(key)
-> 2027 return self.obj._get_value(*key, takeable=self._takeable)
2028
2029 def __setitem__(self, key, value):
C:\Users\Public\anaconda3\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
987
988 # Similar to Index.get_value, but we do not fall back to positional
--> 989 loc = self.index.get_loc(label)
990 return self.index._get_values_for_loc(self, loc, label)
991
C:\Users\Public\anaconda3\lib\site-packages\pandas\core\indexes\range.py in get_loc(self, key, method, tolerance)
356 except ValueError as err:
357 raise KeyError(key) from err
--> 358 raise KeyError(key)
359 return super().get_loc(key, method=method, tolerance=tolerance)
360
KeyError: (0, 10002)
What am I doing wrong and what can be fixed?
at is used as an accessor to the index given by its argument. When you pass label, value+2 to it, it sees this argument as a 2-tuple and looks for an index named literally (0, 10002) in the first turn but it fails since your series has integer indices 0, 1, ..., not tuples.
So you can leave label only in at and set what it returns to value + 2:
numbers = pd.Series(np.random.randint(0,1000,10000))
for label, value in numbers.iteritems():
# lookup the value and set it
numbers.at[label] = value + 2
(noting that this is equivalent to numbers += 2).

How do I pass the values to Catboost?

I'm trying to work with catboost and I've got a problem that I'm really stuck with right now. I have a dataframe with 28 columns, 2 of them are categorical. When the data is numerical there are some even and some fractional numbers, also some 0.00 values that should represent not an empty values but the actual nulls (like 1-1=0).
I'm trying to run this:
train_cl = cb.Pool(data=ret_df.iloc[:580000, :-1], label=ret_df.iloc[:580000, -1], cat_features=cats)
evl_cl = cb.Pool(data=ret_df.iloc[580000:, :-1], label=ret_df.iloc[580000:, -1], cat_features=cats)
But I have this error
---------------------------------------------------------------------------
CatBoostError Traceback (most recent call last)
<ipython-input-112-a515b0ab357b> in <module>
1 train_cl = cb.Pool(data=ret_df.iloc[:580000, :-1], label=ret_df.iloc[:580000, -1], cat_features=cats)
----> 2 evl_cl = cb.Pool(data=ret_df.iloc[580000:, :-1], label=ret_df.iloc[580000:, -1], cat_features=cats)
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in __init__(self, data, label, cat_features, text_features, embedding_features, column_description, pairs, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count, log_cout, log_cerr)
615 )
616
--> 617 self._init(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
618 super(Pool, self).__init__()
619
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in _init(self, data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
1081 if label is not None:
1082 self._check_label_type(label)
-> 1083 self._check_label_empty(label)
1084 label = self._label_if_pandas_to_numpy(label)
1085 if len(np.shape(label)) == 1:
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in _check_label_empty(self, label)
723 """
724 if len(label) == 0:
--> 725 raise CatBoostError("Labels variable is empty.")
726
727 def _check_label_shape(self, label, samples_count):
CatBoostError: Labels variable is empty.
I've googled this trouble, but found nothing. My hypothesis is that there is a problem with 0.00 values, but I do not know how to solve this because I literally can't replace these values with anything.
Please, help me!

slice in xarray gives error 'float' object cannot be interpreted as an integer

I am trying to slice data by longtitude using xarray.
The data is in a netcdf file I created from measurements I made.
The xarray.Dataset has the following attributes:
Dimensions:
(lat: 1321, lon: 1321)
Data variables:
(lon) float64 '8.413 8.411 8.409 ... 4.904 4.905'
(lat) float64 '47.4 47.4 47.41 ... 52.37 52.37'
(data) float64 ... #dimension: 1321
my code is:
import xarray as xr
obs = xr.open_dataset('data.nc')
obs=obs['data'].sel(lon=slice(4.905, 8.413))
The error I get is TypeError: 'float' object cannot be interpreted as an integer
I could not find out whether it is an error in my code, or an error in xarray. I would expect such an error using isel instead of sel. Could not find any solution on here or over at the xarray documentation.
Full error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-434-5b37e4c5d0c6> in <module>
----> 1 obs=obs['data'].sel(lon=slice(4.905, 8.413))
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataarray.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
1059 method=method,
1060 tolerance=tolerance,
-> 1061 **indexers_kwargs,
1062 )
1063 return self._from_temp_dataset(ds)
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2066 self, indexers=indexers, method=method, tolerance=tolerance
2067 )
-> 2068 result = self.isel(indexers=pos_indexers, drop=drop)
2069 return result._overwrite_indexes(new_indexes)
2070
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataset.py in isel(self, indexers, drop, **indexers_kwargs)
1933 var_indexers = {k: v for k, v in indexers.items() if k in var_value.dims}
1934 if var_indexers:
-> 1935 var_value = var_value.isel(var_indexers)
1936 if drop and var_value.ndim == 0 and var_name in coord_names:
1937 coord_names.remove(var_name)
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in isel(self, indexers, **indexers_kwargs)
1058
1059 key = tuple(indexers.get(dim, slice(None)) for dim in self.dims)
-> 1060 return self[key]
1061
1062 def squeeze(self, dim=None):
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in __getitem__(self, key)
701 array `x.values` directly.
702 """
--> 703 dims, indexer, new_order = self._broadcast_indexes(key)
704 data = as_indexable(self._data)[indexer]
705 if new_order:
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in _broadcast_indexes(self, key)
540
541 if all(isinstance(k, BASIC_INDEXING_TYPES) for k in key):
--> 542 return self._broadcast_indexes_basic(key)
543
544 self._validate_indexers(key)
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in _broadcast_indexes_basic(self, key)
568 dim for k, dim in zip(key, self.dims) if not isinstance(k, integer_types)
569 )
--> 570 return dims, BasicIndexer(key), None
571
572 def _validate_indexers(self, key):
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py in __init__(self, key)
369 k = int(k)
370 elif isinstance(k, slice):
--> 371 k = as_integer_slice(k)
372 else:
373 raise TypeError(
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py in as_integer_slice(value)
344
345 def as_integer_slice(value):
--> 346 start = as_integer_or_none(value.start)
347 stop = as_integer_or_none(value.stop)
348 step = as_integer_or_none(value.step)
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py in as_integer_or_none(value)
340
341 def as_integer_or_none(value):
--> 342 return None if value is None else operator.index(value)
343
344
I want to select the entire data, because eventually I want to subtract the entire array from a bigger data base with a wider grid. This bigger data base is a NETCDF file as well. And for that one, I managed to slice the longitude with the exact same code I am trying on this smaller data set where I get the error. The only difference is, that the bigger NETCDF uses a float32 format. I don't suspect this could cause the error.
Any help is appreciated. Thank you.
I think I found the problem.
When I created the netcdf file for the observation, I made a mistake in the createDimension part, when I named the lon and lat data. Because of this, lat and lon showed up under 'Data variables' in the netcdf file, where they should show up under 'Coordinates'
wrong was something like:
#Specifying dimensions#
f.createDimension('longitude', len(lon_list))
f.createDimension('latitude', len(lat_list))
#Building variables
longitude = f.createVariable('lon', float, ('lon',), zlib=True)
latitude = f.createVariable('lat', float, ('lat',), zlib=True)
data = f.createVariable('data', float, ('lat','lon'), zlib=True)
correct was:
#Specifying dimensions#
f.createDimension('lon', len(lon_list))
f.createDimension('lat', len(lat_list))
#Building variables
longitude = f.createVariable('lon', float, ('lon',), zlib=True)
latitude = f.createVariable('lat', float, ('lat',), zlib=True)
data = f.createVariable('data', float, ('lat','lon'), zlib=True)
This is a little late but I just ran into a similar issue where I got a similar undecipherable error when trying to slice by a variable.
I think the problem is that if you are trying to slice by a variable that isn't a coordinate you get an error that isn't very informative.
data = data.assign_coords({"lat":data.lat,"lon":data.lon})
would have fixed this without rewriting the netcdf file.

Can not resolve exception: "ValueError: The index must be timezone aware when indexing with a date string with a UTC offset"

I have a time-series that looks like this:
y.index = pd.to_datetime(y.index)
y.index = y.index.tz_localize(None)
When I try to slice rows using the following expression I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-677-f1b3153cb92b> in <module>
----> 1 y['2020-02-24 10-11-12':]
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
863 key = check_bool_indexer(self.index, key)
864
--> 865 return self._get_with(key)
866
867 def _get_with(self, key):
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\series.py in _get_with(self, key)
868 # other: fancy integer or otherwise
869 if isinstance(key, slice):
--> 870 return self._slice(key)
871 elif isinstance(key, ABCDataFrame):
872 raise TypeError(
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\series.py in _slice(self, slobj, axis, kind)
818
819 def _slice(self, slobj: slice, axis: int = 0, kind=None):
--> 820 slobj = self.index._convert_slice_indexer(slobj, kind=kind or "getitem")
821 return self._get_values(slobj)
822
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\indexes\base.py in _convert_slice_indexer(self, key, kind)
2943 indexer = key
2944 else:
-> 2945 indexer = self.slice_indexer(start, stop, step, kind=kind)
2946
2947 return indexer
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\indexes\datetimes.py in slice_indexer(self, start, end, step, kind)
806
807 try:
--> 808 return Index.slice_indexer(self, start, end, step, kind=kind)
809 except KeyError:
810 # For historical reasons DatetimeIndex by default supports
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4675 slice(1, 3)
4676 """
-> 4677 start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind)
4678
4679 # return a slice
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, end, step, kind)
4888 start_slice = None
4889 if start is not None:
-> 4890 start_slice = self.get_slice_bound(start, "left", kind)
4891 if start_slice is None:
4892 start_slice = 0
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, label, side, kind)
4800 # For datetime indices label may be a string that has to be converted
4801 # to datetime boundary according to its resolution.
-> 4802 label = self._maybe_cast_slice_bound(label, side, kind)
4803
4804 # we need to look up the label
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\indexes\datetimes.py in _maybe_cast_slice_bound(self, label, side, kind)
761 freq = getattr(self, "freqstr", getattr(self, "inferred_freq", None))
762 _, parsed, reso = parsing.parse_time_string(label, freq)
--> 763 lower, upper = self._parsed_string_to_bounds(reso, parsed)
764 # lower, upper form the half-open interval:
765 # [parsed, parsed + 1 freq)
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\indexes\datetimes.py in _parsed_string_to_bounds(self, reso, parsed)
569 if self.tz is None:
570 raise ValueError(
--> 571 "The index must be timezone aware when indexing "
572 "with a date string with a UTC offset"
573 )
ValueError: The index must be timezone aware when indexing with a date string with a UTC offset
I provide a limited part of this Series for reproducibility purpose (json format):
'{"1582539072500":1,"1582539073000":1,"1582539073500":1,"1582539074000":1,"1582539074500":1,"1582539075000":1,"1582539075500":1,"1582539076000":1,"1582539076500":1,"1582539077000":1,"1582539077500":1,"1582539078000":1,"1582539078500":1,"1582539080500":1,"1582539081000":1,"1582539081500":1,"1582539082000":1,"1582539082500":1,"1582539083000":1,"1582539083500":1,"1582539084000":1,"1582539084500":1,"1582539085000":1,"1582539085500":1,"1582539086000":1,"1582539086500":1,"1582539088500":1,"1582539089000":1,"1582539089500":1,"1582539090000":1,"1582539090500":1,"1582539091000":1,"1582539091500":1,"1582539092500":1,"1582539093000":1,"1582539093500":1,"1582539094000":1,"1582539094500":1,"1582539095000":1,"1582539095500":1,"1582539096000":1,"1582539097500":1,"1582539099500":1,"1582539101000":1,"1582539101500":1,"1582539104000":1,"1582539104500":1,"1582539105500":1,"1582539106000":1,"1582539109000":1}'
What creates the error, why my actions do not resolve it and what should I do?
The error is "The index must be timezone aware when indexing with a date string with a UTC offset". However there may be a typo in your code. You have y['2020-02-24 10-11-12':] but there are hyphens in between the hours, minutes, and seconds. I reproduced the error you had and just replaced the time portion hyphens with colons and was able to get it to run on the sample data.
y['2020-02-24 10:11:12':] should work.

KeyError: "['cut'] not found in axis"

I am working with thousands of lines of data trying to narrow a search for certain grains. To do this, I have an 'Asset' column with about 20 different values, of which I need to receive the sum of all of the lines in the adjacent column 'Load'.
I would like to cut the unnecessary rows out of my data set. To start, I relabeled all of the extra assets as 'cut' (as shown in the example below) so that I could manage one .drop command. Here is how it is coded:
df14['Asset'] = df14["Asset"].str.replace('BEANS', 'cut')
df14.drop("cut", axis=0)
set(df14['Asset'])
This is the error I have received:
KeyError Traceback (most recent call last)
<ipython-input-593-40006512df80> in <module>
----> 1 df14.drop("cut", axis=0)
2 set(df14['Asset'])
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
4100 level=level,
4101 inplace=inplace,
-> 4102 errors=errors,
4103 )
4104
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
3912 for axis, labels in axes.items():
3913 if labels is not None:
-> 3914 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
3915
3916 if inplace:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
3944 new_axis = axis.drop(labels, level=level, errors=errors)
3945 else:
-> 3946 new_axis = axis.drop(labels, errors=errors)
3947 result = self.reindex(**{axis_name: new_axis})
3948
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
5338 if mask.any():
5339 if errors != "ignore":
-> 5340 raise KeyError("{} not found in axis".format(labels[mask]))
5341 indexer = indexer[~mask]
5342 return self.delete(indexer)
KeyError: "['cut'] not found in axis"
I have tried several commands to remove said lines, like:
df14.drop(["cut"], inplace = True)
df14[~df14['Asset'].isin(to_drop)]
df14[df14['Asset'].str.contains('cut', na = True)]
And all of them yield the same fruits.
When I code
df14 = df14[~df14["Asset"].str.contains('BEANS')]
It does not remove the Load number, which is the next column over, from my final calculations.
Is it possible to remove all rows of data with a certain label so I can trim from 20 assets to 7 assets?
Thank you
pd.drop works by column or row wise. You give column name to drop a column or index to drop a row. Andaxis=0 means index-wise. Since you don't have a index named "cut", it gives the error.
I recommend doing it by:
df = df.loc[df['Asset'] != 'cut']
I believe that df14.drop("cut", axis=0) is failing because it is looking for the value "cut" in the index of df14. You could potentially specify the asset column as an index, see the pandas documentation on drop for how, but I think a better solution might be something along lines of
df14 = df14.query('asset != "cut"')
I can't say I know if this is the fastest solution since I usually work with small-ish datasets I've not had to worry about performance too much.
This should do the job.
Here you are basically selecting all rows other than 'cut'
df14 = df14.loc[df14['Asset'] != 'cut']

Resources