Pandas - Lists as index values - python-3.x

I understand that Pandas Index values must be immutable/hashable. However, on running the following code - I was surprised to see that the list was allowed as an index value.
import pandas as pd
obj = pd.Series([4,7,5,23,3], index = [2,'A', 2+3j, (3,7), [1,2]], name = 'Trial')
print(obj)
print(type(obj))
Output:
2 4
A 7
(2+3j) 5
(3, 7) 23
[1, 2] 3
Name: Trial, dtype: int64
<class 'pandas.core.series.Series'>
In the past, I have experimented with trying to put an unhashable type as an index value and as expected - was not allowed to.
Is there something I am missing in this piece of code? Have there been changes to the library?
Of course, while trying to subscript the value at that index using obj[[1,2]], I get the 'unhashable type - list' error as expected. But as far as I recall, in the past, I was never allowed to use a list as the index value in the first place.
Thanks in advance for any help/explanation.

In the past, I have experimented with trying to put an unhashable type as an index value and as expected - was not allowed to.
Are you sure?
import pandas as pd
import sys
print(pd.__version__) # 0.24.2
print(sys.version) # '2.7.18 (default, Jan 23 2023, 08:22:06) \n[GCC 12.2.0]'
obj = pd.Series([4,7,5,23,3], index = [2,'A', 2+3j, (3,7), [1,2]], name = 'Trial')
print(obj[[1,2]])
# Output (which is wrong but it works)
A 7
(2+3j) 5
Name: Trial, dtype: int64
It also works for python==3.8.10 and pandas==1.0.5.

Related

Why Pandas stack operation fails depending on the column type?

I have an issue with an unstack / stack operation on pandas 1.1.5, with python 3.8.10.
Let's say that i have a dataframe pandas looking like this :
import datetime as dt
import pandas as pd
# Data
category_1 = ["cat1"] * 18
category_2 = ["CAT1", "CAT2", "CAT3", "CAT4", "CAT5", "CAT6"] * 3
dates = [dt.datetime(2022, 11, 1)] * 6 + [dt.datetime(2022, 11, 2)] * 6 + [dt.datetime(2022, 11, 3)] * 6
numbers = [50] * 18
# Dict
df_dict = {
"category_1": category_1,
"category_2": category_2,
"dates": dates,
"numbers": numbers,
}
df = (pd.DataFrame(df_dict))
df = df.astype({"numbers": "Int64"}) # specific needs to handle NaN values with int column
df = df.set_index(["category_1", "category_2", "dates"])
df.head()
and I want to unstack on category 1 & 2 to manipulate the dates index (i.e missing dates filling, other operations, whatever which is not really revelant for the specific question).
df= df.unstack(["category_1", "category_2"])
df
If i want to stack the same columns (even without manipulation), I have the following error :
df.stack(["category_1", "category_2"])
IndexError: index 3 is out of bounds for axis 0 with size 3
I was able to solve this issue by removing the type forcing in Int64
df = df.astype({"numbers": "int64"}) # No longer an issue to unstack / stack
I was expecting to have the same bug but I didn't.
Can someone help me on how to solve this issue what could be the problem 'under the hood' ?
thanks

Getting the location of indices missing from secondary DataFrame

Please examine the commented text in the code below in order to understand the problem.
import pandas as pd
import numpy as np
primary = pd.DataFrame(
data = ['little','mary','had','a','swan'],
index =pd.DatetimeIndex(['2015-09-25 12:00:00',
'2015-09-25 13:00:00',
'2015-09-25 14:00:00',
'2015-09-25 15:00:00',
'2015-09-25 16:00:00']),
columns=['some_nonsense'])
secondary = pd.DataFrame(
data = ['mommy',np.nan],
index =pd.DatetimeIndex(['2015-09-25 14:00:00',
'2015-09-25 15:00:00']),
columns=['copy_me'])
# 1. secondary dataframe values have already been computed
# 2. we want to assign them to the primary dataframe for available dates
# 3. once done, we want to return dataframe index locations for missing values
# 4. nan is one of the valid values the secondary dataframe can take
primary['copy_me'] = secondary['copy_me']
print (secondary)
print (primary)
# The values have been copied successfully
# But how to get the locations of missing indices?
# The expected result is as follows:
# If I know these values I could pass them to my computing function
missing_indices = np.array([0,1,4])
print('needed result: ', missing_indices)
If I understand correctly, this might help:
(~primary.index.isin(secondary.index)).nonzero()[0]
Breakdown:
Find which primary indixes are present in secondary (primary.index.isin(secondary.index)).
Negate that condition (~).
Find positions where value is non-zero, meaning True, using numpy.nonzero. (.nonzero()[0], [0] because it returns a tuple)
You can just check if primary.index is in secondary.index:
np.flatnonzero(~primary.index.isin(secondary.index))
# array([0, 1, 4], dtype=int32)

TypeError: unsupported operand type(s) for -: 'str' and 'str' | pandas reindex

Im getting a lengthy error traceback with last line as stated in title.
Im trying to use nearest method to fill the missing values during reindexing.
Heres my code:
import pandas as pd
s1=pd.Series([1,2,3,4],index=list('aceg'))
print(s1.reindex(pd.Index(list('abdg')),method='nearest'))
I was trying to see if filling missing info is done after reindexing or during reindexing which might affect the result in this case of method = 'nearest'.
Changing the method to ffill or bfill works fine.
It's not possible to do that with strings because the distance between two strings doesn't mean much. For this use case, you can convert your one-character index as a number with the ord function:
s1 = pd.Series([1,2,3,4], index=list('aceg'))
idx = pd.Index(list('gdba'))
s1.index = idx[s1.index.map(ord).reindex(idx.map(ord), method='nearest')[1]]
print(s1)
# Output:
a 1
b 2
d 3
g 4
dtype: int64
Details:
>>> s1.index.map(ord)
Int64Index([97, 98, 100, 103], dtype='int64')
>>> idx.map(ord)
Int64Index([103, 100, 98, 97], dtype='int64')
If you have strings index instead of one-character index, you can handle it with fuzzy logic and Levenshtein distance

Pandas dataframe float index not self-consistent

I need/want to work with float indices in pandas but I get a keyerror when running something like this:
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
df[df.index[0]]
I have seen some errors regarding precision, but shouldn't this work?
You get the KeyError because df[df.index[0]] would try to access a column with label 1.1 in this case - which does not exist here.
What you can do is use loc or iloc to access rows based on indices:
import numpy as np
import pandas as pd
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
# to access e.g. the first row use
df.loc[df.index[0]]
# or more general
df.iloc[0]
# 5.4 1.531411
# 6.7 -0.341232
# Name: 1.1, dtype: float64
In principle, if you can, avoid equal comparisons for floating point numbers for the reason you already came across: precision. The 1.1 displayed to you might be != 1.1 for the computer - simply because that would theoretically require infinite precision. Most of the time, it will work though because certain tolerance checks will kick in; for example if the difference of the compared numbers is < 10^6.

getting indices in numpy

Can someone find out what is wrong with the code below?
import numpy as np
data = np.recfromcsv("data.txt", delimiter=" ", names=['name', 'types', 'value'])
indices = np.where((data.name == 'david') * data.types.startswith('height'))
mean_value = np.mean(data.value[indices])
I want to calculate mean of weight and height for david and mark as follows:
david>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)
mark>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)
From the text (data.txt) file:
david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160
mark weight_2005 90
mark weight_2012 85
mark height_2005 160
mark height_2012 170
I am using python 3.2 and numpy 1.8
The above code provides the type error as follows:
TypeError: startswith first arg must be bytes or a tuple of bytes, not numpy.str_
With Python3.2 and numpy 1.7, this line works
indices = np.where((data.name == b'david') * data.types.startswith(b'height'))
data displays as:
rec.array([(b'david', b'weight_2005', 50),...],
dtype=[('name', 'S5'), ('types', 'S11'), ('value', '<i4')])
type(data.name[0]) is <class 'bytes'>.
b'height' works in Python2.7 as well.
another option is to convert all the data to unicode (Python 3 strings)
dtype=[('name','U5'), ('types', 'U11'), ('value', '<i4')]
dataU=data.astype(dtype=dtype)
indices = np.where((dataU.name == 'david') * dataU.types.startswith('height'))
or
data = np.recfromtxt('data.txt', delimiter=" ",
names=['name', 'types', 'value'], dtype=dtype)
It looks like recfromcsv does not take a dtype, but recfromtxt does.

Resources