Pandas MultiIndex in which one factor is an Enum - python-3.x

I'm having trouble working with a dataframe whose columns are a multiindex in which one of the iterables is an Enum. Consider the code:
MyEnum = Enum("MyEnum", "A B")
df = pd.DataFrame(columns=pd.MultiIndex.from_product(iterables=[MyEnum, [1, 2]]))
This raises
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.
This can be worked around by instead putting:
df = pd.DataFrame(columns=pd.MultiIndex.from_product(iterables=[
pd.Series(MyEnum, dtype="category"),
[1, 2]
]))
but then appending a row with
df.append({(MyEnum.A, 1): "abc", (MyEnum.B, 2): "xyz"}, ignore_index=True)
raises the same TypeError as before.
I've tried various variations on this theme, with no success. (No problems occur if the columns is not a multiindex, but is an enum.)
(Note that I can dodge this by using an IntEnum instead of an Enum. But then, my columns, simply appear as numbers---this is why I wanted to use an Enum in the first place, as opposed to ints.)
Many thanks!

Related

df.sample(ignore_index=True) still returns index

How do I sample() a row from a Dataframe, without its index?
If there is a different approach, that's ok.
Pandas documentation offers ignore_index parameter.
DataFrame.sample(n=None, frac=None, replace=False, weights=None,
random_state=None, axis=None, ignore_index=False)
Source
However, when I run:
df['col'].sample(ignore_index=True)
I'll still get the index and the value, e.g.:
1 value
Desired:
value
In pandas all objects like Series and DataFrames has index.
Parameter ignore_index obviously generate default RangeIndex, not remove it.
If need scalar from Series select first value:
out = df['col'].sample(1).iat[0]

Why SettingWithCopyWarning is raised using .loc?

I have checked similar questions on SO with the SettingWithCopyWarning error raised using .loc but I still don't understand why I have the error in the following example.
It appears line 3, I succeed to make it disappear with .copy() but I would like to understand why .loc didn't work specifically here.
Does making a conditional slice creates a view even if it's .loc ?
df = pd.DataFrame( data=[0,1,2,3,4,5], columns=['A'])
df.loc[:,'B'] = df.loc[:,'A'].values
dfa = df.loc[df.loc[:,'A'] < 4,:] # here .copy() removes the error
dfa.loc[:,'C'] = [3,2,1,0]
Edit : pandas version is 1.2.4
dfa = df.loc[df.loc[:,'A'] < 4,:]<br>
dfa is a slice of the df dataframe, still referencing the dataframe, a view..copy creates a separate copy, not just a view of the first dataframe.
dfa.loc[:,'C'] = [3,2,1,0]
When it's a view not a copy, you are getting the warning : A value is trying to be set on a copy of a slice from a DataFrame.
.loc is locating the conditions you give it, but it's still a view that you're setting values to if you don't make it a copy of the dataframe.

how to use lambda function to select larger values from two python dataframes whilst comparing them by date?

I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.

Data Frame Error: UndefinedVariableError: name is not defined

I'm working in a jupyter notebook and am trying to create objects for two different answers in a column: Yes and No; in order to see the similarities between all of the 'yes' responses and the same for the 'no' responses as well.
When I use the following code, i get an error that states: UndefinedVariableError: name 'No' is not defined
df_yes=df.query('No-show == \"Yes\"')
df_no=df.query('No-show == \"No\"')
Since the same error occurs even when I'm only including the df_yes, then I figured it has to have something to do with the column name "No-show." So I tried it with different columns and sure enough, it works.
So can someone enlighten me what I'm doing wrong with with this code block so I won't do it again? Thanks!
Observe this example:
>>> import pandas as pd
>>> d = {'col1': ['Yes','No'], 'col2': ['No','No']}
>>> df = pd.DataFrame(data=d)
>>> df.query('col1 == \"Yes\"')
col1 col2
0 Yes No
>>> df.query('col2 == \"Yes\"')
Empty DataFrame
Columns: [col1, col2]
Index: []
>>>
Everything seems to work as expected. But, if I change col1 and col2 to col-1 and col-2, respectively:
>>> d = {'col-1': ['Yes','No'], 'col-2': ['No','No']}
>>> df = pd.DataFrame(data=d)
>>> df.query('col-1 == \"Yes\"')
...
pandas.core.computation.ops.UndefinedVariableError: name 'col' is not defined
As you can see, the problem is the minus (-) you use in your column name. As a matter of fact, you were even more unlucky because No in your error message refers to No-show and not to the value No of your columns.
So, the best solution (and best practice in general) is to name your columns differently (think of them as variables; you can not have a minus in the name of a variable, at least in Python). For example, No_show. If this data frame is not created by you (e.g. you read your data from a csv file), it 's a common practice to rename columns appropriately.

How to fill Pandas dataframe column with numpy array in for loop

I have a for loop generating three values, age(data type int64),dayofMonth(data type numpy.ndarray) and Gender(data type str). I would like to store these three values of each iteration in a pandas data frame with columns as age,Day & Gender.Can you suggest how to do that?I'm using python 3.X
I tried this code inside for loop
df = pd.DataFrame(columns=["Age","Day", "Gender"])
for i in range(100):
df.loc[i]=[age,day,gender]
I can't able to share sample data but can give one example,
age=38,day=array([[1],
[3],
[5],
...,
[25],
[26],
[30]], dtype=int64) and Gender='M'
But I'm getting error message as ValueError: setting an array element with a sequence

Resources