Retrieve pandas dataframe column index - python-3.x

Say I have a pandas dataframe. I can access the columns either by their name or by their index.
Is there a simple way in which I can retrieve the column index given its name?

Use get_loc on the columns Index object to return the ordinal index value:
In [283]:
df = pd.DataFrame(columns=list('abcd'))
df
Out[283]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []
In [288]:
df.columns.get_loc('b')
Out[288]:
1

What do you mean by index exactly?
I bet you are referring to index as a list index, right?
Because Pandas has another kind of index too.
From my first understandying, you can do the following:
my_df = pd.DataFrame(columns=['A', 'B', 'C'])
my_columns = my_df.columns.tolist()
print my_columns # yields ['A', 'B', 'C'], therefore you can recover the index by just doing the following
my_columns.index('C') #yields 2

Related

Convert lists present in each column to its respective datatypes

I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)

Replace values in Pandas DataFrame with Unique values from the same DataFrame

I have a pandas DataFrame which has values that are not correct
data = {'Model':['A', 'B', 'A', 'B', 'A'], 'Value':[20, 40, 20, 40, -1]}
df = pd.DataFrame(data)
df
Out[46]:
Model Value
0 A 20
1 B 40
2 A 20
3 B 40
4 A -1
I would like to replace -1 with the unique values of A.
In this case it should be 20.
How do I go about it. I have tried the following.
In my case its a large DF with 2million rows.
df2 = df[df.model != -1]
pd.merge(df, df2, on='model', how='left')
Out:
MemoryError: Unable to allocate 5.74 TiB for an array with shape (788568381621,) and data type int64
You don't need to merge, which creates all possible pairs of rows with the same Model. The following will do
df['Value'] = df['Value'].mask(df['Value']!=-1).groupby(df['Model']).transform('first')
Or you can also use map:
s = (df[df['Value'] != -1].drop_duplicates('Model')
.set_index('Model')['Value'])
df['Value'] = df['Model'].map(s)
Here's a quick solution:
df['Value'] = df.groupby('Model').transform('max')

Pandas : change the index of the duplicates

I have 2 DataFrames : df0 and df1 and df1.shape[0] > df1.shape[0].
df0 and df1 have the exact same columns.
Most of the rows of df0 are in df1.
The indices of df0 and df1 are
df0.index = range(df0.shape[0])
df1.index = range(df1.shape[0])
I then created dft
dft = pd.concat([df0, df1], axis=0, sort=False)
and removed duplicated rows with
dft.drop_duplicates(subset='this_col_is_not_index', keep='first', inplace=True)
I have some duplicates on the index of dft. For example :
dft.loc[3].shape
returns
(2, 38)
My aim is to change the index of the second row returned to have a unique index 3.
This second row should be indexed dft.index.sort_values()[-1]+1.
I would like to apply this operation on all duplicates.
References :
Python Pandas: Get index of rows which column matches certain value
Pandas: Get duplicated indexes
Redefining the Index in a Pandas DataFrame object
Add parameter ignore_index=True to concat for avoid duplicated index values:
dft = pd.concat([df0, df1], axis=0, sort=False, ignore_index=True)
Use reset_index(drop = True)
dft.reset_index(drop=True)

type of return value in itertuples and print column names of itertuples in pandas

I have a DataFrame as follows:
a b c d
0 0.140603 0.622511 0.936006 0.384274
1 0.246792 0.961605 0.866785 0.544677
2 0.710089 0.057486 0.531215 0.243285
I want to iterate the df with itertuples() and print the values and column names of each row. Currently I know the following method:
df=pd.DataFrame(np.random.rand(3,4),columns=['a','b','c','d'])
for item in df.itertuples():
print(item)
And the output is:
Pandas(Index=0, a=0.55464273035498401, b=0.50784779485386233, c=0.55866384351761911, d=0.35969591433338755)
Pandas(Index=1, a=0.60682158587529356, b=0.37571390304543184,
c=0.13566419305411737, d=0.55807909125502775)
Pandas(Index=2, a=0.73260693374584385, b=0.59246381839030349, c=0.92102184020347211, d=0.029942550647279687)
Question:
1) I thought the return data of each iteration is a tuple (as suggested by the function name) when the type(df) returns Pandas()?
2) What is the best way to extract the value of 'a', 'b', 'c', 'd' being the column names as I loop through the items of each row?
It's a named tuple.
To access the values of the named tuple, either by label:
for item in df.itertuples():
print(item.a, item.b)
or by position
for item in df.itertuples():
print(item[1], item[2])
When DataFrame has more than 254 columns, the return type is a tuple and the only available access is by position. To be anyway able to access by label, restrict df just to columns you need
for item in df.loc[:, [a, b]].itertuples():
print(item.a, item.b)

Python - Pandas Dataframe with Multiple Names per Column

Is there a way in pandas to give the same column of a pandas dataframe two names, so that I can index the column by only one of the two names? Here is a quick example illustrating my problem:
import pandas as pd
index=['a','b','c','d']
# The list of tuples here is really just to
# somehow visualize my problem below:
columns = [('A','B'), ('C','D'),('E','F')]
df = pd.DataFrame(index=index, columns=columns)
# I can index like that:
df[('A','B')]
# But I would like to be able to index like this:
df[('A',*)] #error
df[(*,'B')] #error
You can create a multi-index column:
df.columns = pd.MultiIndex.from_tuples(df.columns)
Then you can do:
df.loc[:, ("A", slice(None))]
Or: df.loc[:, (slice(None), "B")]
Here slice(None) is equivalent to selecting all indices at the level, so (slice(None), "B") selects columns whose second level is B regardless of the first level names. This is semantically the same as :. Or write in pandas index slice way. df.loc[:, pd.IndexSlice[:, "B"]] for the second case.

Resources