Add Multiindex Dataframe and corresponding Series - python-3.x

I am failing to add a multiindex dataframe and a corresponding series. E.g.,
df = pd.DataFrame({
'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1],
'c': [1, 2, 3, 4], 'd':[1, 1, 1, 1]}).set_index(['a', 'b'])
# Dataframe might contain records that are not in the series and vice versa
s = df['d'].iloc[1:]
df + s
produces
ValueError: cannot join with no overlapping index names
Does anyone know how to resolve this? I can work around the issue by adding each column separately, using e.g.
df['d'] + s
But I would like to add the two in a single operation. Any help is much appreciated.

By default, + tries to align along columns, the following would work with +:
s = df.iloc[:, 1:]
df + s
# c d
#a b
#0 0 NaN 2
# 1 NaN 2
#1 0 NaN 2
# 1 NaN 2
In your case, you need to align along index. You can explicitly specify axis=0 with add method for that:
df.add(s, axis=0)
# c d
#a b
#0 0 NaN NaN
# 1 3.0 2.0
#1 0 4.0 2.0
# 1 5.0 2.0

Related

Loop over columns with df.shift in Python

Lets say you have a dataframe like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
df
A B
0 3 5
1 1 6
2 2 7
3 3 8
Now I want to skew and calculate on each column. I put the values as I want them skewed in the index:
range_span = range(4)
result = pd.DataFrame(index=range_span)
Then I try to pupulate result with the following:
for c in df.columns:
for i in range_span:
result.iloc[i][c] = df[c].shift(i).max()
result
This only returns the index. I expected something like this:
You've got 3 critical issues:
issue #1
At this line
result.iloc[i][c] = df[c].shift(i).max()
Raises warning that help understand why result is empty.
...\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
According to their document:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
As iloc[i] will return slice - aka copy - of that rows, you couldn't set original dataframe result. Further, this is why iloc didn't raised issue when it got str index. Explained in #2.
Instead you use iloc - potentially loc with str - like this:
>>> df
A B C
0 1 10 100
1 2 20 200
2 3 30 300
>>> df.iloc[1, 2]
200
>>>df.iloc[[1, 2], [1, 2]]
B C
1 20 200
2 30 300
>>> df.iloc[1:3, 1:3]
B C
1 20 200
2 30 300
>>> df.iloc[:, 1:3]
B C
0 10 100
1 20 200
2 30 300
# ..and so on
issue #2
If you fix issue #1 then you'll see following error:
result.iloc[[i][c]] = df[c].shift(i).max()
TypeError: list indices must be integers or slices, not str
Also from their document:
property DataFrame.iloc: Purely integer-location based indexing for selection by position.
At for c in df.columns: You're passing column name A, B which is str, not int. Use loc instead for str column indices.
This didn't raise TypeError due to issue #1 - as c was passed as argument of __setitem__().
Issue #3
Normally dataframe cannot be enlarged without special functions like combine.
# using same df from #1
>>> df.iloc[1, 3] = 300
Traceback (most recent call last):
File "~\pandas\core\indexing.py", line 1394, in _has_valid_setitem_indexer
raise IndexError("iloc cannot enlarge its target object")
IndexError: iloc cannot enlarge its target object
Easier fix would be using dict and convert to DataFrame when manipulation is complete. Or just creating DataFrame to match or have a larger size at firsthand:
>>> df2 = pd.DataFrame(index=range(4), columns=range(3))
>>> df2
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Combining all, correct fix would be:
import pandas as pd
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
result = pd.DataFrame(index=df.index, columns=df.columns)
for col in df.columns:
for index in df.index:
result.loc[index, col] = df[col].shift(index).max()
print(result)
Output:
A B
0 3 8
1 3 7
2 3 6
3 3 5

Drop a column in pandas if all values equal 1?

How do I drop columns in pandas where all values in that column are equal to a particular number? For instance, consider this dataframe:
df = pd.DataFrame({'A': [1, 1, 1, 1],
'B': [0, 1, 2, 3],
'C': [1, 1, 1, 1]})
print(df)
Output:
A B C
0 1 0 1
1 1 1 1
2 1 2 1
3 1 3 1
How would I drop the 1 columns so that the output is:
B
0 0
1 1
2 2
3 3
Use DataFrame.loc with test if at least one non 1 value by DataFrame.ne with DataFrame.any:
df1 = df.loc[:, df.ne(1).any()]
Or test for 1 by DataFrame.eq with DataFrame.all for all Trues per columns and inverted mask by ~:
df1 = df.loc[:, ~df.eq(1).all()]
print (df1)
B
0 0
1 1
2 2
3 3
EDIT:
One consideration is what do you want to happen if you have a column with Nan and 1 only?
Then replace NaNs to 0 by DataFrame.fillna and use same solution like before:
df1 = df.loc[:, df.fillna(0).ne(1).any()]
df1 = df.loc[:, ~df.fillna(0).eq(1).all()]
You can use any:
df.loc[:, df.ne(1).any()]
One consideration is what do you want to happen if you have a column with Nan and 1 only?
If you want to drop under this condition also, you will to either fillna with 1 or add or and new condition.
df = pd.DataFrame({'A': [1, 1, 1, 1],
'B': [0, 1, 2, 3],
'C': [1, 1, 1, np.nan]})
print(df)
A B C
0 1 0 1.0
1 1 1 1.0
2 1 2 1.0
3 1 3 NaN
All these leave that column with NaN and 1's.
df.loc[:, df.ne(1).any()]
df.loc[:, ~df.eq(1).all()]
So, you can add this addition to drop that column also.
df.loc[:, ~(df.eq(1) | df.isna()).all()]
Output:
B
0 0
1 1
2 2
3 3

Splitting dictionary/list into Separate Columns

I have movie dataset saved for revenue prediction. However, the genres column of this dataset has a dictionary in that dictionary there is 2 or more list in 1 row. The DataFrame looks like this this is not actual dataframe but dataframe is similar to this:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, [{'c':4},{'d':3}], [{'c':5, 'd':6},{'c':7, 'd':8}]]})
this is output
a b
0 1 {'c': 1}
1 2 [{'c': 4}, {'d': 3}]
2 3 [{'c': 5, 'd': 6}, {'c': 7, 'd': 8}]
I need to split this column into separate columns.
How can i do that I used apply(pd.series) method This is what I'm getting as a output
0 1 c
0 NaN NaN 1.0
1 {'c': 4} {'d': 3} NaN
2 {'c': 5, 'd': 6} {'c': 5, 'd': 6} NaN
but I want like this if possible:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8
I do not know if it is possible to achieve what you want by using apply(pd.Series) because you have mixed types in your 'b' column: you have dictionaries and list of dictionaries. Maybe it is, not sure.
However this is how I would do.
First, loop over your column to build a set with all the new column names: that is, the keys of the dictionaries.
Then you can use apply with a custom function to extract the value for each column.
Notice that the values in this column are strings, needed because you want to concatenate with a comma cases like your row #2.
newcols = set()
for el in df['b']:
if isinstance(el, dict):
newcols.update(el.keys())
elif isinstance(el, list):
for i in el:
newcols.update(i.keys())
def extractvalues(x, col):
if isinstance(x['b'], dict):
return x['b'].get(col, np.nan)
elif isinstance(x['b'], list):
return ','.join(str(i.get(col, '')) for i in x['b']).strip(',')
for nc in newcols:
df[nc] = df.apply(lambda r: extractvalues(r, nc), axis=1)
df.drop('b', axis=1, inplace=True)
Your dataframe is now:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8

Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

I have two dataframes like the following examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
'c': ['123', '100', '6', np.nan]})
print(df)
a b c
0 20 1.0 NaN
1 50 NaN 1.0
2 100 1.0 1.0
print(df_id)
b c
0 50 123
1 4954 100
2 93920 6
3 20 NaN
For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].
My desired result is as follows:
result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
'c': [np.nan, np.nan, 1]})
print(result)
a b c
0 20 1.0 NaN
1 50 NaN NaN # df_id['c'] did not contain '50'
2 100 NaN 1.0 # df_id['b'] did not contain '100'
My attempt to do this is here:
for i, letter in enumerate(['b','c']):
df[letter] = (df.apply(lambda x: x[letter] if x['a']
.isin(df_id[letter].tolist()) else np.nan, axis = 1))
The error I get:
AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')
This is in Python 3.5.2, Pandas version 20.1
You can solve your problem using this instead:
for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)
just replace isin with in.
The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.
However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.
Hope that was helpful. If you have any questions please ask.
Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:
for i, letter in enumerate(['b','c']):
mask = df['a'].isin(df_id[letter])
name = letter + '_new'
# for some reason, df[letter] = df.loc[mask, letter] does not work
df.loc[mask, name] = df.loc[mask, letter]
df[letter] = df[name]
del df[name]
This isn't pretty, but seems to work.
If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe.
First create the mask:
mask = df_id.apply(lambda x: df['a'].isin(x))
b c
0 True False
1 True False
2 False True
This can be applied to the original dataframe:
df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
a b c
0 20 1.0 NaN
1 50 NaN NaN
2 100 NaN 1.0

Pandas Rows with missing values in multiple columns

I have a dataframe with columns age, date and location.
I would like to count how many rows are empty across ALL columns (not some but all in the same time). I have the following code, each line works independently, but how do I say age AND date AND location isnull?
df['age'].isnull().sum()
df['date'].isnull().sum()
df['location'].isnull().sum()
I would like to return a dataframe after removing the rows with missing values in ALL these three columns, so something like the following lines but combined in one statement:
df.mask(row['location'].isnull())
df[np.isfinite(df['age'])]
df[np.isfinite(df['date'])]
You basically can use your approach, but drop the column indices:
df.isnull().sum().sum()
The first .sum() returns a per-column value, while the second .sum() will return the sum of all NaN values.
Similar to Vaishali's answer, you can use df.dropna() to drop all values that are NaN or None and only return your cleaned DataFrame.
In [45]: df = pd.DataFrame({'age': [1, 2, 3, np.NaN, 4, None], 'date': [1, 2, 3, 4, None, 5], 'location': ['a', 'b', 'c', None, 'e', 'f']})
In [46]: df
Out[46]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
3 NaN 4.0 None
4 4.0 NaN e
5 NaN 5.0 f
In [47]: df.isnull().sum().sum()
Out[47]: 4
In [48]: df.dropna()
Out[48]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
You can find the no of rows with all NaNs by
len(df) - len(df.dropna(how = 'all'))
and drop by
df = df.dropna(how = 'all')
This will drop the rows with all the NaN values

Resources