Fill columns values according to a match between the columns names and the items of a list in another column - python-3.x

I would like to fill columns values with a "yes" or "no" based on a match between the column name or its substrings and the items of a list in the same row but in another column. Is there a way to achieve this using pandas ?
Out[5]:
Insurance:retailers Insurance:buyers Insurance:sales Types
0 [retailers, sales]
1 [sales]
2 [retailers, buyers]
I'm trying to achieve the following result:
Out[7]:
Insurance:retailers Insurance:buyers Insurance:sales Types
0 yes no yes [retailers, sales]
1 no no yes [sales]
2 yes yes no [retailers, buyers]
Any help would be much appreciated. Thank you.

Don't know much about pandas, but from my understanding panda's DataFrames can be constructed from a python dictionary and vice versa.
So this should point you in the right direction
data={'Insurance:retailers':['No','No','No'],
'Insurance:buyers':['No','No','No'],
'Insurance:sales':['No','No','No'],
'Types':[['retailers', 'sales'],['sales'],['retailers', 'buyers']]}
for idx, entry in enumerate(data['Types']):
for key, value in data.items():
if any(element in key for element in entry):
data[key][idx]='yes'
print(data)

Related

Multilevel index of rows of a dataframe using pandas [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Combine multiple rows based on Id and other column using pandas python [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

How to get number of columns in a DataFrame row that are above threshold

I have a simple python 3.8 DataFrame with 8 columns (simply labeled 0, 1, 2, etc.) with approx. 3500 rows. I want a subset of this DataFrame where there are at least 2 columns in each row that are above 1. I would prefer not to have to check each column individually, but be able to check all columns. I know I can use the .any(1) to check all the columns, but I need there to be at least 2 columns that meet the threshold, not just one. Any help would be appreciated. Sample code below:
import pandas as pd
df = pd.DataFrame({0:[1,1,1,1,100],
1:[1,3,1,1,1],
2:[1,3,1,1,4],
3:[1,1,1,1,1],
4:[3,4,1,1,5],
5:[1,1,1,1,1]})
Easiest way I can think to sort/filter later would be to create another column at the end df[9] that houses the count:
df[9] = df.apply(lambda x: x.count() if x > 2, axis=1)
This code doesn't work, but I feel like it's close?
df[(df>1).sum(axis=1)>=2]
Explanation:
(df>1).sum(axis=1) gives the number of columns in that row that is greater than 1.
then with >=2 we filter those rows with at least 2 columns that meet the condition --which we counted as explained in the previous bullet
The value of x in the lambda is a Series, which can be indexed like this.
df[9] = df.apply(lambda x: x[x > 2].count(), axis=1)

Python 3: Groupby 3 DataFrame columns to check availability in a 4th column and add label 0 or 1 to 5th column

My first time posting on StackoOverflow. Please be kind.
I tried to find the exact solution for this problem but have failed to do so.
What I am attempting to do is groupby ProductID, Class, Material columns to see what are the null and non-null values in a column and assign 0 and 1 respectively in the column Level.
My Dataframe: https://i.stack.imgur.com/dRZcY.jpg
My Target Dataframe: https://i.stack.imgur.com/HWi5y.jpg
I am unable to get a label of 0's and 1's for the missing values in Material column. Please help!
Thanks in Advance!
Try this:
df['level'] = df[['ProductID', 'Class', 'Material']]\
.apply(lambda x: 0 if x.isna().sum() > 0 else 1, axis=1)

How to rank columns of a dataframe by index? pandas

Suppose you have the following Dataframe (It is much more complicated)
df4=pd.DataFrame(np.matrix([[1,5],[3,2],[4,3],[5,4],[2,1]]),index=['a','b','c','d','e'])
Which is already ranked, however, I would like to rank it by the row index to reach the desired dataframe as
df5=pd.DataFrame(np.matrix([['a','e'],['e','b'],['b','c'],['c','d'],['d','a']]))
Is there an easy way of doing so?
Thank you very much
Pass df4 as an indexer to the index of df4:
pd.DataFrame(df4.index[df4-1])
Note that I subtracted 1 from df4 since Pandas indexing is zero based, but your DataFrame appears to be 1 based.
The resulting output:
0 1
0 a e
1 c b
2 d c
3 e d
4 b a
I would like to rank the matrix based on index. I believe that the posted solution is not right for my question
I had to write a formula to answer this question
def Column_rank_based_list(f):
r,c=f.shape
B= array(range(r*c), dtype='a5').reshape(r,c)
for j in range(c):
for i in range(r):
B[f.ix[i,j]-1,j]=f.index[i]
return pd.DataFrame(B, columns=f.columns)
However, I am having difficulty because it is printing b before the entries.
For example for
df4=pd.DataFrame(np.matrix([[1,5],[3,2],[4,3],[5,4],[2,1]]),index=['a','b','c','d','e'])
you would obtain
Column_rank_based_list(df4)

Resources