I have pandas dataframe with 3 columns and want output like this - python-3.x

DataFrame of 3 Column
a b c
1 2 4
1 2 4
1 2 4
Want Output like this
a b c a+b a+c b+c a+b+c
1 2 4 3 5 6 7
1 2 4 3 5 6 7
1 2 4 3 5 6 7

Create all combinations with length 2 or more by columns and then assign sum:
from itertools import chain, combinations
#https://stackoverflow.com/a/5898031
comb = chain(*map(lambda x: combinations(df.columns, x), range(2, len(df.columns)+1)))
for c in comb:
df[f'{"+".join(c)}'] = df.loc[:, c].sum(axis=1)
print (df)
a b c a+b a+c b+c a+b+c
0 1 2 4 3 5 6 7
1 1 2 4 3 5 6 7
2 1 2 4 3 5 6 7

You should always post your approach while asking a question. However, here it goes. This the easiest but probably not the most elegant way to solve it. For a more elegant approach, you should follow jezrael's answer.
Make your pandas dataframe here:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 1], "b": [2, 2, 2], "c": [4, 4, 4]})
Now make your desired dataframe like this:
df["a+b"] = df["a"] + df["b"]
df["a+c"] = df["a"] + df["c"]
df["b+c"] = df["b"] + df["c"]
df["a" + "b" + "c"] = df["a"] + df["b"] + df["c"]
This gives you:
| | a | b | c | a+b | a+c | b+c | abc |
|---:|----:|----:|----:|------:|------:|------:|------:|
| 0 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 1 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 2 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |

Related

compare columns with NaN or <NA> values pandas

I have the dataframe with NaN and values, now I want to compare two columns in the same dataframe whether each row values in null or not null. For examples,
if the column a_1 have null values, column a_2 have not null values, then for that particular
row, the result should be 1 in the new column a_12.
If the values in both a_1(value is 123) & a_2(value is 345) is not null, and the values are
not equal, then the result should be 3 in column a_12.
below is the code snippet I have used for comparison, for the scenario 1, I am getting the result as 3 instead of 1. Please guide me to get the correct output.
try:
if (x[cols[0]]==x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 0
elif (np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 0
elif (~np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 1
elif (np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 3
else:
pass
except Exception as exc:
if (x[cols[0]]==x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 0
elif (pd.isna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 0
elif (pd.notna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 1
elif (pd.isna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 3
else:
pass
I have used pd.isna() and pd.notna(), also np.isnan() and ~np.isnan(), because for some columns the second method (np.isnan()) is working, for some columns its just throwing an error.
Please guide me to achieve the result as excepted.
Expected Output:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 1 |
| <NA> | qweweqw | 2 |
| adsadgsgd | wwuwquq | 3 |
Output Got with the above code:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 3 |
| <NA> | qweweqw | 3 |
| adsadgsgd | wwuwquq | 3 |
Going by the logic in your code, you'd want to define a function and apply it across your DataFrame.
df = pd.DataFrame({'a_1': [1, 2, np.nan, np.nan, 1], 'a_2': [2, np.nan, 1, np.nan, 1]})
The categories you want map neatly to binary numbers, which you can use to write a short function like -
def nan_check(row):
x, y = row
if x != y:
return int(f'{int(pd.notna(y))}{int(pd.notna(x))}', base=2)
return 0
df['flag'] = df.apply(nan_check, axis=1)
Output
a_1 a_2 flag
0 1.0 2.0 3
1 2.0 NaN 1
2 NaN 1.0 2
3 NaN NaN 0
4 1.0 1.0 0
You can try np.select, but I think you need to rethink the condition and the expected output
Condition 1: if the column a_1 have null values, column a_2 have not null values, then for that particular row, the result should be 1 in the new column a_12.
Condition 2: If the values in both a_1 & a_2 is not null, and the values are not equal, then the result should be 3 in column a_12.
df['a_12'] = np.select(
[df['a_1'].isna() & df['a_2'].notna(),
df['a_1'].notna() & df['a_2'].notna() & df['a_1'].ne(df['a_2'])],
[1, 3],
default=0
)
print(df)
a_1 a_2 result a_12
0 gssfwe gssfwe 0 0
1 NaN NaN 0 0
2 fsfsfw NaN 1 0 # Shouldn't be Condition 1 since a_1 is not NaN
3 NaN qweweqw 2 1 # Condition 1
4 adsadgsgd wwuwquq 3 3

Pandas find max column, subtract from another column and replace the value

I have a df like this:
A | B | C | D
14 | 5 | 10 | 5
4 | 7 | 15 | 6
100 | 220 | 6 | 7
For each row in column A,B,C, I want the find the max value and from it subtract column D and replace it.
Expected result:
A | B | C | D
9 | 5 | 10 | 5
4 | 7 | 9 | 6
100 | 213 | 6 | 7
So for the first row, it would select 14(the max out of 14,5,10), subtract column D from it (14-5 =9) and replace the result(replace initial value 14 with 9)
I know how to find the max value of A,B,C and from it subctract D, but I am stucked on the replacing part.
I tought on putting the result in another column called E, and then find again the max of A,B,C and replace with column E, but that would make no sense since I would be attempting to assign a value to a function call. Is there any other option to do this?
#Exmaple df
list_columns = ['A', 'B', 'C','D']
list_data = [ [14, 5, 10,5],[4, 7, 15,6],[100, 220, 6,7]]
df= pd.DataFrame(columns=list_columns, data=list_data)
#Calculate the max and subctract
df['e'] = df[['A', 'B']].max(axis=1) - df['D']
#To replace, maybe something like this. But this line makes no sense since it's backwards
df[['A', 'B','C']].max(axis=1) = df['D']
Use DataFrame.mask for replace only maximal value matched by compare all values of filtered columns with maximals:
cols = ['A', 'B', 'C']
s = df[cols].max(axis=1)
df[cols] = df[cols].mask(df[cols].eq(s, axis=0), s - df['D'], axis=0)
print (df)
A B C D
0 9 5 10 5
1 4 7 9 6
2 100 213 6 7

How to keep only the rows that cumulate 80% of a column whitin of each group of a pandas dataframe?

I have a dataframe like this:
df_dict = dict(
group = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3],
model = [model 2,model 4,model 7,model 6,model 5,model 2,model 5,model 7,model 3,model 4,model 3,model 6,model 5,model 1,model 2,model 7,model 4],
value = [10,4.6,2,1.3,1,10,3,3,2,0.9,4,2.7,2,1,1,1,0.9],
)
df = pd.DataFrame(df_dict)
For each group, I want to keep the models that cover 80% of the total of the variable "value".
In this example, what the output should be:
Group | Model | value |
-------------------------
1 | model 2 | 10 |
1 | model 4 | 4.6 |
1 | model 7 | 2 |
2 | model 2 | 10 |
2 | model 5 | 3 |
2 | model 7 | 3 |
3 | model 3 | 4 |
3 | model 6 | 2.7 |
3 | model 5 | 2 |
3 | model 1 | 1 |
3 | model 2 | 1 |
--------------------------
Let us try multiple groupby
df = df.sort_values(['group','value'],ascending=[True,False])
g = df.groupby('group')['value']
df = df[df.index<=((g.cumsum()/g.transform('sum'))>0.8).groupby(df['group']).transform('idxmax')]
df
Out[120]:
group model value
0 1 model 2 10.0
1 1 model 4 4.6
2 1 model 7 2.0
5 2 model 2 10.0
6 2 model 5 3.0
7 2 model 7 3.0
10 3 model 3 4.0
11 3 model 6 2.7
12 3 model 5 2.0
13 3 model 1 1.0
14 3 model 2 1.0

Creating A new column based on other columns' values with specific requirement in Python Dataframe

I want to create a new column in Python dataframe with specific requirements from other columns. For example, my python dataframe df:
A | B
-----------
5 | 0
5 | 1
15 | 1
10 | 1
10 | 1
20 | 2
15 | 2
10 | 2
5 | 3
15 | 3
10 | 4
20 | 0
I want to create new column C, with below requirements:
When the value of B = 0, then C = 0
The same value in B will have the same value in C. The same values in B will be classified as start, middle, and end. So for values 1, it has 1 start, 2 middle, and 1 end, for values 3, it has 1 start, 0 middle, and 1 end. And the calculation for each section:
I specify a threshold = 10.
Let's look at values B = 1 :
Start :
C.loc[2] = min(threshold, A.loc[1]) + A.loc[2]
Middle :
C.loc[3] = A.loc[3]
C.loc[4] = A.loc[4]
End:
C.loc[5] = min(Threshold, A.loc[6])
However, the output value of C will be the sum of the above calculations.
When the value of B is unique and not 0. For example when B = 4
C[10] = min(threshold, A.loc[9]) + min(threshold, A.loc[11])
I can solve point 0 and 3. But I'm struggling to solve point 2.
So, the final output will be:
A | B | c
--------------------
5 | 0 | 0
5 | 1 | 45
15 | 1 | 45
10 | 1 | 45
10 | 1 | 45
20 | 2 | 50
15 | 2 | 50
10 | 2 | 50
5 | 3 | 25
10 | 3 | 25
10 | 4 | 20
20 | 0 | 0

Populating a pandas dataframe from an odd dictionary

I have a dictionary as follows:
{'header_1': ['body_1', 'body_3', 'body_2'],
'header_2': ['body_6', 'body_4', 'body_5'],
'header_4': ['body_7', 'body_8'],
'header_3': ['body_9'],
'header_9': ['body_10'],
'header_10': []}
I would like to come up with a dataframe like this:
+----+----------+--------+
| ID | header | body |
+----+----------+--------+
| 1 | header_1 | body_1 |
+----+----------+--------+
| 2 | header_1 | body_3 |
+----+----------+--------+
| 3 | header_1 | body_2 |
+----+----------+--------+
| 4 | header_2 | body_6 |
+----+----------+--------+
| 5 | header_2 | body_4 |
+----+----------+--------+
| 6 | header_2 | body_5 |
+----+----------+--------+
| 7 | header_4 | body_7 |
+----+----------+--------+
Where blank items (such as for the key header_10 in the dict above) would receive a value of None. I have tried a number of varieties for df.loc such as:
for header_name, body_list in all_unique.items():
for body_name in body_list:
metadata.loc[metadata.index[-1]] = [header_name, body_name]
To no avail. Surely there must be a quick way in Pandas to append rows and autoincrement the index? Something similar to the SQL INSERT INTO statement only using pythonic code?
Use dict comprehension for add Nones for empty lists and then flatten for list of tuples:
d = {'header_1': ['body_1', 'body_3', 'body_2'],
'header_2': ['body_6', 'body_4', 'body_5'],
'header_4': ['body_7', 'body_8'],
'header_3': ['body_9'],
'header_9': ['body_10'],
'header_10': []}
d = {k: v if bool(v) else [None] for k, v in d.items()}
data = [(k, y) for k, v in d.items() for y in v]
df = pd.DataFrame(data, columns= ['a','b'])
print (df)
a b
0 header_1 body_1
1 header_1 body_3
2 header_1 body_2
3 header_2 body_6
4 header_2 body_4
5 header_2 body_5
6 header_4 body_7
7 header_4 body_8
8 header_3 body_9
9 header_9 body_10
10 header_10 None
Another solution:
data = []
for k, v in d.items():
if bool(v):
for y in v:
data.append((k, y))
else:
data.append((k, None))
df = pd.DataFrame(data, columns= ['a','b'])
print (df)
a b
0 header_1 body_1
1 header_1 body_3
2 header_1 body_2
3 header_2 body_6
4 header_2 body_4
5 header_2 body_5
6 header_4 body_7
7 header_4 body_8
8 header_3 body_9
9 header_9 body_10
10 header_10 None
If the dataset is too big, this solution would be slow, but it should still work.
for key in data.keys():
vals= data[key]
# Create temp df with data from a single key
t_df = pd.DataFrame({'header':[key]*len(vals),'body':vals})
# Append it to your full dataframe.
df = df.append(t_df)
This is another unnesting problem again
Borrow Jez's setting up for your d
d = {k: v if bool(v) else [None] for k, v in d.items()}
1st convert your dict into dataframe
df=pd.Series(d).reset_index()
df.columns
Out[204]: Index(['index', 0], dtype='object')
Then using this function in here
yourdf=unnesting(df,[0])
yourdf
Out[208]:
0 index
0 body_1 header_1
0 body_3 header_1
0 body_2 header_1
1 body_6 header_2
1 body_4 header_2
1 body_5 header_2
2 body_7 header_4
2 body_8 header_4
3 body_9 header_3
4 body_10 header_9
5 None header_10
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

Resources