Masking Data Unequal to Another Set of Data and Storing Results - python-3.x

Does anyone know how to amend the "changes" dataframe to only evaluate cells that are true? I want to send only those items to changes in df2 from df1 to the changes dataframe. This replaces all cells and I can't use "mask" by itself since it's multidimensional. Thanks!
import pandas as pd
import numpy as np
df1=pd.DataFrame({'Col1' : ['blue', 2, 3, 4], 'Col2' : [90, 99, 3, 97], 'Col3' : [11, 12, 13, 14]})
df2=pd.DataFrame({'Col1' : ['blue', 2, 6], 'Col2' : [90, 99, 99], 'Col3' : [11, 12, 13]})
mask=df2.ne(df1)
#Line in question
changes=(df1.loc[mask.index].astype(str) + ' changed to: ***' + df2.loc[mask.index].astype(str)).fillna(df2.astype(str))
I want the output to look like:
Col1 Col2 Col3
0 blue 90 11
1 2 99 12
2 3 changed to: ***6 3 changed to: ***99.0 13
3 4 changed to: ***nan 97 changed to: ***nan 14 changed to: ***nan

IIUC, you can use where with the other parameter see docs:
df1.where(df1.eq(df2), changes)
Output:
Col1 Col2 Col3
0 blue 90 11
1 2 99 12
2 3 changed to: ***6 3 changed to: ***99.0 13
3 4 changed to: ***nan 97 changed to: ***nan 14 changed to: ***nan

Similar approach to Scott Boston's method. (Credit to him!) You can use where's variant: mask.
df1.mask(df1.ne(df2), df2)
This tell you that, whenever df1.ne(df2) is True, fill in values from df2; otherwise, do not change.
Col1 Col2 Col3
0 blue 90.0 11.0
1 2 99.0 12.0
2 6 99.0 13.0
3 NaN NaN NaN

Related

Pandas Min and Max Across Rows

I have a dataframe that looks like below. I want to get a min and max value per city along with the information about which products were ordered min and max for that city. Please help.
Dataframe
db.min(axis=0) - min value for each column
db.min(axis=1) - min value for each row
use Dataframe.min and Datafram.max
DataFrame.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
matrix = [(22, 16, 23),
(33, 50, 11),
(44, 34, 11),
(55, 35, 60),
(66, 36, 13)
]
dfObj = pd.DataFrame(matrix, index=list('abcde'), columns=list('xyz'))
x y z
a 22 16.0 23.0
b 33 50 11.0
c 44 34.0 11.0
d 55 35.0 60
e 66 36.0 13.0
Get a series containing the minimum value of each row
minValuesObj = dfObj.min(axis=1)
print('minimum value in each row : ')
print(minValuesObj)
output
minimum value in each row :
a 16.0
b 11.0
c 11.0
d 35.0
e 13.0
dtype: float64
MMT Marathi, based on the answers provided by Danil and Sutharp777, you should be able to get to your answer. However, I see you have questions for them. Not sure if you are looking for a column to be created that has the min/max value for each row.
Here's the full dataframe with the solution. I am merely compiling the answers they have already given
import pandas as pd
d = [['20in Monitor',2,2,1,2,2,2,2,2,2],
['27in 4k Gaming Monitor',2,1,2,2,1,2,2,2,2],
['27in FHD Monitor',2,2,2,2,2,2,2,2,2],
['34in Ultrawide Monitor',2,1,2,2,2,2,2,2,2],
['AA Batteries (4-pack)',5,5,6,7,6,6,6,6,5],
['AAA Batteries (4-pack)',7,7,8,8,9,7,8,9,7],
['Apple Airpods Headphones',2,2,3,2,2,2,2,2,2],
['Bose SoundSport Headphones',2,2,2,2,3,2,2,3,2],
['Flatscreen TV',2,1,2,2,2,2,2,2,2]]
c = ['Product','Atlanta','Austin','Boston','Dallas','Los Angeles',
'New York City','Portland','San Francisco','Seattle']
df = pd.DataFrame(d,columns=c)
df['min_value'] = df.min(axis=1)
df['max_value'] = df.max(axis=1)
print (df)
The output of this will be:
Product Atlanta Austin ... Seattle min_value max_value
0 20in Monitor 2 2 ... 2 1 2
1 27in 4k Gaming Monitor 2 1 ... 2 1 2
2 27in FHD Monitor 2 2 ... 2 2 2
3 34in Ultrawide Monitor 2 1 ... 2 1 2
4 AA Batteries (4-pack) 5 5 ... 5 5 7
5 AAA Batteries (4-pack) 7 7 ... 7 7 9
6 Apple Airpods Headphones 2 2 ... 2 2 3
7 Bose SoundSport Headphones 2 2 ... 2 2 3
8 Flatscreen TV 2 1 ... 2 1 2
If you want the min and max of each column, then you can do this:
print ('min of each column :', df.min(axis=0).to_list()[1:])
print ('max of each column :', df.max(axis=0).to_list()[1:])
This will give you:
min of each column : [2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2]
max of each column : [7, 7, 8, 8, 9, 7, 8, 9, 7, 7, 9]

Get sum of group subset using pandas groupby

I have a dataframe as shown. Using python, I want to get the sum of 'Value' for each 'Id' group upto the first occurrence of 'Stage' 12.
df = pd.DataFrame({'Id':[1,1,1,2,2,2,2],
'Date': ['2020-04-23', '2020-04-25', '2020-04-28', '2020-04-20', '2020-05-01', '2020-05-05', '2020-05-12'],
'Stage': [11, 12, 15, 11, 14, 12, 12],
'Value': [5, 4, 6, 12, 2, 8, 3]})
Id Date Stage Value
1 2020-04-23 11 5
1 2020-04-25 12 4
1 2020-04-28 15 6
2 2020-04-20 11 12
2 2020-05-01 14 2
2 2020-08-05 12 8
2 2020-05-12 12 3
My desired output:
Id Value
1 9
2 22
Would be very thankful if someone could help.
Let us try use the groupby transform idxmax filter the dataframe , then do another round of groupby
idx = df['Stage'].eq(12).groupby(df['id']).transform('idxmax')
output = df[df.index <= idx].groupby('id')['Value'].sum().reset_index()
Detail
the transform with idxmax will return the first index match with 12 for all the groupby row, then we need to filter the df with index less than that to get the data until the first 12 show up.

Pandas: Group dataframe by condition "last value in column defines the group"

I've got a sorted dataframe (sorted by "customer_id" and "point_in_time") which looks like this:
import pandas as pd
import numpy as np
testing = pd.DataFrame({"customer_id": (1,1,1,2,2,2,2,2,3,3,3,3,4,4),
"point_in_time": (4,5,6,1,2,3,7,9,5,6,8,10,2,5),
"x": ("d", "a", "c", "ba", "cd", "d", "o", "a", "g", "f", "h", "d", "df", "b"),
"revenue": (np.nan, np.nan, 40, np.nan, np.nan, 23, np.nan, 10, np.nan, np.nan, np.nan, 40, np.nan, 100)})
testing
Now I want to group the dataframe by "customer_id" and the "revenue". But with regard to "revenue" a group should start after the last existing revenue and end with the next occuring revenue.
So the groups should look like this:
If I had those groups I could easily do a
testing.groupby(["customer_id", "groups"])
I first tried to create those groups by first grouping by "customer_id" and applying a function to it in which I fill the missing values of "revenue":
def my_func(sub_df):
sub_df["groups"] = sub_df["revenue"].fillna(method="bfill")
sub_df.groupby("groups").apply(next_function)
testing.groupby(["customer_id"]).apply(my_func)
Unfortunately, this does not work if one customer has two revenues which are exactly the same. In this case after using fillna the group column of this customer will consist of only one value which does not allow additional grouping.
So how can this be done and what is the most efficient way to accomplish this task?
Thank you in advance!
Use Series.shift with Series.notna and Series.cumsum, last if necessary add 1:
testing["groups"] = testing['revenue'].shift().notna().cumsum() + 1
print (testing)
customer_id point_in_time x revenue groups
0 1 4 d NaN 1
1 1 5 a NaN 1
2 1 6 c 40.0 1
3 2 1 ba NaN 2
4 2 2 cd NaN 2
5 2 3 d 23.0 2
6 2 7 o NaN 3
7 2 9 a 10.0 3
8 3 5 g NaN 4
9 3 6 f NaN 4
10 3 8 h NaN 4
11 3 10 d 40.0 4
12 4 2 df NaN 5
13 4 5 b 100.0 5

Binning with pd.Cut Beyond range(replacing Nan with "<min_val" or ">Max_val" )

df= pd.DataFrame({'days': [0,31,45,35,19,70,80 ]})
df['range'] = pd.cut(df.days, [0,30,60])
df
Here as code is reproduced , where pd.cut is used to convert a numerical column to categorical column . pd.cut usually gives category as per the list passed [0,30,60]. In this row's 0 , 5 & 6 categorized as Nan which is beyond the [0,30,60]. what i want is 0 should categorized as <0 & 70 should categorized as >60 and similarly 80 should categorized as >60 respectively, If possible dynamic text labeling of A,B,C,D,E depending on no of category created.
For the first part, adding -np.inf and np.inf to the bins will ensure that everything gets a bin:
In [5]: df= pd.DataFrame({'days': [0,31,45,35,19,70,80]})
...: df['range'] = pd.cut(df.days, [-np.inf, 0, 30, 60, np.inf])
...: df
...:
Out[5]:
days range
0 0 (-inf, 0.0]
1 31 (30.0, 60.0]
2 45 (30.0, 60.0]
3 35 (30.0, 60.0]
4 19 (0.0, 30.0]
5 70 (60.0, inf]
6 80 (60.0, inf]
For the second, you can use .cat.codes to get the bin index and do some tweaking from there:
In [8]: df['range'].cat.codes.apply(lambda x: chr(x + ord('A')))
Out[8]:
0 A
1 C
2 C
3 C
4 B
5 D
6 D
dtype: object

Better way to replace values in DataFrame from large dictionary

I have written some code that replaces values in a DataFrame with values from another frame using a dictionary, and it is working, but i am using this on some large files, where the dictionary can get very long. A few thousand pairs. When I then uses this code it runs very slow, and it have also been going out of memory on a few ocations.
I am somewhat convinced that my method of doing this is far from optimal, and that there must be some faster ways to do this. I have created a simple example that does what I want, but that is slow for large amounts of data. Hope someone have a simpler way to do this.
import pandas as pd
#Frame with data where I want to replace the 'id' with the name from df2
df1 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 3, 5, 9], 'values' : [12, 32, 42, 51, 23, 14, 111, 134]})
#Frame containing names linked to ids
df2 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'name' : ['id1', 'id2', 'id3', 'id4', 'id5', 'id6', 'id7', 'id8', 'id9', 'id10']})
#My current "slow" way of doing this.
#Starts by creating a dictionary from df2
#Need to create dictionaries from the domain and banners tables to link ids
df2_dict = dict(zip(df2['id'], df2['name']))
#and then uses the dict to replace the ids with name in df1
df1.replace({'id' : df2_dict}, inplace=True)
I think you can use map with Series converted to_dict - get NaN if not exist value in df2:
df1['id'] = df1.id.map(df2.set_index('id')['name'].to_dict())
print (df1)
id values
0 id1 12
1 id2 32
2 id3 42
3 id4 51
4 id5 23
5 id3 14
6 id5 111
7 id9 134
Or replace, if dont exist value in df2 let original values from df1:
df1['id'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
id values
0 id1 12
1 id2 32
2 id3 42
3 id4 51
4 id5 23
5 id3 14
6 id5 111
7 id9 134
Sample:
#Frame with data where I want to replace the 'id' with the name from df2
df1 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 3, 5, 9], 'values' : [12, 32, 42, 51, 23, 14, 111, 134]})
print (df1)
#Frame containing names linked to ids
df2 = pd.DataFrame({'id' : [1, 2, 3, 4, 6, 7, 8, 9, 10], 'name' : ['id1', 'id2', 'id3', 'id4', 'id6', 'id7', 'id8', 'id9', 'id10']})
print (df2)
df1['new_map'] = df1.id.map(df2.set_index('id')['name'].to_dict())
df1['new_replace'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
id values new_map new_replace
0 1 12 id1 id1
1 2 32 id2 id2
2 3 42 id3 id3
3 4 51 id4 id4
4 5 23 NaN 5
5 3 14 id3 id3
6 5 111 NaN 5
7 9 134 id9 id9

Resources