Get number of rows for all combinations of attribute levels in Pandas - python-3.x

I have a dataframe bunch of categorical variables, each row corresponds to a product.I wanted to find the number of rows for every combination of attribute levels and decided to run the following:
att1=list(frame_base.columns.values)
f1=att.groupby(att1,as_index=False).size().rename('counts').to_frame()
att1 is the list of all attributes, f1 does not seem to provide the correct value as f1.counts.sum() is not equal to len(f1) before the group by.Why doesn't this work?

One possible problem is NaN row, but maybe there is typo - need att instead frame_base:
att = pd.DataFrame({'A':[1,1,3,np.nan],
'B':[1,1,6,np.nan],
'C':[2,2,9,np.nan],
'D':[1,1,5,np.nan],
'E':[1,1,6,np.nan],
'F':[1,1,3,np.nan]})
print (att)
A B C D E F
0 1.0 1.0 2.0 1.0 1.0 1.0
1 1.0 1.0 2.0 1.0 1.0 1.0
2 3.0 6.0 9.0 5.0 6.0 3.0
3 NaN NaN NaN NaN NaN NaN
att1=list(att.columns.values)
f1=att.groupby(att1).size().reset_index(name='counts')
print (f1)
A B C D E F counts
0 1.0 1.0 2.0 1.0 1.0 1.0 2
1 3.0 6.0 9.0 5.0 6.0 3.0 1

Related

Get value from grouped data frame maximum in another column

Return and assign the value of column c based on the maximum value of column b in a grouped data frame, grouped on column a.
I'm trying to return and assign (ideally in a different column) the value of column c, determined by the row with the maximum in column b (which are dates). The dataframe is grouped by column a, but I keep getting errors. The data frame is relatively large, with over 16,000 rows.
df = pd.DataFrame({'a': ['a','a','a','b','b','b','c','c','c'],
'b': ['2008-11-01', '2022-07-01', '2017-02-01', '2017-02-01', '2018-02-01', '2008-11-01', '2014-11-01', '2008-11-01', '2022-07-01'],
'c': [numpy.NaN,6,8,2,1,numpy.NaN,6,numpy.NaN,7]})
df['b'] = pd.to_datetime(df['b'])
df
a b c
0 a 2008-11-01 NaN
1 a 2022-07-01 6.0
2 a 2017-02-01 8.0
3 b 2017-02-01 2.0
4 b 2018-02-01 1.0
5 b 2008-11-01 NaN
6 c 2014-11-01 6.0
7 c 2008-11-01 NaN
8 c 2022-07-01 7.0
I want the following result:
a b c d
0 a 2008-11-01 NaN 8.0
1 a 2022-07-01 6.0 8.0
2 a 2017-02-01 8.0 8.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0
I've tried the following, but grouped series ('SeriesGroupBy') cannot be cast to another dtype:
df['d'] = df.groupby('a').b.astype(np.int64).max()[df.c].reset_index()
I've set column b to int before running the above code, but get an error saying none of the values in c are in the index.
None of [Index([....values in column c...\n dtype = 'object', name = 'a', length = 16280)] are in the [index]
I've tried expanding out the command but errors (which I just realized relate to no alternative values being given in the pd.where function - but I don't know how to give the correct value):
df= df.groupby('a')
df['d'] = df.['c'].where(grouped['b'] == grouped['b'].max())
I've also tried using solutions provided here, here, and here.
Any help or direction would be appreciated.
I'm assuming the first three values should be 6.0 (because maximum date for group a is 2022-07-01 with value 6):
df["d"] = df["a"].map(
df.groupby("a").apply(lambda x: x.loc[x["b"].idxmax(), "c"])
)
print(df)
Prints:
a b c d
0 a 2008-11-01 NaN 6.0
1 a 2022-07-01 6.0 6.0
2 a 2017-02-01 8.0 6.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0

How to count value greater than or equal to 0.5 continuous for 5 or greater than 5 rows python

I am trying to count values in column x greater than or equal to 0.5 for continuous for 5 times or greater. i also need to use groupby function for my data .
i used this function work fine , but this function can not count continuous occurrence of value , its just count all values greater than or equal 0.5
data['points_greater_0.5'] = data[abs(data['x'])>=0.5].groupby(['y','z','n'])['x'].count()
but i want to count if greater than or equal to 0.5 value occurs continuous for 5 times or more
As the source DataFrame I took:
x y z n
0 0.1 1.0 1.0 1.0
1 0.5 1.0 1.0 1.0
2 0.6 1.0 1.0 1.0
3 0.7 1.0 1.0 1.0
4 0.6 1.0 1.0 1.0
5 0.5 1.0 1.0 1.0
6 0.1 1.0 1.0 1.0
7 0.5 1.0 1.0 1.0
8 0.6 1.0 1.0 1.0
9 0.7 1.0 1.0 1.0
10 0.1 1.0 1.0 1.0
11 0.5 1.0 1.0 1.0
12 0.6 1.0 1.0 1.0
13 0.7 1.0 1.0 1.0
14 0.7 1.0 1.0 1.0
15 0.6 1.0 1.0 1.0
16 0.5 1.0 1.0 1.0
17 0.1 1.0 1.0 1.0
18 0.5 2.0 1.0 1.0
19 0.6 2.0 1.0 1.0
20 0.7 2.0 1.0 1.0
21 0.6 2.0 1.0 1.0
22 0.5 2.0 1.0 1.0
(one group for (y, z, n) == (1.0, 1.0, 1.0) and another for (2.0, 1.0, 1.0)).
Start from import itertools as it.
Then define the following function to get the count of your "wanted"
elements from the current group:
def getCnt(grp):
return sum(filter(lambda x: x >= 5, [ len(list(group))
for key, group in it.groupby(grp.x, lambda elem: elem >= 0.5)
if key ]))
Note that it contains it.groupby, i.e. groupby function from itertools
(not the pandasonic version of it).
The difference is that the itertools version starts a new group on each change
of the grouping key (by default, the value of the source element).
Steps:
it.groupby(grp.x, lambda elem: elem >= 0.5) - create an iterator,
returning pairs (key, group), from x column of the current group.
The key states whether the current group (from itertools grouping)
contains your "wanted" elements (>= 0.5) and the group contains these
elements.
[ len(list(group)) for key, group in … if key ] - get a list of
lengths of groups, excluding groups of "smaller" elements.
filter(lambda x: x >= 5, …) - filter the above list, leaving only counts
of groups with 5 or more members.
sum(…) - sum the above counts.
Then, to get your expected result, as a DataFrame, apply this function to
each group of rows, this time grouping with the pandasonic version of
groupby.
Then set the name of the resulting Series (it will be the column name
in the final result) and reset the index, to convert it to a DataFrame.
The code to do it is:
result = df.groupby(['y','z','n']).apply(getCnt).rename('Cnt').reset_index()
The result is:
y z n Cnt
0 1.0 1.0 1.0 11
1 2.0 1.0 1.0 5

Perform arithmetic operation mainly subtraction and division over a pandas series on null values

Simply i want when i subtract/division operation with null value it will give the value(digit).Ex - 3/np.nan = 3 or 2-np.nan = 2.
By using np.nansum and np.nanprod i have handled addition and multiplication,but dont know how will i do operation for subtraction and division.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c=a-b d=a/b
0 1 1.0 0.0 1.0
1 2 2.0 0.0 1.0
2 3 NaN 3.0 3.0
3 4 NaN 4.0 4.0
Above i mention that actually what i am looking for.
#Use fill value of 0 for subtraction operation
df['c']=df.a.sub(df.b,fill_value=0)
#Use fill value of 1 for division operation
df['d']=df.a.div(df.b,fill_value=1)
IIUC using sub with fill_value
df.a.sub(df.b,fill_value=0)
Out[251]:
0 0.0
1 0.0
2 3.0
3 4.0
dtype: float64

Similar random variation for two columns in pandas

data = pd.DataFrame(1.0, index=[1,2,3,4,5], columns=list('ABCD') )
data[['B', 'C']] = data[['B', 'C']].apply(lambda x: x + (-1)**random.randrange(2)*1)
I wanted to randomly vary column B and C, such that the the variation is the same for both columns. If column B increase by one, column C must increase by one too. however for each row, the value can increase/decrease randomly. Code above doesn't work. Then I tried this with random seed:
data['B'] = data['B'].apply(lambda x: x + (-1)**random.randrange(2)*1)
data['C'] = data['C'].apply(lambda x: x + (-1)**random.randrange(2)*1)
Each rows vary randomly but the change in column B and C are not the same. how do I do this?
expected output
A B C D
1 1.0 1.0 1.0 1.0
2 1.0 2.0 2.0 1.0
3 1.0 2.0 2.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 0.0 0.0 1.0

How to fill NaN with user defined value in pandas dataframe

How to fill NaN with user defined value in pandas dataframe.
For text columns like A and B, user defined text like 'Missing' should be imputed. For discrete numeric variables like C and D, median value should be imputed. I have many columns like these, I would like apply rule for all vars in the dataframe
DF
A B C D
A0A1 Railway 10 NaN
A1A1 Shipping NaN 1
NaN Shipping 3 2
B1A1 NaN 1 7
DF out:
A B C D
A0A1 Railway 10 2
A1A1 Shipping 3 1
Missing Shipping 3 2
B1A1 Missing 1 7
You can fillna by pass dict
df.fillna({'A':'Miss','B':"Your2",'C':df.C.median(),'D':df.D.mean()})
Out[373]:
A B C D
0 A0A1 Railway 10.0 3.333333
1 A1A1 Shipping 3.0 1.000000
2 Miss Shipping 3.0 2.000000
3 B1A1 Your2 1.0 7.000000
Fun way!
d = {np.dtype('O'): 'Missing'}
df.fillna(df.dtypes.map(d).fillna(df.median()))
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0
First replace median for numeric columns and then fillna for non numeric:
df = df.fillna(df.median()).fillna('Missing')
print (df)
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0

Resources