how to find the count of cell value and no of last continuous by pandas - python-3.x

Data frame 1 have 3 week code, and data frame 2 have 4th week means last week code, now how to find the two columns , first is total no of cell count and no of last continuous cell

It's easier to get more information about the week codes by turning them into columns with truth values.
>>> df1
week1 week2 week3
0 bk mm xz
1 ck ij bk
2 ka xz zz
3 xz dk bv
4 ij mn ka
5 ik ik xy
6 bv ka xx
7 xy xx ck
8 mn ck ik
>>> df2
week4
0 ka
1 bk
2 ck
3 xy
4 xz
5 zz
6 ik
7 xx
8 bv
Start by joining your two DataFrames to bring all the weeks data together.
df = df1.join(df2)
Now gather all unique week codes and generate a list of codes for each week. We can then combine these in a new DataFrame.
codes = np.unique(df)
weeks = df.T.apply(list, axis=1)
d = {c: [(c in w) for w in weeks] for c in codes}
newDf = pd.DataFrame(data=d, index=weeks.index)
>>> newDf
bk bv ck dk ij ik ka mm mn xx xy xz zz
week1 True True True False True True True False True False True True False
week2 False False True True True True True True True True False True False
week3 True True True False False True True False False True True True True
week4 True True True False False True True False False True True True True
Now we can get total cell count easily and number of continuous cells using a neat 'cumsum' trick.
totals = newDf.sum()
continues = (newDf.cumsum() -
newDf.cumsum().where(~newDf).ffill().fillna(0)).iloc[-1]
Finally you can fill in your total columns in DataFrame #2, using the above as lookup tables.
df2['Total Number of cell'] = df2['week4'].apply(lambda code: totals[code])
df2['No. of time last continuous'] = df2['week4'].apply(
lambda code: continues[code]).astype(int)
results in
>>> df2
week4 Total Number of cell No. of time last continuous
0 ka 4 4
1 bk 3 2
2 ck 4 4
3 xy 3 2
4 xz 4 4
5 zz 2 2
6 ik 4 4
7 xx 3 3
8 bv 3 2

Related

Python: Compare 2 pandas dataframe with unequal number of rows

Need to compare two pandas dataframe with unequal number of rows and generate a new df with True for matching records and False for non matching and missing records.
df1:
date x y
0 2022-11-01 4 5
1 2022-11-02 12 5
2 2022-11-03 11 3
df2:
date x y
0 2022-11-01 4 5
1 2022-11-02 11 5
expected df_output:
date x y
0 True True True
1 False False False
2 False False False
Code:
df1 = pd.DataFrame({'date':['2022-11-01', '2022-11-02', '2022-11-03'],'x':[4,12,11],'y':[5,5,3]})
df2 = pd.DataFrame({'date':['2022-11-01', '2022-11-02'],'x':[4,11],'y':[5,5]})
df_output = pd.DataFrame(np.where(df1 == df2, True, False), columns=df1.columns)
print(df_output)
Error: ValueError: Can only compare identically-labeled DataFrame objects
You can use:
# cell to cell equality
# comparing by date
df3 = df1.eq(df1[['date']].merge(df2, on='date', how='left'))
# or to compare by index
# df3 = df1.eq(df2, axis=1)
# if you also want to turn a row to False if there is any False
df3 = (df3.T & df3.all(axis=1)).T
Output:
date x y
0 True True True
1 False False False
2 False False False

pandas drop rows based on condition on groupby

I have a DataFrame like below
I am trying to groupby cell column and drop the "NA" values where group size > 1.
required Output :
How to get my expected output? How to filter on a condition and drop rows in groupby statement?
From your DataFrame, first we group by cell to get the size of each groups :
>>> df_grouped = df.groupby(['cell'], as_index=False).size()
>>> df_grouped
cell size
0 A 3
1 B 1
2 D 3
Then, we merge the result with the original DataFrame like so :
>>> df_merged = pd.merge(df, df_grouped, on='cell', how='left')
>>> df_merged
cell value kpi size
0 A 5.0 thpt 3
1 A 6.0 ret 3
2 A NaN thpt 3
3 B NaN acc 1
4 D 8.0 int 3
5 D NaN ps 3
6 D NaN yret 3
To finish, we filter the Dataframe to get the expected result :
>>> df_filtered = df_merged[~((df_merged['value'].isna()) & (df_merged['size'] > 1))]
>>> df_filtered[['cell', 'value', 'kpi']]
cell value kpi
0 A 5.0 thpt
1 A 6.0 ret
3 B NaN acc
4 D 8.0 int
Use boolean mask:
>>> df[df.groupby('cell').cumcount().eq(0) | df['value'].notna()]
cell value kpi
0 A crud thpt
1 A 6 ret
3 B NaN acc
4 D hi int
Details:
m1 = df.groupby('cell').cumcount().eq(0)
m2 = df['value'].notna()
df.assign(keep_at_least_one=m1, keep_notna=m2, keep_rows=m1|m2)
# Output:
cell value kpi keep_at_least_one keep_notna keep_rows
0 A crud thpt True True True
1 A 6 ret False True True
2 A NaN thpt False False False
3 B NaN acc True False True
4 D hi int True True True
5 D NaN ps False False False
6 D NaN yret False False False

Counting True or False

I have the following dataframe:
True_False
2018-01-02 True
2018-01-03 True
2018-01-04 False
2018-01-05 False
2018-01-08 False
... ...
2020-01-20 True
2020-01-21 True
2020-01-22 True
2020-01-23 True
2020-01-24 False
504 rows × 1 columns
I want to know how many successive True or False but not total it must stop counting after it toggles True or False. As such i want to eventually calculate mean(), max() and min() days. is it possible to show this data in Pandas?
Solution if all datetimes are consecutive:
You can create helper Series for consecutive groups by Series.shift and Series.cumsum, then get counts by GroupBy.size:
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s)
True_False True_False
False 2 3
4 1
True 1 2
3 4
dtype: int64
And last aggregate min, max and mean per first level of MultiIndex:
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 2 3 1
True 3 4 2
If datetimes are not consecutive first step is DataFrame.asfreq:
df = df.asfreq('d')
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 1.333333 2 1
True 3.000000 4 2

pandas dataframe groups check the number of unique values of a column is one but exclude empty strings

I have the following df,
id invoice_no
1 6636
1 6637
2 6639
2 6639
3
3
4 6635
4 6635
4 6635
the invoice_no for id 3 are all empty strings or spaces; I want to
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
but also consider spaces and empty string invoice_no in each group as same_invoice_no = False; I am wondering how to do that. The result will look like,
id invoice_no same_invoice_no
1 6636 False
1 6637 False
2 6639 True
2 6639 True
3 False
3 False
4 6635 True
4 6635 True
4 6635 True
Empty strings equate to True but NaNs don't. Replace empty strings by Numpy nan
df.replace('', np.nan, inplace = True)
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
id invoice_no same_invoice_no
0 1 6636.0 False
1 1 6637.0 False
2 2 6639.0 True
3 2 6639.0 True
4 3 NaN False
5 3 NaN False
6 4 6635.0 True
7 4 6635.0 True
8 4 6635.0 True

How to delete the entire row if any of its value is 0 in pandas

In the below example I only want to retain the row 1 and 2
I want to delete all the rows which has 0 anywhere across the column:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
3 3 3 0 3 3
4 0 4 0 0 0
5 5 5 5 5 0
the output should read like below:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
I have tried:
df.loc[(df!=0).any(axis=1)]
But it deletes the row only if all of its corresponding columns are 0
You are really close, need DataFrame.all for check all Trues per row:
df = df.loc[(df!=0).all(axis=1)]
print (df)
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
Details:
print (df!=0)
kt b tt mky depth
1 True True True True True
2 True True True True True
3 True True False True True
4 False True False False False
5 True True True True False
print ((df!=0).all(axis=1))
1 True
2 True
3 False
4 False
5 False
dtype: bool
Alternative solution with any for check at least one True for row with changed mask df == 0 and inversing by ~:
df = df.loc[~(df==0).any(axis=1)]

Resources