Select rows of every specific number of col where values is negative and convert values 0 in another column in python3 - python-3.x

i have below dataframe
date B C D E
2019-07-01 00:00 0.400157 0.978738 2.240893 1.867558
2019-07-01 00:10 -0.950088 0.151357 -0.103219 0.410599
2019-07-01 00:20 1.454274 0.761038 0.121675 0.443863
2019-07-01 00:30 -1.494079 0.205158 0.313068 0.854096
suppose every even columns of rows contains -ve values(may be multiple condition,ex. rows contains -ve value or more than 10) then i wants to next odd column values into 0
expected output
date B C D E
2019-07-01 00:00 0.400157 0.978738 2.240893 1.867558
2019-07-01 00:10 -0.950088 0 -0.103219 0
2019-07-01 00:20 1.454274 0.761038 0.121675 0.443863
2019-07-01 00:30 -1.494079 0 0.313068 0.854096
if code one liner solution is then its best or can we write function for this

This solution requires the date column to be set as the index:
df.set_index('date', inplace=True)
df[df.shift(axis=1) < 0] = 0
df.reset_index(inplace=True)
df.shift returns a new dataframe with all the columns shifted to the right (default behaviour; can be changed using the periods parameter). This enables you to compare a cell with one to its left.
Source: DataFrame.shift

Related

How to delete rows with the same value? Merge column with same prefix

Hi everyone, I have two question need helping
Question 2
I have df with data as belows:
ABC_x
Quantity silent
ABC_y
Quantity noirse
A
05
NaN
NaN
B
03
NaN
NaN
NaN
NaN
D
08
NaN
NaN
E
09
G
01
NaN
NaN
How to merge two column ABC_x and ABC_y (same prefix ABC) to one column ABC, and merge data of two column special quantity to one column Quantity?
DF expected:
ABC
Quantity
A
05
B
03
D
08
E
09
G
01
Thank you for reading and help me troubleshoot problem, Have a nice day <3
I have try but unsuccessful
Question 1
pandas has a function duplicated that gives you true for duplicates and false otherwise
In [40]: df.duplicated(["Column A"])
Out[40]:
0 False
1 True
dtype: bool
You can use this for boolean indexing
In [43]: df.loc[df.duplicated(["Column A"]), "Column A"] = np.nan
In [44]: df
Out[44]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN ValueB ValueC Value_D002 Value_E06 Value_F4
and the same for the other columns.
Note
You can also pass multiple columns with
In [52]: df.loc[
...: df.duplicated(["Column A", "Column B", "Column C"]),
...: ["Column A", "Column B", "Column C"],
...: ] = np.nan
In [53]: df
Out[53]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
However, this would replace only where all three columns are duplicated at the same time.
Question 2
pandas has a function fill to replace nan values. From your example I assume there is either a value in _x or _y. In this case you can use backfill to use _x if it is there and take _y otherwise
In [76]: df[["ABC_x", "ABC_y"]].fillna(method="backfill", axis=1)
Out[76]:
ABC_x ABC_y
0 A NaN
1 B NaN
2 D D
3 E E
4 G NaN
Then do this for ABC as well as Quantity and use the first column only:
In [82]: pd.DataFrame({
"ABC": df[["ABC_x", "ABC_y"]].fillna(method="backfill", axis=1).iloc[:, 0],
"Quantity": df[["Quantity silent", "Quantity noirse"]].fillna(method="backfill", axis=1).iloc[:, 0].astype(int),
})
Out[82]:
ABC Quantity
0 A 5
1 B 3
2 D 8
3 E 9
4 G 1
The astype(int) in the end is just because nan is not a valid integer, so pandas interprets the numbers as floats in the presence of nan
Question1
when column name have 'Column', chk duplicated to NaN
cond1 = df.columns.str.contains('Column')
df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated()))
result:
Column A Column B Column C Column D Column E Column F
0 ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NaN NaN NaN Value_D002 Value_E06 Value_F4
make result to join to name
full code
cond1 = df.columns.str.contains('Column')
df.loc[:, ~cond1].join(df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated())))
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
Question2
df.set_axis(df.columns.str.split('[ _]').str[0], axis=1).groupby(level=0, axis=1).first()
result
ABC Quantity
0 A 05
1 B 03
2 D 08
3 E 09
4 G 01

Calculating dynamic difference between two columns in pandas

I have a situation in which I need to calculate the difference between multiple columns and store them in a separate column under separate headers. My dataset looks like below:
cat_1 cat_2 cat_3 cat_4 date_1 date_2 date_3 date_4
a b b c 2020-01-01 2020-01-01 2020-01-25 2020-01-10
b c d 2019-01-11 2020-01-01 2020-01-15 2020-01-10
a b d 2018-11-01 2019-01-01 2020-01-15 2020-01-10
a b c d 2015-01-01 2016-01-29 2018-01-25 2019-01-10
.. and so on
The order will follow : a->b->c->d and the reverse is not true
I want to store the following combinations in new columns in number of days. There will be 4 combinations in total. Essentially I want to calculate the difference in two dates and store in days for the combination.
An example output for the first row of my dataset:
days_a-b days_a-c days_a-d days_b-c days_b-d days_c-d
0 9
355 364 -5
How to solve this one?

Select top n columns based on another column

I have a database as the following:
And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:
I know that pandas offers a formula called nlargest:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html
but I don't think it is usable for this use case. Is there any workaround?
Thanks so much in advance!
I have mimicked your dataframe as below and provided a way forward to get the desired, hope that will helpful.
Your Dataframe:
>>> df
Date country population
0 2019-12-31 A 100
1 2019-12-31 B 10
2 2019-12-31 C 1000
3 2020-01-01 A 200
4 2020-01-01 B 20
5 2020-01-01 C 3500
6 2020-01-01 D 12
7 2020-02-01 D 2000
8 2020-02-01 E 54
Your Desired Solution:
You can use nlargest method along with set_index ans groupby method.
This is what you will get..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2)
Date country
2019-12-31 C 1000
A 100
2020-01-01 C 3500
A 200
2020-02-01 D 2000
E 54
Name: population, dtype: int64
Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index()
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54
Another way around:
With groupby and apply function use reset_index with parameter drop=True and level= ..
>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True)
# df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True)
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54

Pandas delete and shift cells in a column basis multiple conditions

I have a situation where I would want to delete and shift cells in a pandas data frame basis some conditions. My data frame looks like this :
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
A 1 D 1 G 1
B 1 E 2 H 1
C 1 F 2 I 3
C 1 F 2 H 1
Now I want to compare the following conditions:
ID_2 and ID_3 should always be less than or equal to ID_1. If anyone of them is greater than ID_1 then that cell should be deleted and shifted with the next column cell
The output should look like the following :
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
A 1 D 1 G 1
B 1 H 1 blank nan
C 1 blank nan blank nan
C 1 H 1 blank nan
You can create mask by condition, here for greater values like ID_1 by DataFrame.gt::
cols1 = ['Value_2','Value_3']
cols2 = ['ID_2','ID_3']
m = df[cols2].gt(df['ID_1'], axis=0)
print (m)
ID_2 ID_3
0 False False
1 True False
2 True True
3 True False
Then replace missing values if match mask by DataFrame.mask:
df[cols2] = df[cols2].mask(m)
df[cols1] = df[cols1].mask(m.to_numpy())
And last use DataFrame.shift with set new columns by Series.mask:
df1 = df[cols2].shift(-1, axis=1)
df['ID_2'] = df['ID_2'].mask(m['ID_2'], df1['ID_2'])
df['ID_3'] = df['ID_3'].mask(m['ID_2'])
df2 = df[cols1].shift(-1, axis=1)
df['Value_2'] = df['Value_2'].mask(m['ID_2'], df2['Value_2'])
df['Value_3'] = df['Value_3'].mask(m['ID_2'])
print (df)
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
0 A 1 D 1.0 G 1.0
1 B 1 H 1.0 NaN NaN
2 C 1 NaN NaN NaN NaN
3 C 1 H 1.0 NaN NaN
And last if necessary replace by empty strings:
df[cols1] = df[cols1].fillna('')
print (df)
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
0 A 1 D 1.0 G 1.0
1 B 1 H 1.0 NaN
2 C 1 NaN NaN
3 C 1 H 1.0 NaN

Function to calculate timespan for a certain event

I have a pandas dataframe which looks like this
timestamp phase
2019-07-01 07:10:00 a
2019-07-01 07:11:00 a
2019-07-01 07:12:00 b
2019-07-01 07:13:00 b
2019-07-01 07:17:00 a
2019-07-01 07:19:00 a
2019-07-01 07:20:00 c
I am working on a function that creates a dataframe with an duration for every phase, till it hits the next phase.
I already have a solution but I have no clue how to write this in an user-defined-function, as I am new to python.
This is my "static" solution:
df['prev_phase'] = df["phase"].shift(1)
df['next_phase'] = df["phase"].shift(-1)
dfshift = df[df.next_phase != df.prev_phase]
dfshift["delta"] = (dfshift["timestamp"]-dfshift["timestamp"].shift()).fillna(0)
dfshift["helpcolumn"] = dfshift["phase"].shift(1)
dfshift2 = dfshift[dfshift.helpcolumn == dfshift["phase"]]
dfshift3 = dfshift2[["timestamp","phase","delta"]]
dfshift3["deltaminutes"] = dfshift3['delta'] / np.timedelta64(60, 's')
This gives me this as output (example):
timestamp phase delta deltam
2019-05-01 06:44:00 a 0 days 04:51:00 291.0
2019-05-01 07:25:00 b 0 days 00:40:00 40.0
2019-05-01 21:58:00 a 0 days 14:32:00 872.0
2019-05-01 22:07:00 c 0 days 00:08:00 8.0
I just need this in a function.
Thanks in advance
Edit for #Tom
timestamp phase
2019-05-05 08:58:00 a
2019-05-05 08:59:00 a
2019-05-05 09:00:00 b
2019-05-05 09:01:00 b
2019-05-05 09:02:00 b
2019-05-05 09:03:00 b
...
...
2019-05-05 09:38:00 b
2019-05-05 09:39:00 c
2019-05-05 09:40:00 c
2019-05-05 09:41:00 c
Those are the two colums + Index
df = pd.DataFrame({"timestamp": ["2019-07-01 07:10:00",
"2019-07-01 07:11:00",
"2019-07-01 07:12:00",
"2019-07-01 07:13:00",
"2019-07-01 07:17:00",
"2019-07-01 07:19:00",
"2019-07-01 07:20:00"],
"phase": ["a", "a", "b", "b", "a" ,"a", "c"]})
df["timestamp"] = pd.to_datetime(df["timestamp"])
# Create a 'phase_id' column to track when phase changes
df['phase_id'] = df['phase'].ne(df['phase'].shift()) + df.index
# Groupby new 'phase_id' variable and get time range for each phase
df_tdiff = df.groupby("phase_id").diff().reset_index()
df_tdiff.columns = ['phase_id', 'timediff']
# Merge this to old dataframe
df_new = pd.merge(df, df_tdiff, on=["phase_id"], how="left")
This then gives:
df_new
timestamp phase phase_id timediff
0 2019-07-01 07:10:00 a 1 00:01:00
1 2019-07-01 07:11:00 a 1 00:01:00
2 2019-07-01 07:12:00 b 3 00:01:00
3 2019-07-01 07:13:00 b 3 00:01:00
4 2019-07-01 07:17:00 a 5 00:02:00
5 2019-07-01 07:19:00 a 5 00:02:00
6 2019-07-01 07:20:00 c 7 NaT
Finally:
df_new = df_new.groupby("phase_id").first().reset_index(drop=True)
df_new
timestamp phase timediff
0 2019-07-01 07:10:00 a 00:01:00
1 2019-07-01 07:12:00 b 00:01:00
2 2019-07-01 07:17:00 a 00:02:00
3 2019-07-01 07:20:00 c NaT
Of course, if you need that all as a function (as originally requested), then:
def get_phase_timediff(df):
# Create a 'phase_id' column to track when phase changes
df['phase_id'] = df['phase'].ne(df['phase'].shift()) + df.index
# Groupby new 'phase_id' variable and get time range for each phase
df_tdiff = df.groupby("phase_id").diff().reset_index()
df_tdiff.columns = ['phase_id', 'timediff']
# Merge this to old dataframe
df_new = pd.merge(df, df_tdiff, on=["phase_id"], how="left")
# Groupby 'phase_id' again for final output
df_new = df_new.groupby("phase_id").first().reset_index(drop=True)
return(df_new)

Resources