Get the difference between two dates when string value changes - python-3.x

I want to get the number of days between the change of string values (ie., the symbol column) in one column, grouped by their respective id. I want a separate column for datediff like the one below.
id date symbol datediff
1 2022-08-26 a 0
1 2022-08-27 a 0
1 2022-08-28 a 0
2 2022-08-26 a 0
2 2022-08-27 a 0
2 2022-08-28 a 0
2 2022-08-29 b 3
3 2022-08-29 c 0
3 2022-08-30 b 1
For id = 1, datediff = 0 since symbol stayed as a. For id = 2, datediff = 3 since symbol changed after 3 days from a to b. Hence, what I'm looking for is a code that computes the difference in which the id changes it's symbol.
I am currently using this code:
df['date'] = pd.to_datetime(df['date'])
diff = ['0 days 00:00:00']
for st,i in zip(df['symbol'],df.index):
if i > 0:#cannot evaluate previous from index 0
if df['symbol'][i] != df['symbol'][i-1]:
diff.append(df['date'][i] - df['data_ts'][i-1])
else:
diff.append('0 days 00:00:00')
The output becomes:
id date symbol datediff
1 2022-08-26 a 0
1 2022-08-27 a 0
1 2022-08-28 a 0
2 2022-08-26 a 0
2 2022-08-27 a 0
2 2022-08-28 a 0
2 2022-08-29 b 1
3 2022-08-29 c 0
3 2022-08-30 b 1
It also computes the difference between two different ids. But I want the computation to be separate from different ids.
I only see questions about difference of dates when values changes, but not when string changes. Thank you!

IIUC: my solution works with the assumption that the symbols within one id ends with a single changing symbol, if there is any (as in the example given in the question).
First use df.groupby on id and symbol and get the minimum date for each combination. Then, find the difference between the dates within each id. This gives the datediff. Finally, merge the findings with the original dataframe.
df1 = df.groupby(['id', 'symbol'], sort=False).agg({'date': np.min}).reset_index()
df1['datediff'] = abs(df1.groupby('id')['date'].diff().dt.days.fillna(0))
df1 = df1.drop(columns='date')
df_merge = pd.merge(df, df1, on=['id', 'symbol'])

Related

pandas - show column name + sum in which the sum is higher than zero

I read my dataframe in with:
dataframe = pd.read_csv("testFile.txt", sep = "\t", index_col= 0)
I got a dataframe like this:
cell 17472131 17472132 17472133 17472134 17472135 17472136
cell_0 1 0 1 0 1 0
cell_1 0 0 0 0 1 0
cell_2 0 1 1 1 0 0
cell_3 1 0 0 0 1 0
with pandas I would like to get all the column names in which the sum of the column is > 1 and the total sum.
So I would like:
17472131 2
17472133 2
17472135 3
I figured out how to get the sums of each column with
dataframe.sum(axis=0)
but this also returns the columns with a sum lower than 2.. is there a way to only show the columns with a higher value than i.e. 1?
One pretty neat way is to use lambda function in loc:
df.set_index('cell').sum().loc[lambda x: x>1]
Output:
17472131 2
17472133 2
17472135 3
dtype: int64
Details: df.sum returns a pd.Series and we can use lambda x: x>1 to produce as boolean series which loc use boolean indexing to select only True parts of the pd.Series.

How to extract data from data frame when value of column change

I want to extract part of the data frame when value change from 0 to 1.
logic1: when value change from 0 to 1, start to save data until value again change to 0. (also points before 1 and after 1)
logic2: when value change from 0 to 1, start to save data until value again change to 0. (don't need to save points before 1 and after 1)
only save data when the first time value of flag change from 0 to 1, after this if again value change from 0 to 1 don't need to do anything
df=pd.DataFrame({'value':[3,4,7,8,11,1,15,20,15,16,87],'flag':[0,0,0,1,1,1,0,0,1,1,0]})
Desired output:
df_out_1=pd.DataFrame({'value':[7,8,11,1,15]})
Desired output:
df_out_2=pd.DataFrame({'value':[8,11,1]})
Idea is get consecutive groups of 1 and 0 consecutive groups to s, filter only 1 groups and get first 1 group by compare by minimal value:
df = df.reset_index(drop=True)
s = df['flag'].ne(df['flag'].shift()).cumsum()
m = s.eq(s[df['flag'].eq(1)].min())
df2 = df.loc[m, ['value']]
print (df2)
value
3 8
4 11
5 1
And then filter values with aff and remove 1 to default RangeIndex:
df1 = df.loc[(df2.index + 1).union(df2.index - 1), ['value']]
print (df1)
value
2 7
3 8
4 11
5 1
6 15

Complex group by using Pandas

I am facing a situation where I need to group-by a dataframe by a column 'ID' and also calculate the total time frame depicted for that particular ID to complete. I only want to calculate the difference between the date_open and data_closed for the particular ID with the ID count.
We only need to focus on the date open and the date closed field. So it needs to do something taking the max closing date and the min open date and subtracting the two
The dataframe looks as follows:
ID Date_Open Date_Closed
1 01/01/2019 02/01/2019
1 07/01/2019 09/01/2019
2 10/01/2019 11/01/2019
2 13/01/2019 19/01/2019
3 10/01/2019 11/01/2019
The output should look like this :
ID Count_of_ID Total_Time_In_Days
1 2 8
2 2 9
3 1 1
How should I achieve this ?
Using GroupBy with named_aggregation and the min and max of the dates:
df[['Date_Open', 'Date_Closed']] = (
df[['Date_Open', 'Date_Closed']].apply(lambda x: pd.to_datetime(x, format='%d/%m/%Y'))
)
dfg = df.groupby('ID').agg(
Count_of_ID=('ID','size'),
Date_Open=('Date_Open','min'),
Date_Closed=('Date_Closed','max')
)
dfg['Total_Time_In_Days'] = dfg['Date_Closed'].sub(dfg['Date_Open']).dt.days
dfg = dfg.drop(columns=['Date_Closed', 'Date_Open']).reset_index()
ID Count_of_ID Total_Time_In_Days
0 1 2 8
1 2 2 9
2 3 1 1
Now we have Total_Time_In_Days as int:
print(dfg.dtypes)
ID int64
Count_of_ID int64
Total_Time_In_Days int64
dtype: object
This can also be used:
df['Date_Open'] = pd.to_datetime(df['Date_Open'], dayfirst=True)
df['Date_Closed'] = pd.to_datetime(df['Date_Closed'], dayfirst=True)
df_grouped = df.groupby(by='ID').count()
df_grouped['Total_Time_In_Days'] = df.groupby(by='ID')['Date_Closed'].max() - df.groupby(by='ID')['Date_Open'].min()
df_grouped = df_grouped.drop(columns=['Date_Open'])
df_grouped.columns=['Count', 'Total_Time_In_Days']
print(df_grouped)
Count Total_Time_In_Days
ID
1 2 8 days
2 2 9 days
3 1 1 days
I'll try first to create the a column depicting how much time passed from Date_open to Date_closed for each instance of the dataframe. Like this:
df['Total_Time_In_Days'] = df.Date_closed - df.Date_open
Then you can use groupby:
df.groupby('id').agg({'id':'count','Total_Time_In_Days':'sum'})
If you need any help with the .agg function you can refer to it's official documentation here.

Find a subset of rows (N rows) in a Pandas data frame having the same values at a subset of columns

I have a df which contains customer data without a primary key. The same customer might show up multiple times.
I have a field (df2['campaign']) that is an int and reflects how many times the customer shows up in the df. There are also many customer attributes.
In my example, going from top to bottom, for each row (i.e. customer), I would like to find all n rows (i.e. all n customers) whose values of the education and default columns are the same. Remember n is the int contained in df2['campaign']
So as shown below, for row 0 and 1 I should search 1 row but find nothing because there are no matching values for education-default combinations.
For row 2 I should search 1 row (because campaign == 1) where education-default values match, and find 1 row in index 4.
df2.head()
job marital education default campaign housing loan contact
0 3 1 0 0 1 0 0 1
1 7 1 3 1 1 0 0 1
2 7 1 3 0 1 2 0 1
3 0 1 1 0 1 0 0 1
4 7 1 3 0 1 0 2 1
Use df2_sorted = df2.sort(['education', 'default'], ascending=[1, 1]).
Then if your data is not noisy, the rows should become neighbors.

Python 3.x - Merge pandas data frames

I am using Python for Titanic disaster competition on Kaggle. The dataset (df) contains 3 attributes corresponding to each passenger - 'Gender'(1/0), 'Age' and 'Pclass'(1/2/3). I want to obtain median age corresponding to each Gender-Pclass combination.
The end result should be a dataframe as -
Gender Class
1 1
0 2
1 3
0 1
1 2
0 3
Median age will be calculated later
I tried to create the data frame as follows -
unique_gender = pd.DataFrame(df.Gender.unique())
unique_class = pd.DataFrame(df.Class.unique())
reqd_df = pd.merge(unique_gender, unique_class, how = 'outer')
But the output obtained is -
0
0 3
1 1
2 2
3 0
can someone please help me get the desired output?
You want df.groupby(['gender','class'])['age'].median() (per JohnE)

Resources