How to calculate time difference between the rows groupby rowname and extract only the most recent ones? - python-3.x

I want to calculate the number of days between 2 rows with a grouby function and extract only 1 row with the latest date. I need not want all the rows with the same row value instead want the one which is more recent with the number of days as new column.
In [37]: df
Out[37]:
id time
0 A 2016-11-25 16:32:17
1 A 2016-11-27 16:36:04
2 A 2016-11-29 16:35:29
3 B 2016-11-25 16:35:24
4 B 2016-11-28 16:35:46
I want the output as
id no of days
0 A 4(approx)
1 B 3(approx)
So what i want is only the column 2 with id A which has the most recent change in time and date and omit rest of rows.

IIUC
df.time=pd.to_datetime(df.time)
df.groupby('id').time.apply(lambda x : (x.max()-x.min()).days)
Out[1186]:
id
A 4
B 3
Name: time, dtype: int64

Related

Excel formula help - I have 14/28 day readings averaged per day which need to be recorded as daily readings

Dates when the data was read are shown on the date column on the left, then worked out a value per day shown in ET/DAY column.
To analyze monthly data etc etc I want daily data so every day has the ET/DAY value between the dates it was read and inserted in the et/day column.
I have entered in manually what I want in blue but need a formula to do this as too long to do all manually
You can do this with MATCH. With an ascending list, if you give the magic number 1 as the third parameter, it will show the last row number with a smaller or equal number.
Here it's matched 17 with 15 in column A and the result is 2, for row 2:
A
B
C
1
10
=MATCH(17,A:A,1) = 2
2
15
3
20
So for your data, you can match the daily dates with the column on the left, and feed the row number into INDEX to look up from the ET/Day column.
A
B
C
D
E
1
Date
ET/Day
Daily
ET/Day
2
05/Jan/1995
05/Jan/1995
=Index(B:B, Match(#D:D,A:A,1)) = #N/A
3
19/Jan/1995
3.00
06/Jan/1995
=Index(B:B, Match(#D:D,A:A,1)) = 3.00
4
02/Feb/1995
5.41
07/Jan/1995
=Index(B:B, Match(#D:D,A:A,1)) = 3.00
5
...
...
6
19/Jan/1995
=Index(B:B, Match(#D:D,A:A,1)) = 3.00
7
20/Jan/1995
=Index(B:B, Match(#D:D,A:A,1)) = 5.41
There's an inconsistency here with the 05 Jan rate but it should be good after that.

Sorting data frames with time in hours days months format

I have one data frame with time in columns but not sorted order, I want to sort in ascending order, can some one suggest any direct function or code for data frames sort time.
My input data frame:
Time
data1
1 month
43.391588
13 h
31.548372
14 months
41.956652
3.5 h
31.847388
Expected data frame:
Time
data1
3.5 h
31.847388
13 h
31.847388
1 month
43.391588
14 months
41.956652
You need replace units to numbers first by Series.replace and then convert to numeric by pandas.eval, last use this column for sorting by DataFrame.sort_values:
d = {' months': '*30*24', ' month': '*30*24', ' h': '*1'}
df['sort'] = df['Time'].replace(d, regex=True).map(pd.eval)
df = df.sort_values('sort')
print (df)
Time data1 sort
3 3.5 h 31.847388 3.5
1 13 h 31.548372 13.0
0 1 month 43.391588 720.0
2 14 months 41.956652 10080.0
Firstly you have to assert the type of data you have in your dataframe.
This will indicate how you may proceed.
df.dtypes or at your case df.index.dtypes .
Preferred option for sorting dataframes is df.sort_values()

How to write function to extract n+1 column in pandas

I have a excel file with 200 columns. The first column is no. of visits, and other columns are the data with number of people for that number of visits
Visits A B C D
2 10 0 30 40
3 5 6 0 1
4 2 3 1 0
I want to write a function so that I have multiple dataframes with Visit column and A; visit column and B, and so on (I want to write a function, as the number of columns will increase in the future and I want to automatize the process). Also, I want to remove the data with 0.
Desired output:
dataframe 1:
visits A
dataframe 2:
Visits B
3 6
4 3
This is my first question. So sorry, if it is not properly framed. Thank you for your help.
Use DataFrame.items:
for i,col in df.set_index('visits').items():
print(col[col.ne(0)].to_frame(i).reset_index())
You can create a dict to save by the name of columns
dfs={i:col[col.ne(0)].to_frame(i).reset_index() for i,col in df.set_index('visits').items()}

How to find values in one column in another column with multiple values

I have an excel like
A B START DATE END DATE
1 10 01-jan-2016 02-jan-2016
2 11 01- jan-2051 02-feb-2061
3 1 04-mar-2016 07-mar-2016
4 1 08-mar-2016 10-mar-2016
5 5 01-mar-2016 03-dec-2016
6 5 03-nov-2016 31-dec-4712
I am new to excel. I want to highlight or extract the columns in A column which can be found in B Column along with the start date and end date .
That is result should be like :
A start_date end_date
1 04-mar-2016 07-mar-2016
1 08-mar-2016 10-mar-2016
5 01-mar-2016 03-dec-2016
5 03-nov-2016 31-dec-4712
Can anyone pls suggest something ?
In E2 enter:
=IF(COUNTIF(A:A,B2)>0,"X","")
and copy down. Then filter the table
You can hide any un-wanted columns after that.

How to get the latest date with same ID in Excel

I want to Get the Record with the most recent date as same ID's have different dates. Need to pick the BOLD values. Below is the sample data, As original data consist of 10000 records.
ID Date
5 25/02/2014
5 7/02/2014
5 6/12/2013
5 25/11/2013
5 4/11/2013
3 5/05/2013
3 19/02/2013
3 12/11/2012
1 7/03/2013
2 24/09/2012
2 7/09/2012
4 6/12/2013
4 19/04/2013
4 31/03/2013
4 26/08/2012
What I would do is in column B use this formula and fill down
=LEFT(A1,1)
in column C
=DATEVALUE(MID(A1,2,99))
then filter column B to a specific value of interest and sort by column C to order these values by date.
Edit: Even easier do a two level sort by B then by C newest to oldest. The first B in the list is newest.
Do you need a programmatic / formula only solution or can you use a workflow? If a workflow will work, then how about this:
Construct a pivot table of your data
Make the Rows Labels the ID
Make the Values Max of Date
The resulting table is your answer.
Row Labels Max of Date
1 07/03/13
2 24/09/12
3 05/05/13
4 06/12/13
5 25/02/14

Resources