I have a dataframe that looks like this,
Date/Time Volt Current
2011-01-01 11:30:00 NaN NaN
2011-01-01 11:35:00 NaN NaN
2011-01-01 11:40:00 NaN NaN
...
2011-01-01 12:30:00 NaN NaN
2011-01-02 11:30:00 45 23
2011-01-02 11:35:00 31 34
2011-01-02 11:40:00 23 15
...
2011-01-02 12:30:00 13 1
2011-01-03 11:30:00 41 51
...
2011-01-03 12:25:00 14 5
2011-01-03 12:30:00 54 45
...
2011-01-04 11:30:00 45 -
2011-01-04 11:35:00 41 -
2011-01-04 11:40:00 - 4
...
2011-01-04 12:30:00 - 14
The dataframe has a date and time between 11:30:00 to 12:30:00 with a 5 minutes interval. I am trying to figure out how to find the minimum value based on the "Current" column for each day, and copy the entire row. My expected output should be something like this,
Date/Time Volt Current
2011-01-01 NaN NaN
2011-01-02 12:30:00 13 1
2011-01-03 12:25:00 14 5
2011-01-04 11:40:00 NaN 4
For rows with a value in current, it will copy the entire minimum value row.
For rows with "NaN" in current, it will copy the row still with NaN.
Do note that some data in the volt/current are something empty or with a dash.
Is this possible?
Thank you.
Please try,
df=df[df['Current'] != '-']
df.groupby(df['Date/Time'].dt.day).apply(lambda x:x.loc[x['Current'].astype(float).fillna(0).argmin(),:])
Related
How can we split the Datetime values to Year and month and need to split the columns year
(2017_year, 2018_year so on...) and values under the year column should get month of respective year?
Example data:
call time area age
2017-12-12 19:38:00 Rural 28
2018-01-12 22:05:00 Rural 50
2018-02-12 22:33:00 Rural 76
2019-01-12 22:37:00 Urban 45
2020-02-13 00:26:00 Urban 52
Required Output:
call time area age Year_2017 Year_2018
2017-12-12 19:38:00 Rural 28 jan jan
2018-01-12 22:05:00 Rural 50 Feb Feb
2018-02-12 22:33:00 Rural 76 mar mar
2019-01-12 22:37:00 Urban 45 Apr Apr
2020-02-13 00:26:00 Urban 52 may may
I think you need generate years and month from call time datetimes, so output is different:
Explanation - First generate column of months by DataFrame.assign and Series.dt.strftime, then convert years to index with append=True for MultiIndex, so possible reshape by Series.unstack, last add to original:
df1 = (df.assign(m = df['call time'].dt.strftime('%b'))
.set_index(df['call time'].dt.year, append=True)['m']
.unstack()
.add_prefix('Year_'))
print (df1)
call time Year_2017 Year_2018 Year_2019 Year_2020
0 Dec NaN NaN NaN
1 NaN Jan NaN NaN
2 NaN Feb NaN NaN
3 NaN NaN Jan NaN
4 NaN NaN NaN Feb
df = df.join(df1)
print (df)
call time area age Year_2017 Year_2018 Year_2019 Year_2020
0 2017-12-12 19:38:00 Rural 28 Dec NaN NaN NaN
1 2018-01-12 22:05:00 Rural 50 NaN Jan NaN NaN
2 2018-02-12 22:33:00 Rural 76 NaN Feb NaN NaN
3 2019-01-12 22:37:00 Urban 45 NaN NaN Jan NaN
4 2020-02-13 00:26:00 Urban 52 NaN NaN NaN Feb
I have a situation where month and date are messed up for few dates in my dataframe. For e.g here is the input:
df['work_date'].head(15)
0 2018-01-01
1 2018-02-01
2 2018-03-01
3 2018-04-01
4 2018-05-01
5 2018-06-01
6 2018-07-01
7 2018-08-01
8 2018-09-01
9 2018-10-01
10 2018-11-01
11 2018-12-01
12 2018-01-13
13 2018-01-14
14 2018-01-15
The date is stored as a string. As you can see, the date is in the format yyyy-dd-mm till 12th of Jan and then becomes yyyy-mm-dd. The dataframe consists of 3 years worth data and this pattern repeats for all months for all years.
My expected output is to standardize the date to format dddd-mm-yy like below.
0 2018-01-01
1 2018-01-02
2 2018-01-03
3 2018-01-04
4 2018-01-05
5 2018-01-06
6 2018-01-07
7 2018-01-08
8 2018-01-09
9 2018-01-10
10 2018-01-11
11 2018-01-12
12 2018-01-13
13 2018-01-14
14 2018-01-15
Below is the code that I wrote and it gets the job done. Basically, I split the date string and do some string manipulations. However, as you can see its not too pretty. I am checking to see if there could be some other elegant solution to this other than doing the df.apply and the loops.
def func(x):
d = x.split('-')
print(d)
if (int(d[1]) <= 12) & (int(d[2]) <= 12) :
d = [d[0],d[2],d[1]]
x = '-'.join(d)
return x
else:
return x
df['work_date'] = df['work_date'].apply(lambda x:func(x))
You could just update the column based on the fact that it is in order and there is only one date and all days of the year are included consecutively:
df['Date'] = pd.date_range(df['work_date'].min(), '2018-01-12', freq='1D')
# you can specify df['work_date'].min() OR df['work_date'].max) OR A STRING. It really depends on what format your minimum and your maximum is
df
Out[1]:
work_date date
0 2018-01-01 2018-01-01
1 2018-02-01 2018-01-02
2 2018-03-01 2018-01-03
3 2018-04-01 2018-01-04
4 2018-05-01 2018-01-05
5 2018-06-01 2018-01-06
6 2018-07-01 2018-01-07
7 2018-08-01 2018-01-08
8 2018-09-01 2018-01-09
9 2018-10-01 2018-01-10
10 2018-11-01 2018-01-11
11 2018-12-01 2018-01-12
12 2018-01-13 2018-01-13
13 2018-01-14 2018-01-14
14 2018-01-15 2018-01-15
To make this more dynamic, you could also do some try / except shown below:
minn = df['work_date'].min()
maxx = df['work_date'].max()
try:
df['Date'] = pd.date_range(minn, maxx, freq='1D')
except ValueError:
s = maxx.split('-')
df['Date'] = pd.date_range(minn, f'{s[0]}-{s[2]}-{s[1]}', freq='1D')
except ValueError:
s = minn.split('-')
df['Date'] = pd.date_range(f'{s[0]}-{s[2]}-{s[1]}', maxx, freq='1D')
df
I have df as shown below.
Date t_factor
2020-02-01 5
2020-02-03 -23
2020-02-06 14
2020-02-09 23
2020-02-10 -2
2020-02-11 23
2020-02-13 NaN
2020-02-20 29
From the above I would like to replace -ve values in a column t_factor as NaN
Expected output:
Date t_factor
2020-02-01 5
2020-02-03 NaN
2020-02-06 14
2020-02-09 23
2020-02-10 NaN
2020-02-11 23
2020-02-13 NaN
2020-02-20 29
You can use pandas clip implementation as well. This assigns values outside boundary to boundary values. And then chain this with a replace function as below:
df['t_factor'] = df['t_factor'].clip(-1).replace(-1, np.nan)
df
Output:
Date t_factor
0 2020-02-01 5.0
1 2020-02-03 NaN
2 2020-02-06 14.0
3 2020-02-09 23.0
4 2020-02-10 NaN
5 2020-02-11 23.0
6 2020-02-13 NaN
7 2020-02-20 29.0
Use Series.mask:
df['t_factor'] = df['t_factor'].mask(df['t_factor'].lt(0))
OR use boolean indexing and assign np.nan,
df.loc[df['t_factor'].lt(0), 't_factor'] = np.nan
Result:
Date t_factor
0 2020-02-01 5.0
1 2020-02-03 NaN
2 2020-02-06 14.0
3 2020-02-09 23.0
4 2020-02-10 NaN
5 2020-02-11 23.0
6 2020-02-13 NaN
7 2020-02-20 29.0
Use pd.Series.where - by default it will replace values where the condition is False with NaN.
df["t_factor"] = df.t_factor.where(df.t_factor > 0)
Here I have dataset with datetime. Here I want to get time different value row by row in my csv file.
So I wrote the code to get the time different value in minutes. Then I want to convert that time different in hour.
That means;
if time difference value is 30 minutes. in hours 0.5h
if 120 min > 2h
But when I tried to it, it doesn't match with my required format. I just divide that time difference with 60.
my code:
df1['time_diff'] = pd.to_datetime(df1["time"])
print(df1['time_diff'])
0 2019-08-09 06:15:00
1 2019-08-09 06:45:00
2 2019-08-09 07:45:00
3 2019-08-09 09:00:00
4 2019-08-09 09:25:00
5 2019-08-09 09:30:00
6 2019-08-09 11:00:00
7 2019-08-09 11:30:00
8 2019-08-09 13:30:00
9 2019-08-09 13:50:00
10 2019-08-09 15:00:00
11 2019-08-09 15:25:00
12 2019-08-09 16:25:00
13 2019-08-09 18:00:00
df1['delta'] = (df1['time_diff']-df1['time_diff'].shift()).fillna(0)
df1['t'] = df1['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64')% (24*60)
then the result:
After dividing by 60:
df1['t'] = df1['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64')% (24*60)/60
result:
comparing each two images you can see in my first picture 30 min is there when I tries to convert into hours it is not showing and it just showing 1 only.
But have to convert 30 min as 0.5 hr.
Expected output:
[![
time_diff in min expected output of time_diff in hour
0 0
30 0.5
60 1
75 1.25
25 0.4167
5 0.083
90 1.5
30 0.5
120 2
20 0.333
70 1.33
25 0.4167
60 1
95 1.583
Can anyone help me to solve this error?
I suggest use Series.dt.total_seconds with divide by 60 and 3600:
df1['datetimes'] = pd.to_datetime(df1['date']+ ' ' + df1['time'], dayfirst=True)
df1['delta'] = df1['datetimes'].diff().fillna(pd.Timedelta(0))
td = df1['delta'].dt.total_seconds()
df1['time_diff in min'] = td.div(60).astype(int)
df1['time_diff in hour'] = td.div(3600)
print (df1)
datetimes delta time_diff in min time_diff in hour
0 2019-08-09 06:15:00 00:00:00 0 0.000000
1 2019-08-09 06:45:00 00:30:00 30 0.500000
2 2019-08-09 07:45:00 01:00:00 60 1.000000
3 2019-08-09 09:00:00 01:15:00 75 1.250000
4 2019-08-09 09:25:00 00:25:00 25 0.416667
5 2019-08-09 09:30:00 00:05:00 5 0.083333
6 2019-08-09 11:00:00 01:30:00 90 1.500000
7 2019-08-09 11:30:00 00:30:00 30 0.500000
8 2019-08-09 13:30:00 02:00:00 120 2.000000
9 2019-08-09 13:50:00 00:20:00 20 0.333333
10 2019-08-09 15:00:00 01:10:00 70 1.166667
11 2019-08-09 15:25:00 00:25:00 25 0.416667
12 2019-08-09 16:25:00 01:00:00 60 1.000000
13 2019-08-09 18:00:00 01:35:00 95 1.583333
I just started using pandas, i wanted to import one Excel file with 31 rows and 11 columns, but in the output only some columns are displayed, the middle columns are represented by "....", and the first column 'EST' the starting few elements are displayed "00:00:00".
Code
import pandas as pd
df = pd.read_excel("C:\\Users\daryl\PycharmProjects\pandas\Book1.xlsx")
print(df)
Output
C:\Users\daryl\AppData\Local\Programs\Python\Python37\python.exe "C:/Users/daryl/PycharmProjects/pandas/1. Introduction.py"
EST Temperature ... Events WindDirDegrees
0 2016-01-01 00:00:00 38 ... NaN 281
1 2016-02-01 00:00:00 36 ... NaN 275
2 2016-03-01 00:00:00 40 ... NaN 277
3 2016-04-01 00:00:00 25 ... NaN 345
4 2016-05-01 00:00:00 20 ... NaN 333
5 2016-06-01 00:00:00 33 ... NaN 259
6 2016-07-01 00:00:00 39 ... NaN 293
7 2016-08-01 00:00:00 39 ... NaN 79
8 2016-09-01 00:00:00 44 ... Rain 76
9 2016-10-01 00:00:00 50 ... Rain 109
10 2016-11-01 00:00:00 33 ... NaN 289
11 2016-12-01 00:00:00 35 ... NaN 235
12 1-13-2016 26 ... NaN 284
13 1-14-2016 30 ... NaN 266
14 1-15-2016 43 ... NaN 101
15 1-16-2016 47 ... Rain 340
16 1-17-2016 36 ... Fog-Snow 345
17 1-18-2016 25 ... Snow 293
18 1/19/2016 22 ... NaN 293
19 1-20-2016 32 ... NaN 302
20 1-21-2016 31 ... NaN 312
21 1-22-2016 26 ... Snow 34
22 1-23-2016 26 ... Fog-Snow 42
23 1-24-2016 28 ... Snow 327
24 1-25-2016 34 ... NaN 286
25 1-26-2016 43 ... NaN 244
26 1-27-2016 41 ... Rain 311
27 1-28-2016 37 ... NaN 234
28 1-29-2016 36 ... NaN 298
29 1-30-2016 34 ... NaN 257
30 1-31-2016 46 ... NaN 241
[31 rows x 11 columns]
Process finished with exit code 0
To answer your question about the display of only a few columns and "..." :
All of the columns have been properly ingested, but your screen / the console is not wide enough to output all of the columns at once in a "print" fashion. This is normal/expected behavior.
Pandas is not a spreadsheet visualization tool like Excel. Maybe someone can suggest a tool for visualizing dataframes in a spreadsheet format for Python, like in Excel. I think I've seen people visualizing spreadsheets in Spyder but I don't use that myself.
If you want to make sure all of the columns are there, try using list(df) or print(list(df)).
To answer your question about the EST format:
It looks like you have some data cleaning to do. This is typical work in data science. I am not sure how to best do this - I haven't worked much with dates/datetime yet. However, here is what I see:
The first few items have timestamps as well, likely formatted in HH:MM:SS
They are formatted YYYY-MM-DD
On index row 18, there are / instead of - in the date
The remaining rows are formatted M-DD-YYYY
There's an option on read_csv's documentation that may take care of those automatically. It's called "parse_dates". If you turn that option on like pd.read_csv('file location', parse_dates='EST'), that could turn on the date parser for the EST column and maybe solve your problem.
Hope this helps! This is my first answer to anyone who sees it feel free to edit and improve it.