Rearrange Dataframe with Datetime-Index to multiple columns - python-3.x

I have a pandas Dataframe with a Datetime-Index and just one column with a measured value at that time:
Index
Value
2017-01-01 05:00:00
2.8
2017-01-01 05:15:00
3.2
I have data for several years now, one value every 15 minutes. I want to reorganize the df to this (I'm preparing the data to train a Neural Network, each line will be one input):
Index
0 days 05:00:00
0 days 05:00:00
...
1 days 04:45:00
2017-01-01
2.8
3.2
...
1.9
2017-01-02
...
The fastest, most "python" way I could find, was this (with df being the original data, df_result the empty target df):
# prepare df
df_result = pd.DataFrame(index=days_array, columns=times_array)
# fill df
df_result = df_result.apply(order_data_by_days, df=df, log=log, axis=1)
def order_data_by_days(row, df):
for col in row.index:
row[col] = df.loc[row.name + col].values[0]
return row
But this takes >20 seconds for 3.5 years of data! (~120k datapoints). Does anyone have any idea how I could do this a lot faster (I'm aiming at a couple of seconds).
If not, I would try to the the transformation with some other language before the import.

I found a solution, if anyone else has this issue:
Step 1: create target df_result with index (dates, e.g. 2018-01-01, 2018-01-02, ...) as datetime.date and columns (times, e.g. 0 days 05:00:00, 0 days 05:15:00, ..., 1 days 04:45:00) as timedelta.
Step 2: use a for-loop to go through all times. Filter the original df each time using the between_time-function, write the filtered df into the target df_result:
for j in range(0,len(times_array)):
this_time = get_str_from_timedelta(times_array[j], log)
df_this_time = df.between_time(this_time, this_time)
if df_result.empty:
df_result = pd.DataFrame(index=df_this_time.index.date, columns=times_array)
df_this_time.index = df_this_time.index.date
if times_array[j] >= timedelta(days=1):
df_this_time.index = df_this_time.index - timedelta(days=1)
df_result[times_array[j]] = df_this_time[pv]
Note that in my case I checked if the times are actually from next day (timedelta(days=1)), since my "day" starts at 05:00 a.m. and lasts until 04:45 a.m. the next day. To make sure they end up in the same row of df_result (even though, technically, the date-index is wrong here), I use the last if.

Related

How to call a created funcion with pandas apply to all rows (axis=1) but only to some specific rows of a dataframe?

I have a function which sends automated messages to clients, and takes as input all the columns from a dataframe like the one below.
name
phone
status
date
name_1
phone_1
sending
today
name_2
phone_2
sending
yesterday
I iterate through the dataframe with a pandas apply (axis=1) and use the values on the columns of each row as inputs to my function. At the end of it, after sending, it changes the status to "sent". The thing is I only want to send to the clients whose date reference is "today". Now, with pandas.apply(axis=1) this is perfectly doable, but in order to slice the clients with "today" value, I need to:
create a new dataframe with today's value,
remove it from the original, and then
reappend it to the original.
I thought about running through the whole dataframe and ignore the rows which have dates different than "today", but if my dataframe keeps growing, I'm afraid of the whole process becoming slower.
I saw examples of this being done with mask, although usually people only use 1 column, and I need more than just the one. Is there any way to do this with pandas apply?
Thank you.
I think you can use .loc to filter the data and apply func to it.
In [13]: df = pd.DataFrame(np.random.rand(5,5))
In [14]: df
Out[14]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.334781 0.521263 0.402030 0.973504 0.903314
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 0.902107 0.226398 0.596697 0.489761 0.535270
if we want double the values of rows where the value in first column > 0.3
Out[16]:
0 1 2 3 4
2 0.334781 0.521263 0.402030 0.973504 0.903314
4 0.902107 0.226398 0.596697 0.489761 0.535270
In [18]: df.loc[df[0] > 0.3] = df.loc[df[0] > 0.3].apply(lambda x: x*2, axis=1)
In [19]: df
Out[19]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.669563 1.042527 0.804061 1.947008 1.806628
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 1.804213 0.452797 1.193394 0.979522 1.070540

Convert number into hours and minutes wile reading CSV in Pandas

I have CSV file where the second column indicates a time point with the format HHMMSS.
ID;TIME
A;110500
B;090000
C;130200
This situation indicates some questions for me.
Does pandas have a data format to represent a time point with hour, minutes and seconds but without the day, month, ...?
How can I convert that fields to such a format?
On Python I would iterate over the fields. But I am sure that Pandas have a more efficient way.
If there is no time of day format without date I could add a day-month-year date to that timepoint.
That is an MWE
import pandas
import io
csv = io.StringIO('ID;TIME\nA;110500\nB;090000\nC;130200')
df = pandas.read_csv(csv, sep=';')
print(df)
Results in
ID TIME
0 A 110500
1 B 90000
2 C 130200
But what I want to see is
ID TIME
0 A 11:05:00
1 B 9:00:00
2 C 13:02:00
Or much better cutting the seconds also
ID TIME
0 A 11:05
1 B 9:00
2 C 13:02
You could use the parameter date_parser in read_csv like and the time accesor
df = pandas.read_csv(csv, sep=';',
parse_dates=[1], # need to know the position of the TIME column
date_parser=lambda x: pandas.to_datetime(x, format='%H%M%S').time)
print(df)
ID TIME
0 A 11:05:00
1 B 09:00:00
2 C 13:02:00
But doing it after reading might be as good
df = (pandas.read_csv(csv, sep=';')
.assign(TIME=lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.time)
#or lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.strftime('%#H:%M')
)

Iterate over pandas dataframe while updating values

I've looked through a bunch of similar questions, but I cannot figure out how to actually apply the principles to my own case. I'm therefore trying to figure out a simple example I can work from - basically I need the idiots' guide before I can look at more complex examples
Consider a dataframe that contains a list of names and times, and a known start time. I then want to update the dataframe with the finish time, which is calculated from starttime + Time
import pandas as pd
import datetime
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
"Time":[3, 6,1, 7, 23,3,4]})
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
I know that for each case I can calculate the finish time using
finishtime = starttine + datetime.datetime.timedelta(minutes = df.iloc[0,1])
what I can't figure out is how to use this while iterating over the df rows and updating a third column in the dataframe with the output.
I tried
df["FinishTime"] = np.nan
for row in df.itertuples():
df.at[row,"FinishTime"] = starttine + datetime.datetime.timedelta(minutes = row.Time)
but it gave a lot of errors I couldn't unravel. How am I meant to do this?
I am aware that the advice to iterating over a dataframe is don't - I'm not committed to iterating, I just need some way to calculate that final column and add it to the dataframe. My real data is about 200k lines.
Use pd.to_timedelta()
import datetime
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
"Time":[3, 6,1, 7, 23,3,4]})
df.Time = pd.to_timedelta(df.Time, unit='m')
# df = df.assign(FinishTime = df.Time + starttime)
df['FinishTime'] = df.Time + starttime # as pointed out by Trenton McKinney, .assign() is only one way to create new columns
# creating with df['new_col'] has the benefit of not having to copy the full df
print(df)
Output
Name Time FinishTime
0 Kate 00:03:00 2020-02-04 00:03:00
1 Sarah 00:06:00 2020-02-04 00:06:00
2 Isabell 00:01:00 2020-02-04 00:01:00
3 Connie 00:07:00 2020-02-04 00:07:00
4 Elsa 00:23:00 2020-02-04 00:23:00
5 Anne 00:03:00 2020-02-04 00:03:00
6 Lin 00:04:00 2020-02-04 00:04:00
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html
Avoid looping in pandas at all cost
Maybe not at all cost, but pandas takes advantage of C implementations to improve performance by several orders of magnitude. There are many (many) common functions already implemented for our convenience.
Here is a great stackoverflow conversation about this very topic.

Add new rows to dataframe using existing rows from previous year

I'm creating a Pandas dataframe from an existing file and it ends up essentially like this.
import pandas as pd
import datetime
data = [[i, i+1] for i in range(14)]
index = pd.date_range(start=datetime.date(2019,1,1), end=datetime.date(2020,2,1), freq='MS')
columns = ['col1', 'col2']
df = pd.DataFrame(data, index, columns)
Notice that this doesn't go all the way up to the present -- often the file I'm pulling from is a month or two behind. What I then need to do is add on any missing months and fill them with the same value as the previous year.
So in this case I need to add another row that is
2020-03-01 2 3
It could be anywhere from 0-2 rows that need to be added to the end of the dataframe at a given point in time. What's the best way to do this?
Note: The data here is not real so please don't take advantage of the simple pattern of entries I gave above. It was just a quick way to fill two columns of a table as an example.
If I understand your problem, then the following should help you. This does assume that you always have data 12 months ago however. You can define a new DataFrame which includes the months up to the most recent date.
# First create the new index. Get the most recent date and add an offset.
start, end = df.index[-1] + pd.DateOffset(), pd.Timestamp.now()
index_new = pd.date_range(start, end, freq='MS')
Create your DataFrame
# Get the data from the previous year.
data = df.loc[index_new - pd.DateOffset(years=1)].values
df_new = pd.DataFrame(data, index = index_new, columns=df.columns)
which looks like
col1 col2
2020-03-01 2 3
then just use;
pd.concat([df, df_new], axis=0)
Which gives
col1 col2
2019-01-01 0 1
2019-02-01 1 2
2019-03-01 2 3
... ... ...
2020-02-01 13 14
2020-03-01 2 3
Note
This also works for cases where the number of months missing is greater than 1.
Edit
Slightly different variation
# Create series with missing months added.
# Get the corresponding data 12 months prior.
s = pd.date_range(df.index[0], pd.Timestamp.now(), freq='MS')
fill = df.loc[s[~s.isin(df.index)] - pd.DateOffset(years=1)]
# Reindex the original dataframe
df = df.reindex(s)
# Find the dates to fill and replace with lagged data
df.iloc[-1 * fill.shape[0]:] = fill.values

How to join two dataframes for which column time values are within a certain range and are not datetime or timestamp objects?

I have two dataframes as shown below:
time browncarbon blackcarbon
181.7335 0.105270 NaN
181.3809 0.166545 0.001217
181.6197 0.071581 NaN
422 rows x 3 columns
start end toc
179.9989 180.0002 155.0
180.0002 180.0016 152.0
180.0016 180.0030 151.0
1364 rows x 3 columns
The first dataframe has a time column that has instants every four minutes. The second dataframe has a two time columns spaced every two minutes. Both these time columns do not start and end at the same time. However, they contain data collected over the same day. How could I make another dataframe containing:
time browncarbon blackcarbon toc
422 rows X 4 columns
There is a related answer on Stack Overflow, however, that is applicable only when the time columns are datetime or timestamp objects. The link is: How to join two dataframes for which column values are within a certain range?
Addendum 1: The multiple start and end rows that get encapsulated into one of the time rows should also correspond to one toc row, as it does right now, however, it should be the average of the multiple toc rows, which is not the case presently.
Addendum 2: Merging two pandas dataframes with complex conditions
We create a artificial key column to do an outer merge to get the cartesian product back (all matches between the rows). Then we filter all the rows where time falls in between the range with .query.
note: I edited the value of one row so we can get a match (see row 0 in example dataframes on the bottom)
df1.assign(key=1).merge(df2.assign(key=1), on='key', how='outer')\
.query('(time >= start) & (time <= end)')\
.drop(['key', 'start', 'end'], axis=1)
output
time browncarbon blackcarbon toc
1 180.0008 0.10527 NaN 152.0
Example dataframes used:
df1:
time browncarbon blackcarbon
0 180.0008 0.105270 NaN
1 181.3809 0.166545 0.001217
2 181.6197 0.071581 NaN
df2:
start end toc
0 179.9989 180.0002 155.0
1 180.0002 180.0016 152.0
2 180.0016 180.0030 151.0
Since the start and end intervals are mutually exclusive, we may be able to create new columns in df2 such that it would contain all the integer values in the range of floor(start) and floor(end). Later, add another column in df1 as floor(time) and then take left outer join on df1 and df2. I think that should do except that you may have to remove nan values and extra columns if required. If you send me the csv files, I may be able to send you the script. I hope I answered your question.
Perhaps you could just convert your columns to Timestamps and then use the answer in the other question you linked
from pandas import Timestamp
from dateutil.relativedelta import relativedelta as rd
def to_timestamp(x):
return Timestamp(2000, 1, 1) + rd(days=x)
df['start_time'] = df.start.apply(to_timestamp)
df['end_time'] = df.end.apply(to_timestamp)
Your 2nd data frame is too short, so it wouldn't reflect a meaningful merge. So I modified it a little:
df2 = pd.DataFrame({'start': [179.9989, 180.0002, 180.0016, 181.3, 181.5, 181.7],
'end': [180.0002, 180.0016, 180.003, 181.5, 185.7, 181.8],
'toc': [155.0, 152.0, 151.0, 150.0, 149.0, 148.0]})
df1['Rank'] = np.arange(len(df1))
new_df = pd.merge_asof(df1.sort_values('time'), df2,
left_on='time',
right_on='start')
gives you:
time browncarbon blackcarbon Rank start end toc
0 181.3809 0.166545 0.001217 1 181.3 181.5 150.0
1 181.6197 0.071581 NaN 2 181.5 185.7 149.0
2 181.7335 0.105270 NaN 0 181.7 181.8 148.0
which you can drop extra column and sort_values on Rank. For example:
new_df.sort_values('Rank').drop(['Rank','start','end'], axis=1)
gives:
time browncarbon blackcarbon toc
2 181.7335 0.105270 NaN 148.0
0 181.3809 0.166545 0.001217 150.0
1 181.6197 0.071581 NaN 149.0

Resources