I have CSV file where the second column indicates a time point with the format HHMMSS.
ID;TIME
A;110500
B;090000
C;130200
This situation indicates some questions for me.
Does pandas have a data format to represent a time point with hour, minutes and seconds but without the day, month, ...?
How can I convert that fields to such a format?
On Python I would iterate over the fields. But I am sure that Pandas have a more efficient way.
If there is no time of day format without date I could add a day-month-year date to that timepoint.
That is an MWE
import pandas
import io
csv = io.StringIO('ID;TIME\nA;110500\nB;090000\nC;130200')
df = pandas.read_csv(csv, sep=';')
print(df)
Results in
ID TIME
0 A 110500
1 B 90000
2 C 130200
But what I want to see is
ID TIME
0 A 11:05:00
1 B 9:00:00
2 C 13:02:00
Or much better cutting the seconds also
ID TIME
0 A 11:05
1 B 9:00
2 C 13:02
You could use the parameter date_parser in read_csv like and the time accesor
df = pandas.read_csv(csv, sep=';',
parse_dates=[1], # need to know the position of the TIME column
date_parser=lambda x: pandas.to_datetime(x, format='%H%M%S').time)
print(df)
ID TIME
0 A 11:05:00
1 B 09:00:00
2 C 13:02:00
But doing it after reading might be as good
df = (pandas.read_csv(csv, sep=';')
.assign(TIME=lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.time)
#or lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.strftime('%#H:%M')
)
Related
I have a pandas Dataframe with a Datetime-Index and just one column with a measured value at that time:
Index
Value
2017-01-01 05:00:00
2.8
2017-01-01 05:15:00
3.2
I have data for several years now, one value every 15 minutes. I want to reorganize the df to this (I'm preparing the data to train a Neural Network, each line will be one input):
Index
0 days 05:00:00
0 days 05:00:00
...
1 days 04:45:00
2017-01-01
2.8
3.2
...
1.9
2017-01-02
...
The fastest, most "python" way I could find, was this (with df being the original data, df_result the empty target df):
# prepare df
df_result = pd.DataFrame(index=days_array, columns=times_array)
# fill df
df_result = df_result.apply(order_data_by_days, df=df, log=log, axis=1)
def order_data_by_days(row, df):
for col in row.index:
row[col] = df.loc[row.name + col].values[0]
return row
But this takes >20 seconds for 3.5 years of data! (~120k datapoints). Does anyone have any idea how I could do this a lot faster (I'm aiming at a couple of seconds).
If not, I would try to the the transformation with some other language before the import.
I found a solution, if anyone else has this issue:
Step 1: create target df_result with index (dates, e.g. 2018-01-01, 2018-01-02, ...) as datetime.date and columns (times, e.g. 0 days 05:00:00, 0 days 05:15:00, ..., 1 days 04:45:00) as timedelta.
Step 2: use a for-loop to go through all times. Filter the original df each time using the between_time-function, write the filtered df into the target df_result:
for j in range(0,len(times_array)):
this_time = get_str_from_timedelta(times_array[j], log)
df_this_time = df.between_time(this_time, this_time)
if df_result.empty:
df_result = pd.DataFrame(index=df_this_time.index.date, columns=times_array)
df_this_time.index = df_this_time.index.date
if times_array[j] >= timedelta(days=1):
df_this_time.index = df_this_time.index - timedelta(days=1)
df_result[times_array[j]] = df_this_time[pv]
Note that in my case I checked if the times are actually from next day (timedelta(days=1)), since my "day" starts at 05:00 a.m. and lasts until 04:45 a.m. the next day. To make sure they end up in the same row of df_result (even though, technically, the date-index is wrong here), I use the last if.
I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7
I have a question handling dataframe in pandas.
I really don't know what to do.
Could you check this problem?
[df1]
This is first dataframe and I want to get second dataframe.
Like this
I got a index value DATE(Week), DATE(Month) using resample method in pandas.
but I don't know merge the table like second table.
so please check this question. Thank you so much.
What I have understood from your question is that you want to diversify DATE column to its nearest week and month, so if that is the case you need not have to create two separate DataFrame, there is an easier way to do it using DateOffsets
#taking sample from your data
import pandas as pd
from pandas.tseries.offsets import *
>>d = {'DATE': ['2019-01-14', '2019-01-16', '2019-02-19'], 'TX_COST': [156800, 157000, 150000]}
>>df = pd.DataFrame(data=d)
>>df
DATE TX_COST
0 2019-01-14 156800
1 2019-01-16 157000
2 2019-02-19 150000
#convert Date column to datetime format
df['DATE'] = pd.to_datetime(df['DATE'])
#as per your requirement set weekday=6 that is sunday as the week ending date
>>> df['WEEK'] = df['DATE'] + Week(weekday=6)
>>> df
DATE TX_COST WEEK
0 2019-01-14 156800 2019-01-20
1 2019-01-16 157000 2019-01-20
2 2019-02-19 150000 2019-02-24
#use month offset to round the date to nearest month end
>>> df['MONTH'] = df['DATE'] + pd.offsets.MonthEnd()
>>> df
DATE TX_COST WEEK MONTH
0 2019-01-14 156800 2019-01-20 2019-01-31
1 2019-01-16 157000 2019-01-20 2019-01-31
2 2019-02-19 150000 2019-02-24 2019-02-28
This will create the DataFrame which you require
I want to use a date/time as an index for a dataframe in Pandas.
However, daylight saving time is not properly addressed in the database, so the date/time values for the day in which daylight saving time ends have 25 hours and are represented as such:
2019102700
2019102701
...
2019102724
I am using the following code to convert those values to a DateTime object that I use as an index to a Pandas dataframe:
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
However, that gives an error:
ValueError: unconverted data remains: 4
Presumably because the to_datetime function is not expecting the hour to be 24. Similarly, the day in which daylight saving time starts only has 23 hours.
One solution I thought of was storing the dates as strings, but that seems neither elegant nor efficient. Is there any way to solve the issue of handling daylight saving time when using to_datetime?
If you know the timezone, here's a way to calculate UTC timestamps. Parse only the date part, localize to the actual time zone the data "belongs" to, and convert that to UTC. Now you can parse the hour part and add it as a time delta - e.g.
import pandas as pd
df = pd.DataFrame({'date_time_str': ['2019102722','2019102723','2019102724',
'2019102800','2019102801','2019102802']})
df['date_time'] = (pd.to_datetime(df['date_time_str'].str[:-2], format='%Y%m%d')
.dt.tz_localize('Europe/Berlin')
.dt.tz_convert('UTC'))
df['date_time'] += df['date_time_str'].str[-2:].astype('timedelta64[h]')
# df['date_time']
# 0 2019-10-27 20:00:00+00:00
# 1 2019-10-27 21:00:00+00:00
# 2 2019-10-27 22:00:00+00:00
# 3 2019-10-27 23:00:00+00:00
# 4 2019-10-28 00:00:00+00:00
# 5 2019-10-28 01:00:00+00:00
# Name: date_time, dtype: datetime64[ns, UTC]
I'm not sure if it is the most elegant or efficient solution, but I would:
df.loc[df.date_time.str[-2:]=='25', 'date_time'] = (pd.to_numeric(df.date_time[df.date_time.str[-2:]=='25'])+100-24).apply(str)
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
Pick the first and the last index, convert them to tz_aware datetime, then you can generate a date_range that handles 25-hour days. And assign the date_range to your df index:
start = pd.to_datetime(df.index[0]).tz_localize("Europe/Berlin")
end = pd.to_datetime(df.index[-1]).tz_localize("Europe/Berlin")
index_ = pd.date_range(start, end, freq="15min")
df = df.set_index(index_)
I would like to convert all day in the data-frame into day/feb/2020 format
here date field consist only day
from first one convert the date field like this
My current approach is:
import datetime
y=[]
for day in planned_ds.Date:
x=datetime.datetime(2020, 5, day)
print(x)
Is there any easy method to convert all day data-frame to d/m/y format?
One way as assuming you have data like
df = pd.DataFrame([1,2,3,4,5], columns=["date"])
is to convert them to dates and then shift them to start when you need them to:
pd.to_datetime(df["date"], unit="D") - pd.datetime(1970,1,1) + pd.datetime(2020,1,31)
this results in
0 2020-02-01
1 2020-02-02
2 2020-02-03
3 2020-02-04
4 2020-02-05