How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function - python-3.x

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.

csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

Related

Apply a custom rolling function with arguments on Pandas DataFrame

I have this df (here is the df.head()):
date colA
0 2018-01-05 0.6191
1 2018-01-20 0.5645
2 2018-01-25 0.5641
3 2018-01-27 0.5404
4 2018-01-30 0.4933
I would like to apply a function to every 3 rows recursively, meaning for rows: 1,2,3 then for rows: 2,3,4 then rows 3,4,5, etc.
This is what I wrote:
def my_rolling_func(df, val):
p1 = (df['date']-df['date'].min()).dt.days.tolist()[0],df[val].tolist()[0]
p2 = (df['date']-df['date'].min()).dt.days.tolist()[1],df[val].tolist()[1]
p3 = (df['date']-df['date'].min()).dt.days.tolist()[2],df[val].tolist()[2]
return sum([i*j for i,j in [p1,p2,p3]])
df.rolling(3,center=False,axis=1).apply(my_rolling_func, args=('colA'))
But I get this error:
ValueError: Length of passed values is 1, index implies 494.
494 is the number of rows in my df.
I'm not sure why it says I passed a length of 1, I thought the rolling generate slices of df according to the window size I defined (3), and then it applied the function for that subset of df.
First, you specified the wrong axis. Axis 1 means that the window will slide along the columns. You want the window to slide along the indexes, so you need to specify axis=0. Secondly, you misunderstand a little about how rolling works. It will apply your function to each column independently, so you cannot operate on both the date and colA columns at the same time inside your function.
I rewrote your code to make it work:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date':pd.date_range('2018-01-05', '2018-01-30', freq='D'), 'A': np.random.random((26,))})
df = df.set_index('date')
def my_rolling_func(s):
days = (s.index - s.index[0]).days
return sum(s*days)
res = df.rolling(3, center=False, axis=0).apply(my_rolling_func)
print(res)
Out:
A
date
2018-01-05 NaN
2018-01-06 NaN
2018-01-07 1.123872
2018-01-08 1.121119
2018-01-09 1.782860
2018-01-10 0.900717
2018-01-11 0.999509
2018-01-12 1.755408
2018-01-13 2.344914
.....

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

Convert number into hours and minutes wile reading CSV in Pandas

I have CSV file where the second column indicates a time point with the format HHMMSS.
ID;TIME
A;110500
B;090000
C;130200
This situation indicates some questions for me.
Does pandas have a data format to represent a time point with hour, minutes and seconds but without the day, month, ...?
How can I convert that fields to such a format?
On Python I would iterate over the fields. But I am sure that Pandas have a more efficient way.
If there is no time of day format without date I could add a day-month-year date to that timepoint.
That is an MWE
import pandas
import io
csv = io.StringIO('ID;TIME\nA;110500\nB;090000\nC;130200')
df = pandas.read_csv(csv, sep=';')
print(df)
Results in
ID TIME
0 A 110500
1 B 90000
2 C 130200
But what I want to see is
ID TIME
0 A 11:05:00
1 B 9:00:00
2 C 13:02:00
Or much better cutting the seconds also
ID TIME
0 A 11:05
1 B 9:00
2 C 13:02
You could use the parameter date_parser in read_csv like and the time accesor
df = pandas.read_csv(csv, sep=';',
parse_dates=[1], # need to know the position of the TIME column
date_parser=lambda x: pandas.to_datetime(x, format='%H%M%S').time)
print(df)
ID TIME
0 A 11:05:00
1 B 09:00:00
2 C 13:02:00
But doing it after reading might be as good
df = (pandas.read_csv(csv, sep=';')
.assign(TIME=lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.time)
#or lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.strftime('%#H:%M')
)

convert datetime to date python --> error: unhashable type: 'numpy.ndarray'

Pandas by default represent dates with datetime64 [ns], so I have in my columns this format [2016-02-05 00:00:00] but I just want the date 2016-02-05, so I applied this code for a few columns:
df3a['MA'] = pd.to_datetime(df3a['MA'])
df3a['BA'] = pd.to_datetime(df3a['BA'])
df3a['FF'] = pd.to_datetime(df3a['FF'])
df3a['JJ'] = pd.to_datetime(df3a['JJ'])
.....
but it gives me as result this error: TypeError: type unhashable: 'numpy.ndarray'
my question is: why i got this error and how do i convert datetime to date for multiple columns (around 50)?
i will be grateful for your help
One way to achieve what you'd like is with a DatetimeIndex. I've first created an Example DataFrame with 'date' and 'values' columns and tried from there on to reproduce the error you've got.
import pandas as pd
import numpy as np
# Example DataFrame with a DatetimeIndex (dti)
dti = pd.date_range('2020-12-01','2020-12-17') # dates from first of december up to date
values = np.random.choice(range(1, 101), len(dti)) # random values between 1 and 100
df = pd.DataFrame({'date':dti,'values':values}, index=range(len(dti)))
print(df.head())
>>> date values
0 2020-12-01 85
1 2020-12-02 100
2 2020-12-03 96
3 2020-12-04 40
4 2020-12-05 27
In the example, just the dates are already shown without the time in the 'date' column, I guess since it is a DatetimeIndex.
What I haven't tested but might can work for you is:
# Your dataframe
df3a['MA'] = pd.DatetimeIndex(df3a['MA'])
...
# automated transform for all columns (if all columns are datetimes!)
for label in df3a.columns:
df3a[label] = pd.DatetimeIndex(df3a[label])
Use DataFrame.apply:
cols = ['MA', 'BA', 'FF', 'JJ']
df3a[cols] = df3a[cols].apply(pd.to_datetime)

Calculate average of time with fraction of seconds

I have 3 different columns of timestamps in a pandas dataframe, two of which have fraction of seconds recorded while the third does not have fraction of seconds. I would like to calculate an average of these 3 columns.
I have already tried to compute the average using the mean function on the columns and consistently received nan as the result
import pandas as pd
data = [{'time1': '2018-07-22 04:34:10.8966', 'time2': '2017-07-22 04:34:10.8966', 'time3': '2018-07-27 00:10:04'}]
df = pd.DataFrame(data)
df['estimate'] = df[['time1', 'time2', 'time3']].mean(axis=1)
df
Expected : An average of the 3 timestamps
Actual : While there is no error, it also always evaluates to nan which is not what is desired.
As far as I know you can't to it directly on datetime values, you need to convert them, average, and then convert back:
data = [{'time1': '2018-07-22 04:34:10.8966', 'time2': '2017-07-22 04:34:10.8966', 'time3': '2018-07-27 00:10:04'}]
df = pd.DataFrame(data).apply(pd.to_datetime)
df['estimate'] = pd.to_datetime(df[['time1', 'time2', 'time3']].values.astype(pd.np.int64).mean(axis=1))
Result:
time1 time2 time3 estimate
0 2018-07-22 04:34:10.896600 2017-07-22 04:34:10.896600 2018-07-27 00:10:04 2018-03-24 03:06:08.597733376

Resources