applying a function rowwise inside mutate(dplyr) - string

I have the below data where Duration captures number of years in the same house, for each household.
Input df:
House_ID Duration
H29937 30 YEAR
H2996 30 YEAR
H156 25 YEAR
H10007 5 MONTH
I am trying to get the duration in months with the below query: If the second part of extracted string is YEAR, convert the number in duration to months by multiplying it with 12,
else just take the numeric part of Duration
info_df <- mutate(info_df,
residence_Months = ifelse(str_split(Duration," ",2)[[1]][2] == "YEAR",
as.numeric(str_split(Duration," ",2)[[1]][1])*12,
as.numeric(str_split(Duration," ",2)[[1]][1])))
Expected output df:
Agent_Code Duration Residence_Months
S1299317 30 YEAR 360
S1299622 30 YEAR 360
S1299656 25 YEAR 300
S1300067 5 MONTH 5
However, the code above, gives the same value for all rows as 360.
I am not sure where the error is occuring. Can someone please help me with this?
Note : I have tried the rowwise option as pointed out in other posts but to no avail.

Depending on your full data set, this may be better achieved with the lubridate package, but taking into account your example, you can do:
library(dplyr)
library(tidyr)
df <- tibble(House_ID = c("H29937", "H2996", "H156", "H10007"),
Duration = c("30 YEAR", "30 YEAR", "25 YEAR", "5 MONTH"))
df %>%
separate("Duration", c("duration", "unit")) %>%
mutate(duration = as.integer(duration),
Residence_Months = ifelse(unit == "YEAR", duration * 12, duration))
#> # A tibble: 4 x 4
#> House_ID duration unit Residence_Months
#> <chr> <int> <chr> <dbl>
#> 1 H29937 30 YEAR 360
#> 2 H2996 30 YEAR 360
#> 3 H156 25 YEAR 300
#> 4 H10007 5 MONTH 5
Created on 2019-07-18 by the reprex package (v0.3.0)

Related

convert string as 'hours' and 'mins' into minutes

I have a column in my dataframe df:
Time
2 hours 3 mins
5 hours 10 mins
1 hours 40 mins
10 mins
4 hours
6 hours 0 mins
I want to create a new column in df 'Minutes' that converts this column over to minutes
Minutes
123
310
100
10
240
360
Is there a python function to do this?
What I have tried is:
df['Minutes'] = pd.eval(
df['Time'].replace(['hours?', 'mins'], ['*60+', ''], regex=True))
Here is ugly bug pd.eval processing only less like 100 rows, so after stripping + is called pd.eval in Series.apply for prevent it:
df['Minutes'] = (df['Time'].replace(['hours?', 'mins'], ['*60+', ''], regex=True)
.str.strip('+')
.apply(pd.eval))
print (df)
Time Minutes
0 2 hours 3 mins 123
1 5 hours 10 mins 310
2 1 hours 40 mins 100
3 10 mins 10
4 4 hours 240
5 6 hours 0 mins 360
#verify for 120 rows
df = pd.concat([df] * 20, ignore_index=True)
df['Minutes1'] = pd.eval(
df['Time'].replace(['hours?', 'mins'], ['*60+', ''], regex=True).str.strip('+'))
print (df)
ValueError: unknown type object
Another solution with Series.str.extract and Series.add:
h = df['Time'].str.extract('(\d+)\s+hours').astype(float).mul(60)
m = df['Time'].str.extract('(\d+)\s+mins').astype(float)
df['Minutes'] = h.add(m, fill_value=0).astype(int)
print (df)
Time Minutes
0 2 hours 3 mins 123
1 5 hours 10 mins 310
2 1 hours 40 mins 100
3 10 mins 10
4 4 hours 240
5 6 hours 0 mins 360
jezrael's answer is excellent, but I spent quite some time working on this so i figured i'll post it.
You can use a regex to capture 'hours' and 'minutes' from your column, and then assign back to a new column after applying the logical mathematical operation to convert to minutes:
r = "(?:(\d+) hours ?)?(?:(\d+) mins)?"
hours = df.Time.str.extract(r)[0].astype(float).fillna(0) * 60
minutes = df.Time.str.extract(r)[1].astype(float).fillna(0)
df['minutes'] = hours + minutes
print(df)
Time minutes
0 2 hours 3 mins 123.0
1 5 hours 10 mins 310.0
2 1 hours 40 mins 100.0
3 10 mins 10.0
4 4 hours 240.0
5 6 hours 0 mins 360.0
I enjoy using https://regexr.com/ to test my regex

Plot many plots for each unique Value of one column

So, I am working with a Dataframe where there are around 20 columns, but only three columns are really of importance.
Index
ID
Date
Time_difference
1
01-40-50
2021-12-01 16:54:00
0 days 00:12:00
2
01-10
2021-10-11 13:28:00
2 days 00:26:00
3
03-48-58
2021-11-05 16:54:00
2 days 00:26:00
4
01-40-50
2021-12-06 19:34:00
7 days 00:26:00
5
03-48-58
2021-12-09 12:14:00
1 days 00:26:00
6
01-10
2021-08-06 19:34:00
0 days 00:26:00
7
03-48-58
2021-10-01 11:44:00
0 days 02:21:00
There are 90 unique ID's and a few thousand rows in total. What I want to do is:
Create a plot for each unique ID
Each plot with an y-axis of 'Time_difference' and a x-axis of 'date'
Each plot with a trendline
Optimally a plot that has the average of all other plots
Would appreciate any input as to how to start this! Thank you!
For future documentation, solved it as follows:
First transforming the time_delta to an integer:
df['hour_difference'] = df['time_difference'].dt.days * 24 +
df['time_difference'].dt.seconds / 60 / 60
Then creating a list with all unique entries of the ID:
id_list = df['ID'].unique()
And last, the for-loop for the plotting:
for i in id_list:
df.loc[(df['ID'] == i)].plot(y=["hour_difference"], figsize=(15,4))
plt.title(i, fontsize=18) #Labeling titel
plt.xlabel('Title name', fontsize=12) #Labeling x-axis
plt.ylabel('Title Name', fontsize=12) #Labeling y-axis

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

segmenting or grouping a df based on parameters or differences within columns going down the dataframe rows?

I was trying to figure out if there was a way in which where I had a dataframe with multiple fields and I wanted to segment or group the dataframe into a new dataframe based on if the values of specific columns were within x amount of each other?
I.D | Created_Time | Home_Longitude | Home_Latitude | Work_Longitude | Home_Latitude
Faa1 2019-02-23 20:01:13.362 -77.0364 38.8951 -72.0364 38.8951
Above is how the original df looks with multiple rows.
I want to create a new dataframe where all rows or I.Ds contain created times that are within x amount of minutes of each other, and using haversine within x miles of one another homes, and x miles within one another work.
So Basically trying to filter this dataframe into a df that only contains rows that are within x minutes of order created time, x miles within one another homes and , x miles within each work column value.
I did this by
calculating the distances (in miles) and time relative to the first row
My logic
if n rows are within x minutes/miles of the first row, then those n rows are within x minutes/miles of each other
filter the data using the required distance and time filter conditions
Generate some dummy data
random co-ordinates
# Generate random Lat-Long points
def newpoint():
return uniform(-180,180), uniform(-90, 90)
home_points = (newpoint() for x in range(289))
work_points = (newpoint() for x in range(289))
df = pd.DataFrame(home_points, columns=['Home_Longitude', 'Home_Latitude'])
df[['Work_Longitude', 'Work_Latitude']] = pd.DataFrame(work_points)
# Insert `ID` column as sequence of integers
df.insert(0, 'ID', range(289))
# Generate random datetimes, separated by 5 minute intervals
# (you can choose your own interval)
times = pd.date_range('2012-10-01', periods=289, freq='5min')
df.insert(1, 'Created_Time', times)
print(df.head())
ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude
0 0 2012-10-01 00:00:00 -48.885981 -39.412351 -68.756244 24.739860
1 1 2012-10-01 00:05:00 58.584893 59.851739 -119.978429 -87.687858
2 2 2012-10-01 00:10:00 -18.623484 85.435248 -14.204142 -3.693993
3 3 2012-10-01 00:15:00 -29.721788 71.671103 -69.833253 -12.446204
4 4 2012-10-01 00:20:00 168.257968 -13.247833 60.979050 -18.393925
Create Python helper function with haversine distance formula (vectorized haversine distance formula, in km)
def haversine(lat1, lon1, lat2, lon2, to_radians=False, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
Calculate distances (relative to first row) in km, using haversine formula. Then, convert km to miles
df['Home_dist_miles'] = \
haversine(df.Home_Longitude, df.Home_Latitude,
df.loc[0, 'Home_Longitude'], df.loc[0, 'Home_Latitude'])*0.621371
df['Work_dist_miles'] = \
haversine(df.Work_Longitude, df.Work_Latitude,
df.loc[0, 'Work_Longitude'], df.loc[0, 'Work_Latitude'])*0.621371
Calculate time differences, in minutes (relative to first row)
for the dummy data here, the time differences will be in multiples of 5 minutes (but in real data, they could be anything)
df['time'] = df['Created_Time'] - df.loc[0, 'Created_Time']
df['time_min'] = (df['time'].dt.days * 24 * 60 * 60 + df['time'].dt.seconds)/60
Apply filters (method 1) and then select any 2 rows that satisfy the conditions stated in the OP
home_filter = df['Home_dist_miles']<=12000 # within 12,000 miles
work_filter = df['Work_dist_miles']<=8000 # within 8,000 miles
time_filter = df['time_min']<=25 # within 25 minutes
df_filtered = df.loc[(home_filter) & (work_filter) & (time_filter)]
# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)
ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude Home_dist_miles Work_dist_miles time time_min
0 0 2012-10-01 00:00:00 -168.956448 -42.970705 -6.340945 -12.749469 0.000000 0.000000 00:00:00 0.0
4 4 2012-10-01 00:20:00 -73.120352 13.748187 -36.953587 23.528789 6259.078588 5939.425019 00:20:00 20.0
Apply filters (method 2) and then select any 2 rows that satisfy the conditions stated in the OP
multi_query = """Home_dist_miles<=12000 & \
Work_dist_miles<=8000 & \
time_min<=25"""
df_filtered = df.query(multi_query)
# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)
ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude Home_dist_miles Work_dist_miles time time_min
0 0 2012-10-01 00:00:00 -168.956448 -42.970705 -6.340945 -12.749469 0.000000 0.000000 00:00:00 0.0
4 4 2012-10-01 00:20:00 -73.120352 13.748187 -36.953587 23.528789 6259.078588 5939.425019 00:20:00 20.0

Convert strings containing date ranges to dates and find time difference in R

I have a column of data:
dates <- c("20140106_20140131", "20140106_20140331", "20140201_20140210",
"20140201_20140228", "20140211_20140220", "20140221_20140228",
"20140301_20140310", "20140301_20140331")
I want R to know these variables are "date to date" format
Questions
How to write R code to convert to date format? e.g. the first variable: "01/06/2014 to 01/31/2014".
How to compute the date duration?
# vector of your date ranges
dates <- c("20140106_20140131", "20140106_20140331", "20140201_20140210",
"20140201_20140228", "20140211_20140220", "20140221_20140228",
"20140301_20140310", "20140301_20140331")
library('stringr')
library('lubridate')
First create a data frame with two columns of dates.
date_frame <- data.frame(str_split_fixed(dates, "_", 2))
Then convert the dates from strings using lubridate's ymd() function.
date_frame$X1 <- ymd(date_frame$X1)
date_frame$X2 <- ymd(date_frame$X2)
Create a new column of time differences.
transform(date_frame, diff = X2 - X1)
X1 X2 diff
1 2014-01-06 2014-01-31 25 days
2 2014-01-06 2014-03-31 84 days
3 2014-02-01 2014-02-10 9 days
4 2014-02-01 2014-02-28 27 days
5 2014-02-11 2014-02-20 9 days
6 2014-02-21 2014-02-28 7 days
7 2014-03-01 2014-03-10 9 days
8 2014-03-01 2014-03-31 30 days

Resources