I have a column in my dataframe df:
Time
2 hours 3 mins
5 hours 10 mins
1 hours 40 mins
10 mins
4 hours
6 hours 0 mins
I want to create a new column in df 'Minutes' that converts this column over to minutes
Minutes
123
310
100
10
240
360
Is there a python function to do this?
What I have tried is:
df['Minutes'] = pd.eval(
df['Time'].replace(['hours?', 'mins'], ['*60+', ''], regex=True))
Here is ugly bug pd.eval processing only less like 100 rows, so after stripping + is called pd.eval in Series.apply for prevent it:
df['Minutes'] = (df['Time'].replace(['hours?', 'mins'], ['*60+', ''], regex=True)
.str.strip('+')
.apply(pd.eval))
print (df)
Time Minutes
0 2 hours 3 mins 123
1 5 hours 10 mins 310
2 1 hours 40 mins 100
3 10 mins 10
4 4 hours 240
5 6 hours 0 mins 360
#verify for 120 rows
df = pd.concat([df] * 20, ignore_index=True)
df['Minutes1'] = pd.eval(
df['Time'].replace(['hours?', 'mins'], ['*60+', ''], regex=True).str.strip('+'))
print (df)
ValueError: unknown type object
Another solution with Series.str.extract and Series.add:
h = df['Time'].str.extract('(\d+)\s+hours').astype(float).mul(60)
m = df['Time'].str.extract('(\d+)\s+mins').astype(float)
df['Minutes'] = h.add(m, fill_value=0).astype(int)
print (df)
Time Minutes
0 2 hours 3 mins 123
1 5 hours 10 mins 310
2 1 hours 40 mins 100
3 10 mins 10
4 4 hours 240
5 6 hours 0 mins 360
jezrael's answer is excellent, but I spent quite some time working on this so i figured i'll post it.
You can use a regex to capture 'hours' and 'minutes' from your column, and then assign back to a new column after applying the logical mathematical operation to convert to minutes:
r = "(?:(\d+) hours ?)?(?:(\d+) mins)?"
hours = df.Time.str.extract(r)[0].astype(float).fillna(0) * 60
minutes = df.Time.str.extract(r)[1].astype(float).fillna(0)
df['minutes'] = hours + minutes
print(df)
Time minutes
0 2 hours 3 mins 123.0
1 5 hours 10 mins 310.0
2 1 hours 40 mins 100.0
3 10 mins 10.0
4 4 hours 240.0
5 6 hours 0 mins 360.0
I enjoy using https://regexr.com/ to test my regex
Related
Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90
I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date').
x y
id date
abc 3/1/1994 100 7
9/1/1994 90 8
3/1/1995 80 9
bka 5/1/1993 50 8
7/1/1993 40 9
I'd like to convert those dates into an integer-like, such as
x y
id date
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?
Try this:
s = 'day ' + df.groupby(level=0).cumcount().astype(str)
df1 = df.set_index([s], append=True).droplevel(1)
x y
id
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
You can calculate the new level and create a new index:
lvl1 = 'day ' + df.groupby('id').cumcount().astype('str')
df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) )
output:
x y
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I have 4 dataframes with weekly sales values for a year for 4 products. Some of the initial rows are 0 as no sales. there are some other 0 values as well in between the weeks.
I want to remove those initial 0 values, keeping the in between 0s.
For example
Week Sales(prod 1)
1 0
2 0
3 100
4 120
5 55
6 0
7 60.
Week Sales(prod 2)
1 0
2 0
3 0
4 120
5 0
6 30
7 60.
I want to remove row 1,2 from 1st table and 1,2,3 frm 2nd.
Few Assumption based on your example dataframe:
DataFrame is created using pandas
week always start with 1
will remove all the starting weeks only which are having 0 sales
Solution:
Python libraries Required
- pandas, more_itertools
Example DataFrame (df):
Week Sales
1 0
2 0
3 0
4 120
5 0
6 30
7 60
Python Code:
import pandas as pd
import more_itertools as mit
filter_col = 'Sales'
filter_val = 0
##function which returns the index to be removed
def return_initial_week_index_with_zero_sales(df,filter_col,filter_val):
index_wzs = [False]
if df[filter_col].iloc[1]==filter_val:
index_list = df[df[filter_col]==filter_val].index.tolist()
index_wzs = [list(group) for group in mit.consecutive_groups(index_list)]
else:
pass
return index_wzs[0]
##calling above function and removing index from the dataframe
df = df.set_index('Week')
weeks_to_be_removed = return_initial_week_index_with_zero_sales(df,filter_col,filter_val)
if weeks_to_be_removed:
print('Initial weeks with 0 sales are {}'.format(weeks_to_be_removed))
df = df.drop(index=weeks_to_be_removed)
else:
print('No initial week has 0 sales')
df.reset_index(inplace=True)
Result:df
Week Sales
4 120
5 55
6 0
7 60
I hope it helps, you can modify the function as per your requirement.
I have a file containing alerts occurence time. I want to sort those in ascending order. Can you please guide me about this.
Sample time format.
1 day, 19 hours
3 weeks
4 weeks, 1 day
2 minutes
1 month, 1 week
10 hours, 36 minutes
4 weeks, 1 day
4 weeks, 1 day
13 minutes
5 hours, 16 minutes
1 hour, 53 minutes
3 hours, 18 minutes
21 hours, 42 minutes
18 hours, 49 minutes
21 hours, 43 minutes
Maybe not super elegant, but straight forward in Python:
#!/usr/bin/env python
import operator
# 1 month = 28-31 days and 4 weeks = 28 days, so month is kept separate
time_in_seconds = {
'week': 7*24*60*60,
'day': 24*60*60,
'hour': 60*60,
'minute': 60,
'second': 1
}
if __name__ == '__main__':
times = []
with open('sample_time.txt', 'r') as f:
for line in f.read().split('\n'):
months = 0
seconds = 0
try:
for pair in line.split(', '):
num, denum = pair.split(' ')
if denum.startswith('month'):
months += int(num)
else:
seconds += time_in_seconds[denum.rstrip('s')]*int(num)
times.append([months, seconds, line])
except:
pass
sorted_times = sorted(times, key=operator.itemgetter(0,1))
for line in map(operator.itemgetter(2), sorted_times):
print(line)
It assumes your file is called sample_time.txt.
Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':[10,20,30,40,50,60],
'B':[1,2,1,4,5,4]
})
df
A B
0 10 1
1 20 2
2 30 1
3 40 4
4 50 5
5 60 4
I would like a new column 'C' to have values be equal to those in 'A' where the corresponding values for 'B' are less than 3 else 0.
The desired result is as follows:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Thanks in advance!
Use np.where:
df['C'] = np.where(df['B'] < 3, df['A'], 0)
>>> df
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Here you can use pandas method where direct on the column:
In [3]:
df['C'] = df['A'].where(df['B'] < 3,0)
df
Out[3]:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Timings
In [4]:
%timeit df['A'].where(df['B'] < 3,0)
%timeit np.where(df['B'] < 3, df['A'], 0)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 407 µs per loop
np.where is faster here but pandas where is doing more checking and has more options so it depends on the use case here.