Convert duration (string column) to seconds PySpark - string

Can I convert easily a string column (with time duration) to seconds in PySpark?
Is there any function that does this directly? I would avoid to multiplicate each part of my string for seconds.
Input
id
duration
1
00 00:00:34
2
00 00:04:37
3
120 00:04:37
...
...
NOTE:
Id 1 -> 0 days, 0 hours, 0 minutes, 34 seconds
Id 2 -> 0 days, 0 hours, 4 minutes, 37 seconds
Id 3 -> 120 days, 0 hours, 4 minutes, 37 seconds
Output
id
duration
1
34
2
277
3
...
...
...

You can get day,min,hour and seconds from duration column by applying split and then sum up the corresponding seconds to get desired result.
df = # input
df.withColumn("duration", split("duration", "\\s+")) \
.withColumn("time", split(col("duration").getItem(1), ':')) \
.select(col("id"),
((col("duration").getItem(0).cast("int") * 86400) +
(col("time").getItem(0).cast("int") * 3600) +
(col("time").getItem(1).cast("int") * 60) +
(col("time").getItem(2))).cast("long").alias("duration")
).show()
+---+--------+
| id|duration|
+---+--------+
| 1| 34|
| 2| 277|
| 3|10368277|
+---+--------+

Related

How to convert multi-indexed datetime index into integer?

I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date').
x y
id date
abc 3/1/1994 100 7
9/1/1994 90 8
3/1/1995 80 9
bka 5/1/1993 50 8
7/1/1993 40 9
I'd like to convert those dates into an integer-like, such as
x y
id date
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?
Try this:
s = 'day ' + df.groupby(level=0).cumcount().astype(str)
df1 = df.set_index([s], append=True).droplevel(1)
x y
id
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
You can calculate the new level and create a new index:
lvl1 = 'day ' + df.groupby('id').cumcount().astype('str')
df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) )
output:
x y
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9

Selecting column on the basis of date

I have the following data set.
ID Date description V1 V2 V3
1 31-Jan-2013 Des1 10 20 30
1 31-Jan-2013 Des2 20 30 20
1 31-jan-2014 Des1 56 30 20
1 31-jan-2014 des2 30 40 60
2 31-dec-2013 Decc1 10 20 30
2 31-dec-2013 Decc2 20 30 20
2 31-dec-2014 Decc1 56 30 20
2 31-dec-2014 decc2 30 40 60
I want extract only the latest year values for the ID's.
expected output.
ID Date description V1 V2 V3
1 31-jan-2014 Des1 56 30 20
1 31-jan-2014 des2 30 40 60
2 31-dec-2014 Decc1 56 30 20
2 31-dec-2014 decc2 30 40 60
Can anybody help, how we can achieve this in pandas.
Thanks
Anubhav
may be use groupby().
data_u.set_index(['ID', 'Date'],inplace=True)
data_u.sort_index(inplace=True)
data_u.groupby(data_u.index).index.agg(['count'])
this gives me count of the rows of multindex.
But I want to select the latest year of all ID's. Number of records are >500000
You could do the following:
df['Date'] = pd.to_datetime(df['Date'])
df[df.apply(lambda x : x['Date'] == df[(df['ID'] == x['ID'])]['Date'].max() , axis =1)]
Output
+---+----+------------+-------------+----+----+----+
| | ID | Date | description | V1 | V2 | V3 |
+---+----+------------+-------------+----+----+----+
| 2 | 1 | 2014-01-31 | Des1 | 56 | 30 | 20 |
| 3 | 1 | 2014-01-31 | des2 | 30 | 40 | 60 |
| 6 | 2 | 2014-12-31 | Decc1 | 56 | 30 | 20 |
| 7 | 2 | 2014-12-31 | decc2 | 30 | 40 | 60 |
+---+----+------------+-------------+----+----+----+

Looping to create a new column based on other column values in Python Dataframe [duplicate]

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 3 years ago.
I want to create a new column in python dataframe based on other column values in multiple rows.
For example, my python dataframe df:
A | B
------------
10 | 1
20 | 1
30 | 1
10 | 1
10 | 2
15 | 3
10 | 3
I want to create variable C that is based on the value of variable A with condition from variable B in multiple rows. When the value of variable B in row i,i+1,..., the the value of C is the sum of variable A in those rows. In this case, my output data frame will be:
A | B | C
--------------------
10 | 1 | 70
20 | 1 | 70
30 | 1 | 70
10 | 1 | 70
10 | 2 | 10
15 | 3 | 25
10 | 3 | 25
I haven't got any idea the best way to achieve this. Can anyone help?
Thanks in advance
recreate the data:
import pandas as pd
A = [10,20,30,10,10,15,10]
B = [1,1,1,1,2,3,3]
df = pd.DataFrame({'A':A, 'B':B})
df
A B
0 10 1
1 20 1
2 30 1
3 10 1
4 10 2
5 15 3
6 10 3
and then i'll create a lookup Series from the df:
lookup = df.groupby('B')['A'].sum()
lookup
A
B
1 70
2 10
3 25
and then i'll use that lookup on the df using apply
df.loc[:,'C'] = df.apply(lambda row: lookup[lookup.index == row['B']].values[0], axis=1)
df
A B C
0 10 1 70
1 20 1 70
2 30 1 70
3 10 1 70
4 10 2 10
5 15 3 25
6 10 3 25
You have to use groupby() method, to group the rows on B and sum() on A.
df['C'] = df.groupby('B')['A'].transform(sum)

Create "leakage-free" Variables in Python?

I have a pandas data frame with several thousand observations and I would like to create "leakage-free" variables in Python. So I am looking for a way to calculate e.g. a group-specific mean of a variable without the single observation in row i.
For example:
| Group | Price | leakage-free Group Mean |
-------------------------------------------
| 1 | 20 | 25 |
| 1 | 40 | 15 |
| 1 | 10 | 30 |
| 2 | ... | ... |
I would like to do that with several variables and I would like to create mean, median and variance in such a way, so a computationally fast method might be good. If a group has only one row I would like to enter 0s in the leakage-free Variable.
As I am rather a beginner in Python, some piece of code might be very helpful. Thank You!!
With one-liner:
df = pd.DataFrame({'Group': [1,1,1,2], 'Price':[20,40,10,30]})
df['lfgm'] = df.groupby('Group').transform(lambda x: (x.sum()-x)/(len(x)-1)).fillna(0)
print(df)
Output:
Group Price lfgm
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
3 2 30 0.0
Update:
For median and variance (not one-liners unfortunately):
df = pd.DataFrame({'Group': [1,1,1,1,2], 'Price':[20,100,10,70,30]})
def f(x):
for i in x.index:
z = x.loc[x.index!=i, 'Price']
x.at[i, 'mean'] = z.mean()
x.at[i, 'median'] = z.median()
x.at[i, 'var'] = z.var()
return x[['mean', 'median', 'var']]
df = df.join(df.groupby('Group').apply(f))
print(df)
Output:
Group Price mean median var
0 1 20 60.000000 70.0 2100.000000
1 1 100 33.333333 20.0 1033.333333
2 1 10 63.333333 70.0 1633.333333
3 1 70 43.333333 20.0 2433.333333
4 2 30 NaN NaN NaN
Use:
grp = df.groupby('Group')
n = grp['Price'].transform('count')
mean = grp['Price'].transform('mean')
df['new_col'] = (mean*n - df['Price'])/(n-1)
print(df)
Group Price new_col
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
Note: This solution will be faster than using apply, you can test using %%timeit followed by the codes.

How to add all values in a column in MS Excel?

I am having a worksheet where I am having the values as
--------------------
col1 | col2
--------------------
1 | 2 min 30 secs
2 | 1 min 24 secs
3 | 0 min 10 secs
4 | 1 min 3 secs
Now I would like to sum up all the values in col2. Sum will be: 4 min 67 secs.
here a working example, simply format your timecells to show "m \m\i\n s \s\e\c\s"
and sum them up..:
http://sourcemonk.com/sum_of_formatted_time.xlsx

Resources