Splitting time formatted object doesn't work with python and pandas - python-3.x

I have the simple line of code:
print(df['Duration'])
df['Duration'].str.split(':')
print(df['Duration'])
Here are the value I have for each print
00:58:59
00:27:41
00:27:56
Name: Duration, dtype: object
Why is the split not working here? What do I'm missing?

str.split doesn't modify column inplace, so you need to assign the result to something:
import pandas as pd
df = pd.DataFrame({'Duration':['00:58:59', '00:27:41', '00:27:56'], 'other':[10, 20, 30]})
df['Duration'] = df['Duration'].str.split(':')
print(df)
Prints:
Duration other
0 [00, 58, 59] 10
1 [00, 27, 41] 20
2 [00, 27, 56] 30
If you want to expand the columns of DataFrame by splitting, you can try:
import pandas as pd
df = pd.DataFrame({'Duration':['00:58:59', '00:27:41', '00:27:56'], 'other':[10, 20, 30]})
df[['hours', 'minutes', 'seconds']] = df['Duration'].str.split(':', expand=True)
print(df)
Prints:
Duration other hours minutes seconds
0 00:58:59 10 00 58 59
1 00:27:41 20 00 27 41
2 00:27:56 30 00 27 56

Related

how to splits columns by date using python

df.head(7)
df
Month,ward1,ward2,...ward30
Apr-19, 20, 30, 45
May-19, 18, 25, 42
Jun-19, 25, 19, 35
Jul-19, 28, 22, 38
Aug-19, 24, 15, 40
Sep-19, 21, 14, 39
Oct-19, 15, 18, 41
to:
Month, ward1
Apr-19, 20
May-19, 18
Jun-19, 25
Jul-19, 28
Aug-19, 24
Sep-19, 21
Oct-19, 15
Month,ward2
Apr-19, 30
May-19, 25
Jun-19, 19
Jul-19, 22
Aug-19, 15
Sep-19, 14
Oct-19, 18
Month, ward30
Apr-19, 45
May-19, 42
Jun-19, 35
Jul-19, 38
Aug-19, 40
Sep-19, 39
Oct-19, 41
How to group-by date wise in python using pandas?
I have dataframe df that contains a datetime and 30 other columns which I want to split by date attached with each of those columns in pandas but I am facing some difficulties.
try using a dictionary comprehension to hold your separate dataframes.
dfs = {col : df.set_index('Month')[[col]] for col in (df.set_index('Month').columns)}
print(dfs['ward1'])
ward1
Month
Apr-19 20
May-19 18
Jun-19 25
Jul-19 28
Aug-19 24
Sep-19 21
Oct-19 15
print(dfs['ward30'])
ward30
Month
Apr-19 45
May-19 42
Jun-19 35
Jul-19 38
Aug-19 40
Sep-19 39
Oct-19 41
One straight forward way would be to set date column as index and separating out every other column:
data.set_index('Month', inplace =True)
data_dict = {col: data[col] for col in data.columns}
You have to create new DataFrames:
data1 = pd.DataFrame()
data1['Month'] = df['Month']
data1['ward1'] = df['ward1']
data1.head()

Converting a pandas dataframe to dictionnary of lists

I have the following problem. I have a pandas dataframe that was populated from a pandas.read_sql.
The dataframe looks like this:
PERSON GRADES
20 A 70
21 A 23
22 A 67
23 B 08
24 B 06
25 B 88
26 B 09
27 C 40
28 D 87
29 D 11
And I would need to convert it to a dictionary of lists like this:
{'A':[70,23,67], 'B':[08,06,88,09], 'C':[40], 'D':[87,11]}
I've tried for 2 hours and now I'm out of ideas. I think I'm missing something very simple.
With groupby and to_dict
df.groupby('PERSON').GRADES.apply(list).to_dict()
Out[286]: {'A': [70, 23, 67], 'B': [8, 6, 88, 9], 'C': [40], 'D': [87, 11]}

Pandas: Random integer between values in two columns

How can I create a new column that calculates random integer between values of two columns in particular row.
Example df:
import pandas as pd
import numpy as np
data = pd.DataFrame({'start': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'end': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
data = data.iloc[:, [1, 0]]
Result:
Now I am trying something like this:
data['rand_between'] = data.apply(lambda x: np.random.randint(data.start, data.end))
or
data['rand_between'] = np.random.randint(data.start, data.end)
But it doesn't work of course because data.start is a Series not a number.
how can I used numpy.random with data from columns as vectorized operation?
You are close, need specify axis=1 for process data by rows and change data.start/end to x.start/end for working with scalars:
data['rand_between'] = data.apply(lambda x: np.random.randint(x.start, x.end), axis=1)
Another possible solution:
data['rand_between'] = [np.random.randint(s, e) for s,e in zip(data['start'], data['end'])]
print (data)
start end rand_between
0 1 10 8
1 2 20 3
2 3 30 23
3 4 40 35
4 5 50 30
5 6 60 28
6 7 70 60
7 8 80 14
8 9 90 85
9 10 100 83
If you want to truly vectorize this, you can generate a random number between 0 and 1 and normalize it with your min/max numbers:
(
data['start'] + np.random.rand(len(data)) * (data['end'] - data['start'] + 1)
).astype('int')
Out:
0 1
1 18
2 18
3 35
4 22
5 27
6 35
7 23
8 33
9 81
dtype: int64

Avoiding explicit for-loop in Python with pandas dataframe

I would like to find a better way to carry out the following process.
#import packages
import pandas as pd
I have defined a pandas dataframe.
# Create dataframe
data = {'name': ['Jason', 'Jason', 'Tina', 'Tina', 'Tina'],
'reports': [4, 24, 31, 2, 3],
'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data)
After the dataframe is created, I want to add an extra column to the dataframe. This column contains the rank based on the values in the coverage column for every name seperately.
#Add column with ranks based on 'coverage' for every name separately.
df_end = pd.DataFrame()
for person_names in df.groupby('name').groups:
one_name = df.groupby('name').get_group(person_names)
one_name['coverageRank'] = one_name['coverage'].rank()
df_end = df_end.append(one_name)
Is it possible to achieve this simple task in a simpler way? Maybe without using the for-loop?
I think you need DataFrameGroupBy.rank:
df['coverageRank'] = df.groupby('name')['coverage'].rank()
print (df)
coverage name reports coverageRank
0 25 Jason 4 1.0
1 94 Jason 24 2.0
2 57 Tina 31 1.0
3 62 Tina 2 2.0
4 70 Tina 3 3.0

Pandas HDF limiting number of rows of CSV file

I have a CSV file with 3GB. I'm trying to save it to HDF format with Pandas so I can load it faster.
import pandas as pd
import traceback
df_all = pd.read_csv('file_csv.csv', iterator=True, chunksize=20000)
for _i, df in enumerate(df_all):
try:
print ('Saving %d chunk...' % _i, end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True)
print ('Done!')
except:
traceback.print_exc()
print (df)
print (df.info())
del df_all
The original CSV file has about 3 million rows, which is reflected by the output of this piece of code. The last line of output is: Saving 167 chunk...Done!
That means: 167*20000 = 3.340.000 rows
My issue is:
df_hdf = pd.read_hdf('file_csv.hdf')
df_hdf.count()
=> 4613 rows
And:
item_info = pd.read_hdf('ItemInfo_train.hdf', where="item=1")
Returns nothing, even I'm sure the "item" column has an entry equals to 1 in the original file.
What can be wrong?
Use append=True to tell to_hdf to append new chunks to the same file.
df.to_hdf('file_csv.hdf', ..., append=True)
Otherwise, each call overwrites the previous contents and only the last chunk remains saved in file_csv.hdf.
import os
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(10, size=(100, 2)), columns=list('AB'))
df.to_csv('file_csv.csv')
if os.path.exists('file_csv.hdf'): os.unlink('file_csv.hdf')
for i, df in enumerate(pd.read_csv('file_csv.csv', chunksize=50)):
print('Saving {} chunk...'.format(i), end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True,
append=True)
print('Done!')
print(df.loc[df['A']==1])
print('-'*80)
df_hdf = pd.read_hdf('file_csv.hdf', where="A=1")
print(df_hdf)
prints
Unnamed: 0 A B
22 22 1 7
30 30 1 7
41 41 1 9
44 44 1 0
19 69 1 3
29 79 1 1
31 81 1 5
34 84 1 6
Use append=True to tell to_hdf to append new chunks to the same file. Otherwise, only the last chunk is saved in file_csv.hdf:
import os
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(10, size=(100, 2)), columns=list('AB'))
df.to_csv('file_csv.csv')
if os.path.exists('file_csv.hdf'): os.unlink('file_csv.hdf')
for i, df in enumerate(pd.read_csv('file_csv.csv', chunksize=50)):
print('Saving {} chunk...'.format(i), end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True,
append=True)
print('Done!')
print(df.loc[df['A']==1])
print('-'*80)
df_hdf = pd.read_hdf('file_csv.hdf', where="A=1")
print(df_hdf)
prints
Unnamed: 0 A B
22 22 1 7
30 30 1 7
41 41 1 9
44 44 1 0
19 69 1 3
29 79 1 1
31 81 1 5
34 84 1 6

Resources