Handling dates with mix of two and four digit years in python - python-3.x

I have two DataFrame df:
A B
5/4/2018 8/4/2018
24/5/15 26/5/15
21/7/16 22/7/16
3/7/2015 5/7/2015
1/7/2016 1/7/2016
I want to calculate the difference of days for each row.
for example:
A B C
5/4/2018 8/4/2018 3
24/5/15 26/5/15 2
I have tried to convert dataframe into datetime using pd.to_datetime. but, getting error "ValueError: unconverted data remains: 18"
tried following code:
import datetime as dt
df['A'] = pd.to_datetime(df['A'], format = "%d/%m/%y").datetime.datetime.strftime("%Y-%m-%d")
df['B'] = pd.to_datetime(df['B'], format = "%d/%m/%y").datetime.datetime.strftime("%Y-%m-%d")
df['C'] = (df['B'] - df['A']).dt.days
note :using python 3.7

Try:
df['A'] = pd.to_datetime(df['A'], dayfirst=True)
df['B'] = pd.to_datetime(df['B'], dayfirst=True)
df['C'] = (df['B'] - df['A']).dt.days
Output:
A B C
0 2018-04-05 2018-04-08 3
1 2015-05-24 2015-05-26 2

Related

Sum in Column based on condition in rows in pandas dataframe [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

Convert floats to ints of a column with numbers and nans

I'm working with Python 3.6 and Pandas 1.0.3.
I would like to convert the floats from column "A" to int... This column has some nan values.
So i followed this post with the solution of #jezrael.
But I get the following error:
"TypeError: cannot safely cast non-equivalent float64 to int64"
This is my code
import pandas as pd
import numpy as np
data = {'timestamp': [1588757760.0000, 1588757760.0161, 1588757764.7339, 1588757764.9234], 'A':[9087.6000, 9135.8000, np.nan, 9102.1000], 'B':[0.1648, 0.1649, '', 5.3379], 'C':['b', 'a', '', 'a']}
df = pd.DataFrame(data)
df['A'] = pd.to_numeric(df['A'], errors='coerce').astype('Int64')
print(df)
Did I miss something?
Your problem is that you have true float numbers, not integers in the float form. So for safety reasons pandas will not convert them, because you would be obtained other values.
So you need first explicitely round them to integers, and only then use the.astype() method:
df['A'] = pd.to_numeric(df['A'].round(), errors='coerce').astype('Int64')
Test:
print(df)
timestamp A B C
0 1.588758e+09 9088 0.1648 b
1 1.588758e+09 9136 0.1649 a
2 1.588758e+09 NaN
3 1.588758e+09 9102 5.3379 a
One way to do it is to convert NaN to a integer:
df['A'] = df['A'].fillna(99999999).astype(np.int64, errors='ignore')
df['A'] = df['A'].replace(99999999, np.nan)
df
timestamp A B C
0 1.588758e+09 9087 0.1648 b
1 1.588758e+09 9135 0.1649 a
2 1.588758e+09 NaN
3 1.588758e+09 9102 5.3379 a

How to sum the column value seperated with semicolon in python

I have a dataframe with the values as below:
df = pd.DataFrame({'Column4': ['NaN;NaN;1;4','4;8','nan']} )
print (df)
Column4
0 NaN;NaN;1;4
1 4;8
2 nan
I tried with the code below to get the sum.
df['Sum'] = df['Column4'].apply(lambda x: sum(map(int, x.split(';'))))
I am getting the error message as
ValueError: invalid literal for int() with base 10: 'NaN'
Use Series.str.split with expand=True for DataFrame, convert to floats and sum per rows - pandas by default exclude missing values:
df['Sum'] = df['Column4'].str.split(';', expand=True).astype(float).sum(axis=1)
print (df)
Column4 Sum
0 NaN;NaN;1;4 5.0
1 4;8 12.0
2 nan 0.0
Your solution should be changed:
f = lambda x: sum(int(y) for y in x.split(';') if not y in ('nan','NaN'))
df['Sum'] = df['Column4'].apply(f)
because if convert to float get mssing values for NaNs with another numeric:
df['Sum'] = df['Column4'].apply(lambda x: sum(map(float, x.split(';'))))
print (df)
Column4 Sum
0 NaN;NaN;1;4 NaN
1 4;8 12.0
2 nan NaN

Subtracting two clock times in pandas dataframe

I am trying to subtract two columns of a pandas data frame which contain normal clock times as strings, but somehow I am getting struck.
I have tried converting each column to datetime using pandas.datetime, but still the subtraction doesn't work.
import pandas as pd
df = pd.DataFrame()
df['A'] = ["12:30","5:30"]
df['B'] = ["19:30","9:30"]
df['A'] = pd.to_datetime(df['A']).dt.time
df['B'] = pd.to_datetime(df['B']).dt.time
df['time_diff'] = df['B'] - df['A']
I am expecting the actual time difference between two clock times.
You should using to_timedelta
df['A'] = pd.to_timedelta(df['A']+':00')
df['B'] = pd.to_timedelta(df['B']+':00')
df['time_diff'] = df['B'] - df['A']
df
Out[21]:
A B time_diff
0 12:30:00 19:30:00 07:00:00
1 05:30:00 09:30:00 04:00:00
I tried the following method. This also worked for me. Divide by 3600 to get the time in hours.
df = pd.DataFrame()
df['A'] = ["12:30","5:30"]
df['B'] = ["19:30","9:30"]
df['time_diff_minutes'] = (pd.to_datetime(df['B']) -
pd.to_datetime(df['A'])).astype('timedelta64[s]')/60
df['time_diff_hours'] = df['time_diff_minutes']/60
df
Out[161]:
A B time_diff_minutes time_diff_hours
0 12:30 19:30 420.0 7.0
1 5:30 9:30 240.0 4.0

Python Subtracting two columns with date data, from csv to get number of weeks , months?

I have a csv in which I have two columns representing start date: st_dt and end date: 'end_dt` , I have to subtract these columns to get the number of weeks. I tried iterating through columns using pandas, but it seems my output is wrong.
st_dt end_dt
---------------------------------------
20100315 20100431
Use read_csv with parse_dates for datetimes and then after substract days:
df = pd.read_csv(file, parse_dates=[0,1])
print (df)
st_dt end_dt
0 2010-03-15 2010-04-30
df['diff'] = (df['end_dt'] - df['st_dt']).dt.days
print (df)
st_dt end_dt diff
0 2010-03-15 2010-04-30 46
If some dates are wrong like 20100431 use to_datetime with parameter errors='coerce' for convert them to NaT:
df = pd.read_csv(file)
print (df)
st_dt end_dt
0 20100315 20100431
1 20100315 20100430
df['st_dt'] = pd.to_datetime(df['st_dt'], errors='coerce', format='%Y%m%d')
df['end_dt'] = pd.to_datetime(df['end_dt'], errors='coerce', format='%Y%m%d')
df['diff'] = (df['end_dt'] - df['st_dt']).dt.days
print (df)
st_dt end_dt diff
0 2010-03-15 NaT NaN
1 2010-03-15 2010-04-30 46.0

Resources