Create a list of duplicate index entries in pandas dataframe - python-3.x

I am trying to identify which time stamps in my index have duplicates. I want to create a list of the time stamp strings. I would like to return a single timestamp for each of the time stamps that have duplicates if possible.
#required packages
import os
import pandas as pd
import numpy as np
import datetime
# create sample time series
header = ['A','B','C','D','E']
period = 5
cols = len(header)
dates = pd.date_range('1/1/2000', periods=period, freq='10min')
dates2 = pd.date_range('1/1/2022', periods=period, freq='10min')
df = pd.DataFrame(np.random.randn(period,cols),index=dates,columns=header)
df0 = pd.DataFrame(np.random.randn(period,cols),index=dates2,columns=header)
df1 = pd.concat([df]*3) #creates duplicate entries by copying the dataframe
df1 = pd.concat([df1, df0])
df2 = df1.sample(frac=1) #shuffles the dataframe
df3 = df1.sort_index() #sorts the dataframe by index
print(df2)
#print(df3)
# Identifying duplicated entries
df4 = df2.duplicated()
print(df4)
I would like to then use the list call out all the duplicate entries for each time stamp. From the code above, is there a good way to call the index that correlates to a bool type that is false?
Edit: added an extra dataframe to create some unique values and tripled the first data frame to create more than a single repeat.Also added more detail to the question.

IIUC:
df4[~df4]
Output:
2000-01-01 00:10:00 False
2000-01-01 00:00:00 False
2000-01-01 00:40:00 False
2000-01-01 00:30:00 False
2000-01-01 00:20:00 False
dtype: bool
List of timestamps,
df4[~df4].index.tolist()
Output:
[Timestamp('2000-01-01 00:10:00'),
Timestamp('2000-01-01 00:00:00'),
Timestamp('2000-01-01 00:40:00'),
Timestamp('2000-01-01 00:30:00'),
Timestamp('2000-01-01 00:20:00')]

In [46]: df2.drop_duplicates()
Out[46]:
A B C D E
2000-01-01 00:00:00 0.932587 -1.508587 -0.385396 -0.692379 2.083672
2000-01-01 00:40:00 0.237324 -0.321555 -0.448842 -0.983459 0.834747
2000-01-01 00:20:00 1.624815 -0.571193 1.951832 -0.642217 1.744168
2000-01-01 00:30:00 0.079106 -1.290473 2.635966 1.390648 0.206017
2000-01-01 00:10:00 0.760976 0.643825 -1.855477 -1.172241 0.532051
In [47]: df2.drop_duplicates().index.tolist()
Out[47]:
[Timestamp('2000-01-01 00:00:00'),
Timestamp('2000-01-01 00:40:00'),
Timestamp('2000-01-01 00:20:00'),
Timestamp('2000-01-01 00:30:00'),
Timestamp('2000-01-01 00:10:00')]

Related

Pandas [.dt] property vs to_datetime

The question is intended to build an understandable grasp on subtle differences between .dt and pd.to_datetime
I want understand which method is suited/preferred and if one can be used as a defacto and other differences that are there between the two
values = {'date_time': ['20190902093000','20190913093000','20190921200000'],
}
df = pd.DataFrame(values, columns = ['date_time'])
df['date_time'] = pd.to_datetime(df['date_time'], format='%Y%m%d%H%M%S')
>>> df
date_time
0 2019-09-02 09:30:00
1 2019-09-13 09:30:00
2 2019-09-21 20:00:00
Using .dt
df['date'] = df['date_time'].dt.date
>>> df
date_time date
0 2019-09-02 09:30:00 2019-09-02
1 2019-09-13 09:30:00 2019-09-13
2 2019-09-21 20:00:00 2019-09-21
>>> df.dtypes
date_time datetime64[ns]
date object
dtype: object
>>> df.date.values
array([datetime.date(2019, 9, 2), datetime.date(2019, 9, 13),
datetime.date(2019, 9, 21)], dtype=object)
Using .dt , even though the elements are individually datetime , is inferred as object in the 'DataFrame` , which sometimes is suited but mostly its causes a lot of problems down the line and an implicit conversion is inevitable
Using pd.to_datetime
df['date_to_datetime'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
>>> df.dtypes
date_time datetime64[ns]
date object
date_to_datetime datetime64[ns]
>>> df.date_to_datetime.values
array(['2019-09-02T00:00:00.000000000', '2019-09-13T00:00:00.000000000',
'2019-09-21T00:00:00.000000000'], dtype='datetime64[ns]')
Using pd.to_datetime , natively returns a datetime64[ns] array and inferred the same in the DataFrame , which in my experience is consistent and widely used , when dealing with dates using pandas
I m aware of the fact a native Date dtype does not exist in pandas , and is wrapped around datetime64[ns]
The two concepts are quite different.
pandas.to_datetime() is a function that can take a variety of inputs and convert them to a pandas datetime index. For example:
dates = pd.to_datetime([1610290846000000000, '2020-01-11', 'Jan 12 2020 2pm'])
print(dates)
# DatetimeIndex(['2021-01-10 15:00:46', '2020-01-11 00:00:00',
# '2020-01-12 14:00:00'],
# dtype='datetime64[ns]', freq=None)
pandas.Series.dt is an interface on a pandas series that gives you convenient access to operations on data stored as a pandas datetime. For example:
x = pd.Series(dates)
print(x.dt.date)
# 0 2021-01-10
# 1 2020-01-11
# 2 2020-01-12
# dtype: object
print(x.dt.hour)
# 0 15
# 1 0
# 2 14
# dtype: int64

How to set datetime format for pandas dataframe column labels?

IPNI_RNC PATHID 2020-11-11 00:00:00 2020-11-12 00:00:00 2020-11-13 00:00:00 2020-11-14 00:00:00 2020-11-15 00:00:00 2020-11-16 00:00:00 2020-11-17 00:00:00 Last Day Violation Count
Above are the columns label after reading the excel file. There are 10 columns in df variable after reading the excel and 7 of the columns label are date.
My input data set is an excel file which changes everyday and I want to update it automatically. In excel, some columns label are date like 11-Nov-2020, 12-Nov-2020 but after reading the excel it becomes like 2020-11-11 00:00:00, 2020-11-12 00:00:00. I want to keep column labels as 11-Nov-2020, 12-Nov-2020 while reading excel by pd.read_excel if possible or I need to convert it later.
I am very new in python. Looking forward for your support
Thanks who have already came forward to cooperate me
You can of course use the standard python methods to parse the date values, but I would not recommend it, because this way you end up with python datetime objects and not with the pandas representation of dates. That means, it consumes more space, is probably not as efficient and you can't use the pandas methods to access e.g. the year. I'll show you, what I mean below.
In case you want to avoid the naming issue of your column names, you might want to try to prevent pandas to automatically assign the names and read the first line as data to fix it yourselfe automatically (see the section below about how you can do it).
The type conversion part:
# create a test setup with a small dataframe
import pandas as pd
from datetime import date, datetime, timedelta
df= pd.DataFrame(dict(id=range(10), date_string=[str(datetime.now()+ timedelta(days=d)) for d in range(10)]))
# test the python way:
df['date_val_python']= df['date_string'].map(lambda dt: str(dt))
# use the pandas way: (btw. if you want to explicitely
# specify the format, you can use the format= keyword)
df['date_val_pandas']= pd.to_datetime(df['date_string'])
df.dtypes
The output is:
id int64
date_string object
date_val_python object
date_val_pandas datetime64[ns]
dtype: object
As you can see date_val has type object, this is because it contains python objects of class datetime while date_val_pandas uses the internal datetime representation of pandas. You can now try:
df['date_val_pandas'].dt.year
# this will return a series with the year part of the date
df['date_val_python'].dt.year
# this will result in the following error:
AttributeError: Can only use .dt accessor with datetimelike values
See the pandas doc for to_datetime for more details.
The column naming part:
# read your dataframe as usual
df= pd.read_excel('c:/scratch/tmp/dates.xlsx')
rename_dict= dict()
for old_name in df.columns:
if hasattr(old_name, 'strftime'):
new_name= old_name.strftime('DD-MMM-YYYY')
rename_dict[old_name]= new_name
if len(rename_dict) > 0:
df.rename(columns=rename_dict, inplace=True)
This works, in case your column titles are stored as usual dates, which I suppose is true, because you get a time part after importing them.
strftime of the datetime module is the function you need:
If datetime is a datetime object, you can do
datetime.strftime("%d-%b-%Y")
Example:
>>> from datetime import datetime
>>> timestamp = 1528797322
>>> date_time = datetime.fromtimestamp(timestamp)
>>> print(date_time)
2018-06-12 11:55:22
>>> print(date_time.strftime("%d-%b-%Y"))
12-Jun-2018
In order to apply a function to certain dataframe columns, use:
datetime_cols_list = ['datetime_col1', 'datetime_col2', ...]
for col in dataframe.columns:
if col in datetime_cols_list:
dataframe[col] = dataframe[col].apply(lambda x: x.strftime("%d-%b-%Y"))
I am sure this can be done in multiple ways in pandas, this is just what came out the top of my head.
Example:
import pandas as pd
import numpy as np
np.random.seed(0)
# generate some random datetime values
rng = pd.date_range('2015-02-24', periods=5, freq='T')
other_dt_col = rng = pd.date_range('2016-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Date2': other_dt_col,'Val': np.random.randn(len(rng)) })
print (df)
# Output:
# Date Date2 Val
# 0 2016-02-24 00:00:00 2016-02-24 00:00:00 1.764052
# 1 2016-02-24 00:01:00 2016-02-24 00:01:00 0.400157
# 2 2016-02-24 00:02:00 2016-02-24 00:02:00 0.978738
# 3 2016-02-24 00:03:00 2016-02-24 00:03:00 2.240893
# 4 2016-02-24 00:04:00 2016-02-24 00:04:00 1.867558
datetime_cols_list = ['Date', 'Date2']
for col in df.columns:
if col in datetime_cols_list:
df[col] = df[col].apply(lambda x: x.strftime("%d-%b-%Y"))
print (df)
# Output:
# Date Date2 Val
# 0 24-Feb-2016 24-Feb-2016 1.764052
# 1 24-Feb-2016 24-Feb-2016 0.400157
# 2 24-Feb-2016 24-Feb-2016 0.978738
# 3 24-Feb-2016 24-Feb-2016 2.240893
# 4 24-Feb-2016 24-Feb-2016 1.867558

Subtracting two clock times in pandas dataframe

I am trying to subtract two columns of a pandas data frame which contain normal clock times as strings, but somehow I am getting struck.
I have tried converting each column to datetime using pandas.datetime, but still the subtraction doesn't work.
import pandas as pd
df = pd.DataFrame()
df['A'] = ["12:30","5:30"]
df['B'] = ["19:30","9:30"]
df['A'] = pd.to_datetime(df['A']).dt.time
df['B'] = pd.to_datetime(df['B']).dt.time
df['time_diff'] = df['B'] - df['A']
I am expecting the actual time difference between two clock times.
You should using to_timedelta
df['A'] = pd.to_timedelta(df['A']+':00')
df['B'] = pd.to_timedelta(df['B']+':00')
df['time_diff'] = df['B'] - df['A']
df
Out[21]:
A B time_diff
0 12:30:00 19:30:00 07:00:00
1 05:30:00 09:30:00 04:00:00
I tried the following method. This also worked for me. Divide by 3600 to get the time in hours.
df = pd.DataFrame()
df['A'] = ["12:30","5:30"]
df['B'] = ["19:30","9:30"]
df['time_diff_minutes'] = (pd.to_datetime(df['B']) -
pd.to_datetime(df['A'])).astype('timedelta64[s]')/60
df['time_diff_hours'] = df['time_diff_minutes']/60
df
Out[161]:
A B time_diff_minutes time_diff_hours
0 12:30 19:30 420.0 7.0
1 5:30 9:30 240.0 4.0

How to get all indexes which had a particular value in last row of a Pandas DataFrame?

For a sample DataFrame like,
>>> import pandas as pd
>>> index = pd.date_range(start='1/1/2018', periods=6, freq='15T')
>>> data = ['ON_PEAK', 'OFF_PEAK', 'ON_PEAK', 'ON_PEAK', 'OFF_PEAK', 'OFF_PEAK']
>>> df = pd.DataFrame(data, index=index, columns=['tou'])
>>> df
tou
2018-01-01 00:00:00 ON PEAK
2018-01-01 00:15:00 OFF PEAK
2018-01-01 00:30:00 ON PEAK
2018-01-01 00:45:00 ON PEAK
2018-01-01 01:00:00 OFF PEAK
2018-01-01 01:15:00 OFF PEAK
How to get all indexes for which tou value is not ON_PEAK but of row before them is ON_PEAK, i.e. the output would be:
['2018-01-01 00:15:00', '2018-01-01 01:00:00']
Or, if it's easier to get all rows with ON_PEAK and the first row next to them, i.e
['2018-01-01 00:00:00', '2018-01-01 00:15:00', '2018-01-01 00:30:00', '2018-01-01 00:45:00', '2018-01-01 01:00:00']
You need to find rows where tou is not ON_PEAK and the previous tou found using pandas.shift() is ON_PEAK. Note that positive values in shift give nth previous values and negative values gives nth next value in the dataframe.
df.loc[(df['tou']!='ON_PEAK') & (df['tou'].shift(1)=='ON_PEAK')]
Output:
tou
2018-01-01 00:15:00 OFF_PEAK
2018-01-01 01:00:00 OFF_PEAK

Pandas: How do I get the key (index) of the first and last row of a dataframe

I have a dataframe (df) with a datetime index and one field "myfield"
I want to find out the first and last datetime of the dataframe.
I can access the first and last dataframe element like this:
df.iloc[0]
df.iloc[-1]
for df.iloc[0] I get the result:
myfield myfieldcontent
Name: 2017-07-24 00:00:00, dtype: float
How can I access the datetime of the row?
You can use select index by [0] or [-1]:
df = pd.DataFrame({'myfield':[1,4,5]}, index=pd.date_range('2015-01-01', periods=3))
print (df)
myfield
2015-01-01 1
2015-01-02 4
2015-01-03 5
print (df.iloc[-1])
myfield 5
Name: 2015-01-03 00:00:00, dtype: int64
print (df.index[0])
2015-01-01 00:00:00
print (df.index[-1])
2015-01-03 00:00:00
If you are using pandas 1.1.4 or higher, you can use "name" attribute.
import pandas as pd
df = pd.DataFrame({'myfield': [1, 4, 5]}, index=pd.date_range('2015-01-01', periods=3))
df = df.reset_index()
print("Index value: ", df.iloc[-1].name) #pandas-series
#Convert to python datetime
print("Index datetime: ", df.iloc[-1].name.to_pydatetime())
jezrael's answer is perfect. Just to provide an alternative, if you insist on using loc then you should first reset_index.
import pandas as pd
df = pd.DataFrame({'myfield': [1, 4, 5]}, index=pd.date_range('2015-01-01', periods=3))
df = df.reset_index()
print df['index'].iloc[0]
print df['index'].iloc[-1]

Resources