Pandas: How do I get the key (index) of the first and last row of a dataframe - python-3.x

I have a dataframe (df) with a datetime index and one field "myfield"
I want to find out the first and last datetime of the dataframe.
I can access the first and last dataframe element like this:
df.iloc[0]
df.iloc[-1]
for df.iloc[0] I get the result:
myfield myfieldcontent
Name: 2017-07-24 00:00:00, dtype: float
How can I access the datetime of the row?

You can use select index by [0] or [-1]:
df = pd.DataFrame({'myfield':[1,4,5]}, index=pd.date_range('2015-01-01', periods=3))
print (df)
myfield
2015-01-01 1
2015-01-02 4
2015-01-03 5
print (df.iloc[-1])
myfield 5
Name: 2015-01-03 00:00:00, dtype: int64
print (df.index[0])
2015-01-01 00:00:00
print (df.index[-1])
2015-01-03 00:00:00

If you are using pandas 1.1.4 or higher, you can use "name" attribute.
import pandas as pd
df = pd.DataFrame({'myfield': [1, 4, 5]}, index=pd.date_range('2015-01-01', periods=3))
df = df.reset_index()
print("Index value: ", df.iloc[-1].name) #pandas-series
#Convert to python datetime
print("Index datetime: ", df.iloc[-1].name.to_pydatetime())

jezrael's answer is perfect. Just to provide an alternative, if you insist on using loc then you should first reset_index.
import pandas as pd
df = pd.DataFrame({'myfield': [1, 4, 5]}, index=pd.date_range('2015-01-01', periods=3))
df = df.reset_index()
print df['index'].iloc[0]
print df['index'].iloc[-1]

Related

Pandas [.dt] property vs to_datetime

The question is intended to build an understandable grasp on subtle differences between .dt and pd.to_datetime
I want understand which method is suited/preferred and if one can be used as a defacto and other differences that are there between the two
values = {'date_time': ['20190902093000','20190913093000','20190921200000'],
}
df = pd.DataFrame(values, columns = ['date_time'])
df['date_time'] = pd.to_datetime(df['date_time'], format='%Y%m%d%H%M%S')
>>> df
date_time
0 2019-09-02 09:30:00
1 2019-09-13 09:30:00
2 2019-09-21 20:00:00
Using .dt
df['date'] = df['date_time'].dt.date
>>> df
date_time date
0 2019-09-02 09:30:00 2019-09-02
1 2019-09-13 09:30:00 2019-09-13
2 2019-09-21 20:00:00 2019-09-21
>>> df.dtypes
date_time datetime64[ns]
date object
dtype: object
>>> df.date.values
array([datetime.date(2019, 9, 2), datetime.date(2019, 9, 13),
datetime.date(2019, 9, 21)], dtype=object)
Using .dt , even though the elements are individually datetime , is inferred as object in the 'DataFrame` , which sometimes is suited but mostly its causes a lot of problems down the line and an implicit conversion is inevitable
Using pd.to_datetime
df['date_to_datetime'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
>>> df.dtypes
date_time datetime64[ns]
date object
date_to_datetime datetime64[ns]
>>> df.date_to_datetime.values
array(['2019-09-02T00:00:00.000000000', '2019-09-13T00:00:00.000000000',
'2019-09-21T00:00:00.000000000'], dtype='datetime64[ns]')
Using pd.to_datetime , natively returns a datetime64[ns] array and inferred the same in the DataFrame , which in my experience is consistent and widely used , when dealing with dates using pandas
I m aware of the fact a native Date dtype does not exist in pandas , and is wrapped around datetime64[ns]
The two concepts are quite different.
pandas.to_datetime() is a function that can take a variety of inputs and convert them to a pandas datetime index. For example:
dates = pd.to_datetime([1610290846000000000, '2020-01-11', 'Jan 12 2020 2pm'])
print(dates)
# DatetimeIndex(['2021-01-10 15:00:46', '2020-01-11 00:00:00',
# '2020-01-12 14:00:00'],
# dtype='datetime64[ns]', freq=None)
pandas.Series.dt is an interface on a pandas series that gives you convenient access to operations on data stored as a pandas datetime. For example:
x = pd.Series(dates)
print(x.dt.date)
# 0 2021-01-10
# 1 2020-01-11
# 2 2020-01-12
# dtype: object
print(x.dt.hour)
# 0 15
# 1 0
# 2 14
# dtype: int64

how to set datetime type index for weekly column in pandas dataframe

I have a data as given below:
date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50
This data is saved in test.txt file.
Date column is given as a weekly column as a concatenation of year and weekid. I am trying to set the date column as an index, with given code:
import pandas as pd
import numpy as np
data=pd.read_csv("test.txt", sep="\t", parse_dates=['date'])
But it gives an error. How can I set the date column as an index with datetime type?
Use index_col parameter for setting index:
data=pd.read_csv("test.txt", sep="\t", index_col=[0])
EDIT: Using column name as index:
data=pd.read_csv("test.txt", sep="\t", index_col=['date'])
For converting index from int to date time, do this:
data.index = pd.to_datetime(data.index, format='%Y%m')
There might be simpler solutions than this too, using apply first I converted your Year-Weekid into Year-month-day format and then just simply used set_index to make date as index column.
import pandas as pd
data ={
'date' : [201901,201902,201903,201904,201905],
'product' : ['A','A','A','C','C'],
'price' : [10,10,10,20,20],
'amount' : [20,20,30,50,60]
}
df = pd.DataFrame(data)
# str(x)+'1' converts to Year-WeekId-Weekday, so 1 represents `Monday` so 2019020
# means 2019 Week2 Monday.
# If you want you can try with other formats too
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
df.set_index(['date'],inplace=True)
df
Edit:
To see datetime in Year-WeekID format you can style the dataframe as follows, however if you set date as index column following code won't be able to work. And also remember following code just applies some styling so just useful for display purpose only, internally it will remain as date-time object.
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x)+'1',format='%Y%W%w'))
style_format = {'date':'{:%Y%W}'}
df.style.format(style_format)
You also can use the date_parser parameter:
import pandas as pd
from io import StringIO
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y%m')
inputtxt = StringIO("""date product price amount
201901 A 10 20
201902 A 10 20
201903 A 20 30
201904 C 40 50""")
df = pd.read_csv(inputtxt, sep='\s+', parse_dates=['date'], date_parser=dateparse)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
3 amount 4 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 256.0+ bytes

How to set datetime format for pandas dataframe column labels?

IPNI_RNC PATHID 2020-11-11 00:00:00 2020-11-12 00:00:00 2020-11-13 00:00:00 2020-11-14 00:00:00 2020-11-15 00:00:00 2020-11-16 00:00:00 2020-11-17 00:00:00 Last Day Violation Count
Above are the columns label after reading the excel file. There are 10 columns in df variable after reading the excel and 7 of the columns label are date.
My input data set is an excel file which changes everyday and I want to update it automatically. In excel, some columns label are date like 11-Nov-2020, 12-Nov-2020 but after reading the excel it becomes like 2020-11-11 00:00:00, 2020-11-12 00:00:00. I want to keep column labels as 11-Nov-2020, 12-Nov-2020 while reading excel by pd.read_excel if possible or I need to convert it later.
I am very new in python. Looking forward for your support
Thanks who have already came forward to cooperate me
You can of course use the standard python methods to parse the date values, but I would not recommend it, because this way you end up with python datetime objects and not with the pandas representation of dates. That means, it consumes more space, is probably not as efficient and you can't use the pandas methods to access e.g. the year. I'll show you, what I mean below.
In case you want to avoid the naming issue of your column names, you might want to try to prevent pandas to automatically assign the names and read the first line as data to fix it yourselfe automatically (see the section below about how you can do it).
The type conversion part:
# create a test setup with a small dataframe
import pandas as pd
from datetime import date, datetime, timedelta
df= pd.DataFrame(dict(id=range(10), date_string=[str(datetime.now()+ timedelta(days=d)) for d in range(10)]))
# test the python way:
df['date_val_python']= df['date_string'].map(lambda dt: str(dt))
# use the pandas way: (btw. if you want to explicitely
# specify the format, you can use the format= keyword)
df['date_val_pandas']= pd.to_datetime(df['date_string'])
df.dtypes
The output is:
id int64
date_string object
date_val_python object
date_val_pandas datetime64[ns]
dtype: object
As you can see date_val has type object, this is because it contains python objects of class datetime while date_val_pandas uses the internal datetime representation of pandas. You can now try:
df['date_val_pandas'].dt.year
# this will return a series with the year part of the date
df['date_val_python'].dt.year
# this will result in the following error:
AttributeError: Can only use .dt accessor with datetimelike values
See the pandas doc for to_datetime for more details.
The column naming part:
# read your dataframe as usual
df= pd.read_excel('c:/scratch/tmp/dates.xlsx')
rename_dict= dict()
for old_name in df.columns:
if hasattr(old_name, 'strftime'):
new_name= old_name.strftime('DD-MMM-YYYY')
rename_dict[old_name]= new_name
if len(rename_dict) > 0:
df.rename(columns=rename_dict, inplace=True)
This works, in case your column titles are stored as usual dates, which I suppose is true, because you get a time part after importing them.
strftime of the datetime module is the function you need:
If datetime is a datetime object, you can do
datetime.strftime("%d-%b-%Y")
Example:
>>> from datetime import datetime
>>> timestamp = 1528797322
>>> date_time = datetime.fromtimestamp(timestamp)
>>> print(date_time)
2018-06-12 11:55:22
>>> print(date_time.strftime("%d-%b-%Y"))
12-Jun-2018
In order to apply a function to certain dataframe columns, use:
datetime_cols_list = ['datetime_col1', 'datetime_col2', ...]
for col in dataframe.columns:
if col in datetime_cols_list:
dataframe[col] = dataframe[col].apply(lambda x: x.strftime("%d-%b-%Y"))
I am sure this can be done in multiple ways in pandas, this is just what came out the top of my head.
Example:
import pandas as pd
import numpy as np
np.random.seed(0)
# generate some random datetime values
rng = pd.date_range('2015-02-24', periods=5, freq='T')
other_dt_col = rng = pd.date_range('2016-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Date2': other_dt_col,'Val': np.random.randn(len(rng)) })
print (df)
# Output:
# Date Date2 Val
# 0 2016-02-24 00:00:00 2016-02-24 00:00:00 1.764052
# 1 2016-02-24 00:01:00 2016-02-24 00:01:00 0.400157
# 2 2016-02-24 00:02:00 2016-02-24 00:02:00 0.978738
# 3 2016-02-24 00:03:00 2016-02-24 00:03:00 2.240893
# 4 2016-02-24 00:04:00 2016-02-24 00:04:00 1.867558
datetime_cols_list = ['Date', 'Date2']
for col in df.columns:
if col in datetime_cols_list:
df[col] = df[col].apply(lambda x: x.strftime("%d-%b-%Y"))
print (df)
# Output:
# Date Date2 Val
# 0 24-Feb-2016 24-Feb-2016 1.764052
# 1 24-Feb-2016 24-Feb-2016 0.400157
# 2 24-Feb-2016 24-Feb-2016 0.978738
# 3 24-Feb-2016 24-Feb-2016 2.240893
# 4 24-Feb-2016 24-Feb-2016 1.867558

Pandas Create DataFrame with ColumnNames from a list

Considering the following list made up of sub-lists as elements, I need to create a pandas dataframe
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
The desired output is as following, with the first argument being converted to the column name in the dataframe.
tom nick juli
0 10 15 14
Is there a way by which this output can be achieved?
Best Regards.
Use dictionary comprehension and pass to DataFrame constructor:
print ({x[0]: x[1:] for x in data})
{'tom': [10], 'nick': [15], 'juli': [14]}
df = pd.DataFrame({x[0]: x[1:] for x in data})
print (df)
tom nick juli
0 10 15 14
You could also use dict + extended iterable unpacking:
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
result = pd.DataFrame(dict((column, values) for column, *values in data))
print(result)
Output
tom nick juli
0 10 15 14
We also do:
pd.DataFrame(data).set_index(0).T
0 tom nick juli
1 10 15 14

pandas timeseries select single date

Selecting a single date from a timeserie gives a KeyError.
Setup:
import pandas as pd
import numpy as np
ts = pd.DataFrame({'date': pd.date_range(start = '1/1/2017', periods = 5),
'observations': np.random.choice(range(0, 100), 5, replace = True)}).set_index('date')
Dataframe:
observations
date
2017-01-01 58
2017-01-02 88
2017-01-03 53
2017-01-04 4
2017-01-05 26
How do I select the number of observations for a single date?
ts['2017-01-01']
Returns: KeyError: '2017-01-01'
But...
ts['2017-01-01':'2017-01-01']
...seems to work just fine.
Any suggestions how to select/subset with a single date?
As #scnerd pointed out, when you do ts['2017-01-01'] it tries to find '2017-01-01' as a column's name of the dataframe ts, which gives you an KeyError as none of the columns in ts has this name
In order to look for an index' name, as in your example 'date' is set as index, you need to use loc method such as ts.loc['2017-01-01'] and you will get:
observations 54
Name: 2017-01-01 00:00:00, dtype: int32

Resources