How to slice a pandas.DatetimeIndex? - python-3.x

What is the best way to get dates between, say, '2019-01-08' and '2019-01-16', from the pandas.DatetimeIndex object dti as constructed below? Ideally, some concise syntax like dti['2019-01-08':'2019-01-16']?
import pandas as pd
dti = pd.bdate_range(start='2019-01-01', end='2019-02-15')
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-07', '2019-01-08', '2019-01-09', '2019-01-10',
'2019-01-11', '2019-01-14', '2019-01-15', '2019-01-16',
'2019-01-17', '2019-01-18', '2019-01-21', '2019-01-22',
'2019-01-23', '2019-01-24', '2019-01-25', '2019-01-28',
'2019-01-29', '2019-01-30', '2019-01-31', '2019-02-01',
'2019-02-04', '2019-02-05', '2019-02-06', '2019-02-07',
'2019-02-08', '2019-02-11', '2019-02-12', '2019-02-13',
'2019-02-14', '2019-02-15'],
dtype='datetime64[ns]', freq='B')

You can do it with the slice_indexer for DateTimeIndex
pandas.DateTimeIndex.slice_indexer(start, stop, step, [...])
It returns the indexes of the datetime items so you can pass it to dti
Example:
dti[dti.slice_indexer("2019-01-07", "2019-01-17")]

If you read the source code for DatetimeIndex.__getitem__ method, the individual dates in a DatetimeIndex is stored in a DatetimeArray. To support slicing, you need to get the integer indices of the start and stop date in that array. I suggest that you file a feature request with the pandas development team.
Meanwhile, you can monkey-patch it in:
from pandas.core.indexes.datetimes import DatetimeIndex
__old_getitem = DatetimeIndex.__getitem__
def __new_getitem(index, key):
if isinstance(key, slice):
_key = index.slice_indexer(key.start, key.stop, key.step)
else:
_key = key
return __old_getitem(index, _key)
DatetimeIndex.__getitem__ = __new_getitem
# Now you can slice
dti['2019-01-08':'2019-01-16':4]

Related

How to create a dictionary of dates as keys with value pair as list of three temperatures in python

The function extracts the max, min and avg temperatures for all days in the list. I want to combine the data into a dictionary; i.e. the returned temperatures and values and the dates as keys. Can't seem to get this to work. I may be going about this in the wrong way. End aim is to create a chart with date and the three temperatures for each day. I was anticipating something like: my_dict: {date,[list of 3 temps], date2,[list of 3 temps2]...}
lstdates=['09-27','09-28','09-29','09-30','10-1']
def daily_normals(date):
"""Daily Normals.
Args:
date (str): A date string in the format '%m-%d'
Returns:
A list of tuples containing the daily normals, tmin, tavg, and tmax
"""
sel = [func.min(meas.tobs), func.avg(meas.tobs), func.max(meas.tobs)]
return session.query(*sel).filter(func.strftime("%m-%d", meas.date) == date).all()
lstdaynorm=[]
my_dict ={}
for i in lstdates:
print(i)
dn=daily_normals(l)
lstdaynorm.append(dn)
my_dict.append(i,dn)
For starters, a dict object has no method called append, so my_dict.append(i,dn) is invalid syntax. Also, your iterator variable is i, but you called daily_normals on l. You should convert the tuple dn to a list and directly insert that list into the dict to achieve what you want:
lstdaynorm=[]
my_dict = {}
for i in lstdates:
dn=daily_normals(i)
lstdaynorm.append(dn)
my_dict[i] = list(dn[0][1:]) # extract elements of tuple excluding date from list and convert it to list
my_dict = dict(my_dict)
To put this in a dataframe:
import pandas as pd
df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['tmin', 'tavg', 'tmax'])

Multiple fields using Pandas and Quandl

I am using Quandl to download daily NAV prices for a specific set of Mutual Fund schemes. However it returns a data object instead of returning the specific value
import quandl
import pandas as pd
quandl.ApiConfig.api_key = <Quandl Key>
list2 = [102505, 129221, 102142, 103197, 100614, 100474, 102913, 102921]
def get_nav(mf_code):
df_main=pd.DataFrame()
code=str(mf_code)
df_main=quandl.get("AMFI/"+code,start_date='2019-04-05',end_date='2019-04- 05')
return (df_main['Net Asset Value'])
for each in list2:
mf_code=each
nav = get_nav(mf_code)
print (nav)
Output for the above code :
Date
2019-04-05 29.8916
Name: Net Asset Value, dtype: float64
Date
2019-04-05 19.354
Name: Net Asset Value, dtype: float64
whereas,
I am looking to extract only the values i.e. 29.8916, 19.354, etc
Updated code:
def get_nav(mf_code):
nav1=[]
df_main=pd.DataFrame()
code=str(mf_code)
# try:
df_main=quandl.get("AMFI/"+code,start_date='2019-04-05',end_date='2019-04-05')
nav_value=df_main['Net Asset Value']
if not nav_value.empty:
nav1=nav_value[0]
print(nav1)
# print(df_main.head())
# except IndexError:
# nav_value=0
return (nav1)
#Use merged sheet for work
df_port=pd.read_excel(fp_out)
df_port['Current Price']=df_port['Scheme_Code'].apply(lambda x:get_nav(x))
print(df_port['Current Price'].head())
df_port.to_excel(fp_out2)
By default, quandl Time-series API returns you a dataframe with date as index, even if there is only one row.
If you only need the value of first row, you can use iloc:
if not nav.empty:
print (nav.iloc[0])
or just plain integer indexing:
if not nav.empty:
print (nav[0])

A groupby and aggregation function is giving an unexpected result

Dear Stack Overflow Community,
I have built a customized groupby function in order to groupby keys in a dictionary and aggregate them by summation. The function is written as below,
from itertools import groupby
from operator import itemgetter
def group_by_field(data, fields):
groups = []
agg_group = itemgetter(*[item for item in fields])
sorted_data = sorted(data, key = agg_group)
for key, group in groupby(sorted_data, agg_group):
dictionary = dict(zip([item for item in fields], key))
dictionary["items"] = sum(item["items"] for item in group)
groups.append(dictionary)
return groups
My data is a dictionary or a json file and it is read as below,
with gzip.open('data.json', 'rb') as f:
data = json.load(f)
And the second argument of my function is a tuple of variables presented in this way,
('bnf_name',)
or
('bnf_name', 'post_code',)
The function is working well in both cases, but when I provide only one variable it reduces the name of the variable. please look at the attached image to see the results of both cases. results
I would like to know why the function is behaving in this way, and wait for any suggestion that would solve my problem.

Pandas dataframe column float inside string (i.e. "float") to int

I'm trying to clean some data in a pandas df and I want the 'volume' column to go from a float to an int.
EDIT: The main issue was that the dtype for the float variable I was looking at was actually a str. So first it needed to be floated, before being changed.
I deleted the two other solutions I was considering, and left the one I used. The top one is the one with the errors, and the bottom one is the solution.
import pandas as pd
import numpy as np
#Call the df
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
def tick_data(tickers):
for i in tickers:
tick_df = pd.DataFrame(client.get_ticker())
tick = tick_df.loc[:, ['symbol', 'volume']]
tick.iloc[:,['volume']].astype(int)
if tick['volume'].dtype != np.number:
print('yes')
else:
print('no')
return tick
Below is the revised code:
import pandas as pd
#Call the df
def ticker():
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
for i in tickers:
#pulls out market data for each symbol
tickers = pd.DataFrame(client.get_ticker())
#isolates the symbol and volume
tickers = tickers.loc[:, ['symbol', 'volume']]
#floats volume
tickers['volume'] = tickers.loc[:, ['volume']].astype(float)
#volume to int
tickers['volume'] = tickers.loc[:, ['volume']].astype(int)
#deletes all symbols > 20,000 in volume, returns only symbol
tickers = tickers.loc[tickers['volume'] >= 20000, 'symbol']
return tickers
You have a few issues here.
In your first example, iloc only accepts integer locations for the rows and columns in the DataFrame, which is generating your error. I.e.
tick.iloc[:,['volume']].astype(int)
doesn't work. If you want label-based indexing, use .loc:
tick.loc[:,['volume']].astype(int)
Alternately, use bracket-based indexing, which allows you to take a whole column directly without using slice syntax (:) on the rows:
tick['volume'].astype(int)
Next, astype(int) returns a new value, it does not modify in-place. So what you want is
tick['volume'] = tick['volume'].astype(int)
As for your dtype is a number check, you don't want to check == np.number, but you don't want to check is either, which only returns True if it's np.number and not if it's a subclass like np.int64. Use np.issubdtype, or pd.api.types.is_numeric_dtype, i.e.:
if np.issubdtype(tick['volume'].dtype, np.number):
or:
if pd.api.types.is_numeric_dtype(tick['volume'].dtype):

Sort by datetime in python3

Looking for help on how to sort a python3 dictonary by a datetime object (as shown below, a value in the dictionary) using the timestamp below.
datetime: "2018-05-08T14:06:54-04:00"
Any help would be appreciated, spent a bit of time on this and know that to create the object I can do:
format = "%Y-%m-%dT%H:%M:%S"
# Make strptime obj from string minus the crap at the end
strpTime = datetime.datetime.strptime(ts[:-6], format)
# Create string of the pieces I want from obj
convertedTime = strpTime.strftime("%B %d %Y, %-I:%m %p")
But I'm unsure how to go about comparing that to the other values where it accounts for both day and time correctly, and cleanly.
Again, any nudges in the right direction would be greatly appreciated!
Thanks ahead of time.
Datetime instances support the usual ordering operators (< etc), so you should order in the datetime domain directly, not with strings.
Use a callable to convert your strings to timezone-aware datetime instances:
from datetime import datetime
def key(s):
fmt = "%Y-%m-%dT%H:%M:%S%z"
s = ''.join(s.rsplit(':', 1)) # remove colon from offset
return datetime.strptime(s, fmt)
This key func can be used to correctly sort values:
>>> data = {'s1': "2018-05-08T14:06:54-04:00", 's2': "2018-05-08T14:05:54-04:00"}
>>> sorted(data.values(), key=key)
['2018-05-08T14:05:54-04:00', '2018-05-08T14:06:54-04:00']
>>> sorted(data.items(), key=lambda item: key(item[1]))
[('s2', '2018-05-08T14:05:54-04:00'), ('s1', '2018-05-08T14:06:54-04:00')]

Resources