Index by date ranges in Python Pandas - python-3.x

I am new to using python pandas, and have the below script to pull in time series data from an excel file, set the dates = index, and then will want to perform various calculations on the data referencing by date. Script:
df = pd.read_excel("myfile.xls")
df = df.set_index(df.Date)
df = df.drop("Date",1)
df.index.name = None
df.head()
The output of that (to give you a sense of the data) is:
Px1 Px2 Px3 Px4 Px5 Px6 Px7
2015-08-12 19.850000 10.25 7.88 10.90 109.349998 106.650002 208.830002
2015-08-11 19.549999 10.16 7.81 10.88 109.419998 106.690002 208.660004
2015-08-10 19.260000 10.07 7.73 10.79 109.059998 105.989998 210.630005
2015-08-07 19.240000 10.08 7.69 10.92 109.199997 106.430000 207.919998
2015-08-06 19.250000 10.09 7.76 10.96 109.010002 106.010002 208.350006
When I try to retrieve data based on one date like df.loc['20150806'] that works, but when I try to retrieve a slice like df.loc['20150806':'20150812'] I return Empty DataFrame.
Again, the index is a DateTimeIndex with dtype = 'datetime64[ns]', length = 1412, freq = None, tz = None
Like I said, my ultimate goal is to be able to group the data by Day, Month, Year, different periods etc., and perform calculations on the data. I want to give that context, but don't even want to get into that here since I'm clearly stuck on something more basic - perhaps misunderstanding how to operate with a DateTimeIndex
Thank you.
EDIT: Meant to also include, I think the main problem I referenced with indexing has something to do with freq=0, bc when I tried simpler examples with contiguous date series, I did not have this problem.

df.loc['2015-08-12':'2015-08-10'] and df.loc['2015-08-10':'2015-08-12':-1] both work. df = df.sort_index() and slicing the way I was trying also works. Thank you all. Was missing the forest for the trees there I think.

Related

Pandas converts some numbers into zeros or other fixed values

I'm using Python with Pandas through Google Colab for some data analysis. I was analyzing some data through plots and noticed some missing data. However, when I looked at the original Excel data before any Python work, there was no missing data in these places. Somehow it's turning the first four days of a month of hourly data into zeros OR, but only for some of the files and some of the time periods. Following the zeros is also a period of other constant values.
I have four similar data files and two of them seem to be working just fine, but the other two get these zeros at the start of SOME (consecutive) months, while nothing is wrong with the original data. Is there some feature in Pandas that could cause some numbers to turn into zeros or other constant values? The same code is used for all the different files, which are all in the same format.
I thought it could be just a problem with using 'resample' during plotting, but even when I just print the values without 'resample', the values are still missing. I included a figure here to show what the data problem looks like.
Function to read the data:
def read_elec_data(data_file_name):
df = pd.read_excel(data_file_name) # Read the original data
# Convert the time value (30.11.2018 0:00-1:00) into a Pandas-compatible timestamp format (2018-11-30 0:00)
new = df["Päivämäärä ja tunti"].str.split("-", n = 1, expand = True) # Split the time column by the delimiter and make it into two new columns [0, 1]. The ending hour [1] can be ignored.
time_data = new[0]
time_data_fixed = pd.to_datetime(time_data) # Convert the modified time data into datetime format
df['Aika'] = time_data_fixed # Add the new time column to the dataframe
# Remove all columns except the new timestamp and energy consumption columns. Rename the consumption according to the building name
building_name = df['Kohde'][0]
df.drop(columns =["Päivämäärä ja tunti", "Tunti", 'Kohde', 'Mittarin nimi'], inplace = True) # Remove everything except the new timestamp and energy consumption
df = df.rename(columns={'Kulutus[kWh]' : building_name})
df = df.set_index('Aika') # Set the timestamp as the index for the final DataFrame that will be utilized in the calculations
return df
Calling of the function:
all_electricity_data_list = []
for buildingname in list_of_electricity_data:
df = read_elec_data(buildingname) # Use the file reading and modification function
all_electricity_data_list.append(df)
all_electricity_data = pd.concat(all_electricity_data_list, axis=1)
Some numbers are converted to zeros or other constant values even though the original data is fine:

fast date based replacement of rows in Pandas

I am on a quest of finding the fastest replacement method based on index in Pandas.
I want to fill np.nans to all rows based on index (DateTimeIndex).
I tested various types of selection, but obviously, the bottleneck is setting the rows equal to a value (np.nan in my case).
Naively, I want to do this:
df['2017-01-01':'2018-01-01'] = np.nan
I tried and tested a performance of various other methods, such as
df.loc['2017-01-01':'2018-01-01'] = np.nan
And also creating a mask with NumPy to speed it up
df['DateTime'] = df.index
st = pd.to_datetime('2017-01-01', format='%Y-%m-%d').to_datetime64()
en = pd.to_datetime('2018-01-01', format='%Y-%m-%d').to_datetime64()
ge_start = df['DateTime'] >= st
le_end = df['DateTime'] <= en
mask = (ge_start & le_end )
and then
df[mask] = np.nan
#or
df.where(~mask)
But with no big success. I have DataFrame (that I cannot share unfortunately) of size cca (200,1500000), so kind of big, and the operation takes order of seconds of CPU time, which is way too much imo.
Would appreciate any ideas!
edit: after going through
Modifying a subset of rows in a pandas dataframe and Why dataframe.values is very slow and unifying datatypes for the operation, the problem is solved with cca 20x speedup.

Pandas Dataframe filter results with Merge. Encoding Decoding Issues

I read tons of posts from our database for the last few days. with my limited skills in python and pandas and numpy, i am not sure if i found the answers that I desire. so would you please take a look at my situation, and see what i can do with it. and I am sorry about the chinese characters in the searching results.
I am currently writing some quant analysis for personal use. I retrieved a csv file via tushare-pro, which is a 3825-rows dataframe.
df1 = pd.DataFrame(pd.read_csv(stock_stats_ts.csv))
data1 = np.array(df1.loc[:,:])
returns
[[300826 'N测绘' '建筑工程' ... 41.16 16.85 40171.0]
[2770 '科迪乳业' '乳制品' ... 21.05 4.38 47133.0]
[2503 '搜于特' '服饰' ... 8.6 3.08 65664.0]
...
[2260 '*ST德奥' '家用电器' ... 23.08 3.03 24704.0]
[995 '*ST皇台' '白酒' ... 68.05 -35.24 10275.0]
[939 '*ST凯迪' '新型电力' ... 10.79 -74.92 79373.0]]
and then I narrow it down to things i desire, such as code/name/esp/pb/roe
df2 = df1.loc[:,['code','name','esp','pb','npr']]
data2 = np.array(df1.loc[:,:])
returns
[[300826 'N测绘' 1.08 2.79 16.85]
[2770 '科迪乳业' 0.03 2.13 4.38]
[2503 '搜于特' 0.098 2.17 3.08]
...
[2260 '*ST德奥' 0.034 0.0 3.03]
[995 '*ST皇台' -0.079 0.0 -35.24]
[939 '*ST凯迪' -0.362 0.0 -74.92]]
and I also have a list of stock names which i desire from previous session
df3 = pd.DataFrame(pd.read_csv(candidates.csv))
data3 = np.array(df3.loc[:,['candidates']])
returns
[['维维股份']
['ST正源']
['美克家居']
['*ST金山']
['大有能源']
['好当家']
['贵州茅台']
['通策医疗']
['杭州解百']
['耀皮玻璃']
['梅花生物']
['金牌厨柜']
['继峰股份']
['胜利股份']
['渝 开 发']
['云南白药']
['中原环保']
['兴蓉环境']
['华闻集团']
['粤 水 电']
['濮耐股份']
['*ST东南']
['洪涛股份']
['达实智能']
['千红制药']
['闽发铝业']
['史丹利']
['加加食品']
['张家港行']
['国联水产']]
What i am sure of is that my candidates are in df2[name] columns for sure, and then, with what lines of codes so that I can filter my df2 based on the results i have from df3?
Thanks to chief #Rexhil Regmi and #nimrodm, my question worked perfectly with pd.merge. However, all those chinese characters in are encoding in 'gbk' which is unreadable with MS Excel. Any hints to change them into 'utf8'?
You can use a merge for this problem of yours.
new_df = df3.merge(df1, left_on='candidate', right_on='name', how='left')
This should give you what you are looking for.
The best solution is to join the two tables (similar to SQL JOIN). Pandas merge performs an inner join meaning you select only rows that have a matching entry in the candidates data frame.
So, for example result.csv is
name,value
first,10
second,20
third,30
And selected.csv is
candidates
first
third
Read both of these as DataFrames (no need to convert to a numpy array):
data = pd.read_csv('result.csv')
selected = pd.read_csv('selected.csv')
And join the two (the how parameter is optional since inner is the default value for merge)
data.merge(selected, how='inner', left_on='name',right_on='candidates')
name value candidates
0 first 10 first
1 third 30 third
This joins the two DataFrames looking for rows where the value of data[j,'name'] == data[k, 'candidates']
Another option
Another approach is to directly select lines where name (in my example) is in a given list:
data[data['name'].isin(selected['candidates'])]
name value
0 first 10
2 third 30
This is probably inefficient unless, perhaps, the candidates list is very short.

Changing column datatype from Timestamp to datetime64

I have a database I'm reading from excel as a pandas dataframe, and the dates come in Timestamp dtype, but I need them to be in np.datetime64, so that I can make calculations.
I am aware that the function pd.to_datetime() and the astype(np.datetime64[ns]) method do work. However, I am unable to update my dataframe to yield this datatype, for whatever reason, using the code mentioned above.
I have also tried creating an acessory dataframe from the original one, with just the dates that I wish to update the typing, converting it to np.datetime64 and plugging it back onto the original dataframe:
dfi = df['dates']
dfi = pd.to_datetime(dfi)
df['dates'] = dfi
But still it doesn't work. I have also tried updating values one by one:
arr_i = df.index
for i in range(len(arr_i)):
df.at[arri[l],'dates'].to_datetime64()
Edit
The root problem seems to be that the dtype of the column gets updated to np.datetime64, but somehow, when getting single values from within, they still have the dtype = Timestamp
Does anyone have a suggestion of a workaround that is fairly fast?
Pandas tries to standardize all forms of datetimes by storing them as NumPy datetime64[ns] values when you assign them to a DataFrame. But when you try to access individual datetime64 values, they are returned as Timestamps.
There is a way to prevent this automatic conversion from happening however: Wrap the list of values in a Series of dtype object:
import numpy as np
import pandas as pd
# create some dates, merely for example
dates = pd.date_range('2000-1-1', periods=10)
# convert the dates to a *list* of datetime64s
arr = list(dates.to_numpy())
# wrap the values you wish to protect in a Series of dtype object.
ser = pd.Series(arr, dtype='object')
# assignment with `df['datetime64s'] = ser` would also work
df = pd.DataFrame({'timestamps': dates,
'datetime64s': ser})
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10 entries, 0 to 9
# Data columns (total 2 columns):
# timestamps 10 non-null datetime64[ns]
# datetime64s 10 non-null object
# dtypes: datetime64[ns](1), object(1)
# memory usage: 240.0+ bytes
print(type(df['timestamps'][0]))
# <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(type(df['datetime64s'][0]))
# <class 'numpy.datetime64'>
But beware! Although with a little work you can circumvent Pandas' automatic conversion mechanism,
it may not be wise to do this. First, converting a NumPy array to a list is usually a sign you are doing something wrong, since it is bad for performance. Using object arrays is a bad sign since operations on object arrays are generally much much slower than equivalent operations on arrays of native NumPy dtypes.
You may be looking at an XY problem -- it may be more fruitful to find a way to (1)
work with Pandas Timestamps instead of trying to force Pandas to return NumPy
datetime64s or (2) work with datetime64 array-likes (e.g. Series of NumPy arrays) instead of handling values individually (which causes the coersion to Timestamps).

Round pandas timestamp series to seconds - then save to csv without ms/ns resolution

I have a dataframe, df with index: pd.DatetimeIndex. The individual timestamps are changed from 2017-12-04 08:42:12.173645000 to 2017-12-04 08:42:12 using the excellent pandas rounding command:
df.index = df.index.round("S")
When stored to csv, this format is kept (which is exactly what I want). I also need a date-only column, and this is now easily created:
df = df.assign(DateTimeDay = df.index.round("D"))
When stored to csv-file using df.to_csv(), this does write out the entire timestamp (2017-12-04 00:00:00), except when it is the ONLY column to be saved. So, I add the following command before save:
df["DateTimeDay"] = df["DateTimeDay"].dt.date
...and the csv-file looks nice again (2017-12-04)
Problem description
Now over to the question, I have two other columns with timestamps on the same format as above (but different - AND - with some very few NaNs). I want to also round these to seconds (keeping NaNs as NaNs of course), then make sure that when written to csv, they are not padded with zeros "below the second resolution". Whatever I try, I am simply not able to do this.
Additional information:
print(df.dtypes)
print(df.index.dtype)
...all results in datetime64[ns]. If I convert them to an index:
df["TimeCol2"] = pd.DatetimeIndex(df["TimeCol2"]).round("s")
df["TimeCol3"] = pd.DatetimeIndex(df["TimeCol3"]).round("s")
...it works, but the csv-file still pads them with unwanted and unnecessary zeros.
Optimal solution: No conversion of the columns (like above) or use of element-wise apply unless they are quick (100+ million rows). My dream command would be like this:
df["TimeCol2"] = df["TimeCol2"].round("s") # Raises TypeError: an integer is required (got type str)
You can specify the date format for datetime dtypes when calling to_csv:
In[170]:
df = pd.DataFrame({'date':[pd.to_datetime('2017-12-04 07:05:06.767')]})
df
Out[170]:
date
0 2017-12-04 07:05:06.767
In[171]:
df.to_csv(date_format='%Y-%m-%d %H:%M:%S')
Out[171]: ',date\n0,2017-12-04 07:05:06\n'
If you want to round the values, you need to round prior to writing to csv:
In[173]:
df1 = df['date'].dt.round('s')
df1.to_csv(date_format='%Y-%m-%d %H:%M:%S')
Out[173]: '0,2017-12-04 07:05:07\n'

Resources