How to filter time series if data exists at least data every 6 hours? - python-3.x

I'd like to verify if there is data at least once every 6 hours per ID, and filter out the IDs that do not meet this criteria.
essentially a filter: "if ID's data not at least every 6h, drop id from dataframe"
I try to use the same method for filtering one per day, but having trouble adapting the code.
# add day column from datetime index
df['1D'] = df.index.day
# reset index
daily = df.reset_index()
# count per ID per day. Result is per ID data of non-zero
a = daily.groupby(['1D', 'id']).size()
# filter by right join
filtered = a.merge(df, on = id", how = 'right')
I cannot figure out how to adapt this for the following 6hr periods each day: 00:01-06:00, 06:01-12:00, 12:01-18:00, 18:01-24:00.

Groupby ID and then integer divide hour by 6 and get unique counts. In your case it should be greater than or equal to 4 because there are 4 - 6 hour bins in 24 hours and each day has 4 unique bins i.e.
Bins = 4
00:01-06:00
06:01-12:00
12:01-18:00
18:01-24:00
Code
mask = df.groupby('id')['date'].transform(lambda x: (x.dt.hour // 6).nunique() >= 4)
df = df[mask]

I propose to use pivot_table with resample which allows to change to arbitrary frequencies. Please see comments for further explanations.
# build test data. I need a dummy column to use pivot_table later. Any column with numerical values will suffice
data = [[datetime(2020, 1, 1, 1), 1, 1],
[datetime(2020, 1, 1, 6), 1, 1],
[datetime(2020, 1, 1, 12), 1, 1],
[datetime(2020, 1, 1, 18), 1, 1],
[datetime(2020, 1, 1, 1), 2, 1],
]
df = pd.DataFrame.from_records(data=data, columns=['date', 'id', 'dummy'])
df = df.set_index('date')
# We need a helper dataframe df_tmp.
# Transform id entries to columns. resample with 6h = 360 minutes = 360T.
# Take mean() because it will produce nan values
# WARNING: It will only work if you have at least one id with observations for every 6h.
df_tmp = pd.pivot_table(df, columns='id', index=df.index).resample('360T').mean()
# Drop MultiColumnHierarchy and drop all columns with NaN values
df_tmp.columns = df_tmp.columns.get_level_values(1)
df_tmp.dropna(axis=1, inplace=True)
# Filter values in original dataframe where
mask_id = df.id.isin(df_tmp.columns.to_list())
df = df[mask_id]

I kept your requirements on timestamps but I believe you want to use the commented lines in my solution.
import pandas as pd
period = pd.to_datetime(['2020-01-01 00:01:00', '2020-01-01 06:00:00'])
# period = pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 06:00:00'])
shift = pd.to_timedelta(['6H', '6H'])
id_with_data = set(df['ID'])
for k in range(4): # for a day (00:01 --> 24:00)
period_mask = (period[0] <= df.index) & (df.index <= period[1])
# period_mask = (period[0] <= df.index) & (df.index < period[1])
present_ids = set(df.loc[period_mask, 'ID'])
id_with_data = id_with_data.intersection(present_ids)
period += shift
df = df.loc[df['ID'].isin(list(id_with_data))]

Related

How to merge as-of for date but left-join for another column?

I have the following dataframes:
assets = pd.DataFrame(columns = ['date','asset'], data = [[datetime.date(2022,10,21),'SPY'],[datetime.date(2022,10,21),'FTSE'], [datetime.date(2022,11,12),'SPY'],[datetime.date(2022,11,12),'FTSE']])
prices = pd.DataFrame(columns = ['date','asset', 'price'], data = [[datetime.date(2022,10,11),'SPY',10],[datetime.date(2022,10,11),'FTSE',5],[datetime.date(2022,11,8),'SPY',100],[datetime.date(2022,11,8),'FTSE',50]]
for each asset I want to get the as-of price (at the nearest date). How to do so?
If I have only one asset it is easy:
assets_spy = assets#.loc[assets['asset']=='SPY']
prices_spy = prices#.loc[prices['asset']=='SPY']
assets_spy.index = pd.to_datetime(assets_spy['date'])
prices_spy.index = pd.to_datetime(prices_spy['date'])
merged = pd.merge_asof(assets_spy.sort_index(),
prices_spy.sort_index(),
direction='nearest',right_index=True,left_index=True)
but if I follow the same logic for multiple assets, it won't match.
The function pandas.merge_asof has an optional parameter named by that you can use for matching using a column or list of columns before performing merge operation. Therefore, you could adapt your code like this:
import pandas as pd
import datetime
assets = pd.DataFrame(columns = ['date', 'asset'],
data = [[datetime.date(2022, 10, 21), 'SPY'],
[datetime.date(2022, 10, 21), 'FTSE'],
[datetime.date(2022, 11, 12), 'SPY'],
[datetime.date(2022, 11, 12), 'FTSE']])
prices = pd.DataFrame(columns = ['date', 'asset', 'price'],
data = [[datetime.date(2022, 10, 11), 'SPY', 10],
[datetime.date(2022, 10, 11), 'FTSE', 5],
[datetime.date(2022, 11, 8), 'SPY', 100],
[datetime.date(2022, 11, 8), 'FTSE', 50]])
merged = pd.merge_asof(
assets.astype({'date': 'datetime64[ns]'}).sort_values('date').convert_dtypes(),
prices.astype({'date': 'datetime64[ns]'}).sort_values('date').convert_dtypes(),
direction = 'nearest',
on = 'date',
by = 'asset',
)
merged
# Returns:
#
# date asset price
# 0 2022-10-21 SPY 10 <-- Merged using SPY price from '2022-10-11'
# 1 2022-10-21 FTSE 5 <-- Merged using FTSE price from '2022-10-11'
# 2 2022-11-12 SPY 100 <-- Merged using SPY price from '2022-11-08'
# 3 2022-11-12 FTSE 50 <-- Merged using FTSE price from '2022-11-08'
Output Screenshot:

Multiply each row of an array with coefficients in list - Python

I am very new to Python an need help. This is the problem statement:
I want to calculate the value of each of the three houses by multiplying the rows of the arraym X (each row representing one house) with the coefficients in list c, so for the first house: Price = (66x3000)+(5x200)+ (15x-50) + (2x5000) + (500x100) = 258.000
Do not use numpy
Print the price of the three houses
This is what I have so far:
# input values for three houses:
# - size [m^2],
# - size of the sauna [m^2],
# - distance to water [m],
# - number of indoor bathrooms,
# - proximity of neighbors [m]
X = [[66, 5, 15, 2, 500],
[21, 3, 50, 1, 100],
[120, 15, 5, 2, 1200]]
# coefficient values
c = [3000, 200 , -50, 5000, 100]
def predict(X, c):
price = 0
for i in range (len(X)):
for j in range (len(X[i])):
price += (c[j]*X[i][j])
print(price)
predict(X, c)
The output is
258250
334350
827100.
The program adds the value of the 2nd an 3rd hourse the the previous result, rather than returning each house's value. How can I fix this?
Many thanks!
Move the line
price = 0
into the outer for loop:
def predict(X, c):
for i in range (len(X)):
price = 0
for j in range (len(X[i])):
...

compare index and column in data frame with dictionary

I have a dictionary:
d = {'A-A': 1, 'A-B':2, 'A-C':3, 'B-A':5, 'B-B':1, 'B-C':5, 'C-A':3,
'C-B':4, 'C-C': 9}
and a list:
L = [A,B,C]
I have a DataFrame:
df =pd.DataFrame(columns = L, index=L)
I would like to fill each row in df by values in dictionary based on dictionary keys.For example:
A B C
A 1 2 3
B 5 1 5
C 3 4 9
I tried doing that by:
df.loc[L[0]]=[1,2,3]
df.loc[L[1]]=[5,1,5]
df.loc[L[2]] =[3,4,9]
Is there another way to do that especially when there is a huge data?
Thank you for help
Here is another way that I can think of:
import numpy as np
import pandas as pd
# given
d = {'A-A': 1, 'A-B':2, 'A-C':3, 'B-A':5, 'B-B':1, 'B-C':5, 'C-A':3,
'C-B':4, 'C-C': 9}
L = ['A', 'B', 'C']
# copy the key values into a numpy array
z = np.asarray(list(d.values()))
# reshape the array according to your DataFrame
z_new = np.reshape(z, (3, 3))
# copy it into your DataFrame
df = pd.DataFrame(z_new, columns = L, index=L)
This should do the trick, though it's probably not the best way:
for index in L:
prefix = index + "-"
df.loc[index] = [d.get(prefix + column, 0) for column in L]
Calculating the prefix separately beforehand is probably slower for a small list and probably faster for a large list.
Explanation
for index in L:
This iterates through all of the row names.
prefix = index + "-"
All of the keys for each row start with index + "-", e.g. "A-", "B-"… etc..
df.loc[index] =
Set the contents of the entire row.
[ for column in L]
The same as your comma thing ([1, 2, 3]) just for an arbitrary number of items. This is called a "list comprehension".
d.get( , 0)
This is the same as d[ ] but returns 0 if it can't find anything.
prefix + column
Sticks the column on the end, e.g. "A-" gives "A-A", "A-B"…

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

Python Pandas Create Multiple dataframes by slicing data at certain locations

I am new to Python and data analysis using programming. I have a long csv and I would like to create DataFrame dynamically and plot them later on. Here is an example of the DataFrame similar to the data exist in my csv file
df = pd.DataFrame(
{"a" : [4 ,5, 6, 'a', 1, 2, 'a', 4, 5, 'a'],
"b" : [7, 8, 9, 'b', 0.1, 0.2, 'b', 0.3, 0.4, 'b'],
"c" : [10, 11, 12, 'c', 10, 20, 'c', 30, 40, 'c']})
As seen, there are elements which repeated in each column. So I would first need to find the index of the repetition and following that use this for making subsets. Here is the way I did this.
find_Repeat = df.groupby(['a'], group_keys=False).apply(lambda df: df if
df.shape[0] > 1 else None)
repeat_idxs = find_Repeat.index[find_Repeat['a'] == 'a'].tolist()
If I print repeat_idxs, I would get
[3, 6, 9]
And this is the example of what I want to achieve in the end
dfa_1 = df['a'][Index_Identifier[0], Index_Identifier[1])
dfa_2 = df['a'][Index_Identifier[1], Index_Identifier[2])
dfb_1 = df['b'][Index_Identifier[0], Index_Identifier[1])
dfb_2 = df['b'][Index_Identifier[1], Index_Identifier[2])
But this is not efficient and convenient as I need to create many DataFrame like these for plotting later on. So I tried the following method
dfNames = ['dfa_' + str(i) for i in range(len(repeat_idxs))]
dfs = dict()
for i, row in enumerate(repeat_idxs):
dfName = dfNames[i]
slices = df['a'].loc[row:row+1]
dfs[dfName] = slices
If I print dfs, this is exactly what I want.
{'df_0': 3 a
4 1
Name: a, dtype: object, 'df_1': 6 a
7 4
Name: a, dtype: object, 'df_2': 9 a
Name: a, dtype: object}
However, if I want to read my csv and apply the above, I am not getting what's desired. I can find the repeated indices from csv file but I am not able to slice the data properly. I am presuming that I am not reading csv file correctly. I attached the csv file for further clarification csv file
Two options:
Loop over and slice
Detect the repeat row indices and then loop over to slice contiguous chunks of the dataframe, ignoring the repeat rows:
# detect rows for which all values are equal to the column names
repeat_idxs = df.index[(df == df.columns.values).all(axis=1)]
slices = []
start = 0
for i in repeat_idxs:
slices.append(df.loc[start:i - 1])
start = i + 1
The result is a list of dataframes slices, which are the slices of your data in order.
Use pandas groupby
You could also do this in one line using pandas groupby if you prefer:
grouped = df[~(df == df.columns.values).all(axis=1)].groupby((df == df.columns.values).all(axis=1).cumsum())
And you can now iterate over the groups like so:
for i, group_df in grouped:
# do something with group_df

Resources