Does this suport/resistance feature use future data? Python Finance - python-3.x

I build an support/resistance like feature for my deep learning model (cryptocurrency prediction). The newly trained model's results are very good, with the normal evaluation the results are also very good. But with an evaluation that only predicts on the last sample in the dataframe the results are not nearly as good. The evaluation is done over the same period with the same amount of samples.
So I am concerned that the support/resistance feature uses future data somehow. I have had this issue before when I made the code for the first time but I already fixed that error. So does anyone have any ideas if this code does indeed use future data which would not be possible in real-life.
This problem happened since this new feature so the rest of the code is good.
The code works like the following, it retrieves the high & low per slices of 12 samples. Then it fills in the value of the high/low from the point itself till just before the next high/low.
Code:
# find lows & highs in window.
lows, highs = [], []
max = int(len(df) / window)
diff = len(df) - (max * window)
for index in range(max):
if index == 0:
sliced = df.iloc[index*window: ((index+1)*window)+diff]
else:
sliced = df.iloc[(index*window)+diff: ((index+1)*window)+diff]
high = sliced["high"].max()
index = sliced.index[sliced['high'] == high].tolist()[0]
highs.append([index, high])
low = sliced["low"].min()
index = sliced.index[sliced['low'] == low].tolist()[0]
lows.append([index, low])
# fill in highs.
max = len(highs)
filled = []
for index in range(max):
if index == 0: # this does fill in future data but the first rows are always dropped past this because of other features so this is not the problem.
for i in range(0, highs[index][0]):
filled.append([i, highs[index][1]])
if index < max-1:
for i in range(highs[index][0], highs[index+1][0]):
filled.append([i, highs[index][1]])
elif index == max-1:
for i in range(highs[index][0], len(df)):
filled.append([i, highs[index][1]])
highs = filled
# fill in lows.
max = len(lows)
filled = []
for index in range(max):
if index == 0: # this does fill in future data but the first rows are always dropped past this because of other features so this is not the problem.
for i in range(0, lows[index][0]):
filled.append([i, lows[index][1]])
if index < max-1:
for i in range(lows[index][0], lows[index+1][0]):
filled.append([i, lows[index][1]])
elif index == max-1:
for i in range(lows[index][0], len(df)):
filled.append([i, lows[index][1]])
lows = filled
# fill support & resistance into df.
for index, high in highs:
df.at[index, "resistance"] = high
for index, low in lows:
df.at[index, "support"] = low
return df[["support", "resistance"]]
As a candlestick graph it would look like this. The purple points are the new highs and lows.
Does anyone see some kind of error which could make the feature use future data that is not available in real-life and therefore could explain the differences between the evaluations?

Related

How can vectorization or apply function yield same results but faster than for loop in this case

My code is running for days as it has 10M rows. I have used for loop as while referring to a column in the df, I am changing the values of multiple rows. Hence didn't used function or vectors. Can the following code be made faster but still keeping the logic true? Please note, cosine_value of many rows would be updated multiple times in a single run.
As of now, I am getting the intended results but run time is huge. Also, it would also be helpful to suggest how to apply apply function rather than for loop in this case.
row = 0
for query in df['description']:
# For a given query, finding top 1000 closest match of description and their index
if df['cosine_value'][row]<.45:
query_vector = vectorizer.transform([query]).toarray().astype('float32')
query_vector_dims = query_vector.shape[1]
top_1000_results = index.search(query_vector, 1000)
results = []
for i, _id in enumerate(top_1000_results[1].tolist()[0]):
results.append((df.iloc[_id]['description'], _id))
arr = np.array(results)
Result_indexes=arr[:, 1]
result_text=arr[ :,0]
Result_index = Result_indexes.astype(int)
extracted_data = encoded_data[Result_index]
#Calculating the cosine similarity of query and top 1000 matches and only keeping the few
cosine_similarities = cosine_similarity(query_vector, extracted_data)
cosine_similarities_bool = np.array(cosine_similarities<.3)
cosine_similarities_bool = cosine_similarities_bool.transpose()
cosine_similarities_bool = np.squeeze(cosine_similarities_bool)
Result_index = np.delete(Result_index, cosine_similarities_bool)
# As selected from above step, changing the cosine_value of multiple rows
j=0
for i in Result_index:
if df.at[i, "cosine_value"]<cosine_similarities[0][j]:
df.at[i, "description_id"]=description_id
df.at[i, "cosine_value"]=cosine_similarities[0][j]
j=j+1
description_id=description_id+1
row=row+1

Generate High, Medium, Low categories from a skewed distribution

I have been working on a Churn Prediction use case in Python using XGBoost. The data trained on various parameters like Age, Tenure, Last 6 months income etc gives us the prediction if an employee is likely to leave based on its employee ID.
Additionally, if the user wants to the see why this ML system categorised the employee as such, the user can see the features that contributed to this, which are extracted form the model via eli5 library.
So to make this more explainable to the users, we had created some ranges for each feature:
Tenure (in days)
[0-100] = High Risk
[101-300] = Medium Risk
[301-800] = Low Risk
To define these ranges we've analysed the distributions of each feature and manually defined the ranges for our use in the system. We saw the impact of each feature on the target variable IsTerminated in training data. Following is an example of Tenure distribution.
Here the green bar represents the employees who are terminated or left and pink represents those who didn't.
So the question is that, as time passes and new data would be added to the model the such features' risk ranges would change. In this case of Tenure, if an employee has tenure of 780 days, after a month his tenure feature would show 810. Obviously, we keep the upper end on "Low Risk" as open ended. But real problem is, how can we define the internal boundaries / ranges programtically ?
EDIT: Thanks for the clarification. I have changed the answer.
It is important to realize that you are trying to project a selection in multi-dimensional space into a 1D space. Not in every case you will be able to see a clear separation like the one you got. There are also various possibilities to do that, here I made a simple example that could help your client to interpret the model, but does not represent the full complexity of the model, of course.
You did not provide any sample data, so I will generate some from the breast cancer dataset.
First let's import what we need:
from sklearn import datasets
from xgboost import XGBClassifier
import pandas as pd
import numpy as np
And now import the dataset and train a very simple XGBoost Model
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
xgb_model = XGBClassifier(n_estimators=5,
objective="binary:logistic",
random_state=42)
xgb_model.fit(X, y)
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
There are multiple ways to solve this.
One approach is to bin in the probability given by the model. So you will decide which probabilities you consider to be "High Risk", "Medium Risk" and "Low Risk" and the intervals on data can be classified. In this example I considered low to be 0 <= p <= 0.5, medium for 0.5 < p <= 0.8 and high for 0.8 < p <= 1.
First you have to calculate the probability for each prediction. I would suggest to maybe use the test set for that, to avoid bias from a possible model overfitting.
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
df = pd.DataFrame(X, columns=cancer.feature_names)
# Stores the probability of a malignant cancer
df['probability'] = y_prob
Then you have to bin your data and calculate average probabilities for each of those bins. I would suggest to bin your data using np.histogram_bin_edges automatic calculation:
def calculate_mean_prob(feat):
"""Calculates mean probability for a feature value, binning it."""
# Bins from the automatic rules from numpy, check docs for details
bins = np.histogram_bin_edges(df[feat], bins='auto')
binned_values = pd.cut(df[feat], bins)
return df['probability'].groupby(binned_values).mean()
Now you can classify each bin following what you would consider to be a low/medium/high probability:
def classify_probability(prob, medium=0.5, high=0.8, fillna_method= 'ffill'):
"""Classify the output of each bin into a risk group,
according to the probability.
Following the follow rules:
0 <= p <= medium: Low risk
medium < p <= high: Medium risk
high < p <= 1: High Risk
If a bin has no entries, it will be filled using fillna with the method
specified in fillna_method
"""
risk = pd.cut(prob, [0., medium, high, 1.0], include_lowest=True,
labels=['Low Risk', 'Medium Risk', 'High Risk'])
risk.fillna(method=fillna_method, inplace=True)
return risk
This will return you the risk for each bin that you divided your data. Since you will probably have multiple bins that have consecutive values, you might want to merge the consecutive pd.Interval bins. The code for that is shown below:
def sum_interval(i1, i2):
if i2 is None:
return None
if i1.right == i2.left:
return pd.Interval(i1.left, i2.right)
return None
def sum_intervals(args):
"""Given a list of pd.Intervals,
returns a list summing consecutive intervals."""
result = list()
current_interval = args[0]
for next_interval in list(args[1:]) + [None]:
# Try to sum the current interval and nex interval
# The None in necessary for the last interval
sum_int = sum_interval(current_interval, next_interval)
if sum_int is not None:
# Update the current_interval in case if it is
# possible to sum
current_interval = sum_int
else:
# Otherwise tries to start a new interval
result.append(current_interval)
current_interval = next_interval
if len(result) == 1:
return result[0]
return result
def combine_bins(df):
# Group them by label
grouped = df.groupby(df).apply(lambda x: sorted(list(x.index)))
# Sum each category in intervals, if consecutive
merged_intervals = grouped.apply(sum_intervals)
return merged_intervals
Now you can combine all the functions to calculate the bins for each feature:
def generate_risk_class(feature, medium=0.5, high=0.8):
mean_prob = calculate_mean_prob(feature)
classification = classify_probability(mean_prob, medium=medium, high=high)
merged_bins = combine_bins(classification)
return merged_bins
For example, generate_risk_class('worst radius') results in:
Low Risk (7.93, 17.3]
Medium Risk (17.3, 18.639]
High Risk (18.639, 36.04]
But in case you get features which are not so good discriminators (or that do not separate the high/low risk linearly), you will have more complicated regions. For example generate_risk_class('mean symmetry') results in:
Low Risk [(0.114, 0.209], (0.241, 0.249], (0.272, 0.288]]
Medium Risk [(0.209, 0.225], (0.233, 0.241], (0.249, 0.264]]
High Risk [(0.225, 0.233], (0.264, 0.272], (0.288, 0.304]]

Analysis of Eye-Tracking data in python (Eye-link)

I have data from eye-tracking (.edf file - from Eyelink by SR-research). I want to analyse it and get various measures such as fixation, saccade, duration, etc.
Is there an existing package to analyse Eye-Tracking data?
Thanks!
At least for importing the .edf-file into a pandas DF, you can use the following package by Niklas Wilming: https://github.com/nwilming/pyedfread/tree/master/pyedfread
This should already take care of saccades and fixations - have a look at the readme. Once they're in the data frame, you can apply whatever analysis you want to it.
pyeparse seems to be another (yet currently unmaintained as it seems) library that can be used for eyelink data analysis.
Here is a short excerpt from their example:
import numpy as np
import matplotlib.pyplot as plt
import pyeparse as pp
fname = '../pyeparse/tests/data/test_raw.edf'
raw = pp.read_raw(fname)
# visualize initial calibration
raw.plot_calibration(title='5-Point Calibration')
# create heatmap
raw.plot_heatmap(start=3., stop=60.)
EDIT: After I posted my answer I found a nice list compiling lots of potential tools for eyelink edf data analysis: https://github.com/davebraze/FDBeye/wiki/Researcher-Contributed-Eye-Tracking-Tools
Hey the question seems rather old but maybe I can reactivate it, because I am currently facing the same situation.
To start I recommend to convert your .edf to an .asc file. In this way it is easier to read it to get a first impression.
For this there exist many tools, but I used the SR-Research Eyelink Developers Kit (here).
I don't know your setup but the Eyelink 1000 itself detects saccades and fixation. I my case in the .asc file it looks like that:
SFIX L 10350642
10350642 864.3 542.7 2317.0
...
...
10350962 863.2 540.4 2354.0
EFIX L 10350642 10350962 322 863.1 541.2 2339
SSACC L 10350964
10350964 863.4 539.8 2359.0
...
...
10351004 683.4 511.2 2363.0
ESACC L 10350964 10351004 42 863.4 539.8 683.4 511.2 5.79 221
The first number corresponds to the timestamp, the second and third to x-y coordinates and the last is your pupil diameter (what the last numbers after ESACC are, I don't know).
SFIX -> start fixation
EFIX -> end fixation
SSACC -> start saccade
ESACC -> end saccade
You can also check out PyGaze, I haven't worked with it, but searching for a toolbox, this one always popped up.
EDIT
I found this toolbox here. It looks cool and works fine with the example data, but sadly does not work with mine
EDIT No 2
Revisiting this question after working on my own Eyetracking data I thought I might share a function wrote, to work with my data:
def eyedata2pandasframe(directory):
'''
This function takes a directory from which it tries to read in ASCII files containing eyetracking data
It returns eye_data: A pandas dataframe containing data from fixations AND saccades fix_data: A pandas dataframe containing only data from fixations
sac_data: pandas dataframe containing only data from saccades
fixation: numpy array containing information about fixation onsets and offsets
saccades: numpy array containing information about saccade onsets and offsets
blinks: numpy array containing information about blink onsets and offsets
trials: numpy array containing information about trial onsets
'''
eye_data= []
fix_data = []
sac_data = []
data_header = {0: 'TimeStamp',1: 'X_Coord',2: 'Y_Coord',3: 'Diameter'}
event_header = {0: 'Start', 1: 'End'}
start_reading = False
in_blink = False
in_saccade = False
fix_timestamps = []
sac_timestamps = []
blink_timestamps = []
trials = []
sample_rate_info = []
sample_rate = 0
# read the file and store, depending on the messages the data
# we have the following structure:
# a header -- every line starts with a '**'
# a bunch of messages containing information about callibration/validation and so on all starting with 'MSG'
# followed by:
# START 10350638 LEFT SAMPLES EVENTS
# PRESCALER 1
# VPRESCALER 1
# PUPIL AREA
# EVENTS GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# SAMPLES GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# followed by the actual data:
# normal data --> [TIMESTAMP]\t [X-Coords]\t [Y-Coords]\t [Diameter]
# Start of EVENTS [BLINKS FIXATION SACCADES] --> S[EVENTNAME] [EYE] [TIMESTAMP]
# End of EVENTS --> E[EVENT] [EYE] [TIMESTAMP_START]\t [TIMESTAMP_END]\t [TIME OF EVENT]\t [X-Coords start]\t [Y-Coords start]\t [X_Coords end]\t [Y-Coords end]\t [?]\t [?]
# Trial messages --> MSG timestamp\t TRIAL [TRIALNUMBER]
try:
with open(directory) as f:
csv_reader = csv.reader(f, delimiter ='\t')
for i, row in enumerate (csv_reader):
if any ('RATE' in item for item in row):
sample_rate_info = row
if any('SYNCTIME' in item for item in row): # only start reading after this message
start_reading = True
elif any('SFIX' in item for item in row): pass
#fix_timestamps[0].append (row)
elif any('EFIX' in item for item in row):
fix_timestamps.append ([row[0].split(' ')[4],row[1]])
#fix_timestamps[1].append (row)
elif any('SSACC' in item for item in row):
#sac_timestamps[0].append (row)
in_saccade = True
elif any('ESACC' in item for item in row):
sac_timestamps.append ([row[0].split(' ')[3],row[1]])
in_saccade = False
elif any('SBLINK' in item for item in row): # stop reading here because the blinks contain NaN
# blink_timestamps[0].append (row)
in_blink = True
elif any('EBLINK' in item for item in row): # start reading again. the blink ended
blink_timestamps.append ([row[0].split(' ')[2],row[1]])
in_blink = False
elif any('TRIAL' in item for item in row):
# the first element is 'MSG', we don't need it, then we split the second element to seperate the timestamp and only keep it as an integer
trials.append (int(row[1].split(' ')[0]))
elif start_reading and not in_blink:
eye_data.append(row)
if in_saccade:
sac_data.append(row)
else:
fix_data.append(row)
# drop the last data point, because it is the 'END' message
eye_data.pop(-1)
sac_data.pop(-1)
fix_data.pop(-1)
# convert every item in list into a float, substract the start of the first trial to set the start of the first video to t0=0
# then devide by 1000 to convert from milliseconds to seconds
for row in eye_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in sac_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in sac_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in blink_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
sample_rate = float (sample_rate_info[4])
# convert into pandas fix_data Frames for a better overview
eye_data = pd.DataFrame(eye_data)
fix_data = pd.DataFrame(fix_data)
sac_data = pd.DataFrame(sac_data)
fix_timestamps = pd.DataFrame(fix_timestamps)
sac_timestamps = pd.DataFrame(sac_timestamps)
trials = np.array(trials)
blink_timestamps = pd.DataFrame(blink_timestamps)
# rename header for an even better overview
eye_data = eye_data.rename(columns=data_header)
fix_data = fix_data.rename(columns=data_header)
sac_data = sac_data.rename(columns=data_header)
fix_timestamps = fix_timestamps.rename(columns=event_header)
sac_timestamps = sac_timestamps.rename(columns=event_header)
blink_timestamps = blink_timestamps.rename(columns=event_header)
# substract the first timestamp of trials to set the start of the first video to t0=0
eye_data.TimeStamp -= trials[0]
fix_data.TimeStamp -= trials[0]
sac_data.TimeStamp -= trials[0]
trials -= trials[0]
trials = trials /1000 # does not work with trials/=1000
# devide TimeStamp to get time in seconds
eye_data.TimeStamp /=1000
fix_data.TimeStamp /=1000
sac_data.TimeStamp /=1000
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
except:
print ('Could not read ' + str(directory) + ' properly!!! Returned empty data')
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
Hope it helps you guys. Some parts of the code you may need to change, like the index where to split the strings to get the crutial information about event on/offsets. Or you don't want to convert your timestamps into seconds or do not want to set the onset of your first trial to 0. That is up to you.
Additionally in my data we sent a message to know when we started measuring ('SYNCTIME') and I had only ONE condition in my experiment, so there is only one 'TRIAL' message
Cheers

how do i check if a data set is normal or not in python?

So I'm creating a master program for machine learning from scratch in python and the first step i want to do is to check if the data set is normal or not.
ps : the data set can have many features or just a single feature.
It has to be implemented in python3.
also, normalizing the data can be done by the below function right :
# Find the min and max values for each column
def dataset_minmax(dataset):
minmax = list()
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
value_min = min(col_values)
value_max = max(col_values)
minmax.append([value_min, value_max])
return minmax
# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])
THANKS IN ADVANCE!
Your question seems discordant: if your features are not coming from a normal distribution, you cannot "normalize" them, in the sense of changing their distribution. If you mean to check if they have average 0 and SD of 1 that is a different ballpark game.

Applying Lambda to Recode (tricky) Strings to Numbers

I have a large data set of NFL scenarios, but for the sake of illustration, let me just reduce it to a list of 2 observations. Like this:
data = [[scenario1],[scenario2]]
Here is what the data set consists of:
data[0][0]
>>"It is second down and 3. The ball is on your opponent's 5 yardline. There is 3 seconds left in the fourth quarter. You are down by 3 points."
data[1][0]
>>"It is first down and 10. The ball is on your 20 yardline. There is 7 minutes left in the third quarter. You are down by 10 points."
I can't build any models with the data in string format like this. So I want to recode these scenarios into new columns (or features if you will) as quantitative values. I thought I should first get the data frame squared away:
down = 0
yards = 0
yardline = 0
seconds = 0
quarter = 0
points = 0
data = [[scenario1, down, yards, yardline, seconds, quarter, points], [scenario2, yards, yardline, seconds, quarter, points]]
Now is the tricky part, some how I have to populate the new columns from the information from the scenario column. Tricky, because for instance, in the 2nd sentence if the word "opponent's" is present, that means we must calculate it as 100- whatever the yardline number is. In the above scenario1 variable, it should be 100-5=95.
At first I thought I should just separate all the numbers and throw away the words, but as pointed out above, some words are actually necessary to correctly assign the quantitative value. I have never made a lambda with this much subtlety. Or perhaps, a lambda is not the right way to go? I'm open to any/all suggestions.
For reinforcement, here is what I want to see (from scenario1 if I entered:
data[0][1:]
>>2,3,95,3,4,-3
Thank you
A lambda is not the way you're gonna want to go here. Python's re module is your friend :)
from re import search
def getScenarioData(scenario):
data = []
ordinals_to_nums = {'first':1, 'second':2, 'third':3, 'fourth':4}
numerals_to_nums = {
'zero':0, 'one':1, 'two':2, 'three':3, 'four':4,
'five':5, 'six':6, 'seven':7, 'eight':8, 'nine':9
}
# Downs
match = search('(first|second|third|fourth) down and', scenario)
if match:
raw_downs = match.group(1)
downs = ordinals_to_nums[raw_downs]
data.append(downs)
# Yards
match = search('down and (\S+)\.', scenario)
if match:
raw_yards = match.group(1)
data.append(int(raw_yards))
# Yardline
match = search("(oponent's)? (\S+) yardline", scenario)
if match:
raw_yardline = match.groups()
yardline = 100-int(raw_yardline[1]) if raw_yardline[0] else int(raw_yardline[1])
data.append(yardline)
# Seconds
match = search('(\S+) (seconds|minutes) left', scenario)
if match:
raw_secs = match.groups()
multiplier = 1 if raw_secs[1] == 'seconds' else 60
data.append(int(raw_secs[0]) * multiplier)
# Quarter
match = search('(\S+) quarter', scenario)
if match:
raw_quarter = match.group(1)
quarter = ordinals_to_nums[raw_quarter]
data.append(quarter)
# Points
match = search('(up|down) by (\S+) points', scenario)
if match:
raw_points = match.groups()
if raw_points:
polarity = 1 if raw_points[0] == 'up' else -1
points = int(raw_points[1]) * polarity
else:
points = 0
data.append(points)
return data
Personally, I find storing your data like [[scenario, <scenario_data>], ...] is a bit odd, but to add the data to each scenario:
for s in data:
s.extend(getScenarioData(s[0]))
I would suggest using a list of dictionaries because using indexes like data[0][3] could get confusing a month or two from now:
def getScenarioData(scenario):
# instead of data = []
data = {'scenario':scenario}
# instead of data.append(downs)
data['downs'] = downs
...
scenarios = ['...', '...']
data = [getScenarioData(s) for s in scenarios]
EDIT: When you want to get a value from the dicts, use the get method to prevent raising a KeyError because get defaults to None if the key is not found:
for s in data:
print(s.get('quarter'))

Resources