How can we Visualizing MultiDimensional data clustered? - python-3.x

I have dataset of 100+ dimensions and I used PRECOMPUTED correlation as distance metric.
`
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
af = AffinityPropagation(affinity='precomputed').fit(my_distanceMetric_as_correlationMatrix)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
`
Now I can see the data in different clusters but I would like to visualize these clusters. So I request your support.

You can download the .whl file from http://www.lfd.uci.edu/~gohlke/pythonlibs/#cvxopt
(ctrl-f for scikit-learn and choose the appropriate version.)
Place the downloaded file in your current working directory, and install using
pip install filename
in my case the filename is scikit_learn‑0.18.1‑cp27‑cp27m‑win_amd64.whl

Related

backtesting.py ploting function not working

I'm trying to learn backtesting.py, when I run the following sample code, it pops up these errors, anyone could help? I tried to uninstall the Bokeh package and reinstall an older version, but it doen't work.
BokehDeprecationWarning: Passing lists of formats for DatetimeTickFormatter scales was deprecated in Bokeh 3.0. Configure a single string format for each scale
C:\Users\paul_\AppData\Local\Programs\Python\Python310\lib\site-packages\bokeh\models\formatters.py:399: UserWarning: DatetimeFormatter scales now only accept a single format. Using the first prodvided: '%d %b'
warnings.warn(f"DatetimeFormatter scales now only accept a single format. Using the first prodvided: {fmt[0]!r} ")
BokehDeprecationWarning: Passing lists of formats for DatetimeTickFormatter scales was deprecated in Bokeh 3.0. Configure a single string format for each scale
C:\Users\paul_\AppData\Local\Programs\Python\Python310\lib\site-packages\bokeh\models\formatters.py:399: UserWarning: DatetimeFormatter scales now only accept a single format. Using the first prodvided: '%m/%Y'
warnings.warn(f"DatetimeFormatter scales now only accept a single format. Using the first prodvided: {fmt[0]!r} ")
GridPlot(id='p11925', ...)
import bokeh
import datetime
import pandas_ta as ta
import pandas as pd
from backtesting import Backtest
from backtesting import Strategy
from backtesting.lib import crossover
from backtesting.test import GOOG
class RsiOscillator(Strategy):
upper_bound = 70
lower_bound = 30
rsi_window = 14
# Do as much initial computation as possible
def init(self):
self.rsi = self.I(ta.rsi, pd.Series(self.data.Close), self.rsi_window)
# Step through bars one by one
# Note that multiple buys are a thing here
def next(self):
if crossover(self.rsi, self.upper_bound):
self.position.close()
elif crossover(self.lower_bound, self.rsi):
self.buy()
bt = Backtest(GOOG, RsiOscillator, cash=10_000, commission=.002)
stats = bt.run()
bt.plot()
An issue was opened for this in the GitHub repo:
https://github.com/kernc/backtesting.py/issues/803
A comment in the issue suggests to downgrade bokeh to 2.4.3:
python3 -m pip install bokeh==2.4.3
This worked for me.
I had a similar issue, using Spyder IDE.
Found out I need to call the below for the plot to show for Spyder.
backtesting.set_bokeh_output(notebook=False)
I have update Python to version 3.11 & downgrade bokeh to 2.4.3
This worked for me.
Downgrading Bokeh didn't work for me.
But, after importing backtesting in Jupyter, I needed to do:
backtesting.set_bokeh_output(notebook=False)
The expected plot was then generated in a new interactive browser tab.

How to download pipeline code from Pycaret AutoML into .py files?

PyCaret seems like a great AutoML tool. It works, fast and simple and I would like to download the generated pipeline code into .py files to double check and if needed to customize some parts. Unfortunately, I don't know how to make it real. Reading the documentation have not helped. Is it possible or not?
It is not possible to get the underlying code since PyCaret takes care of this for you. But it is up to you as the user to decide the steps that you want your flow to take e.g.
# Setup experiment with user-defined options for preprocessing, etc.
setup(...)
# Create a model (uses training split only)
model = create_model("lr")
# Tune hyperparameters (user can pass a custom tuning grid if needed)
# Again, uses training split only
tuned = tune_model(model, ...)
# Finalize model (so that the best hyperparameters are retrained on the entire dataset
finalize_model(tuned)
# Any other steps you would like to do.
...
Finally, you can save the entire pipeline as a pkl file for use later
# Saves the model + pipeline as a pkl file
save_model(final, "my_best_model")
You may get a partial answer: incomplete with 'get_config("prep_pipe")' in 2.6.10 or in 3.0.0rc1
Just run a setup like in examples, store as a cdf1, and try cdf.pipeline and you may get a text like this: Pipeline(..)
When working with pycaret=3.0.0rc4, you have two options.
Option 1:
get_config("pipeline")
Option 2:
lb = get_leaderboard()
lb.iloc[0]['Model']
Option 1 will give you the transformations done to the data whilst option 2 will give you the same plus the model and its parameters.
Here's some sample code (from a notebook, based on their documentation on the Binary Classification Tutorial (CLF101) - Level Beginner):
from pycaret.datasets import get_data
from pycaret.classification import *
dataset = get_data('credit')
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
exp_clf101 = setup(data = data, target = 'default', session_id=123)
best = compare_models()
evaluate_model(best)
# OPTION 1
get_config("pipeline")
# OPTION 2
lb = get_leaderboard()
lb.iloc[0]['Model']

Spliting an image dataset

I have an image dataset (folder of jpg images) .I would like to split it : 70 % for train and 30% for test randomly.
So I write this simple script:
from sklearn.model_selection import train_test_split
path = ".\dataset"
output_split=train_test_split(path,path,test_size=0.2)
But I don't find anything in folder "output_split"
So where I store output of spliting (train and test)?
I recommend using tf.dataset in your project. You can reference this link here for documentation:
https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory
And this link for a example of the function use:
https://keras.io/api/preprocessing/image/
Please remember to use the same seed for both training and (testing/validation).
Please note that no changes will happen to your files in your local.

Azure AutoML download metrics

I was wondering if there is a way to download the metrics for a model after a run has completed in AutoML in Azure? For example, I want to download the generated confusion matrix as a png file along with the other available metrics.
You can use AutoMLRun's get_output() method to do so -- check out this notebook example.
If you're using the UI to create AutoML runs, or need an output from a previously submitted run, you'll have to create a new AutoMLRun() instance using an Experiment object and the run_id, like below.
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
froom azureml.train.automl.run import AutoMLRun
ws = Workspace.from_config()
experiment_name = 'YOUREXPERIMENTNAME'
experiment = Experiment(ws, experiment_name)
run_automl = AutoMLRun(experiment, run_id="YOUR RUN ID")
best_run, fitted_model = remote_run.get_output()
You cannot download the confusion matrix or other visualizations from AutoML. You can get a link to the UI from the run and view visualizations there. Why do you need this from the Python SDK?
Also, you can see visualizations through the RunDetails widget.

load csv and set parameters in jupyter notebook on Azure ML

I'm using a Python 3.4 Jupyter notebook to load a dataset in Azure ML which is stored in the cloud as a dataset in the Azure ML project environment. But using the default template created by Azure ML, I can't load the data due to a mixed datatypes error.
from azureml import Workspace
import pandas as pd
ws = Workspace()
ds = ws.datasets['rossmann-train.csv']
df = ds.to_dataframe()
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/IPython/kernel/main.py:6: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
In my local environment I just import the dataset as follows:
df = pd.read_csv('train.csv',low_memory=False)
But I'm not sure how to do this in azure using the ds object.
df = pd.read_csv(ds)
and
pd.DataFrame.from_csv(ds)
raise the error:
OSError: Expected file path name or file-like object, got type
*edit: more info on the ds object:
In [1]: type(ds)
Out [1]: azureml.SourceDataset
In [2]: print (ds)
Out [2]: rossmann-train.csv
First of all, I am not sure, by your question, what is the ds object. But I'm pretty sure it is not a csv file, since, if it were, you'd have processed it your self and you wouldn't be having this question.
Now, I am not sure whether pandas has a native way of dealing with Azure, but this piece of documentation indicates that first you must download the data form Azure, using their package, and save it into your local file system.
But for that, they are assuming that the data you downloaded is already in the csv format. If not, use the appropriate reader (or parse it by hand) in order to tabulate the data for a pandas.DataFrame.
According to the docs on the azureml library, one workaround would be to import the file as text then parse it into csv but this seems unnecessary since the data is already recognised as being in csv structure.
text_data = ds.read_as_text()

Resources