How to export-import MLFlow experiments with original experiment ID using mlflow-export-import? - mlflow

I am using https://github.com/mlflow/mlflow-export-import to export experiments from local mlruns/ to a RDS Postgres. However, for each new imported experiment, the experiment ID is not sequential. It is the sum of all runs and experiments imported before.
For example:
ID 0: Default, 0 runs
ID 1: Experiment 1, 88 runs
ID 90: Experiment 2, 86 runs
ID 177: Experiment 3, 1 run
ID 179: Experiment 4, 10 runs
Since the experiments ids can not be setted mannualy, there is a way to change the mlflow-export-import code to use the original experiment id? Or, at least, use incremental experiment ids for each new imported experiment?

Experiment IDs are auto-generated by MLflow. There is no API to set or change them. In open source MLflow they are monotonically increasing integers, in Databricks MLflow they are UUIDs.

Related

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Airflow dags - reporting of the runtime for tracking purposes

I am trying to find a way to capture the dag stats - i.e run time (start time, end time), status, dag id, task id, etc for various dags and their task in a separate table
found the default logs which goes to elasticsearch/kibana, but not a simple way to pull the required logs from there back to the s3 table.
building a separate process to load those logs in s3 will have replicated data and also there will be too much data to scan and filter as tons of other system-related logs are generated as well.
adding a function to each dag - would have to modify each dag
what other possibilities are to get it don't efficiently, of any other airflow inbuilt feature can be used
You can try using Ad Hoc Query available in Apache airflow.
This option is available at Data Profiling -> Ad Hoc Query and select airflow_db
If you wish to get DAG statistics such as start_time,end_time etc you can simply query in the below format
select start_date,end_date from dag_run where dag_id = 'your_dag_name'
The above query returns start_time and end_time details of the DAG for all the DAG runs. If you wish to get details for a particular run then you can add another filter condition like below
select start_date,end_date from dag_run where dag_id = 'your_dag_name' and execution_date = '2021-01-01 09:12:59.0000' ##this is a sample time
You can get this execution_date from tree or graph views. Also you can get other stats like id,dag_id,execution_date,state,run_id,conf as well.
You can also refer to https://airflow.apache.org/docs/apache-airflow/1.10.1/profiling.html#:~:text=Part%20of%20being%20productive%20with,application%20letting%20you%20visualize%20data. link for more details.
You did not mention do you need this information real time or in batches.
Since you do not want to use ES logs either, you can try airflow metrics, if it suits your need.
However pulling this information from database is not efficient, in any case but it still is an option if you are not looking for real time data collection.

Python Data saving performance

I`ve got some bottleneck with data, and will be appreciated for senior advice.
I have an API, where i recieve financial data that looks like this GBPUSD 2020-01-01 00:00:01.001 1.30256 1.30250, my target is to write those data directly into databse as fast as it possible.
Inputs:
Python 3.8
PastgreSQL 12
Redis Queue (Linux)
SQLAlchemy
Incoming data structure, as showed above, comes in one dictionary {symbol: {datetime: (price1, price2)}}. All of the data comes in String datatype.
API is streaming 29 symbols, so I can recieve for example from 30 to 60+ values of different symbols just in one second.
How it works now:
I recieve new value in dictionary;
All new values of each symbol, when they come to me, is storing in one variable dict - data_dict;
Next I'm asking those dictionary by symbol key and last value, and send those data to Redis Queue - data_dict[symbol][last_value].enqueue(save_record, args=(datetime, price1, price2)) . Till this point everything works fine and fast.
When it comes to Redis worker, there is save_record function:
"
def save_record(Datetime, price1, price2, Instr, adf):
# Parameters
#----------
# Datetime : 'string' : Datetime value
# price1 : 'string' : Bid Value
# price2 : 'string' : Ask Value
# Instr : 'string' : symbol to save
# adf : 'string' : Cred to DataBase engine
#-------
# result : : Execute save command to database
engine = create_engine(adf)
meta = MetaData(bind=engine,reflect=True)
table_obj = Table(Instr,meta)
insert_state = table_obj.insert().values(Datetime=Datetime,price1=price1,price2=price2)
with engine.connect() as conn:
conn.execute(insert_state)
When i`m execute last row of function, it takes from 0.5 to 1 second to write those row into the database:
12:49:23 default: DT.save_record('2020-00-00 00:00:01.414538', 1.33085, 1.33107, 'USDCAD', 'postgresql cred') (job_id_1)
12:49:24 default: Job OK (job_id_1)
12:49:24 default: DT.save_record('2020-00-00 00:00:01.422541', 1.56182, 1.56213, 'EURCAD', 'postgresql cred') (job_id_2)
12:49:25 default: Job OK (job_id_2)
Queued jobs for inserting each row directly into database is that bottleneck, because I can insert only 1 - 2 value(s) in 1 second, and I can recieve over 60 values in 1 second. If I run this saving, it starts to create huge queue (maximum i get was 17.000 records in queue after 1 hour of API listening), and it won't stop rhose size.
I'm currently using only 1 queue, and 17 workers. This make my PC CPU run in 100%.
So question is how to optimize this process and not create huge queue. Maybe try to save for example in JSON some sequence and then insert into DB, or store incoming data in separated variables..
Sorry if something is doubted, ask - and I`ll answer.
--UPD--
So heres my little review about some experiments:
Move engine meta out of function
Due to my architechture, API application located on Windows 10, and Redis Queue located on Linux. There was an issue wis moving meta and engine out of function, it returns TypeError (it is not depends on OS), a little info about it here
Insert multiple rows in a batch:
This approach seemed to be the most simple and easy - so it is! Basically, i've just created dictionary: data_dict = {'data_pack': []}, to begin storing there incoming values. Then I ask if there is more than 20 values per symbol is written allready - i'm sending those branch to Redis Queue, and it takes 1.5 second to write down in database. Then i delete taken records from data_dict, and process continue. So thanks Mike Organek for good advice.
Those approach is quite enough for my targets to exist, at the same time I can say that this stack of tech can provide you really good flexibility!
Every time you call save_record you re-create the engine and (reflected) meta objects, both of which are expensive operations. Running your sample code as-is gave me a throughput of
20 rows inserted in 4.9 seconds
Simply moving the engine = and meta = statements outside of the save_record function (and thereby only calling them once) improved throughput to
20 rows inserted in 0.3 seconds
Additional note: It appears that you are storing the values for each symbol in a separate table, i.e. 'GBPUSD' data in a table named GBPUSD, 'EURCAD' data in a table named EURCAD, etc.. That is a "red flag" suggesting bad database design. You should be storing all of the data in a single table with a column for the symbol.

Azure ML: How to train a model on multiple instances

I have a AML compute cluster with the min & max nodes set to 2. When I execute a pipeline, I expect the cluster to run the training on both instances in parallel. But the cluster status reports that only one node is busy and the other is idle.
Here's my code to submit the pipeline, as you can see, I'm resolving the cluster name and passing that to my Step1, thats training a model on Keras.
aml_compute = AmlCompute(ws, "cluster-name")
step1 = PythonScriptStep(name="train_step",
script_name="Train.py",
arguments=["--sourceDir", os.path.realpath(source_directory) ],
compute_target=aml_compute,
source_directory=source_directory,
runconfig=run_config,
allow_reuse=False)
pipeline_run = Experiment(ws, 'MyExperiment').submit(pipeline1, regenerate_outputs=False)
Each python script step runs on a single node even if you allocate multiple nodes in your cluster. I'm not sure whether training on different instances is possible off-the-shelf in AML, but there's definitely the possibility to use that single node more effectively (looking into using all your cores, etc.)
Really great question. The TL;DR is that there isn't an easy way to do that right now. IMHO there's a few questions within your questions -- here's a stab at all of them.
Distributed Training with keras
I'm no keras expert, but from their distributed training guide, I'm interested to know about the parallelism that you are after? model parallelism or data parallelism?
For data parallelism, it looks like the tf.distribute API is the way to go. I would strongly recommend getting that working on a single, multi-GPU machine (local or Azure VM) without Azure ML before starting to use Pipelines.
Distributed training with Azure ML
This Azure ML notebook shows how to use PyTorch with Horovod on AzureML. Seems not too tricky to change this to work with keras.
As for how to get distributed training to work inside of an Azure ML Pipeline, one spitball workaround would be to have the PythonScriptStep be a controller that would create a new compute cluster and submit the training script to it. I'm not too confident but I'll do some digging.
Multi-Node PythonScripSteps
This is possible (at least w/ pyspark). Below is a PythonScriptStep a production pipeline of ours that can run on more than one node. It uses a Docker image with Spark pre-installed, and a pyspark RunConfiguration. In the screenshots below you can see one of the nodes is the primary orchestrator, and the other is a secondary worker.
from azureml.core import Environment, RunConfiguration
env = Environment.from_pip_requirements(
'spark_env',
os.path.join(os.getcwd(), 'compute', 'spark-requirements.txt'))
env.docker.enabled = True
env.docker.base_image = 'microsoft/mmlspark:0.16'
spark_run_config = RunConfiguration(framework="pyspark")
spark_run_config.environment = spark_env
spark_run_config.node_count = 2
roll_step = PythonScriptStep(
name='roll.py',
script_name='roll.py',
arguments=['--input_dir', joined_data,
'--output_dir', rolled_data],
compute_target=compute_target_spark,
inputs=[joined_data],
outputs=[rolled_data],
runconfig=spark_run_config,
source_directory=os.path.join(os.getcwd(), 'compute', 'roll'),
allow_reuse=pipeline_reuse
)

Nodejs agenda job scheduler - How to group multiple jobs that will run in the same database scan into 1 batch job?

Problem
I have purchase service that users can use to buy/rent digital assets like game, media, movies... When purchase event happened, I create a job and schedule it to run at calculated expired date to remove key for such asset.
Everything works. But it would be better if I can group those jobs that will run in the same agenda db scan into 1 batch job to remove multiple keys.
This will reduce significant amount of db read/write/delete operations in both keys and agenda collection, it also increases the free memory at most of the time as instead of storing 100+ jobs to run in a scan, it stores just 1 job to remove 100+ keys.
Research
The closest feature I found in Agenda repo is unique(). Which allows user to find and modify the existing job that matches the fields defined in unique(). If it can concat new jobs to the existing job, that will solve my case.
Implementation
Before diving in and modify the package I want to check if there are already people solved the problem I have and have some thoughts to share.
Another solution without touching the package is to create an in-memory dictionary to accumulate jobs for a specific db scan with this strategy:
dict = {}
//if key expires in 1597202228 then put to dict slot:
dict = {
1597300000: [jobA]
}
//another key expires in 1597202238 then put to the same slot:
dict = {
1597300000: [jobA,jobB]
}
//the latch condition to put job batch into agenda:
if dict_size == dict_allocated_memory then put the whole dict into db.
if a batch_size = batch_limit then put the batch into db and remove the batch in dict.
if the batch is going to expire in the next db scan then put the batch (it may be empty, has a few jobs...) into db and remove the batch in dict.

Resources