AzureML ParallelRunStep runs only on one node - azure-machine-learning-service

I have an inference pipeline with some PythonScriptStep with a ParallelRunStep in the middle. Everything works fine except for the fact that all mini batches are run on one node during the ParallelRunStep, no matter how many nodes I put in the node_count config argument.
All the nodes seem to be up and running in the cluster, and according to the logs the init() function has been run on them multiple times. Diving into the logs I can see in sys/error/10.0.0.* that all the workers except the one that is working are saying:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/batch/tasks/shared/LS_root/jobs/virtualstage/azureml/c36eb050-adc9-4c34-8a33-5f6d42dcb19c/wd/tmp8_txakpm/bg.png'
bg.png happens to be a side argument created in a previous PythonScriptStep that I'm passing to the ParallelRunStep:
bg_file = PipelineData('bg', datastore=data_store)
bg_file_ds = bg_file.as_dataset()
bg_file_named = bg_file_ds.as_named_input("bg")
bg_file_dw = bg_file_named.as_download()
...
parallelrun_step = ParallelRunStep(
name='batch-inference',
parallel_run_config=parallel_run_config,
inputs=[frames_data_named.as_download()],
arguments=["--bg_folder", bg_file_dw],
side_inputs=[bg_file_dw],
output=inference_frames_ds,
allow_reuse=True
)
What's happening here? Why the side argument seems to be available only in one worker while it fails in the others?
BTW I found this similar but unresolved question.
Any help is much appreciated, thanks!

Apparently you need to specify a local mount path to use side_inputs in more than one node:
bg_file_named = bg_file_ds.as_named_input(f"bg")
bg_file_mnt = bg_file_named.as_mount(f"/tmp/{str(uuid.uuid4())}")
...
parallelrun_step = ParallelRunStep(
name='batch-inference',
parallel_run_config=parallel_run_config,
inputs=[frames_data_named.as_download()],
arguments=["--bg_folder", bg_file_mnt],
side_inputs=[bg_file_mnt],
output=inference_frames_ds,
allow_reuse=True
)
Sources:
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-debug-parallel-run-step
https://github.com/Azure/azure-sdk-for-python/issues/18355

Related

libtorch 1.6 gives me different results on each run

I traced a model and I am using it within libtorch, but for some bizarre reason, I notice that the results are different at each run :( I am totally baffled by this.
So, I load the model like so:
torch::jit::script::Module module_af;
module_af = torch::jit::load(af_model_path);
module_af.eval();
// std::cout << "Model Load ok\n";
filelog.get()->info("Model Load ok");
and I run the inference like so:
Mat img4encodingRGB = imread(align_filename, cv::COLOR_BGR2RGB);
auto img2encode = torch::from_blob(img4encodingRGB.data, {img4encodingRGB.rows, img4encodingRGB.cols, img4encodingRGB.channels()}, at::kByte);
img2encode = img2encode.to(at::kFloat).div(255).unsqueeze(0);
img2encode = img2encode.permute({ 0, 3, 1, 2 });
img2encode.sub_(0.5).div_(0.5);
I run the forward like so:
std::vector<torch::jit::IValue> arcface_inputs;
arcface_inputs.push_back(img2encode);
at::Tensor embeds0 = module_af.forward(arcface_inputs).toTensor();
std::cout << embeds0; // GIves different output on each run.
I am really baffled by this. The problem seems to be even worse - on two machines, they seem to produce identical results on concecutive runs, but on two other machines they dont. All packages are EXACTLY the same - libtorch 1.6 and the above code compiled using cmake.
It kinda reminds me of undefined behaviour, but I am totally lost because on two machines (a server and a vm) they seem to produce identical results - but they dont on other two.
I have triple checked everything to see if I am doing something stupid, but it does not seem like it - and hence my post.
Hope someone can point me to clues as to what I could be doing wrong :sob:
Example output run:1:
Columns 1 to 10-0.1005 -0.1768 -0.2082 0.1240 0.1185 0.3801 0.1378 0.1269 -0.3572 -1.1453
run:2:
Columns 1 to 10-0.1861 -0.3326 -0.3739 0.2302 0.1730 0.5391 0.1965 0.1972 -0.5481 -1.7317

Boto3 client in multiprocessing pool fails with "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

I'm using boto3 to connect to s3, download objects and do some processing. I'm using a multiprocessing pool to do the above.
Following is a synopsis of the code I'm using:
session = None
def set_global_session():
global session
if not session:
session = boto3.Session(region_name='us-east-1')
def function_to_be_sent_to_mp_pool():
s3 = session.client('s3', region_name='us-east-1')
list_of_b_n_o = list_of_buckets_and_objects
for bucket, object in list_of_b_n_o:
content = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(content['Body'].read().decode('utf-8'))
write_processed_data_to_a_location()
def main():
pool = mp.Pool(initializer=set_global_session, processes=40)
pool.starmap(function_to_be_sent_to_mp_pool, list_of_b_n_o_i)
Now, when processes=40, everything works good. When processes = 64, still good.
However, when I increases to processes=128, I get the following error:
botocore.exceptions.NoCredentialsError: Unable to locate credentials
Our machine has the required IAM roles for accessing S3. Moreover, the weird thing that happens is that for some processes, it works fine, whereas for some others, it throws the credentials error. Why is this happening, and how to resolve this?
Another weird thing that happens is that I'm able to trigger two jobs in 2 separate terminal tabs (each of which has a separate ssh login shell to the machine). Each job spawns 64 processes, and that works fine as well, which means there are 128 processes running simultaneously. But 80 processes in one login shell fails.
Follow up:
I tried creating separate sessions for separate processes in one approach. In the other, I directly created s3-client using boto3.client. However, both of them throw the same error with 80 processes.
I also created separate clients with the following extra config:
Config(retries=dict(max_attempts=40), max_pool_connections=800)
This allowed me to use 80 processes at once, but anything > 80 fails with the same error.
Post follow up:
Can someone confirm if they've been able to use boto3 in multiprocessing with 128 processes?
This is actually a race condition on fetching the credentials. I'm not sure how fetching credentials under the hood works, but the I saw this question in Stack Overflow and this ticket in github.
I was able to resolve this by keeping a random wait time for each of the processes. The following is the updated code which works for me:
client_config = Config(retries=dict(max_attempts=400), max_pool_connections=800)
time.sleep(random.randint(0, num_processes*10)/1000) # random sleep time in milliseconds
s3 = boto3.client('s3', region_name='us-east-1', config=client_config)
I tried keeping the range for sleep time lesser than num_processes*10, but that failed again with the same issue.
#DenisDmitriev, since you are getting the credentials and storing them explicitly, I think that resolves the race condition and hence the issue is resolved.
PS: values for max_attempts and max_pool_connections don't have a logic. I was plugging several values until the race condition was figured out.
I suspect that AWS recently reduced throttling limits for metadata requests because I suddenly started running into the same issue. The solution that appears to work is to query credentials once before creating the pool and have the processes in the pool use them explicitly instead of making them query credentials again.
I am using fsspec with s3fs, and here's what my code for this looks like:
def get_aws_credentials():
'''
Retrieve current AWS credentials.
'''
import asyncio, s3fs
fs = s3fs.S3FileSystem()
# Try getting credentials
num_attempts = 5
for attempt in range(num_attempts):
credentials = asyncio.run(fs.session.get_credentials())
if credentials is not None:
if attempt > 0:
log.info('received credentials on attempt %s', 1 + attempt)
return asyncio.run(credentials.get_frozen_credentials())
time.sleep(15 * (random.random() + 0.5))
raise RuntimeError('failed to request AWS credentials '
'after %d attempts' % num_attempts)
def process_parallel(fn_d, max_processes):
# [...]
c = get_aws_credentials()
# Cache credentials
import fsspec.config
prev_s3_cfg = fsspec.config.conf.get('s3', {})
try:
fsspec.config.conf['s3'] = dict(prev_s3_cfg,
key=c.access_key,
secret=c.secret_key)
num_processes = min(len(fn_d), max_processes)
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=num_processes) as pool:
for data in pool.map(process_file, fn_d, chunksize=10):
yield data
finally:
fsspec.config.conf['s3'] = prev_s3_cfg
Raw boto3 code will look essentially the same, except instead of the whole fs.session and asyncio.run() song and dance, you'll work with boto3.Session itself and call its get_credentials() and get_frozen_credentials() methods directly.
I get the same problem with multi process situation. I guess there is a client init problem when you use multi process. So I suggest that you can use get function to get s3 client. It works for me.
g_s3_cli = None
def get_s3_client(refresh=False):
global g_s3_cli
if not g_s3_cli or refresh:
g_s3_cli = boto3.client('s3')
return g_s3_cli

How to use pystemd to control systemd timedated ntp service?

I'm working on a python app that needs to get the NTPSynchronized parameter from system-timedated. I'd also like to be able to start and stop the NTP service by using the SetNTP method.
To communicate with timedated over d-bus I have been using this as reference: https://www.freedesktop.org/wiki/Software/systemd/timedated/
I previously got this working with dbus-python, but have since learned that this library has been deprecated. I tried the dbus_next package, but that does not have support for Python 3.5, which I need.
I came across the pystemd package, but I am unsure if this can be used to do what I want. The only documentation I have been able to find is this example (https://github.com/facebookincubator/pystemd), but I can not figure out how to use this to work with system-timedated.
Here is the code I have that works with dbus-python:
import dbus
BUS_NAME = 'org.freedesktop.timedate1`
IFACE = 'org.freedesktop.timedate1`
bus = dbus.SystemBus()
timedate_obj = bus.get_object(BUS_NAME, '/org/freedesktop/timedate1')
# Get synchronization value
is_sync = timedate_obj.Get(BUS_NAME, 'NTPSynchronized', dbus_interface=dbus.PROPERTIES_IFACE)
# Turn off NTP
timedate_obj.SetNTP(False,False, dbus_interface=IFACE)
Here's what I have so far with pystemd, but I don't think I'm accessing it in the right way:
from pystemd.systemd1 import Unit
unit = Unit(b'systemd-timesyncd.service')
unit.load()
# Try to access properties
prop = unit.Properties
prop.NTPSynchronized
Running that I get:
Attribute Error: 'SDInterface' object has no attribute 'NTPSynchronized'
I have a feeling that either the service I entered is wrong, or the way I'm accessing properties is wrong, or even both are wrong.
Any help or advice is appreciated.
Looking at the source code, it appears that using the pystemd.systemd1 Unit object has a default destination of "org.freedesktop.systemd1" + the service name (https://github.com/facebookincubator/pystemd/blob/master/pystemd/systemd1/unit.py)
This is not what I want because I am trying to access "org.freedesktop.timedate1"
So instead I instantiated it's base class SDObject from pystemd.base (https://github.com/facebookincubator/pystemd/blob/master/pystemd/base.py)
The following code allowed me to get the sync status of NTP
from pystemd.base import SDObject
obj = SDObject(
destination=b'org.freedesktop.timedate1',
path=b'/org/freedesktop/timedate1',
bus=None,
_autoload=False
)
obj.load()
is_sync = obj.Properties.Get('org.freedesktop.timedate1','NTPSynchronized')
print(is_sync)
Not sure if this is what the library author intended, but hey it works!

using Threads with Jython/wlst

I'm working with weblogic servers and want a class to handle every connection separately with a handler in my scripts.
I'm using jython2.7 as interpreter and include all the weblogic libs need.
The Idea is to have a handler for every conncetion to the adminserver I have and controll them separately.
For this I wrote a class with the functions needed ( shutdown, getServers, etc ). The functions are working. My problem is the script only has the last connection established activ and not the connection in my class itself.
I thougt Processes would be the thing, and made something like this.
p1 = Process( target = IDIHandler.connect, args = None )
p2 = Process( target = IDEHandler.connect, args = None )
but it doesn't seem to work... The Process starts right away and not if I'm using p1.start() plus the same problem like above. only the last connected server is connceted.
I have to admit I'm not even sure if Process is the correct way for my Problem. If you have other options for me to try I'll do :)
Thanks in advance

Unable to determine function entry point

Been using MS botframework for a couple months. Developing with the emulator in node and using continuous integration to push to Azure.
Pushed last Wednesday and tested with no problems. Made two very inconsequential code changes and pushed on Friday and no I'm getting:
Exception while executing function: Functions.messages. mscorlib: Unable to determine function entry point. I tried redeploying the older version, same thing.
Thoughts?
The function entrypoint is determined based on this logic. As you can see, the flow is:
If an explicit entry point is defined in function.json, use that
Otherwise; if there's a single exported function, use that
Otherwise; try to use a function named run or index (in that order)
I suspect you were in branch #2, and your minor change introduced new functions, so the runtime is now attempting to locate a function named run or index and that doesn't exist.
Can you please make sure your primary entry point function is named run or index and try again?
Turns out there was a short-lived bug in the Azure git integration and I deployed during the window this bug was live. It modified function.json and left it in an invalid state. Kudos to MS Support for staying with the issue and determining the root cause.
In my case, the root cause is having 1 named export & 1 default export.
To fix this, only export default 1 thing in the entrypoint file (index.js in my case)
Leaving here as a trail in case someone faces the same thing
1) Try stopping the service in Azure:
2) Then go to Kudu: https://[YourAzureSiteName].scm.azurewebsites.net/DebugConsole
and run npm install:
3) Then restart the service in Azure
You can use module.exports for example:
module.exports = async function queryDatabase() {
const pg = require('pg');
//...
//...
}
if you are exporting multiple functions
for example function1, function2 then
by adding this line to the end of file we can resolve this issue.
exports.index = function1

Resources