Slurm not optimally allocating multiple GPUs - slurm

We are using Slurm 20.02 with NVML autodetect, and on some 8-GPU nodes with NVLink, 4-GPU jobs get allocated by Slurm in a surprising way that appears sub-optimal.
On a system with 8 Nvidia A40 GPUs, 4 NVLink bridges, and two AMD EPYC 7302 CPUs, we have the following topology:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV4 SYS SYS SYS SYS SYS SYS 12-15,44-47 3
GPU1 NV4 X SYS SYS SYS SYS SYS SYS 8-11,40-43 2
GPU2 SYS SYS X NV4 SYS SYS SYS SYS 4-7,36-39 1
GPU3 SYS SYS NV4 X SYS SYS SYS SYS 0-3,32-35 0
GPU4 SYS SYS SYS SYS X NV4 SYS SYS 28-31,60-63 7
GPU5 SYS SYS SYS SYS NV4 X SYS SYS 24-27,56-59 6
GPU6 SYS SYS SYS SYS SYS SYS X NV4 20-23,52-55 5
GPU7 SYS SYS SYS SYS SYS SYS NV4 X 16-19,48-51 4
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NV# = Connection traversing a bonded set of # NVLinks
We see Slurm allocate 4-GPU jobs in groups such as [0,1,2,4], [1,2,3,7], [0,4,5,6] (using nvidia-smi numbering, not minor numbers, i.e., NUMA Affinity in the table above), with a pair of NVLinked GPUs and 2 unlinked GPUs.
We were expecting to see groups such as [0,1,2,3] or [0,1,4,5], with multiple pairs of NVLinked GPUs.
Some potentially relevant specs/settings:
# NVIDIA:
Driver Version: 460.32.03
CUDA Toolkit Version: 11.1
# slurm.conf:
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageTRES=gres/gpu
JobAcctGatherType=jobacct_gather/linux
Questions:
Is this behavior expected?
Is there a way to force Slurm to allocate multiple pairs of NVLinked GPUs?

Related

Pymongo, Motor memory leak

Background: I use tornado + motor, and found the mem_usage increase.
Then I code the test.py. The db.tasks "size" : 12192854 (10+M). After one minute, MEM USAGE / LIMIT is 1.219GiB / 8GiB
env:
python 3.7.5
motor 2.5.0 (2.1.0 before upgrade)
multidict 4.7.5
pymongo 3.12.0
Here are my code
import os
import gc
import time
import logging
import asyncio
import uvloop
import pdb
import pymongo
import base64
from tornado.platform.asyncio import AsyncIOMainLoop
from guppy import hpy
from motor import motor_asyncio
mongo_auth = 'xxxxx='
runtime_mongos = arch_mongos = {
"host": f"mongodb://{base64.b64decode(mongo_auth).decode()}#" + ','.join(
[
"1xxx:27024",
"2xxx:27024",
"3xxx:27024",
]),
"readPreference": "secondaryPreferred"
}
table = motor_asyncio.AsyncIOMotorClient(**runtime_mongos)["db"]["tasks"]
async def get_data():
return await table.find().sort([
("priority", pymongo.ASCENDING),
("start_uts", pymongo.ASCENDING),
]).to_list(None)
async def test():
while True:
a = await get_data()
print(len(a))
await asyncio.sleep(1)
gc.collect() # no use!
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(test())
Finally, I found the python process has a lot of threads, then I get a clue about the motor 'ThreadPoolExecutor'.
code in motor 2.1:
if 'MOTOR_MAX_WORKERS' in os.environ:
max_workers = int(os.environ['MOTOR_MAX_WORKERS'])
else:
max_workers = tornado.process.cpu_count() * 5
_EXECUTOR = ThreadPoolExecutor(max_workers=max_workers)
I set MOTOR_MAX_WORKERS=1 and the mem_usage keeps in low level.
I deploy my project in docker.But, the cpu of the container is not exclusive.I guess this is the reason of 'max_workers' is irrational.
My fault...

Understand python multiprocessing child process memory usage with 'spawn' start method

I am trying to create a child process in python 3.8.0 using multiprocessing module without inheriting the parent's memory. I am using spawn start method mp.set_start_method('spawn') for this. But the memory usage of the child process is almost same as the parent process. Code snippets below
I am using code shared here for testing How can I restrict the scope of a multiprocessing process?
memtest.py
import multiprocessing as mp
import numpy as np
def foo(x):
import time
time.sleep(60)
if __name__ == "__main__":
mp.set_start_method('spawn')
dont_inherit = np.ones((500, 100))
for x in range(3):
mp.Process(target=foo, args=(x,)).start()
run using python3 memtest.py
memory usage from top
449m 28m 14m S 0.0 0.2 0:00.44 python3 memtest.py
34904 10m 5816 S 0.0 0.1 0:00.03 /srv/env/bin/python3 -c from multiprocessing.resource_tracker import main;main(5)
252m 26m 13m S 0.0 0.2 0:00.26 /srv/env/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=20) --multiprocessing-fork
252m 27m 13m S 0.0 0.2 0:00.21 /srv/env/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=22) --multiprocessing-fork
252m 26m 13m S 0.0 0.2 0:00.23 /srv/env/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=24) --multiprocessing-fork
I am using virtualenv with python3.8.0 on ubuntu18.04
$ python3 --version
Python 3.8.0
What is wrong in this approach of creating a child process? I need to create a lot of child processes that need to be light weight, I initially figured using mp's spawn approach would do this but it doesn't seem to be working.
Short answer: spawn also copy global variable so you should either:
create processes first, and then dont_inherit. I think this is more elegant but probably not always possible; or
in each subprocess del dont_inherit. After subprocess is created you have only one copy of dont_inherit in memory (at least on Linux, where copy-on-write works well), so removal in subprocess "reference" to dont_inherit is rather cheap and fast.
Here is some longer story:
I am not sure what exactly ps measure so I think it is better to use total memory usage (e.g. using htop)
import multiprocessing as mp
ctx = mp.get_context('spawn') #or fork, both work the same
q = ctx.Queue()
def proc(q):
while True:
msg = q.get()
print("Q", msg)
longlist = [ x for x in range(60_000_000) ]
#additional 2.3GB in RAM
p = ctx.Process(target=proc, args=(q,))
p.start()
#no change in memory usage
for i in range( len(longlist) ):
longlist[i] = longlist[i]+1 #memory usage growing
# when for is ended you have additional 2.3GB in RAM (now ~4.6GB is used)
# because you have original longlist in subprocess
# and modified longlist in main processs
below the same but with del global variable in subprocess
import multiprocessing as mp
ctx = mp.get_context('spawn') #or fork, both work the same
q = ctx.Queue()
def proc(q):
global longlist
del longlist
while True:
msg = q.get()
print("Q", msg)
longlist = [ x for x in range(60_000_000) ]
#additional 2.3GB in RAM
p = ctx.Process(target=proc, args=(q,))
p.start()
#no change in memory usage
for i in range( len(longlist) ):
longlist[i] = longlist[i]+1 #no change in memory usage
# in this point total memory usage is still ~2.3GB
#rmrmg's answer is misleading.
Spawn will copy over global variables, yes, but it won't copy over memory that's protected by the __name__=='__main__' scope. Spawn essentially imports your current module, and when it does this import, the __name__=='__main__' block does not activate. This is the point of __name__=='__main__' (to protect execution code so that it is not run at import).
Now, in regards to why your memory usage is similar across your processes, that's because your dont_inherit is made up of 500*100 ints, which amounts to 4*500*100 = 200000 bytes = 200 kilobytes. Your subprocesses indeed don't have your dont_inherit object, the memory saved is just so small you can't even detect it from running top.
In the future, you should try to access these kinds of objects directly so that you can confirm whether they're present or not. E.g.
import multiprocessing as mp
import numpy as np
def foo(x):
global dont_inherit
print(dont_inherit)
if __name__ == "__main__":
mp.set_start_method('spawn')
dont_inherit = np.ones((500, 100))
for x in range(3):
mp.Process(target=foo, args=(x,)).start()
If you run this, you'll see that your print statements will throw an error because nothing is there.
You can also make your dont_inherit variable larger by a couple orders so you can actually see the memory usage.

Capture a terminal output in real time of an python module

The python 'yfinance' module downloads the quotes of many Financial Securities in a pandas dataframe and in the meanwhile it displays a progress bar in the console. In this way:
import yfinance as yf
Tickerlist = ["AAPL","GOOG","MSFT"]
quote = yf.download(tickers=Tickerlist,period='max',interval='1d',group_by='ticker')
I would like to capture the console progress bar in real time, and the code should be this:
import sys
import subprocesss
process = subprocess.Popen(["yf.download","tickers=Tickerlist","period='max'","interval='1d'","group_by='ticker'"],stdout=quote)
while True:
out = process.stdout.read(1)
sys.stdout.write(out)
sys.stdout.flush()
I make a big mess with subprocess. I need your help! Thanks.
I have already seen all the links that deal with this topic but without being able to solve my problem.
You need two python files to do what you want.
one is yf_download.py and second is run.py
The file code looks like this and you can run it through run.py
python run.py
yf_download.py
import sys
import yfinance as yf
Tickerlist = ["AAPL","GOOG","MSFT"]
def run(period):
yf.download(tickers=Tickerlist, period=period,interval='1d',group_by='ticker')
if __name__ == '__main__':
period = sys.argv[1]
run(period)
run.py
import sys
import subprocess
process = subprocess.Popen(["python", "yf_download.py", "max"],stdout=subprocess.PIPE)
while True:
out = process.stdout.read(1)
if process.poll() is not None:
break
if out != '':
sys.stdout.buffer.write(out)
sys.stdout.flush()

Why does PyQt5 QFileDialog.getExistingDirectory fail to see ~/.config/subdirectory?

Under Python2.7 and PySide, I was able to point to subdirectories of ~/.config/. However, when I moved to Python3 and PyQt5, I can open to ~/.config/ but not to subdirectories of it... (All of the directories have drwxr-xr-x permissions and no special chattr stuff or ACL stuff happening.)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Arch Linux (up-to-date)
# Python 3.6.5
# python-pyqt5 5.10.1-3
import os
import sys
from PyQt5.QtCore import *
from PyQt5.QtGui import *
from PyQt5.QtWidgets import *
app = QApplication(sys.argv)
# Succeeds. (Lists three files in the autostart directory.)
wd = os.path.expanduser("~/.config/autostart")
os.system("ls {0}".format(wd))
# Fails. Opens to ~/
x = QFileDialog.getExistingDirectory(caption="Choose presets...", directory=wd)
wd = os.path.expanduser("~/.config")
# Succeeds. Opens at ~/.config/
x = QFileDialog.getExistingDirectory(caption="Choose presets...", directory=wd)
# Succeeds. Opens at ~/Documents/Volunteer
wd = os.path.expanduser("~/Documents/Volunteer")
x = QFileDialog.getExistingDirectory(None, "Choose presets...", wd)
And, thanks to #ekhumoro we have a winner! Telling the QFileDialog not to use the native dialog did the trick. Specifically:
#!/usr/bin/evn python3
# -*- coding: utf-8 -*-
# Arch Linux (up-to-date)
# Python 3.6.5
# python-pyqt5 5.10.1-3
import os
import sys
from PyQt5.QtCore import *
from PyQt5.QtGui import *
from PyQt5.QtWidgets import *
app = QApplication(sys.argv)
# Succeeds. (Lists three files in the autostart directory.)
wd = os.path.expanduser("~/.config/autostart")
os.system("ls {0}".format(wd))
# SUCCEEDS (where it previously failed)
x = QFileDialog.getExistingDirectory(caption="Choose presets...", directory=wd,
options=QFileDialog.DontUseNativeDialog)

Python 3 BlockingScheduler killed without apparent reason

I am running a basic blocking scheduler and it is being killed without apparent reason. In my console, a "Killed" message appears but that's all. Any idea how I could obtain a reason as to why it was killed? My function is as simple as the one below.
from apscheduler.schedulers.blocking import BlockingScheduler
import pandas as pd
import time
sched = BlockingScheduler()
#sched.scheduled_job('cron', day_of_week='mon,tue', hour=17, minute=45)
def scheduled_job():
print("Start time: ", pd.datetime.now(), "\n")
fct.start()
time.sleep(100)
fct.stop()
print("End time: ", pd.datetime.now(), "\n\n")
return
sched.start()

Resources