Getting a list or count of all running threads - haskell

How can I do one of the following?
get a list of ThreadIDs all running forked threads (preferably with labels) from application code?
get a simple (maybe approximate; e.g. from the last major GC), count of running threads
get either of the above from gdb or similar
Things that don't work for me:
requiring a profiling build
maintaining some data structure to try to track this information myself
Things I thought might be promising:
write a custom eventlog hook to try to get this info https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime_control.html#hooks-to-change-rts-behaviour
create a hacky RTS binding somehow

GHC 9.6 introduced listThreads :: IO [ThreadId]
From GHC User’s Guide:
GHC now provides a set of operations for introspecting on the threads of a program, GHC.Conc.listThreads, as well as operations for querying a thread’s label (GHC.Conc.threadLabel) and status (GHC.Conc.threadStatus).

Related

Coding with Revit API: tips to reduce memory use?

I have a quite 'general' question. I am developing with Revit API (with python), and I am sometimes observing that the Revit session gets slower during my tests and trials (the longer Revit stays open, the more it seems to happen). It's not getting to the point where it would be really problematic, but it made me think about it anyway..
So, since I have no programming background, I am pretty sure that my code is filled with really 'unorthodox' things that could be far better.
Would there be some basic 'tips and tricks' that I could follow (I mean, related to the Revit API) to help the speed of code execution? Or maybe should I say: to help reducing the memory use?
For instance, I've read about the 'Dispose' method available, notably when using Transactions (for instance here: http://thebuildingcoder.typepad.com/blog/2012/09/disposal-of-revit-api-objects.html ), but it's not very clear to me in the end if that's actually very important to do or not (and furthermore, since I'm using Python, and don't know where that puts me in the discussion about using "using" or not)?
Should I just 'Dispose' everything? ;)
Besides the 'Dispose' method, is there something else?
Thanks a lot,
Arnaud.
Basics:
Okay let's talk about a few important points here:
You're running scripts under IronPython which is an implementation of python in C# language
C# Language uses Garbage Collectors to collect unused memory.
Garbage Collector (GC) is a piece of program that is executed at intervals to collect the unused elements. It uses a series of techniques to group and categorize the target memory areas for later collection.
Your main program is halted by the operating system to allow the GC to collect memory. This means that if the GC needs more time to do its job at each interval, your program will get slow and you'll experience some lag.
Issue:
Now to the heart of this issue:
python is an object-oriented programming language at heart and IronPython creates objects (Similar to Elements in Revit in concept) for everything, from your variables to methods of a class to functions and everything else. This means that all these objects need to be collected when they're not used anymore.
When using python as a scripting language for a program, there is generally one single python Engine that executes all user inputs.
However Revit does not have a command prompt and an associated python engine. So every time you run a script in Revit, a new engine is created that executes the program and dies at the end.
This dramatically increases the amount of unused memory for the GC to collect.
Solution:
I'm the creator and maintainer of pyRevit and this issue was resolved in pyRevit v4.2
The solution was to set LightweightScopes = true when creating the IronPython engine and this will force the engine to create smaller objects. This dramatically decreased the memory used by IronPython and increased the amount of time until the user experiences Revit performance degradation.
Sorry i can't comment with a low reputation, i use another way to reduce memory, it's less pretty than the LightweightScopes trick, but it works for one-time cleanup after expensive operations :
import gc
my_object = some_huge_object
# [operation]
del my_object # or my_object = [] does the job for a list or dict
gc.collect()

What are the performance implications of interopping with other languages via system calls?

Suppose I'm writing a program in node.js (or perhaps another typical back-end scripting language). Suppose further I have a C function f (or a python function, or what have you) that does some pure data transformation.
If I want to use f in my node program, there are two approaches:
Bind f via something like node-gyp that makes it callable from JavaScript land.
Make f into a binary (or, in the case of a language like python, a single f.py interface) that sits on the file system, and then call it from node as if were any other system command (so that one can then take the output from the system call as a string, convert it into node.js data, and then use it).
Question: What are the performance implications of choosing (2) over (1)?
This is important because if you are using a language like C to make some aspect of your application run significantly faster, then using (2) would seem pointless if it slowed things down past some threshold.
The cost of 1 is the cost of loading the native code, transfering arguments (ffi), calling the native code, and transfering arguments back. With loading being done only once.
The cost of 2 is always going to be the cost to startup the process, running the process, converting the results back from strings.
If the cost of f is high, you may never see a difference between 1 and 2. If the cost of f is low, then 2 will take longer because the process startup overhead will dominate.
However, depending on the complexity of f (it might be a very large data-processing application in C), it's almost always faster to create a native binding like 1. Avoiding process startup overhead is important, it also reduces the total amount of memory needed to run your application.
Alternatively you could do option:
Have the C code talk over a local network socket. Accepting requests and responding with answers when the computation is done.
This has the benefit of scaling out to multiple nodes if you need it.
Benchmarking both for your use case is the only way to be sure but method 1 is
likely to be faster.
The startup cost of calling a binary and starting an interpreter for python/perl/blah would likely kill any performance gain you might get using their Foreign Function Interface (FFI). Startup cost is one of the reasons why Apache has mod_python, mod_perl and why FastCGI exists.
Another thing to consider is that you're adding another language to the mix and this might kill performance of the team ie now everyone needs to know two languages and two FFI methods etc. If your app is in Node, keep it in Node and use node to call native methods.

How to extend GHC's Thread State Object

I'd like to add two extra fields of type StgWord32 to the thread state object (TSO). Based on the information I found on the GHC-Wiki and from looking at the source code, I have extended the struct in /includes/rts/storage/TSO.h and changed the program that creates different offsets (creating DerivedConstants.h). The compiler, the rts, and a simple application re-compile, but at the end of the execution (in hs_exit_) the garbage collector complains:
internal error: scavenge_stack: weird activation record found on stack: 45
I guess it has to to with cmm and/or the STG implementation details (the offsets are generated since the structs are not visible at cmm level, correct me if I'm wrong). Is the order of fields significant? Have I missed a file that should be changed?
I use a debug build of the compiler and RTS and a rather dated ghc 6.12.3 on a 64bit architecture. Any hints to relevant documentation and comments
on the difference between ghc 6 and 7 regarding TSO handling are welcome, too.
The error that you are getting comes from: ghc/rts/sm/Scav.c. Specifically at line 1917:
default:
barf("scavenge_stack: weird activation record found on stack: %d", (int)(info->i.type));
It looks like you need to also modify ClosureTypes.h, which you can find in ghc/includes/rts/storage. This file seems to contain the different kinds of headers that can appear in a heap object. I've also run into some strange bootstrapping errors, where if I try to rebuild using the stage-1 compiler, I get the error you mentioned, but if I do a clean build, then it compiles just fine.
A workaround that turned out good enough for me was to introduce a separate data structure for each Capability that would hold the additional information for each lightweight thread. I have used a HashTable (see rts/Hash.h and .c) mapping from thread id to the custom info struct. The entries were added when the threads were created from sparks (in schduleActiveteSpark).
Timing the creation, insertion, lookup and destruction of the entries and the table showed negligible overhead for small programs. The main overhead results from the actual usage of the information and should ideally be kept outside of the innermost scheduler loop. For the THREADED_RTS build one needs to ensure that other Capabilities don't access tables that are not their own (or use a mutex if such access is required, which is potential source of additional overhead).

Using callgrind/kcachegrind to get per-thread statistics

I'd like to be able to see how "expensive" each thread in my application is using callgrind. I profiled with the --separate-thread=yes option which gives you a callgrind file for the whole app and then one per-thread.
This is useful for viewing the profile of any given thread, but what I really want is just a sorted list of CPU time from each thread so I can see which threads are the the biggest hogs.
Valgrind/Callgrind doesn't allow this behaviour. Neither kcachegrind does, but I think it will be a good improvement. Maybe some answers could be found on their mailing-list.
A working but really boring way could be to use option --separate-thread=no, and update your code to use for each thread a different function name or class name. Depending your code complexity, it could be the answer (using 1computeData(), 2computeData(), ..)
Just open multiple profiles in kcachegrind at the same time.

Tracing pthread scheduling

What I want to do is create some kind of graph detailing the execution of (two) threads in Linux. I don't need to see what the threads do, just when they are scheduled and for how long, a time line basically.
I've spend the last few hours searching the internet for a way to trace the scheduling of pthreads. Unfortunately, the two projects I found require either kernel recompilation (LTTng) or glibc patching (NPTL Trace Tool), both of which I can not do (large, centrally managed system, on which I have no sudo rights).
Is there any other way to do something like this or will I have to resort to finding a laptop on which I can patch/recompile whatever I want?
Best regards
PS: I would have linked to both projects, but the site doesn't allow me (reputation < 10). The first search result on Google for the project names is the correct one though.
Superuser privileges are not needed to build an instrumented glibc / libpthread.so. The ptt_trace program that is part of NPTL Trace Tool will run your program using the instrumented library.
Maybe something like Intel's VTune?
There is also a tool called pthreadw (on sourceforge)
It's a wrapper library which intercepts calls to the usual functions of the pthread library, and reports stats, like typical times spent playing with locks, condition variables, etc...
It is not currently able to export traces, only textual summary reports.

Resources