Multi-threading availability in Python

Multi-threading availability in Python - python-3.x

I've read many posts and articles about how to use threading in conjunction with queue.Queue or multiprocessing.pool.ThreadPool or concurrent.futures.ThreadPoolExecutor, but none of them work for me. And by that I mean that there's no parallelism at all. The only way I get 100% CPU usage and real parallelism is with multiprocessing.Pool. I also read about GIL and CPython.
Why do I care if at least one approach works? Well, multiprocessing.Pool simply prevents nested parallelism (daemonic processes). I can't have an outer function running in a separate process and that function start its own pool of processes.
So I have two questions to hopefully stop my never-ending search for an approach that works:
Is multi-threading really impossible in Python if I'm using the default Anaconda's distribution and its python.exe? (I see a bunch of articles talking about multi-threading with GUIs and I/O operations...)
Is nested parallelism really impossible in Python?

Q : "Is multi-threading really impossible in Python...?"
Lexicaly, there is a multi-threaded code-execution ( the code imports thread-based tools ).
Nevertheless, as you have read GIL-details and as of the as-is state of the Python interpreter design ( this is valid since ever and still defended by Guido van Rossum himself as a by-design property in 2020-Q2 ) the central lock-acquisition of the GIL-lock singleton, the actual code-execution gets re-[SERIAL]-ised, under the thread-based python tools, so the actual resulting performance speedup gets << 1 ( all add-on costs of all setups were paid, all GIL-lock related switching of threads' execution overheads are gotten to be paid ( each about 250 ms - so add it up... ) during whole the course of the code-execution, yet, no acceleration could ever appear here ( except from the use-cases that happen to mask ( best as many times as possible, to justify all the other add-on costs ) some external I/O-latency ( network transports, slow user-interactions with UI ) >> 250 ms, that have been mentioned above too ) )
Q : "Is nested parallelism really impossible in Python?"
Well, the ultimate answer is not about Is it somehow possible,butDoes it make some sense to try to achieve that?,
for which a plain answer ( still valid in 2020-Q2 )
(performance-wise) No, sorry, it would never make sense to try to do that,unless a sum of all the add-on Costs at least starts to become justified (which it does not seem to be anywhere near in 2020-Q2 and hardly will start to be, unless a Python ecosystem undergoes a total redesign, going straight against the Guido's evangelisation).
A performance motivated architecture must well balance all add-on costs, so as not to fall into the trap of the Amdahl's Law - it is never paying way more than will ever get receiving back.
That simple to type.
So complex to achieve.

Related

What are alternatives to Locks and critical sections in ZeroMQ (guide)

I am reading ZeroMQ-the guide, currently on Chapter 4, for those of you who know.
http://zguide.zeromq.org/page:all
I am working in python with the binding pyzmq.
The author says that we should forget everything we know about concurrent programming, never use locks and critical sections etc.
Right now I am doing a pet project for fun with ZeroMQ, I have a piece of data, which is shared between some threads (don't worry my threads, don't pass sockets). They are sharing a single Database,
my question is :
Should I put a lock around that piece of data, to avoid race conditions, like one normally would, in order to serialize access, or is this something to avoid when using ZeroMQ, because better alternatives exist?
I remember the author saying that one should always share data between threads using inproc:// or ipc:// ( for processes ), but I am not sure how that fits here.

A lot of thrilling FUN in doing this, indeed
Yes, Pieter HINJENS does not advice any step outside of his Zen-of-Zero. Share nothing, lock nothing ... as you have noticed many times in his great book already.
What is the cornerstone of the problem -- protecting the [TIME]-domain consistent reflections of the Database-served pieces of data.
In distributed-system, the [TIME]-domain problem spans a non-singular landscape, so more problems spring out of the box.
If we know, there is no life-saving technology connected to the system ( regulatory + safety validations happily do not apply here ), there could be a few tricks to solve the game with ZeroMQ and without any imperative locks.
Use .setsockopt( { ZMQ.CONFLATE | ZMQ.IMMEDIATE }, 1 ) infrastructure, with .setsockopt( ZMQ.AFFINITY, ... ) and .setsockopt( ZMQ.TOS, ... ) latency-tweaked hardware-infrastructure + O/S + kernel setups end-to-end, where possible.
A good point to notice is also that of the Python threading-model still goes under GIL-lock-stepping, so the principal collision-avoidance is already in-place.
An indeed very hard-pet-project :( ZeroMQ cannot and will not avoid problems )
Either
co-locate the decision-making process(es), so as to have an almost-zero latency on taking decisions on updated data,
or
permit a non-local decision taking, but make them equipped with some robust rules, based on latency of uncompensated ( principally impossible until a Managed Quantum Entanglement API gets published and works indeed at our service ) [TIME]-domain stamped event-notifications - thus having also a set of rules controlled chances to moderate DB data-reflection's consistency corner cases, where any "earlier"-served DB-READ has been delivered "near" or "after" a known DB-WRITE has changed its value - both visible to the remote-observer at an almost-the-same time.
Database's own data-consistency is maintained by the DBMS-engine per se. No need to care here.
Let's imagine the Database-accesses being mediated "through" ZeroMQ communication tools. The risk is not in the performance-scaling, where ZeroMQ enjoys an almost-linear scaling, the problem is in the said [TIME]-domain consistency of the served "answer" anywhere "behind" the DBMS-engine's perimeter, the more once we got into the distributed-system realms.
Why?
A DBMS-engine's "locally" data-consistent "answer" to a read-request, served by the DBMS-engine right at a moment of UTC:000,000,000.000 000 000 [s] will undergo a transport-class specific journey, along the intended distribution path, but -- due to principal reasons -- does not get delivered onto a "remote"-platform until UTC:000,000,000.??? ??? ??? [s] ( depending on the respective transport-class and intermediating platforms' workloads ).
Next, there may be and will be an additional principal inducted latency, caused from workloads of otherwise uncoordinated processes requests, principally concurrent in their respective appearance, that get later somehow aligned into a pure-serial queue, be it due to the ZeroMQ Context()'s job or someone else's one. Dynamics of queue-management and resources-(un-)availability during these phases add another layer of latency-uncertainties.
Put together, one may ( and ought ) fight as a herd of lions for any shaving-off of the latency costs alongside the { up | down }-the-paths, yet getting a minimum + predictable latency is more a wish, than any easily achievable low hanging fruit.
DB-READ-s may seem to be principally easy to serve, as they might appear as lock-free and un-orchestrated among themselves, yet the very first DB-WRITE-request may some of the already scheduled "answer"-s, that were not yet sent out onto the wire ( and each such piece of data ought have got updated / replaced by a DBMS-local [TIME]-domain " freshmost " piece of data --- As no one is willing to dog-fight, the less to further receive shots from a plane, that was already known to have been shot down a few moments ago, is she ... ? )
These were lectures already learnt during collecting pieces of distributed-system smart designs' experience ( Real Time Multiplayer / Massive Ops SIMs being the best of the best ).
inproc:// transport-class is the best tool for co-located decision taking, yet will not help you in this in Python-based ecosystems ( ref. GIL-lock-stepping has enforced a pure-serial execution and latency of GIL-release goes ~3~5 orders of magnitude above the almost-"direct"-RAM-read/write-s.
ipc:// transport-class sockets may span inter-process communications co-located on the same host, yet if one side goes in python, still the GIL-lock-stepping will "chop" your efforts of minimising the accumulated latencies as regular blocking will appear in GIL-interval steps and added latency-jitter is the worst thing a Real-Time distributed-system designer is dreaming about :o)

How non-trivial should a computation be to make it reasonable to get it sparked for a parallel execution in Haskell? [duplicate]

This question already has an answer here:
Why are GHC Sparks Fizzling?
(1 answer)
Closed 5 years ago.
According to the doumentation for Control.Parallel, one should make sure that the computation being sparked is non-trivial so that creating the spark is cheaper than the computation itself.
This makes sense, but after listening to Bartosz Milewski talk about how cheap sparks are, I'm wondering how experienced Haskell programmers determine whether or not a computation is worthy of parallelism.

This subject is facts, not opinions based.
Please take notice of a few facts on the actual overhead costs before reading:
".. creating spark doesn't immediately wakeup idle capability, see here. By default scheduling interval is 20ms, so when you create a spark, it will take up to 20 ms to turn it to a real thread. By that time the calling thread most likely will already evaluate the thunk, and the spark will be either GC'd or fizzled.
By contrast, forkIO will immediately wakeup idle capability if any. That is why explicit concurrency is more reliable then parallel strategies."
So, remember to add +20 ms and/or the benchmarked costs of forkIO-spawned functional block, to the below cited add-on overheads in the realistically achievable cost/benefit ( speedup ) formulae.
This problem has been solved by Dr. Gene AMDAHL many decades ago
A bit more recent generations of C/S students or practitioners just seem that have somehow forgotten the elementary process-scheduling logic ( i.e. the rules, not art, of proper organising the flow of code-execution over the system's restricted physical resources ).
Though a fair objection may and will come from the nature of the functional languages, where lambda-calculus can and often does harnesses a space, otherwise hidden for imperative languages, for going into a smart, fine-grain, parallelism, derived right from the laws of lambda- or pi- calculi.
Yet the core message holds and is here, for more than 60 years.
A piece of quantitative, fair, records-of-evidence based rationale on this is well enough: ( no magic, no hidden Art of whatever nature )
Please, first try to do one's own best to first fully understand both the original formulation of the Amdahl's Law, plus kindly also revise the recent criticism and overhead-strict, resources-aware re-formulation of the original, generally valid, universal system-scheduling law.
Additions, in the [ Criticism ] section, were meant exactly to match what actually happens, when someone comes to a piece of code with an idea to "re-organise" the computing graph and enter into the process-flow some either "just"-[CONCURRENT] or true-[PARALLEL] computing syntax-constructors ( whatever the actual code-execution tools are ).
Having got the overhead-strict Amdahl's Law theory, let's measure:
This part is easy and systematic: my students often rant, but going forward, each of them collects a set of hands on experience, what it actually takes ( in costs ) to go into any form of a promise to use a parallelised code-execution.
1 ) create a NOP-function - a function, that indeed does nothing, except of being run ( without an obligation to pass any, the less any remarkable in volume, arguments, without trying to allocate any single bit of memory during its ( empty )-code-execution and without returning any value "back" ). This is an idealised NOP-function payload, to let it being spawned / sparked / distributed into parallelism-tool of choice driven execution.
Having the NOP-fun ready, lets benchmark the pure-overhead of such NOP-fun code-execution in multiple instances and measure the time it took.
Being sure all such instances were doing indeed nothing "there", the lump sum of time spent between the two time-lines were -- hoooray -- the pure overhead cost of going parallelised and process re-collection overhead cost.
So simple, so easy.
Different tool differ in how much costs a user-programme will accrue, but both the metric and the methodology is crystal-clear.
CuT_start <- liftIO $ getCurrentTime -- a Code_Under_Test___START
-- I use a ZeroMQ Stopwatch() instance
-- having a better than [us] resolution
-- but haven't found it in Haskell binding
--CuT_TIMING_CRITICAL_SECTION_/\/\/\/\/\/\\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
<_CuT_a_Code_syntax-constructor_Under_Test_>
--CuT_TIMING_CRITICAL_SECTION_/\/\/\/\/\/\\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
CuT_end <- liftIO $ getCurrentTime -- a Code_Under_Test___END
liftIO $ print ( diffUTCTime CuT_end CuT_start )
2 ) having registered a net cost of spawning / sparking the intended amount of jobs, one may forward with:
- adding "remote" memory allocations ( literally till the swapping kills the O/S )
- adding "remote" CPU-bound processing ( again, as far as one smells the fatigue of the O/S kernel's scheduling efforts to create some yet feasible hardware-threads mapping )
- adding "remote" process testing as per the call-interface scaling ( volume of data with a need to pass from caller to callee ) dependencies
- adding "remote" process return value(s)
- adding "remote" process needs to access some shared resource
The final decision Go - No Go :
All these, above collected and recorded add-on costs, just increase the real-world code overhead costs, that have to be entered into the recent, re-formulated Amdahl's Law.
If and only if
the overhead-strict, resources-aware speedup result is >> 1.00 it makes sense to go into parallelised code-execution
In all cases, where
the "improved" speedup is actually <= 1.00 it would be indeed a very bad idea to pay more than one receives from such an "improvement"
( A reversed formulation is always possible -- derive a minimum amount of processing, that will at least justify the above systematically benchmarked costs of using a respective type of a parallelised-code syntax-constructor )
Q.E.D.

High memory/performance critical computing - Architectural approach opinions

I need an architectural opinion and approaches to the following problem:
INTRO:
We have a table of ~4M rows called Purchases.We also have a table
of ~5k rows called Categories.In addition, we have a table of
~4k SubCategories. We are using T-SQL to store data.At users
request ( during runtime ), server receives a request of about 10-15 N
possibilities of parameters. Based on the parameters, we take
purchases, sort them by categories and subCategories and do some
computing.Some of the process of "computing" includes filtering,
sorting, rearranging fields of purchases, subtracting purchases with
each other, adding some other purchases with each other, find savings,
etc...This process is user specific, therefore every user WILL get
different data back, based on their roles.
Problem:
This process takes about 3-5 minutes and we are wanting to cut it
down.Previously, this process was done in-memory, on the browser via
webworkers (JS). We have moved away from it as the memory started to get
really large and most of browsers start to fail on load. Then we moved
the service to the server (NodeJS), which processed the request on the fly,
via child-processes. Reason for child-processes: the computing process
goes through a for loop about 5,000x times ( for each category ) and does
the above mentioned "computing".Via child processes we were able to
distribute the work into #of child processes, which gave us somewhat
better results, if we ran at least 16-cores ( 16 child processes ).
Current processing time is down to about a 1.5-2 minutes, but we are
wanting to see if we have better options.
I understand its hard to fully understand our goals without seeing any code but to ask question specifically. What are some ways of doing computing on semi-big data at runtime?
Some thoughts we had:
using SQL in-memory tables and doing computations in sql
using azure batch services
using bigger machines ( ~ 32-64 cores, this may be our best shot if we cannot get any other thoughts. But of course, cost increases drasticaly, yet we accept the fact that cost will increase )
stepping into hadoop ecosystem ( or other big data ecosystems )
some other useful facts:
our purchases are about ~1GB ( becoming a little too large for in-memory computing )
We are thinking of doing pre-computing and caching on redis to have SOME data ready for client ( we are going to use their parameters set in their account to pre-compute every day, yet clients tend to change those parameters frequently, therefore we have to have some efficient way of handling data that is NOT cached and pre-computed )
If there is more information we can provide to better understand our dilemma, please comment and I will provide as much info as possible. There would be too much code to paste in here for one to fully understand the algorithms therefore I want to try delivering our problem with words if possible.

Never decide about technology before being sure about workflow's critical-path
This will never help you achieve ( a yet unknown ) target.
Not knowing the process critical-path, no one could calculate any speedup from whatever architecture one may aggressively-"sell" you or just-"recommend" you to follow as "geeky/nerdy/sexy/popular" - whatever one likes to hear.
What would you get from such pre-mature decision?
Typically a mix of both the budgeting ( co$t$ ) and Project-Management ( sliding time-scales ) nightmares:
additional costs ( new technology also means new skills to learn, new training costs, new delays for the team to re-shape and re-adjust and grow into a mature using of the new technology at performance levels better, than the currently used tools, etc, etc )
risks of choosing a "popular"-brand, which on the other side does not exhibit any superficial powers the marketing texts were promising ( but once having paid the initial costs of entry, there is no other way than to bear the risk of never achieving the intended target, possibly due to overestimated performance benefits & underestimated costs of transformation & heavily underestimated costs of operations & maintenance )
What would you say, if you could use a solution,where "Better options" remain your options:
you can start now, with the code you are currently using without a single line of code changed
you can start now with a still YOUR free-will based gradual path of performance scaling
you can avoid all risks of (mis)-investing into any extra-premium costs "super-box", but rather stay on the safe side re-use a cheap and massively in-service tested / fine-tuned / deployment-proven COTS hardware units ( a common dual-CPU + a few GB machines, commonly used in large thousands in datacentres )
you can scale up to any level of performance you need, growing CPU-bound processing performance gradually from start, hassle-free, up to some ~1k ~2k ~4k ~8k CPUs, as needed -- yes, up to many thousands of CPUs, that your current workers'-code can immediately use for delivering the immediate benefit of the such increased performance and thus leave your teams free hands and more time for thorough work on possible design improvements and code re-factoring for even better performance envelopes if the current workflow, having been "passively" just smart-distributed to say ~1000, later ~2000 or ~5000-CPU-cores ( still without a single SLOC changed ) do not suffice on its own?
you can scale up -- again, gradually, on an as-needed basis, hassle-free -- up to ( almost ) any size of the in-RAM capacity, be it on Day 1 ~8TB, ~16TB, ~32TB, ~64TB, jumping to ~72TB or ~128TB next year, if needed -- all that keeping your budget always ( almost ) linear and fully adjusted by your performance plans and actual customer-generated traffic
you can isolate and focus your R&D efforts not on (re)-learning "new"-platform(s), but purely into process (re)-design for further increasing the process performance ( be it using a strategy of pre-computing, where feasible, be it using smarter fully-in-RAM layouts for even faster ad-hoc calculations, that cannot be statically pre-computed )
What would business owners say to such ROI-aligned strategy?
If one makes CEO + CFO "buy" any new toy, well, that is cool for hacking this today, that tommorrow, but such approach will never make shareholders any happier, than throwing ( their ) money into the river of Nile.
If one can show the ultimately efficient Project plan, where most of the knowledge and skills are focused on business-aligned target and at the same time protecting the ROI, that would make both your CEO + CFO and I guarantee that also all your shareholders very happy, wouldn't it?
So, which way would you decide to go?

This topic isn't really new but just in case... As far as my experience can tell, I would say your T-SQL DB might by your bottle neck here.
Have you measured the performance of your SQL queries? What do you compute on SQL server side? on the Node.js side?
A good start would be to measure the response time of your SQL queries, revamp your queries, work on indexes and dig into how your DB query engine works if needed. Sometimes a small tuning in the DB settings does the trick!

C# Algorithmic Stock Trading

We are working on a Algorithmic trading software in C#. We monitor Market Price and then based on certain conditions, we want to buy the stock.
User input can be taken from GUI (WPF) and send to back-end for monitoring.
Back - end receives data continuously from Stock Exchange and checks if user entered price is met with certain limits and conditions. If all are satisfied, then we will buy / sell the stock (in Futures FUT).
Now, I want to design my Back end service.
I need Task Parallel Library or Custom Thread Pool where I want to create my tasks / threads / pool when application starts (may be incremental or fixed say 5000).
All will be in waiting state.
Once user creates an algorithm, we will activate one thread from the pool and monitors price for each incoming string. If it matches, then buy / sell and then go into waiting state again. (I don't want to create and destroy the threads / tasks as it is time consuming).
So please can you guys help me in this regard? If the above approach is good or do we have any other approach?
I am struck with this idea and not able to go out of box to think on this.

The above approach is definitely not "good"
Given the idea above, the architecture is wrong in many cardinal aspects. If your Project aspires to survive in 2017+ markets, try to learn from mistakes already taken in 2007-2016 years.
The percentages demonstrate the NBBO flutter for all U.S. Stocks from 2007-01 ~ 2012-01. ( Lower values means better NBBO stability. Higher values: Instability ) ( courtesy NANEX )
Financial Markets operate on nanosecond scales
Yes, a few inches of glass-fibre signal propagation transport delay decide on PROFIT or LOSS.
If planning to trading in Stock Markets, your system will observe the HFT crowd, doing dirty practice of Quote Stuffing and Vacuum-Cleaning 'em right in front of your nose at such scales, that your single-machine multi-threaded execution will just move through thin-air of fall in gap already created many microseconds before your decision took place on your localhost CPU.
The rise of HFT from 2007-01 ~ 2012-01 ( courtesy NANEX ).
May read more about an illusion of liquidity here.
See the expansion of Quotes against the level of Trades:
( courtesy NANEX )
Even if one decides to trade in a single instrument, on FX, the times are prohibitively short ( more than 20% of the ToB Bids are changed in time less than 2 ms and do not arrive to your localhost before your trading algorithm may react accordingly ).
If your TAMARA-measurements are similar to this, at your localhost, simply forget to trade in any HF/MF/LF-HFT instruments -- you simply do not see the real market ( the tip of the iceberg ) -- as the +20% price-events happen in the very first column ( 1 .. 2 ms ), where you do not see any single event at all!

5000 threads is bad, don't do that ever, you'll degrade the performance with context switch loss much more than parallel execution timing improvement. Traditionally the number of threads for your application should be equal to the number of cores in your system, by default. There are other possible variants, but probably they aren't the best option for your.
So you can use a ThreadPool with some working item method there with infinite loop, which is very low level, but you have control on what is going on in your system. Callback function could update the UI so the user will be notified about the trading results.
However, if you are saying that you can use the TPL, I suggest to consider these two options for your case:
Use a collection of tasks running forever for checking the new trading request. You still should tune up the number of simultaneously running tasks because you probably don't want them to fight each other for a CPU time. As the LongRunning tasks are created with dedicated background thread, many of them will degrade your application performance as well. Maybe in this approach you should introduce a strategy pattern implementation for a algorithm being run inside the task.
Setup a TPL Dataflow process within your application. For such approach your should encapsulate the info about the algorithm inside a DTO-object, and introduce a pipeline:
BufferBlock for storing all the incoming requests. Maybe you can use here a BroadcastBlock, if you want to check the sell or buy options in parallel. You can link the block with a boolean predicate here so the different block will process different types of requests.
ActionBlock (maybe one block for each algorithm from user) for processing the algorithmic check for a pattern based on which you are providing the decision.
ActionBlock for storing all the buy / sell requests for a data successfully passed by the algorithm.
BufferBlock for UI reaction with a Reactive Extensions (Introductory book for Rx, if you aren't familiar with it)
This solution still has to be tuned up with a block creation options, and more informative for you how exactly your data flow across the trading algorithm, the speed of the decision making and overall performance. You should properly examine for a defaults for TPL Dataflow blocks, you can find them into the official documentation. Other good place to start is Stephen Cleary's introductory blog posts (Part 1, Part 2, Part 3) and the chapter #4 about this library in his book.

With C# 5.0, the natural approach is to use async methods running on top of the default thread pool.
This way, you are creating Tasks quite often, but the most prominent cost of that is in GC. And unless you have very high performance requirements, that cost should be acceptable.

I think you would be better with an event loop, and if you need to scale, you can always shard by stock.

What are the benefits of coroutines?

I've been learning some lua for game development. I heard about coroutines in other languages but really came up on them in lua. I just don't really understand how useful they are, I heard a lot of talk how it can be a way to do multi-threaded things but aren't they run in order? So what benefit would there be from normal functions that also run in order? I'm just not getting how different they are from functions except that they can pause and let another run for a second. Seems like the use case scenarios wouldn't be that huge to me.
Anyone care to shed some light as to why someone would benefit from them?
Especially insight from a game programming perspective would be nice^^

OK, think in terms of game development.
Let's say you're doing a cutscene or perhaps a tutorial. Either way, what you have are an ordered sequence of commands sent to some number of entities. An entity moves to a location, talks to a guy, then walks elsewhere. And so forth. Some commands cannot start until others have finished.
Now look back at how your game works. Every frame, it must process AI, collision tests, animation, rendering, and sound, among possibly other things. You can only think every frame. So how do you put this kind of code in, where you have to wait for some action to complete before doing the next one?
If you built a system in C++, what you would have is something that ran before the AI. It would have a sequence of commands to process. Some of those commands would be instantaneous, like "tell entity X to go here" or "spawn entity Y here." Others would have to wait, such as "tell entity Z to go here and don't process anymore commands until it has gone here." The command processor would have to be called every frame, and it would have to understand complex conditions like "entity is at location" and so forth.
In Lua, it would look like this:
local entityX = game:GetEntity("entityX");
entityX:GoToLocation(locX);
local entityY = game:SpawnEntity("entityY", locY);
local entityZ = game:GetEntity("entityZ");
entityZ:GoToLocation(locZ);
do
coroutine.yield();
until (entityZ:isAtLocation(locZ));
return;
On the C++ size, you would resume this script once per frame until it is done. Once it returns, you know that the cutscene is over, so you can return control to the user.
Look at how simple that Lua logic is. It does exactly what it says it does. It's clear, obvious, and therefore very difficult to get wrong.
The power of coroutines is in being able to partially accomplish some task, wait for a condition to become true, then move on to the next task.

Coroutines in a game:
Easy to use, Easy to screw up when used in many places.
Just be careful and not use it in many places.
Don't make your Entire AI code dependent on Coroutines.
Coroutines are good for making a quick fix when a state is introduced which did not exist before.
This is exactly what java does. Sleep() and Wait()
Both functions are the best ways to make it impossible to debug your game.
If I were you I would completely avoid any code which has to use a Wait() function like a Coroutine does.
OpenGL API is something you should take note of. It never uses a wait() function but instead uses a clean state machine which knows exactly what state what object is at.
If you use coroutines you end with up so many stateless pieces of code that it most surely will be overwhelming to debug.
Coroutines are good when you are making an application like Text Editor ..bank application .. server ..database etc (not a game).
Bad when you are making a game where anything can happen at any point of time, you need to have states.
So, in my view coroutines are a bad way of programming and a excuse to write small stateless code.
But that's just me.

It's more like a religion. Some people believe in coroutines, some don't. The usecase, the implementation and the environment all together will result into a benefit or not.
Don't trust benchmarks which try to proof that coroutines on a multicore cpu are faster than a loop in a single thread: it would be a shame if it were slower!
If this runs later on some hardware where all cores are always under load, it will turn out to be slower - ups...
So there is no benefit per se.
Sometimes it's convenient to use. But if you end up with tons of coroutines yielding and states that went out of scope you'll curse coroutines. But at least it isn't the coroutines framework, it's still you.

We use them on a project I am working on. The main benefit for us is that sometimes with asynchronous code, there are points where it is important that certain parts are run in order because of some dependencies. If you use coroutines, you can force one process to wait for another process to complete. They aren't the only way to do this, but they can be a lot simpler than some other methods.

I'm just not getting how different they are from functions except that
they can pause and let another run for a second.
That's a pretty important property. I worked on a game engine which used them for timing. For example, we had an engine that ran at 10 ticks a second, and you could WaitTicks(x) to wait x number of ticks, and in the user layer, you could run WaitFrames(x) to wait x frames.
Even professional native concurrency libraries use the same kind of yielding behaviour.

Lots of good examples for game developers. I'll give another in the application extension space. Consider the scenario where the application has an engine that can run a users routines in Lua while doing the core functionality in C. If the user needs to wait for the engine to get to a specific state (e.g. waiting for data to be received), you either have to:
multi-thread the C program to run Lua in a separate thread and add in locking and synchronization methods,
abend the Lua routine and retry from the beginning with a state passed to the function to skip anything, least you rerun some code that should only be run once, or
yield the Lua routine and resume it once the state has been reached in C
The third option is the easiest for me to implement, avoiding the need to handle multi-threading on multiple platforms. It also allows the user's code to run unmodified, appearing as if the function they called took a long time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string