Data acquisition and parallel analysis

Data acquisition and parallel analysis - python-3.x

With this example, I am able to start 10 processes and then continue to do "stuff".
import random
import time
import multiprocessing
if __name__ == '__main__':
"""Demonstration of GIL-friendly asynchronous development with Python's multiprocessing module"""
def process(instance):
total_time = random.uniform(0, 2)
time.sleep(total_time)
print('Process %s : completed in %s sec' % (instance, total_time))
return instance
for i in range(10):
multiprocessing.Process(target=process, args=(i,)).start()
for i in range(2):
print("im doing stuff")
output:
>>
im doing stuff
im doing stuff
Process 8 : completed in 0.5390905372395016 sec
Process 6 : completed in 1.2313793332779521 sec
Process 2 : completed in 1.3439237625459899 sec
Process 0 : completed in 2.171809500083049 sec
Process 5 : completed in 2.6980031493633887 sec
Process 1 : completed in 3.3807358192422416 sec
Process 3 : completed in 4.597366303348297 sec
Process 7 : completed in 4.702447947943171 sec
Process 4 : completed in 4.8355495004170965 sec
Process 9 : completed in 4.9917788543156245 sec
I'd like to have a main while True loop which do data acquisition and just start a new process at each iteration (with the new data) and check if any process has finished and look at the output.
How could I verify that a process has ended and what is its return value? Edit: while processes in a list are still executing.
If I had to summarize my problem: how can I know which process is finished in a list of processes - with some still executing or new added?

Related

python threading event.wait() using same object in multiple threads

where is the documentation for the python3 threading library's event.wait() method that explains how 1 event can be used multiple times in different threads?
the example below shows the same event can be used in multiple threads, each with a different wait() duration, probably because each has its own lock under the hood.
But this functionality is not documented in an obvious way, on the threading page.
this works great but it's not clear why it works or if this will continue to work in future python versions.
are there ways this could bonk unexpectedly?
can an inherited event work properly in multiple classes as long as it's used in separate threads?
import logging
import threading
import time
logging.basicConfig(level=logging.DEBUG,
format='[%(levelname)s] (%(threadName)-10s) %(message)s',)
def worker(i,dt,e):
tStart=time.time()
e.wait(dt)
logging.debug('{0} tried to wait {1} seconds but really waited {2}'.format(i,dt, time.time()-tStart ))
e = threading.Event()
maxThreads=10
for i in range(maxThreads):
dt=1+i # (s)
w = threading.Thread(target=worker, args=(i,dt,e))
w.start()
output:
[DEBUG] (Thread-1 ) 0 tried to wait 1 seconds but really waited 1.0003676414489746
[DEBUG] (Thread-2 ) 1 tried to wait 2 seconds but really waited 2.00034761428833
[DEBUG] (Thread-3 ) 2 tried to wait 3 seconds but really waited 3.0001776218414307
[DEBUG] (Thread-4 ) 3 tried to wait 4 seconds but really waited 4.000180244445801
[DEBUG] (Thread-5 ) 4 tried to wait 5 seconds but really waited 5.000337362289429
[DEBUG] (Thread-6 ) 5 tried to wait 6 seconds but really waited 6.000308990478516
[DEBUG] (Thread-7 ) 6 tried to wait 7 seconds but really waited 7.000143051147461
[DEBUG] (Thread-8 ) 7 tried to wait 8 seconds but really waited 8.000152826309204
[DEBUG] (Thread-9 ) 8 tried to wait 9 seconds but really waited 9.00012469291687
[DEBUG] (Thread-10 ) 9 tried to wait 10 seconds but really waited 10.000144481658936

Since e is threading Event,
You are declaring it locally for each thread (all the 10 threads are excecated almost parallelly).
You can check it here:
import logging
import threading
import time
logging.basicConfig(level=logging.DEBUG,
format='[%(levelname)s] (%(threadName)s) %(message)s',)
def worker(i,dt,e):
tStart=time.time()
logging.info('Program will wait for {} time while trying to print the change from {} to {}'.format(dt,i,dt))
e.wait(dt)
logging.debug('{0} tried to wait {1} seconds but really waited {2}'.format(i,dt, time.time()-tStart ))
e = threading.Event()
maxThreads=10
for i in range(maxThreads):
dt=1+i # (s)
w = threading.Thread(target=worker, args=(i,dt,e))
w.start()
It's not the locking, its just about the value passed into the thread target

How to use ApScheduler correctly in FastAPI？

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import uvicorn
import time
from loguru import logger
from apscheduler.schedulers.background import BackgroundScheduler
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
test_list = ["1"]*10
def check_list_len():
global test_list
while True:
time.sleep(5)
logger.info(f"check_list_len：{len(test_list)}")
#app.on_event('startup')
def init_data():
scheduler = BackgroundScheduler()
scheduler.add_job(check_list_len, 'cron', second='*/5')
scheduler.start()
#app.get("/pop")
async def list_pop():
global test_list
test_list.pop(1)
logger.info(f"current_list_len:{len(test_list)}")
if __name__ == '__main__':
uvicorn.run(app="main3:app", host="0.0.0.0", port=80, reload=False, debug=False)
Above is my code, I want to take out a list of elements through get request, and set a periodic task constantly check the number of elements in the list, but when I run, always appear the following error:
Execution of job "check_list_len (trigger: cron[second='*/5'], next run at: 2021-11-25 09:48:50 CST)" skipped: maximum number of running instances reached (1)
2021-11-25 09:48:50.016 | INFO | main3:check_list_len:23 - check_list_len：10
Execution of job "check_list_len (trigger: cron[second='*/5'], next run at: 2021-11-25 09:48:55 CST)" skipped: maximum number of running instances reached (1)
2021-11-25 09:48:55.018 | INFO | main3:check_list_len:23 - check_list_len：10
INFO: 127.0.0.1:55961 - "GET /pop HTTP/1.1" 200 OK
2021-11-25 09:48:57.098 | INFO | main3:list_pop:35 - current_list_len:9
Execution of job "check_list_len (trigger: cron[second='*/5'], next run at: 2021-11-25 09:49:00 CST)" skipped: maximum number of running instances reached (1)
2021-11-25 09:49:00.022 | INFO | main3:check_list_len:23 - check_list_len：9
It looks like I started two scheduled tasks and only one succeeded, but I started only one task. How do I avoid this

You're getting the behavior you're asking for. You've configured apscheduler to run check_list_len every five seconds, but you've also made it so that function runs without terminating - just sleeping for five seconds in an endless loop. That function never terminates, so apscheduler doesn't run it again - since it still hasn't finished.
Remove the infinite loop inside your utility function when using apscheduler - it'll call the function every five seconds for you:
def check_list_len():
global test_list # you really don't need this either, since you're not reassigning the variable
logger.info(f"check_list_len：{len(test_list)}")

Laravel-Excel keeps browser busy for 140 seconds after completion of import: how do I correct it?

Using the import to models option, I am importing an XLS file with about 15,000 rows.
With the microtime_float function, the script times and echos out how long it takes. At 29.6 secs, this happens, showing it took less than 30 seconds. At that time, I can see the database has all 15k+ records as expected, no issues there.
Problem is, the browser is kept busy and at 1 min 22 secs, 1 min 55 secs and 2 min 26 secs it prompts me to either wait or kill the process. I keep clicking wait and finally it ends at 2 mins 49 secs.
This is a terrible user experience, how can I cut off this extra wait time?
It's a very basic setup: the route calls importcontroller#import with http get and the code is as follows:
public function import()
{
ini_set('memory_limit', '1024M');
$start = $this->microtime_float();
Excel::import(new myImport, 'myfile.xls' , null, \Maatwebsite\Excel\Excel::XLS);
$end = $this->microtime_float();
$t = $end - $start;
return "Time: $t";
}
The class uses certain concerns as follows:
class myImport implements ToModel, WithBatchInserts, WithChunkReading, WithStartRow

DocumentDB performance issues

When running from DocumentDB queries from C# code on my local computer a simple DocumentDB query takes about 0.5 seconds in average. Another example, getting a reference to a document collection takes about 0.7 seconds in average. Is this to be expected? Below is my code for checking if a collection exists, it is pretty straight forward - but is there any way of improving the bad performance?
// Create a new instance of the DocumentClient
var client = new DocumentClient(new Uri(EndpointUrl), AuthorizationKey);
// Get the database with the id=FamilyRegistry
var database = client.CreateDatabaseQuery().Where(db => db.Id == "FamilyRegistry").AsEnumerable().FirstOrDefault();
var stopWatch = new Stopwatch();
stopWatch.Start();
// Get the document collection with the id=FamilyCollection
var documentCollection = client.CreateDocumentCollectionQuery("dbs/"
+ database.Id).Where(c => c.Id == "FamilyCollection").AsEnumerable().FirstOrDefault();
stopWatch.Stop();
// Get the elapsed time as a TimeSpan value.
var ts = stopWatch.Elapsed;
// Format and display the TimeSpan value.
var elapsedTime = String.Format("{0:00} seconds, {1:00} milliseconds",
ts.Seconds,
ts.Milliseconds );
Console.WriteLine("Time taken to get a document collection: " + elapsedTime);
Console.ReadKey();
Average output on local computer:
Time taken to get a document collection: 0 seconds, 752 milliseconds
In another piece of my code I'm doing 20 small document updates that are about 400 bytes each in JSON size and it still takes 12 seconds in total. I'm only running from my development environment but I was expecting better performance.

In short, this can be done end to end in ~9 milliseconds with DocumentDB. I'll walk through the changes required, and why/how they impact results below.
The very first query always takes longer in DocumentDB because it does some setup work (fetching physical addresses of DocumentDB partitions). The next couple requests take a little bit longer to warm the connection pools. The subsequent queries will be as fast as your network (the latency of reads in DocumentDB is very low due to SSD storage).
For example, if you modify your code above to measure, for example 10 readings instead of just the first one like shown below:
using (DocumentClient client = new DocumentClient(new Uri(EndpointUrl), AuthorizationKey))
{
long totalRequests = 10;
var database = client.CreateDatabaseQuery().Where(db => db.Id == "FamilyRegistry").AsEnumerable().FirstOrDefault();
Stopwatch watch = new Stopwatch();
for (int i = 0; i < totalRequests; i++)
{
watch.Start();
var documentCollection = client.CreateDocumentCollectionQuery("dbs/"+ database.Id)
.Where(c => c.Id == "FamilyCollection").AsEnumerable().FirstOrDefault();
Console.WriteLine("Finished read {0} in {1}ms ", i, watch.ElapsedMilliseconds);
watch.Reset();
}
}
Console.ReadKey();
I get the following results running from my desktop in Redmond against the Azure West US data center, i.e. about 50 milliseconds. These numbers may vary based on the network connectivity and distance of your client from the Azure DC hosting DocumentDB:
Finished read 0 in 217ms
Finished read 1 in 46ms
Finished read 2 in 51ms
Finished read 3 in 47ms
Finished read 4 in 46ms
Finished read 5 in 93ms
Finished read 6 in 48ms
Finished read 7 in 45ms
Finished read 8 in 45ms
Finished read 9 in 51ms
Next, I switch to Direct/TCP connectivity from the default of Gateway to improve the latency from two hops to one, i.e., change the initialization code to:
using (DocumentClient client = new DocumentClient(new Uri(EndpointUrl), AuthorizationKey, new ConnectionPolicy { ConnectionMode = ConnectionMode.Direct, ConnectionProtocol = Protocol.Tcp }))
Now the operation to find the collection by ID completes within 23 milliseconds:
Finished read 0 in 197ms
Finished read 1 in 117ms
Finished read 2 in 23ms
Finished read 3 in 23ms
Finished read 4 in 25ms
Finished read 5 in 23ms
Finished read 6 in 31ms
Finished read 7 in 23ms
Finished read 8 in 23ms
Finished read 9 in 23ms
How about when you run the same results from an Azure VM or Worker Role also running in the same Azure DC? The same operation completes with about 9 milliseconds!
Finished read 0 in 140ms
Finished read 1 in 10ms
Finished read 2 in 8ms
Finished read 3 in 9ms
Finished read 4 in 9ms
Finished read 5 in 9ms
Finished read 6 in 9ms
Finished read 7 in 9ms
Finished read 8 in 10ms
Finished read 9 in 8ms
Finished read 9 in 9ms
So, to summarize:
For performance measurements, please allow for a few measurement samples to account for startup/initialization of the DocumentDB client.
Please use TCP/Direct connectivity for lowest latency.
When possible, run within the same Azure region.
If you follow these steps, you can get great performance and you'll be able to get the best performance numbers with DocumentDB.

Computing eigenvalues in parallel for a large matrix

I am trying to compute the eigenvalues of a big matrix on matlab using the parallel toolbox.
I first tried:
A = rand(10000,2000);
A = A*A';
matlabpool open 2
spmd
C = codistributed(A);
tic
[V,D] = eig(C);
time = gop(#max, toc) % Time for all labs in the pool to complete.
end
matlabpool close
The code starts its execution:
Starting matlabpool using the 'local' profile ... connected to 2 labs.
But, after few minutes, I got the following error:
Error using distcompserialize
Out of Memory during serialization
Error in spmdlang.RemoteSpmdExecutor/initiateComputation (line 82)
fcns = distcompMakeByteBufferHandle( ...
Error in spmdlang.spmd_feval_impl (line 14)
blockExecutor.initiateComputation();
Error in spmd_feval (line 8)
spmdlang.spmd_feval_impl( varargin{:} );
I then tried to apply what I saw on tutorial videos from the parallel toolbox:
>> job = createParallelJob('configuration', 'local');
>> task = createTask(job, #eig, 1, {A});
>> submit(job);
waitForState(job, 'finished');
>> results = getAllOutputArguments(job)
>> destroy(job);
But after two hours computation, I got:
results =
Empty cell array: 2-by-0
My computer has 2 Gi memory and intel duoCPU (2*2Ghz)
My questions are the following:
1/ Looking at the first error, I guess my memory is not sufficient for this problem. Is there a way I can divide the input data so that my computer can handle this matrix?
2/ Why is the second result I get empty? (after 2 hours computation...)
EDIT: #pm89
You were right, an error occurred during the execution:
job =
Parallel Job ID 3 Information
=============================
UserName : bigTree
State : finished
SubmitTime : Sun Jul 14 19:20:01 CEST 2013
StartTime : Sun Jul 14 19:20:22 CEST 2013
Running Duration : 0 days 0h 3m 16s
- Data Dependencies
FileDependencies : {}
PathDependencies : {}
- Associated Task(s)
Number Pending : 0
Number Running : 0
Number Finished : 2
TaskID of errors : [1 2]
- Scheduler Dependent (Parallel Job)
MaximumNumberOfWorkers : 2
MinimumNumberOfWorkers : 1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Data acquisition and parallel analysis - python-3.x

Related

python threading event.wait() using same object in multiple threads

How to use ApScheduler correctly in FastAPI？

Laravel-Excel keeps browser busy for 140 seconds after completion of import: how do I correct it?

DocumentDB performance issues

Computing eigenvalues in parallel for a large matrix

Categories

Resources