I have this code but process worker don't finish all task
workers = []
for batch_index, batched_payloads in batches:
p = Process(target=self.process_batch, args=(batch_index, batched_payloads))
workers.append(p)
p.start()
for worker in workers:
worker.join()
hello any help please I have 20000 items i split to batch of 5000 , with multiprocessing it work on paralell but don't finish all work exit before finish, the process is to insert data for exemple it insert for any batch 100 item or more and not finish all (batch1 : 10..100, batch2 :5000...5200 batch3 : 10000..11000 batch3: 15000..15900
Related
I have a Python flask API that apply some SQL based filtering on an object.
Steps of the API workflow:
receive a POST request (with arguments)
run multiple SQL read queries (against a postgres DB) depending on some of the posted arguments
apply some simple "pure python" rules on the SQL results to get a boolean result
store the boolean result and the associated posted arguments in the postgres DB
return the boolean result
Contraints of the API:
The API needs to return the boolean answer under 150ms
I can store the boolean result asynchronously in DB to avoid waiting for the write query to complete before returning the boolean result
However and as explained, the boolean answer depends on the SQL read queries so I cannot run those queries asynchronously
Test made:
While making some tests, I saw that I can make read queries in parallel. The test I did was:
Running the query below 2 times not using multithreading => the code ran in roughly 10 seconds
from sqlalchemy import create_engine
import os
import time
engine = create_engine(
os.getenv("POSTGRES_URL")
)
def run_query():
with engine.connect() as conn:
rs = conn.execute(f"""
SELECT
*
, pg_sleep(5)
FROM users
""")
for row in rs:
print(row)
if __name__ == "__main__":
start = time.time()
for i in range(5):
run_query()
end = time.time() - start
Running the query using multithreading => the code ran in roughly 5 seconds
from sqlalchemy import create_engine
import os
import threading
import time
engine = create_engine(
os.getenv("POSTGRES_URL")
)
def run_query():
with engine.connect() as conn:
rs = conn.execute(f"""
SELECT
*
, pg_sleep(5)
FROM users
""")
for row in rs:
print(row)
if __name__ == "__main__":
start = time.time()
threads = []
for i in range(5):
t = threading.Thread(target=run_query)
t.start()
threads.append(t)
for t in threads:
t.join()
end = time.time() - start
Question:
What is the bottleneck of the code ? I'm sure there must be a maximum number of read queries that I can run in parallel in 1 API call. However I'm wondering what is determining these limit.
Thank you very much for your help !
This scales well beyond the point that is sensible. With some tweaks to the built in connection pool's pool_size, you could easily have 100 pg_sleep going simultaneously. But as soon as you change that to do real work rather than just sleeping, it would fall apart. You only have so many CPU and so many disk drives, and that number is probably way less than 100.
You should start by looking at those read queries to see why they are slow and if they can't be made faster with indices or something.
I have a requirements of pretty custom non-trivial synchronization which can be implemented with a fair ReentrantLock and Phaser. It does not seem to be possible (without a non-trivial customization) to implement on fs2 and cats.effect.
Since it's required to wrap all blocking operation into a Blocker here is the code:
private val l: ReentrantLock = new ReentrantLock(true)
private val c: Condition = l.newCondition
private val b: Blocker = //...
//F is declared on the class level
def lockedMutex(conditionPredicate: Int => Boolean): F[Unit] = blocker.blockOn {
Sync[F].delay(l.lock()).bracket(_ => Sync[F].delay{
while(!conditionPredicate(2)){
c.await()
}
})(_ => Sync[F].delay(l.unlock()))
}
QUESTION:
Is it guaranteed that the code containing c.await() will be executed in the same Thread which acquires/releases the ReentrantLock?
This is a crucial part since if it's not IllegalMonitorStateException will be thrown.
You really do not need to worry about threads when using something like cats-effect, rather you can describe your problem on a higher level.
This should get the same behavior you want, it will be running high-priority jobs until there isn't more to then pick low-priority jobs. After finishing a low-priority job each fiber will first check if there are more high-priority jobs before trying to pick again a low-priority one:
import cats.effect.Async
import cats.effect.std.Queue
import cats.effect.syntax.all._
import cats.syntax.all._
import scala.concurrent.ExecutionContext
object HighLowPriorityRunner {
final case class Config[F[_]](
highPriorityJobs: Queue[F, F[Unit]],
lowPriorityJobs: Queue[F, F[Unit]],
customEC: Option[ExecutionContext]
)
def apply[F[_]](config: Config[F])
(implicit F: Async[F]): F[Unit] = {
val processOneJob =
config.highPriorityJobs.tryTake.flatMap {
case Some(hpJob) => hpJob
case None => config.lowPriorityJobs.tryTake.flatMap {
case Some(lpJob) => lpJob
case None => F.unit
}
}
val loop: F[Unit] = processOneJob.start.foreverM
config.customEC.fold(ifEmpty = loop)(ec => loop.evalOn(ec))
}
}
You can use the customEC to provide your own ExecutionContext to control the number of real threads that are running your fibers under the hood.
The code can be used like this:
import cats.effect.{Async, IO, IOApp, Resource}
import cats.effect.std.Queue
import cats.effect.syntax.all._
import cats.syntax.all._
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import scala.concurrent.duration._
object Main extends IOApp.Simple {
override final val run: IO[Unit] =
Resource.make(IO(Executors.newFixedThreadPool(2)))(ec => IO.blocking(ec.shutdown())).use { ec =>
Program[IO](ExecutionContext.fromExecutor(ec))
}
}
object Program {
private def createJob[F[_]](id: Int)(implicit F: Async[F]): F[Unit] =
F.delay(println(s"Starting job ${id} on thread ${Thread.currentThread.getName}")) *>
F.delay(Thread.sleep(1.second.toMillis)) *> // Blocks the Fiber! - Only for testing, use F.sleep on real code.
F.delay(println(s"Finished job ${id}!"))
def apply[F[_]](customEC: ExecutionContext)(implicit F: Async[F]): F[Unit] = for {
highPriorityJobs <- Queue.unbounded[F, F[Unit]]
lowPriorityJobs <- Queue.unbounded[F, F[Unit]]
runnerFiber <- HighLowPriorityRunner(HighLowPriorityRunner.Config(
highPriorityJobs,
lowPriorityJobs,
Some(customEC)
)).start
_ <- List.range(0, 10).traverse_(id => highPriorityJobs.offer(createJob(id)))
_ <- List.range(10, 15).traverse_(id => lowPriorityJobs.offer(createJob(id)))
_ <- F.sleep(5.seconds)
_ <- List.range(15, 20).traverse_(id => highPriorityJobs.offer(createJob(id)))
_ <- runnerFiber.join.void
} yield ()
}
Which should produce an output like this:
Starting job 0 on thread pool-1-thread-1
Starting job 1 on thread pool-1-thread-2
Finished job 0!
Finished job 1!
Starting job 2 on thread pool-1-thread-1
Starting job 3 on thread pool-1-thread-2
Finished job 2!
Finished job 3!
Starting job 4 on thread pool-1-thread-1
Starting job 5 on thread pool-1-thread-2
Finished job 4!
Finished job 5!
Starting job 6 on thread pool-1-thread-1
Starting job 7 on thread pool-1-thread-2
Finished job 6!
Finished job 7!
Starting job 8 on thread pool-1-thread-1
Starting job 9 on thread pool-1-thread-2
Finished job 8!
Finished job 9!
Starting job 10 on thread pool-1-thread-1
Starting job 11 on thread pool-1-thread-2
Finished job 10!
Finished job 11!
Starting job 15 on thread pool-1-thread-1
Starting job 16 on thread pool-1-thread-2
Finished job 15!
Finished job 16!
Starting job 17 on thread pool-1-thread-1
Starting job 18 on thread pool-1-thread-2
Finished job 17!
Finished job 18!
Starting job 19 on thread pool-1-thread-1
Starting job 12 on thread pool-1-thread-2
Finished job 19!
Starting job 13 on thread pool-1-thread-1
Finished job 12!
Starting job 14 on thread pool-1-thread-2
Finished job 13!
Finished job 14!
Thanks to Gavin Bisesi (#Daenyth) for refining my original idea into this!
Full code available here.
i have written a python program that helps copy over data from one Oracle DB to another. It uses threading to concurrently copy data over several threads.
def run_sqlplus(sqlplus_script):
p = subprocess.Popen(['sqlplus', conn],stdin=subprocess.PIPE,
stdout=subprocess.PIPE,stderr=subprocess.PIPE)
(stdout,stderr) = p.communicate(sqlplus_script.encode('utf-8'))
stdout_lines = stdout.decode('utf-8').split("\n")
return stdout_lines
The list specifies the queries to be executed
LST=["select COUNT(1) FROM a WHERE BUSINESS_DATE='30-Sep-2020' AND RUN_ID=1 AND PROCESSING_LOCATION='PAC';",
"select COUNT(1) FROM b WHERE BUSINESS_DATE='30-Sep-2020' AND RUN_ID=1 AND PROCESSING_LOCATION='PAC';",
"COPY FROM schema/pass#DB APPEND TABLE1 using select * FROM TABLE1 WHERE BUSINESS_DATE='30-Sep-2020' ;"
]
I have mapped the List (LST) to the the below threadpool executor:
with concurrent.futures.ThreadPoolExecutor(max_threads) as executor:
results=[executor.submit(execute_queries,i) for i in LST]
for f in concurrent.futures.as_completed(results):
print(f.result())
While running the code it seems the copy commands are running longer than expected. Is there any way that i can monitor the progress of the threads in python or the SID/SERIAL# in Oracle DB.
How can I schedule lots of APScheduler jobs (4,000+) concurrently? (I must schedule all these after certain user events.)
Iteratively calling add_job simply takes too long with many jobs. But when I try to use AsyncIOScheduler and the following async code, I don't get any added performance increase either.
NOTE: my scheduler needs to connect to a SQL jobstore via SqlAlchemy
scheduler = AsyncIOScheduler(jobstores={"default": SQLAlchemyJobStore(url="a valid db connection str")})
scheduler.start()
def schedule_jobs_quickly():
# init lots of (fake) jobs
jobs = []
for i in range(3000):
jobs.append(i)
send_time = datetime.datetime.now() + datetime.timedelta(days=2)
# try to schedule jobs concurrently
start_time = time.time()
asyncio.get_event_loop().run_until_complete(schedule_all_jobs(jobs, send_time))
duration = time.time() - start_time
print(f"Created {len(jobs)} jobs in {duration} seconds")
async def schedule_all_jobs(all_jobs, send_time):
tasks = []
for job in all_jobs:
task = asyncio.ensure_future(schedule_job(job, send_time))
tasks.append(task)
await asyncio.gather(*tasks, return_exceptions=True)
async def schedule_job(job, send_time):
scheduler.add_job(send_email_if_needed, trigger=send_time)
Result is very slow. How to speed this up?
>>> schedule_jobs_quickly()
...
Created 3000 jobs in 401.9982771873474 seconds
For comparison, this is how long it took with a BackgroundScheduler() using the default memory jobstore:
Created 3000 jobs in 0.9155495166778564 seconds
So, it seems to be the database connections that are so expensive. Maybe there's a way to create multiple jobs using the same connection, instead of re-connecting for each add_job?
It's not the solution I was looking for, but I decided to give up on AsyncIOScheduler and instead schedule my many tasks in a separate thread so the rest of my program could continue without being held up by all of the DB connections. Example below.
from threading import Thread
def schedule_jobs_quickly():
# init lots of (fake) jobs
jobs = []
for i in range(3000):
jobs.append(i)
send_time = datetime.datetime.now() + datetime.timedelta(days=2)
# schedule jobs in new thread
scheduler_thread = Thread(target=schedule_email_jobs, args=(email_jobs,))
scheduler_thread.start()
def schedule_email_jobs(jobs):
for job in jobs:
scheduler.add_job(send_email, trigger=send_time)
def send_email():
# sends email
I have setup airflow with postgresql database, I am creating multiple dags
def subdag(parent_dag_name, child_dag_name,currentDate,batchId,category,subCategory,yearMonth,utilityType,REALTIME_HOME, args):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=args,
schedule_interval="#once",
)
# get site list to run bs reports
site_list = getSiteListforProcessing(category,subCategory,utilityType,yearMonth);
print (site_list)
def update_status(siteId,**kwargs):
createdDate=getCurrentTimestamp();
print ('N',siteId,batchId,yearMonth,utilityType,'N')
updateJobStatusLog('N',siteId,batchId,yearMonth,utilityType,'P')
def error_status(siteId,**kwargs):
createdDate=getCurrentTimestamp();
print ('N',siteId,batchId,yearMonth,utilityType,'N')
BS_template = """
echo "{{ params.date }}"
java -cp xx.jar com.xGenerator {{params.siteId}} {{params.utilityType}} {{params.date}}
"""
for index,siteid in enumerate(site_list):
t1 = BashOperator(
task_id='%s-task-%s' % (child_dag_name, index + 1),
bash_command=BS_template,
params={'date': currentDate, 'realtime_home': REALTIME_HOME,'siteId': siteid, "utilityType":utilityType},
default_args=args,
dag=dag_subdag)
t2 = PythonOperator(
task_id='%s-updatetask-%s' % (child_dag_name, index + 1),
dag=dag_subdag,
python_callable=update_status,
op_kwargs={'siteId':siteid})
t2.set_upstream(t1)
return dag_subdag
It creates dynamic tasks but on all number of dynamic task, it always fails the last one and logs error as:
"Cannot use more than 1 thread when using sqlite. Setting max_threads to 1"
E.g. : if 4 tasks are created 3 runs, and if 2 tasks are created 1 runs.