I have a Python flask API that apply some SQL based filtering on an object.
Steps of the API workflow:
receive a POST request (with arguments)
run multiple SQL read queries (against a postgres DB) depending on some of the posted arguments
apply some simple "pure python" rules on the SQL results to get a boolean result
store the boolean result and the associated posted arguments in the postgres DB
return the boolean result
Contraints of the API:
The API needs to return the boolean answer under 150ms
I can store the boolean result asynchronously in DB to avoid waiting for the write query to complete before returning the boolean result
However and as explained, the boolean answer depends on the SQL read queries so I cannot run those queries asynchronously
Test made:
While making some tests, I saw that I can make read queries in parallel. The test I did was:
Running the query below 2 times not using multithreading => the code ran in roughly 10 seconds
from sqlalchemy import create_engine
import os
import time
engine = create_engine(
os.getenv("POSTGRES_URL")
)
def run_query():
with engine.connect() as conn:
rs = conn.execute(f"""
SELECT
*
, pg_sleep(5)
FROM users
""")
for row in rs:
print(row)
if __name__ == "__main__":
start = time.time()
for i in range(5):
run_query()
end = time.time() - start
Running the query using multithreading => the code ran in roughly 5 seconds
from sqlalchemy import create_engine
import os
import threading
import time
engine = create_engine(
os.getenv("POSTGRES_URL")
)
def run_query():
with engine.connect() as conn:
rs = conn.execute(f"""
SELECT
*
, pg_sleep(5)
FROM users
""")
for row in rs:
print(row)
if __name__ == "__main__":
start = time.time()
threads = []
for i in range(5):
t = threading.Thread(target=run_query)
t.start()
threads.append(t)
for t in threads:
t.join()
end = time.time() - start
Question:
What is the bottleneck of the code ? I'm sure there must be a maximum number of read queries that I can run in parallel in 1 API call. However I'm wondering what is determining these limit.
Thank you very much for your help !
This scales well beyond the point that is sensible. With some tweaks to the built in connection pool's pool_size, you could easily have 100 pg_sleep going simultaneously. But as soon as you change that to do real work rather than just sleeping, it would fall apart. You only have so many CPU and so many disk drives, and that number is probably way less than 100.
You should start by looking at those read queries to see why they are slow and if they can't be made faster with indices or something.
Related
I am trying to use a REST API to enrich data I have in a spark dataframe. The REST API isn't built by me and requires a single input at a time (no batch option). Unfortunately the REST API latency is slower than I would like so my spark applications seem to spend a lot of time waiting for the API to iterate over each row. Although my REST API has higher latency, it does have very high throughput/capacity which does not seem to get fully used by my spark application.
Since my application appears to be network bound, I was wondering if it would make sense to use threading to help improve the speed of my application. Does spark already capable of doing this internally? If using threads does make sense, is there an easy way to accomplish this? Has anybody successfully done this?
I’ve encountered the same problem when fetching data from a blob storage.
Below is a small self-contained dummy example that I think you can easily modify for your needs.
In the example you should be able to register that it takes a lot longer to construct df_slow vs constructing df_fast.
It works by making each worker process a list of rows in parallel, instead of processing one row at a time sequentially.
You might be able to just swap the slowAdd function with your own Row transforming function. The slowAdd function simulates network latency by sleeping 0.1 seconds.
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import Row
# Just some dataframe with numbers
data = [(i,) for i in range(0, 1000)]
df = spark.createDataFrame(data, ["Data"], T.IntegerType())
# Get an rdd that contains 'list of Rows' instead of 'Row'
standardRdd = df.rdd # contains [row1, row3, row3,...]
number_of_partitions = 10
repartionedRdd = standardRdd.repartition(number_of_partitions) # contains [row1, row2, row3,...] but repartioned to increase parallelism
glomRdd = repartionedRdd.glom() # contains roughly [[row1, row2, row3,..., row100], [row101, row102, row103, ...], ...]
# where the number of sublists corresponds to the number of partitions
# Define a transformation function with an artificial delay.
# Substitute this with your own transformation function.
import time
def slowAdd(r):
d = r.asDict()
d["Data"] = d["Data"] + 100
time.sleep(0.1)
return Row(**d)
# Define a function that maps the slowAdd function from 'list of Rows' to 'list of Rows' in parallel
import concurrent.futures
def slowAdd_with_thread_pool(list_of_rows):
thread_pool = concurrent.futures.ThreadPoolExecutor(max_workers=100)
return [result for result in thread_pool.map(slowAdd, list_of_rows)]
# Perform a fast mapping from 'list of Rows' to 'Rows'.
transformed_fast_rdd = glomRdd.flatMap(slowAdd_with_thread_pool)
# For reference, perform a slow mapping from 'Rows' to 'Rows'
transformed_slow_rdd = repartionedRdd.map(slowAdd)
# Convert the rdds back to dataframes from the rdd's
df_fast = spark.createDataFrame(transformed_fast_rdd)
#This sum operation will be fast (~100 threads sleeping in parallel on each worker)
df_fast.agg(F.sum(F.col("Data"))).show()
df_slow = spark.createDataFrame(transformed_slow_rdd)
#This sum operation will be slow (1 thread sleeping in parallel on each worker)
df_slow.agg(F.sum(F.col("Data"))).show()
Hi I'm trying to run tasks in an asynchronous way, but can't get it to work.
What I want to do is query a db (takes 30s) and process the result (takes 15s) while I make the next query.
My problem seems very simple to solve, but for some reason I can't get it to work.
Thank you very much for your help.
Here is my code so far
async def query_db(date):
sql_query = f"SELECT * FROM tablename WHERE date='{date}'"
df = pd.read_sql(sql_query, engine)
df.to_csv(f"{date}_data.csv", index=False)
async def process_df(filepath):
df = pd.read_csv(filepath)
# Do processing stuff here and save the modified file
for dte in pd.date_range("2020-10-10", "2020-10-12", freq='D'):
query_data_task = asyncio.create_task(query_db(dte))
filepath = f"{dte}_data.csv"
await query_data_task
process_dataframe = asyncio.create_task(process_df(filepath))
I try to insert 150.000 generated data to the Cassandra using BATCH in Python driver. And it take approximately 30 seconds. What should I do to optimize it and insert data faster ?
Here is my code:
from cassandra.cluster import Cluster
from faker import Faker
import time
fake = Faker()
cluster = Cluster(['127.0.0.1'], port=9042)
session = cluster.connect()
session.default_timeout = 150
num = 0
def create_data():
global num
BATCH_SIZE = 1500
BATCH_STMT = 'BEGIN BATCH'
for i in range(BATCH_SIZE):
BATCH_STMT += f" INSERT INTO tt(id, title) VALUES ('{num}', '{fake.name()}')";
num += 1
BATCH_STMT += ' APPLY BATCH;'
prep_batch = session.prepare(BATCH_STMT)
return prep_batch
tt = []
session.execute('USE ttest_2')
prep_batch = []
print("Start create data function!")
start = time.time()
for i in range(100):
prep_batch.append(create_data())
end = time.time()
print("Time for create fake data: ", end - start)
start = time.time()
for i in range(100):
session.execute(prep_batch[i])
time.sleep(0.00000001)
end = time.time()
print("Time for execution insert into table: ", end - start)
Main problem is that you're using batches for inserting the data - in Cassandra, that's a bad practice (see documentation for explanation). Instead you need to prepare a query, and insert data one by one - this will allow driver to route data to specific node, decreasing the load onto that node, and allow to perform data insertion faster. Pseudo-code would look as following (see the python driver code for exact syntax):
prep_statement = session.prepare("INSERT INTO tt(id, title) VALUES (?, ?)")
for your_loop:
session.execute(prep_statement, [id, title])
Another problem is that you're using synchronous API - this means that driver waits until insert happens & then fire the next one. To speedup you need to use asynchronous API instead (see the same doc for details). See the Developing applications with DataStax drivers guide for a list of best practices, etc.
But really, if you just want to load database with data, I recommend not to re-invent the wheel, but either:
generate the data into CSV file & load into Cassandra using DSBulk that is heavily optimized for loading of data
use NoSQLBench to generate data & populate Cassandra - it's also heavily optimized for data generation & loading (not only into Cassandra).
How can I schedule lots of APScheduler jobs (4,000+) concurrently? (I must schedule all these after certain user events.)
Iteratively calling add_job simply takes too long with many jobs. But when I try to use AsyncIOScheduler and the following async code, I don't get any added performance increase either.
NOTE: my scheduler needs to connect to a SQL jobstore via SqlAlchemy
scheduler = AsyncIOScheduler(jobstores={"default": SQLAlchemyJobStore(url="a valid db connection str")})
scheduler.start()
def schedule_jobs_quickly():
# init lots of (fake) jobs
jobs = []
for i in range(3000):
jobs.append(i)
send_time = datetime.datetime.now() + datetime.timedelta(days=2)
# try to schedule jobs concurrently
start_time = time.time()
asyncio.get_event_loop().run_until_complete(schedule_all_jobs(jobs, send_time))
duration = time.time() - start_time
print(f"Created {len(jobs)} jobs in {duration} seconds")
async def schedule_all_jobs(all_jobs, send_time):
tasks = []
for job in all_jobs:
task = asyncio.ensure_future(schedule_job(job, send_time))
tasks.append(task)
await asyncio.gather(*tasks, return_exceptions=True)
async def schedule_job(job, send_time):
scheduler.add_job(send_email_if_needed, trigger=send_time)
Result is very slow. How to speed this up?
>>> schedule_jobs_quickly()
...
Created 3000 jobs in 401.9982771873474 seconds
For comparison, this is how long it took with a BackgroundScheduler() using the default memory jobstore:
Created 3000 jobs in 0.9155495166778564 seconds
So, it seems to be the database connections that are so expensive. Maybe there's a way to create multiple jobs using the same connection, instead of re-connecting for each add_job?
It's not the solution I was looking for, but I decided to give up on AsyncIOScheduler and instead schedule my many tasks in a separate thread so the rest of my program could continue without being held up by all of the DB connections. Example below.
from threading import Thread
def schedule_jobs_quickly():
# init lots of (fake) jobs
jobs = []
for i in range(3000):
jobs.append(i)
send_time = datetime.datetime.now() + datetime.timedelta(days=2)
# schedule jobs in new thread
scheduler_thread = Thread(target=schedule_email_jobs, args=(email_jobs,))
scheduler_thread.start()
def schedule_email_jobs(jobs):
for job in jobs:
scheduler.add_job(send_email, trigger=send_time)
def send_email():
# sends email
I want to set up a mock database (as opposed to creating a test database if possible) to check if the data is being properly queried and than being converted into a Pandas dataframe. I have some experience with mock and unit testing and have set-up previous test successfully. However, I'm having difficulty in applying how to mock real-life objects like databases for testing.
Currently, I'm having trouble generating a result when my test is run. I believe that I'm not mocking the database object correctly, I'm missing a step involved or my thought process is incorrect. I put my tests and my code to be tested in the same script to simplify things.
I've thoroughly read thorough the Python unittest and mock documentation so I know what it does and how it works (For the most part).
I've read countless posts on mocking in Stack and outside of it as well. They were helpful in understanding general concepts and what can be done in those specific circumstances outlined, but I could not get it to work in my situation.
I've tried mocking various aspects of the function including the database connection, query and using the 'pd_read_sql(query, con)' function to no avail. I believe this is the closest I got.
My Most Recent Code for Testing
import pandas as pd
import pyodbc
import unittest
import pandas.util.testing as tm
from unittest import mock
# Function that I want to test
def p2ctt_data_frame():
conn = pyodbc.connect(
r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};'
r'DBQ=My\Path\To\Actual\Database\Access Database.accdb;'
)
query = 'select * from P2CTT_2016_Plus0HHs'
# I want to make sure this dataframe object is created as intended
df = pd.read_sql(query, conn)
return df
class TestMockDatabase(unittest.TestCase):
#mock.patch('directory1.script1.pyodbc.connect') # Mocking connection
def test_mock_database(self, mock_access_database):
# The dataframe I expect as the output after query is run on the 'mock database'
expected_result = pd.DataFrame({
'POSTAL_CODE':[
'A0A0A1'
],
'DA_ID':[
1001001
],
'GHHDS_DA':[
100
]
})
# This is the line that I believe is wrong. I want to create a return value that mocks an Access table
mock_access_database.connect().return_value = [('POSTAL_CODE', 'DA_ID', 'GHHDS_DA'), ('A0A0A1', 1001001, 100)]
result = p2ctt_data_frame() # Run original function on the mock database
tm.assert_frame_equal(result, expected_result)
if __name__ == "__main__":
unittest.main()
I expect that the expected dataframe and the result after running the test using the mock database object is one and the same. This is not the case.
Currently, if I print out the result when trying to mock the database I get:
Empty DataFrame
Columns: []
Index: []
Furthermore, I get the following error after the test is run:
AssertionError: DataFrame are different;
DataFrame shape mismatch
[left]: (0, 0)
[right]: (1, 3)
I would break it up into a few separate tests. A functional test that the desired result will be produced, a test to make sure you can access the database and get expected results, and the final unittest on how to implement it. I would write each test in that order completing the tests first before the actual function. If found that if I can't figure out how to do something I'll try it on a separate REPL or create a git branch to work on it then go back to the main branch. More information can be found here: https://obeythetestinggoat.com/book/praise.harry.html
Comments for each test and the reason behind it is in the code.
import pandas as pd
import pyodbc
def p2ctt_data_frame(query='SELECT * FROM P2CTT_2016_Plus0HHs;'): # set query as default
with pyodbc.connect(
r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};'
r'DBQ=My\Path\To\Actual\Database\Access Database.accdb;'
) as conn: # use with so the connection is closed once completed
df = pd.read_sql(query, conn)
return df
Separate test file:
import pandas as pd
import pyodbc
import unittest
from unittest import mock
class TestMockDatabase(unittest.TestCase):
def test_p2ctt_data_frame_functional_test(self): # Functional test on data I know will not change
actual_df = p2ctt_data_frame(query='SELECT * FROM P2CTT_2016_Plus0HHs WHERE DA_ID = 1001001;')
expected_df = pd.DataFrame({
'POSTAL_CODE':[
'A0A0A1'
],
'DA_ID':[
1001001
],
'GHHDS_DA':[
100
]
})
self.assertTrue(actual_df == expected_df)
def test_access_database_returns_values(self): # integration test with the database to make sure it works
with pyodbc.connect(
r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};'
r'DBQ=My\Path\To\Actual\Database\Access Database.accdb;'
) as conn:
with conn.cursor() as cursor:
cursor.execute("SELECT TOP 1 * FROM P2CTT_2016_Plus0HHs WHERE DA_ID = 1001001;")
result = cursor.fetchone()
self.assertTrue(len(result) == 3) # should be 3 columns by 1 row
# Look for accuracy in the database
info_from_db = []
for data in result: # add to the list all data in the database
info_from_db.append(data)
self.assertListEqual( # All the information matches in the database
['A0A0A1', 1001001, 100], info_from_db
)
#mock.patch('directory1.script1.pd') # testing pandas
#mock.patch('directory1.script1.pyodbc.connect') # Mocking connection so nothing sent to the outside
def test_pandas_read_sql_called(self, mock_access_database, mock_pd): # unittest for the implentation of the function
p2ctt_data_frame()
self.assert_True(mock_pd.called) # Make sure that pandas has been called
self.assertIn(
mock.call('select * from P2CTT_2016_Plus0HHs'), mock_pd.mock_calls
) # This is to make sure the proper value is sent to pandas. We don't need to unittest that pandas handles the
# information correctly.
*I was not able to test this so there might be some bugs I need to fix