Optimize inserting data to Cassandra database through Python driver - cassandra

I try to insert 150.000 generated data to the Cassandra using BATCH in Python driver. And it take approximately 30 seconds. What should I do to optimize it and insert data faster ?
Here is my code:
from cassandra.cluster import Cluster
from faker import Faker
import time
fake = Faker()
cluster = Cluster(['127.0.0.1'], port=9042)
session = cluster.connect()
session.default_timeout = 150
num = 0
def create_data():
global num
BATCH_SIZE = 1500
BATCH_STMT = 'BEGIN BATCH'
for i in range(BATCH_SIZE):
BATCH_STMT += f" INSERT INTO tt(id, title) VALUES ('{num}', '{fake.name()}')";
num += 1
BATCH_STMT += ' APPLY BATCH;'
prep_batch = session.prepare(BATCH_STMT)
return prep_batch
tt = []
session.execute('USE ttest_2')
prep_batch = []
print("Start create data function!")
start = time.time()
for i in range(100):
prep_batch.append(create_data())
end = time.time()
print("Time for create fake data: ", end - start)
start = time.time()
for i in range(100):
session.execute(prep_batch[i])
time.sleep(0.00000001)
end = time.time()
print("Time for execution insert into table: ", end - start)

Main problem is that you're using batches for inserting the data - in Cassandra, that's a bad practice (see documentation for explanation). Instead you need to prepare a query, and insert data one by one - this will allow driver to route data to specific node, decreasing the load onto that node, and allow to perform data insertion faster. Pseudo-code would look as following (see the python driver code for exact syntax):
prep_statement = session.prepare("INSERT INTO tt(id, title) VALUES (?, ?)")
for your_loop:
session.execute(prep_statement, [id, title])
Another problem is that you're using synchronous API - this means that driver waits until insert happens & then fire the next one. To speedup you need to use asynchronous API instead (see the same doc for details). See the Developing applications with DataStax drivers guide for a list of best practices, etc.
But really, if you just want to load database with data, I recommend not to re-invent the wheel, but either:
generate the data into CSV file & load into Cassandra using DSBulk that is heavily optimized for loading of data
use NoSQLBench to generate data & populate Cassandra - it's also heavily optimized for data generation & loading (not only into Cassandra).

Related

Write Pandas dataframe data to CSV file

I am trying to write a pipeline to bring oracle database table data to aws.
It only takes a few ms to fill the dataframe, but when I try to write the dataframe to a csv-file it takes more than 2 min to write 10000 rows. In addition, one of the column's datatypes is cx_oracle lob type.
I thought this meant that it must take time to write data. So I converted the data to categorical data. But then the operation will take more memory space. Does anyone have any suggestions on how to optimize this process?
query = 'select * from tablename'
cursor.execute(query)
iter_idx = 0
while True:
results = cursor.fetchmany()
if not results:
break
iter_idx += 1
df = pd.DataFrame(results)
df.columns = field['source_field_names']
rec_count = df.shape[0]
t_rec_count += rec_count
file = generate_micro_file()
print('memory usage : \n', df.info(memory_usage='deep'))
# sd = dd.from_pandas(df, npartitions=1)
df.to_csv(file, encoding=str(encoding_type), header=False, index=False, escapechar='\\',chunksize=arraysize)
code output:
From the data access side, there is room for improvement by optimizing the fetching of rows across the network. Either by:
passing a large num_rows value to fetchmany(), see the cx_Oracle doc on [Cursor.fetchmany()[(https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#Cursor.fetchmany).
or increasing the value of Cursor.arraysize.
Your question didn't explain enough about your LOB usage. See the sample return_lobs_as_strings.py for optimizing fetches.
See the cx_Oracle documentation Tuning Fetch Performance.
Is there a particular reason to spend the overhead of converting to a Pandas dataframe? Why not write directly using the csv module?
Maybe something like this:
with connection.cursor() as cursor:
sql = "select * from all_objects where rownum <= 100000"
cursor.arraysize = 10000
with open("testwrite.csv", "w", encoding="utf-8") as outputfile:
writer = csv.writer(outputfile, lineterminator="\n")
results = cursor.execute(sql)
writer.writerows(results)
You should benchmark and choose the best solution.

Python API: synchronous multithreaded DB queries

I have a Python flask API that apply some SQL based filtering on an object.
Steps of the API workflow:
receive a POST request (with arguments)
run multiple SQL read queries (against a postgres DB) depending on some of the posted arguments
apply some simple "pure python" rules on the SQL results to get a boolean result
store the boolean result and the associated posted arguments in the postgres DB
return the boolean result
Contraints of the API:
The API needs to return the boolean answer under 150ms
I can store the boolean result asynchronously in DB to avoid waiting for the write query to complete before returning the boolean result
However and as explained, the boolean answer depends on the SQL read queries so I cannot run those queries asynchronously
Test made:
While making some tests, I saw that I can make read queries in parallel. The test I did was:
Running the query below 2 times not using multithreading => the code ran in roughly 10 seconds
from sqlalchemy import create_engine
import os
import time
engine = create_engine(
os.getenv("POSTGRES_URL")
)
def run_query():
with engine.connect() as conn:
rs = conn.execute(f"""
SELECT
*
, pg_sleep(5)
FROM users
""")
for row in rs:
print(row)
if __name__ == "__main__":
start = time.time()
for i in range(5):
run_query()
end = time.time() - start
Running the query using multithreading => the code ran in roughly 5 seconds
from sqlalchemy import create_engine
import os
import threading
import time
engine = create_engine(
os.getenv("POSTGRES_URL")
)
def run_query():
with engine.connect() as conn:
rs = conn.execute(f"""
SELECT
*
, pg_sleep(5)
FROM users
""")
for row in rs:
print(row)
if __name__ == "__main__":
start = time.time()
threads = []
for i in range(5):
t = threading.Thread(target=run_query)
t.start()
threads.append(t)
for t in threads:
t.join()
end = time.time() - start
Question:
What is the bottleneck of the code ? I'm sure there must be a maximum number of read queries that I can run in parallel in 1 API call. However I'm wondering what is determining these limit.
Thank you very much for your help !
This scales well beyond the point that is sensible. With some tweaks to the built in connection pool's pool_size, you could easily have 100 pg_sleep going simultaneously. But as soon as you change that to do real work rather than just sleeping, it would fall apart. You only have so many CPU and so many disk drives, and that number is probably way less than 100.
You should start by looking at those read queries to see why they are slow and if they can't be made faster with indices or something.

how to avoid duplication in BigQuery by streaming insert

I made a function that inserts .CSV data into BigQuery in every 5~6 seconds. I've been looking for the way to avoid duplicating the data in BigQuery after inserting. I want to remove data that has same luid but I have no idea how to remove it so is it possible to check each data of .CSV has already existed in BigQuery table before inserting .
I put row_ids parameter to avoid duplicate luid but it seems not to work well .
Could you give me any idea ?? Thanks.
def stream_upload():
# BigQuery
client = bigquery.Client()
project_id = 'test'
dataset_name = 'test'
table_name = "test"
full_table_name = dataset_name + '.' + table_name
json_rows = []
with open('./test.csv','r') as f:
for line in csv.DictReader(f):
del line[None]
line_json = dict(line)
json_rows.append(line_json)
errors = client.insert_rows_json(
full_table_name,json_rows,row_ids=[row['luid'] for row in json_rows]
)
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
print("end")
schedule.every(0.5).seconds.do(stream_upload)
while True:
schedule.run_pending()
time.sleep(0.1)
BigQuery doesn't have a native way to deal with this. You could either create a view off of this table that performs deduping or create an external cache of luids and lookup if they have already been written to BigQuery before writing and update the cache after writing new data. This could be as simple as a file cache or you could use an additional database.

Reading large volume data from Teradata using Dask cluster/Teradatasql and sqlalchemy

I need to read large volume data(app. 800M records) from teradata, my code is working fine for a million record. for larger sets its taking time to build metadata. Could someone please suggest how to make it faster. Below is the code snippet which I am using for my application.
def get_partitions(num_partitions):
list_range =[]
initial_start=0
for i in range(num_partitions):
amp_range = 3240//num_partitions
start = (i*amp_range+1)*initial_start
end = (i+1)*amp_range
list_range.append((start,end))
initial_start = 1
return list_range
#delayed
def load(query,start,end,connString):
df = pd.read_sql(query.format(start, end),connString)
engine.dispose()
return df
connString = "teradatasql://{user}:{password}#{hostname}/?logmech={logmech}&encryptdata=true"
results = from_delayed([load(query,start, end,connString) for start,end in get_partitions(num_partitions)])
The build time is probably taken in finding out the metadata of your table. This is done by fetching the whole of the first partition and analysing it.
You would be better off either specifying it explcitly, if you know the dtypes upfront, e.g., {col: dtype, ...} for all the columns, or generating it from a separate query that you limit to just as many rows as it takes to be sure you have the right types:
meta = dask.compute(load(query, 0,10 ,connString))
results = from_delayed(
[
load(query,start, end,connString) for start,end in
get_partitions(num_partitions)
],
mete=meta.loc[:0, :] # zero-length version of table
)

Filtering Spark DataFrame on new column

Context: I have a dataset too large to fit in memory I am training a Keras RNN on. I am using PySpark on an AWS EMR Cluster to train the model in batches that are small enough to be stored in memory. I was not able to implement the model as distributed using elephas and I suspect this is related to my model being stateful. I'm not entirely sure though.
The dataframe has a row for every user and days elapsed from the day of install from 0 to 29. After querying the database I do a number of operations on the dataframe:
query = """WITH max_days_elapsed AS (
SELECT user_id,
max(days_elapsed) as max_de
FROM table
GROUP BY user_id
)
SELECT table.*
FROM table
LEFT OUTER JOIN max_days_elapsed USING (user_id)
WHERE max_de = 1
AND days_elapsed < 1"""
df = read_from_db(query) #this is just a custom function to query our database
#Create features vector column
assembler = VectorAssembler(inputCols=features_list, outputCol="features")
df_vectorized = assembler.transform(df)
#Split users into train and test and assign batch number
udf_randint = udf(lambda x: np.random.randint(0, x), IntegerType())
training_users, testing_users = df_vectorized.select("user_id").distinct().randomSplit([0.8,0.2],123)
training_users = training_users.withColumn("batch_number", udf_randint(lit(N_BATCHES)))
#Create and sort train and test dataframes
train = df_vectorized.join(training_users, ["user_id"], "inner").select(["user_id", "days_elapsed","batch_number","features", "kpi1", "kpi2", "kpi3"])
train = train.sort(["user_id", "days_elapsed"])
test = df_vectorized.join(testing_users, ["user_id"], "inner").select(["user_id","days_elapsed","features", "kpi1", "kpi2", "kpi3"])
test = test.sort(["user_id", "days_elapsed"])
The problem I am having is that I cannot seem to be able to filter on batch_number without caching train. I can filter on any of the columns that are in the original dataset in our database, but not on any column I have generated in pyspark after querying the database:
This: train.filter(train["days_elapsed"] == 0).select("days_elapsed").distinct.show() returns only 0.
But, all of these return all of the batch numbers between 0 and 9 without any filtering:
train.filter(train["batch_number"] == 0).select("batch_number").distinct().show()
train.filter(train.batch_number == 0).select("batch_number").distinct().show()
train.filter("batch_number = 0").select("batch_number").distinct().show()
train.filter(col("batch_number") == 0).select("batch_number").distinct().show()
This also does not work:
train.createOrReplaceTempView("train_table")
batch_df = spark.sql("SELECT * FROM train_table WHERE batch_number = 1")
batch_df.select("batch_number").distinct().show()
All of these work if I do train.cache() first. Is that absolutely necessary or is there a way to do this without caching?
Spark >= 2.3 (? - depending on a progress of SPARK-22629)
It should be possible to disable certain optimization using asNondeterministic method.
Spark < 2.3
Don't use UDF to generate random numbers. First of all, to quote the docs:
The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.
Even if it wasn't for UDF, there are Spark subtleties, which make it almost impossible to implement this right, when processing single records.
Spark already provides rand:
Generates a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
and randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
which can be used to build more complex generator functions.
Note:
There can be some other issues with your code but this makes it unacceptable from the beginning (Random numbers generation in PySpark, pyspark. Transformer that generates a random number generates always the same number).

Resources