Python Data saving performance

Python Data saving performance - python-3.x

I`ve got some bottleneck with data, and will be appreciated for senior advice.
I have an API, where i recieve financial data that looks like this GBPUSD 2020-01-01 00:00:01.001 1.30256 1.30250, my target is to write those data directly into databse as fast as it possible.
Inputs:
Python 3.8
PastgreSQL 12
Redis Queue (Linux)
SQLAlchemy
Incoming data structure, as showed above, comes in one dictionary {symbol: {datetime: (price1, price2)}}. All of the data comes in String datatype.
API is streaming 29 symbols, so I can recieve for example from 30 to 60+ values of different symbols just in one second.
How it works now:
I recieve new value in dictionary;
All new values of each symbol, when they come to me, is storing in one variable dict - data_dict;
Next I'm asking those dictionary by symbol key and last value, and send those data to Redis Queue - data_dict[symbol][last_value].enqueue(save_record, args=(datetime, price1, price2)) . Till this point everything works fine and fast.
When it comes to Redis worker, there is save_record function:
"
def save_record(Datetime, price1, price2, Instr, adf):
# Parameters
#----------
# Datetime : 'string' : Datetime value
# price1 : 'string' : Bid Value
# price2 : 'string' : Ask Value
# Instr : 'string' : symbol to save
# adf : 'string' : Cred to DataBase engine
#-------
# result : : Execute save command to database
engine = create_engine(adf)
meta = MetaData(bind=engine,reflect=True)
table_obj = Table(Instr,meta)
insert_state = table_obj.insert().values(Datetime=Datetime,price1=price1,price2=price2)
with engine.connect() as conn:
conn.execute(insert_state)
When i`m execute last row of function, it takes from 0.5 to 1 second to write those row into the database:
12:49:23 default: DT.save_record('2020-00-00 00:00:01.414538', 1.33085, 1.33107, 'USDCAD', 'postgresql cred') (job_id_1)
12:49:24 default: Job OK (job_id_1)
12:49:24 default: DT.save_record('2020-00-00 00:00:01.422541', 1.56182, 1.56213, 'EURCAD', 'postgresql cred') (job_id_2)
12:49:25 default: Job OK (job_id_2)
Queued jobs for inserting each row directly into database is that bottleneck, because I can insert only 1 - 2 value(s) in 1 second, and I can recieve over 60 values in 1 second. If I run this saving, it starts to create huge queue (maximum i get was 17.000 records in queue after 1 hour of API listening), and it won't stop rhose size.
I'm currently using only 1 queue, and 17 workers. This make my PC CPU run in 100%.
So question is how to optimize this process and not create huge queue. Maybe try to save for example in JSON some sequence and then insert into DB, or store incoming data in separated variables..
Sorry if something is doubted, ask - and I`ll answer.
--UPD--
So heres my little review about some experiments:
Move engine meta out of function
Due to my architechture, API application located on Windows 10, and Redis Queue located on Linux. There was an issue wis moving meta and engine out of function, it returns TypeError (it is not depends on OS), a little info about it here
Insert multiple rows in a batch:
This approach seemed to be the most simple and easy - so it is! Basically, i've just created dictionary: data_dict = {'data_pack': []}, to begin storing there incoming values. Then I ask if there is more than 20 values per symbol is written allready - i'm sending those branch to Redis Queue, and it takes 1.5 second to write down in database. Then i delete taken records from data_dict, and process continue. So thanks Mike Organek for good advice.
Those approach is quite enough for my targets to exist, at the same time I can say that this stack of tech can provide you really good flexibility!

Every time you call save_record you re-create the engine and (reflected) meta objects, both of which are expensive operations. Running your sample code as-is gave me a throughput of
20 rows inserted in 4.9 seconds
Simply moving the engine = and meta = statements outside of the save_record function (and thereby only calling them once) improved throughput to
20 rows inserted in 0.3 seconds
Additional note: It appears that you are storing the values for each symbol in a separate table, i.e. 'GBPUSD' data in a table named GBPUSD, 'EURCAD' data in a table named EURCAD, etc.. That is a "red flag" suggesting bad database design. You should be storing all of the data in a single table with a column for the symbol.

Related

How to use synchronous messages on rabbit queue?

I have a node.js function that needs to be executed for each order on my application. In this function my app gets an order number from a oracle database, process the order and then adds + 1 to that number on the database (needs to be the last thing on the function because order can fail and therefore the number will not be used).
If all recieved orders at time T are processed at the same time (asynchronously) then the same order number will be used for multiple orders and I don't want that.
So I used rabbit to try to remedy this situation since it was a queue. It seems that the processes finishes in the order they should, but a second process does NOT wait for the first one to finish (ack) to begin, so in the end I'm having the same problem of using the same order number multiple times.
Is there anyway I can configure my queue to process one message at a time? To only start process n+1 when process n has been acknowledged?
This would be a life saver to me!

If the problem is to avoid duplicate order numbers, then use an Oracle sequence, or use an identity column when you insert into a table to generate the order number:
CREATE TABLE mytab (
id NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY(START WITH 1),
data VARCHAR2(20));
INSERT INTO mytab (data) VALUES ('abc');
INSERT INTO mytab (data) VALUES ('def');
SELECT * FROM mytab;
This will give:
ID DATA
---------- --------------------
1 abc
2 def
If the problem is that you want orders to be processed sequentially, then don't pull an order from the queue until the previous one is finished. This will limit your throughput, so you need to understand your requirements and make some architectural decisions.
Overall, it sounds Oracle Advanced Queuing would be a good fit. See the node-oracledb documentation on AQ.

Unable to delete large number of rows from Spanner

I have 3 node Spanner instance, and a single table that contains around 4 billion rows. The DDL looks like this:
CREATE TABLE predictions (
name STRING(MAX),
...,
model_version INT64,
) PRIMARY KEY (name, model_version)
I'd like to setup a job to periodically remove some old rows from this table using the Python Spanner client. The query I'd like to run is:
DELETE FROM predictions WHERE model_version <> ?
According to the docs, it sounds like I would need to execute this as a Partitioned DML statement. I am using the Python Spanner client as follows, but am experiencing timeouts (504 Deadline Exceeded errors) due to the large number of rows in my table.
# this always throws a "504 Deadline Exceeded" error
database.execute_partitioned_dml(
"DELETE FROM predictions WHERE model_version <> #version",
params={"model_version": 104},
param_types={"model_version": Type(code=INT64)},
)
My first intuition was to see if there was some sort of timeout I could increase, but I don't see any timeout parameters in the source :/
I did notice there was a run_in_transaction method in the Spanner lib that contains a timeout parameter, so I decided to deviate from the partitioned DML approach to see if using this method worked. Here's what I ran:
def delete_old_rows(transaction, model_version):
delete_dml = "DELETE FROM predictions WHERE model_version <> {}".format(model_version),
dml_statements = [
delete_dml,
]
status, row_counts = transaction.batch_update(dml_statements)
database.run_in_transaction(delete_old_rows,
model_version=104,
timeout_secs=3600,
)
What's weird about this is the timeout_secs parameter appears to be ignored, because I still get a 504 Deadline Exceeded error within a minute or 2 of executing the above code, despite a timeout of one hour.
Anyways, I'm not too sure what to try next, or whether or not I'm missing something obvious that would allow me to run a delete query in a timely fashion on this huge Spanner table. The model_version column has pretty low cardinality (generally 2-3 unique model_version values in the entire table), so I'm not sure if that would factor into any recommendations. But if someone could offer some advice or suggestions, that would be awesome :) Thanks in advance

The reason that setting timeout_secs didn't help was because the argument is unfortunately not the timeout for the transaction. It's the retry timeout for the transaction so it's used to set the deadline after which the transaction will stop being retried.
We will update the docs for run_in_transaction to explain this better.

The root cause was that the total timeout for the Streaming RPC calls was set too low in the client libraries, being set to 120s for Streaming APIs (eg ExecuteStreamingSQL used by partitioned DML calls.)
This has been fixed in the client library source code, changing them to a 60 minute timout (which is the maximum), and will be part of the next client library release.
As a workaround, in Java, you can configure the timeouts as part of the SpannerOptions when you connect your database. (I do not know how to set custom timeouts in Python, sorry)
final RetrySettings retrySettings =
RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofMinutes(60L))
.setMaxRpcTimeout(Duration.ofMinutes(60L))
.setMaxAttempts(1)
.setTotalTimeout(Duration.ofMinutes(60L))
.build();
SpannerOptions.Builder builder =
SpannerOptions.newBuilder()
.setProjectId("[PROJECT]"));
builder
.getSpannerStubSettingsBuilder()
.applyToAllUnaryMethods(
new ApiFunction<UnaryCallSettings.Builder<?, ?>, Void>() {
#Override
public Void apply(Builder<?, ?> input) {
input.setRetrySettings(retrySettings);
return null;
}
});
builder
.getSpannerStubSettingsBuilder()
.executeStreamingSqlSettings()
.setRetrySettings(retrySettings);
builder
.getSpannerStubSettingsBuilder()
.streamingReadSettings()
.setRetrySettings(retrySettings);
Spanner spanner = builder.build().getService();

The first suggestion is to try gcloud instead.
https://cloud.google.com/spanner/docs/modify-gcloud#modifying_data_using_dml
Another suggestion is to pass the range of name as well so that limit the number of rows scanned. For example, you could add something like STARTS_WITH(name, 'a') to the WHERE clause so that make sure each transaction touches a small amount of rows but first, you will need to know about the domain of name column values.
Last suggestion is try to avoid using '<>' if possible as it is generally pretty expensive to evaluate.

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?

Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.

I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample

You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Cassandra approach of RDBMS nested insertions

I receive regularly two types of sets of data:
Network flows, thousands per second:
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
(... etc ..)
}
And interface tables, every hour:
[
{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 19
'iface_name' : 'Fa0/0'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 20
'iface_name' : 'Fa0/1'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.157.38',
'iface_id' : 20
'iface_name' : 'Gi0/3'
}
]
I want to insert those flows in Cassandra, with interface names instead of IDs, based on the latest matching host/iface_id value. I cannot rely on a memory-only solution, otherwise I may loose up to one hour of flows every time I restart the application.
What I had in mind, is to use two Cassandra tables: One that holds the flows, and one that holds the latest host/iface_id table. Then, when receiving a flow, I would use this data to properly fill interface name.
Ideally, I would like to let Cassandra take care of this. In my mind, it seems more efficient than pulling out interface names from the application side every time.
The thing is that I cannot figure out how to do that - and having never worked with NoSQL before, I am not even sure that this is the right approach... Could someone point me in the right direction?
Inserting data in the interface table and keeping only the latest version is quite trivial, but I cannot wrap my mind around the 'inserting interface name in flow record' part. In a traditional RDBMS I would use a nested query, but those don't seem to exist in Cassandra.

Reading your question, I can hope that the data hourly received in interface table is not too big. So we can keep that data (single row) in memory as well as in cassandra database. For every hour, the in memory data will get updated as well as a new inserted in to database. We can save interface data with below table definition -
create table interface_by_hour(
year int,
month int,
day int,
hour int,
data text, -- enitre json string for one hour interface data.
primary key((year,month,day,hour)));
Few insert statements --
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,23,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,00,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,28,1,'{complete json.........}');
keep every hours interface data in this table and update it in memory as well. Benefit of having in memory data is that you don't have to read it from table thousand of time every second. If application goes down, you can read the current/previous hour data from table using below query, and build the in memory cache.
cqlsh:mykeyspace> select * from interface_by_hour where year=2017 and month=1 and day=27 and hour=0;
year | month | day | hour | data
------+-------+-----+------+--------------------------
2017 | 1 | 27 | 0 | {complete json.........}
Now comes the flow data --
As we have current hour interface table data cached in memory, we can quickly map interface name to host. Use below table to save flow data.
create table flow(
iface_name text,
createdon bingint, -- time stamp in milliseconds.
host text, -----this is optionl, if you want dont use it as column.
flowdata text, -- entire json string
primarykey(iface_name,createdon,host));
Only issue I see in above table is that it will not distribute data evenly across the partitions, if you have too many flow data for one interface name, whole data will inserted in to one partition.
I designed this table just to save the data, if you could have specified how you going to use this data, I would have done some more Thinking.
hope this helps.

Hi as far as I can tell the interface data is not so heavy on the writes that it would need partitioning by time. It changes only once per hour so it's not necessary to save data for every hour just the latest version. Also I will assume that you want to query this in some way I'm not sure how so I'll just propose something general for interface and will threat the flows as time series data:
create table interface(
iface_name text primary key,
iface_id int,
host text,
stamp timestamp
);
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/0', 19, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/1', 20, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Gi0/3', 20, '192.168.157.38', '2017-01-19 03:00:00');
usually this is an antipatern with cassandra:
cqlsh:test> select * from interface;
iface_name | host | iface_id | stamp
------------+----------------+----------+---------------------------------
Fa0/0 | 192.168.2.6 | 19 | 2017-01-19 02:00:00.000000+0000
Gi0/3 | 192.168.157.38 | 20 | 2017-01-19 02:00:00.000000+0000
Fa0/1 | 192.168.2.6 | 20 | 2017-01-19 02:00:00.000000+0000
But as far as I can see you don't have that many interfaces
So basically anything up to thousands will be o.k. here in worst case you might want to
use the token function to get the data out from partitions but the thing is this will save you
a lot on space and you don't need to save this by the hour.
I would simply keep this table in memory also and then enrich the data as it comes in.
If there is updates, update the in memory cache ... but also put writes to cassandra.
If something fails then simply restore from interface table and continue.
basically your flow info would then become
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
'iface_name' : 'key put from in memory cache',
}
This is how you will get the bigest performance now saving flows is just
time series data then, take into account that you are hitting the cluster
with thousands per second and that when you are paritioning by time you
get at least 7000 if not more columns every second in (with the model
I'm proposing here) usually you will want to have up to 100 000 columns
within single partition, which would say that your partition goes over
ideal size withing 20 seconds or even less so I would even suggest using
random buckets (when inserting just use some number in defined range
let's say 10):
create table flow(
time_with_minute text,
artificial_bucket int,
stamp timeuuid,
host text,
ip_src text,
ip_dst text,
iface_in int,
iface_out int,
iface_name text,
primary key((time_with_minute, artificial_bucket), stamp)
);
When wanting to fetch flows over time you would simply use the parts of
a timestamp plus make 10 queries at the same time or one by one to access all the data. There are various techniques here, you simply need to tell more about your use case.
inserting is then something like:
insert into flow(time_with_minute, artificial_bucket, stamp, host, ip_src, ip_dst, iface_in, iface_out, iface_name)
values ('2017-01-19 01:37', 1, now(), '192.168.2.6', '10.29.78.3', '8.8.4.4', 19, 20, 'Fa0/0');
I used now just for an example, use https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/utils/UUIDGen.java to
generate timeuuid with the time when you inserted flow. Also I inserted 1 into artificial bucket, here you would insert random number with range, let's say 0-10 at least. Some people, depending on the load insert multiple random buckets, even 60 or more. It all depends on how heavy writes are. If you just put it to minute every minute a group of nodes within the cluster will be hot and this will switch around. Having hot nodes is usually not a good idea.
With cassandra you are writing the information that you need right away, you are not doing
any joins during write or something similar. Keep the data in memory that you need to
stamp the data with the information that you need and just insert without denormalisation.
Also you can model the solution in a relational way and just tell how you would like to
access the data then we can go into details.

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.

You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string