I configured my Hyperledger Fabric network with 2 peers under 1 org and 2 couchdb, 1 each peer.
I am seeing that when I do a transaction, it takes some time to do it, sometimes around 1 second. For me it's too much time, it should be just some ms.
I have a simulator that is able to insert around 30k samples into the blockchain but it runs very slow because sometimes a transaction takes 1s, so you can imagine that with a such amount of data it takes a lot.
How can I solve this? Is Fabric able to handle more transaction than this?
What I have noticed and it seems wrong to me is that:
Using Fauxton to see inside couchdb, if I upload 300 samples on the blockchain, I see 300 blocks created. Could this be a problem? I know that a block should encapsulate more transaction, but my blockchains seems not to do this. How to solve?
Another thing that I have noticed is that I did not configure any endorsment policy. Should I do it and should it make it faster? How to do this?
And, finally: there is the possibility that couchdb is slowing down the network? How to disable it?
Two hidden complexities can impact performance
The complexity of your query, per record type. It’s important to form a performance histogram based on object types
Whether your data structure is pre-ordered to suiting the hashing algorithm. If not, you’ll experience a slight bit for drag if you object size is large.
Related
I am currently developing a fabric chaincode.
I created a function in the chaincode that was I/O-bound (reading many values from the ledger).
Experiment with this on two nodes. One node uses HDD and the other node uses SSD.
In the ledger, 10,000 objects with 4k size keys were stored. (It seems to be too small.. When I put 100,000, an error occurred, so I tested it with 10,000.)
If I READ(GetState) a lot of values in the ledger, I expected that the READ speed of the node using SSD would be faster, but there was no difference.
I understood that LevelDB is a key-value storage, so there is no difference because it is fast. (Sequential and random reads have similar execution times)
Wondering how to experiment so that the difference in performance of HDD/SDD appears using LevelDB.
And if there is a way, I would like to ask for advice.
If you are querying data via chain code, it is unlikely that you will be I/O bound at the disk level. The path length for querying data via chaincode has a few hops so that is likely going to be your constraint.
I am working in a specific project to change my repository to hazelcast.
I need find some documents by data range, store type and store ids.
During my tests i got 90k throughput using one instance c3.large, but when i execute the same test with more instances the result decrease significantly (10 instances 500k and 20 instances 700k).
These numbers were the best i could tuning some properties:
hazelcast.query.predicate.parallel.evaluation
hazelcast.operation.generic.thread.count
hz:query
I have tried to change instance to c3.2xlarge to get more processing but but the numbers don't justify the price.
How can i optimize hazelcast to be more fast in this scenario?
My user case don't use map.get(key), only map.values(predicate).
Settings:
Hazelcast 3.7.1
Map as Data Structure;
Complex object using IdentifiedDataSerializable;
Map index configured;
Only 2000 documents on map;
Hazelcast embedded configured by Spring Boot Application (singleton);
All instances in same region.
Test
Gatling
New Relic as service monitor.
Any help is welcome. Thanks.
If your use-case only contains map.values with a predicate, I would strongly suggest to use object type as in in-memory storage model. This way, there will not be any serialization involved during Query execution.
On the other end, it is normal to get very high numbers when you only have 1 member. Because, there is no data moving across network. Potentially to improve, I would check EC2 instances with high network capacity. For example c3.8xlarge has 10 Gbit network, compared to High that comes with c3.2xlarge.
I can't promise, how much increase you can get, but I would definitely try these changes first.
Posting here as I could not find any forums for lmdb key-value store.
Is there a limit for sub-databases? What is a reasonable number of sub-databases concurrently open?
I would like to have ~200 databases which seems like a lot and clearly indicates my model is wrong.
I suppose could remodel and embed id of each db in key itself and keep one db only but then I have longer keys and also I cannot drop database if needed.
I'm interested though if LMDB uses some sort of internal prefixes for keys already.
Any suggestions how to address that problem appreciated.
Instead of calling mdb_dbi_open each time, keep your own map with database names to database handles returned from mdb_dbi_open. Reuse these handles for the lifetime of your program. This will allow you to have multiple databases within an environment and prevent the overhead with mdb_dbi_open.
If you read the documentation for mdb_env_set_maxdbs.
Currently a moderate number of slots are cheap but a huge number gets expensive: 7-120 words per transaction, and every mdb_dbi_open() does a linear search of the opened slots.
http://www.lmdb.tech/doc/group__mdb.html#gaa2fc2f1f37cb1115e733b62cab2fcdbc
The best way to know is to test the function call mdb_dbi_open performance to see if it is acceptable.
Microsoft changed the architecture of the Azure Storage to use eg. SSD's for journaling and 10 Gbps network (instead of standard Harddrives and 1G ps network). Se http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx
Here you can read that the storage is designed for "Up to 20,000 entities/messages/blobs per second".
My concern is that 20.000 entities (or rows in Table Storage) is actually not a lot.
We have a rather small solution with a table with 1.000.000.000 rows. With only 20.000 entities pr. second it will take more than half a day to read all rows.
I realy hope that the 20.000 entities actually means that you can do up to 20.000 requests pr. second.
I'm pretty sure the 1st generation allowed up to 5.000 requests pr. second.
So my question is. Are there any scenarios where the 1st generation Azure storage is actually more scalable than the second generation?
Any there any other reason we should not upgrade (move our data to a new storage)? Eg. we tried to get ~100 rows pr. partition, because that was what gave us the best performance characteristic. Are there different characteristic for the 2nd generation? Or has there been any changes that might introduce bugs if we change?
You have to read more carefully. The exact quote from the mentioned post is:
Transactions – Up to 20,000 entities/messages/blobs per second
Which is 20k transactions per second. Which is you correctly do hope for. I surely do not expect to have 20k 1M files uploaded to the blob storage. But I do expect to be able to execute 20k REST Calls.
As for tables and table entities, you could combine them in batches. Given the volume you have I expect that you already are using batches. There single Entity Group Transaction is counted as a single transaction, but may contain more than one entity. Now, rather then assessing whether it is low or high figure, you really need a good setup and bandwidth to utilize these 20k transactions per second.
Also, the first generation scalability target was around that 5k requests/sec you mention. I don't see a configuration/scenario where Gen 1 would be more scalable than Gen 2 storage.
Are there different characteristic for the 2nd generation?
The differences are outlined in that blog post you refer.
As for your last concern:
Or has there been any changes that might introduce bugs if we change?
Be sure there are not such changes. Azure Storage service behavior is defined in the REST API Reference. The API is not any different based on Storage service Generation. It is versioned based on features.
Can CouchDB handle thousands of separate databases on the same machine?
Imagine you have a collection of BankTransactions. There are many thousands of records. (EDIT: not actually storing transactions--just think of a very large number of very small, frequently updating records. It's basically a join table from SQL-land.)
Each day you want a summary view of transactions that occurred only at your local bank branch. If all the records are in a single database, regenerating the view will process all of the transactions from all of the branches. This is a much bigger chunk of work, and unnecessary for the user who cares only about his particular subset of documents.
This makes it seem like each bank branch should be partitioned into its own database, in order for the views to be generated in smaller chunks, and independently of each other. But I've never heard of anyone doing this, and it seems like an anti-pattern (e.g. duplicating the same design document across thousands of different databases).
Is there a different way I should be modeling this problem? (Should the partitioning happen between separate machines, not separate databases on the same machine?) If not, can CouchDB handle the thousands of databases it will take to keep the partitions small?
(Thanks!)
[Warning, I'm assuming you're running this in some sort of production environment. Just go with the short answer if this is for a school or pet project.]
The short answer is "yes".
The longer answer is that there are some things you need to watch out for...
You're going to be playing whack-a-mole with a lot of system settings like max file descriptors.
You'll also be playing whack-a-mole with erlang vm settings.
CouchDB has a "max open databases" option. Increase this or you're going to have pending requests piling up.
It's going to be a PITA to aggregate multiple databases to generate reports. You can do it by polling each database's _changes feed, modifying the data, and then throwing it back into a central/aggregating database. The tooling to make this easier is just not there yet in CouchDB's API. Almost, but not quite.
However, the biggest problem that you're going to run into if you try to do this is that CouchDB does not horizontally scale [well] by itself. If you add more CouchDB servers they're all going to have duplicates of the data. Sure, your max open dbs count will scale linearly with each node added, but other things like view build time won't (ex., they'll all need to do their own view builds).
Whereas I've seen thousands of open databases on a BigCouch cluster. Anecdotally that's because of dynamo clustering: more nodes doing different things in parallel, versus walled off CouchDB servers replicating to one another.
Cheers.
I know this question is old, but wanted to note that now with more recent versions of CouchDB (3.0+), partitioned databases are supported, which addresses this situation.
So you can have a single database for transactions, and partition them by bank branch. You can then query all transactions as you would before, or query just for those from a specific branch, and only the shards where that branch's data is stored will be accessed.
Multiple databases are possible, but for most cases I think the aggregate database will actually give better performance to your branches. Keep in mind that you're only optimizing when a document is updated into the view; each document will only be parsed once per view.
For end-of-day polling in an aggregate database, the first branch will cause 100% of the new docs to be processed, and pay 100% of the delay. All other branches will pay 0%. So most branches benefit. For end-of-day polling in separate databases, all branches pay a portion of the penalty proportional to their volume, so most come out slightly behind.
For frequent view updates throughout the day, active branches prefer the aggregate and low-volume branches prefer separate. If one branch in 10 adds 99% of the documents, most of the update work will be done on other branch's polls, so 9 out of 10 prefer separate dbs.
If this latency matters, and assuming couch has some clock cycles going unused, you could write a 3-line loop/view/sleep shell script that updates some documents before any user is waiting.
I would add that having a large number of databases creates issues around compaction and replication. Not only do things like continuous replication need to be triggered on a per-database basis (meaning you will have to write custom logic to loop over all the databases), but they also spawn replication daemons per database. This can quickly become prohibitive.