How Hyperledger Fabric ledger size impacts performance? - hyperledger-fabric

The ledger size stored in /var/hyperledger/production is increasing day by day. For example after 3-4 years, the size can be in Terabytes. Is there any -
Performance issues when doing r/w transactions?
Way to cut some blocks and reduce the size of data and still maintain the validity of the chain?
How much ledger size (in GB or TB) that Hyperledger Fabric can tolerate the performance?
Any data archival or off chain strategy?

for your question,I try to find the answer in the source code,and a little bit of answer
1、for your first question:
in fabric v1.4.2 source code commom/ledger/blockfile_mgr.go,method getTransactionByIDis used as tx query,according to this method,final this method will call method getTxLoc, we can know when we query a tx,we need first to get block which the tx locate,we can find this block quick with index(I don’t have a deep understanding of the indexing mechanism,but I think it is helpful in performance)
2、for your second and third question:
in fabric v1.4.2 source code commom/ledger/blockfile_mgr.go,method addBlock is used as add block to blockfile(which located at /var/hyperledger/production/...blockfile_000000),we can find some usefull information in this method:
//Determine if we need to start a new file since the size of this block
//exceeds the amount of space left in the current file
//default value of maxBlockfileSize is 64 * 1024 * 1024
if currentOffset+totalBytesToAppend > mgr.conf.maxBlockfileSize {
mgr.moveToNextFile()
currentOffset = 0
}
It means when blockfile bigger than 64 * 1024 * 1024,It will create a new blockfile
3、for your last question
I have not found related information
hope these are helpful for you.

Related

Event Processing (17 Million Events/day) using Azure Stream Analytics HoppingWindow Function - 24H Window, 1 Minute Hops

We have a business problem that needs solving and would like some guidance from the community on the combination of products in Azure we could use to solve it.
The Problem:
I work for a business that produces online games. We would like to display the number of users playing a specific game in a 24 Hour Window, but we want the value to update every minute. Essentially the output that HoppingWindow(Duration(hour, 24), Hop(minute, 1)) function in Azure Stream Analytics will provide.
Currently, the amount of events are around 17 Million a day and the Stream Analytics Job seems to be struggling with the load. We tried the following so far;
Tests Done:
17 Million Events -> Event Hub (32 Partitions) -> ASA (42 Streaming Units) -> Table Storage
Failed: Stream Analytics Job never outputs on large timeframes (Stopped test at 8 Hours)
17 Million Events -> Event Hub (32 Partitions) -> FUNC -> Table Storage (With Proper Partition/Row Key)
Failed: Table storage does not support distinct count
17 Million Events -> Event Hub -> FUNC -> Cosmos DB
Tentative: Cosmos DB doesn't support distinct count, not natively anyways. Seems to be some hacks going around, but not sure that's the way to go.
Is there any known designs geared for processing 17 Million events a minute, every minute?
Edit: As per the comments, the code.
SELECT
GameId,
COUNT(DISTINCT UserId) AS ActiveCount,
DateAdd(hour, -24, System.TimeStamp()) AS StartWindowUtc,
System.TimeStamp() AS EndWindowUtc INTO [out]
FROM
[in] TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
HoppingWindow(Duration(hour, 24), Hop(minute, 1)),
GameId,
UserId
The expected output, note that in reality there will be 1440 records per GameId. One for each minute
To be clear, the problem is that generating the expected output on the larger timeframes, ie 24 Hours doesn't output or at the very least takes 8+ Hours to output. The smaller window sizes work, for example changing the above code to use HoppingWindow(Duration(minute, 10), Hop(minute, 5)).
The tests that followed assumed that ASA is not the answer to the problem and we tried different approaches. Which seemed to have caused a bit of confusion, sorry about that
The way ASA scales up at the moment is with 1 node vertically from 1 to 6SU, then horizontally with multiple nodes of 6SU above that threshold.
Obviously, to be able to scale horizontally a job needs to be parallelizable, which means the stream will be distributed across nodes according to the partition scheme.
Now if the input stream, the query and the output destination are aligned in partitions, then the job is called embarrassingly parallel and that's where you'll be able to reach maximum scale. Each pipeline, from entry to output, will be independent, and each node will only have to maintain in memory the data pertaining to its own state (those 24h of data). That's what we're looking for here: minimizing the local data store at each node.
With EH supporting 32 partitions on most SKUs, the maximum scale publicly available on ASA is 192SU (6*32). If partitions are balanced, that's means that each node will have the least amount of data to maintain in its own state store.
Then we need to minimize the payload itself (the size of an individual message), but from the look of the query that's already the case.
Could you try scaling up to 192SU and see what happens?
We are also working on multiple other features that could help on that scenario. Let me know if that could be of interest to you.

Azure Computer Vision enforces 4mb limit on Images in paid tier?

I am currently working with Azure Computer Vision (read and anlyze APIs).
The docs state that images must be 4mb on free tier or up to 50mb on paid.
https://westus.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-1-preview-1/operations/5d986960601faab4bf452005
Repro
Pricing Tier S1
POST
https://{yourdomainy.cognitiveservices.azure.com/vision/v3.1/analyze?visualFeatures=Categories,Description,Tags
Headers
ocp-apim-subscription-key:{yourkey}
Content-Type:application/json
Body
{"url":"{imageurl}"}
(i have also tried posting a byte array of the image - same result) All images are between 3-8mb, so far less than the 50mb limit stated in docs.
Reponse
{\"code\":\"InvalidImageSize\",\"requestId\":\"177edee5-d17d-4fb7-a16f-da30644f77c4\",\"message\":\"Input image is too large.\"}
Any idea why this is happening?
Thanks
Baced on your question you are getting this message also for 3MB images, therefore less than 4MB.
In addition to the MB constraint there is also a dimension constraint. The max dimension is 10000 x 10000 pixels.
It may be that you are getting a InvalidImageSize due to the dimension constraint.
FWIW I just found this question and it did cause me some confusion too. I would just like to say that from my investigation you need to check each api individually for their size limites. It would seem that (as of api v3.1) many of the api's (Analyze and Describe) endpoints have a 4MB limit, with a couple of exceptions such as Read which call out 4MB limit on Free and 50MB on paid.
As the original post referred to Analyze endpoint in the example request I think this is likely the cause

Best way to store high frequency, periodic time-series data?

I have created an MVP for a nodejs project, following are some of the features that are relevant to the question I am about to ask:
1-The application has a list of IP addresses with CRUD actions.
2-The application will ping each IP address after every 5 seconds.
3- And display against each IP address it's status i.e alive or dead and the uptime if alive
I created a working MVP on nodejs with the help of the library net-ping, express, mongo and angular. Now I have a new feature request that is:
"to calculate the round trip time(latency) for each ping that is generated for each IP address and populate a bar chart or any type of chart that will display the RTT(latency) history(1 months-1 year) of every connection"
I need to store the response of each ping in the database, Assuming the best case that if each document that I will store is of size 0.5 kb, that will make 9.5MB data to be stored in each day,285MB in each month and 3.4GB in a year for a single IP address and I am going to have 100-200 IP addresses in my application.
What is the best solution (including those which are paid) that will suit the best for my requirements considering the app can scale more?
Time series data require special treatment from a database perspective as they introduce challenges to the traditional database management from capacity, query performance, read/write optimisation targets, etc.
I wouldn't recommend you store this data in a traditional RDBMS, or object/document database.
Best option is to use a specialised time-series database engine, like InfluxDB, that can support downsampling (aggregation) and raw data retention rules
So I changed The schema design for the Time-series data after reading this and that reduced the numbers in my calculation of size massively
previous Schema looked like this:
{
timestamp: ISODate("2013-10-10T23:06:37.000Z"),
type: "Latency",
value: 1000000
},
{
timestamp: ISODate("2013-10-10T23:06:38.000Z"),
type: "Latency",
value: 15000000
}
Size of each document: 0.22kb
number of document created in an hour= 720
size of data generated in an hour=0.22*720 = 158.4kb
size of data generated by one IP address in a day= 158 *24 = 3.7MB
Since every next time_Stamp is just the increment of 5 seconds from the previous one, the schema can be optimized to cut the redundant data.
The new schema looks like this :
{
timestamp_hour: ISODate("2013-10-10T23:06:00.000Z"),// will contain hours
type: “Latency”,
values: {//will contain data for all pings in the specific hour
0: 999999,
…
37: 1000000,
38: 1500000,
…
720: 2000000
}
}
Size of each document: 0.5kb
number of document created in an hour= 1
size of data generated in an hour= 0.5kb
size of data generated by one IP address in a day= 0.5 *24 = 12kb
So I Am assuming the size of the data will not be an issue anymore, and I although there is a debate for what type of storage should be used in such scenarios to ensure best performance but I am going to trust mongoDB in my case.

Liferay: huge DLFileRank table

I have a Liferay 6.2 server that has been running for years and is starting to take a lot of database space, despite limited actual content.
Table Size Number of rows
--------------------------------------
DLFileRank 5 GB 16 million
DLFileEntry 90 MB 60,000
JournalArticle 2 GB 100,000
The size of the DLFileRank table sounds to me as abnormally big (if it is totally normal please let me know).
While the file ranking feature of Liferay is nice to have, we would not really mind resetting it if it halves the size of the database.
Question: Would a DELETE * FROM DLFileRank be safe? (stop Liferay, run that SQL command, maybe set dl.file.rank.enabled=false in portal-ext.properties, start Liferay again)
Is there any better way to do it?
Bonus if there is a way to keep recent ranking data and throw away only the old data (not a strong requirement).
Wow. According to the documentation here (Ctrl-F rank), I'd not have expected the number of entries to be so high - did you configure those values differently?
Set the interval in minutes on how often CheckFileRankMessageListener
will run to check for and remove file ranks in excess of the maximum
number of file ranks to maintain per user per file. Defaults:
dl.file.rank.check.interval=15
Set this to true to enable file rank for document library files.
Defaults:
dl.file.rank.enabled=true
Set the maximum number of file ranks to maintain per user per file.
Defaults:
dl.file.rank.max.size=5
And according to the implementation of CheckFileRankMessageListener, it should be enough to just trigger DLFileRankLocalServiceUtil.checkFileRanks() yourself (e.g. through the scripting console). Why you accumulate that large number of files is beyond me...
As you might know, I can never be quoted by stating that direct database manipulation is the way to go - in fact I refuse thinking about the problem from that way.

Azure VM stats - Network In/Out - what are the measurements?

I feel perturbed, but I don't understand the measurement Azure uses for Network In/Out and a few other things.
On Azure portal -> my VM -> Metrics -> [Host] Network In/Out, it says that it is measured in bytes, but then it also draws graph over time. If it were plain, bytes, it should be cumulative and therefore grow indefinitely, but it isn't, therefore I am inclined to believe it is measured per second or something like that. But Azure docs claim that it is bytes and not bytes per second (link here)
Am I missing something obvious?
I am inclined to believe the data is in bytes per minute. At least for mine it appears that way. I set my graph for a 10 minute interval. Taking the mouse off the graph the total bytes show at the bottom. Hovering over the individual sample points (10 in total) they average between 31-34MB each. Adding them up you get close to the total for the graph interval 326MB. 10*32.5 is very close to the this total leading me to believe that each interval on the graph is a sum of the individual interval (1 minute). That is what I am seeing anyway. Terrible documentation from Microsoft. Why not just specify this in the (i) hover point on the individual graph?
#eddyP23 - if you add up all your points in your graph it appears you would come to the same conclusion. Each point is a sum of the interval (1 minute). I am not sure how else to read this.
If it were bytes per second the data total for the complete interval would be vastly larger. 10 minute interval
Sorry for the delay.
therefore I am inclined to believe it is measured per second or
something like that. But Azure docs claim that it is bytes and not
bytes per second
You can find the Network In here:
The Network In (bytes per second) used for monitor your VM's performance.

Resources