Averaging har files - performance-testing

When trying to analyze a website's performance, it is useful to run a monitoring session multiple times, since wait times and receive times can vary randomly.
I couldn't find a practical way to then average several Http ARchive (har) files together into one.
Any recommendation ?

Related

Azure Face API - How to manage very large volumes (more than 30 million faces)

I am using LargeFaceGroup to store the faces. The usecase I am dealing with has more than 30 millions of faces. On these 30 millions Images, I need to run Face-Identify call as well.
The limitation of LargeFaceGroup is - It can only hold upto 1 million. If I use 30 LargeFaceGroup I will have to make 30 Face-Identify to find match between 30 million faces. Hence making 30 API transaction for finding match for a single face.
I have few question:
Is there any more efficient way to deal with large volumes.
How can I optimize API Cost and time? (example- I have found out that we can pass upto 10 faceIds to Face-Identify, thus reducing the API transaction by 10 fold)
Can I also detect/add/delete faces in batch, or I will have to make API transaction for each individual faces?
What is the search time for Face-Identify in a LargeFaceGroup. Is is dependent upon the number of faces present in the LargeFaceGroup?
After a discussion with the Azure Face API product team. I got answers to these questions.
To handle large volumes, we should use PersonDirectory to store faces. It can handle up to 75 million faces. There is no training cost in PersonDirectory data structure as well.
As mentioned in the first point. Training costs can be eliminated. Time can be optimized - You can request more than 10TPS from Azure, and they will allow it. Other API calls such as detect,Add-Face, and Delete-Face can not be optimized. (Some hacks like stitching multiple images to one and then call detect on it can save API calls. You can check if this is suitable for the use case).
Rather you should focus that you are not having some redundant API calls such as 2 detect calls, rather save the faceid and make subsequent calls within 24 hours.
Apart from the hack for detect. You will have to call API for each individual Image/Face.
I am not sure about the response time for an individual query, but while handling large volumes we are concerned about the throughput of the API, and throughput can be increased from 10 TPS to some upper limit as desired.
Face API Doc - https://westus.dev.cognitive.microsoft.com/docs/services/face-v1-0-preview/operations/563879b61984550f30395239

dfs.FSnameSystem.BlockCapacity getting reduced eventually

I have a small application that I am running on a 'EMR' Cluster with 3 nodes. I have a few gigabytes of csv files that are split across multiple files. The application reads the csv files and then converts into '.orc' files. I have a small program that sequentially and synchronously sends limited (less than ten) files as input to the application.
My problem is, after sometime, the cluster is going down eventually without leaving any trace (or may be I am looking at wrong places). After trying to find out various options, I observed in 'ganglia' that the dfs.FSNameSystem.BlockCapacity is reducing eventually.
Is this because of the application or is it with the server configuration? Can someone please share if you have any exposure on this?

Linux: huge files vs huge number of files

I am writing software in C, on Linux running on AWS, that has to handle 240 terabytes of data, in 72 million files.
The data will be spread across 24 or more nodes, so there will only be 10 terabytes on each node, and 3 million files per node.
Because I have to append data to each of these three million files every 60 seconds, the easiest and fastest thing to do would to be able to keep each of these files open at one time.
I can't store the data in a database, because the performance in reading/writing the data will be too slow. I need to be able to read the data back very quickly.
My questions:
1) is it even possible to keep open 3 million files
2) if it is possible, how much memory would it consume
3) if it is possible, would performance be terrible
4) if it is not possible, I will need to combine all of the individual files into a couple of dozen large files. Is there a maximum file size in Linux?
5) if it is not possible, what technique should I use to append data every 60 seconds, and keep track of it?
The following is a very coarse description of an architecture that can work for your problem, assuming that the maximum number of file descriptors is irrelevant when you have enough instances.
First, take a look at this:
https://aws.amazon.com/blogs/aws/amazon-elastic-file-system-shared-file-storage-for-amazon-ec2/
https://aws.amazon.com/efs/
EFS provides a shared storage that you can mount as a filesystem.
You can store ALL your files in a single storage unit of EFS. Then, you will need a set of N worker-machines running at full capacity of filehandlers. You can then use a Redis queue to distribute the updates. Each worker has to dequeue a set of updates from Redis, and then will open necessary files and perform the updates.
Again: the maximum number of open filehandlers will not be a problem, because if you hit a maximum, you only need to increase the number of worker machines until you achieve the performance you need.
This is scalable, though I'm not sure if this is the cheapest way to solve your problem.

What's the best method for fetch the huge files from the webserver using c#

Hi i have a spec for fetch the files from server and predict the un-used files from the directory in this situation i am going to fetch the files from server it will return huge files, the problem is the cpu usage will increase while i am fetching large files, so i like to eliminate this scenario. can any one knows how to avoid this situation please share with me though it might help full for me.
Thanks
You can split your large file on server into several smaller pieces and fetch some metadata about amount of pieces, size etc. and than fetch them one by one from your client c# code and join pieces in binary mode to your larger file.

Solr for constantly updating index

I have a news site with 150,000 news articles. About 250 new articles are added daily to the database at an interval of 5-15 minutes. I understand that Solr is optimized for millions of records and my 150K won't be a problem for it. But I am worried the frequent updation will be a problem, since the cache gets invalidated with every update. In my dev server, cold load of a page takes 5-7 seconds to load (since every page runs a few MLT queries).
Will it help, if I split my index into two - An archive index and a latest index. The archive index will be updated once every day.
Can anyone suggest any ways to optimize my installation for a constantly updating index?
Thanks
My answer is: test it! Don't try to optimize yet if you don't know how it performs. Like you said, 150K is not a lot, it should be quick to build an index of that size for your tests. After that, run a couple of MLT queries from a different concurrent threads (to simulate users) while you index more documents to see how it behaves.
One setting that you should keep an eye on is auto-commit. Since you are indexing constantly, you can't commit at each document (you will bring Solr down). The value that you will choose for this setting will let you tune the latency of the system (how many times it takes for new documents to be returned in results) while keeping the system responsive.
Consider using mlt=true in the main query instead of issuing per-result MoreLikeThis queries. You'll save the roundtrips and so it will be faster.

Resources