I have created a panel with 6 LogQL queries (shown below), calculating basic stats on a given dataset (min, percentiles, etc..). I want to order the gages so they have a logical ordering, and make it easy to visualize.
For instance the panel below would be a lot easier to read if it was presenting each gauge in the following order:
min, p50, p75, p99, max, stddev.
As of now the order is just random and will change after each page reload. Any way I could get the ordering I am looking for?
Related
I prepared a dataset and later learned that it is skewed.
Assuming a plot of user_count vs score where user_count is the number of users on that particular score.
I have to sample out the total users in multiple samples of size [100<=n<=1000] in such a way that the standard deviation of each created sample is minimized.
How do I do that?
I have tried binning methods like custom binning, quantile, etc. but that is not helpful to me, as by manual binning some of my bins have high SD.
Example:
I created 19 custom bins of interval: .05-.10, .10-.15, ......, .90-.95, >.95
this gives me something like this:
the problem here is: q19 has a high SD.
so, I am trying to figure out a way using which I can create an optimal number of bins automatically with minimum standard deviations.
In my master thesis, I need to determine and calculate the number of cases for median time to event. The method is according to Brookmeyer & Crowley, 1982. My question is: How can I determine the sample size according to Brookmeyer? So determine the number of cases for median time to event. How can I define the equation for N? I know how to calculate the confidence interval, but my problem, how do I determine the case number theoretically for this.
Edit:
"Designing the trial with different characteristics: planning a single arm study without historical control. How can I determine the sample size N and what method is the best", this is my plan. Assuming "Median Time to event "PFS" ". I want to determine the sample size N and then calculate it, that's why I thought that I can clearly use or find a formula for N. I firmly assume that the survival time is exponentially distributed I want to see with it: 1- Sample size based on distributional assumptions? 2- No implementation available? How to derive p-value? Thanks for further help, best regards
I am trying to make a normal template matching search more effizient by first doing the search on downscaled representations of the image. Basically I do a double pyrDown -> quarter resolution.
For most images and templates this works beautifully, but for some others I get really bad matching results. It seems to be especially bad for thin fonts or small contrast.
Look at this example:
And this template:
At 100% resolution I get a matching probability of 99,9%
At 50% resolution I get 90%
At 25% resolution I get 87%
I don't really know why its so bad for some images/templates. I tried to recreate and test in photoshop by hiding/showing the 25% downscaled template on top of the 25% downscaled image, and as you can see, it's not 100% congruent:
https://giphy.com/gifs/coWDjcvHysKgn95IFa
I need a way to get more probability for those matchings at low resolution because it needs to be fast.
Any ideas on how to improve my algorithm?
Here are the original files:
https://www.dropbox.com/s/llbdj9bx5eprxbk/images.zip?dl=0
This is not unusual and those scores seem perfectly fine. However here are some ideas that might help you improve the situation:
You mentioned that it seems to be especially bad for thin fonts. This could be happening because some of the pixels in the lines are being smoothed out or distorted with the Gaussian filter that is applied on pyrDown. It could also be an indication that you have reduced the resolution too much. Unfortunately I think the pyrDown function in OpenCV reduces the resolution by a factor of 2 so it does not give you the ability to fine tune it by other scale factors. Another thing you could try is the instruction resize() with interpolation set to INTER_LINEAR or INTER_CUBIC. The resize() function will allow you to resize the image using any scale factor so you might have more control of performance vs accuracy.
Use multiple templates of the same objects. If you come to a scene and can only achieve an 87% score, create a template out of that scene. Then add it to a database of templates that are to be utilized. Obviously as the amount of templates increases so does the time it takes to complete the search.
The best way to deal with this scenario is to perform an exhaustive match on the highest level of the pyramid then track it down to the lowest level using a reduced search space on lower levels. By exhaustive I mean you will search all rows and all columns across the entire top pyramid level image. You will keep track of the locations (row, col) of the highest matches on the highest level (you are already probably doing that). Then you will multiply those locations by a factor of 2 and perform a restricted search on the next lowest level (ex. 5 x 5 shift centered on the rough location). You keep doing this until you are at the bottom level. This will give you the best overall accuracy and performance. This is also the way most industrial computer vision packages do it.
I'm trying to come up with a good design for a nearest neighbor search application. This would be somewhat similar to this question:
Saving and incrementally updating nearest-neighbor model in R
In my case this would be in Python but the main point being the part that when new data comes, the model / index must be updated. I'm currently playing around with scikit-learn neighbors module but I'm not convinced it's a good fit.
The goal of the application:
User comes in with a query and then the n (probably will be fixed to 5) nearest neighbors in the existing data set will be shown. For this step such a search structure from sklearn would help but that would have to be regenerated when adding new records.Also this is a first ste that happens 1 per query and hence could be somewhat "slow" as in 2-3 seconds compared to "instantly".
Then the user can click on one of the records and see that records nearest neighbors and so forth. This means we are now within the exiting dataset and the NNs could be precomputed and stored in redis (for now 200k records but can be expanded to 10th or 100th of millions). This should be very fast to browse around.
But here I would face the same problem of how to update the precomputed data without having to do a full recomputation of the distance matrix especially since there will be very few new records (like 100 per week).
Does such a tool, method or algorithm exist for updatable NN searching?
EDIT April, 3rd:
As is indicated in many places KDTree or BallTree isn't really suited for high-dimensional data. I've realized that for a Proof-of-concept with a small data set of 200k records and 512 dimensions, brute force isn't much slower at all, roughly 550ms vs 750ms.
However for large data set in millions+, the question remains unsolved. I've looked at datasketch LSH Forest but it seems in my case this simply is not accurate enough or I'm using it wrong. Will ask a separate question regarding this.
You should look into FAISS and its IVFPQ method
What you can do there is create multiple indexes for every update and merge them with the old one
You could try out Milvus that supports adding and near real-time search of vectors.
Here are the benchmarks of Milvus.
nmslib supports adding new vectors. It's used by OpenSearch as part their Similarity Search Engine, and it's very fast.
One caveat:
While the HNSW algorithm allows incremental addition of points, it forbids deletion and modification of indexed points.
You can also look into solutions like Milvus or Vearch.
I am looking to use Cassandra for a nearby search type query. based on my lon/lat coordinates I want to retrieve the closest points. I do not need 100% accuracy so I am comfortable in using a bounding box instead of a circle (better performance too), but I can't find concrete instructions (Hopefully with an example) how to implement a bounding box.
From my experience, there's no easy way to have a generic geospatial index search on top of Cassandra. I believe you only have two options:
Geohashing, split your dataset into square/rectangular elements: for example, use integer parts of lat/lon as an indexes in a grid. Upon doing search, you can load all elements in an enclosing grid element and perform full neighbour scan inside your application.
works well if you have an evenly distributed dataset, like grid points in NWP similation that I've had.
works really bad on a datasets like "restaurants in USA", where most of the points are herding around large cities. You'll have unbalanced high load on different grid elements like New York area and get absolutely empty index buckets located somewhere in the Atlantic Ocean.
External indexes like ElasticSearch/Solr/Sphinx/etc.
All of them have geospatial indexing support out-of-the-box, no need to develop your own in your application layer.
You have to setup a separate indexing service and keep cassandra/index data in sync. There's some cassandra/search integrations like DSE (commercial), stargate-core (I've never heard about anyone using this in production), or you can roll your own, but all of these require time and effort.
This issue was touched on in the Euro Cassandra Summit in 2014.
RedHat: Scalable Geospatial Indexing with Cassandra
The presenter explains how he created a spatial index using User Defined Types that is very suitable to querying geospatial data using a region or bounding box based lookup.
The general idea is to break up your data into regions that are defined by bounding boxes. Each region then represents a rowkey, which you can then use to access any data associated with that region. If you have a location of interest, you query the keyspace on the regions which fall inside that area.