My Marqo index is 10x larger than the data itself. How can I reduce the size of the index? - marqo

I have indexed 10 GB data into a marqo index, and now the index is over 100GB. Can anyone tell me why this might have occurred?
Here is an example of the data
{"title": "some title", "article_text": "a long text field containing text in an article", "publish_date": 43132132132, "popularity": 4.221}
If I put the data into an inverted index store like ES the data is significantly smaller (around 20GB). However, I want to use semantic search on the text so ES won't work for my use case.
I also want to potentially use Marqo's GPT integration which comes out of the box.

I think there are a couple of things to try. It will depend a little on the structure of the data. The best thing you can try is to adjust the splitting that happens for the text. See here https://marqo.pages.dev/0.0.10/Preprocessing/Text/ . For example, the default settings split the text by sentence and use splits of 2. Changing this to 5,10,20 should be fine and will reduce the storage by an approximate amount (2->20 is ~7-10x reduction).
settings = {
"index_defaults": {
"text_preprocessing": {
"split_length": 4,
"split_overlap": 0,
"split_method": "sentence"
},
},
}
response = mq.create_index("my-multimodal-index", settings_dict=settings)
One thing to note is that the size of the splitting should fit within the context length of the model being used. Default text models are mostly 128 tokens for the context length but many can go to 512 (i.e. BERT based ones). I think splitting at 10 or 20 sentences will be fine for the defaults but the context length can be adjusted by specifying custom parameters in the model selection (see https://marqo.pages.dev/0.0.10/Models-Reference/dense_retrieval/#generic-models). A default model can increase the context length via;
settings = {
"index_defaults": {
"text_preprocessing": {
"split_length": 5,
"split_overlap": 0,
"split_method": "sentence"
},
"model": 'unique-model-alias',
"model_properties": {"name": "all_datasets_v4_MiniLM-L6",
"dimensions": 384,
"tokens": 256,
"type": "sbert"},
"normalize_embeddings": True,
},
}
response = mq.create_index("my-generic-model-index", settings_dict=settings)
which would double the context to 256 tokens. The exact mapping between tokens and words is not exact but I think it is something like 4 tokens per word (on average).
The final option would be to use a model with a lower embedding dimension. However this will be limited with how much space can be reduced and there may not be a lot of options model-wise.

Note that because Marqo uses semantic search the text data is inherently more expensive on storage. However, there are a few options you can explore to reduce the size of your index:
Option 1: Use non_tensor_fields. Tensor based fields are encoded into collections of vectors, which means that they take up significantly more storage than regular inverted indexes. This means that Marqo is able to apply semantic search but means that its more costly on storage. Therefore, any text or attributes that you dont want to use semantic search on can be made non_tensor_fields which means they are stored only for filtering and for lexical search. Assuming you only want to do semantic search on the article_text, you could do:
mq.index("your-index").add_documents([{"title": "some title", "article_text": "a long text field containing text in an article", "publish_date": 43132132132, "popularity": 4.221}], non_tensor_fields=["popularity", "publish_date", "title"])
Option 2: Adjust your chunking strategy - check the settings you used to create the index.
Marqo chunks based on the number of sentences in a text field.
If you use a larger split length, this means that you get fewer vectors in the collection. In the below example we have split length of 2 and split overlap of 0.
index_settings = {
"index_defaults": {
"text_preprocessing": {
"split_length": 2,
"split_overlap": 0,
"split_method": "sentence"
}
}
}
mq.create_index("my-first-index", settings_dict=index_settings)
If we changed this to split length of 2 with an overlap of 1 we would effectively double the number of vectors for each article_text field, significantly increasing the storage size for the index.
You should check these settings and see if you can increase split length (for example 6 can work) and reduce split overlap.
Note that this comes with some drawbacks due to the fact that attention tapers off quadratically, so shorter overlapping chunks generally perform better when finding specific information.
Note that by the same vein if you were to use image patching, you would also get multiple vectors per image.

Related

Recognizing license plate characters using template characters in Python

For a university project I have to recognize characters from a license plate. I have to do this using python 3. I am not allowed to use OCR functions or use functions that use deep learning or neural networks. I have reached the point where I am able to segment the characters from a license plate and transform them to a uniform format. A few examples of segmented characters are here.
The format of the segmented characters is very dependent on the input. However, I can easily convert this to uniform dimensions using opencv. Additionally, I have a set of template characters and numbers that I can use to predict what character / number it is.
I therefore need a metric to express the similarity between the segmented character and the reference image. In this way, I can say that the reference image with the highest similarity score matches the segmented character. I have tried the following ways to compute the similarity.
For these operations I have made sure that the reference characters and the segmented characters have the same dimensions.
A bitwise XOR-operator
Inverting the reference characters and comparing them pixel by pixel. If a pixel matches increment the similarity score, if a pixel does not match decrement the similarity score.
hash both the segmented character and the reference character using 'imagehash'. Consequently comparing the hashes and see which ones are most similar.
None of these methods succeed to give me an accurate prediction for all characters. Most characters are usually correctly predicted. However, the program confuses characters like 8-B, D-0, 7-Z, P-R consistently.
Does anybody have an idea how to predict the segmented characters? I.e. defining a better similarity score.
Edit: Unfortunately, cv2.matchTemplate and cv2.matchShapes are not allowed for this assignment...
The general procedure for comparing two images consists in the extraction of features from the two images and their subsequent comparison. What you are actually doing in the first two methods is considering the value of every pixel as a feature. The similarity measure is therefore a distance-computation on a space of very high dimension. This methods are, however, subject to noise and this requires very big datasets in order not to obtain acceptable results.
For this reason, usually one attempts to reduce the space dimensionality. I'm not familiar with the third method, but it seems to go in this direction.
A way to reduce the space dimensionality consists in defining some custom features meaningful for the problem you are facing.
A possibility for the character classification problem could be to define features that measure the response of the input image on strategic subshapes of the characters (an upper horizontal line, a lower one, a circle in the upper part of the image, a diagonal line, etc.).
You could define a minimal set of shapes that, combined together, can generate every character. Then you should retrieve one feature for each shape, by measuring the response (i.e., integrating the signal of the input image inside the shape) of the original image on that particular shape. Finally, you should determine the class which the image belongs to by taking the nearest reference point in this, smaller, space of the features.

Getting less than 1 score while trying to check the cosine similarities of same document

I have used doc2vec to find the similarities in multiple documents, but when i am checking the same document which i created my model, the score should be '1' right? as the used document and the to be predict document is same. Sadly, I am getting different score when trying to find the similarities. Below is the attached code. Please tell me how to make this right, I can't find what is wrong here. Pleas help me...doc2vec - calculating document cosine similarity
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
df['Tagged_data'] = df['sent_to_word_tokenize_text'].apply(lambda x: [TaggedDocument(d, [i]) for i, d in enumerate(x)])
sadguru_model = Doc2Vec(df['Tagged_data'][0], vector_size = 1000, window = 500, dm = 1, min_count = 1, workers = 2, epochs = 100)
test_doc = word_tokenize(' '.join([word for word in df['Sentence_Tokenized_Text'][0]]))
# Sadguru model document
index0 = sadguru_model.docvecs.most_similar(positive=sadguru_model.infer_vector(test_doc)], topn =1) output: index0 = [(4014, 0.5270981788635254)]
output: index0 = [(4014, 0.5270981788635254)]
Doc2Vec doesn't discover true, unique vectors for every input document. Rather, it progressively learns useful-approximation vectors, using an internal algorithm that itself makes use of a lot of random initialization and random sampling. As a result:
if your training data includes the same document (same words) twice, with different document-ids, they won't get identical vectors
re-inferring vectors on a trained model, with the exact same words as an in-training document, won't result in identical vectors to the same original document
For more info, see the Gensim FAQ questions 11 & 12.
If your data & parameters are sufficient, then you can expect that two identical documents should have "very close" vectors, and a re-inference of the same document-words creates a vector "very close" to the same document in the original training set. (There's no precise definition of "very close", but in a working model, such same-word documents will be closer to each other than other documents in the training set.)
So you should expect 'high' similarities approaching 1.0, but essentially never 1.0 exactly, unless you've made two identical vectors on purpose with a lot of special effort.
However, you're not even seeing that 'very close' result, because it looks like your training parameters (and probably, training corpus) are way out-of-whack compared to normal or best practices. Specifically:
A vector_size=1000 is only appropriate for gigantic datasets, of millions (ideally tens-of-millions) of documents. If you're using vectors larger than your data can fill with meaningful distinctions, your models' results will appear increasingly random - especially in the case of identical or very-similar documents, because now instead of the stochastic, iterative process gradually nudging them to the same 'neighborhood' of values, they could wind up all over the place.
A window=500 is unprecedented. The default is 5, sometimes values up to 20 are used, or occasionally giant values **if and only if* the documents themselves are tiny, such that the effective window is still just "the whole document of a manageable size". On a real-sized corpus with documents over 500 words, window=500 would be amazingly expensive to calculate & likely result if far-worse vectors than a more typical value.
A min_count=1 is almost always a bad idea. Words that appear only once, or a few times, don't have the variety of subtly-varying uses that are needed for Doc2Vec (& related algorithms like Word2Vec, FastText, etc) to learn meaningful representations. Instead, single/rare uses contribute weird nonrepresentative examples, and often just function as noise preventing other words with enough examples from being better-understood. Far more people should be increasing the value over 5, as their training data grows, than reducing it.
An epochs=100 is highly uncommon, mostly used if struggling to squeeze some results from insufficient data by intensively re-training on it. (The cases where that makes the most sense would also be those where, due to small data, you decrease the vector_size to below the default of 100.) For Doc2Vec, epochs of 10-20 is most common in published results.
Try a vector_size no larger than the square root of the count of unique documents you have, leave the min_count at its default (or at least 2), leave the window at its default (unless you specifically have very-small documents), and try epochs=20 (unless you have very few documents and find improvement with slightly more).
Then you'll likely find your self-similarity test to return some high value – perhaps 0.9 or more – rather than 0.52, but still not 1.0.

Find most repeated phrase on huge text

I have huge text data. My entire database is text format in UTF-8
I need to have list of most repeated phrase on my whole text data.
For example my desire output something like this:
{
'a': 423412341,
'this': 423412341,
'is': 322472341,
'this is': 222472341,
'this is a': 122472341,
'this is a my': 5235634
}
Process and store each phrase take huge size of database.
For example store in MySQL or MongoDB.
Question is is there any more efficient database or alghorithm for find this result ?
Solr, Elasticsearch or etc ...
I think i have max 10 words in each phrase can be good for me.
I'd suggest combining ideas from two fields, here: Streaming Algorithms, and the Apriori Algorithm From Market-Basket Analysis.
Let's start with the problem of finding the k most frequent single words without loading the entire corpus into memory. A very simple algorithm, Sampling (see Finding Frequent Items in Data Streams]), can do so very easily. Moreover, it is very amenable to parallel implementation (described below). There is a plethora of work on top-k queries, including some on distributed versions (see, e.g., Efficient Top-K Query Calculation in Distributed Networks).
Now to the problem of k most frequent phrases (of possibly multiple phrases). Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity. Hence, once you have the k most frequent single words, you can scan the corpus for only them (which is faster) to build the most frequent phrases of length 2. Using this, you can build the most frequent phrases of length 3, and so on. The stopping condition is when a phrase of length l + 1 does not evict any phrase of length l.
A Short Description of The Sampling Algorithm
This is a very simple algorithm which will, with high probability, find the top k items out of those having frequency at least f. It operates in two stages: the first finds candidate elements, and the second counts them.
In the first stage, randomly select ~ log(n) / f words from the corpus (note that this is much less than n). With high probability, all your desired words appear in the set of these words.
In the second stage, maintain a dictionary of the counts of these candidate elements; scan the corpus, and count the occurrences.
Output the top k of the items resulting from the second stage.
Note that the second stage is very amenable to parallel implementation. If you partition the text into different segments, and count the occurrences in each segment, you can easily combine the dictionaries at the end.
If you can store the data in Apache Solr, then the Luke Request Handler could be used to find the most common phrases. Example query:
http://127.0.0.1:8983/solr/admin/luke?fl=fulltext&numTerms=100
Additionally, the Terms Component may help find the most common individual words. Here is an article about Self Updating Solr Stopwords which uses the Terms Component to find the 100 most common indexed words and add them to the Stopwords file. Example query:
http://127.0.0.1:8983/solr/terms?terms.fl=fulltext&terms.limit=100
Have you considered using MapReduce?
Assuming you have access to a proper infrastructure, this seems to be a clear fit for it. You will need a tokenizer that splits lines into multi-word tokens up to 10 words. I don't think that's a big deal. The outcome from the MR job will be token -> frequency pairs, which you can pass to another job to sort them on the frequencies (one option). I would suggest to read up on Hadoop/MapReduce before considering other solutions. You may also use HBase to store any intermediary outputs.
Original paper on MapReduce by Google.
tokenize it by 1 to 10 words and insert into 10 SQL tables by token lengths. Make sure to use hash index on the column with string tokens. Then just call SELECT token,COUNT(*) FROM tablename GROUP BY token on each table and dump results somewhere and wait.
EDIT: that would be infeasible for large datasets, just for each N-gram update the count by +1 or insert new row into table (in MYSQL would be useful query INSERT...ON DUPLICATE KEY UPDATE). You should definitely still use hash indexes, though.
After that just sort by number of occurences and merge data from these 10 tables (you could do that in single step, but that would put more strain on memory).
Be wary of heuristic methods like suggested by Ami Tavory, if you select wrong parameters, you can get wrong results (flaw of sampling algorithm can be seen on some classic terms or phrases - e.g. "habeas corpus" - neither habeas nor corpus will be selected as frequent by itself, but as a 2 word phrase it may very well rank higher than some phrases you get by appending/prepending to common word). There is surely no need to use them for tokens of lesser length, you could use them only when classic methods fail (take too much time or memory).
The top answer by Amy Tavori states:
Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity.
While it is true that appending a word to a phrase cannot increase its popularity, there is no reason to assume that the frequency of 2-grams are bounded by the frequency of 1-grams. To illustrate, consider the following corpus (constructed specifically to illustrate this point):
Here, a tricksy corpus will exist; a very strange, a sometimes cryptic corpus will dumbfound you maybe, perhaps a bit; in particular since my tricksy corpus will not match the pattern you expect from it; nor will it look like a fish, a boat, a sunflower, or a very handsome kitten. The tricksy corpus will surprise a user named Ami Tavory; this tricksy corpus will be fun to follow a year or a month or a minute from now.
Looking at the most frequent single words, we get:
1-Gram Frequency
------ ---------
a 12
will 6
corpus 5
tricksy 4
or 3
from 2
it 2
the 2
very 2
you 2
The method suggested by Ami Tavori would identify the top 1-gram, 'a', and narrow the search to 2-grams with the prefix 'a'. But looking at the corpus from before, the top 2-grams are:
2-Gram Frequency
------ ---------
corpus will 5
tricksy corpus 4
or a 3
a very 2
And moving on to 3-grams, there is only a single repeated 3-gram in the entire corpus, namely:
3-Gram Frequency
------ ---------
tricksy corpus will 4
To generalize: you can't use the top m-grams to extrapolate directly to top (m+1)-grams. What you can do is throw away the bottom m-grams, specifically the ones which do not repeat at all, and look at all the ones that do. That narrows the field a bit.
This can be simplified greatly. You don't need a database at all. Just store the full text in a file. Then write a PHP script to open and read the file contents. Use the PHP regex function to extract matches. Keep the total in a global variable. Write the results to another file. That's it.

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

How can i cluster document using k-means (Flann with python)?

I want to cluster documents based on similarity.
I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).
I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.
My questions are:
Can we do document similarity grouping / Clustering using KMeans (Flann do not allow any text input it seems )
Is Flann the right choice? If not please suggest me High performance library that support text/docs clustering, that have python wrapper/API.
Is k-means the right algorithm?
You need to represent your document as an array of numbers (aka, a vector). There are many ways to do this, depending on how sophisticated you want to be, but the simplest way is just to represent is as a vector of word counts.
So here's what you do:
Count up the number of times each word appears in the document.
Choose a set of "feature" words that will be included in your vector. This should exclude extremely common words (aka "stopwords") like "the", "a", etc.
Make a vector for each document based on the counts of the feature words.
Here's an example.
If your "documents" are single sentences, and they look like (one doc per line):
there is a dog who chased a cat
someone ate pizza for lunch
the dog and a cat walk down the street toward another dog
If my set of feature words are [dog, cat, street, pizza, lunch], then I can convert each document into a vector:
[1, 1, 0, 0, 0] // dog 1 time, cat 1 time
[0, 0, 0, 1, 1] // pizza 1 time, lunch 1 time
[2, 1, 1, 0, 0] // dog 2 times, cat 1 time, street 1 time
You can use these vectors in your k-means algorithm and it will hopefully group the first and third sentence together because they are similar, and make the second sentence a separate cluster since it is very different.
There is one big problem here:
K-means is designed for Euclidean distance.
The key problem is the mean function. The mean will reduce variance for Euclidean distance, but it might not do so for a different distance function. So in the worst case, k-means will no longer converge, but run in an infinite loop (although most implementations support stopping at a maximum number of iterations).
Furthermore, the mean is not very sensible for sparse data, and text vectors tend to be very sparse. Roughly speaking the problem is that the mean of a large number of documents will no longer look like a real document, and this way become dissimilar to any real document, and more similar to other mean vectors. So the results to some extend degenerate.
For text vectors, you probably will want to use a different distance function such as cosine similarity.
And of course you first need to compute number vectors. For example by using relative term frequencies, normalizing them via TF-IDF.
There is a variation of the k-means idea known as k-medoids. It can work with arbitrary distance functions, and it avoids the whole "mean" thing by using the real document that is most central to the cluster (the "medoid"). But the known algorithms for this are much slower than k-means.

Resources