Redis string or sets? - string

I want to store person & food pairs, person eats food in 5 minutes, so I need to expire the key. So I have 2 ways to do that, string or sets:
1. setex person:tommy:food:chicken 300 anything
setex person:tommy:food:chip 300 anything
...
2. sadd person:tommy chicken chip
expire person:tommy 300
Which is better? Or is there other way?

I would go with the second option. My reasons being:
Since there is no other information attached to entries, there is no reason to give each item its own key (this would be different if each item needed to store a hash of information, for example).
You can find out how many food items are linked to a person (SCARD) without creating another key.
You have all of the other set functions available to you if you need to manipulate or compare several sets to each other.
If you read the documentation on how redis expires keys, you notice that Redis has an active approach to expiring keys. Using the second option's design, it is guaranteed that all food data associated with someone will be removed at the same time, when the key expiration method runs.
To some degree, this question seems analogous to using individual variables in a programming language, as opposed to an array:
a = 3, b = 5, c = 7
sum = a + b + c
versus
items = [3, 5, 7]
sum = sum(items)

Related

inverted index sets - querying key prefixes

I'm using Redis in order to build an inverted index system for words and the documents that contains those words.
the setup is really simple: Redis Sets where the key of the Set is: i:word and the values of the Set are the documents ids that have this word
let's say i have 2 sets: i:example and i:result
the query - "example result" will intersect i:example and i:result and return all the ids that have both example and result as members
but what i'm looking for is a way to perform (in efficient manner) a query like: "ex res". the result set should contain at least all the ids from the query "example result"
Solutions that i thought of:
create prefix sets of size 2: p:ex - contains {"example", "expertise", "ex"...}. the lookup running time will not be a problem - O(1) to get the set and O(n) to check all elements in the set for words that start with the prefix (where n = set.size()) but i worry about the added size price.
Using scan: but i'm not sure about the running time - query like scan 0 match ex* will take O(n) where n is the number of keys in the db? I know redis is fast but it's probably not an optimized solution for query like "ex machi cont".
The usual way to go about this is the first approach you had mentioned, but usually you'd go with segments that are 3+ chars long. Note that you'll need to have a set for each segment, i.e.g. i:exa, i:exam, i:examp, i:exampl and of course i:example.
This will naturally take up space in your database (hence the suggestion to start at 3 rather than 2 characters). A possible tweak is to keep in the i:len(3) sets only references to i:len(4+) sets instead of document ids. This will required more read operations but will have significant savings in terms of RAM.
You should explore v2.8.9's addition of lexicographical ranges for Sorted Sets. By calling ZRANGEBYLEX you can get ranges of members (i.e.g. all the words that start with ex). While this could be useful in this context by itself, consider that you can also use your Sorted Set's members creatively to encode a word and its document reference. This can help you get over the "loss" of the score (since all scores need to be the same for lexicographical ordering to work). For example, assuming the words "bed" and "beg" in docs 1 and 2:
ZADD index 0 "beg:1" 0 "bed:2"
Lastly, here's a little something to think about too - adding suffix searching (i.e.g, everything that ends with "ample"): https://redislabs.com/blog/how-to-use-redis-at-least-x1000-more-efficiently

Algorithm of unique user identity

I'm writing service for anonymous commenting (plugin for social network).
I have to generate pseudo-unique number for the each user per a thread.
So, each post has a unique number (for example, 6345) and each user has unique id (9144024). Using this information I need to generate unique index in array of avatars.
Let's say, there is array with 312 images, it's static and all images are in the same order every time.
Now the algorithm looks like this:
(post id + user id) % number if images = index
(6345 + 9144024) % 312 = 33
And in comment I show image with index 33. The problem is that it's possible to find the user id by the image if someone will find the way of generating images (image list is always in same order).
What is the best way here without storing per-post data in database, for example.
You are looking for a kind of one-way function: computing the image id from the user id should be easy, but not the converse. The first thing that comes to my mind here is using hash functions: simply concatenate the user id and the post id, perhaps with some salt, then compute the SHA-1 hash of that, and take that modulo the number of images.
In this approach, I'd interpret the hash result as a single 160-bit integer. If you don't have a big integer library at hand, you can do the modulo computation incrementally. Start with the highest byte, and then in each step multiply the current value by 28, add the next byte, and reduce the sum modulo 312. You could also simply take the lowest 32 or 64 bit or something like that, and perform the modulo on that, although the result of that approach might be less evenly distributed than the one outlined above.

How to generate unique string?

I have a device and for each device I wish to generate a string of the following format: XXXXXXXX. Each X is either B, G, or R. An example is GRBRRBRB. This gives me roughly 7000 keys to work with which is enough as I doubt I'll have more devices.
I was thinking I could generate them all before hand, and dump them in a file or something, and just get the next key available from that, but I wonder if there is a better way to do this.
I know there are better ways to do it if I don't need guaranteed uniqueness but I definitely need that so I'm not sure what the best way to do it is.
Treat it as a ternary representation of a number, where R=0, B=1, G=2. So when you're writing the nth ID, the first digit is R if n % 3 == 0, B if n % 3 == 1, G otherwise. The second digit is the same, except you're looking at (n / 3) % 3; then for the third digit look at (n / 3^2) % 3; etc.
If your devices have any unique sequential ID available, I would implement a deterministic algorithm for that 'unique string' retrieval.
I mean something like
GetDeviceRgbString(deviceid) { // deterministic algorithm returning appropriate value using device id to choose it }
Otherwise consider storing it somewhere (depends on your environment, you gave little data about it, but that may be file, database, ... ) and marking them as used, preferably keep the data about assigned device, you may need it once.
I'm going to assume you want them unique to identify them for some type of network (internet) activity. That being the case, I would have a web service that can take care of making sure each device is unique by handing out IDs. The software on each device would see if it has an ID (stored locally), and if not, request one from the web service.

constraint to avoid generating duplicates in this search task

I have to solve the following optimization problem:
Given a set of elements (E1,E2,E3,E4,E5,E6) create an arbitrary set of sequences e.g.
seq1:E1,E4,E3
seq2:E2
seq3:E6,E5
and given a function f that gives a value for every pair of elements e.g.
f(E1,E4) = 5
f(E4,E3) = 2
f(E6,E5) = 3
...
in addition it also gives a value for the pair of an element combined with some special element T, e.g.
f(T,E2) = 10
f(E2,T) = 3
f(E5,T) = 1
f(T,E6) = 2
f(T,E1) = 4
f(E3,T) = 2
...
The utility function that must be optimized is the following:
The utility of a sequence set is the sum of the utility of all sequences.
The utility of a sequence A1,A2,A3,...,AN is equal to
f(T,A1)+f(A1,A2)+f(A2,A3)+...+f(AN,T)
for our example set of sequences above this leads to
seq1: f(T,E1)+f(E1,E4)+f(E4,E3)+f(E3,T) = 4+5+2+2=13
seq2: f(T,E2)+f(E2,T) =10+3=13
seq3: f(T,E6)+f(E6,E5)+f(E5,T) =2+3+1=6
Utility(set) = 13+13+6=32
I try to solve a larger version (more elements than 6, rather 1000) of this problem using A* and some heuristic. Starting from zero sequences and stepwise adding elements either to existing sequences or as a new sequence, until we obtain a set of sequences containing all elements.
The problem I run into is the fact that while generating possible solutions I end up with duplicates, for example in above example all the following combinations are generated:
seq1:E1,E4,E3
seq2:E2
seq3:E6,E5
+
seq1:E1,E4,E3
seq2:E6,E5
seq3:E2
+
seq1:E2
seq2:E1,E4,E3
seq3:E6,E5
+
seq1:E2
seq2:E6,E5
seq3:E1,E4,E3
+
seq1:E6,E5
seq2:E2
seq3:E1,E4,E3
+
seq1:E6,E5
seq2:E1,E4,E3
seq3:E2
which all have equal utility, since the order of the sequences does not matter.
These are all permutations of the 3 sequences, since the number of sequences is arbitrairy there can be as much sequences as elements and a faculty(!) amount of duplicates generated...
One way to solve such a problem is keeping already visited states and don't revisit them. However since storing all visited states requires a huge amount of memory and the fact that comparing two states can be a quite expensive operation, I was wondering whether there wasn't a way I could avoid generating these in the first place.
THE QUESTION:
Is there a way to stepwise construct all these sequence constraining the adding of elements in a way that only combinations of sequences are generated rather than all variations of sequences.(or limit the number of duplicates)
As an example, I only found a way to limit the amount of 'duplicates' generated by stating that an element Ei should always be in a seqj with j<=i, therefore if you had two elements E1,E2 only
seq1:E1
seq2:E2
would be generated, and not
seq1:E2
seq2:E1
I was wondering whether there was any such constraint that would prevent duplicates from being generated at all, without failing to generate all combinations of sets.
Well, it is simple. Allow generation of only such sequences that are sorted according to first member, that is, from the above example, only
seq1:E1,E4,E3
seq2:E2
seq3:E6,E5
would be correct. And this you can guard very easily: never allow additional sequence that has its first member less than the first member of its predecessor.

Matching Based on Arbitrary Categories and Similarity Measures

I have customer database who have certain attributes, and a customer type. The collection of attributes can vary (they do come from a finite set though), and when I look at a new customer with unknown type, with given attributes, I would like to determine which type s/he belongs to. For example, say I have these customers already in DB,
Customer | Type | Attributes
1 A 44,32,5,'X'
2 A 3,32,66,'A'
3 B 6,32,'A', 'B'
4 C 47,31,2,'H'
5 C 14,32,2,'O'
6 C 2,'C'
7 A 44
When I receive a new customer who has attributes, for example, 3,32,2, I would like to determine which type this customer belongs to, and the code should report its confidence (as percentage) of this match.
What is the best method to use here? Something statistical, or a method based on an affinity matrix of some kind, or recommendation engine style Pearson Correlation coefficients based approach? Sample, pseude code would be most welcome, but any, all ideas are fine.
Thanks,
The way to solve this problem is using Naive Bayes.

Resources