Basket analysis with redundant items - market-basket-analysis

Assume I want to do a basket analysis on 1.000.000 baskets. Assume that 99% of them bought milk, for some reason, how can we cope with that, since (almost) all baskets have item -> milk.
When doing NLP you can specify "stop-words" like "the","and" etc., can you do that aswell here, simply by saying that items that occur in more then e.g 70% of the baskets should be removed or.. ?

Related

Find any values from a set/list within a string

Long time lurker, first time poster. I'm hoping to get some advice from the brilliant minds in this community. In the project I'm working in, the goal is to look at a user-provided string and determine if the content of that string contains any (one or many) matches to a list of match criteria. For example:
User-provided string: "I like thing a and thing b"
Match List:
Match Criteria
Match Type
Category
Foo
Exact (Case Insensitive)
Bar
Thing a
Contains (Case Insensitive)
Things
Thing b
Contains (Case Insensitive)
Stuff
In this case, it would return the following matches:
Thing a > Things
Thing b > Stuff
As of now, my approach is to iterate through the match criteria list and check each list item against the user-supplied string using the Match Type specified (Exact, Contains, Regular Expression), returning a list of the matches and then doing some stuff with that list. This approach works, even when matching ~100 rules and handling a 200-record batch, but it seems obvious that the performance will be pretty terrible if a large number of rules is introduced.
Is there a better way to do this that would be supported in Apex called by a trigger? I would love to learn a more sophisticated approach if there is one.
Thanks in advance!
What do you need it for. Is it a pure apex exercise or is it "close" to certain standard sObjects? There are lots of built-in features around "fuzzy matching".
In no specific order...
Have you looked into "all things Einstein", from categorising leads to predicting how likely this opportunity is to close. Might not be direction you expected to take but who knows
Obviously SOSL comes to mind, like what powers the global search. It automatically does some substitutions for you like Mike -> Michael
Matching rules, duplicate rules. You'd have a limit of say 5 active rules but you could hook to them up from Apex, including creative abuse of the system. "Dear salesforce, let's pretend I'm making such and such Opportunity, can you find me similar Opportunities?" (plot twist- you're not making an Oppty at all, you're creating account of some venture capitalist looking for investments that match his preferences). Give matching rules a go, if not for everything then at least for more creative fuzzy matching. You really don't want to implement soundex, levenshtein etc manually...
tags? There's somewhat forgotten feature from SF classic, it creates bunch of tables (AccountTag, ContactTag). This plus SOSL could be close to what you need.
Additionally if you need this for anything close to Knowledge Base:
Data Categories come to mind
KB supports synonyms, letting you define your (not very intuitive) "thing b => stuff" mapping.
and it should survive translations

Transportation problem to minimize the cost using genetic algorithm

I am new to Genetic Algorithm and Here is a simple part of what i am working on
There are factories (1,2,3) and they can server any of the following customers(ABC) and the transportation costs are given in the table below. There are some fixed cost for A,B,C (2,4,1)
A B C
1 5 2 3
2 2 4 6
3 8 5 5
How to solve the transportation problem to minimize the cost using a genetic algorithm
First of all, you should understand what is a genetic algorithm and why we call it like that. Because we act like a single cell organism and making cross overs and mutations to reach a better state.
So, you need to implement your chromosome first. In your situation, let's take a side, customers or factories. Let's take customers. Your solution will look like
1 -> A
2 -> B
3 -> C
So, your example chromosome is "ABC". Then create another chromosome ("BCA" for example)
Now you need a fitting function which you wish to minimize/maximize.
This function will calculate your chromosomes' breeding chance. In your situation, that'll be the total cost.
Write a function that calculates the cost for given factory and given customer.
Now, what you're going to do is,
Pick 2 chromosomes weighted randomly. (Weights are calculated by fitting function)
Pick an index from 2 chromosomes and create new chromosomes via using their switched parts.
If new chromosomes have invalid parts (Such as "ABA" in your situation), make a fixing move (Make one of "A"s, "C" for example). We call it a "mutation".
Add your new chromosome to the chromosome set if it wasn't there before.
Go to first process again.
You'll do this for some iterations. You may have thousands of chromosomes. When you think "it's enough", stop the process and sort the chromosome set ascending/descending. First chromosome will be your result.
I'm aware that makes the process time/chromosome dependent. I'm aware you may or may not find an optimum (fittest according to biology) chromosome if you do not run it enough. But that's called genetic algorithm. Even your first run and second run may or may not produce the same results and that's fine.
Just for your situation, possible chromosome set is very small, so I guarantee that you will find an optimum in a second or two. Because the entire chromosome set is ["ABC", "BCA", "CAB", "BAC", "CBA", "ACB"] for you.
In summary, you need 3 informations for applying a genetic algorithm:
How should my chromosome be? (And initial chromosome set)
What is my fitting function?
How to make cross-overs in my chromosomes?
There are some other things to care about this problem:
Without mutation, genetical algorithm can stuck to a local optimum. It still can be used for optimization problems with constraints.
Even if a chromosome exists with a very low chance to be picked for cross-over, you shouldn't sort and truncate the chromosome set till the end of iterations. Otherwise, you may stuck at a local extremum or worse, you may get an ordinary solution candidate instead of global optimum.
To fasten your process, pick non-similar initial chromosomes. Without enough mutation rate, finding global optimum could be a real pain.
As mentioned in nejdetckenobi's answer, in this case the solution search space is too small, i.e. only 8 feasible solutions ["ABC", "BCA", "CAB", "BAC", "CBA", "ACB"]. I assume this is only a simplified version of your problem, and your problem actually contains more factories and customers (but the numbers of factories and customers are equal). In this case, you can just make use of special mutation and crossover to avoid infeasible solution with repeating customers, e.g. ["ABA", 'CCB', etc.].
For mutation, I suggest to use a swap mutation, i.e. randomly pick two customers, swap their corresponding factory (position):
ABC mutate to ACB
ABC mutate to CBA

How to quickly filter a list using regex?

Well... I have a trivial request of building an Entry that filter on-the-fly a list of entries. (think of an editor auto-complete feature)
The request is to support a regex filter over the whole list and display only matching entries.
e.g.,
The list contains:
abc.efg.hij.entry
abc.ddd.hij.entry2
hij.some.value.entry
Typing in the Entry
Value : List
hij : abc.efg.hij.entry, abc.ddd.hij.entry2, hij.some.value.entry
ddd : abc.ddd.hij.entry2
dd*entry : abc.ddd.hij.entry2
val : hij.some.value.entry
Here is the code i'm using for filtering the list:
regex = re.compile(r"{0}".format(entry_value), re.IGNORECASE)
display_list = list(filter(regex.search, display_list))
The real life list contains ~300K entries of strings (up to 100 char each) and the performance of the above is very poor, considering a GUI response time.
I've profiled my real test case and it yields ~0.8s for each key typing in the Entry.
Is there a faster way?
If you are doing regular expression pattern matching against a normal python list that contains 300,000 items, it's just naturally going to be slow. Also, if you are going to display 300,000 items in a listbox it's going to be slow to display all of those items.
Your best bet might be to pick a better data structure. For example, on my system I can run your filter against 300,000 items in about 250ms, but a query against an in-memory sqlite database with 300,000 rows takes about half that time. In either case, it can add another second to fully update the display if the result is very large (for example, if all 300,000 match)
Of course, sqlite doesn't support regex out of the box, but you can translate some common patterns to sql patterns (eg: 'foo.*bar' could be translated to 'foo%bar'). For more information on sqlite and regex see How do I use regex in a SQLite query?
Another strategy to employ would be to not search on every character typed. Wait until the user pauses in their typing. So, for example, if they type "Lorem", you don't need to search on "L" and then "Lo", and then "Lor", etc. Instead, schedule the search to happen in 100 ms, and with each keypress you can reschedule the search. This will prevent the searching from slowing down, while still giving the user what appears to be a fairly rapid result.

Neo4j query for shortest path stuck (Do not work) if I have 2way relationship in graph nodes and nodes are interrelated

I made relation graph two relationship, like if A knows B then B knows A, Every node has unique Id and Name along with other properties.. So my graph looks like
if I trigger a simple query
MATCH (p1:SearchableNode {name: "Ishaan"}), (p2:SearchableNode {name: "Garima"}),path = (p1)-[:NAVIGATE_TO*]-(p2) RETURN path
it did not give any response and consumes 100% CPU and RAM of the machine.
UPDATED
As I read though posts and from comments on this post I simplified the model and relationship. Now it ends up to
Each relationship has different weights, to simplify consider horizontal connections weight 1, vertical weights 1 and diagonal relations have weights 1.5
In my database there are more than 85000 nodes and 0.3 Million relationships
Query with shortest path is not ends up to some result. It stuck in the processing and CPU goes to 100%
im afraid you wont be able to do much here. your graph is very specific, having a relation only to closest nodes. thats too bad cause neo4j is ok to play around the starting point +- few relations away, not over whole graph with each query
it means, once, you are 2 nodes away, the computational complexity raises up to:
8 relationships per node
distance 2
8 + 8^2
in general, the top complexity for a distance n is
O(8 + 8^n) //in case all affected nodes have 8 connections
you say, you got like ~80 000 of nodes.this means (correct me if im wrong), the longest distance of ~280 (from √80000). lets suppose your nodes
(p1:SearchableNode {name: "Ishaan"}),
(p2:SearchableNode {name: "Garima"}),
to be only 140 hopes away. this will create a complexity of 8^140 = 10e126, im not sure if any computer in the world can handle this.
sure, not all nodes have 8 connections, only those "in the middle", in our example graph it will have ~500 000 relationships. you got like ~300 000, which is maybe 2 times less so lets supose the overal complexity for an average distance of 70 (out of 140 - a very relaxed bottom estimation) for nodes having 4 relationships in average (down from 8, 80 000 *4 = 320 000) to be
O(4 + 4^70) = ~10e42
one 1GHz CPU should be able to calculate this by:
-1000 000 per second
10e42 == 10e36 * 1 000 000 -> 10e36 seconds
lets supose we got a cluster of 100 10Ghz cpu serves, 1000 GHz in total.
thats still 10e33 * 1 000 000 000 -> 10e33seconds
i would suggest to just keep away from AllshortestPaths, and look only for the first path available. using gremlin instead of cypher it is possible to implement own algorithms with some heuristics so actually you can cut down the time to maybe seconds or less.
exmaple: using one direction only = down to 10e16 seconds.
an example heuristic: check the id of the node, the higher the difference (subtraction value) between node2.id - node1.id, the higher the actual distance (considering the node creation order - nodes with similar ids to be close together). in that case you can either skip the query or just jump few relations away with something like MATCH n1-[:RELATED..5]->q-[:RELATED..*]->n2 (i forgot the syntax of defining exact relation count) which will (should) actually jump (instantly skip to) 5 distances away nodes which are closer to the n2 node = complexity down from 4^70 to 4^65. so if you can exactly calculate the distance from the nodes id, you can even match ... [:RELATED..65] ... which will cut the complexity to 4^5 and thats just matter of miliseconds for cpu.
its possible im completely wrong here. it has been already some time im our of school and would be nice to ask a mathematician (graph theory) to confirm this.
Let's consider what your query is doing:
MATCH (p1:SearchableNode {name: "Ishaan"}),
(p2:SearchableNode {name: "Garima"}),
path = (p1)-[:NAVIGATE_TO*]-(p2)
RETURN path
If you run this query in the console with EXPLAIN in front of it, the DB will give you its plan for how it will answer. When I did this, the query compiler warned me:
If a part of a query contains multiple disconnected patterns, this
will build a cartesian product between all those parts. This may
produce a large amount of data and slow down query processing. While
occasionally intended, it may often be possible to reformulate the
query that avoids the use of this cross product, perhaps by adding a
relationship between the different parts or by using OPTIONAL MATCH
You have two issues going on with your query - first, you're assigning p1 and p2 independent of one another, possibly creating this cartesian product. The second issue is that because all of your links in your graph go both ways and you're asking for an undirected connection you're making the DB work twice as hard, because it could actually traverse what you're asking for either way. To make matters worse, because all of the links go both ways, you have many cycles in your graph, so as cypher explores the paths that it can take, many paths it will try will loop back around to where it started. This means that the query engine will spend a lot of time chasing its own tail.
You can probably immediately improve the query by doing this:
MATCH p=shortestPath((p1:SearchableNode {name:"Ishaan"})-[:NAVIGATE_TO*]->(p2:SearchableNode {name:"Garima"}))
RETURN p;
Two modifications here - p1 and p2 are bound to each other immediately, you don't separately match them. Second, notice the [:NAVIGATE_TO*]-> part, with that last arrow ->; we're matching the relationship ONE WAY ONLY. Since you have so many reflexive links in your graph, either way would work fine, but either way you choose you cut the work the DB has to do in half. :)
This may still perform not so great, because traversing that graph is still going to have a lot of cycles, which will send the DB chasing its tail trying to find the best path. In your modeling choice here, you usually shouldn't have relationships going both ways unless you need separate properties on each relationship. A relationship can be traversed in both directions, so it doesn't make sense to have two (one in each direction) unless the information that relationship is capturing is semantically different.
Often you'll find with query performance that you can do better by reformulating the query and thinking about it, but there's major interplay between graph modeling and overall performance. With the graph set up with so many bi-directional links, there will only be so much you can do to optimize path-finding.
MATCH (p1:SearchableNode {name: "Ishaan"}), (p2:SearchableNode {name: "Garima"}),path = (p1)-[:NAVIGATE_TO*]->(p2) RETURN path
Or:
MATCH (p1:SearchableNode {name: "Ishaan"}), (p2:SearchableNode {name: "Garima"}), (p1)-[path:NAVIGATE_TO*]->(p2) RETURN path

String matching algorithm : (multi token strings)

I have a dictionary which contains a big number of strings. Each string could have a range of 1 to 4 tokens (words). Example :
Dictionary :
The Shawshank Redemption
The Godfather \
Pulp Fiction
The Dark Knight
Fight Club
Now I have a paragraph and I need to figure out how many strings in the para are part of the dictionary.
Example, when the para below :
The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been
battling The Godfather for the top spot.
is run against the dictionary, I should be getting the ones in bold as the ones that are part of the dictionary.
How can I do this with the least dictionary calls.
Thanks
You might be better off using a Trie. A Trie is better suited to finding partial matches (i.e. as you search through the text of a paragraph) that are potentially what you're looking for, as opposed to making a bunch of calls to a dictionary that will mostly fail.
The reason why I think a Trie (or some variation) is appropriate is because it's built to do exactly what you're trying to do:
If you use this (or some modification that has the tokenized words at each node instead of a letter), this would be the most efficient (at least that I know of) in terms of storage and retrieval; Storage because instead of storing the word "The" a couple thousand times in each Dict entry that has that word in the title (as is the case with movie titles), it would be stored once in one of the nodes right under the root. The next word, "Shawshank" would be in a child node, and then "redemption" would be in the next, with a total of 3 lookups; then you would move to the next phrase. If it fails, i.e. the phrase is only "The Shawshank Looper", then you fail after the same 3 lookups, and you move to the failed word, Looper (which as it happens, would also be a child node under the root, and you get a hit. This solution works assuming you're reading a paragraph without mashup movie names).
Using a hash table, you're going to have to split all the words, check the first word, and then while there's no match, keep appending words and checking if THAT phrase is in the dictionary, until you get a hit, or you reach the end of the paragraph. So if you hit a paragraph with no movie titles, you would have as many lookups as there are words in the paragraph.
This is not a complete answer, more like an extended-comment.
In literature it's called "multi-pattern matching problem". Since you mentioned that the set of patterns has millions of elements, Trie based solutions will most probably perform poorly.
As far as I know, in practice traditional string search is used with a lot of heuristics. DNA search, antivirus detection, etc. all of these fields need fast and reliable pattern matching, so there should be decent amount of research done.
I can imagine how Rabin-Karp with rolling-hash functions and some filters (Bloom filter) can be used in order to speed up the process. For example, instead of actually matching the substrings, you could first filter (e.g. with weak-hashes) and then actually verify, thus reducing number of verifications needed. Plus this should reduce the work done with the original dictionary itself, as you would store it's hashes, or other filters.
In Python:
import re
movies={1:'The Shawshank Redemption', 2:'The Godfather', 3:'Pretty Woman', 4:'Pulp Fiction'}
text = 'The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.'
repl_str ='(?P<title>' + '|'.join(['(?:%s)' %movie for movie in movies.values()]) + ')'
result = re.sub(repl_str, '<b>\g<title></b>',text)
Basically it consists of forming up a big substitution instruction string out of your dict values.
I don't know whether regex and sub have a limitation in the size of the substitution instructions you give them though. You might want to check.
lai

Resources