I currently have a list of 200 words from which I need to create semantically correct permutations. Unfortunately, permutating through a list of that size will lead to something like a trillion permutations.
What I am planning to do is utilize the Microsoft Web Ngram service and a yield function to find ngrams within my permutations that have joint scores above a certain threshold. My hope here is that by filtering based on score, I will be left with only semantically correct permutations
My question is regarding the Microsoft Ngram API: with a list of 200 words, there will be A LOT of permutations to go through using this method -- can someone give me a sense if the api function be able to handle that volume of requests?
Thanks!
There is no limit on the number of queries you can make. However, the terms of use disallow threaded access, and the server response is relatively slow (between 0.12 and 0.22 s each query). So you could get at most 720k queries in a 24 hour period. I'm using PHP's file_get_contents(...). There may be a faster way.
In my application I've chopped up a library such that portions are updated with n-gram data as needed. It does slow down my code but it is at least tenable.
http://kkava.com/vocab/?ngram=on&imp=on&v=
Related
i'm using AzureMaps Search and i'm trying to retrieve all POI(point of interest) in a location, but i can't find in any documentation how to sort, for example by distance my results
Someone has same problem?
https://atlas.microsoft.com/search/poi/json?subscription-key=key&api-version=1.0&query=restaurant&lat=45&lon=9
I don't think the current Search POI API provides sorting as part of the API itself. So, you'll have to do that in-memory afterwards. The results are sorted by "score"(relevancy) by default.
There is no way to order by results with POI,I guess what you're looking for here. As per the best practices, you could use nearby-search
https://atlas.microsoft.com/search/address/json?subscription-key={subscription-key}&api-version=1&query=400%20Broad%20Street%2C%20Seattle%2C%20WA&countrySet=US
If you would like straight line distances you can loop through the results can calculate the distances using the haversine formula. If using the Azure Maps Web SDK, you can use the atlas.math.getDistanceTo function instead. Once you calculate a distance to each point, then you can sort accordingly.
If you want to get the driving distance to each point there are two approaches you can take;
Use the Route Matrix API. This is fairly easy to use, would be less error prone than the second option below and the response is easy enough to work with. Only negative with this approach is that you will need to S1 pricing tier to access this service and each cell would generate a transaction which can get expensive fast.
Use the Routing Directions API with a large number of waypoints that go from your origin to each destination and back (A->B->A->C....). This will be a bit more work to understand the results and if any leg of the route is unrouteable for any reason, the whole route calculation would fail. However, this would be significantly cheaper than option one as you can use S0 pricing tier which has free limits and this would only generate 1 transaction in most cases (if you have a large number of locations then you might need to break them up and spread across a few calls). Because this would calculate the route from the origin to each destination and back, you twice as many calculations are made than you need which could make this slower than approach 1. When parsing the response you would look at the odd indexed route legs as those would go from the origin to each destination. In some scenarios it might be desirable to know the travel time from the destinations to the origin (i.e. how long would it take all employees to get to work), in which case the even numbered legs is what you would want to use.
Again, once you have the distance, or better yet, travel time, you can then sort the results accordingly.
This question is probably very repeated in the blogging and Q&A websites but I couldn't find any concrete answer yet.
I am trying to build a recommendation system for customers using only their purchase history.
Let's say my application has n products.
Compute item similarities for all the n products based on their attributes (like country, type, price)
When user needs recommendation - loop the previously purchased products p for user u and fetch the similar products (similarity is done in the previous step)
If am right we call this as content-based recommendation as opposed to collaborative filtering since it doesn't involve co-occurrence of items or user preferences to an item.
My problem is multi-fold:
Is there any existing scalable ML platform that addresses contend based recommendation (I am fine to adopt different technologies/language)
Is there a way to tweak Mahout to get this result?
Is classification a way to handle content based recommendation?
Is it something that a graph database good at solving?
Note: I looked at Mahout (since am familiar with Java and Mahout apparently utilizes Hadoop for distributed processing) for doing this in scale and advantage of having a well tested ML algorithms.
Your help is appreciated. Any examples would be really great. Thanks.
The so called item-item recommenders are natural candidates for precomputing the similarities, because the attributes of the items rarely change. I would suggest you precompute the item similarity between each item, and perhaps store the top K for each item, and if you have enough resources you could load the similarity matix into main memory for real time recommendation.
Check out my answer to this question for a way to do this in Mahout: Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
The example is how to compute the textual similarity between the items, and than load the precomputed values into main memory.
For performance comparison about different data structures to hold the values check out this question: Mahout precomputed Item-item similarity - slow recommendation
I hope it belongs here.
Can anyone please tell me is there any method to compare different search applications working in the same domain with the same dataset?
The problem is they are quite different - one is a web application which looks up the database where items are grouped in categories, and another one is a rich client which makes search by keywords.
Is there any standard test giudes for that purpose?
There are testing methods. You may use e.g. Precision/Recall or the F beta method to estimate a rate which computes the "efficiency". However you need to make a reference set by yourself. That means you will somehow measure not the efficiency in the domain, more likely the efficiency compared to your own reasoning.
The more you need to make sure that your reference set is representative for the data you have.
In most cases common reasoning will give you also the result.
If you want to measure the performance in matters of speed you need to formulate a set of assumed queries against the search and query your search engine with these at a given rate. Thats doable with every common loadtesting tool.
I've just gotten into the Adwords API for an upcoming project and I need something quite simple actually, but I want to go about it the most efficient way.
I need code to retrieve the Global Monthly Search Volume for multiple keywords (in the millions). After reading about BulkMutateJobService, in the Google documentation they say
If you want to perform a very large number of operations (up to 500,000) on your AdWords campaigns and child objects, use BulkMutateJobService
But later on in the page they give limits of
No more than 25 OperationStream objects are allowed.
No more than 10,000 operations are allowed per BulkMutateRequest.
No more than 100 request parts are allowed.
as well as a few others. See source here http://code.google.com/apis/adwords/docs/bulkjobs.html
Now, my questions:
What do these numbers mean? If I have 1 million words I need information on, do I only need to perform 2 requests with 500K words each?
Also, are there examples of code that does this task?
I only need Global Monthly Search Volume and CPC for each keyword. I've searched online, but to no avail have I found any good example or anything leaning in that direction that utilizes BulkMutateJobService.
Any links, resources, code, advice you can offer? All is appreciated.
The BulkMutateJobService only allows for mutates, or changes, to the account. It does not provide the bulk retrieval of information.
You can fetch monthly search volume for keywords using the TargetingIdeaService. If you use it in STATS mode you can include up to 2500 keywords per request.
Estimates CPC values are obtained from the TrafficEstimatorService. You can request up to 500 keywords per request.
FYI, there is an official AdWords API Forum that you can ask questions on.
Consider the following search results:
Google for 'David' - 591 millions hits in 0.28 sec
Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index
WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2
ORDER BY total_score DESC
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since IN performance is not affected much by the size of the target set and it's using a simple count to determine whether all terms were matched or not.
Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.