NLP - Improving Running Time and Recall of Fuzzy string matching - nlp

I have made a working algorithm but the running time is very horrible. Yes, I know from the start that it will be horrible but not that much. For just 200000 records, the program runs for more than an hour.
Basically what I am doing is:
for each searchfield in search fields
for each sample in samples
do a q-gram matching
if there are matches then return it
else
split the searchfield into uniwords
for each sample in samples
split sample into uniwords
for each uniword in samples
if the uniword is a known abbreviation
then search the dictionary for its full word or other known abbr
else do a jaro-winkler matching
average the distances of all the uniwords
if the average is above threshold then make it as a match and break
end for
if there is a match make a comment that it matched one of the samples partially
end else
end for
Yes, this code is very loop-happy. I am using brute-force because the recall is very important. So, I'm wondering how can I make it faster since I am not only running it for 200000 data for millions of data and the computers of the client are not high-end (1GB-2GB of Ram Pentium 4 or Dual-Core, the computer where I test this program is a Dual Core with 4GB of Ram). I came across TF/IDF but I do not know if it will be sufficient. And I wonder how can google make searches real time.
Thanks in advance!
Edit:
This program is a data filterer. From 200,000 dummy data (actual data is about 12M), I must filter data that is irrelevant to the samples (500 dummy samples, I still do not know how much the actual amount of samples).
With the given dummy data and samples, the running time is about 1 hour but after tinkering here and there, I have successfully lessen it to 10-15 minutes. I have lessen it by grouping the fields and samples that begin with the same character (discounting special and non-meaningful words e.g. the, a, an) and matching the fields to the sample with the same first character. I know there is a problem there. What if the field was misspelled at the first character? But I think the number of those are negligible. The samples are spelled correctly since it is always maintained.

what is your programing language? I guess using q=2 or 3 is sufficient. Also I suggested to come from uni gram to higher degrees.

Related

GPT-J and GPT-Neo generate too long sentences

I trained a GPT-J and GPT-Neo models (fine tuning) on my texts and am trying to generate new text. But very often the sentences are very long (sometimes 300 characters each), although in the dataset the sentences are of normal length (50-100 characters usually). I tried a lot of things, changed, adjusted the temperature, top_k, but still half of the results with long phrases and I neen more short.
What can you try?
Here are long examples of generated results:
The support system that they have built has allowed us as users who
are not code programmers or IT administrators some ability to create
our own custom solutions without needing much programming experience
ourselves from scratch!
All it requires are documents about your inventory process but
I've found them helpful as they make sure you do everything right for
maximum efficiency because their knowledge base keeps reminding me
there's new ways i can be doing some things wrong since upgrading my
license so even though its good at finding errors with documentation
like an auditor may bring up later downline someone else might benefit
if those files dont exist anymore after one year when upgrades renews
automatically!
With all GPT models you can specify the "max_length" parameter during generation. This will force the model to generate an amount of tokens equal to max_length. You could also play with num_return_sequences and use a helper function to choose the shortest sequence.
Example:
output = model.generate(input_ids, do_sample=True, top_k=50, max_length=100, top_p=0.95, num_return_sequences=1)
These large language models are trained on massive amounts of data, and fine-tuning them can take patience as they learn to adapt to what you're feeding it. Try different things - adjust your training data format, try different samples, use a pre-prompt during generation to guide the model, etc.. A model like GPT-J does a mind-numbingly large amount of calculations just to spit out a single word, so it is hard to predict what exactly is causing it to say one thing over another.

Test multiple algorithms in one experiment

Is there any way to test multiple algorithms rather than doing it once for each and every algorithm; then checking the result? There are a lot of times where I don’t really know which one to use, so I would like to test multiple and get the result (error rate) fairly quick in Azure Machine Learning Studio.
You could connect the scores of multiple algorithms with an 'Evaluate Model' button to evaluate algorithms against each other.
Hope this helps.
The module you are looking for, is the one called “Cross-Validate Model”. It basically splits whatever comes in from the input-port (dataset) into 10 pieces, then reserves the last piece as the “answer”; and trains the nine other subset models and returns a set of accuracy statistics measured towards the last subset. What you would look at is the column called “Mean absolute error” which is the average error for the trained models. You can connect whatever algorithm you want to one of the ports, and subsequently you will receive the result for that algorithm in particular after you “right-click” the port which gives the score.
After that you can assess which algorithm did the best. And as a pro-tip; you could use the Filter-based-feature selection to actually see which column had a significant impact on the result.
You can check section 6.2.4 of hands-on-lab at GitHub https://github.com/Azure-Readiness/hol-azure-machine-learning/blob/master/006-lab-model-evaluation.md which focuses on the evaluation of multiple algorithms etc.

How to predict when next event occurs based on previous events? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Basically, I have a reasonably large list (a year's worth of data) of times that a single discrete event occurred (for my current project, a list of times that someone printed something). Based on this list, I would like to construct a statistical model of some sort that will predict the most likely time for the next event (the next print job) given all of the previous event times.
I've already read this, but the responses don't exactly help out with what I have in mind for my project. I did some additional research and found that a Hidden Markov Model would likely allow me to do so accurately, but I can't find a link on how to generate a Hidden Markov Model using just a list of times. I also found that using a Kalman filter on the list may be useful but basically, I'd like to get some more information about it from someone who's actually used them and knows their limitations and requirements before just trying something and hoping it works.
Thanks a bunch!
EDIT: So by Amit's suggestion in the comments, I also posted this to the Statistics StackExchange, CrossValidated. If you do know what I should do, please post either here or there
I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
Best of luck!
I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
A - 1 to 2 minutes
B - 2 to 5 minutes
C - 5 to 10 minutes
etc.
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.
If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.
I think the predictive neural network would be a good approach for this task.
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
This method is also used for predicting f.x. weather forecasting, stock marked, sun spots.
There's a tutorial here if you want to know more about how it works.
http://www.obitko.com/tutorials/neural-network-prediction/
Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.
A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.

Find UK PostCodes closest to other UK Post Codes by matching the Post Code String

Here is a question that has me awake for a number of days now. The only conclusion I came up so far is that Red Bull does not usually help coders.
I have a scenario in my application where I have a couple of jobs (1 to 50). The job has an address and I have the following properties of an address: Postcode, Latitude, and Longitude.
I have a table of workers also and they too have addresses. While the jobs or workers are created through screens, I use Google Map queries to make sure the provided Postcode is valid and is in UK so all the addresses are verified.
I am using a scheduler control to display some workers on y-axis and a timeline on x-axis. Every job has a date and can only move vertically on the scheduler on the job’s date. The user selects a number of jobs and they are displayed in a basket close to the scheduler. The user can then drag and drop job against workers. All this is manual so it works.
My task is to automate this so that the user does not do much except just verifying and allotting the jobs. Therefore, I have to automate the process.
Every worker has a property called WillingMaximumDistanceTravel which is an integer representing miles, the worker is willing to travel for a job.
Now here is the headache: I have over 1500 workers. I have a utility function that uses Newtonsoft’s Json Convert to de-serialize a stream of response from Google Maps. I need to feed it Postcode A and B.
I also plan to introduce a new table to DB to store the distance finds as Postcode A, Postcode B, and Distance. Therefore, if I find myself comparing the same postcodes again, I will just retrieve the result from DB instead and slowly and eventually, I would no longer require bothering Google anymore as this table would be very comprehensive.
I cannot use the simple Haversine formula, as Crow-fly path is not my requirement here. The pain in this is that it takes a lot of time to calculate. Some workers can travel over 10 miles while some vary from 15 to 80. I have to take the first job from the list and run it with every applicable worker o the system! I was wondering that the UK postcode has a pattern to it. If we sort a list of UK postcodes, can we rough-estimate, from the alphanumeric pattern, where will we hit a 100-mile mark, a 200-mile mark and so on?
If anyone is interested in the code, please drop a line and I will paste it.
(I work for Google, but I'm not speaking on behalf of Google. I have nothing to do with the maps API.)
I suspect this isn't a great situation for using the Google Maps API, simply because you're pushing so much data through. You really don't want to make that many requests, even if you could do so under the directions limits.
When I tackled something similar in a previous job, we bought into a locally-hosted maps API - but even that wasn't fast enough for this sort of work. We ended up precomputing the time to travel from the centroid of each postcode "area" (probably the wrong name for it, but the first part of the postcode followed by the first digit of the remainder, e.g. "SW1W 9" for "SW1W 9TQ") to every other area, storing the result in a giant table. I think we only did it for postcodes which were within 100 miles or something similar, to cut down on the amount of preprocessing.
Even then, a simple DB wasn't quite as fast as we wanted - so we stored the results in a giant file, with a single byte per source/destination pair. (We had a fixed sequence of source postcodes and target postcodes, so we didn't need to specify those.) At that point, computing a travel time consisted of:
Work out postcode areas (substring work)
Find the index of each postcode area within the sequence
Check if we'd loaded that part of the file (we lazy loaded for startup speed)
Load the row if necessary, and just access it otherwise
The bytes were on a sliding scale of accuracy, so for the first 60 minutes it was on a per-minute basis, then each extra value meant an extra 2 minutes, then 5 etc. (Those aren't the exact values, but it was something like that.)
When you've worked out "good candidates" you can ask an on-site API or the Google Maps API for more accurate directions for your exact postcodes, of course.
You want to look for a spatial-index or a space-filling-curve. A spatial index reduce the 2d problem to a 1d problem and recursivley subdivide the surface into smaller tiles but it is basically a reordering of the tiles. You can subdivide the surface either with an index or a string using 4 characters. The latter one can be useful to you because it let you query the string with all string operation hidden in the database engine. You want to look for Nick's spatial index quadtree hilbert-curve blog.

How do search engines conduct 'AND' operation?

Consider the following search results:
Google for 'David' - 591 millions hits in 0.28 sec
Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index
WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2
ORDER BY total_score DESC
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since IN performance is not affected much by the size of the target set and it's using a simple count to determine whether all terms were matched or not.
Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.

Resources