Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
We are building a text search solution and want a way to measure precision and recall of the system every time we add new document types. From reading some of the posts here it sounds like a machine learning based solution is the way to go. Can a expert comment on this? We will then look to add machine learning folks to our team.
The only way to get the F1-score require knowledge about the correct class, rank of all samples obtains by evaluation querys, and you also need thoses evaluation querys.
Any machine learning will need a large quantity of manual work to provided thoses samples and/or querys. So large that it wont save you any time.
Another bad aspect of this evaluation is through to learning-related intrinsic errors. It will go with the growing size of the index of the search engine and the number of examples required. You never get a good evaluation.
Forget machine-learning for the evaluation of search engine.
Build by hand your tests querys and sample, by the time it will become big and reliable.
If you really want machine-learning in your system, you should look at query pre-processing. Getting some meta-information about the query by another way (you say SVN, why not?) is generaly a good for performance and while it did'nt change the result, you can use the same sample for an end-to-end evaluation.
That what I have done few years ago, but with naive baye classifier on natural langage analysis.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
https://nlp.stanford.edu/projects/glove/
I'm trying to use GloVe for summarizing music reviews, but I'm wondering which version is the best for my project. Will "glove.840B.300d.zip" give me a more accurate text summarization since it used way more tokens? Or perhaps the Wikipedia 2014 + Gigaword 5 is more representative than Common Crawl? Thanks!
Unfortunately I don't think anyone can give you a better answer for this than:
"try several options, and see which one works the best"
I've seen work that uses the Wikipedia 2014 + Gigaword 100d vectors that produced SOTA results for reading comprehension. Without experimentation, it's difficult to say conclusively which corpus is closer to your music review set, or what the impact of larger dimensional word embeddings will be.
This is just random advice, but I guess I would suggest trying in this order:
100d from Wikipedia+Gigaword
300d from Wikipedia+Gigaword
300d from Common Crawl
You might as well start with the smaller dimensional embeddings while prototyping, and then you could experiment with larger embeddings to see if you get a performance enhancement.
And in the spirit of promoting other group's work, I would definitely say you should look at these ELMo vectors from AllenNLP:
http://allennlp.org/elmo
They look very promising!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
What are some ways including machine learning that I can use in my projects to generate things related to another. Like related apps, related websites, related products, etc.
I've been brainstorming these are strategies...
one way i can think of is show items from same category. But that would be too broad.
2nd way improves upon previous step, it's to keep track of what people click next and promote that item. Meanwhile keep bottom list randomized to let other relevant items show up and get clicked.
3rd way is to use machine learning and provide training data somehow and use that.
I want something simple but smart, as it gets better with time.
Collaborative filtering is designed for solving exactly this problem. The problem with this approach is that produces good results having a lot of data only. I mean... A LOT. And it's not a really simple thing to use. However, any machine learning technique is not simple. There are some node.js packages for CF available, but I have no idea how good are they.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm pretty new on search engines and pretty newbie on machine learning. But I wanted to know if there is a way to combine functionalities of search engines like elasticsearch or Apache Solr and machine learning project like Apache Mahout, H2O or PredictionIO.
For exemple, if you work on a travel website where you can search for a destination. You start type "au", so the first suggestions are "AUstria", "AUstralia", "mAUrice island", "mAUritania"... etc... This is typically what elasticsearch can do.
But you know that this user has already travelled on Mauritania three times, so you want that Mauritania goes on the first place of suggestions. And I guess that's typically what machine learning can do.
Is there bridges between this two type of technologies ? Can machine learning ensure the work of search engine efficiently ?
I'm open to all answers, regardless of the technologies used. If you have ever experienced this type of problems, my ears are wide open :-)
Thank you
Your question is very general in nature- so my answer will have to be the same.
Consider a recommender framework such as the one in Apache Mahout correlated co-occurance. Unlike the vanilla spark recommender, this implementation allows for multiple types of actions, such as viewed a web site, booked a trip their before, demographic information, etc.
Now you would calculate the recommendations for each user at whatever interval. Recommendations being based on multiple criteria and what other people similar to this user has done. Consider your 'items' in this case to be every destination in the world. So we now have every possible destination ranked for each user.
It is then a trivial extension to index elastic search by user/the ordered list of that users recommended destinations.
For example, we have a user who has visited Berlin, looked at several hotels in Vienna, and is from Romainia. When the user types in "au", we would expect to see "Austria" come up in the results much higher than 'Austrailia'
Per the comments and down votes- you probably should have either A) asked a more specific programming question or B) asked this question on another forum such as Data Science Stack Exchange, fyi
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm currently confused in incremental software methodology
what is the main difference between incremental development which adopt plan driven approach and the one that adopt agile approach ?
can anyone explain to me what is the difference between those two and if my choice was good for the project?
Learning is at the core of the agile approaches. It embraces the fact that it is almost impossible to have enough information to make detailed plan up front. Instead implementing, or possibly trying to implement, your first feature will trigger very valuable learnings. Both about your implementation and the usage and actual needs in the field.
I'm not sure what "documentations are really important" actually means, but dividing implementation along module boundaries will cause a number of unwanted effects:
you can only learn about the usage of the complete system after all modules are done, a.k.a. Too late. That will drive unknown remaining amount of work after you thought you were done.
how do you know that the first module is done? Presumably based on some guesswork about what it should do, which might be right but most probably is at least slightly wrong, which causes unknown late modifications
integration problems will also show up after the third module was supposed to be finished
All three drive late realizations about problems and unknown amount of work left to the end.
Agile focuses on driving out these learnings and information by forcing early feedback, such as early integration (as soon as there is a skeleton for the three modules), user feedback by forcing implementation of one user level feature at a time with demos of them as soon as hty are ready.
It is a strategy for minimizing risks in all software endeavours.
In my mind, you should have gone for an agile aproach.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
When a large system developed by Agile process requires a sudden large-scale change that affects most everything, what is the best way to go about it using Agile? Does the iterative part change at this point?
For example, what if a decision is made to make a centralized system a distributed one? Or choose another large pervasive example.
Arguably large changes should have been planned for, but it's never a perfect world which is one of the reasons Agile exists, so assume that suddenly a major change is introduced that shakes the foundation.
Edit to summarize solutions:
It's incremental all the way no matter how large or small the change may be.
"Does the iterative part change at this point?"
Never.
No matter how "pervasive" the change appears to be, you still have to work incrementally, in iterations you can manage.
You still have to prioritize the changes and make them in a way that will continue to pass unit tests and can be released when needed.
You may, for example, find that fixing 80% of the system is sufficient, and you may release. Or may be required to fix 100% of the system before releasing.
You still work incrementally. In sprints. Irrespective of when you release.
Agile has no magic answers.
There's a number of approaches :-
Plot a path of reasonably incremental changes to change the system from one archtecture to another. If you have reasonably well factored code, you should be ditching the code that is made redundant by the change and keeping stuff thats independent of the change.
Another approach if things are really different, start a parallel development of components for the new system.
Or, start new and steal as much as you can from the old project.
Depends how BIG the change really is.