About "AUTOMATIC TEXT SUMMARIZER (lingustic based)" [closed] - text

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am having "AUTOMATIC TEXT SUMMARIZER (linguistic approach)" as my final year project. I have collected enough research papers and gone through them. Still I am not very clear about the 'how-to-go-for-it' thing. Basically I found "AUTOMATIC TEXT SUMMARIZER (statistical based)" and found that it is much easier compared to my project. My project guide told me not to opt this (statistical based) and to go for linguistic based.
Anyone who has ever worked upon or even heard of this sort of project would be knowing that summarizing any document means nothing but SCORING each sentence (by some approach involving some specific algos) and then selecting sentences having score more than threshold score. Now the most difficult part of this project is choosing the appropriate algorithm for scoring and later implementing it.
I have moderate programming skills and would like to code in JAVA (because there I'll get lots of APIs resulting in lesser overheads). Now I want to know that for my project, what should be my approach and algos used. Also how to implement them.

Using Lexical Chains for Text Summarization (Microsoft Research)
An analysis of different algorithms: DasMartins.2007
Most important part in the doc:
• Nenkova (2005) analyzes that no system
could beat the baseline with statistical
significance
• Striking result!
Note there are 2 different nuances to the liguistic approach:
Linguistic rating system (all clear here)
Linguistic generation (rewrites sentences to build the summary)

Automatic Summarization is a pretty complex area - try to get your java skills first in order as well as your understanding of statistical NLP which uses machine learning. You can then work through building something of substance. Evaluate your solution and make sure you have concretely defined your measurement variables and how you went about your evaluation. Otherwise, your project is doomed to failure. This is generally considered a high risk project for final year undergraduate students as they often are unable to get the principles right and then implement it in a way that is not right either and then their evaluation measures are all ill defined and don't reflect on their own work clearly. My advice would be to focus on one area rather then many in summarization as you can have single and multi document summaries. The more varied you make your project the less likely hold of you receiving a good mark. Keep it focused and in depth. Evaluate other peoples work then the process you decided to take and outcomes of that.
Readings:
-Jurafsky book on NLP there is a back section on summarization and QA.
-Advances in Text Summarization by inderjeet mani is really good
Understand what things like term weighting, centroid based summarization, log-likelihood ratio, coherence relations, sentence simplification, maximum marginal relevance, redundancy, and what a focused summary actually is.
You can attempt it using a supervised or an unsupervised approach as well as a hybrid.
Linguistic is a safer option that is why you have been advised to take that approach.
Try attempting it linguistically then build statistical on to hybridize your solution.
Use it as an exercise to learn the theory and practical implication of the algorithms as well as build on your knowledge. As you will no doubt have to explain and defend your project to the judging panel.

If you really have read those research papers and research books you probably know what is known. Now it is up to you to implement the knowledge of those research papers and research books in a Java application. Or you could expand the human knowledge by doing some innovation/invention. If you do expand human knowledge you have become a true scientist.

Please make your question more specific, in these two main areas:
Project definition: What is the goal of your project?
Is the input unit a single document? A list of documents?
Do you intend your program to use machine learning?
What is the output?
How will you measure success?
Your background knowledge: You intend to use linguistic rather than statistical methods.
Do you have background in parsing natural language? In semantic representation?
I think some of these questions are tough. I am asking them because I spent too much time trying to answer similar questions in the course of my studies. Once you get these sorted out, I may be able to give you some pointers. Mani's "Automatic Summarization" looks like a good start, at least the introductory chapters.

The University of Sheffield did some work on automatic email summarising as part of the EU FASiL project a few years back.

Related

Is there a search engine that will give a direct answer? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I've been wondering about this for a while and I can't see why Google haven't tried it yet - or maybe they have and I just don't know about it.
Is there a search engine that you can type a question into which will give you a single answer rather than a list of results which you then have to trawl through yourself to find what you want to know?
For example, this is how I would design the system:
User’s input: “Where do you go to get your eyes tested?”
System output: “Opticians. Certainty: 95%”
This would be calculated as follows:
The input is parsed from natural language into a simple search string, probably something like “eye testing” in this case. The term “Where do you go” would also be interpreted by the system and used when comparing results.
The search string would be fed into a search engine.
The system would then compare the contents of the results to find matching words or phrases taking note of what the question is asking (i.e. what, where, who, how etc.)
Once a suitable answer is determined, the system displays it to the user along with a measure of how sure it is that the answer is correct.
Due to the dispersed nature of the Internet, a correct answer is likely to appear multiple times, especially for simple questions. For this particular example, it wouldn’t be too hard for the system to recognise that this word keeps cropping up in the results and that it is almost certainly the answer being searched for.
For more complicated questions, a lower certainty would be shown, and possibly multiple results with different levels of certainty. The user would also be offered the chance to see the sources which the system calculated the results from.
The point of this system is that it simplifies searching. Many times when we use a search engine, we’re just looking for something really simple or trivial. Returning a long list of results doesn’t seem like the most efficient way of answering the question, even though the answer is almost certainly hidden away in those results.
Just take a look at the Google results for the above question to see my point:
http://www.google.co.uk/webhp?sourceid=chrome-instant&ie=UTF-8&ion=1&nord=1#sclient=psy&hl=en&safe=off&nord=1&site=webhp&source=hp&q=Where%20do%20you%20go%20to%20get%20your%20eyes%20tested%3F&aq=&aqi=&aql=&oq=&pbx=1&fp=72566eb257565894&fp=72566eb257565894&ion=1
The results given don't immediately answer the question - they need to be searched through by the user before the answer they really want is found. Search engines are great directories. They're really good for giving you more information about a subject, or telling you where to find a service, but they're not so good at answering direct questions.
There are many aspects that would have to be considered when creating the system – for example a website’s accuracy would have to be taken into account when calculating results.
Although the system should work well for simple questions, it may be quite a task to make it work for more complicated ones. For example, common misconceptions would need to be handled as a special case. If the system finds evidence that the user’s question has a common misconception as an answer, it should either point this out when providing the answer, or even simply disregard the most common answer in favour of the one provided by the website that points out that it is a common misconception. This would all have to be weighed up by comparing the accuracy and quality of conflicting sources.
It's an interesting question and would involve a lot of research, but surely it would be worth the time and effort? It wouldn't always be right, but it would make simple queries a lot quicker for the user.
Such a system is called an automatic Question Answering (QA) system, or a Natural Language search engine. It is not to be confused with a social Question Answering service, where answers are produced by humans. QA is a well studied area, as evidenced by almost a decade of TREC QA track publications, but it is one of the more difficult tasks in the field of natural language processing (NLP) because it requires a wide range of intelligence (parsing, search, information extraction, coreference, inference). This may explain why there are relatively few freely available online systems today, most of which are more like demos. Several include:
AnswerBus
START - MIT
QuALiM - Microsoft
TextMap - ISI
askEd!
Wolfram Alpha
Major search engines have shown interest in question answering technology. In an interview on Jun 1, 2011, Eric Scmidt said, Google’s new strategy for search is to provide answers, not just links. "'We can literally compute the right answer,' said Schmidt, referencing advances in artificial intelligence technology" (source).
Matthew Goltzbach, head of products for Google Enterprise has stated that "Question answering is the future of enterprise search." Yahoo has also forecasted that the future of search involves users getting real-time answers instead of links. These big players are incrementally introducing QA technology as a supplement to other kinds of search results, as seen in Google's "short answers".
While IBM's Jeopardy-playing Watson has done much to popularize machines answering question (or answers), many real-world challenges remain in the general form of question answering.
See also the related question on open source QA frameworks.
Update:
2013/03/14: Google and Bing search execs discuss how search is evolving to conversational question answering (AllThingsD)
Wolfram Alpha
http://www.wolframalpha.com/
Wolfram Alpha (styled Wolfram|Alpha)
is an answer engine developed by
Wolfram Research. It is an online
service that answers factual queries
directly by computing the answer from
structured data, rather than providing
a list of documents or web pages that
might contain the answer as a search
engine would.[4] It was announced in
March 2009 by Stephen Wolfram, and was
released to the public on May 15,
2009.[1] It was voted the greatest computer innovation of 2009 by Popular
Science.[5][6]
http://en.wikipedia.org/wiki/Wolfram_Alpha
Have you tried wolframalpha?
Have a look at this: http://www.wolframalpha.com/input/?i=who+is+the+president+of+brasil%3F
Ask Jeeves, now Ask.com, used to do this. Why nobody does this anymore, except Wolfram:
Question Answering (QA) is far from a solved problem.
There exist strong question answering systems, but they require full parsing of both the question and the data and therefore require tremendous amounts of computing power and storage, even compared to Google scale, to get any coverage.
Most web data is too noisy to handle; you first have to detect if it's in a language you support (or translate it, as some researchers have done; search for "cross-lingual question answering"), then try to detect noise, then parse. You lose more coverage.
The internet changes at lightning pace. You lose even more coverage.
Users have gotten accustomed to keyword search, so that's much more economical.
Powerset, acquired by Microsoft, is also trying to do question answering. They call their product a "natural language search engine" where you can type in a question such as "Which US State has the highest income tax?" and search on the question instead of using keywords.

How to counter the "one true language" perspective? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
How do you work with someone when they haven't been able to see that there is a range of other languages out there beyond "The One True Path"?
I mean someone who hasn't realised that the modern software professional has a range of tools in his toolbox. The person whose knee jerk reaction is, for example, "We must do this is C++!" "Everything must be done in C++!"
What's the best approach to open people up to the fact that "not everything is a nail"? How may I introduce them to having a well-equipped toolbox, selecting the best tool for the job at hand?
As long as there are valid reasons for it to be done in C++, I don't see anything wrong with this monolithic approach.
Of course a good programmer must have many different tools in his/hers toolbox, but these tools don't need to be a new language, it can simply be about learning new programming paradigms.
As much as I've experienced actually, learning many different languages doesn't make you much of a better programmer at all.
This is also true with finding the right language for the job. Yeah ok, if you're doing concurrency you might want a functional language rather than an Object Oriented language, but what are the gains of using another programming language?
At the end of the day; "Maintenance".
If it can be maintained without undue problems then the debate may well be moot and comes down to preference or at least company policy/adopted technology.
If that is satisfied then the debate becomes "Can it be built efficiently to be cost effective and not cause integration problems?"
Beyond that it's simply the screwdriver/build a house argument.
Give them a task which can be done much easily in some other language/technology and also its hard to do it the language/technology that he/she is suggesting for everything.
This way they will eventually search for alternatives as it gets harder and harder for them to accomplish the task using the language/technology that they know.
Lead by example, give them projects that play to their strengths, and encourage them to learn.
If they are given a task that is obviously better suited for some other technology and they choose to use a less effective language, don't accept the work. Tell them it's not an appropriate solution to the problem. Think of it as no different then them choosing Cobol to take the replace of a shell script -- maybe it works, but it will be hard to maintain over time, take too long to develop, require expensive tools, etc.
You also need to take a hard look at the work they do and decide if it's really a big deal or not if it's done in C++. For example, if you have plenty of staff that knows that language and they finished the task in a decent amount of time, what's the harm? On the other hand, if the language they choose slows them down or will lead to long term maintenance problems they need to be aware of that.
There are plenty of good programmers who only know one language well. That fact in and of itself can't be used to determine if they are a valuable member of a team. I've known one-language guys who were out of this word, and some that I wouldn't have on a team if they worked for free.
Don't hire them.
Put them in charge of a team of COBOL programmers.
Ask them to produce a binary that outputs an infinite Fibonacci sequence.
Then show them the few lines (or 1 line, depending on the implementation) it takes in Haskell, and that it too can be compiled into a binary so there are better ways forward.
How may I introduce them to having a
well-equipped toolbox, selecting the
best tool for the job at hand?
I believe that the opposite of "one true language" is "polyglot programming", and I will then refer to another answer of mine:
Is polyglot programming important?
I actually doubt that anybody can nowadays realize a project in one and only one language (even though there might be exceptions). The easiest way to show them the usefulness of specific tools and languages, is then to show them that they are already using several ones, e.g. SQL, build file, various XML dialect, etc.
Though I embrace the polyglot perspective, I do also believe that in many area "less is more". There is a balance to find between the number of language/tools, the learning curve, and the overall productivity.
The challenge is to decide which small set of languages/tools fit nicely together in your domain and will push productivity and creativity to new limits.
Give them a screwdriver and tell them to build a house?

How to choose a Feature Selection Algorithm? - advice

Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best.
I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a similar problem various individuals have choosen different algorithms
Although I can try Frequency based, mutual information, information gain and various other algorithms the list seems endless.. and was wondering if there an efficient way then trial and error.
any advice
Have you tried the book I recommended upon your last question? It's freely available online and entirely about the task you are dealing with: Sentiment Analysis and Opinion Mining by Pang and Lee. Chapter 4 ("Extraction and Classification") is just what you need!
I did an NLP course last term, and it came pretty clear that sentiment analysis is something that nobody really knows how to do well (yet). Doing this with unsupervised learning is of course even harder.
There's quite a lot of research going on regarding this, some of it commercial and thus not open to the public. I can't point you to any research papers but the book we used for the course was this (google books preview). That said, the book covers a lot of material and might not be the quickest way to find a solution to this particular problem.
The only other thing I can point you towards is to try googling around, maybe in scholar.google.com for "sentiment analysis" or "opinion mining".
Have a look at the NLTK movie_reviews corpus. The reviews are already pos/neg categorized and might help you with training your classifier. Although the language you find in Twitter is probably very different from those.
As a last note, please post any successes (or failures for that matter) here. This issue will come up later for sure at some point.
Unfortunately, there is no silver bullet for anything when dealing with machine learning. It's usually referred to as the "No Free Lunch" theorem. Basically a number of algorithms work for a problem, and some do better on some problems and worse on others. Over all, they all perform about the same. The same feature set may cause one algorithm to perform better and another to perform worse for a given data set. For a different data set, the situation could be completely reversed.
Usually what I do is pick a few feature selection algorithms that have worked for others on similar tasks and then start with those. If the performance I get using my favorite classifiers is acceptable, scrounging for another half percentage point probably isn't worth my time. But if it's not acceptable, then it's time to re-evaluate my approach, or to look for more feature selection methods.

How do I programmatically calculate Poker Odds? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 10 months ago.
Improve this question
I'm trying to write a simple game/utility to calculate poker odds. I know there's plenty of resources that talk about the formulas to do so, but I guess I'm having trouble translating that to code. Particularly, I'm interested in Texas Hold-em ...
I understand that there are several different approaches, one being that you can calculate the odds that you will draw some hand based on the cards you can see. The other approach is calculating the odds that you will win a certain hand. The second approach seems much more complex as you'd have to enter more data (how many players, etc.)
I'm not asking that you write it for me, but some nudges in the right direction would help :-)
Here are some links to articles, which could help as starting points: Poker Logic in C# and Fast, Texas Holdem Hand Evaluation and Analysis
"This code snippet will let you calculate poker probabilities the hard way, using C# and .NET."
The theoretical fundamentals are given in this Wikipedia article about Poker Probabilities and in this excellent statistical tutorial.
An example of a complete project written in Objective-C, Java, C/C++ or Python is found at SpecialKEval. Further links and reading can be found therein.
Monte carlo simulation is a common approach to get the odds calculation for poker hands. There are plenty of examples of implementing this kind of simulation for holdem on the net.
http://www.codeproject.com/KB/game/MoreTexasHoldemAnalysis1.aspx
Take a look at pokersource if you have reasonably strong C abilities. It's not simple, I'm afraid, but some of the things you're looking for are complex. The poker-eval program that uses the library will probably do much of what you want if you can get the input format correct (not easy either). Sites such as this one or this also use this library AFAIK.
Still, it could be worse, you could be wanting to calculate something tricky like Omaha Hi-lo...
Pokersource and the statistical articles are not bad suggestions. But this is really best done with a Monte Carlo simulation, a useful, simple, and powerful approach to this type of difficult problem.
It works equally well with Omaha Hi-lo as it does with Hold'em
Complete source code for Texas hold'em poker game evaluator can be found here:
http://www.advancedmcode.org/poker-predictor.html
It is built for matlab, the GUI id m-coded but the computational engine is c++.
It allows for odds and probability calculation. It can deal, on my 2.4Ghz laptop, with a 100000 10 players game computation in 0,3 seconds.
An accurate real time computer:-)
Have a look here as well:
http://specialk-coding.blogspot.com/2010/04/texas-holdem-7-card-evaluator_23.html
Monte Carlo simulation is often slower than the good exact evaluators.
We may also be able use combinatorics, calculating the odds with combinations and the number of ways each combination can appear. This way we don't have to iterate over all the possible hands.

Agent-based modeling resources [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I would like to know what kind of toolkits, languages, libraries exist for agent-based modeling and what are the pros/cons of them?
Some examples of what I am thinking of are
Swarm, Repast, and MASS.
I found a survey from June 2009 that answer your question:
Survey of Agent Based Modelling and Simulation Tools
Au. R.J. Allan
Abstract
Agent Based Modelling and Simulation is a computationally
demanding technique based on discrete event simulation and having its
origins in genetic algorithms. It is a powerful technique for
simulating dynamic complex systems and observing “emergent” behaviour.
The most common uses of ABMS are in social simulation and optimisation
problems, such as traffic flow and supply chains. We will investigate
other uses in computational science and engineering. ABMS has been
adapted to run on novel architectures such as GPGPU (e.g. nVidia using
CUDA). Argonne National Laboratory have a Web site on Exascale ABMS
and have run models on the IBM BlueGene with funding from the SciDAC
Programme. We plan to organise a workshop on ABMS methodolgies and
applications in summer of 2009. Keywords agent based modelling,
Archaeology
http://epubs.cclrc.ac.uk/bitstream/3637/ABMS.pdf
I also recommend NetLogo. It is an IDE+environment+programming language based on logo (which was based on Lisp) which lets you build multi-agent models extremely fast. I have found that I can reproduce (simulate) algorithms from research articles in a couple of hours, algorithms that would have taken weeks to implement with other libraries.
You can check some of my models at this page.
I got introduced to Dramatis at OSCON 2008, it is an Agent based framework for Ruby and Python. The author (Steven Parkes) has some references in his blog and is working at running a language agnostic Actors discussion list.
This page at erights.org has a great set of references to, what I think are, the core papers that introduce and explore the Actors message passing model.
There is also a pretty good link in wikipedia:
http://en.wikipedia.org/wiki/Comparison_of_agent-based_modeling_software
On the modelling side, have a look at FAML, an agent-oriented modelling language. This is a pretty academic paper, but it may help depending on your interests: http://ieeexplore.ieee.org/xpl/freepre_abs_all.jsp?isnumber=4359463&arnumber=4967615
I know this is an old thread, but I thought it would not hurt to add some extra info. There is a great new website which is dedicated to agent-based modeling. The site contains links to papers, tutorials, tools, resources, and researchers working on agent-based modeling in a number of fields.
you should also have a look at Madkit and Turtlekit
Old thread, but for completeness there is also Anylogic and pyabm which can be used for ABMs.
I have experience programming agent-based models in several environments / languages. My opinion is that if you want to implement a relatively simple model, use Netlogo. It's also possible to use Netlogo for heavy-duty models as well (I've done this successfully), but at some point the flexibility of a programming language like java/python/c++ outweighs the convenience of the native methods available in Netlogo, especially when performance becomes a major issue.
Repast is becoming a bit bloated. If you are an experienced programmer, all you really need to start building an ABM is the ability to schedule events and draw random numbers. The rest (defining agents / environments and their behaviors) you can craft on your own. When it comes to managing the objects in your model, use the regular data structures you're used to (arrays / hashes / trees / etc.). To this end, I'm developing a very lightweight Java library called "ABMUtils" (on github) that implements a scheduler and wraps a random number generator. This is in the early development stage but I expect to flesh things out (keeping it simple) over the coming months.
If you are an evolutionary economist you can also check Laboratory for Simulation Development (LSD).
PHP and Java developers should take a look at KATO.

Resources