IR and QA - Beginner Project Scope - nlp

I have been brainstorming for an Undergraduate Project in Question Answering domain. A project that has components of IR and NLP.
The first thing that popped up, was of course factoid question answering, but that seemed to be an already conquered problem. #IBM Watson!
Non-factoid QA seems interesting, so I took it up. Now, we are in scope-it-out phase of the project description. So, from the ambitious goal - of answering any question put up by the user - I need to scope out our project.
So I took the following decisions:
It will be closed-domain - C++ Programming
The corpus will consist of just one website. (cplusplus or wikipedia) or just one document (the complete reference)
We will develop only one module of the entire QA architecture - Passage Retrieval or Answer Extraction.
Our mentor insists on implementing an already existing solution, to start with.
I am stuck at this point, to search for existing implementations. Here is one. But when I read through the environment requirements, it was staggering. There are a lot of libraries and tool kits, but I didn't find any non-factoid QA system, that was good to know at least on a very small scale.
Suggest a good scope for the project. I wish to continue working on this through my masters, so it what would be a good start? We have about 4 months for the project, and it is important not to end up doing a research project. It should have a tangible output.

For IR you have Lucene/Solr.
For machine learning and nlp lots of libraries are available, primarily in python and java, at least the user friendly ones.
Implementing Hoifung's system is pretty ambitious, I'd go for something simpler. Have you looked at his code at all?
Something you could find lots of stuff in is the BioNLP challenges from the last few years, but those are also relatively complicated tasks.
How about twitter movie review discovery? Ie based on X tweets, does this movie suck?

Related

Meta construction capabilities?

I am currently considering Orange as the base for a meta-learning assistant prototype I intend to develop, but before committing myself to a thorough exploration of the documentation and learning about python development (which would both be quite time consuming), I would appreciate some insight regarding the feasibility of such prototype within Orange framework.
The main aim of the prototype I intend to develop is to allow efficient use of data mining and machine learning algorithm by non experts. Concretely, I wish as a first step to be able to give the user a workflow answering his modelling need, that I elicit from his dataset and expression of his need. In order to perform this elicitation, I intend to run a process that implies designing and executing learning workflows on his data.
Is it possible from within the Orange framework (or else from an above "supervising" framework) to automatically define and execute learning workflows ?
Yes, it is.
We have actually experimented with a "recommendation system" that would suggest parts of the workflow to the user. It wasn't useful. Also, there have been various meta-learning projects in the past and I think that the general consensus is --- it doesn't work. ;)
But if you intend to try it, Orange is suitable platform for this.
#hoijui: Orange no longer has any other mailing list or forum, just this one. Developers follow Stack overflow and answer questions there.

Guidance on broadening horizons

Let me setup my question with some info. I'm not in college yet and strictly a hobby programmer. Probably a little more than 2 years ago I got started programming on mac. I started with very simplistic GUI examples with Cocoa and XCode. Long story short, I learned from the top down, first learning objective-c, then venturing into more "low-level" projects where I became better at basic C and even used a few C++ libraries in my existing projects.
What I'm saying is that I've never really done anything outside of an XCode project and occasional iPhone project. I've implemented lots of stuff, algorithms, math, etc. but all within that environment. I look at the world of programming and there is so much out there that's not necessarily a standalone application. It seems to me that the hardest thing is finding out where to start; how to setup the environment. I guess I'm wondering if anyone has any suggestions, projects, tutorials, maybe on setting up environments for different languages on different systems. Web programming, java applets? etc.
On the note of environments, I would be interested in knowing on a more basic level what makes a "development environment." To my basic knowledge, an "environment" combines the language, with the compiler that interprets that language, and contains libraries that provide an API for the language, where the compiled product runs on a certain system. This is my basic concept, but again, I'm here.
Sorry if this question... well... combines too many questions, but any input or guidance is welcome. Thanks in advance for any replies!
Not sure if I understood your question correctly or if this will help you, but here are my (relative newbie) thoughts and rambling:
I've done Java at uni in two different courses, one where we wrote the code in Notepad and then compiled it in command line, in some dubious DOS application, and then two years later when we worked in NetBeans and while NetBeans was a lot better and easier, I learned a lot and was a lot more careful when writing code after the Notepad experience (especially after waiting for several minutes for a compile only to see a message caused by a silly bug).
If you can choose between IDEs, I would read on different blogs, see what people prefer and why and make a choice. The problem is that most of the time, both at uni and at work, you can't choose and have to go with the teachers/managers choose, and make the best of it.
It seems to me that the hardest thing is finding out where to start; how to setup the environment.
I think it would be easiest if you found something that you want to do, and then take small steps and get bits done. I work as a desktop app developer and 3 years ago I set up a wordpress blog for a friend and imported posts and comments from a different blogging platform, with minimal knowledge about everything involved. I started with things that were already done by others and learned how to use them and then slowly tried to fill in the gaps - the comments part wasn't done then, so I had to learn about databases, how I could see them and then write the code that inserted in them, etc.
What I'm trying to say is that if you find something to do (and if you don't have ideas for projects, you can find several posts with ideas here, on SO) and then set goals towards doing that, even if you don't finish it, or your studying takes you in areas you hadn't expected, it will all be useful at some point.
I guess I'm wondering if anyone has any suggestions, projects, tutorials, maybe on setting up environments for different languages on different systems. Web programming, java applets? etc.
This is way too broad a question. If you're doing web programming, you need to set up a web programming environment. At a minimum, you would need an HTTP server. You'd probably also need a relational database. The rest of the web environment would be language dependent.
If you're doing GUI programmng, you would need access to the device or devices (iPhone, Android, etc.) that you want to write programs for.
To my basic knowledge, an "environment" combines the language, with the compiler that interprets that language, and contains libraries that provide an API for the language, where the compiled product runs on a certain system.
That gets you started, yes. You'd want an integrated development environment to write the code. Again, you'd probably need a relational or object oriented database. The rest of the development environment is language dependent.

Successful projects using agile methods? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have been interested in agile methods of late and have found a lot of prescriptions and minute descriptions of a lot of practices.
Still, I remember my best projects as run-to-completion spikes followed by some debugging and minimal testing before going live.
I have been asking myself, did Flickr use agile methods? Does Facebook practice TDD? Was Gmail made in 25 minute spans followed by 5 minutes of daydream?
In other words, before I listen further to all the preaching and jump into the manuals, what evidence do I get that this is the way to be successful in a successful project in a successful company?
Of course, I am asking this because I want to read the answers, not because I want to dismiss an argument.
A related question is, how many non-Agile (Waterfall, "Big Design Up Front", etc) projects are successful? In my experience, not many. In fact, I just rolled off a two-phase project in which the first phase was traditional Waterfall and failed pretty significantly, but the second phase was iterative in nature and yielded substantially better results (on time, far fewer defects, end result was closer to client's actual needs than the original spec).
I've been doing Agile development for a few years now and, overall, have found it to be superior to the alternative. A few things I've noticed:
Agile != "no process". Agile is about having only as much process as you need and continually refining that process.
Agile requires discipline. You not only have to have a process, you have to follow it.
Agile won't turn a failing project into a success. It can help you identify that the project is failing sooner rather than later, and help you figure out why it's failing. It's about shortening the feedback loop so that you have a chance to get back on course before it's too late.
Microsoft Research recently posted an article in which they empirically evaluate some Agile methods. It's well worth a read and might provide some of the information you're looking for.
Here's a successful project of mine: http://www.sky.com
Went live after a few months.
Delivered new functionality to the CMS and servers behind the site weekly - with deployments typically every week or so.
All done with all the extreme programming disciplines.
Weekly demos to the customer to go with the weekly iterations.
Here's another agile project (also done strictly with XP), also a big success: http://showbiz.sky.com/
I've also worked on two other successful XP projects:
Banking A system for cleaning up and distributing fixed income data across investment bank sites in NY, London, Paris and Tokyo. I believe the whole project only had one production incident over the course of a few years.
Mobile Data A system for configuring mobile phones and PDAs for mobile networks and handset manufacturers. We built the core product incrementally over a number of years and co-ordinated the work over three sites across the world. All done using extreme programming. Customers were some of the largest companies in the mobile business. Our apps provided global support for some of these clients.
I really wouldn't go back to the old way of doing things - and neither would the customers that sponsored the projects I've mentioned above.
In most of the big companies (IBM for instance), the methodology is not always the same, Agile or Rational or Waterfall. That depends in a lot of the history of the projects and the experience of the current People and Project Managers.
If you plan to develop on something is always good to check on all the sides before deciding what suits best for your plan.
So the short answer is: It depends.
My product (the Sophos Email Appliance) is developed using agile methods. Industrial Extreme Programming, as espoused by Joshua Kerievsky, was used for the first several years of development. Recently I have started to move the team more towards Kanban, visualizing work flow and using pull-based scheduling instead of time-boxed iterations.
In other words, before I listen further to all the preaching and jump into the manuals, what evidence do I get that this is the way to be successful in a successful project in a successful company?
There is empirical evidence that most IT projects are not successful (where success means on time, on budget and fully functional here). Given this evidence, it seems reasonable to wonder if a deterministic approach (the waterfall) is well suited for software projects.
"The definition of insanity is doing the same thing over and over and expecting different results." --Albert EinsteinRita Mae Brown1
If a deterministic process produces failures over and over, we are likely not applying the right process for software development projects and Agile methods are an alternative. The theory behind these methods is that most software projects are not deterministic, they are creative (like in art) and complex (as defined by Ralph Stacey) projects and we can't predict everything. So, instead of trying to predict everything and then fighting against change, we should use an adaptive process. And this is what Agile methods are about.
Now, using an Agile method will never guarantee systematic success (and someone claiming the inverse is a liar) but they'll give you better control over the risks. And, if your project has to fail, it will at least fail fast.
Update: 1 Actually, this quotation seems to be misattributed to Albert Einstein. The earliest known occurrence, and probable origin, cites to Rita Mae Brown.
I believe Doublefine just produced Brutal Legend using Scrum.
From what I understand StackOverflow is a successful website built with agile practices and TDD.

Looking for a good exercise in building a website

I'd like to learn how to build a website, say using .Net (Monorail comes to mind). I'd like a pet project, something that:
Will take a fair yet reasonble amount of time
I can I can build on my own
Will be actually cool or useful,
Hasn't been done to death already (e.g. ... writing a blog engine is not what I'd consider as interesting, although it's technically challenging - it's been done to death and there are so many ready blog platforms today)
Any ideas, stackoverflow?
Have you considered offering your time to a local non-profit organization? You might review their existing mission, website, and other materials and approach them with an idea for something helpful that you could develop for free.
I find that if a project is "real" I'll put more effort into it than into a "toy" project on the side.
Hasn't been done to death already
(e.g. ... writing a blog engine is not
what I'd consider as interesting,
although it's technically challenging
- it's been done to death and there are so many ready blog platforms
today)
If this is just a learning exercise, why do you care if its been done to death? More than that, it seems like a blog platform involves a lot of the fundamental skills you'd need to learn anyway to get up to speed on ASP.NET.
You could also try writing a:
messageboard
web-based source-control system.
wiki engine
SO clone
Music/movie management system
Input two celebrities A and A', output a list of movies where A appears B, B appears with C, C appears with D, D appears with A'. See also: Kevin Bacon.
Start your own internet phenomenon. Lolcats, FML, NotAlwaysRight, GraphJam, Passive Agressive Notes, FSTDT, FailBlog, Sh*t Bricks, Keyboard Cat, and JapanWTF have already been done. Find a meme and run with it.
Searchable online taxonomy of species
Decentralized usernames (OpenID), avatars (Gravatar), status updates (Twitter), and currently playing music (Last.fm) have already been done. I predict the next big social network phenomenon will extend the phenomenon by decentralizing another staple of social-networking sites, probably a "current mood" or "signature" that follows you from site to site.
photo gallery engine
a website where people post great ideas for a website.
I'd say my answer would be the same as the one I gave to this previous SO question (albeit substitute .Net for PHP).

Effective strategies for studying frameworks/ libraries partially [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I remember the old effective approach of studying a new framework. It was always the best way to read a good book on the subject, say MFC. When I tried to skip a lot of material to speed up coding it turned out later that it would be quicker to read the whole book first. There was no good ways to study a framework in small parts. Or at least I did not see them then.
The last years a lot of new things happened: improved search results from Google, programming blogs, much more people involved in Internet discussions, a lot of open source frameworks.
Right now when we write software we much often depend on third-party (usually open source) frameworks/ libraries. And a lot of times we need to know only a small amount of their functionality to use them. It's just about finding the simplest way of using a small subset of the library without unnecessary pessimizations.
What do you do to study as less as possible of the framework and still use it effectively?
For example, suppose you need to index a set of documents with Lucene. And you need to highlight search snippets. You don't care about stemmers, storing the index in one file vs. multiple files, fuzzy queries and a lot of other stuff that is going to occupy your brain if you study Lucene in depth.
So what are your strategies, approaches, tricks to save your time?
I will enumerate what I would do, though I feel that my process can be improved.
Search "lucene tutorial", "lucene highlight example" and so on. Try to estimate trust score of unofficial articles ( blog posts ) based on publishing date, the number and the tone of the comments. If there is no a definite answer - collect new search keywords and links on the target.
Search for really quick tutorials/ newbie guides on official site
Estimate how valuable are javadocs for a newbie. (Read Lucene highlight package summary)
Search for simple examples that come with a library, related to what you need. ( Study "src/demo/org/apache/lucene/demo")
Ask about "simple Lucene search highlighting example" in Lucene mail list. You can get no answer or even get a bad reputation if you ask a silly question. And often you don't know whether you question is silly because you have not studied the framework in depth.
Ask it on Stackoverflow or other QA service "could you give me a working example of search keywords highlighting in Lucene". However this question is very specific and can gain no answers or a bad score.
Estimate how easy to get the answer from the framework code if it's open sourced.
What are your study/ search routes? Write them in priority order if possible.
I use a three phase technique for evaluating APIs.
1) Discovery - In this phase I search StackOverflow, CodeProject, Google and Newsgroups with as many different combination of search phrases as possible and add everything that might fit my needs into a huge list.
2) Filter/Sort - For each item I found in my gathering phase I try to find out if it suits my needs. To do this I jump right into the API documentation and make sure it has all of the features I need. The results of this go into a weighted list with the best solutions at the top and all of the cruft filtered out.
3) Prototype - I take the top few contenders and try to do a small implementation hitting all of the important features. Whatever fits the project best here wins. If for some reason an issue comes up with the best choice during implementation, it's possible to fall back on other implementations.
Of course, a huge number of factors go into choosing the best API for the project. Some important ones:
How much will this increase the size of my distribution?
How well does the API fit with the style of my existing code?
Does it have high quality/any documentation?
Is it used by a lot of people?
How active is the community?
How active is the development team?
How responsive is the development team to bug patch requests?
Will the development team accept my patches?
Can I extend it to fit my needs?
How expensive will it be to implement overall?
... And of course many more. It's all very project dependent.
As to saving time, I would say trying to save too much here will just come back to bite you later. The time put into selecting a good library is at least as important as the time spent implementing it. Also, think down the road, in six months would you rather be happily coding or would you rather be arguing with a xenophobic dev team :). Spending a couple of extra days now doing a thorough evaluation of your choices can save a lot of pain later.
The answer to your question depends on where you fall on the continuum of generality/specificity. Do you want to solve an immediate problem? Are you looking to develop a deep understanding of the library? Chances are you’re somewhere between those extremes. Jeff Atwood has a post about how programmers move between these levels, based on their need.
When first getting started, read something on the high-level design of the framework or library (or language, or whatever technology it is), preferably by one of the designers. Try to determine what problems they are trying to address, what the organizing principles behind the design are, and what the central features are. This will form the conceptual framework from which future understanding will hang.
Now jump in to it. Create something. Do not copy and paste somebody's code. Instead, when things don’t work, read the error messages in detail, and the help on those error messages, and figure out why that error occurred. It can be frustrating, when things don’t work, but it forces you to think, and that’s when you learn.
1) Search Google for my task
2) look at examples with a few different libraries, no need to tie myself down to Lucene for example, if I don't know what other options I have.
3) Look at the date of last update on the main page, if it hasn't been updated in 6-months leave (with some exceptions)
4) Search for sample task with library (don't read tutorials yet)
5) Can I understand what's going on without a tutorial? If yes continue if no start back at 1
6) Try to implement the task
7) Watch myself fail
8) Read a tutorial
9) Try to implement the task
10) Watch myself fail and ask on StackOverflow, or mail the authors, post on user group (if friendly looking)
11) If I could get the task done, I'll consider the framework worthy of study and read up the main tutorial for 2 hours (if it doesn't fit in 2 hours I just ignore what's left until I need it)
I have no recipe, in the sense of a set of steps I always follow, that's largely because everything I learn is different. Some things are radically new to me (Dojo for example, I have no fluency in Java script so that's a big task), some just enhancements of previous knowledge (Iknow EJB 2 well, so learning EJB 3 while on the surface is new with all its annotations, its building on concepts.)
My general strategy though is I'd describe as "Spiral and Park". I try to circle the landscape first, understand the general shape, I Park concepts that I don't get just yet, don't let it worry me. Then i go a little deeper into some areas, but again try not to get obsessed with one, Spiralling down into the subject. Hopefully I start to unpark and understand, but also need to park more things.
Initially I want answers to questions such as:
What's it for?
Why would I use this rather than that other thing I already know
What's possible? Any interesting sweet spots. "Eg. ooh look at that nice AJAX-driven update"
I do a great deal of skim reading.
Then I want to do more exploring on the hows. I start to look for gotchas and good advice. (Eg. in java: why is "wibble".equals(var) a useful construct?)
Specific techniques and information sources:
Most important: doing! As early as possible I want to work a tutorial or two. I probably have to get the first circuit of the spiral done, but then I want to touch and experiment.
Overview documents
Product documents
Forums and discussion groups, learning by answering questions is my favourite technique.
if at all possible I try to find gurus. I'm fortunate in having in my immediate colleagues a wealth of knowledge and experience.
Quick-start guides.
A quick look at the API documentation if available.
Reading sample codes.
Messing around YOU HAVE TO MESS AROUND (sorry for the caps).
If it's a small library/API with a small or no community you can always contact the developer himself and ask for help 'cause he'll probably be more than happy to help you; he's happy that one more person is using his API.
Mailing lists are a great resource as long as you do your homework first before asking questions.
Mailing list archives are invaluable for most of the questions I've had on CoreAudio related stuff.
I would never read javadoc. As there often is none. And when there is, most likely it isnt up to date. So one gets confused at the best.
Start with the simplest possible tutorial you find within some minutes.
Often the tutorial will lead you to further sources at the end, so then most of the time one is on a path that goes on and on, deeper and deeper.
It really depends on what the topic is and how much info is on it. Learning by example is a good way to start a topic brand new to you, especially if you're knowledgeable in other similar libraries or languages. You can take a topic you're familiar with, and say "I understand how to implement using X, lets see how it's done using Y".
So what are your strategies, approaches, tricks to save your time?
Well, I search. I generally never ask questions, preferring to research myself. If worse comes to worse I'll read the documentation. In some cases (say, when I was doing some work with SharpSVN) I had to look at the source, specifically the test cases, to get some information about how the API worked.
Generally, I have to be honest, most of my 'study' and 'learning' is by accident.
For example, just a few seconds ago, I discovered how to get the "Recent" folder in C#. I had no idea how to do that before seeing the question, considering it interesting, and then searching.
So for me the real 'trick' is that I hang around on forums, answer questions, and accidentally pick up knowledge. Then when it comes time for me to research something; chances are I know a bit about it, and searching is easier and I can focus on the implementation [typically implementing a test program first] and progressing from there.

Resources