Natural Language/Text Mining and Reddit/social news site - nlp

I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com.
I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit.
What kind of applications can you come up with?

I have found in the past that the best way to mine data on sites like Reddit or Digg is to first use the developer API that they provide. Typically you have a focused interest in either a topic or trend, and the only way to get that data is through an established public interface. You can also parse feeds, and combine them both to uncover 90% of what you would want to know. If you want to do deep research on data not available through an API, then you should be prepared to spend a significant amount of time writing custom wrappers around a tool like cURL. If you have the budget you can also call them and ask if they offer paid research data on users.

I'd start on the RSS, and after that I might use Nutch; what to actually do with the data is more your call.

These are good ideas. I can get the data, but what applications can be built around it?

Related

Stat Tracking of Customized data

I feel like I've been all over the web for this but haven't found anything like what I'm looking for, I might just be using the wrong keywords.
Basically I need a stat tracking system which allows me to track a variety of different types of events, each of them connected to a user that has custom properties (id, country, etc). Think Google Analytics with custom events and custom variables, but without the limit of 5 custom variables and more flexibility with events.
The events are anything from impressions to bank statuses and item ownerships. Basically our database in a statistical format.
At the moment we simply use our MySQL database to gather the info we need, but as we are rapidly growing we cannot keep doing this, also we really need a system that can automate the creation of statistical reports.
So my question really comes down to this, are there any such services (preferrably open source) out there, or possibly frameworks or databases created especially for this type of processing that we can easily create our own stat tracking software from?
Long story short, I need a service, framework or database that eases the creation of custom statistics reports.
Any help would be much appreciated. Apologies if this is not supposed to be on Stackoverflow, I wasn't sure where else to ask this.
R is a free re-implementation of the S statistics language, and is very popular, especially for those who are looking for a heavy-duty statistics programming language.
SPSS is more desktop-software oriented; it was the tool we used in my statistics courses a decade ago, and would probably still be the king of desktop software analysis tools for traditional statistics work.
NumPy is a numerical analysis tool for Python. (A fellow stacker has pointed out a website that has a lot of comparisons of Python with R that might be useful to helping you decide between them: https://stackoverflow.com/questions/6489466/list-of-r-python-equivalents/6493539#6493539.) I'm not sure how well it fits with general statistics works, but it is very popular in a wide array of sciences.
Tableau Software has some tools that a friend speaks very highly about -- I've just skimmed their literature, and it looks like they might make exploring data easier, but traditional analysis of statistics might not be their strongest point.

Effective strategies for studying frameworks/ libraries partially [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I remember the old effective approach of studying a new framework. It was always the best way to read a good book on the subject, say MFC. When I tried to skip a lot of material to speed up coding it turned out later that it would be quicker to read the whole book first. There was no good ways to study a framework in small parts. Or at least I did not see them then.
The last years a lot of new things happened: improved search results from Google, programming blogs, much more people involved in Internet discussions, a lot of open source frameworks.
Right now when we write software we much often depend on third-party (usually open source) frameworks/ libraries. And a lot of times we need to know only a small amount of their functionality to use them. It's just about finding the simplest way of using a small subset of the library without unnecessary pessimizations.
What do you do to study as less as possible of the framework and still use it effectively?
For example, suppose you need to index a set of documents with Lucene. And you need to highlight search snippets. You don't care about stemmers, storing the index in one file vs. multiple files, fuzzy queries and a lot of other stuff that is going to occupy your brain if you study Lucene in depth.
So what are your strategies, approaches, tricks to save your time?
I will enumerate what I would do, though I feel that my process can be improved.
Search "lucene tutorial", "lucene highlight example" and so on. Try to estimate trust score of unofficial articles ( blog posts ) based on publishing date, the number and the tone of the comments. If there is no a definite answer - collect new search keywords and links on the target.
Search for really quick tutorials/ newbie guides on official site
Estimate how valuable are javadocs for a newbie. (Read Lucene highlight package summary)
Search for simple examples that come with a library, related to what you need. ( Study "src/demo/org/apache/lucene/demo")
Ask about "simple Lucene search highlighting example" in Lucene mail list. You can get no answer or even get a bad reputation if you ask a silly question. And often you don't know whether you question is silly because you have not studied the framework in depth.
Ask it on Stackoverflow or other QA service "could you give me a working example of search keywords highlighting in Lucene". However this question is very specific and can gain no answers or a bad score.
Estimate how easy to get the answer from the framework code if it's open sourced.
What are your study/ search routes? Write them in priority order if possible.
I use a three phase technique for evaluating APIs.
1) Discovery - In this phase I search StackOverflow, CodeProject, Google and Newsgroups with as many different combination of search phrases as possible and add everything that might fit my needs into a huge list.
2) Filter/Sort - For each item I found in my gathering phase I try to find out if it suits my needs. To do this I jump right into the API documentation and make sure it has all of the features I need. The results of this go into a weighted list with the best solutions at the top and all of the cruft filtered out.
3) Prototype - I take the top few contenders and try to do a small implementation hitting all of the important features. Whatever fits the project best here wins. If for some reason an issue comes up with the best choice during implementation, it's possible to fall back on other implementations.
Of course, a huge number of factors go into choosing the best API for the project. Some important ones:
How much will this increase the size of my distribution?
How well does the API fit with the style of my existing code?
Does it have high quality/any documentation?
Is it used by a lot of people?
How active is the community?
How active is the development team?
How responsive is the development team to bug patch requests?
Will the development team accept my patches?
Can I extend it to fit my needs?
How expensive will it be to implement overall?
... And of course many more. It's all very project dependent.
As to saving time, I would say trying to save too much here will just come back to bite you later. The time put into selecting a good library is at least as important as the time spent implementing it. Also, think down the road, in six months would you rather be happily coding or would you rather be arguing with a xenophobic dev team :). Spending a couple of extra days now doing a thorough evaluation of your choices can save a lot of pain later.
The answer to your question depends on where you fall on the continuum of generality/specificity. Do you want to solve an immediate problem? Are you looking to develop a deep understanding of the library? Chances are you’re somewhere between those extremes. Jeff Atwood has a post about how programmers move between these levels, based on their need.
When first getting started, read something on the high-level design of the framework or library (or language, or whatever technology it is), preferably by one of the designers. Try to determine what problems they are trying to address, what the organizing principles behind the design are, and what the central features are. This will form the conceptual framework from which future understanding will hang.
Now jump in to it. Create something. Do not copy and paste somebody's code. Instead, when things don’t work, read the error messages in detail, and the help on those error messages, and figure out why that error occurred. It can be frustrating, when things don’t work, but it forces you to think, and that’s when you learn.
1) Search Google for my task
2) look at examples with a few different libraries, no need to tie myself down to Lucene for example, if I don't know what other options I have.
3) Look at the date of last update on the main page, if it hasn't been updated in 6-months leave (with some exceptions)
4) Search for sample task with library (don't read tutorials yet)
5) Can I understand what's going on without a tutorial? If yes continue if no start back at 1
6) Try to implement the task
7) Watch myself fail
8) Read a tutorial
9) Try to implement the task
10) Watch myself fail and ask on StackOverflow, or mail the authors, post on user group (if friendly looking)
11) If I could get the task done, I'll consider the framework worthy of study and read up the main tutorial for 2 hours (if it doesn't fit in 2 hours I just ignore what's left until I need it)
I have no recipe, in the sense of a set of steps I always follow, that's largely because everything I learn is different. Some things are radically new to me (Dojo for example, I have no fluency in Java script so that's a big task), some just enhancements of previous knowledge (Iknow EJB 2 well, so learning EJB 3 while on the surface is new with all its annotations, its building on concepts.)
My general strategy though is I'd describe as "Spiral and Park". I try to circle the landscape first, understand the general shape, I Park concepts that I don't get just yet, don't let it worry me. Then i go a little deeper into some areas, but again try not to get obsessed with one, Spiralling down into the subject. Hopefully I start to unpark and understand, but also need to park more things.
Initially I want answers to questions such as:
What's it for?
Why would I use this rather than that other thing I already know
What's possible? Any interesting sweet spots. "Eg. ooh look at that nice AJAX-driven update"
I do a great deal of skim reading.
Then I want to do more exploring on the hows. I start to look for gotchas and good advice. (Eg. in java: why is "wibble".equals(var) a useful construct?)
Specific techniques and information sources:
Most important: doing! As early as possible I want to work a tutorial or two. I probably have to get the first circuit of the spiral done, but then I want to touch and experiment.
Overview documents
Product documents
Forums and discussion groups, learning by answering questions is my favourite technique.
if at all possible I try to find gurus. I'm fortunate in having in my immediate colleagues a wealth of knowledge and experience.
Quick-start guides.
A quick look at the API documentation if available.
Reading sample codes.
Messing around YOU HAVE TO MESS AROUND (sorry for the caps).
If it's a small library/API with a small or no community you can always contact the developer himself and ask for help 'cause he'll probably be more than happy to help you; he's happy that one more person is using his API.
Mailing lists are a great resource as long as you do your homework first before asking questions.
Mailing list archives are invaluable for most of the questions I've had on CoreAudio related stuff.
I would never read javadoc. As there often is none. And when there is, most likely it isnt up to date. So one gets confused at the best.
Start with the simplest possible tutorial you find within some minutes.
Often the tutorial will lead you to further sources at the end, so then most of the time one is on a path that goes on and on, deeper and deeper.
It really depends on what the topic is and how much info is on it. Learning by example is a good way to start a topic brand new to you, especially if you're knowledgeable in other similar libraries or languages. You can take a topic you're familiar with, and say "I understand how to implement using X, lets see how it's done using Y".
So what are your strategies, approaches, tricks to save your time?
Well, I search. I generally never ask questions, preferring to research myself. If worse comes to worse I'll read the documentation. In some cases (say, when I was doing some work with SharpSVN) I had to look at the source, specifically the test cases, to get some information about how the API worked.
Generally, I have to be honest, most of my 'study' and 'learning' is by accident.
For example, just a few seconds ago, I discovered how to get the "Recent" folder in C#. I had no idea how to do that before seeing the question, considering it interesting, and then searching.
So for me the real 'trick' is that I hang around on forums, answer questions, and accidentally pick up knowledge. Then when it comes time for me to research something; chances are I know a bit about it, and searching is easier and I can focus on the implementation [typically implementing a test program first] and progressing from there.

How do you find out what users really want? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I've read somewhere (I forget the source, sorry - I think the MS Office developer's blog?), that when you do a survey of users asking them about what features they would like to see in your software/website, they will more often than not say that they want every little thing, whereas collected metrics show that in the end, most people don't use 99% of these features. The general message from the blog post was that you shouldn't ask people what they use, you should track it for yourself.
This leads to an unfortunate chicken-and-egg situation when trying to figure out what new feature to add next. Without the feature already in place, I can't measure how much it's actually being used. With finite (and severely stretched) resources, I also can't afford to add all the features and then remove the unused ones.
How do you find out what will be useful to your users? If a survey is the only option, do you have to structure your questions in certain ways (eg: don't show a list of possible features, since that would be leading them on)?
Contrary to popular belief, you don't ask them. Well, you don't listen to them when they tell you what they want. You watch them while they use what they have right now. If they don't have anything, you listen to them enough to give them a prototype, then you watch them use that. How a person actually uses software tells you a lot more than what they actually say they want. Watch what they do to find out what they really need.
Give them options and the have them arrange them in order of importance. As you said, the users are going to want everything, but this will allow you to tell what they want the most.
You tell them. Then both of you know.
(No, your users won't tell you what they want. That's work. If users wanted more work to do, they wouldn't be looking for software to do their work for them.)
An anecdote from a previous life:
We were planning for a new release and wanted to add some new features to the application. We got the users together and brainstormed what things they wanted to see in the system, placing each "feature" on a yellow sticky on a white board. We then grouped similar requests together and eliminated duplicates or near dups.
We then laid each sticky on a table with a cup in front of it. Each user got 10 pennies to "vote" on the features they wanted. They could put as many pennies in each cup as they wanted, up to all their pennies in one cup if they so desired. We then counted the number of pennies in each cup and chose to implement the top 5 vote getters, in order of votes.
It was surprising to see people that were passionate about a feature while brainstorming and categorizing turn around and not vote for that feature (or vote lightly for it).
Of course, a technique like this will only work if you have ready access to your user base (this was for an enterprise system we developed internally).
You ask them.
(No, you do not know what your users want better than they do. Yes, you will get a lot of stupid answers. Avoid multiple-choice surveys and instead opt for reviewing free-form answers. The information you collect will be invaluable.)
Of course — you could always allow your users to vote on which features they like most...
Users know what they don't want better than they know what they want.
We had brought in a team do do an Oracle eBusiness Suite implementation. They took an interesting approach that had worked very well for them in the past. But it was phenomenal in our environment.
We had cultural issues which meant none of the users were going to stick the necks out to say what they wanted. I had history with the users from the past. Trying to get get requirements out of them was like trying to get blood from a stone. But once you went live the bitching would start.
Anyway the implementation team installed Oracle eBusiness Suite straight out of the box. Give the users the basic training. Then about every 4 weeks for the next 6 months they customized the base installation to accommodate the complaints.
I would recommend against showing them options; as you point out, if it's available, then people will want it just for the sake of having it. Often the users are not aware of the extra costs of developing a particular feature, and just want it because you mentioned the possibility of having it.
The other option is to show a list of all the features you could possibly add, and then attach a price to each one, and then ask users, would it be worth $X to have feature Y, or, how much extra would you be willing to pay for feature Y?
Eat your own dog food
Try to use the application that you write yourself as much as possible. Then you will know how you can improve your application.
According to 37 Signals - Getting Real book, you don't do anything, you don't even record what they want, you just delete mails after one read without any action.
When it comes to implement / fix stuff you'll remember the most important things that your users want from top of your head. Obviously this requires a bit user base.
You need to tie features to cost. Everyone wants features, but not every feature is worth paying for. Ask which features are most important, which would your users be willing to pay for? Develop features based on the priorities supplied by users and stop when they aren't willing to pay for any more. Get the product into their hands as quickly as possible so that you can get real feedback on what doesn't work and what needs to be added. When the users have access to real software, you get much better information. This works best when you are developing specifically for a particular customer. If you don't have access to real customers, consider seeding your product with people (can you say, public beta?) free in order to get better feedback.
Users don't know what features they want. You don't know what features they might be offered. "Features" don't mean anything except as they help them accomplish tasks and achieve goals. And that's where you should start, because they will have a very imperfect understanding how they relate.
There is one thing they know, maybe, much better than you do. And that's how to get their jobs done.
As soon as computer/software concepts and terminology start to leak into the discussion between users and designers, you're off the rails.
So many times users will focus their requirements in terms of what's wrong with, or could be improved about, the software they currently use. Over time, even they lose the distinction between their jobs, and the software they use to do their jobs.
It's a very hard, critically important problem for you to solve this.
The only way to know what the users "really" need is to "be" the user.
Its programming kung fu black belt level.
"Be like water making its way through cracks. Do not be assertive, but adjust to the object, and you shall find a way round or through it. If nothing within you stays rigid, outward things will disclose themselves.
Empty your mind, be formless. Shapeless, like water. If you put water into a cup, it becomes the cup. You put water into a bottle and it becomes the bottle. You put it in a teapot it becomes the teapot. Now, water can flow or it can crash. Be water my friend."
When you be the water/customer, you'll now.
I think Bruce Lee would be a good programmer.
Im very serious. This is the way I work. I cant do things I dont understand, so I have to understand before I do things. When I understand, and my costomers know I understand then I can do a good job. Without understanding there will be missunderstandings. You are the only person who know when you have the correct level of understanding, you are also the person who is responsible to get that knowledge.
The Oracle at Delphi
Pros: accuracy is superb
Cons: if you can interpret the messages, which many people fail to do (often seeing what they want to see). Also requires supplication, which can get messy (contrary to popular opinion, your hecatomb need not be 100 of the same type of livestock).
Psychics
Pros: accurate to a point.
Cons: rare. Prone to mental instability, highly vulnerable to eldritch beings, and might attract unwanted attention from them. Also, it takes experience to sort through the mystery that is the human mind to get to desired information. And sometimes you still need to probe subjects while they're actually doing the thing they need help with, since users lie.
Plant a mole
Pros: New gadgets. New Poisons! Plans within plans within plans. Baby's a freak show. You might learn all sorts of fascinating things in addition to the information you need to help the user.
Cons: Expensive. Chances remain that the agent will turn on you, or fail to learn anything you couldn't learn more simply. If discovered, organization will likely turn or liquidate the asset, which represents a huge investment of resources. Organization might reciprocate.
Guess
Pros: Take a group of people with average to great imaginations and problem solving skills, give them some booze and inspire them with some quotes from Ghostbusters, Big Trouble in Little China, or The Big Lewbowski. Who knows where it will go, but it'll be fun and they might produce something interesting/useful.
Cons: Chances of meeting user's needs are higher than you think, but not that good.
Ask the user
Pros: users feel empowered as part of the process.
cons: until they have to decide on anything, at which point you are on your own. Unless the user is a very experienced user, in which case they probably have a good idea of what the want. There's only like 4 experienced users on the planet though, and nobody ever knows anyone who gets to do a job for them. They may be mythical beasts.
Pretend you care and ask the user (even though you don't really), and then observe them doing whatever key workflow/process/etc is involved and pay attention to what they do.
Pros: you trick the users into thinking their opinion matters, which empowers them but doesn't deliver any other baggage. Since users lie - no purposefully or maliciously mind - you actually get to see them in action and get a better grasp of what the problem is, thus giving you a better foundation for building a solution. Also, you avoid the psychic route, and thus avoid a long and winding road that begins with promise but ends with you and the psychic being eaten by some monstrous, unspeakable thing that is not of this world. Observing the process is like totally Zen, which is good for your Developer Mystique.
Cons: No road trip to the Oracle (which would be EPIC). Spies are much sexier; chicks dig spies. Ghostbusters|Big Trouble in Little China|The Big Lewboski probably aren't involved. Feels more like work than the rest of the options.
Asking users about features will prompt them to talk to you about features.
If you want to find out what users really want then you are talking about understanding their goals and motivations. I've found the easiest way to start doing this is user interviews, not about features but about how users use your product and products like it, why they are using it and how it fits in with their life.
Once you build an understanding of what your users are trying to do with your product and why they want to do it you are in a position to make an informed judgment as to whether the features people requested are what they really need.
Ideally I think your problem is about understanding users rather then just listening to their requests.
This is an old question with a lot of good answers already, but I thought I'd just add a little bit of personal experience for the sake of people who end up here in the future through a search like I did.
If your project does not need to gain an audience as quickly as possible in order to succeed (like a webapp) if it's more of an internal project or product to be sold for a fixed client, or type of client, then I believe your best bet is to go the 37signals way: give your users the absolute minimum they need in order to accomplish the most basic tasks of the most basic cycle of work at first, then listen to what they say it's objectively missing in order for them to do their work properly. Not what they want or would like it to have, but what they really need. And the only way you know for sure what you really need is when you don't have it.
I worked as the designer in the development team of an intranet-based "heart-of-the-company" app that followed that strategy, and the results were wonderful. First week: everyone was pissed. When it was over, 90%+ of approval, and the app was still simple and beautiful. And most of the people who were not entirely satisfied seemed to understand why it couldn't be like they wanted, and the main request of nearly everyone was to, whatever we did, keep the app simple.
Again, if you're working on a product or website that needs to attract people first, that might not be feasible or delay things a lot. But if you have some control or leeway over the userbase, I'd definitely recommend this approach.
You don't ask for features. You ask for problems. Pain points. Find out what they hate about their current solution. Find out what eats in to their time.
When you know what they don't like, then you build the solution to those problems.
When you solve real problems, then you're creating real products that people will gladly give you money for.
But what's also important is respecting them during your research phase. Surveys are still great for doing research, but if you ask them a dozen questions, they will hate you. You need to respect their time and use a survey tool that engages them and leaves a great impression.
It's a proven fact users don't know what they want. What you need to ask them is what is wrong with what there is now - what problems are they having with your software? why aren't they using x feature and y control? why interaction x worked for them while interaction y made them try to gauge their eyes out?
Of course to be able to ask those questions, you need to do some field study and see what features are used, what patterns your users exhibit and analyze that data. That analysis will give you the base for much more specific questions which users are able to answer decisively and accurately.
If you're serious, you videotape them at their work, and then you break down what they are trying to accomplish and how your product can help them. This is part of a whole discipline called usability engineering. A good introduction to technique is Jakob Nielsen's book Usability Engineering. Back before he became a shameless huckster, Jakob was a very good scientist and he learned a lot about cheap ways of figuring out what users need. Especially good if you're on a budget. What impressed me most was using paper prototypes; this is a great way to mock up software you haven't built yet and helps answer your question about what to build next. Until I saw this technique in action I couldn't believe how effective it could be.
P.S. One example of what happens if you just ask people: 90% of the feature requests for Microsoft Office 2007 were for features that were already in Microsoft Office 2003. In that case what users needed were better ways of finding what was already there. I wish I could find where I read about this... sorry not to have a reference.
I'm assuming based on your wording that you are building a product to sell, and not building something to order for a specific client.
In that context, I'd say that you should start by becoming a user yourself and building the features you need in the way that you want it. As you evolve the product, you'll need feedback from other users, but this at least this gets you started and breaks the chicken-egg cycle.
As for measuring actual usage of features, you can set up a discussion forum to get feedback on the features you added... you don't need anything too complicated if you are time-strapped.
I personally like the hands off approach from customers. They give you high level requirements and you provide the implementation. Your software team/company/division are supposed to be the experts. Sure you will make some mistakes, if its horrible the customer will pipe up and you will fix it, but generally having the implementation up to you and your developers is a fun dilemma to solve.
Research, research, research. Learn from others designs, then make your own kickass design. Not easy but then again they don't pay developers the big bucks for nothing.
That's a good question.
If you're building an FPS game, you really need to know for yourself what should be included, because 99% of your users will never contact you to say "I wish your game just had X". An experienced beta-testing team can help here.
If you're writing an accounting application, you need to understand the industry and what users are trying to accomplish when they use your product, and try and focus your feature set around those goals.
If you're writing a custom app for 100 users in one business, you could have a chat to the dozen or so most avid users of the software. They're the ones who know all the forms back-to-front, have discovered all the undocumented shortcut keys, and have also figured out how to circumvent many of your data validation rules.
Imagine you are them
Use Cases.
What will they do with that feature?
It works like this.
People take actions. We build software to help them take actions
In order to take an action a person must make a decision. We build software to help them make decisions.
In order to make a decision to take an action, a person needs information. We build software to collect and present information.
Every feature must be an Action, a Decision or Information. And the connection had better be direct. Information that does not lead to a decision or an action isn't even "nice to have" -- it's junk.
Users say a lot of things. What do they do? What decisions do they make? What information do they need?
Edit
Note that not everyone is good at describing use cases. Some people have no vision and will simply tell you what they do today without understanding how they are creating business (or personal) value. They may not really know what decisions they're supposed to be making, and are vague on the information they need.
Other users know what value they create, and why, and can discuss use cases well. They can envision alternative ways to create value; they can articulate options for their actions. Decisions don't have a lot of alternative implementations (people make decisions, not software) and the information required doesn't change much, either.
Watch them.
Identify bottlenecks in their work
Create something that solves that
bottleneck in an elegant way
Let them use it
Repeat until everyone is happy
Based on the principles:
Users know what they want, but they
don't know what they really need.
You ARE
NEVER going to get it right the
first time.
It seems like a chicken-and-egg problem. Much like computing PageRank. A page's page rank is dependent on the PageRank of other pages linking to that page.
One way of computing PageRank is by iteration.
Iteration is the key!
A. Voting
Gather a biiiig list of features all users want (make them enumerate each feature they want).
Then have them review the list and allow them to vote on features. Say, give em 100 points to distribute on features. They can give more than 1 point to a feature.
B. Analysis
Analyze the business model, List the features that you think is needed.
This is needed because:
users sometimes don't get the big
picture
you have this REALLY great
idea that users won't think of in a
bajillion years.
C. Implement
Analyze list from A and B, merge, remove a few, improve some. Implement.
D. Test
Test it on users. Hear their complaints. Look at
- features they use often
- stuff they get stuck on
- etc etc etc
E. Iterate
Usually, users do not always know what they want and whether they want anything. In our company sales people go to existing and potential customers, show them our product and explain them why they desperately want that.
In my time in university we were taught something called "userp-driven development". Here you really have to go to the customer, observer how people there work, what tools do they use, and try to find out what could facilitate their life. You then create a mock-up, go to the customer again, present it to the users, get their feedback and then proceed to improve your mock-up. When everyone more or less agrees to the course of action, you do implementation, regularly showing the customer what you have trying to get correction feedback as early as possible.
Important is not to talk to the managers who want the product, but to the users who will use the product. Otherwise the whole play will bring you nothing.
P.S. Asking them directly "What do you want?" could be a dangerous question...
Babylon 5 - What do you want?
It's called Market Research.
No, this wasn't a dig at the guy, that's really what it is about. Sure, there's a bunch of techniques that UCD people use in the field to get user requirements, but they are exactly the same tools used by market researchers. Card Sorting, Priority lists and so on are all market research terms.

Building a web search engine [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I've always been interested in developing a web search engine. What's a good place to start? I've heard of Lucene, but I'm not a big Java guy. Any other good resources or open source projects?
I understand it's a huge under-taking, but that's part of the appeal. I'm not looking to create the next Google, just something I can use to search a sub-set of sites that I might be interested in.
There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc):
The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In addition to the text itself, you will want things like the time you accessed it, etc. The crawler needs to be smart enough to know how often to hit certain domains, to obey the robots.txt convention, etc.
The parser. This reads the data fetched by the crawler, parses it, saves whatever metadata it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.
The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the webpages. It can be as smart as you want it to be -- apply NLP techniques to make indexes of concepts, cross-link things, throw in synonyms, etc.
The ranking engine. Given a few thousand URLs matching "apple", how do you decide which result is the best? Jut the index doesn't give you that information. You need to analyze the text, the linking structure, and whatever other pieces you want to look at, and create some scores. This may be done completely on the fly (that's really hard), or based on some pre-computed notions of "experts" (see PageRank, etc).
The front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.
My advice -- choose which of these interests you the most, download Lucene or Xapian or any other open source project out there, pull out the bit that does one of the above tasks, and try to replace it. Hopefully, with something better :-).
Some links that may prove useful:
"Agile web-crawler", a paper from Estonia (in English)
Sphinx Search engine, an indexing and search api. Designed for large DBs, but modular and open-ended.
"Information Retrieval, a textbook about IR from Manning et al. Good overview of how the indexes are built, various issues that come up, as well as some discussion of crawling, etc. Free online version (for now)!
Xapian is another option for you. I've heard it scales better than some implementations of Lucene.
Check out nutch, it's written by the same guy that created Lucene (Doug Cutting).
It seems to me that the biggest part is the indexing of sites. Making bots to scour the internet and parse their contents.
A friend and I were talking about how amazing Google and other search engines have to be under the hood. Millions of results in under half a second? Crazy. I think that they might have preset search results for commonly searched items.
edit:
This site looks rather interesting.
I would start with an existing project, such as the open source search engine from Wikia.
[My understanding is that the Wikia Search project has ended. However I think getting involved with an existing open-source project is a good way to ease into an undertaking of this size.]
http://re.search.wikia.com/about/get_involved.html
If you're interested in learning about the theory behind information retrieval and some of the technical details behind implementing search engines, I can recommend the book Managing Gigabytes by Ian Witten, Alistair Moffat and Tim C. Bell. (Disclosure: Alistair Moffat was my university supervisor.) Although it's a bit dated now (the first edition came out in 1994 and the second in 1999 -- what's so hard about managing gigabytes now?), the underlying theory is still sound and it's a great introduction to both indexing and the use of compression in indexing and retrieval systems.
I'm interested in Search Engine too. I recommended both Apache Hadoop MapReduce and Apache Lucene. Getting faster by Hadoop Cluster is the best way.
There are ports of Lucene. Zend have one freely available. Have a look at this quick tutorial: http://devzone.zend.com/node/view/id/91
Here's a slightly different approach, if you are not so much interested in the programming of it but more interested in the results: consider building it using Google Custom Search Engine API.
Advantages:
Google does all the heavy lifting for you
Familiar UI and behavior for your users
Can have something up and running in minutes
Lots of customization capabilities
Disadvantages:
You're not writing code, so no learning opportunity there
Everything you want to search must be public & in the Google index already
Your result is tied to Google

NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.
(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)
UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."
UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.
Use the Wikipedia dumps
needs lots of cleanup
See if anything in nltk-data helps you
the corpora are usually quite small
the Wacky people have some free corpora
tagged
you can spider your own corpus using their toolkit
Europarl is free and the basis of pretty much every academic MT system
spoken language, translated
The Reuters Corpora are free of charge, but only available on CD
You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.
If you do this commercially, the LDC might be a viable alternative.
Wikipedia sounds like the way to go. There is an experimental Wikipedia API that might be of use, but I have no clue how it works. So far I've only scraped Wikipedia with custom spiders or even wget.
Then you could search for pages that offer their full article text in RSS feeds. RSS, because no HTML tags get in your way.
Scraping mailing lists and/or the Usenet has several disatvantages: you'll be getting AOLbonics and Techspeak, and that will tilt your corpus badly.
The classical corpora are the Penn Treebank and the British National Corpus, but they are paid for. You can read the Corpora list archives, or even ask them about it. Perhaps you will find useful data using the Web as Corpus tools.
I actually have a small project in construction, that allows linguistic processing on arbitrary web pages. It should be ready for use within the next few weeks, but it's so far not really meant to be a scraper. But I could write a module for it, I guess, the functionality is already there.
If you're willing to pay money, you should check out the data available at the Linguistic Data Consortium, such as the Penn Treebank.
Wikipedia seems to be the best way. Yes you'd have to parse the output. But thanks to wikipedia's categories you could easily get different types of articles and words. e.g. by parsing all the science categories you could get lots of science words. Details about places would be skewed towards geographic names, etc.
You've covered the obvious ones. The only other areas that I can think of too supplement:
1) News articles / blogs.
2) Magazines are posting a lot of free material online, and you can get a good cross section of topics.
Looking into the wikipedia data I noticed that they had done some analysis on bodies of tv and movie scripts. I thought that might interesting text but not readily accessible -- it turns out it is everywhere, and it is structured and predictable enough that it should be possible clean it up. This site, helpfully titled "A bunch of movie scripts and screenplays in one location on the 'net", would probably be useful to anyone who stumbles on this thread with a similar question.
You can get quotations content (in limited form) here:
http://quotationsbook.com/services/
This content also happens to be on Freebase.

Resources