This question has been asked in various ways before, but I'm wondering if people who have experience with automatic search term suggestion could offer advice on the most useful and efficient approaches. Here's the scenario:
I'm just starting on a website for a book that is a dictionary of terms (roughly 1,000 entries, with 300 word explanations on average), many of which are fairly obscure, and it is likely that many visitors to the site would not know how to spell the words. The publisher wants to make full-text search available for every entry. So, I'm hoping to implement a search engine with spelling correction. The main site will probably be done in a PHP framework (or possibly Django) with a MySQL database.
Can anyone with experience in this area give advice on the following:
With a set corpus of this nature, should I be using something like Lucene or Sphinx for the search engine?
As far as I can tell, neither of these has a built-in suggestion function. So it seems I will need to integrate one or more of the following. What are the advantages / disadvantages of:
Suggestion requests through Google's search API
A phonetic comparison algorithm like metaphone() in PHP
A spell checking system like Aspell
A simpler spelling script such as Peter Norvig's
A Levenshtein function
I'm concerned about the specificity of my corpus, and don't want Google to start suggesting things that have nothing to do with this book. I'm also not sure whether I should try to use both a metaphone comparison and a Levenshtein comparison, or some other combination of techniques to capture both typos and attempts at phonetic spelling.
You might want to consider Apache Solr, which is a web service encapsulation of Lucene, and runs in a J2EE container like Tomcat. You'll get term suggestion, spell check, porting, stemming and much more. It's really very nice.
See here for a full listing of its features relating to queries.
There are Django and PHP libraries for Solr.
I wouldn't recommend using Google Suggest for such a specialised corpus anyway, and with Solr you won't need it.
Hope this helps.
Related
looking for ideas for implementing Spellcheck/DidYouMean for the Japanese language (mostly).
The target for spellcheck is search queries, search engine build on solr, but the solution is not bound to it.
So far found two main approached:
edit distance for dictionary (libraries like SymSpell)
statistic, based on user rewritten queries
The first approach seems not very feasible for Kanji/Kana.
Also, its results as-is are quite noisy and it's complicated to build a lot of clean N-grams for contextual spellcheck (so 'hollow world' would be fixed as 'hello world').
Any suggestions on how it could be done?
The second approach is complicated because it's difficult to detect rewritten queries and since users rarely do so or do it correctly - it's hard to gather such statistics.
The main articles/videos that I found so far are quite a high level and too simple (for edit distance they don't provide applicable for real-world approaches to reduce noise to a reasonable level - 95% or higher) or focused on English only.
Any pointers for some published papers are welcome :)
Thanks in advance.
I am planning to build a small social networking site. What is the best way to support keyword search in the content. I am looking for an opinion considering the fact that the contents can grow few TBs in size.
thanks,
GL
You should definitely use Solr/Lucene to index contents resulting in efficient keyword search results and it is also very easy to implement a faceted search based on Solr if you have such a feature in your mind.
Have you looked at Apache Lucene?
It's a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
I am looking for a tool that can analyze the emotion of short texts. I searched for a week and I couldn't find a good one that is publicly available. The ideal tool is one that takes a short text as input and guesses the emotion. It is preferably a standalone application or library.
I don't need tools that is trained by texts. And although similar questions are asked before no satisfactory answers are got.
I searched the Internet and read some papers but I can't find a good tool I want. Currently I found SentiStrength, but the accuracy is not good. I am using emotional dictionaries right now. I felt that some syntax parsing may be necessary but it's too complex for me to build one. Furthermore, it's researched by some people and I don't want to reinvent the wheels. Does anyone know such publicly/research available software? I need a tool that doesn't need training before using.
Thanks in advance.
I think that you will not find a more accurate program than SentiStrength (or SoCal) for this task - other than machine learning methods in a specific narrow domain. If you have a lot (>1000) of hand-coded data for a specific domain then you might like to try a generic machine learning approach based on your data. If not, then I would stop looking for anything better ;)
Identifying entities and extracting precise information from short texts, let alone sentiment, is a very challenging problem specially with short text because of lack of context. Hovewer, there are few unsupervised approaches to extracting sentiments from texts mainly proposed by Turney (2000). Look at that and may be you can adopt the method of extracting sentiments based on adjectives in the short text for your use-case. It is hovewer important to note that this might require you to efficiently POSTag your short text accordingly.
Maybe EmoLib could be of help.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I've always been interested in developing a web search engine. What's a good place to start? I've heard of Lucene, but I'm not a big Java guy. Any other good resources or open source projects?
I understand it's a huge under-taking, but that's part of the appeal. I'm not looking to create the next Google, just something I can use to search a sub-set of sites that I might be interested in.
There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc):
The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In addition to the text itself, you will want things like the time you accessed it, etc. The crawler needs to be smart enough to know how often to hit certain domains, to obey the robots.txt convention, etc.
The parser. This reads the data fetched by the crawler, parses it, saves whatever metadata it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.
The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the webpages. It can be as smart as you want it to be -- apply NLP techniques to make indexes of concepts, cross-link things, throw in synonyms, etc.
The ranking engine. Given a few thousand URLs matching "apple", how do you decide which result is the best? Jut the index doesn't give you that information. You need to analyze the text, the linking structure, and whatever other pieces you want to look at, and create some scores. This may be done completely on the fly (that's really hard), or based on some pre-computed notions of "experts" (see PageRank, etc).
The front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.
My advice -- choose which of these interests you the most, download Lucene or Xapian or any other open source project out there, pull out the bit that does one of the above tasks, and try to replace it. Hopefully, with something better :-).
Some links that may prove useful:
"Agile web-crawler", a paper from Estonia (in English)
Sphinx Search engine, an indexing and search api. Designed for large DBs, but modular and open-ended.
"Information Retrieval, a textbook about IR from Manning et al. Good overview of how the indexes are built, various issues that come up, as well as some discussion of crawling, etc. Free online version (for now)!
Xapian is another option for you. I've heard it scales better than some implementations of Lucene.
Check out nutch, it's written by the same guy that created Lucene (Doug Cutting).
It seems to me that the biggest part is the indexing of sites. Making bots to scour the internet and parse their contents.
A friend and I were talking about how amazing Google and other search engines have to be under the hood. Millions of results in under half a second? Crazy. I think that they might have preset search results for commonly searched items.
edit:
This site looks rather interesting.
I would start with an existing project, such as the open source search engine from Wikia.
[My understanding is that the Wikia Search project has ended. However I think getting involved with an existing open-source project is a good way to ease into an undertaking of this size.]
http://re.search.wikia.com/about/get_involved.html
If you're interested in learning about the theory behind information retrieval and some of the technical details behind implementing search engines, I can recommend the book Managing Gigabytes by Ian Witten, Alistair Moffat and Tim C. Bell. (Disclosure: Alistair Moffat was my university supervisor.) Although it's a bit dated now (the first edition came out in 1994 and the second in 1999 -- what's so hard about managing gigabytes now?), the underlying theory is still sound and it's a great introduction to both indexing and the use of compression in indexing and retrieval systems.
I'm interested in Search Engine too. I recommended both Apache Hadoop MapReduce and Apache Lucene. Getting faster by Hadoop Cluster is the best way.
There are ports of Lucene. Zend have one freely available. Have a look at this quick tutorial: http://devzone.zend.com/node/view/id/91
Here's a slightly different approach, if you are not so much interested in the programming of it but more interested in the results: consider building it using Google Custom Search Engine API.
Advantages:
Google does all the heavy lifting for you
Familiar UI and behavior for your users
Can have something up and running in minutes
Lots of customization capabilities
Disadvantages:
You're not writing code, so no learning opportunity there
Everything you want to search must be public & in the Google index already
Your result is tied to Google
Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.
(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)
UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."
UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.
Use the Wikipedia dumps
needs lots of cleanup
See if anything in nltk-data helps you
the corpora are usually quite small
the Wacky people have some free corpora
tagged
you can spider your own corpus using their toolkit
Europarl is free and the basis of pretty much every academic MT system
spoken language, translated
The Reuters Corpora are free of charge, but only available on CD
You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.
If you do this commercially, the LDC might be a viable alternative.
Wikipedia sounds like the way to go. There is an experimental Wikipedia API that might be of use, but I have no clue how it works. So far I've only scraped Wikipedia with custom spiders or even wget.
Then you could search for pages that offer their full article text in RSS feeds. RSS, because no HTML tags get in your way.
Scraping mailing lists and/or the Usenet has several disatvantages: you'll be getting AOLbonics and Techspeak, and that will tilt your corpus badly.
The classical corpora are the Penn Treebank and the British National Corpus, but they are paid for. You can read the Corpora list archives, or even ask them about it. Perhaps you will find useful data using the Web as Corpus tools.
I actually have a small project in construction, that allows linguistic processing on arbitrary web pages. It should be ready for use within the next few weeks, but it's so far not really meant to be a scraper. But I could write a module for it, I guess, the functionality is already there.
If you're willing to pay money, you should check out the data available at the Linguistic Data Consortium, such as the Penn Treebank.
Wikipedia seems to be the best way. Yes you'd have to parse the output. But thanks to wikipedia's categories you could easily get different types of articles and words. e.g. by parsing all the science categories you could get lots of science words. Details about places would be skewed towards geographic names, etc.
You've covered the obvious ones. The only other areas that I can think of too supplement:
1) News articles / blogs.
2) Magazines are posting a lot of free material online, and you can get a good cross section of topics.
Looking into the wikipedia data I noticed that they had done some analysis on bodies of tv and movie scripts. I thought that might interesting text but not readily accessible -- it turns out it is everywhere, and it is structured and predictable enough that it should be possible clean it up. This site, helpfully titled "A bunch of movie scripts and screenplays in one location on the 'net", would probably be useful to anyone who stumbles on this thread with a similar question.
You can get quotations content (in limited form) here:
http://quotationsbook.com/services/
This content also happens to be on Freebase.