I am a newbie when it comes to information extraction. For the past several days, I have read a lot of academic papers and ordered a book on NLP. I want to figure out how I can build a FlipDog.com like system (hopefully not from scratch). They extract job openings from more than 60,000 company web sites. How do I get started?
I am open to learning any programming language. Has anybody used Mallet/GATE/MinorThird or RoadRunner? Ideally, I want to be able to train a system with the data set particular to my domain and have it extract information based on that. Which platform would you recommend for this purpose?
Thanks!
The faster way to extract job offerings is to use dapper.net (a web scraping service from websites). You can very easily to teach dapper to extract data using visual editor. It works very well when on your target websites you have tables.
To learn Information Extraction, I suggest to start from lingpipe. It is a java framework for Information Extraction, so you do not need to learn architectural specific features of the framework, such as Gate or Apache UIMA. On lingpipe website you will find a lot of tutorials which will help you to learn various Information Extraction approaches. After that I suggest to learn Gate and UIMA.
If you want to realize such a website, you also need to learn how to use web crawler frameworks (e.g., nutch), web search engines (yahoo, google, bing), and Information Retrieval engines (such as, apache lucene) to provide a search service on the top of extracted data.
Update:
For python, it is the best to start with: http://www.nltk.org/
Related
I have an e-commerce website and I need to display products in homepage based on user interest like advertisement showing on facebook and google based on the search we did on the internet.
Is there any API from facebook or google or any website for fetching user interests?
I have wondered how facebook, google, and booking.com collaboratively tracking us even if they are different companies.
Is there any common gateway or common interface where these big companies share user info by making a system with cookies to track among all as a centralized big data model for behavioral advertising?
I am looking for answers from experts for this. I really need to build a system for tracking user interest based on their search on google and other websites.
So you need to use something called recommender systems, using a machine learning algorithm, you'll recommend products to people based on ratings and interest. by using previous data of the same user or other users (just like when you get recommended videos on youtube).
this topic is too big for me to explain it step by step in here, and you need to first have a good understanding of primitive machine learning such as classification, regression ... etc
so if you're interested make sure to check a coursera course called Machine Learning (Stanford University) it's taught by machine learning rockstar Andrew NG, and it doesn't only teach you machine learning, but it takes you from somebody with no idea on the subject to an expert (technically) the course used MATLAB/OCTAVE and it has an entire section on Recommender Systems which is what you need, after you've finished just implement what you learned in the language of your choice !
PS:
you can always look up tutorial online for implementing Recommender Systems, but you will waste so much time because you would have no idea about what you're doing without understanding the theory which you will master in the course I've recommended, the course can also be found easily on youtube. but taking it for FREE in coursera will help you more because you'll have hands-on programming experience, on the different subjects.
For that google and facebook has their algorithm to track user movement and they use it for showing ads on their website.
i don't think that is available for common use.
I think you won't be able to Track there interest live Facebook, Google, Amazon, twitter. They are collecting your Interest form there own platform.
Also they manage large Add provider. SO once a Add has been clicked by you, it has been tracked. Also Google Play, itunes has access to your Phone.
I am currently looking into finding a search device that can facilitate a lot of documents and a few different websites and an LMS.
Where this differs is that we would like there to be a heavy amount of relevancy based on user roles. Everyone is auto-logged into all of our systems via SSO much like this site. We want heavy weighting to be put on documents, web site articles/knowledgebase, and class in the LMS that are for that user's selected role.
I personally have limited knowledge of solr which we use for some full text searches. I have considered looking into elasticsearch, solr, google appliance, and FAST.
Do any of these have any innate features that will help me get to my end goal faster? My worries about elasticsearch and solr is the amount of development time. Our group has done limited search customization so also wondering on dev time needed for various solutions.
You must have not done much research yet, because FAST is gone (absorbed by Microsoft) and Google Appliance is not very customizable.
Solr is more customizable than Elasticsearch, so that's my recommendation. For the rest, I would start by having the field for a role and then using that as a boost factor.
If the basic approach does not work, the Solr-Users mailing list may actually be a better place to follow-up as it allows a discussion to fine tune the issue.
(Updated)
For more packaged solutions that integrate with Solr and include crawling, you can look at:
Apache Nutch to add crawling specifically
Apache ManifoldCF - has a lot of integrators into other data sources
LucidWorks Fusion (commercial)
Flexile search platform (commercial)
Cloudera - is a Big Data solution, but it integrates with Solr and - may - have a crawler. It does have a nice UI Hue
And probably more, if you dig a little bit.
I'm currently working on an online store and I'm curious if there are any "best practices" that I should consider to attain subsecond (or close to) search operations. I'm using Full Text Search in Sql Server 2008 which I'm sure I could optimize in various ways. Right now, searches within Management Studio alone are taking 2-3 seconds roughly. Furthermore, I'm curious if client or server-side caching of some sort could be utilized. The database for the catalog contains millions of records. Does anyone know how Amazon.com or Borders.com return search results so quickly? Are there any books or articles that discuss search optimization and architecture? This isn't to be confused with search-engine optimization. Right now, I don't care about how visible the site is to the public.
Those websites use full text search or IR libraries. Apache Lucene is an open source framework that perfectly meets your needs. These information retrieval or IR libraries use inverted-index to obtain better search performance trading the index creation time. Also look for using Facets and collaborative filtering (the suggestion list you see on amazon) using Taste.
www.acm.org/dl
&computer.org
& searchenginewatch
& microsoft/enterprisesearch whitepapers
& lucidimagination
& autonomy
& endeca
All of these resources publish consumable information that is both useful and not always too obscure nor facile.
You can get the task done with MSSQL 2008 but you need to spend more time than a question on stackO can get you. |imho|
Note: Its fine to explore the implementation issues before you architect, but its not always a good idea to bring those implementation details into the architecture.
I'm just beginning a project that involves working with a few of Google's APIs (for .NET), specifically the Contacts List, Calendar and Gmail. While Google does provide a wealth of information through their code.google.com network, finding what I need to get started has thus far proven to be a monumental task. What I'm hoping to find is a "big picture" look at what Google makes available to developers as well as a few sample pieces of code to ease me off the starting block.
Does anyone know where I can find a handful of simple, useful examples developing basic applications with Google's API (I've come across a few examples within code.google.com, but they're so rudimentary that they're not helpful)? Are there any resources (in print or online) that can spoon feed a Google novice without burying them? Is there a special, hidden nook within code.google.com for Google beginners?
Any information anyone could share would be very helpful.
http://code.google.com/apis/ajax/playground/
I am working on a website that currently has a number of disparate search functions, for example:
A crawl 'through the front door' of the website
A search that communicates with a web-service
etc...
What would be the best way to tie these together, and provide what appears to be a unified search function?
I found the following list on wikipedia
Free and open source enterprise search software
Lucene and Solr
Xapian
Vendors of proprietary enterprise search software
AskMeNow
Autonomy Corporation
Concept Searching Limited
Coveo
Dieselpoint, Inc.
dtSearch Corp.
Endeca Technologies Inc.
Exalead
Expert System S.p.A.
Funnelback
Google Search Appliance
IBM
ISYS Search Software
Microsoft (includes Microsoft Search Server, Fast Search & Transfer):
Open Text Corporation
Oracle Corporation
Queplix Universal Search Appliance
SAP
TeraText
VivĂsimo
X1 Technologies, Inc.
ZyLAB Technologies
Thanks for any advice regarding this.
Solr is an unbelievably flexible solution for search. Just in the last year I coded 2 solr-based websites and worked on a third existing one, each worked in a very different way.
Solr simply eats XML requests to add something to index, and XML requests to search for something inside an index. It doesn't do crawling or text extraction for you, but most of the time these are easy to do. There are many existing addons to Solr/Lucene stack so maybe something for you already exists.
I would avoid proprietary software unless you're sure Solr is insufficient. It's one of the nicest programs I've worked with, very flexible when you need it and at the same time you can start in minutes without reading long manuals.
Note that no matter what search solution you use, a search setup is "disparate" by nature.
You will still have an indexer, and a search UI, or the "framework".
You WILL corner yourself by marrying a specific search technology. You actually want to have the UI as separate from the search backend as possible. The backend may stop to scale, or there may be a better search engine out there tomorrow.
Switching search engines is very common, so never - ever - write your interface with a specific search engine in mind. Always abstract it, so the UI is not aware of the actual search technology used.
Keep it modular, and you will thank yourself later.
By using a standard web services interface, you can also allow 3rd parties to build stuff for you, and they won't have to "learn" whatever search engine you use on the backend.
Take a look at these similar questions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?
My personal recommendation: Solr.
All these companies offer different features of Universal Search. Smaller companies carved themselves very functional and extremely desired niches. For example Queplix enables any search engine to work with structured data and enterprise applications by extracting the data, business objects, roles and permissions from all indexed applications. It provides enterprise-ranking criteria as well as data-compliance alerts.
Two other solutions that weren't as well-known &/or available around the time the original question was asked:
Google Custom Search - especially since the disable public URL option was recently added
YaCy - you can join the network or download and roll your own independent servers