Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've been looking into hosted search solutions such as Indextank and Google CSE. This is my first time to integrate one into my website. The only advantage I can think of in using one is decreased usage of my server's database.
What could possibly be other advantages?
In the case of IndexTank, you get the advantages of running your own search server while not having to worry about the operation. It's free up to 100k documents so if you're below that threshold the advantage is obvious.
If you're going to pay for the service, you should weigh the cost of your own time maintaining your service (which will probably more significant than your server costs) vs the cost of a hosted service.
Keep in mind that Google CSE is not hosted search, they crawl your site so you don't have real-time updates/deletes or custom sorting. You do get a very simple javascript that adds search for your html pages that Google can crawl.
Fewer queries hitting your database may not be insignificant, but there are additional advantages to using a search engine.
Most search engines offer superior search relevancy and often much improved speed over in database full text search. A search engine such as ElasticSearch will also give you access to many advanced features and superior ways of describing the information you or your users wishes to acquire.
You may host ElasticSearch yourself or use a hosted solution. Found Search is a great hosted ElasticSearch service.
Google Custom Search provides additional capabilities, such as:
autocomplete - when restricting search by site(s)
synonym specification - if autocomplete discovery isn't robust enough
categories - after you enter a search term you can filter queries by topic
images - showing images in your search results
promotions - specifying links based on queries, e.g., think integrating FAQs or products with queries
Some of these functions are enabled in the mobile app custom search engine at https://www.google.com/cse/publicurl?cx=partner-pub-3989641269425886:5048880850#gsc.tab=0
One of the important things you lose with Google Custom Search is the ability to return more than 100 results. For instance, if you run a large site with thousands of relevant results, Google CSE will return no more than 100 of these results. That can be good or bad depending on your objectives.
Related
I understand that same work should not be repeated when Google CSE is already there, so what may be the reasons to should consider implementing a dedicated search engine for a public facing website similar to SO(& why probably StackOverflow did that ?). Paid version of CSE(Google site Search), already eliminates several drawbacks that forced dedicated implementation. Cost may be one reason to not choose Google CSE, but what are other reasons ?
Another thing I want to ask is my site is similar kind as StackOverflow, so when Google indexes its content every now & then, won't that overload my database servers with lots of queries may be when there is peak traffic time?
I look forward to use Google Custom search API but I need to clarify whether the 1000 paid queries that I get for 5$ are valid only for 1 day or they get adjusted to extra queries(beyond free ones) on the next day & so on. Can anyone clarify on this too?
This depends on the content of your site, the frequency of the updates, and the kind of search you want to provide.
For example, with StackOverflow, there'd probably be no way to search for questions of an individual user through Google, but it can be done with an internal search engine easily.
Similarly, Google can outdate their API at any time; in fact, if past experience is any indication, Google has already done so with their Google Web Search API, where a lot of non-profits that had projects based on such API were left on the street with no Google options for continuation of their services (paying 100 USD/year for only 20'000 search queries per year, may be fine for a posh blog indeed, but greatly limits what you can actually use the search API for).
On the other hand, you probably already want to have Google index all of your pages, to get the organic search traffic, so Google CSE would probably use rather minimal resources of your server, compared to having a complete in-house search engine.
Now that Google Site Search is gone, the best search tool alternative for all the loyal Google fans is Google Custom Search (CSE)
Some of the features of Google Custom Search that I loved the most, were :-
Its free (with ads)
Ability to monetise those ads with your AdSense Account
Tons of Customization options, including removing the Google branding,
Ability to link it with Google Analytics account, for highly comprehensive analytical report,
Powerful auto correct feature to understand the real intention behind the typos,
Cons : Lacks customer Support…
Read More: https://www.techrbun.com/2019/05/google-custom-search-features.html
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 13 years ago.
Improve this question
I posted a source code on codeplex and to my surprise found that it appeared on google within 13 hours. Also when i made some changes to my account on codeplex those changes reflected on google within a matter of minutes. How did that happen ? Is there some extra importance that google pays to sites like Codeplex, Stackoverflow etc to make their results appear in the search results fast ? Are there some special steps i can take to make google crawl my site somewhat faster, if not this fast.
Google prefers some sites over others. There is a lot of magic rules involved, in the case of CodePlex and Stackoverflow we can even assume that they had ben manually put on some whitelist. Then Google subscribes to the RSS feed of these sites and crawls them whenever there is a new RSS post.
Example: Posts on my blog are included in the index within minutes, but if I dont post for weeks, Google just passes by every week or so.
Huh?
Probably (and you have to be an insider to know...) if they find enough changes from crawl to crawl they narrow the window between crawling until - sites like popular blogs / news ect are being crawled every few min.
For popular sites like stackoverflow.com the indexing occurs more often than normal, you could notice this by searching for a question that has been just asked.
It is not well known but Google relies on pigeons to rank its pages. Some pages have particularly tasty corn, which attracts the pigeons' attentions much more frequently than other pages.
Actually ... Popular sites have certain feeds that they share will google. The site updates these feeds and google updates its index when the feed changes. For other sites that rank well, seach engines crawl more often, provided there are changes. True its not public knowledge and even for the popular sites there are no guarantees about when newly published data appears in the index.
Real time search is one of the newest buzzwords and battlegrounds in the search engine wars. Google's announced/Bing's twitter integration are good examples of this new focus on super-fresh content.
Incorporating fresh content is a real technical challenge and priority for companies like Google since one has to crawl the documents, incorporate them into the index (which is spread across hundreds/thousands of machines), and then somehow determine if the new content is relevant for a given query. Remember, since we are indexing brand new documents and tweets that these things aren't going to have many inbound links which is the typical thing that boosts PageRank.
The best way to get Google/Yahoo/Bing to crawl your site more often is to have a site with frequently updated content that gets a decent amount of traffic. (All of these companies know how popular sites are and will devote more resources indexing sites like stackoverflow, nytimes, and amazon)
The other thing you can do is also make sure that your robots.txt isn't preventing spiders from crawling your site as much as you want and to make sure to submit a sitemap to google/bing-hoo so that they will have a list of your urls. But be careful what you wish for: https://blog.stackoverflow.com/2009/06/the-perfect-web-spider-storm/
Well even my own blog appears in real time (it's pagerank 3 though) so it's not such a big deal I think :)
For example I just posted this and it appeared in Google at least 37 minutes ago (maybe it was in real-time as I didn't check before)
http://www.google.com/search?q=rebol+cgi+hosting
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm going to build a high-performance web service. It should use a database (or any other storage system), some processing language (either scripting or not), and a web-server daemon. The system should be distributed to a large amount of servers so the service runs fast and reliable.
It should replicate data to achieve reliability and at the same time it must provide distributed computing features in order to process large amounts of data (primarily, queries on large databases that won't survive being executed on a single server with a suitable level of responsiveness). Caching techniques are out of the subject.
Which cluster/cloud solutions I should take for the consideration?
There are plenty of Single-System-Image (SSI), clustering file systems (can be a part of the design), projects like Hadoop, BigTable clones, and many others. Each has its pros and cons, and "about" page always says the solution is great :) If you've tried to deploy something that addresses the subject - share your experience!
UPD: It's not a file hosting and not a game, but something rather interactive. You can take StackOverflow as an example of a web-service: small pieces of data, semi-static content, intensive database operations.
Cross-Post on ServerFault
You really need a better definition of "big". Is "Big" an aspiration, or do you have hard numbers which your marketing department* reckon they'll have on board?
If you can do it using simple components, do so. The likes of Cassandra and Hadoop are neither easy to setup (especially the later) or develop for; developers who are going to be able to develop such an application effectively will be very expensive and difficult to hire.
So I'd say, start off using your favourite "Traditional" database, with an appropriate high-availability solution, then wait until you get close to the limit (You can always measure where the limit is on your real application, once it's built and you have a performance test system).
Remember that Stack Overflow uses pretty conventional components, simply well tuned with a small amount of commodity hardware. This is fine for its scale, but would never work for (e.g. Facebook), but the developers knew that the audience of SO was never going to reach Facebook levels.
EDIT:
When "traditional" techniques start failing, e.g. you reach the limit of what can be done on a single database instance, then you can consider sharding or doing functional partitioning into more instances (again with your choice of HA system).
The only time you're going to need one of these (e.g. Cassandra) "nosql" systems is if you have a homogeneous data store with very high write requirement and availability requirement; even then you could probably still solve it by sharding conventional systems - as others (even Facebook) have done at times.
It's hard to make specific recommendations since you've been a bit vague, but I would recommend Google Appengine for basically any web service. It's reliable, easy to use, and is built on the google architecture so is fast and reliable.
i'd like to recommend stratoscal symphony. it's a private cloud service that does it all. everything you just mentiond - this service provides perfectly. their symphony products deliver the public cloud experience in you enterprise data center. if that's what you're looking for, i suggest you give it a shot
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
We looking for a simple, open source, web based document management system for Linux. With document management I mean the ability to store a set of files (minimally doc, xls and pdf) as a document. Associate meta data with the document like owner and version. Update and delete documents. Ability to index and search content. Authentication and the ability to authorize at least read, and possible write. If possible I would like to avoid implementations in Java or PHP, and as we use MySQL already that would work especially well for meta-data storage.
We have used Google Applications in the past but the lack of support for PDF makes it a poor fit. Other downsides include their service losing some of our spreadsheets, no concept of company owning information opposed to individual accounts, and some of our information is sensitive and we prefer keeping it in-house (passwords, contracts etc).
MediaWiki was not a good fit either as our documents is really a set opposed to be structured content (i.e. not looking for a content management system), and at least the version we had installed did not deal well with attachments.
Based on review of past questions I plan on looking into KnowledgeTree. Any other projects that we should consider?
I've been using KnowledgeTree now for a few months developing an ASP.Net application and I only have good things to say about it. Our product uses it for PDF storage/retrieval and it really couldn't be easier to deal with. The basic install gives you a simple environment with almost endless amounts of configuration for meta-data, document groups, and various security options. Also, the KnowledgeTree staff have been very helpful and have provided us with sample code when we have run into 'how are we going to do that?' moments.
I'll second the recommendation for KnowledgeTree. Have been using it for a couple years and have roughly 1K documents indexed. Sometime last year, I wrote a short script which monitors KT's transaction table (in MySQL) and notifies users of new or updated documents via Twitter, Identica, and/or Jabber. The Twitter/Identica feeds can then be monitored with a RSS reader.
Look for something that will index all your document formats and keep them searchable.
I solved this in my office using Coldfusion. It has verity search engine built in. This indexes files on your network (doc/xls/pdf, etc) to make the text in them searchable (like google).
An instant search engine for all my files, for upto 150,000 or so is built in for free with Coldfusion so it suits my purpose.. Something like this would allow you to save your files on a network how/wherever and you'd be able to extract the rest of the information about owners, modification dates through libraries available in java / .net.
I am sure you could replicate this with another language, but probably a bit more effort. I am presently wishing I could use the Google Docs API as a wysiwyg editor in my own wiki in-house.. that would solve most of my problems then because everything would be intranet based.
Try https://www.mayan-edms.com, written on Django, db agnostic
You can consider GroupDocs as they have got storage, conversion and few more features.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I want to implement search functionality for a website (assume it is similar to SO). I don't want to use Google search of stuff like that.
My question is:
How do I implement this?
There are two methods I am aware of:
Search all the databases in the application when the user gives his query.
Index all the data I have and store it somewhere else and query from there (like what Google does).
Can anyone tell me which way to go? What are the pros and cons?
Better, are there any better ways to do this?
Use lucene,
http://lucene.apache.org/java/docs/
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
It is available in java and .net. It is also in available in php in the form of a zend framework module.
Lucene does what you wanted(indexing of the searched items), you have to keep track of a lucene index but it is much better than doing a database search in terms of performance. BTW, SO search is powered by lucene. :D
It depends on how comprehensive your web site is and how much you want to do yourself.
If you are running a a small website without further possibilities to add a custom search, let google do the work (maybe add a sitemap) and use the google custom search.
If you run a medium site with an sql engine use the search features of your sql engine.
If you run some heavier software stack like J2EE or .Net use Lucene, a great, powerful search engine or its .Net clone lucene.Net
If you want to abstract your search from your application and be able to query it in a language neutral way with XML/HTTP and JSON APIs, have a look at solr. Solr runs lucene in the background, but adds a nice web interface to it.
You might want to have a look at xapian and the omega front end. It's essentially a toolkit on which you can build search functionality.
The best way to approach this will depend on how you construct your pages.
If they're frequently composed from a lot of different records (as I imagine stack overflow pages are), the indexing approach is likely to give better results unless you put a lot of work into effectively reconstructing the pages on the database side.
The disadvantage you have with the indexing approach is the turn around time. There are workarounds (like the Google's sitemap stuff), but they're also complex to get right.
If you go with database path, also be aware that modern search engine systems function much better if they have link data to process, so finding a system which can understand links between 'pages' in the database will have a positive effect.
If you are on Microsoft plattform you could use the Indexing service. This integrates very easliy with IIS websites.
It has all the basic features like full text search, ranking, exlcude and include certain files types and you can add your own meta information as well via meta tags in the html pages.
Do a google and you'll find tons!
This is somewhat orthogonal to your question, but I highly recommend the idea of a RESTful search. That is, to perform a search that has never been performed, the website POSTs a query to /searches/. To re-run a search, the website GETs /searches/{some id}
There are some good documents to be found regarding this, for example here.
(That said, I like indexing where possible, though it is an optimization, and thus can be premature.)
If you application uses the Java EE stack and you are using Hibernate you can use the Compass Framework maintain a searchable index of your database. The Compass Framework uses Lucene under the hood.
The only catch is that you cannot replicate your search index. So you need to use a clustered database to hold the index tables or use the newer grid based index storage mechanisms that have been added to the Compass Framework 2.x.