I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as
A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.
This the data I'm trying to accumulate for the URLs I have.
I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).
Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...
Thanks guys.
Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.
If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.
These are called snippets.
Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.
As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)
They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)
You may have some luck with existing auto-summary packages, such as OTS.
You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.
Some info that might explain how google does this:
Webmasters/Site owners Help
Adding a URL to google
I am not familiar with Google APIs, but perhaps there is an official way to get such information.
Interesting. some sources are better than others.
For "audiotuts.com" google has a worse description than AboutUs.com.
Google
Nov 18th in General by Joel Falconer ·
1. Recently, an AUDIOTUTS reader asked me about creative process. While this
is a topic that can’t be made into a
...
AboutUs.com:
AUDIOTUTS is a blog/tutorial site for
musicians, producers and audio
junkies! It is the sister site of the
popular PSDTUTS, VECTORTUTS and
NETTUTS.
I hate problems like these... they should be trivial but they aren't!
If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.
A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.
Related
Sorry, this is a bad question. I don't even know what the title should be. I'm a total noob at making websites so this might be easy to find but I just don't know the terminology to search for. I cannot find anything about how to do this...
What I want to do is have something like references/variables that I can use in a block of text and it will automatically get replaced with whatever value should be there. Best way I can think of to describe it would be if I was using the site as a design doc for a game or something, I would be able to type in [Title] or something similar on any page and when it loads that text would be replaced with whatever my Title is. That way If I ever change titles, names, classes, races, places, items, etc... they would only have to be changed in 1 place and the change would be reflected everywhere.
I notice if I add a link to a page it will automatically use the Title of that page as the text of the link. That is almost exactly what I want. Except when I change the Title of the other page the text of the link remains as the original text. It doesn't get updated to the new Title and that is not at all what I want.
Also, I want to do this in Google Sites and as simply as possible. I don't really want to use a database. I was hoping Google Sites would have some kind of funcionality for this.
I don't believe this is possible (on Google Sites) and likely you need to consider a hosted solution.
Quoting the answer from this relevant post:
You should consider hosting your solution using Google's App Engine
instead of Google Sites. You can set it up so it uses PHP (see link
below), you can configure it to use your domain name and you get
enough CPU, disk and bandwidth allowance to serve around five million
page views for free each month, if you are serving more than that,
their prices are extremely competitive.
Google App Engine:
http://code.google.com/appengine/docs/whatisgoogleappengine.html How
to setup PHP using Google App Engine: http://blog.caucho.com/?p=187
Also I'm not sure how your PHP skills are but if you're unfamiliar with it then this should help to get you started.
We runnig site-s with EE 1.6.8... Not funny, but my boss like it...
So we implemented a search. Everything is fine but the search url is like this:
/search/results/0374c6c40f159934bc6795f031c4e52f10/
instead
/search/results/keyword
The developers said, that only a paid plugin can we put the keyword in the url.
OMG.
Is it true?
And another Q: after few hours the search url give no results back. It seems, that the session of the cookie expired or anything.
I have two ideas:
1. Our developers want to fool me
2. EE is so, it's not a cms just a cms like thing...
You are correct, the EE Search module uses session-based URLs for results. The reason being that search results are cached for performance, so those results need to expire after a short period of time (as new results might need to appear).
I assume what you want is bookmarkable search results. In this case, I suggest Super Search, or on the free, Google-powered end, the Google Search Results plugin.
Not 100% sure if it would work but in theory you could have www.example.com/search/results/keyword .
In your EE code you would put {exp:weblog:entries search:body="{segment_3}"}title:{title} etc..{/exp:channel:enties} as shown on http://expressionengine.com/legacy_docs/modules/weblog/parameters.html#par_search
The problem is when the keyword contains non [a-z][0-9] characters which is worth considering.
We offer EvoPost on our website for free http://www.eevolution.co.uk/index.php/addons/evopost which will enable you to capture the keywords from a HTTP POST variable e.g. search:body="{ep_txtboxname}"
Feel free to contact us through our website if you need any assistance with the product.
Thanks
Tim
EEvolution Developer
Is there any sort of documentation on exactly what parameters you can put in the url of Google viewer?
Originally, I thought it was just url,embedded,chrome, but I've recently come accross other funny ones like a,pagenumber, and a few others for authentication etc.
Any clues?
One I know is "chrome"
If you've got https://docs.google.com/viewer?........;chrome=true
then you see a fairly heavy UI version of that doc, however with "chrome=false" you get a compact version.
But indeed, I'd like a complete list myself!
I know this question is very old and perhaps you already solved your issue, but for anyone on the internet who might be looking for an answer...
I have been looking for this recently, following a guide I found on GitHub Gist
https://gist.github.com/tzmartin/1cf85dc3d975f94cfddc04bc0dd399be
More specifically, the option to embed a certain page of pdf using
<iframe src="https://docs.google.com/viewer?srcid=[put your file id here]&pid=explorer&efh=false&a=v&chrome=false&embedded=true" width="580px" height="480px"></iframe>
The best I could fing was this article (I suppose from a long time now)
https://weekly-geekly.github.io/articles/111647/index.html
HOWEVER, I tried modifying the attributes and the result was simply a redirect to
https://drive.google.com/file/d/[ID]/edit
https://drive.google.com/file/d/[ID]/preview or
https://drive.google.com/file/d/[ID]/view
AS OF MAY 2020, THIS SOLUTION PROBABLY DOESN'T WORK
I'm also on a quest to discover some of the parameters of the viewer.
the "chrome" parameter doesn't seem to do anything, though. Is this
supposed to be the same as embedded=true?
Parameters I know of:
url= (obviously)
embedded= (obviously)
hl= set language of UI (tooltips)
#:0.page.1 = jump to page 2 (page 1 is numbered 0) - this is unreliable and often requires a refresh after the first load,
defeating the purpose.
That said, when I use the Google Docs viewer on my site, "fit page to
screen" is the default view without any parameters. So maybe I'm
misunderstanding your question.
Source: For convenience, this is a full quote of the sole answer (it is from user k3david) to the crosspost of this question #Doc has posted to the Google support forum in 2011.
You can pass q=whatever to pass a search query to the viewer.
Looking through my search logs from time to time, I notice that by far the biggest user of my search engine is the google-bot. What gives? Is it looking for content that might not be directly accessible through navigation? If so, how does it know which words and phrases to look for (they're surprisingly relevant). Does it check the most popular keywords on the site? I know I seem to be answering my own question here, but this is really only working it out from first principles. I'd like to hear from someone who knows what they're talking about (i.e. not me).
If your search form's method is get instead of post, each search has its own url, and people might be posting those urls elsewhere. Or if you have a (possibly inadvertently) publicly accessible webstats page that listed those urls, that's another common way for search engines to stumble upon your internal search urls. A third way I've seen is sites that list recent searches on their pages, but this is more intentional. "MySQL Performance Blog" does this to an annoying extent, so any search of their site from google yields hundreds of pages of similar searches, even if none of them found what they were looking for.
Edit: Looks like it does on occasion, but only GET forms:
http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html
Google will use words that occur on your site in search boxes to try to find pages that it can't otherwise.
Google says that for the past few months, it has been filling in forms
on a "small number" of "high-quality" web sites to get back
information. What words has it been entering into those forms? Words
automatically selected that occur on the site, with check boxes and
drop-down menus also being selected.
http://searchengineland.com/google-now-fills-out-forms-crawls-results-13760
Let's say I'm performing a google search for search term.
Sometimes, one of the suggestions will be to a URL like this: www.someothersearch.com/search+term/
How does "someothersearch.com" do this?
In general, a page will only be in Google if some other page links to it. Google is not going to go to someothersearch.com and submit "search term" into the form, it is likely a hidden or nonhidden link on someothesearch.com.
Why not? someothersearch.com presumably has its own index pages for terms searched previously; the Google spider is just indexing those index pages as well.
Just a guess. Maybe these sites support OpenSearch?
I misunderstood your question at first; What these sites are doing is rewriting their requests. How they know which terms people will search for is a bit of a mystery to me, but it probably relies on things like watching google.com/trends, scraping their own and other log files for referral from google that include the search term, buying lists of well ranking terms people might use AdSense for and instead trying to generate natural traffic for them... etc. Probably when they add new pages with these terms they're also adding them to their xml sitemap that Google will crawl.
Redacted:
I have added the Open-Search tag to your question; please follow it. You'll find this post on https://stackoverflow.com/questions/20830/firefox-and-ie7-users-here-is-your-stackoverflow-search-pluginlink textthe most informative; however I recommend you use image/png for your icon format.