Web Scraping, data keywords - search

i want to crawl a web site, by finding some specific data on it. Does web scraper support keyword properties, for example: i want to extract all data containing the words Java, PHP developper. Is there some web crawlers which can support these properties ?

Basically there is no special keyword scrapers. Yet you might imitate them.
Case 1
You suppose the html head section meta contains keywords:
<html>
<head>
<meta name="keywords" content="java, php, python, linux">
<head>
<html>
1. Scrape not the whole web page but rather a part of it - 1st 1000 characters.
Check this part for keywords. Use regex for it, for ex. /java|php|linux|python/gi
If found, mark/save this url and use it for thourought scrape in future.
Case 2
No meta with keywords in interest on web pages. :-(
Just make a regular page content retrieving, using regex (see above) for keywords presence check in the whole page text.

Related

Correctly set meta description

I am having an issue with the meta tag description in my personal web site, when using it like that:
<meta name="description" content="my content" />
A search in Google for my homepage doesn't display the description I set but some contents of my page.
I read articles about that but I can't figure it out why this is happening.
According to google support, here is a good way to define a description of your site:
<meta name="Description" content="Author: A.N. Author,
Illustrator: P. Picture, Category: Books, Price: $17.99,
Length: 784 pages">
Is this meta content description going to be a standard ? Some kind of formated object ?
Or maybe this is my XHTML way to write tag the problem ?
Google announced a while back that it wouldn't use meta descriptions for ranking.
The meta description is sometimes used as the summary text in search results, but if Google thinks it has found a better summary on your page based on the user's search, it will use that instead.
In some situations this description is used as a part of the snippet shown in the search results
Your meta description should really match up to the page content very well, otherwise it definitely won't be shown. The example you have shown contains structured data - but there is no indication that there is a strong preference for this (i.e. it may not be relevant to all pages).
As always, the algorithms are mysterious and subject to change.

How to search the web including meta tags

All search engines only find the content with in the body tag as per our search query. I would like to search the web including meta tags. Cause I'm eager to know who is using what kind of meta tags.
In short I like to search the web page content in other words source , that is with in <html> and </html> tags including meta tags and other links(css, js) with in the tag, not just the content with in the body tag.
Thanks,
Lol.. google doing that all ... Its silly question.. All worlds using in building your site are searchable. If u want to find exactly one word u must use it with "word" - example https://www.google.pl/?gws_rd=ssl#q=%22stackoverflow%22
how to search google - https://support.google.com/websearch/answer/134479?hl=en
Finally I got what I am looking for, here is the search engine to search source code on web. The search engine for source code. So we can find the what kind of meta tags are being used by others
nerdydata.com

SEO search result indentation (google)

I want my website to have indentation in google result search.
After taking reference of many websites, I found this one website "www.traveloka.com"
Inside the website, I can't find any meta keywords stuffs.
But the website is well indented.
My question is :
- does meta keywords really needed to have google indent my search result ?
- if yes, why the website www.traveloka.com is well indented without meta keywords ?
- if no, what matters then ? Beside having the page have href linking to each other ?
UPDATE :
While doing SEO, I found this website :
chlooe.com
It reports SEO advises, which ones to be changed, etc.
I'll follow the instructions there. any thoughts ?
If by indentation you mean ... it's called sublinks.
Meta tags are no longer important for most search engines. They now rank the pages according to content so in your site's content, use strong keywords to get better ranking.
Having a specific page title helps a lot too.
As for the meta tags, personally, I like to leave it in but they are no longer mandatory.
The Google site links are generated automatically by Google depending on your content.
Here are a few tips:
1) Have a sitemap.xml in your website. This will tell the crawlers which pages are available on your site. To generate a sitemap.xml, I use http://www.xml-sitemaps.com/
2) Submit that sitemap to google webmaster tools.
3) Use clean urls. For example www.mydomain.com/contact, .../about-us, .../portfolio, ... etc. These help search engines seperate the content and create sub links depending on the most important content.
4) Most important of all, get traffic on your website... no traffic = poor ranking.
This is not a full tutorial but just some tips. Search for "google sub links" to learn more.
Hope this helps
https://support.google.com/webmasters/answer/47334?hl=en

How to index html, css, and javascript using solr

Users of CodePen submit html/css/javascript each time they save a pen. We're setting up Solr search and I'd like to know if any work has been done to properly tokenize html/css/js for optimal retrieval.
For example in javascript, we'd like code like
window.location = 'http://wufoo.com'
to produce a search hit on window, location and window.location.
Also, for html, we don't wish to strip out brackets on elements like <form> or <field>.
Before I go down the road of writing a custom field type, I'd like to know if anyone has already tackled this problem. Since we index each field individually, we'll need a separate tokenizer with rules specific for css, html, and javascript.

Can I redirect a query from default search box in SharePoint to a different search engine

I dont want the default results that SharePoint returns. I want the query term when entered into SharePoint search box to be redirected to a different search engine? Can I do that.
I have seen FAST ESP web parts but could not figure out how they actually transferred the query to FAST search engine.
Any help would be really appreciated!!
Add a content editor to the main search page with the following line of code
<meta http-equiv="refresh" content="5;url=http://newsearchserver">
To pass the querystring I don't have any code but I did a quick search and the following link provides information about how to get the querystring value.
http://blogs.edork.com/MikeGeyer/Lists/Posts/Post.aspx?List=c6444f02-e1a0-4a5e-b6f4-70bccdc80508&ID=36
From there you should be able to put it together and redirect to another server with the proper querystring.
Good Luck!
You could try to customize the Google Search Appliance search box for sharepoint.
http://code.google.com/apis/searchappliance/documentation/connectors/200/connector_admin/searchbox_sharepoint.html
The source code it´s on:
http://code.google.com/p/google-enterprise-connector-sharepoint/downloads/list

Resources