Google downloads the whole page by Crawling and then scrape some data to create the indexing like title, metatags?
What are the other datapoints google extracts from the page?
From this old and classical paper from Google's founders:
System features include:
Hyperlinks (for calculating pagerank)
Anchor text
Visual presentation details such as font size of words
Full raw HTML of pages is available in a repository
Also see this for more about processing for information-retrieval purpose.
Related
I am trying to create an index and skill that will allow me to
Index pdfs, multi and single page, and all other types of files,
Extract the Data and make it searchable,
Search for a term say "Cat" and have sections of text where the term appears to be returned, as well as the page number and document name / downloadable URL of the PDF/ image where it was found, a bounding box, would be nice but not necessary.
I am struggling, I have tried text extraction skill, OCR skill, but I am struggling in that the Search term returns the whole, extracted document (100 pages), as text in the file "content"
It's not making much sense to me, the JFK example is outdated.
I have spent 4 days on this, it cannot be that difficult, the documentation is not that helpful either.
I have tied to "build" and index and skillset using the portal tools, but getting a similar result.
any help would be appreciated.
You might want to try the hOCR custom skill, available on GitHub from the Power Skills repository if you prefer to use the hOCR format for bounding boxes, but [the OCR skill](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr#sample-text-and-layouttext-output's output) already offers bounding boxes for content. Note that the Power Skills repo also has updated versions of most of the skills used in the JFK sample, including the image store that can help you make pictures of the pages available in your app.
The key to making it work is in the skillset definition.
The JFK skillset has its OCR skill output layoutText.
There is also a custom image store skill that uploads /document/normalized_images/*/data and keeps the resulting URI as imageStoreUri.
Another custom skill transforms the OCR layout results into the HOCR format.
Then a ShaperSkill is aggregating that information under ocrImageMetadata.
In the case of JFK, that information then gets further aggregated under cryptonyms, because that's the main thing the JFK demo is focusing on, and the image metadata is also an output field mapping for /document/hocrDocument/metadata as metadata, which is also indexed. The important point is that all the relevant information is mapped to the indexed fields. As a consequence, the information therein becomes available from index query results.
Google SE has zero-tolerance policy against duplicate and spun content, but I am not sure how it deals with translated text? Any guesses on how it might detect translated content? The first thing occurs to my mind is they use their own Google Translate to back-translate the translated content into the source language, but if that's the case do they have to try back-translating into all languages? Are there any specific similarity metrics for such a task? Thank you!
From this video with a Google employee, auto-generated / machine translated versions of webpages can count against your site as duplicate content. If you append the machine translated version with some text of your own you might be able to get around this 'Yes, it's duplicated content' flag, but we can't know how much original text needs to be added to a translation in order for the Google robots to flag the page as original content instead of duplicated content.
Your best bet would be to have an actual human translate the whole web page or you could have a human translator augment or modify a machine-translated version of your webpage so that human-edited translation of your website is sufficiently different (what 'sufficiently' is we don't know) from the machine translated version.
I have developed this website
thereelthing.com.sg/
how ever after 3 week which i have updated the meta tags in my website, this meta tag is does not appear the same on google search!!
( http://thereelthing.com.sg/ )
Search Reasult:
Clients | The reel Thing
http://thereelthing.com.sg/clients/The Reel Thing, Video Company, thereelthing.com.sg , mandanemedia.com.
Search link (on page 3)!
https://www.google.com/
search for : the reel thing sg
Do u know is there any way i could update the Google search result more faster?
Your site has a page rank of 0, google won't be indexing that very often. Have you set up google webmaster tools for it and do you have a xml site map?
Your title tags don't explain much about each page and your meta descriptions are all identical and full of content that is not relevant (repeated domain names?).
I would not be surprised if Google decided to ignore them in most cases and make up their own text. They do that.
When you update your meta tags, you need to re-index your web page. Try with social bookmarking and blogs to cache again by search engine robots.
Make backlinks with new meta keywords to get fast and accurate results. New meta tags will be affected whenever your web page is re-indexed.
I am working on a project to digitize approximately 1 million images for which metadata will be added to facilitate search.
Each image is, for example, a page in a dictionary. But not text. Just a static scanned image. OCR is not an option :(
My objective is to emulate the current search procedure which consists of looking up the alphabetical entries till the correct page is found. In absence of machine readable text, I am looking at tagging each page with Dictionary range tag. For Example (Apple-Canada). So if someone searches for "Banana", it should hit the (Apple-Canada) range Tag.
Is this supported in SharePoint out of the box? If not, is there an addon product which provides this functionality or am I looking at building a customized extension?
Any help will be appreciated :)
Installing the IFilter for TIF files is done with a couple of clicks and gives you free OCR along the way. Very good for scanned pages.
On your question though: No, SharePoint does not have any kind of "range" tags or fields. The only vaguely similar thing to what you are requesting is the Thesaurus of the search. There you could define acronyms and synonyms for words and it would actually search for something else. So you could enter Banana but it would actually search for Apple. Some examples here: How to: Customize the Thesaurus in SharePoint Search and Search Server.
Other than that I can only think of a custom implemented search provider giving you the flexibility you need.
Recently google launched its new feature in image search by image means we can search other images by uploading a image in the google search box. How is this possible?
http://images.google.com
Look at WP:Content-based image retrieval. An example of open-source implementation that you can study internal working of is for example GNU Image Finding Tool.
If you click on the "Learn more" link on the page you are referring to, you'll find this explanation
How it works
Google uses computer vision techniques to match your image to other images in the Google Images index and additional image collections. From those matches, we try to generate an accurate "best guess" text description of your image, as well as find other images that have the same content as your search image. Your search results page can show results for that text description as well as related images.
Actually the answer to this lies in the image processing.....in over a decade image processing and computer vision have done great deal of advancement...
search by image uses pixels ...it compares the pixels and matches with image database it contains....its quite similar to what actual tyext search does but there pixels in place of text...
there are various operators like soble operator,etc which help us focus on the important details of the picture being tested and and we we can search on the basis of the important feature of the image.....