Adhering to Google's Structured Data Policy - googlebot

Google's Product structured data reference recommends that aggregateRating be included. Common structured data errors document says:
The content referred to by the structured data is hidden from the user.
My question is how to verify that Google understands my UI? I'm using (5) svg star images to indicate the rating but there's 3 flavors of star, empty, half and full. Do I need to add a content="rating" or title="rating" attribute or …? I'd like to understand how does Google know I'm showing the 5 stars at all? I could be using .png files, or unicode ★.
"aggregateRating": {
"#type": "AggregateRating",
"ratingValue": "88",
"bestRating": "100",
"ratingCount": "20"
}

Google doesn’t document how they verify if the visible content matches the structured data. As long as you don’t hide/omit the content, it should be fine.
For this and all the other reasons, it would make sense to use accessible, semantic markup. If it’s accessible to users with disabilities, it’s typically also accessible to search engine bots.
In the case of ratings, you could use the meter element:
<meter min="0" max="5" value="3">★★★☆☆</meter>
(As meter is probably not widely supported, you might want to consider using WAI-ARIA in addition: example with img elements and aria-labelledby.)

Related

Creating Index and Skill Azure Cognitive Search

I am trying to create an index and skill that will allow me to
Index pdfs, multi and single page, and all other types of files,
Extract the Data and make it searchable,
Search for a term say "Cat" and have sections of text where the term appears to be returned, as well as the page number and document name / downloadable URL of the PDF/ image where it was found, a bounding box, would be nice but not necessary.
I am struggling, I have tried text extraction skill, OCR skill, but I am struggling in that the Search term returns the whole, extracted document (100 pages), as text in the file "content"
It's not making much sense to me, the JFK example is outdated.
I have spent 4 days on this, it cannot be that difficult, the documentation is not that helpful either.
I have tied to "build" and index and skillset using the portal tools, but getting a similar result.
any help would be appreciated.
You might want to try the hOCR custom skill, available on GitHub from the Power Skills repository if you prefer to use the hOCR format for bounding boxes, but [the OCR skill](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr#sample-text-and-layouttext-output's output) already offers bounding boxes for content. Note that the Power Skills repo also has updated versions of most of the skills used in the JFK sample, including the image store that can help you make pictures of the pages available in your app.
The key to making it work is in the skillset definition.
The JFK skillset has its OCR skill output layoutText.
There is also a custom image store skill that uploads /document/normalized_images/*/data and keeps the resulting URI as imageStoreUri.
Another custom skill transforms the OCR layout results into the HOCR format.
Then a ShaperSkill is aggregating that information under ocrImageMetadata.
In the case of JFK, that information then gets further aggregated under cryptonyms, because that's the main thing the JFK demo is focusing on, and the image metadata is also an output field mapping for /document/hocrDocument/metadata as metadata, which is also indexed. The important point is that all the relevant information is mapped to the indexed fields. As a consequence, the information therein becomes available from index query results.

Get value within TV’s html attribute in Modx

I’m new to Modx so I don’t know if this is possible or not.
My TV, in this case [[*myTV]] outputs the following:
<data value='www.mylink.com'>Description</data>
Is there a way to only display the data value in the front-end? In this case I just want to display the url.
My recommendation would be to keep the data (in this case the URL) and the html separate, and that might help your situation. If the TV only includes the URL itself, then it makes it much easier to deal with the output of the TV using output modifiers. As an example, if [[*myTV]] contains www.mylink.com for a particular resource and you want the original output in your question, you could do something like:
[[*myTV:default=`<data value='[[*myTV]]'>Description</data>`]]
You can also nest TVs within output modifiers, so if for example you had a corresponding [[*description]] TV that describes the URL in [[*myTV]], you could use:
[[*myTV:default=`<data value='[[*myTV]]'>[[*description]]</data>`]]
TL;DR... The short version: Storing the entire output in the TV and extracting text from within that TV to output is much more difficult than storing individual components of that output in separate TVs and bringing them together for output when needed.
The longer version: In any situation where you're storing both data and HTML in a TV (which is not advisable in the vast majority of cases), you'll likely find duplication of your data across your project, and if by chance you decide to change the html at some point in the future, you then have to go into each and every TV field and change that HTML, which is the opposite effect from what a CMS is supposed to do - i.e. make Content Management easier!
If you do happen to find a use case for storing TVs along with their HTML formatting, that is a job best left for MODX Chunks, where you can code the implementation of your TVs within HTML in one spot within MODX and instead of duplicating that code everywhere, you reference the chunk like so: [[$chunk]].

Is there any way to hide or obfuscate schema json-ld?

On my webpage I have a standard JSON-LD schema that holds A LOT of data. Is there any way to prevent or make it harder to read for an average user in the console?
Remove spacing and new lines. It has to stay machine readable, which I think means you can't obfuscate the actual text or property names.
I guess you could have it stored in another obfuscated format and have JavaScript generate the readable version. But then, anyone checking the rendered html will see it as it is. And it will limit the systems that can read it.
Another idea is to detect if it's a normal user and not provide the structured data to them. They don't need it. But that's cloaking and may annoy Google.
Don’t mark up content that is not visible to readers of the page
One of google Google structured data Quality guidelines is to give the users the content you describe in your JSON-LD (So the idea of hiding or make this data harder to read for "normal users" does not make sense).
Don’t mark up content that is not visible to readers of the page. For
example, if the JSON-LD markup describes a performer, the HTML body
should describe that same performer. Google Quality guidelines
https://developers.google.com/search/docs/guides/sd-policies
By the way, "normal/average users" won't inspect your HTML source code (And developers have nothing to do with this specific JSON-LD information either).
Protect-javascript
If you insist read topics related to "protect-javascript" (This issue not related to schema JSON-LD):
How can I obfuscate (protect) JavaScript?
How do I protect javascript files?
Protect your JavaScripts from "view source"

Does Google engine penalize pages containing (machine or human) translated content?

Google SE has zero-tolerance policy against duplicate and spun content, but I am not sure how it deals with translated text? Any guesses on how it might detect translated content? The first thing occurs to my mind is they use their own Google Translate to back-translate the translated content into the source language, but if that's the case do they have to try back-translating into all languages? Are there any specific similarity metrics for such a task? Thank you!
From this video with a Google employee, auto-generated / machine translated versions of webpages can count against your site as duplicate content. If you append the machine translated version with some text of your own you might be able to get around this 'Yes, it's duplicated content' flag, but we can't know how much original text needs to be added to a translation in order for the Google robots to flag the page as original content instead of duplicated content.
Your best bet would be to have an actual human translate the whole web page or you could have a human translator augment or modify a machine-translated version of your webpage so that human-edited translation of your website is sufficiently different (what 'sufficiently' is we don't know) from the machine translated version.

Plone: creating and using document tags?

For an academic plone site I am creating, it is desirable to support document tags (see below).
There are multiple users for this site, and each user has a (long) list of publications that they alone can add / edit.
In its simplest form, a publication entry consists of a hyperlink or even just plain text. For instance:
A. Baynes, J. Watson and S. Holmes, "The role of observation and deduction in forensics", Applied Crime Solving, 221, 210-243 (1901). doi: 10.1032/acsolv2714
(The above is a fictitious article, but it has all the elements one expects in most citations.)
For those unfamiliar with DOI links, these are fixed text strings that can be resolved to the page for the article in question using dx.doi.org. Further, copyright / license terms often prohibit the authors from providing a full PDF / HTML for their articles on their websites. The articles often lie behind a paywall (usually accessible from most Universities / major research labs). So, running full text searches on the article itself is NOT an option.
Returning to the problem definition, I am assuming that the users will add their publications as links, but I want to give them the ability to specify a comma separated list of words / phrases (or tags) that more closely identify what the article is about.
For the above article, an appropriate list of tags would be:
forensics, haemoglobin, degradation of evidence
After each user appends such tags to the article, I want to create a backend that will allow visitors to the site to simply be able to enter these tags in a search field and find all publications that pertain to, say, haemoglobin.
That search should pull all publications that list haemoglobin as a tag, for all users of the site.
I intentionally used haemoglobin as a tag to illustrate that relevant tags need not be (and usually aren't) part of the text specified in the title of the article.
Further, the Plone "Collections" feature is not an adequate solution to this problem. Collections are typically generated by the admin. That means that a) admin intervention for something like this is essential and b) tags are best defined by users, not the admin.
When adding any content type (File, Folder, Page, Link, Collection, ...) in Plone, you can apply any number of tags to the content. This is done in the "Categorization" tab when editing/creating the content.
Visitors/Users can search the site based on tags like normal searches (using the search box or accessing the /##search URL).
Moreover you can use "tag cloud" portlets to visualise the tags' frequencies. Check the followings to get an idea:
1. A tag cloud portlet that rotates tags in 3D using a Flash movie
2. TagCloud
Don't forget to check Plone documentation, and specially Plone user manual to get yourself acquainted with the way Plone works.
#user2751530
I would like to know whether you are still working on this specific project - I am currently developing a similar one using plone v4, documentviewer v3 and an as of yet nonexistant frontend. I would like to discuss different approaches to the tagging-by-user problem, you can contact me through skype (dawitt19) or twitter (pref.) through #japhigu.

Resources