I am adding Azure Search and trying to add skills for content enrichment.
I can see the Key Phrase Extraction and the Language Detection predefined skills but not the Text Split skill on the screen. Is there a reason why Text Split skill is not visible? Or is it something that can only be added via API?
The capabilities exposed throught the portal focus on core scenarios that customers want to perform so they do not include text splitting. If you want to split your text, you should do it by creating your own skillset programatically through the API, that will allow you to define the language and the size of a page.
Related
Is there a way to see qtty characters I use in the azure portal metrics? There is "SynthesizedCharacters" metric but I only view data when I use it from Speech Studio. I want to see this metric when I use cognitive sdk. Is it possible?
Thanks
Unfortunately, AFAIK there is no metric to track that from the Azure Portal. However, you can maintain the count locally at your end or the central location where you can query yourself --- add an additional logic to maintain the metrics in your code.
The character is counted based on the below conditions (that can be found here):
Text passed to the text-to-speech service in the SSML body of the
request
All markup within the text field of the request body in the SSML
format, except for and tags
Letters, punctuation, spaces, tabs, markup, and all white-space
characters
Every code point defined in Unicode
I am trying to create an index and skill that will allow me to
Index pdfs, multi and single page, and all other types of files,
Extract the Data and make it searchable,
Search for a term say "Cat" and have sections of text where the term appears to be returned, as well as the page number and document name / downloadable URL of the PDF/ image where it was found, a bounding box, would be nice but not necessary.
I am struggling, I have tried text extraction skill, OCR skill, but I am struggling in that the Search term returns the whole, extracted document (100 pages), as text in the file "content"
It's not making much sense to me, the JFK example is outdated.
I have spent 4 days on this, it cannot be that difficult, the documentation is not that helpful either.
I have tied to "build" and index and skillset using the portal tools, but getting a similar result.
any help would be appreciated.
You might want to try the hOCR custom skill, available on GitHub from the Power Skills repository if you prefer to use the hOCR format for bounding boxes, but [the OCR skill](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr#sample-text-and-layouttext-output's output) already offers bounding boxes for content. Note that the Power Skills repo also has updated versions of most of the skills used in the JFK sample, including the image store that can help you make pictures of the pages available in your app.
The key to making it work is in the skillset definition.
The JFK skillset has its OCR skill output layoutText.
There is also a custom image store skill that uploads /document/normalized_images/*/data and keeps the resulting URI as imageStoreUri.
Another custom skill transforms the OCR layout results into the HOCR format.
Then a ShaperSkill is aggregating that information under ocrImageMetadata.
In the case of JFK, that information then gets further aggregated under cryptonyms, because that's the main thing the JFK demo is focusing on, and the image metadata is also an output field mapping for /document/hocrDocument/metadata as metadata, which is also indexed. The important point is that all the relevant information is mapped to the indexed fields. As a consequence, the information therein becomes available from index query results.
Suppose I have a huge set of noisy phrases. For each one of them, I want to check if it is defined by some resources by using the google define feature. Once I type "define my_phrase" to the google search box, if the retrieved results contain the definition panel (e.g. https://www.google.com/#q=define+home+cooking), I put it into my phrase pool.
I'm wondering is this possible to do this task in a batch so that I don't have to type each of the phrase manually one by one? It would be great if this could be achieved from a unix terminal but windows is also welcome!
I heard of google-app-engine but I only have a rough idea and not sure if it could help.
Thanks!
as starting point, you may try and play with the Google Custom search following API reference - Xml results
https://developers.google.com/custom-search/docs/xml_results?hl=en&csw=1#XML_Results
Be aware of:
google TOS for this service
quantity courtesy limit
For an academic plone site I am creating, it is desirable to support document tags (see below).
There are multiple users for this site, and each user has a (long) list of publications that they alone can add / edit.
In its simplest form, a publication entry consists of a hyperlink or even just plain text. For instance:
A. Baynes, J. Watson and S. Holmes, "The role of observation and deduction in forensics", Applied Crime Solving, 221, 210-243 (1901). doi: 10.1032/acsolv2714
(The above is a fictitious article, but it has all the elements one expects in most citations.)
For those unfamiliar with DOI links, these are fixed text strings that can be resolved to the page for the article in question using dx.doi.org. Further, copyright / license terms often prohibit the authors from providing a full PDF / HTML for their articles on their websites. The articles often lie behind a paywall (usually accessible from most Universities / major research labs). So, running full text searches on the article itself is NOT an option.
Returning to the problem definition, I am assuming that the users will add their publications as links, but I want to give them the ability to specify a comma separated list of words / phrases (or tags) that more closely identify what the article is about.
For the above article, an appropriate list of tags would be:
forensics, haemoglobin, degradation of evidence
After each user appends such tags to the article, I want to create a backend that will allow visitors to the site to simply be able to enter these tags in a search field and find all publications that pertain to, say, haemoglobin.
That search should pull all publications that list haemoglobin as a tag, for all users of the site.
I intentionally used haemoglobin as a tag to illustrate that relevant tags need not be (and usually aren't) part of the text specified in the title of the article.
Further, the Plone "Collections" feature is not an adequate solution to this problem. Collections are typically generated by the admin. That means that a) admin intervention for something like this is essential and b) tags are best defined by users, not the admin.
When adding any content type (File, Folder, Page, Link, Collection, ...) in Plone, you can apply any number of tags to the content. This is done in the "Categorization" tab when editing/creating the content.
Visitors/Users can search the site based on tags like normal searches (using the search box or accessing the /##search URL).
Moreover you can use "tag cloud" portlets to visualise the tags' frequencies. Check the followings to get an idea:
1. A tag cloud portlet that rotates tags in 3D using a Flash movie
2. TagCloud
Don't forget to check Plone documentation, and specially Plone user manual to get yourself acquainted with the way Plone works.
#user2751530
I would like to know whether you are still working on this specific project - I am currently developing a similar one using plone v4, documentviewer v3 and an as of yet nonexistant frontend. I would like to discuss different approaches to the tagging-by-user problem, you can contact me through skype (dawitt19) or twitter (pref.) through #japhigu.
I am working on a project to digitize approximately 1 million images for which metadata will be added to facilitate search.
Each image is, for example, a page in a dictionary. But not text. Just a static scanned image. OCR is not an option :(
My objective is to emulate the current search procedure which consists of looking up the alphabetical entries till the correct page is found. In absence of machine readable text, I am looking at tagging each page with Dictionary range tag. For Example (Apple-Canada). So if someone searches for "Banana", it should hit the (Apple-Canada) range Tag.
Is this supported in SharePoint out of the box? If not, is there an addon product which provides this functionality or am I looking at building a customized extension?
Any help will be appreciated :)
Installing the IFilter for TIF files is done with a couple of clicks and gives you free OCR along the way. Very good for scanned pages.
On your question though: No, SharePoint does not have any kind of "range" tags or fields. The only vaguely similar thing to what you are requesting is the Thesaurus of the search. There you could define acronyms and synonyms for words and it would actually search for something else. So you could enter Banana but it would actually search for Apple. Some examples here: How to: Customize the Thesaurus in SharePoint Search and Search Server.
Other than that I can only think of a custom implemented search provider giving you the flexibility you need.