I am trying to create an index and skill that will allow me to
Index pdfs, multi and single page, and all other types of files,
Extract the Data and make it searchable,
Search for a term say "Cat" and have sections of text where the term appears to be returned, as well as the page number and document name / downloadable URL of the PDF/ image where it was found, a bounding box, would be nice but not necessary.
I am struggling, I have tried text extraction skill, OCR skill, but I am struggling in that the Search term returns the whole, extracted document (100 pages), as text in the file "content"
It's not making much sense to me, the JFK example is outdated.
I have spent 4 days on this, it cannot be that difficult, the documentation is not that helpful either.
I have tied to "build" and index and skillset using the portal tools, but getting a similar result.
any help would be appreciated.
You might want to try the hOCR custom skill, available on GitHub from the Power Skills repository if you prefer to use the hOCR format for bounding boxes, but [the OCR skill](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr#sample-text-and-layouttext-output's output) already offers bounding boxes for content. Note that the Power Skills repo also has updated versions of most of the skills used in the JFK sample, including the image store that can help you make pictures of the pages available in your app.
The key to making it work is in the skillset definition.
The JFK skillset has its OCR skill output layoutText.
There is also a custom image store skill that uploads /document/normalized_images/*/data and keeps the resulting URI as imageStoreUri.
Another custom skill transforms the OCR layout results into the HOCR format.
Then a ShaperSkill is aggregating that information under ocrImageMetadata.
In the case of JFK, that information then gets further aggregated under cryptonyms, because that's the main thing the JFK demo is focusing on, and the image metadata is also an output field mapping for /document/hocrDocument/metadata as metadata, which is also indexed. The important point is that all the relevant information is mapped to the indexed fields. As a consequence, the information therein becomes available from index query results.
Related
I need to produce facet block from two vocabularies in my site. I am using Views and a patched version of Views infinite Scroll to generate the search page, using my search index, and I have tweaked everything I could in the facet display settings to see if I could produce the requested results, to no avail.
I do not need keyword searches. I need to show all taxonomy terms in each facets at all times and to be able to select a single criteria at a time from each vocabulary. So, never more thane one selection at a time from each facet block.
Why are you using Solr to store data and generate your search page, if you do not need keyword search and are trying to go against the native working of solr Facets, I hear you say? For performance reasons, it is the reason why I am using Solr to store & serve the results, I have even gone as far as pushing renedered node to the index with the help of the somwhat obscure search_api_solr_view_modes module.
I could take two separate routes
Create a custom block, load all the taxonomy terms, alter the output of the term link to point to the view and provide the TID for the View. The active filter data could be obtained from the view arguments. I know how to do that but feel it is the wrong way to go about it, if I am working with Solr, I should be using a facet, not a custom block.
Build a custom Facet block that has this exact behaviour. After reading a lot of documentation, I git kind of dicouraged with the possibility of doing this simply without having to develop a Facet plugin, which is kind of out of my league.
Any advice is appreciated.
Here is a screenshot of the interface I have to produce.
http://imageshack.com/a/img834/9836/kr0i.png
Each taxonomy term has to be persistent, i.e., produce a link event if there are no nodes indexed under this term.
Selecting a term in one of the vocabularies will deselect previously selected terms
Clicking on the x next to a term will remove it form the active search criterias.
Have a look at this. https://drupal.org/project/ajax_facets This might get you to where you need to be. Sans you infinite scroll. There is a youtube video that goes with it. http://www.youtube.com/watch?v=pBj3OkXLyWs
I'd appreciate it if it works as I haven't tried it my self.
For an academic plone site I am creating, it is desirable to support document tags (see below).
There are multiple users for this site, and each user has a (long) list of publications that they alone can add / edit.
In its simplest form, a publication entry consists of a hyperlink or even just plain text. For instance:
A. Baynes, J. Watson and S. Holmes, "The role of observation and deduction in forensics", Applied Crime Solving, 221, 210-243 (1901). doi: 10.1032/acsolv2714
(The above is a fictitious article, but it has all the elements one expects in most citations.)
For those unfamiliar with DOI links, these are fixed text strings that can be resolved to the page for the article in question using dx.doi.org. Further, copyright / license terms often prohibit the authors from providing a full PDF / HTML for their articles on their websites. The articles often lie behind a paywall (usually accessible from most Universities / major research labs). So, running full text searches on the article itself is NOT an option.
Returning to the problem definition, I am assuming that the users will add their publications as links, but I want to give them the ability to specify a comma separated list of words / phrases (or tags) that more closely identify what the article is about.
For the above article, an appropriate list of tags would be:
forensics, haemoglobin, degradation of evidence
After each user appends such tags to the article, I want to create a backend that will allow visitors to the site to simply be able to enter these tags in a search field and find all publications that pertain to, say, haemoglobin.
That search should pull all publications that list haemoglobin as a tag, for all users of the site.
I intentionally used haemoglobin as a tag to illustrate that relevant tags need not be (and usually aren't) part of the text specified in the title of the article.
Further, the Plone "Collections" feature is not an adequate solution to this problem. Collections are typically generated by the admin. That means that a) admin intervention for something like this is essential and b) tags are best defined by users, not the admin.
When adding any content type (File, Folder, Page, Link, Collection, ...) in Plone, you can apply any number of tags to the content. This is done in the "Categorization" tab when editing/creating the content.
Visitors/Users can search the site based on tags like normal searches (using the search box or accessing the /##search URL).
Moreover you can use "tag cloud" portlets to visualise the tags' frequencies. Check the followings to get an idea:
1. A tag cloud portlet that rotates tags in 3D using a Flash movie
2. TagCloud
Don't forget to check Plone documentation, and specially Plone user manual to get yourself acquainted with the way Plone works.
#user2751530
I would like to know whether you are still working on this specific project - I am currently developing a similar one using plone v4, documentviewer v3 and an as of yet nonexistant frontend. I would like to discuss different approaches to the tagging-by-user problem, you can contact me through skype (dawitt19) or twitter (pref.) through #japhigu.
I am working on a project to digitize approximately 1 million images for which metadata will be added to facilitate search.
Each image is, for example, a page in a dictionary. But not text. Just a static scanned image. OCR is not an option :(
My objective is to emulate the current search procedure which consists of looking up the alphabetical entries till the correct page is found. In absence of machine readable text, I am looking at tagging each page with Dictionary range tag. For Example (Apple-Canada). So if someone searches for "Banana", it should hit the (Apple-Canada) range Tag.
Is this supported in SharePoint out of the box? If not, is there an addon product which provides this functionality or am I looking at building a customized extension?
Any help will be appreciated :)
Installing the IFilter for TIF files is done with a couple of clicks and gives you free OCR along the way. Very good for scanned pages.
On your question though: No, SharePoint does not have any kind of "range" tags or fields. The only vaguely similar thing to what you are requesting is the Thesaurus of the search. There you could define acronyms and synonyms for words and it would actually search for something else. So you could enter Banana but it would actually search for Apple. Some examples here: How to: Customize the Thesaurus in SharePoint Search and Search Server.
Other than that I can only think of a custom implemented search provider giving you the flexibility you need.
I've created a Google map that loads a KML file as an overlay. It is a map of trailheads for say hiking. What I'm trying to figure out now is how to create a search that will allow visitors to search within the KML's data and show the relevant trailhead/s as results on the Google Map. Is this possible? I have a google search that will let them search for an address, but this does NOT search within the KML file's data for a trailhead.
Ideally the visitor could input an address, say 12345 Main st., Chicago, IL, or something and it would display results that are within a specified vicinity, say ten miles, of that address (ie latitude, longitude).
I'm a little lost as to even where to begin.
thanks for your help!
Davis
I don't know how often your kml file updates, but i recommend storing all the kml data in a database as well to make this easier. Maybe every once in a while re-download the kml file and update the database.
Then its as simple as using the haversine formula and searching the database for nearby trails.
What you're describing sounds like a good job for Fusion Tables. Fusion Tables give you a nice way to store and edit the data (even collaboratively). In addition, there are geospatial columns/data fields you can add (aka, a "Location" column that can be address or lat/long coordinates). Put all the trail heads in your fusion table and you can map them. Let people enter an address or lat/long, and you can query the fusion table to show all trail heads within the user specified distance of that point. See the tutorials to get started.
You can use KML search tool to do this. It supports KML KMZ CSV and GPX. You can find the tool here
I have a search engine that searches albums.
For each music album, I have a page.
So, the work flow goes like this:
People search for music titles
The search engine displays a list of albums.
People click on an album to go to a details page.
I want google to index my front page and the details page. I want the details page to be highly ranked. How can I build a sitemap for this?
By the way, I have about 5 million albums (but I want the top 1000 ones to be highly ranked on google)
You would not use a sitemap for that many results. You would want each album to appear as a page with a unique URI to reference that page. That way the search engine can crawl your site by crawling links since search bots cannot submit form data. Each of those URIs should be simple, meaning limited to this part of the URI syntax:
scheme://authority_segment/path
Program your web application to remove and throw away any extraneous data, such as query string or parameters. If you do this you have to be sure that you are watching for URI poisoning or SQL injection even through means of character encoding.
How can I build a sitemap for this?
By pulling the addresses out of your database and creating a XML file with a high priority for some selected pages. Somehow I think that isn’t your real question …
If I wanted to automate building a site map for a site like this, I'd employ Python. I'd pretty much write everything from the ground up (except the data store access). The format is quite simple.
I'm not sure I quite understand your question...