how to show contents of the file rather than filename when searching by solr - search

I have a lot of pdf files (text inside), and I want to build a simple search engine to search the sentences which contains the given keywords. After several hours' searching, I chose solr as the tool.
I am new to solr. I downloaded latest solr 6.5.0 and set it up in windows 7.
I have used the following commands to create a collection called gettingstarted and can search operation by visiting the link http://localhost:8983/solr/gettingstarted/browse
bin\solr.cmd start
bin\solr.cmd create -c gettingstarted
java -Dauto -Dc=gettingstarted -Drecursive -jar example/exampledocs/post.jar *.pdf
However, it only shows the filename which contains the keyword rather than the detail lines of the file. The following picture shows this case:
I also tried the integrated example called techproducts and to my surprise, it can show the exact sentences which contains the keywords. The following picture shows this case:
So I have a question if I can do something to enable the sentences which contains exact keywords show in the first picture. I don't know about velocity, config files and even the underlying principles. I just want it work, giving the detail search results. I do not care about the security issues and also do not care about the way it shows (uglyness is OK).
It is the first day I play with solr, so maybe I made some mistakes about the description. Thanks for your patience. I need your help.

http://localhost:8983/solr/gettingstarted/browse
this is example UI application (solritas )which comes by default with solr.
You should use /select request handler to query, which handles you query and retrieve results.
http://localhost:8983/solr/gettingstarted/select?q=keyword
For Indexing PDF.
when you index pdf, all content inside pdf goes to field called content by default.
Example:
Assuming you created gettingstarted collection already.
Navigate to directory example/exampledocs/ and hit this command.
java -Dauto -Dc=gettingstarted -jar post.jar solr-word.pdf
if it indexed successfully. go to admin and search for keyword inside pdf, it should give content field with value (text inside pdf)
example query request URL
http://localhost:8983/solr/gettingstarted/select?q=solr&wt=json&indent=on

Related

AWS Textract can not recognize the table of the second page of PDF document

I need to extract table information from a billing copy using AWS Textract. It gives me almost perfect results every time but for some PDF document, it does not give me the table results of the second page.
code examples used: AWS Official Documentation
image(JPEG) of first page is
image(JPEG) of second page is
So, AWS gives me the first 20 entries output as CSV. But for the second page of the image the result of CSV is:
and most importantly, I found the same results in a similar type of PDFs which has 21 entries and one entry exists on the second page of PDF. I have already used PyPDF2 to merge pdf pages into one page but doesn't solve my problem. Is there any OpenCV tools do I need to use?
Please suggest to me any possible suggestions for these types of issues.

I want to add new column that contains html files in solr indexer using nutch 1.17 version

I want to add new column that contains htmls files(raw html files).May I know what configurations changes are required.I read segment reader that contains content folder but output is text file i want to index the htmls files in a column.May I know how could I achieve.
You may have to face special character issues in raw HTML when indexing in Solr. Anyhow, first you should1 customize index-basic plugin in Nutch. Its class name is BasicIndexingFilter.java. Update this class with followings:
String htmlcontent = parse.getData();
doc.add("htmlContent", StringUtil.cleanField(htmlcontent));
After this, you also have to add a field with Solr Schem "htmlContent". Hopefully it will solve your issue.
There may be others options also for this task.
I found another option as commented that works best. Use nutch CLI
bin/nutch index crawldb-path -dir segments-directory -addBinarycontent -base64

Azure search adding documents to index approaches

I am not sure if i am going to be able to describe this right but ill give it a go.
We are working on implementing Azure search. At the core level we have searchable PDF documents that we want the text of them added to the index so all of them are searchable.
The initial thought was to just submit that document to the index via the add document rest api. The thinking was that this would be the most simple and quickest path
to getting the text of that document into the index. We also considered using and indexer and just having all the Searchable PDF docs in a blob store and have the indexer
crawl those every 10-15 mins.
We also looked into (based on a recommendation) submitting a standalone JSON file with the text from the PDF in it. Submitting that to the index either via the same add document API or
placing that file in a blob store. Within the JSON document we would need to have document identifiers that provide the index with the location of the PDF so that when that text is found
via search, we can make that clickable and as a result open the PDF.
It seems to me that pushing in the json file with the document add api. Indexing that and when it is part of a search we can use the doc id to link back to it and open it.
For those of you that have used Azure search. How did you implement?
If you're totally sure that only pdf will live on this particular index, then the first approach is faster to implement, since the native indexer can be used for extract the content of the pdf document as well to push it to the index.
Both approaches will work, but for the second one, you would need to extract the pdf yourself using an external tool.

File of custom type are not searchable in Alfresco using Advance search

I am seeing below behavior in Alfresco and read lots of relates doumentaion of alfresco but not found any clear answer.
Below are things I have done to search a file.
1. Uploaded a file named "Test.txt" in a folder having only one rule to have custom type on the uploaded docs.
2. And when I select content in "look for" option in advance search then my test file comes in result of search.
as shown below.
Then I have searched it using advance search using name property and selecting my custom type in the "look for" option in advance search then it result 0 files.
But when I set any property of test.txt file it becomes searchable using custom type in Advance search.
My question is If I just upload a file. How can it become searchable using custom type in Advance search.?
When is the indexing generated of files uploaded of custom type.
I am using Alfresco 4.1 and Solr as search engine.
Thanks,
Fouad
SOLR indexes Alfresco every 15 seconds by default, so there's no reason why your uploaded file wouldn't be indexed right away.
Are you sure your rule actually works?
I'd suggest taking the file's nodeRef, and using Node Browser to look at it's type, aspects and properties right after it enters the folder (and triggers the rule), and after you change something by hand. That might clear something up, as in why it works/does not work.
Additionally, you could search for unindexed nodes and see if your file is there:
http://docs.alfresco.com/5.0/concepts/solr-index-fix.html

Searching from File title as well as file content in media library

I managed to Search the contents of text files using custom search as described in the link below: https://docs.kentico.com/k8/custom-development/miscellaneous-custom-development-tasks/smart-search-api/creating-custom-smart-search-indexes
But it is not able to search in the filename. For example, if my search text is "Roman", the file "RomanRaj.txt" should show up in the results. Please help.
Try to add file name to your search index by index content customization. See the documentation on this topic.
I'd suggest NOT creating a custom smart search index but look at using attachments and searching those. Out of the box, Kentico will allow you to search attachments and their contents without writing any code.

Resources