neo4j: fulltext index on external documents and attachments - search

This is the situation. Don't ask why, it is the way it is.
I created a simple fulltext-index over the document-contents by grabbing the content from the document-server and adding the simple unformatted content as a new property (raw_content) to each Document-Object in the neo4jdb.
Then i created a fulltext-index like:
CALL db.index.fulltext.createNodeIndex('content', ['Document'], ['title', 'teaser', 'raw_content'])
So far so good. Search works very well.
Now I want to index the attachments. I've got Attachment-URLs for each document, which I can call by the docid.
So, before I slide into any antipatterns, I'd like to ask the community, about how to do this. I've got two ways on my mind:
similar to the way I index the raw_content - is there a way to make Lucene get and parse the URLs, that I give to it?
the batch does all the parsing and adds the content to new fields like "attachment01_content" ...
Solution 1 would be appreciated, but i did not find any documentation on this.
Solution 2 is ugly, especially because Lucene can handle with pdf, doc ...
any ideas on how to solve this?

Related

Getting thumbnails in OpenSearchServer search results

I need an alternative to Google Custom Search for a website I look after, it has to be something that will crawl a website, index it, allow fiddling of priorities, and then allow search queries via REST or something similar and return XML or JSON etc. It needs to run on a Windows Server instance.
So, I'm up and running with http://www.opensearchserver.com/ and it seems to do the trick, but can't, for the life of me, work out how to get thumbnail images in the results? I've searched the documentation and read everything I could, but can't find out how to do this (or how to get my head around it).
I'm crawling standard web pages and they all have thumbnail meta data, which I'm assuming should be able to be parsed somehow for results and included in the JSON results?
Any pointers at all would be very helpful, thanks!
I figured this out, in case anyone else is struggling, here's how I did it. The answer is in the documentations, it's just not that simple.
Read: http://www.opensearchserver.com/documentation/faq/crawling/how_to_extract_specific_information_from_web_pages.md - it contains the method
Assume you set up a 'web crawler' index.
Assuming you're using a meta thumbnail like this:
<meta name="thumbnail" content="http://my_cdn.com/news/images/29637.jpg">
Go into Schema / Fields. Add a new field called 'thumbnail' with index no, store yes, vector no, analyser Text, copy of blank. Save that.
Now go to schema / parser list, edit HTML parser. Go to 'field mapping', now add a new regex for the thumbnail in the html. We map from the 'htmlSource' to the thumbnail' with the matching regex.
My imperfect regex (that works though) is:
htmlSource -> linked in: thumbnail -> captured by:
(?s)<meta name="thumbnail" content="(.*?)">
Now SAVE this and go to crawl/manual crawl, enter a url that has a thumbnail and then check if the field now appears in the list below when it's read. If not check your regex, and check you actually saved the HTML Parser changes.
To get the thumb in your results, simply add the fieldname to the JSON you send with the query:
"returnedFields": [ "
"url",
"thumbnail"
],

How to parse a document using crawler4j

I wanted to parse all the documents containing some text I enter as "query" using crawler4j in Eclipse.
Any ideas?
Not really a "direct" answer, but I also played with crawling these last few days. I looked first at Crawler4J, then stumbled on JSoup. Did not play much with the crawler, but jSoup turns out to be quite an easy tool for parsing. Hence my suggestion. I guess crawler is good if you really need to crawl a part of the web. But JSoup really seems to shine as a good parser. Similar to JQuery in terms of selecting nodes etc... So perhaps use the crawler for first collecting documents, then parse them using JSoup. Here's a quick example:
Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").timeout(5000)
.get();
Elements els = doc.select("li");

Searching Single Pages with Dynamic Content

I have a slight problem I have been trying to address for a client I have been working with. We have 4 sets of single pages that are loading content from a database using PHP based upon a get string that is provided. These pages that are generated are optimized well for SEO and have alt tags for images and Content that we need to be able to search using a search feature.
Now i had assumed (An everyone knows what assuming gets you) that these pages by default would be able to be searched by the concrete 5 built in search feature. But it doesn't work. If I search for a word that I know is definitely on one of these pages even multiple times no results are found.
How can I make Concrete5 search these pages. If its no do able by a default or by a plugin, then can someone please offer some advice on how to fix this. This is an important feature and must be completed.
EDIT: See my comment below. I still need some help or direction here as CSE inst much of an option.
EDIT2: It may be viable for me to install a crawler and a custom search engine to address my problems. I was thinking of spider. Any other suggestions on that or other options are much appreciated!
Unfortunately C5 doesn't provide a way to do this -- the only way to tap into the search index is with blocks. And even if you created a phony block just to pass content from the single_page through to the search index, there's no way to say that some content is from one URL while other content is from another URL (which you'd need to do since your single_page controller is handling many different URL's).
I don't know of a way to achieve what you want to do (and it appears that nobody else does either -- http://www.concrete5.org/community/forums/customizing_c5/make-content-in-single-pages-searchable/ ), other than building your own internal search engine.
EDIT: I just did some digging, and thought that perhaps you could manually insert records into the PageSearchIndex table and specify the searchable content and the desired path there -- but this won't work because it relies on one cID (collection id, a.k.a. page id) per entry -- so you'd only be able to insert one record for the top-level single_page path.
I think the simplest solution here would be to create your own searching infrastructure for your single_pages (like some kind of function in the controller that would return an array of page paths and searchable content for each one), then override the search block and perform an additional search of your single_page -- then combine the results on the search results page there. Or just use google site search for your site, which will actually crawl the pages and hence find your various single_page urls: https://www.google.com/cse/
Best of luck.
I have not tested this, but maybe you can put a function getSearchableContent() in the single pages controller like you do for blocks. This would return the string to be searched. Would look something like this:
function getSearchableContent() {
// ... compose searchstring depending on the queried content.
return $searchstring;
}
But I don't know if this works for dynamic content. If not, I'd look into C5's search index core classes and try to extend them for your project.

Modx - Extend site_content - Add new table

Currently, we're running revolution 2.2. On site_content, we have some tags that are ran for crawling twitter. I want to start tracking the number of results for each tag as results come in, to determine which tags don't return that many results, etc.
So I was thinking that I should create a new table (twitter_data), and have a foreign key that will link it to the search tag ID, which is stored in site_content.
What is the best path to accomplish this? Should I create my table then run the reverse schema tool, outlined here?
http://rtfm.modx.com/display/revolution20/Reverse+Engineer+xPDO+Classes+from+Existing+Database+Table#ReverseEngineerxPDOClassesfromExistingDatabaseTable-CreatingaMySQLtable
I also found this, but not sure if this is what I should be looking into:
http://rtfm.modx.com/display/revolution20/Using+Custom+Database+Tables+in+your+3rd+Party+Components
Probably not - if you can avoid modifying the core modx schema do so. an external table may be your best option, but requires a fair bit of work.
though if you can explain wht you mean by 'tags' a little better [html tags? snippets? content tags? not sure what you mean] there may be other options. for example. one of our clients wanted to count page hits [and didn't want to use google to do it] so all we did was to create a template variable bound to each page they wanted to count and then updated that appropriate variable by writing plugin to fire on the onpageload or onpagerender event. [I don't ermember exactly which or what it was called]
Basically, you may be able to do this by writng a plugin rather than trying to extend anything or add snippets/chunks.

List Schema - URL Syntax

I ran acrossed this a couple months ago and did not save the link anywhere, unfortunately.
Basically, there is a URl syntax to extract a Sharepoint Lists basic schema that exports it to the browser in XML format. It gives the basic information for the field and views of the list.
Resolution:
http://blogs.msdn.com/b/kaevans/archive/2009/05/01/getting-xml-data-from-a-sharepoint-list-the-easy-way.aspx
You just have to put in the right context of words to get the result you need.
For everyone else's sake:
http://<PATH TO SITE>/_vti_bin/owssvr.dll?Cmd=ExportList&List={GUID}
The list GUID can be found by going to the List Settings, then pulling out the GUID from the URL.
You might want to take a look at this: http://www.dotnetmafia.com/blogs/dotnettipoftheday/archive/2010/01/21/introduction-to-querying-lists-with-rest-and-listdata-svc-in-sharepoint-2010.aspx
Not sure if it's what you are after..

Resources