couchdb multi language documents

couchdb multi language documents - couchdb

We are considering of using CouchDB for our Systems now. Anybody knows there is a function or framework in CouchDB supports handling different languages of the the same document?

If your functions fall out of UTF-8, then you probably want to attach them to documents as attachments. That way you can also easily keep multiple translations associated with the same document. For example, have a document that holds the meta data and then an attachment for each language. Keep the attachment names standardized - ie., by en_US or whatever - and you can easily check to see if you have the given translation.

CouchDB documents are Unicode text using UTF-8, so all languages supported by Unicode are supported.

wikipedia helps understand couchdb's concept of document :
from their CouchDB page you can , you then find link text .

Related

How to create a partial search in Meteor that only returns permitted data

I'm trying to create a search feature in Meteor 1.8.1 that does the following:
returns partial matches, e.g. "fish" will find "fish", "fishcake" and "dogfish"
has server-side control of which documents are returned, so search results don't include documents that are not published to the user
is reasonably efficient
returns a limited number of results
This seems like it should be a common requirement, but I'm failing to find any solution.
MongoDB full text search will only return on whole words, so will only find "fish".
Easy search doesn't support server-side permissions, as far as I can tell.
I could try a regex solution but I think it would be expensive?
Thank you for any solutions!
Edit: From discussion it seems that Easy Search does support server-side filtering using a selector, and this would be the best solution. However, I can't get a selector working from the examples and documentation. For clarity, I've created a new question for that issue

The documentation explicitly states that for advanced use cases you may want to use elastic search and offers you a pluggable extension to ease the burden of integration.
https://matteodem.github.io/meteor-easy-search/docs/recipes/#advanced-search
You might wish that a search for cafe returns documents with the text café in them (special character). Or that your search string is split up by whitespace and those terms used to search across multiple fields.
You should consider using a search engine like ElasticSearch for your search if you have these usecases. ElasticSearch allows you to configure precisely how your fields are being searched. One way you can do that is by analyzing your data, so that searching itself is as fast as possible.

Solr - Enriching the TermsComponent answer

I'm using Solr 3.5.0 (with WebSphere Commerce). While performing a search, commerce use the suggestion tool to suggest (auto-complete) search terms regarding the letters already typed on the search box.
Currently WebSphere Commerce is using the Solr's TermsComponent. But one of my new requirement is to be abble to enrich the list of suggested terms.
Do you know is there is any way to do that by creating a plain text dictionary, using an other solr component, ... ?
Thanks for reading,
and for your help.
Regards,
Dekx.

I think a plain-text dictionary probably wouldn't be a usable data source (even if you could use it, search linearly through a plain-text file would probably be too slow). If you create an index from you dictionary, you could probably incorporate it in the TermsComponent as a shard (see the TermsComponent documentation, under the heading "Distributed Search Support").
I don't believe TermsComponent supports searching multiple fields, so you'll want to make sure the same field name is used for the terms in the dictionary that you want to use (that is, if you are looking at the "name" field in the index, then create a "name" field in your indexed dictionary as well, rather than a "dictionaryentry" field)
Just to my mind, though, I fail to understand what the value this would be. Generally, it's intended to look at the terms available in the index on that field. "Enriching" it with more data, would just be providing suggestions that it won't actually be able to find when searching. Of course, I don't really know about your search implementation, but in most cases, that would certainly be my thought.

Is there an ISO language code to indicate mixed-language text?

I've checked the ISO 639-1 specs but can't find any references. I can just make something up for my database, but was curious if there is a standard.
Update 2012-02-26: I ended up creating a special entry in my languages table with an asterisk (*) as the country code which I now use to represent entries in no single language.

ISO 639-2 and ISO 639-3 reserved the code mul for documents that contain multiple languages.
By the way, IETF BCP 47 is currently the most authoritative document on the use of language codes.

There is some precedence for using commas when storing multiple ISO language codes, but for a database entry, another table might be prudent when storing such metadata (or even the data itself along with it's language code, if it can be partitioned at the paragraph or even sentence level) to represent such a one to many relationship.
If you are referring to collation, and you are dealing with multiple languages, then it would be best to convert the text to a Unicode charset that your database supports, and store multilingual text in that format.

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.

If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.

Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.

I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.

This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.

You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

Does StackOverflow use Lucene for tagged searches?

How has SO implemented the tagged search? Is it using Lucene or any other open-source search engine library for tagged searching?
What is the best way to search document (PDF, XML, HTML, MS Word) or database?

Searching tags is very different than searching text. A tagged search is searching for an association where questions are all associated with a particular tag. This can be implemented with a full-text engine where the tags are all appended in a single large entry, but a relational database will probably be best in this situation (assuming the tagged data is in a relational database to start with).
For searching other documents like PDF, XLS, HTML, then you need full text like Lucene. You'll need a parser that can extract just the relevant text from each source (i.e., separate text from markup).

So, yes, it is using Lucene.NET, though I'm not sure exactly how. The "best" way is a whole 'nother story.

The last time this was discussed (on the podcast) it was mentioned that Stackoverflow uses SQL Server's full-text search feature, not Lucene.

SO doesn't use Lucene.
If you want to index documents and are running Windows, then IFilters would be my first choice.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

couchdb multi language documents - couchdb

We are considering of using CouchDB for our Systems now. Anybody knows there is a function or framework in CouchDB supports handling different languages of the the same document?

CouchDB documents are Unicode text using UTF-8, so all languages supported by Unicode are supported.

wikipedia helps understand couchdb's concept of document : from their CouchDB page you can , you then find link text .

Related

How to create a partial search in Meteor that only returns permitted data

Solr - Enriching the TermsComponent answer

Is there an ISO language code to indicate mixed-language text?

Using Lucene like a relational database

Does StackOverflow use Lucene for tagged searches?

Categories

Resources