Is there an ISO language code to indicate mixed-language text? - nlp

I've checked the ISO 639-1 specs but can't find any references. I can just make something up for my database, but was curious if there is a standard.
Update 2012-02-26: I ended up creating a special entry in my languages table with an asterisk (*) as the country code which I now use to represent entries in no single language.

ISO 639-2 and ISO 639-3 reserved the code mul for documents that contain multiple languages.
By the way, IETF BCP 47 is currently the most authoritative document on the use of language codes.

There is some precedence for using commas when storing multiple ISO language codes, but for a database entry, another table might be prudent when storing such metadata (or even the data itself along with it's language code, if it can be partitioned at the paragraph or even sentence level) to represent such a one to many relationship.
If you are referring to collation, and you are dealing with multiple languages, then it would be best to convert the text to a Unicode charset that your database supports, and store multilingual text in that format.

Related

Search with multiple languages on Azure Search

I have a database which holds keywords for items, and also their localizations in different languages (supporting around 30 different languages right now), if there are any for that item. I want to be able to search these items using Azure Search. However, I'm not sure about how to set up the index architecture. Two solutions come to my mind in this scenario:
Either I will
1) have a different index for each language, and use that language's analyzer for that index. Later on, when I want to search using this index, I will also need to detect the query language coming from the user, and then search on the index corresponding to that language.
or
2) have a single index with a lot of fields that correspond to the different localizations of the item. Azure Search has support on having language priorities when searching, so knowing the user's language may come in handy, but is not necessarily a must.
I'm kind of new to this stuff, so any pointers, links, ideas etc. will be of tremendous help, even if it doesn't answer the question directly.
Option 2 is what we recommend (having a single index with one field per language). You can set some static priorities by assigning field weights using a scoring profile. If you are able to detect the language used in a query, you can scope the search to just that language using the searchFields option.

Mahout: Creating vectors from Text, How can we support foreign language?

http://mahout.apache.org/users/basics/creating-vectors-from-text.html
Mahout teach out how to create vectors from text using lucene?
Is there a way to support character other than English?
Thanks
It requires a few steps to make a document become vector. Because you mentioned Apache Lucene and Mahout, I will briefly explain how you can obtain vectors by using Mahout and Lucene. It's a little bit tedious but you have to see the big picture in order to understand what you need to do to create vectors from languages other than English.
Firstly, by using Apache Lucene, you can create index files from text. In this step, the text is passed through Analyzer. The Analyzer will break the text into pieces (or technically tokens) and do most of important things including removing stopping words (the, but, a, an, ...), stemming words, converting to lower cases, etc. So, you can see, to support different languages, all you need to do is to build your own Analyzer.
In Lucene, StandardAnalyzer is the most well-equipped analyzer you can use, it supports non-English languages like Chinese, Japanese, Korean.
Secondly, after you obtain the index files, the next step is to mining text by using Mahout. No matter what you are going to do with your text, you have to convert the index files to SequenceFile since Mahout can only read inputs in SequenceFile format. The workaround is to use SequenceFilesFromLuceneStorage class in Mahout to do it.
Thirdly, after you have the sequence file, you now can convert it to vectors. For example, you can use the class SparseVectorsFromSequenceFiles to do it.
Hope it helps.

Multilingual free-text search in an app with normalized data?

We have enums, free-text, and referenced fields etc. in our DB.
Each enum has its own translation, free-text could be in any language. We'd like to do efficient large-scale free-text searching and enum value based searching.
I know of solutions like Solr which are nice, but that would mean we'd have to index entire de-normalized records with all the text of all the languages in the system. This seems a bit excessive.
What are some recommended approaches for searching multilingual normalized data? Anyone tackle this before?
ETL. Extract, Transform, Load. In other words, get the data out of your existing databases, transform it (which is more than merely denormalizing it) and load it into SOLR. The SOLR db will be a lot smaller than the existing databases because there is no relational overhead. And SOLR search takes most of the load off of your existing database servers.
Take a good look at how to configure and use SOLR and learn about SOLR cores. You may want to put some languages in separate cores because that way you can more effectively use the various stemming algorithms in SOLR. But even with multilingual data you can still use bigrams (such as are used with Chinese language analysis).
Having multiple cores makes searching a bit more complex since you can try either a single language index, or an all-languages index. But it is much more effective to group language data and apply language specific stopwords, protected words, stemming and language analysis tools.
Normally you would include some key data in the index so that when you find a record via SOLR search, you can then reference directly into the source db. Also, you can have normalised and non-normalised data together, for instance an enum could be recorded in a normalised field in English as well as a non-normalised field in the same language as the free-text. A field can be duplicated in order to apply two different analysis and filtering treatments.
It would be worth your while to trial this with a subset of your data in order to learn how SOLR works and how best to configure it.

couchdb multi language documents

We are considering of using CouchDB for our Systems now. Anybody knows there is a function or framework in CouchDB supports handling different languages of the the same document?
If your functions fall out of UTF-8, then you probably want to attach them to documents as attachments. That way you can also easily keep multiple translations associated with the same document. For example, have a document that holds the meta data and then an attachment for each language. Keep the attachment names standardized - ie., by en_US or whatever - and you can easily check to see if you have the given translation.
CouchDB documents are Unicode text using UTF-8, so all languages supported by Unicode are supported.
wikipedia helps understand couchdb's concept of document :
from their CouchDB page you can , you then find link text .

UUIDs in CouchDB

I am wondering about the format UUIDs are by default represented in CouchDB. While the RFC 4122 describes UUIDs like 550e8400-e29b-11d4-a716-446655440000, CouchDB uses continuously chars like 3069197232055d39bc5bc39348a36417. I've searched some time in both their wiki and their documentation what this actually is, however without any result.
Do you know whether this is either a non RFC-conform format omitting all - or is this a completely different representation of the 128 bits.
The background is that I'm using Java UUIDs which are formatted as noted in the RFC. I see the advantage that the CouchDB-style is probably more handy for building internal trees, but I want to be sure to use a consistent implementation.
Technically we don't use the rfc standard for uuids as you've noticed. Version four uuids reserve something like four bits to specify the version of uuid. We also don't format them with the hyphens that are generally seen in other implementations.
CouchDB uuids are 16 random bytes formatted as hex. Roughly speaking that's a v4 uuid but not rfc compliant.
Regardless of the specifics, there's really not much of an issue in practice. You generally shouldn't try to interpret a uuid unless you're trying to do some sort out-of-band analysis. CouchDB will never interpret uuids, we only rely on the properties of randomness involved therein.
Bottom line would be to not worry about it and just treat them as strings after generation.
K I can provide some 2019 reference from the doc site: "it's in any case preferable to provide one's own uuids" -- https://docs.couchdb.org/en/latest/best-practices/documents.html?highlight=uuid
I ran slap bang into this because the (hobby) db I'm attempting as a first programming anything, deals with an application that does generate and use 4122 -compliant uuids and I was chewing my nails worrying about stripping the "-" bits out and putting them back on retrieval.
Before it hit me that the uuid that couchdb uses as the doc _id is a string not a number... doh. So I use the app's uuid generated when it creates an object to _id the document. No random duplicated uuids.

Resources