Has anyone in the NLP field heard of the term Zone Hashing? From what I hear, zone hashing is the process of iterating through a document and extracting sentences. An accumulation of sentences is then hashed, and the process continues for the next n sentences...
I haven't found any references to this on Google, so I'm wondering if it goes by a different name. It should be related to measuring text similarity/nearness.
Perhaps it refers to locality sensitive hashing?
As far as I know, "zone hashing" is not a well established concept in the NLP as a discipline. It is just a simple concept used in some algorithms (related to NLP). The only one I know, which uses it is a Sphinx search server, and here, "zone hashing" is simply "hashing of objects called zones", where "zone" is described as follows:
Zones can be formally defined as follows. Everything between an
opening and a matching closing tag is called a span, and the aggregate
of all spans corresponding sharing the same tag name is called a zone.
For instance, everything between the occurrences of < H1 > and < /H1 > in
the document field belongs to H1 zone.
Zone indexing, enabled by index_zones directive, is an optional
extension of the HTML stripper. So it will also require that the
stripper is enabled (with html_strip = 1). The value of the
index_zones should be a comma-separated list of those tag names and
wildcards (ending with a star) that should be indexed as zones.
Zones can nest and overlap arbitrarily. The only requirement is that
every opening tag has a matching tag. You can also have an arbitrary
number of both zones (as in unique zone names, such as H1) and spans
(all the occurrences of those H1 tags) in a document. Once indexed,
zones can then be used for matching with the ZONE operator, see
Section 5.3, “Extended query syntax”.
And hashing of these structures is used in the traditional sense to speed up search and lookup. I am not aware of any "deeper" meaning.
Perhaps it refers to locality sensitive hashing?
Locality sensitive hashing is a probabilistic method for multi dimensional data, I do not see any deeper connections to the zone hashing then fact that both use hash functions.
Related
I have a database which holds keywords for items, and also their localizations in different languages (supporting around 30 different languages right now), if there are any for that item. I want to be able to search these items using Azure Search. However, I'm not sure about how to set up the index architecture. Two solutions come to my mind in this scenario:
Either I will
1) have a different index for each language, and use that language's analyzer for that index. Later on, when I want to search using this index, I will also need to detect the query language coming from the user, and then search on the index corresponding to that language.
or
2) have a single index with a lot of fields that correspond to the different localizations of the item. Azure Search has support on having language priorities when searching, so knowing the user's language may come in handy, but is not necessarily a must.
I'm kind of new to this stuff, so any pointers, links, ideas etc. will be of tremendous help, even if it doesn't answer the question directly.
Option 2 is what we recommend (having a single index with one field per language). You can set some static priorities by assigning field weights using a scoring profile. If you are able to detect the language used in a query, you can scope the search to just that language using the searchFields option.
I have a lot of entities with 3 language columns: DescriptionNL, DescriptionFR and DescriptionDE (Description, Info, Article, ... all in 3 languages).
My idea was to create a forth property Description which return the right value according to the Thread.CurrentThread.CurrentCulture.TwoLetterISOLanguageName.
But a drawback is that when you have a GetAll() method in your repository for a dropdownlist or something else, you return the 3 values to the application layer. So extra network traffic.
Adding a parameter language to the domain services to retrieve data is also "not done" according to DDD experts. The reason is that the language is part of the UI, not the Domain. So what is the best method to retrieve your models with the right description?
You are correct in stating that a language has no bearing on a domain model. If you need to manipulate objects or data you will need to use some canonical form of that data. This only applies to situation where the value has any kind of meaning in your domain. Anything that is there only for classification may not interest your model but it may be useful to still use a canonical value.
The added benefit of a canonical value is that you know what the value represent even across systems as you can do a mapping.
A canonical approach that was used on one of my previous projects had data sets with descriptions in various languages, yet the keys were the same for each value. For instance, Mr is key 1, whereas Mrs is key 2. Now in French M. would be key 1 and Mme would be key 2. These values are your organisational values. Now let's assume you have System A and System B. In System A Mr is value 67 and in System B Mr is value 22. Now you can map to these values via your canonical values.
You wouldn't necessarily store these as entities in a repository but they should be in some read model that can be easily queried. The trip to the database should not be too big of a deal as you could cache the results along with a version number or expiry date.
I need to do large-scale anonymisation of database log-files.
Part of this will involve obscuring various field names (strings), as well as IP addresses.
1. Field Names
For example, we might have the string BusinessLogic.Categorisation.ExternalDeals. In the anonymised version, we would want it to be something like Jerrycan.Doorway.Fodmap (or something gibberish, but still "pronounceable")
The purpose is simply to obscure the original strings - however, we still want to be able to matchup occurrence of those strings across different logfiles.
The requirements of the hash are:
Repeatable - that is, the same inputs passed in each time would always produce the same outputs. We need to be able to match-up fields between different logfiles (all we're trying to prevent is somebody deriving the original string).
One-way - there is no way of reversing the outputs to product the inputs.
Low chance of collision - it will mess up our analysis if two fields are mapped to the same output.
Human readable (or pronounceable) - somebody scanning through logfiles by hand should be able to make out fields, and visually match them up. Or if need be, read them over the phone.
Short strings - I do understand there's a tradeoff between this and available entropy, however, ideally a string like HumanReadable should map to something like LizzyNasbeth.
I had a look around, and I found https://github.com/zacharyvoase/humanhash (output hash is a bit longer than what I want) and https://www.rfc-editor.org/rfc/rfc1751 (not really "pronouceable" - ideally, we'd want something that looks like a English-language human word, but isn't actually - and, once again, a bit long).
What algorithms or approaches are there to this problem? Or any libraries or implementations you could recommend?
2. IP Addresses
For the IP addresses, we need a way to mask them (i.e. not possible for an outside observer to derive the original IP address), but still have it be repeatable across different logfiles (i.e. the same input always produces the same output).
Ideally, the output would still "look" like an IP address. For example, maybe 192.168.1.55 would map to 33.41.22.44 (or we can use alphabetical codes as well, if that's easier).
Any thoughts on how to do this?
You could use codenamize :
$ codenamize BusinessLogic -j "" -c
AbsorbedUpper
You can use this from command line or as a Python library.
(Disclaimer, I wrote it).
I was discussing with a colleague, and he suggested one approach.
Take the field name - and pass it through a standard one-way hash (e.g. MD5).
Use the resulting digest as a index to map to a dictionary of English words (e.g. using mod).
That solves the issue of it always being repeatable - the same word hashed each time will always map to the same English word (assuming your dictionary list does not change).
If individuals companies were worried about dictionary attacks (i.e. the field name "firstname" would always map to say "Paris"), then we could also use a company-specific keyfile to salt the hash. This means that it would be repeatable for anonymised logfiles from them (i.e. "firstname" might always map to "Toulouse" for them), but it would not be the same as for other companies who use other keyfiles.
I'm still very keen to see what other people can suggest, or whether they might have any thoughts on the above.
I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.
You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.
I am doing a multilingual search. And I will use lucene as the tool to do it.
I have the translated contents already, there will be 3 or 4 languages of each document.
For indexing and search, there could be the 4 strategies, For each document/contents:
each language are indexed in different index/directory.
each language are indexed in different document but in the same index.
each language are indexed in different Field but in the same document.
all the languages are indexed in the same Field in a document
But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?
Thanks!
Although the question has been asked a couple of years ago, it's still a great question.
There are a couple of aspects to consider evaluating the different solution approaches:
are language specific analyzers used at indexing time?
is the query language always known (e.g. user selectable)?
does the query language always match one of the "content" languages?
should only content matching the query language be retuned?
is relevancy important?
If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.
Single Field (Strategies 2 & 4)
+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)
Multiple Fields (Strategy 3)
+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query
Multiple Indices (Strategy 1)
+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index
Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.
Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.
In short, it depends on your needs, but I would go with option 3 or 1.
1) would probably the best way, if there is no overlap / shared fields between the languages at all.
3) would be the way to go if there are several fields that need to be shared across languages, as this saves disk space and allows a larger part of the index to fit in the file system cache
I would not recommend 2): this makes your search queries more complex and forces lucene to consider more documents.
4) will make your search query very complex, unless you want users to be able to search in any language without selecting it first.