How to compare different language String values in JAVA? - nlp

In my web application I am using two different Languages namely English and Arabic.
I have a search box in my web application in which if we search by name or part of the name then it will retrieve the values from DB by comparing the "Hometown" of the user
Explanation:
Like if a user belongs to hometown "California" and he searches a name say "Victor" then my query will first see the people who are having the same hometown "California" and in the list of people who have "California" as hometown the "Victor" *name* will be searched and it retrieve the users having "California" as their hometown and "victor" in their name or part of the name.
The problem is if the hometown "California" is saved in English it will compare and retrieve the values. But "California" will be saved as "كاليفورنيا" in Arabic. In this case the hometown comparison fails and it cant retrieve the values.
I wish that my query should find both are same hometown and retrieve the values. Is it possible?
What alternate I should think of for this logic for comparison. I am confused. Any suggestion please?
EDIT:
*I have an Idea such that if the hometown is got then is it possible to use Google translator or transliterator and change the hometown to another language. if it is in english then to arabic or if it is in english then to arabic and give the search results joining both. Any suggestion?*

The problem you encounter is that you want / need information in 2 or more languages and you want the user of your application to be able to use both languages. One possible approach is to keep multiple records per item and including a language code as part of the primary key, for instance if your record is
id hometown name
001 California Victor
you could introduce a language code and store
id lang hometown name
001 en California Victor
001 ar كاليفورنيا Victor
then your search would match either "California" or "كاليفورنيا" giving you the id 001, which you can then use to load all translations of your data (or just the data in the current output language.) This sceme can be used with any number of languages and has the added advantage that you don't need to prefill the table. You can add new translations for records when they become known.
(Caveat: I just repeated your arabic string, I can't read it, also 'ar' most likely isn't the correct language code for aribic but you get the idea.)

Does the Arabic sound like "California"? If so you will need to compare on a "sounds-like"-basis which will most likely result in a phoneme conversion.

Transliterate all names into the same language (e.g. English) for searching, and use Levenstein edit distance to compute the similarity between the phonetic representations of the names. This will be slow if you simply compare your query with every name, but if you pre-index all of the place names in your database into a Burkhard-Keller tree, then they can be efficiently searched by edit distance from the query term.
This technique allows you to sort names by how close they actually match. You're probably more likely to find a match this way than using metaphone or double-metaphone, though this is more difficult to implement.

Your Google suggestion sounds like it might also be a good one, but you should play around with it, and be sure that you're happy with its accuracy. In testing how it worked going between Hebrew and English, I noticed that sometimes Google just leaves English place names in English letters when translating to Hebrew.

How about you use some localization on client side to display values. Or create a wrapper class for hometown that will override equal(Object) in the manner the instance for California will return true for both "California" and "كاليفورنيا" (sorry if I made mistake here, just copy-pasted from above).

This sounds like a classic encoding problem. Whenever you transfer non-ascii character you need to make sure you're encoding it right. For Arabic and English I suspect you can use UTF-8 (but I don't know arabic, so it may be wrong).
In your setup you will probably have the following points:
Browser <-> Servlet container <-> Database
|
System.out
In any of the system interfaces where chars (16-bit) are converted to byte (8-bit) you will need to make sure the encoding is correct.
Browser to Servlet container
When you do GET or POST requests from a web-page, the browser will look at 1) The HTTP headers from the server, especially the Content-Type: text/html; charset=UTF-8, which if present, will override the HTML meta header <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">.
On the servlet container side, the HttpServletRequest.getParameter(), will have an encoding that you most likely need to set in the server settings.
Example tomcat's server.xml
<Connector port="8080" protocol="HTTP/1.1" URIEncoding="UTF-8"
maxThreads="2000"
connectionTimeout="20000"
redirectPort="8443" />
Servlet container to Database
The database needs to have the correct encodings, or sorting etc will not be right.
Example my.cnf for MySQL
[mysqld]
....
init_connect=''SET collation_connection = utf8_general_ci''
init_connect='SET NAMES utf8'
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_general_ci
[mysql]
....
default-character-set=utf8
Then the JDBC-driver needs to be set for UTF-8.
Example JDBC connect string
jdbc:mysql://localhost:3306/rimario?useUnicode=true&characterEncoding=utf-8
System.out
System.out.printnln() can not be relied upon to verify things. First it depends on the java vm default encoding, set using System.property -Dfile.encoding=UTF-8, secondly the terminal in which you do the System.out, will need to be set to and support UTF-8. Don't trust System.out!
Once a String in the VM is a proper character, it will not be affected by encoding. In memory every char in a string is 16-bit, which (almost) covers all the chars that utf-8 can encode. You can write the string to a file and investigate the file to really know if you got correct chars in your VM.

Related

Remove part of a string in each row of a large column of data in KNIME

I am stumbed.
I have a column with some thousand rows of unique adresses regarding universities, pharmacompanies etc. in a KNIME workflow
Example:
55 Shattuck Street Boston Massachusetts 02115 US [NAT: US RES: US] for all designated states
What I need is to clean the data, so each row look like nice and computable like this:
55 Shattuck Street Boston Massachusetts 02115 US.
My problem Is I can't seem to get the system to remove everything after US. Does anyone know a suitable approach in KNIME?
You should be able to use either String Replacer or String Manipulation for this. The first one lets you use either a simple wildcard or a full regular expression pattern while the second one uses a Java-like syntax - the choice comes down to how many different variations on the input data you need to handle and which syntax you prefer.
If you just need to remove any text between square brackets including the space before the open bracket then you can use String Replacer configured like this:
Beside the nodes which were already mentioned by nekomatic and which will work perfectly for the given scenario, there's also a user-friendly regular expression tool in the Palladian nodes extension called Regex Extractor, which allows you to build your regexes with a live preview as you might know from popular online regex testers.
For your scenario, you could e.g. set up a regex like this:
^(?<address>.*)(?:\s\[.*)
In prose, this means: Capture all characters until a space + square opening bracket and output into a column named address.
The Palladian extension is available here as a free plugin for KNIME Desktop and provides a variety of different tools for web, text, and geo data mining and classification.

textual syntax for domain models

we have domain models described in some xml format. Given the domain models I want to generate tooling that helps the testers/domain experts to express data in text (and a domain specific test framework later). IDE support is mandatory (IDEA or eclipse).
say, i have this pseudo model
User
fn string 120 chars mandatory
ln string 120 chars mandatory
address not-mandatory
Address
street mandatory
city mandatory
A typical usage scenario:
user opens the IDE
creates a new file
when content assist invoked, should give options 'user', 'address' etc
If I choose user, furthur ctrl-space should give 'fn', 'ln', 'address' as options.
I know this can be done by xtext or jetbrains mps etc. But, I want to understand which technology lends for the following requirements.
the models are fed to the system at run time (new, updates, deletes etc).
so, I cannot have static set of grammars. How can I structure it so that the model/property assist is resolved at run time or at least the grammar is generated (may be a part of it)
when I am working with one set of 'grammars' , if I point my target server to a different version (which may have different set of models) , I want the editor validate my existing files and flag errors.
I get the data files in xml, text or via server lookups.
It is very important for me to transform the models into some other format or interpret them in java/groovy.
for ex,
I may have the following data file
user {
fn : Tom
ln : Jill
hobby : movies
}
but, when I validate this file against a server which does not know 'hobby' property, I want the editor to mark error on that property.
I have plans to add much more functionality to this dsl/toolkit.
Any hints which technology is more suitable ?
thanks
I know this can be done by xtext or jetbrains mps etc. But, I want to understand which technology lends for the following requirements.
I think Xtext is good for your requirements under the condition that you have (or can create) an XML schema for your XML domain models.
the models are fed to the system at run time (new, updates, deletes etc). so, I cannot have static set of grammars. How can I structure it so that the model/property assist is resolved at run time or at least the grammar is generated (may be a part of it)
If I understand you correctly, you don't really need specific grammar rules for each XML data model but only cross references to the data model.
EMF has support for generating EMF Java classes from XSD files and Xtext can reference XML files conforming to the XSD schema if you add them to the Xtext index using your custom indexer (Xtext interface IDefaultResourceDescriptionStrategy).
So you can create a normal Xtext project with grammar etc. for your DSL and use cross references that refer to your XML domain model.
when I am working with one set of 'grammars' , if I point my target server to a different version (which may have different set of models) , I want the editor validate my existing files and flag errors.
I get the data files in xml, text or via server lookups.
EMF uses URIs to identify resources so if you generate an Ecore model like I described, it should be possible to import the XML domain models using http:// or file:// (or whatever, it's extensible) URIs, or something that you internally resolve to URIs.
It is very important for me to transform the models into some other format or interpret them in java/groovy.
Here you have the choice between making an interpreter, an Xbase inferrer or a generator (each of which can be implemented well using Xtend), depending on your requirements.
(Disclaimer: I am an employee at itemis, which is one of the main contributors to Xtext)

Possible security risks in localization messages

If a web application allows users to contribute translation messages in order to localize the application to a given language or locale, then what are the potential security risks involved in this. [Apart from social engineering which is an obvious one]
These translation messages are usually a collection of key-value pairs in some kind of format depending on the language/library etc. For example, PHP array files as in many OSS PHP applications, getetxt .po files for apps using gettext, Yaml files in Rails, and many others.
Such translation data is then used to provide a new locale in the list of locales available for a site.
As soon as you relinquish control of the content, you are effectively allowing any "authorized" content provider to add whatever they want to your UI. Even if you prevent execution of potential code included in the content, you cannot prevent display of inappropriate text (or images) to users unless you screen that text at its entry point into your system.
One way to address this is via service contracts with the content providers that specify their obligations for content verification. Depending on who the providers are, this may be enough to make you confortable with relinquishing control. Otherwise, there's pretty much no substitute for a human with the application's owner organization approving all submitted content before it is approved for publication.
To be honest this is kind of a strange question. I will assume that you have read and understand the OWASP top 10. I assume you know how to protect your own server from attack.
That being said in my mind the most obvious attack against this translation system is persistent XSS which would allow an attacker to deface every website using this dataset. Just saying "oah we htmlencode the values" isn't enough. If you are supplying these data sets to a 3rd party you can't expect all of them to sanitize the data properly. To make matters worse, XSS is an output problem, you can't HTML encode the entire data set and expect it to be 100% safe because you have no idea how the data is going to be used within the HTML document. The problem is the data may end up within a script tag or event, and then the protection from html-encoding could be nullified entirely. I always chuckle when I see someone using strip_tags() to try and stop xss, this is just the wrong approach.
In summation there really isn't a 100% solution to the problem, but this will prevent most xss:
$var=htmlspecialchars($var,ENT_QUOTES,"UTF-8");
$var=rtrim($var,"\\");
Obviously the rtrim() is used to help prevent xss within a script tag. If the string ends with a backslash you can break out of a quoted string, backslashes are equally as dangerous as quote marks.
I think it's safe to say that HTML elements in the "new" string can only be those that were in the old string, minus a few specific attributes such as title and alt.
Example:
English string: <strong title="Just a test">Hover this message</strong>
Dutch translation: <strong title="Gewoon een test">Hang hier met de muis boven</strong> - will be marked as safe
Dutch translation: <strong onmouseover="window.location='something';">Hang hier met de muis boven</strong> will be invalidated by the filter
You would have to write a rather strong filter though, and always verify that no attributes were added, removed, and no HTML elements were added or removed. Also, always be careful with " and '.

where can I find SVG or GeoJSON of countries that uses a 2 or 3 digit country code?

I was going to use the Polymaps.org library (combined with Protovis) to create a nice vector based world map. However, there example (http://polymaps.org/ex/world.html) uses a GeoJSON from Thematic Mapping, but the countries are coded by name instead of by their 2 digit country codes.
When I pair up my data, I have problems with things like "Russia" vs "Republic of Russia". Anybody know of a GeoJSON file for countries that uses the ISO 2 or 3 digit codes? It seems crazy to use the names.
Any other SVG type file would be useful too. I could create one, but I feel like it must exist out there and I just don't know how to find it.
Its not exactly what you want since its not in geoJSON format:
http://vis.stanford.edu/protovis/ex/countries.js
May be this is what you want ? World Countries Information and ip geocoding RESTful Web services API
Happy coding:-)

Do you HtmlEncode during input or output?

When do you call Microsoft.Security.Application.AntiXss.HtmlEncode? Do you do it when the user submits the information or do you do when you're displaying the information?
How about for basic stuff like First Name, Last Name, City, State, Zip?
You do it when you are displaying the information. Preserve the original as it was entered, convert it for display on a web page. Let's say you were displaying it in some other way, like exporting it into Excel. In that case, you'd want to export the preserved original.
Encode every single string.
You should only encode or escape your data at the last possible moment, whether that's directly before you put it in the database, or display it on the screen. If you encode too soon, you run the risk of accidentally double encoding (you'll often see &amp; on newbies' websites - myself included).
If you do want to encode sooner than that, then take measures to avoid the double encoding. Joel wrote an article about good uses for hungarian notation, where he advocated use of prefixes to determine what is stored in the variable. eg: "us" for unsafe string, "ss" for safe string.
usFirstName = getUserInput('firstName')
ssFirstName = cleanString(usFirstName);
Also note that it doesn't matter what the type of information is (city, zip code, etc) - leaving any of these unchecked is asking for trouble.
It depends on your situation. Where I work, for years the company did no HTML encoding, so when we started doing it, it would have been almost impossible to find every location within the system that user input could be displayed on the page.
Instead we chose to sanitize input on its way into the system since there were fewer input points than output points. We sanitize immediately before inputting data into the DB, although we don't use Microsoft's AntiXss library, we use a set of homebrew methods that whitelist ranges of HTML tags and characters depending on the type of input.
If you're designing the system from scratch, or you have a system that is small (or managed well) enough to encode output, follow Corey's suggestion. It's definitely the better way to do it.
Encoding is not a property of the data, it is a property of the transport mechanism. Therefore you should unencode data when you receive it, and encode it appropriately before transmission. The transport mechanism determines what sort of encoding is necessary.
This principle holds true whether your transport mechanism is HTML, HTTP, smoke signals, etc. The trick is knowing how to do the types of encoding manually, and when various frameworks do the steps for you automagically. For instance, ASP.NET will encode data assigned to a System.Web.UI.WebControls.Button's Text, but not text assigned to a System.Web.UI.WebControls.Literal's Text. jQuery will encode content you set with .innerText(), but not content you set with .innerHtml().

Resources