LDAP - query phone numbers, excluding space characters - search

Supposing I had a SQL Server table like this..
..and wanted to search for phone numbers containing "345", ignoring any spaces.
In SQL Server, this is easy to do:
My problem is, I want to do this in an LDAP query, and LDAP queries don't recognize the REPLACE function.
The phone numbers in our Active Directory often contain spaces (or other characters), but I can't work out how to write an LDAP query to simulate the SQL REPLACE() function.
So, right now, my query can search for, say, "345" within all phone numbers, but these three characters must appear in AD without any spaces between them.
Select sn,givenName,mobile,telephoneNumber
From 'LDAP://OurServerName/DC=global,DC=OurCompany,DC=net'
Where (mobile='*345*' OR telephoneNumber ='*345*' )
Does anyone know how to get LDAP to search, ignoring spaces ?
I have found similar questions which suggest LDAP just copes with this out of the box... but from what I've seen, it doesn't...
(And yes, we want to avoid running a script against Active Directory to remove all spaces from all phone numbers.)
Btw, our company has less than 1500 employees, so the performance of the query shouldn't be an issue.

Well, it is what it is.
TelephoneNumber and Mobile use the the Telephone Number syntax (1.3.6.1.4.1.1466.115.121.1.50) and LDAP implementations SHOULD follow RFC 4517 and use telephoneNumberMatch and telephoneNumberSubstringsMatch BOTH which implement "Insignificant Character Handling". (ie ignoring spaces, "-", etc.) in accordance with e.123
As far as I know, you will need to run a script or ldif out and parse numbers to obtain the desired results.

Related

Remove part of a string in each row of a large column of data in KNIME

I am stumbed.
I have a column with some thousand rows of unique adresses regarding universities, pharmacompanies etc. in a KNIME workflow
Example:
55 Shattuck Street Boston Massachusetts 02115 US [NAT: US RES: US] for all designated states
What I need is to clean the data, so each row look like nice and computable like this:
55 Shattuck Street Boston Massachusetts 02115 US.
My problem Is I can't seem to get the system to remove everything after US. Does anyone know a suitable approach in KNIME?
You should be able to use either String Replacer or String Manipulation for this. The first one lets you use either a simple wildcard or a full regular expression pattern while the second one uses a Java-like syntax - the choice comes down to how many different variations on the input data you need to handle and which syntax you prefer.
If you just need to remove any text between square brackets including the space before the open bracket then you can use String Replacer configured like this:
Beside the nodes which were already mentioned by nekomatic and which will work perfectly for the given scenario, there's also a user-friendly regular expression tool in the Palladian nodes extension called Regex Extractor, which allows you to build your regexes with a live preview as you might know from popular online regex testers.
For your scenario, you could e.g. set up a regex like this:
^(?<address>.*)(?:\s\[.*)
In prose, this means: Capture all characters until a space + square opening bracket and output into a column named address.
The Palladian extension is available here as a free plugin for KNIME Desktop and provides a variety of different tools for web, text, and geo data mining and classification.

Solr custom wildcard

I am pretty new to Solr and I am looking for a way to port the search features I have for my web application having a regular database to use Solr indexes. My problem so far is I have to customize the wildcards behaviour: for example, "?" should be "0 or 1 characters" not any character as it is now, "+" should mean any "white-space", "#" should be any digit and so on. Any good pointer?
Thanks!
There is no simple answer that I know of, I am afraid.
For 0 or 1 characters - you can replace the original query with an 'OR' query. Eg. mp? in your db search usecase becomes - 'mp OR mp?' in Solr.
White spaces are tokenized by default in text field. So, you can look at using a white space tokenizer as part of your custom 'text' field. There are several examples. text_ws in the sample schema only does whitespace tokenizing. You'd want to readup on tokenizers.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
There is no digit equivalent - you can do term1* OR term2* OR term3* ... etc. You can also use function queries that support numerical functions. http://wiki.apache.org/solr/FunctionQuery
It looks like the best choice in this case is to use regular expressions in the search. More details can be found here: http://1opensourcelover.wordpress.com/2013/09/29/solr-regex-tutorial/
It's not exactly what I was looking for as I will have to build my own solr-query on the back and I have a feeling that regular expressions abuse will create a little bit more overhead on my server. For the test I did it looks pretty fast.
I will leave the question open for a while maybe someone can come up with a better answer.

Why the restricting URI characters in the config file?

I am using clean URL for search. If the user types a single quote it says disallowed URI character. And I know how to enable a character for appearing in the URL. I want to know the security vulnerabilities on allowing certain characters like braces, quotes and others?
I want to know this by any means like explanation or external references.
I am assuming you are talking about "query string" part of the URL, if that is so then your framework is probably disallowing those characters to prevent SQL inject sort of attacks as in your code you may end up using those query string values to construct a SQL query and boom, your application is SQL injected.

How to avoid Cross site scripting in ASP.NET

I have an ASP.NET form which has textbox. The user can enter any characters,numbers etc. I should not restrict the user to filter hazardous characters. But I need to prevent cross site scripting.
The user can enter any text like alert('hi') like this.
The data should be saved as its in DB. Also it should be return back and display in label in form as it is.
How can acheive this without cross site scripting
Well, I think you should consider some restriction on what users are allowed to enter. You don't want null bytes or non-printable characters do you? Even if you accept more than alphanumeric values, you should decide which characters are allowed and exclude the rest using a simple regular expression (with start and end anchors of course).
Then, the way to prevent XSS is to encode the value whenever you display it. There are a whole host of ways to do this, but using the AntiXSS class of the Microsoft Web Protection Library is the best if you ask me. You can encode the output based on whether you're rendering it within HTML elements, attributes, JavaScript, and so on.

How to compare different language String values in JAVA?

In my web application I am using two different Languages namely English and Arabic.
I have a search box in my web application in which if we search by name or part of the name then it will retrieve the values from DB by comparing the "Hometown" of the user
Explanation:
Like if a user belongs to hometown "California" and he searches a name say "Victor" then my query will first see the people who are having the same hometown "California" and in the list of people who have "California" as hometown the "Victor" *name* will be searched and it retrieve the users having "California" as their hometown and "victor" in their name or part of the name.
The problem is if the hometown "California" is saved in English it will compare and retrieve the values. But "California" will be saved as "كاليفورنيا" in Arabic. In this case the hometown comparison fails and it cant retrieve the values.
I wish that my query should find both are same hometown and retrieve the values. Is it possible?
What alternate I should think of for this logic for comparison. I am confused. Any suggestion please?
EDIT:
*I have an Idea such that if the hometown is got then is it possible to use Google translator or transliterator and change the hometown to another language. if it is in english then to arabic or if it is in english then to arabic and give the search results joining both. Any suggestion?*
The problem you encounter is that you want / need information in 2 or more languages and you want the user of your application to be able to use both languages. One possible approach is to keep multiple records per item and including a language code as part of the primary key, for instance if your record is
id hometown name
001 California Victor
you could introduce a language code and store
id lang hometown name
001 en California Victor
001 ar كاليفورنيا Victor
then your search would match either "California" or "كاليفورنيا" giving you the id 001, which you can then use to load all translations of your data (or just the data in the current output language.) This sceme can be used with any number of languages and has the added advantage that you don't need to prefill the table. You can add new translations for records when they become known.
(Caveat: I just repeated your arabic string, I can't read it, also 'ar' most likely isn't the correct language code for aribic but you get the idea.)
Does the Arabic sound like "California"? If so you will need to compare on a "sounds-like"-basis which will most likely result in a phoneme conversion.
Transliterate all names into the same language (e.g. English) for searching, and use Levenstein edit distance to compute the similarity between the phonetic representations of the names. This will be slow if you simply compare your query with every name, but if you pre-index all of the place names in your database into a Burkhard-Keller tree, then they can be efficiently searched by edit distance from the query term.
This technique allows you to sort names by how close they actually match. You're probably more likely to find a match this way than using metaphone or double-metaphone, though this is more difficult to implement.
Your Google suggestion sounds like it might also be a good one, but you should play around with it, and be sure that you're happy with its accuracy. In testing how it worked going between Hebrew and English, I noticed that sometimes Google just leaves English place names in English letters when translating to Hebrew.
How about you use some localization on client side to display values. Or create a wrapper class for hometown that will override equal(Object) in the manner the instance for California will return true for both "California" and "كاليفورنيا" (sorry if I made mistake here, just copy-pasted from above).
This sounds like a classic encoding problem. Whenever you transfer non-ascii character you need to make sure you're encoding it right. For Arabic and English I suspect you can use UTF-8 (but I don't know arabic, so it may be wrong).
In your setup you will probably have the following points:
Browser <-> Servlet container <-> Database
|
System.out
In any of the system interfaces where chars (16-bit) are converted to byte (8-bit) you will need to make sure the encoding is correct.
Browser to Servlet container
When you do GET or POST requests from a web-page, the browser will look at 1) The HTTP headers from the server, especially the Content-Type: text/html; charset=UTF-8, which if present, will override the HTML meta header <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">.
On the servlet container side, the HttpServletRequest.getParameter(), will have an encoding that you most likely need to set in the server settings.
Example tomcat's server.xml
<Connector port="8080" protocol="HTTP/1.1" URIEncoding="UTF-8"
maxThreads="2000"
connectionTimeout="20000"
redirectPort="8443" />
Servlet container to Database
The database needs to have the correct encodings, or sorting etc will not be right.
Example my.cnf for MySQL
[mysqld]
....
init_connect=''SET collation_connection = utf8_general_ci''
init_connect='SET NAMES utf8'
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_general_ci
[mysql]
....
default-character-set=utf8
Then the JDBC-driver needs to be set for UTF-8.
Example JDBC connect string
jdbc:mysql://localhost:3306/rimario?useUnicode=true&characterEncoding=utf-8
System.out
System.out.printnln() can not be relied upon to verify things. First it depends on the java vm default encoding, set using System.property -Dfile.encoding=UTF-8, secondly the terminal in which you do the System.out, will need to be set to and support UTF-8. Don't trust System.out!
Once a String in the VM is a proper character, it will not be affected by encoding. In memory every char in a string is 16-bit, which (almost) covers all the chars that utf-8 can encode. You can write the string to a file and investigate the file to really know if you got correct chars in your VM.

Resources