How does stackoverflow display codes without compromising their security? [duplicate] - security

Is there a catchall function somewhere that works well for sanitizing user input for SQL injection and XSS attacks, while still allowing certain types of HTML tags?

It's a common misconception that user input can be filtered. PHP even has a (now deprecated) "feature", called magic-quotes, that builds on this idea. It's nonsense. Forget about filtering (or cleaning, or whatever people call it).
What you should do, to avoid problems, is quite simple: whenever you embed a a piece of data within a foreign code, you must treat it according to the formatting rules of that code. But you must understand that such rules could be too complicated to try to follow them all manually. For example, in SQL, rules for strings, numbers and identifiers are all different. For your convenience, in most cases there is a dedicated tool for such an embedding. For example, when you need to use a PHP variable in the SQL query, you have to use a prepared statement, that will take care of all the proper formatting/treatment.
Another example is HTML: If you embed strings within HTML markup, you must escape it with htmlspecialchars. This means that every single echo or print statement should use htmlspecialchars.
A third example could be shell commands: If you are going to embed strings (such as arguments) to external commands, and call them with exec, then you must use escapeshellcmd and escapeshellarg.
Also, a very compelling example is JSON. The rules are so numerous and complicated that you would never be able to follow them all manually. That's why you should never ever create a JSON string manually, but always use a dedicated function, json_encode() that will correctly format every bit of data.
And so on and so forth ...
The only case where you need to actively filter data, is if you're accepting preformatted input. For example, if you let your users post HTML markup, that you plan to display on the site. However, you should be wise to avoid this at all cost, since no matter how well you filter it, it will always be a potential security hole.

Do not try to prevent SQL injection by sanitizing input data.
Instead, do not allow data to be used in creating your SQL code. Use Prepared Statements (i.e. using parameters in a template query) that uses bound variables. It is the only way to be guaranteed against SQL injection.
Please see my website http://bobby-tables.com/ for more about preventing SQL injection.

No. You can't generically filter data without any context of what it's for. Sometimes you'd want to take a SQL query as input and sometimes you'd want to take HTML as input.
You need to filter input on a whitelist -- ensure that the data matches some specification of what you expect. Then you need to escape it before you use it, depending on the context in which you are using it.
The process of escaping data for SQL - to prevent SQL injection - is very different from the process of escaping data for (X)HTML, to prevent XSS.

PHP has the new nice filter_input functions now, that for instance liberate you from finding 'the ultimate e-mail regex' now that there is a built-in FILTER_VALIDATE_EMAIL type
My own filter class (uses JavaScript to highlight faulty fields) can be initiated by either an ajax request or normal form post. (see the example below)
<?
/**
* Pork Formvalidator. validates fields by regexes and can sanitize them. Uses PHP filter_var built-in functions and extra regexes
* #package pork
*/
/**
* Pork.FormValidator
* Validates arrays or properties by setting up simple arrays.
* Note that some of the regexes are for dutch input!
* Example:
*
* $validations = array('name' => 'anything','email' => 'email','alias' => 'anything','pwd'=>'anything','gsm' => 'phone','birthdate' => 'date');
* $required = array('name', 'email', 'alias', 'pwd');
* $sanitize = array('alias');
*
* $validator = new FormValidator($validations, $required, $sanitize);
*
* if($validator->validate($_POST))
* {
* $_POST = $validator->sanitize($_POST);
* // now do your saving, $_POST has been sanitized.
* die($validator->getScript()."<script type='text/javascript'>alert('saved changes');</script>");
* }
* else
* {
* die($validator->getScript());
* }
*
* To validate just one element:
* $validated = new FormValidator()->validate('blah#bla.', 'email');
*
* To sanitize just one element:
* $sanitized = new FormValidator()->sanitize('<b>blah</b>', 'string');
*
* #package pork
* #author SchizoDuckie
* #copyright SchizoDuckie 2008
* #version 1.0
* #access public
*/
class FormValidator
{
public static $regexes = Array(
'date' => "^[0-9]{1,2}[-/][0-9]{1,2}[-/][0-9]{4}\$",
'amount' => "^[-]?[0-9]+\$",
'number' => "^[-]?[0-9,]+\$",
'alfanum' => "^[0-9a-zA-Z ,.-_\\s\?\!]+\$",
'not_empty' => "[a-z0-9A-Z]+",
'words' => "^[A-Za-z]+[A-Za-z \\s]*\$",
'phone' => "^[0-9]{10,11}\$",
'zipcode' => "^[1-9][0-9]{3}[a-zA-Z]{2}\$",
'plate' => "^([0-9a-zA-Z]{2}[-]){2}[0-9a-zA-Z]{2}\$",
'price' => "^[0-9.,]*(([.,][-])|([.,][0-9]{2}))?\$",
'2digitopt' => "^\d+(\,\d{2})?\$",
'2digitforce' => "^\d+\,\d\d\$",
'anything' => "^[\d\D]{1,}\$"
);
private $validations, $sanatations, $mandatories, $errors, $corrects, $fields;
public function __construct($validations=array(), $mandatories = array(), $sanatations = array())
{
$this->validations = $validations;
$this->sanitations = $sanitations;
$this->mandatories = $mandatories;
$this->errors = array();
$this->corrects = array();
}
/**
* Validates an array of items (if needed) and returns true or false
*
*/
public function validate($items)
{
$this->fields = $items;
$havefailures = false;
foreach($items as $key=>$val)
{
if((strlen($val) == 0 || array_search($key, $this->validations) === false) && array_search($key, $this->mandatories) === false)
{
$this->corrects[] = $key;
continue;
}
$result = self::validateItem($val, $this->validations[$key]);
if($result === false) {
$havefailures = true;
$this->addError($key, $this->validations[$key]);
}
else
{
$this->corrects[] = $key;
}
}
return(!$havefailures);
}
/**
*
* Adds unvalidated class to thos elements that are not validated. Removes them from classes that are.
*/
public function getScript() {
if(!empty($this->errors))
{
$errors = array();
foreach($this->errors as $key=>$val) { $errors[] = "'INPUT[name={$key}]'"; }
$output = '$$('.implode(',', $errors).').addClass("unvalidated");';
$output .= "new FormValidator().showMessage();";
}
if(!empty($this->corrects))
{
$corrects = array();
foreach($this->corrects as $key) { $corrects[] = "'INPUT[name={$key}]'"; }
$output .= '$$('.implode(',', $corrects).').removeClass("unvalidated");';
}
$output = "<script type='text/javascript'>{$output} </script>";
return($output);
}
/**
*
* Sanitizes an array of items according to the $this->sanitations
* sanitations will be standard of type string, but can also be specified.
* For ease of use, this syntax is accepted:
* $sanitations = array('fieldname', 'otherfieldname'=>'float');
*/
public function sanitize($items)
{
foreach($items as $key=>$val)
{
if(array_search($key, $this->sanitations) === false && !array_key_exists($key, $this->sanitations)) continue;
$items[$key] = self::sanitizeItem($val, $this->validations[$key]);
}
return($items);
}
/**
*
* Adds an error to the errors array.
*/
private function addError($field, $type='string')
{
$this->errors[$field] = $type;
}
/**
*
* Sanitize a single var according to $type.
* Allows for static calling to allow simple sanitization
*/
public static function sanitizeItem($var, $type)
{
$flags = NULL;
switch($type)
{
case 'url':
$filter = FILTER_SANITIZE_URL;
break;
case 'int':
$filter = FILTER_SANITIZE_NUMBER_INT;
break;
case 'float':
$filter = FILTER_SANITIZE_NUMBER_FLOAT;
$flags = FILTER_FLAG_ALLOW_FRACTION | FILTER_FLAG_ALLOW_THOUSAND;
break;
case 'email':
$var = substr($var, 0, 254);
$filter = FILTER_SANITIZE_EMAIL;
break;
case 'string':
default:
$filter = FILTER_SANITIZE_STRING;
$flags = FILTER_FLAG_NO_ENCODE_QUOTES;
break;
}
$output = filter_var($var, $filter, $flags);
return($output);
}
/**
*
* Validates a single var according to $type.
* Allows for static calling to allow simple validation.
*
*/
public static function validateItem($var, $type)
{
if(array_key_exists($type, self::$regexes))
{
$returnval = filter_var($var, FILTER_VALIDATE_REGEXP, array("options"=> array("regexp"=>'!'.self::$regexes[$type].'!i'))) !== false;
return($returnval);
}
$filter = false;
switch($type)
{
case 'email':
$var = substr($var, 0, 254);
$filter = FILTER_VALIDATE_EMAIL;
break;
case 'int':
$filter = FILTER_VALIDATE_INT;
break;
case 'boolean':
$filter = FILTER_VALIDATE_BOOLEAN;
break;
case 'ip':
$filter = FILTER_VALIDATE_IP;
break;
case 'url':
$filter = FILTER_VALIDATE_URL;
break;
}
return ($filter === false) ? false : filter_var($var, $filter) !== false ? true : false;
}
}
Of course, keep in mind that you need to do your sql query escaping too depending on what type of db your are using (mysql_real_escape_string() is useless for an sql server for instance). You probably want to handle this automatically at your appropriate application layer like an ORM. Also, as mentioned above: for outputting to html use the other php dedicated functions like htmlspecialchars ;)
For really allowing HTML input with like stripped classes and/or tags depend on one of the dedicated xss validation packages. DO NOT WRITE YOUR OWN REGEXES TO PARSE HTML!

No, there is not.
First of all, SQL injection is an input filtering problem, and XSS is an output escaping one - so you wouldn't even execute these two operations at the same time in the code lifecycle.
Basic rules of thumb
For SQL query, bind parameters
Use strip_tags() to filter out unwanted HTML
Escape all other output with htmlspecialchars() and be mindful of the 2nd and 3rd parameters here.

To address the XSS issue, take a look at HTML Purifier. It is fairly configurable and has a decent track record.
As for the SQL injection attacks, the solution is to use prepared statements. The PDO library and mysqli extension support these.

PHP 5.2 introduced the filter_var function.
It supports a great deal of SANITIZE, VALIDATE filters.

Methods for sanitizing user input with PHP:
Use Modern Versions of MySQL and PHP.
Set charset explicitly:
$mysqli->set_charset("utf8");manual
$pdo = new PDO('mysql:host=localhost;dbname=testdb;charset=UTF8', $user, $password);manual
$pdo->exec("set names utf8");manual
$pdo = new PDO(
"mysql:host=$host;dbname=$db", $user, $pass,
array(
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"
)
);manual
mysql_set_charset('utf8') [deprecated in PHP 5.5.0, removed in PHP 7.0.0].
Use secure charsets:
Select utf8, latin1, ascii.., dont use vulnerable charsets big5, cp932, gb2312, gbk, sjis.
Use spatialized function:
MySQLi prepared statements:
$stmt = $mysqli->prepare('SELECT * FROM test WHERE name = ? LIMIT 1'); $param = "' OR 1=1 /*";$stmt->bind_param('s', $param);$stmt->execute();
PDO::quote() - places quotes around the input string (if required) and escapes special characters within the input string, using a quoting style appropriate to the underlying driver:$pdo = new PDO('mysql:host=localhost;dbname=testdb;charset=UTF8', $user, $password);explicit set the character set$pdo->setAttribute(PDO::ATTR_EMULATE_PREPARES, false);disable emulating prepared statements to prevent fallback to emulating statements that MySQL can't prepare natively (to prevent injection)$var = $pdo->quote("' OR 1=1 /*");not only escapes the literal, but also quotes it (in single-quote ' characters)
$stmt = $pdo->query("SELECT * FROM test WHERE name = $var LIMIT 1");
PDO Prepared Statements: vs MySQLi prepared statements supports more database drivers and named parameters: $pdo = new PDO('mysql:host=localhost;dbname=testdb;charset=UTF8', $user, $password);explicit set the character set$pdo->setAttribute(PDO::ATTR_EMULATE_PREPARES, false);disable emulating prepared statements to prevent fallback to emulating statements that MySQL can't prepare natively (to prevent injection)
$stmt = $pdo->prepare('SELECT * FROM test WHERE name = ? LIMIT 1');
$stmt->execute(["' OR 1=1 /*"]);
mysql_real_escape_string [deprecated in PHP 5.5.0, removed in PHP 7.0.0].
mysqli_real_escape_string Escapes special characters in a string for use in an SQL statement, taking into account the current charset of the connection. But recommended to use Prepared Statements because they are not simply escaped strings, a statement comes up with a complete query execution plan, including which tables and indexes it would use, it is a optimized way.
Use single quotes (' ') around your variables inside your query.
Check the variable contains what you are expecting for:
If you are expecting an integer, use:
ctype_digit — Check for numeric character(s);$value = (int) $value;$value = intval($value);$var = filter_var('0755', FILTER_VALIDATE_INT, $options);
For Strings use:
is_string() — Find whether the type of a variable is stringUse Filter Function filter_var() — filters a variable with a specified filter:$email = filter_var($email, FILTER_SANITIZE_EMAIL);$newstr = filter_var($str, FILTER_SANITIZE_STRING);more predefined filters
filter_input() — Gets a specific external variable by name and optionally filters it:$search_html = filter_input(INPUT_GET, 'search', FILTER_SANITIZE_SPECIAL_CHARS);
preg_match() — Perform a regular expression match;
Write Your own validation function.

One trick that can help in the specific circumstance where you have a page like /mypage?id=53 and you use the id in a WHERE clause is to ensure that id definitely is an integer, like so:
if (isset($_GET['id'])) {
$id = $_GET['id'];
settype($id, 'integer');
$result = mysql_query("SELECT * FROM mytable WHERE id = '$id'");
# now use the result
}
But of course that only cuts out one specific attack, so read all the other answers. (And yes I know that the code above isn't great, but it shows the specific defence.)

There's no catchall function, because there are multiple concerns to be addressed.
SQL Injection - Today, generally, every PHP project should be using prepared statements via PHP Data Objects (PDO) as a best practice, preventing an error from a stray quote as well as a full-featured solution against injection. It's also the most flexible & secure way to access your database.
Check out (The only proper) PDO tutorial for pretty much everything you need to know about PDO. (Sincere thanks to top SO contributor, #YourCommonSense, for this great resource on the subject.)
XSS - Sanitize data on the way in...
HTML Purifier has been around a long time and is still actively updated. You can use it to sanitize malicious input, while still allowing a generous & configurable whitelist of tags. Works great with many WYSIWYG editors, but it might be heavy for some use cases.
In other instances, where we don't want to accept HTML/Javascript at all, I've found this simple function useful (and has passed multiple audits against XSS):
/* Prevent XSS input */
function sanitizeXSS () {
$_GET = filter_input_array(INPUT_GET, FILTER_SANITIZE_STRING);
$_POST = filter_input_array(INPUT_POST, FILTER_SANITIZE_STRING);
$_REQUEST = (array)$_POST + (array)$_GET + (array)$_REQUEST;
}
XSS - Sanitize data on the way out... unless you guarantee the data was properly sanitized before you add it to your database, you'll need to sanitize it before displaying it to your user, we can leverage these useful PHP functions:
When you call echo or print to display user-supplied values, use htmlspecialchars unless the data was properly sanitized safe and is allowed to display HTML.
json_encode is a safe way to provide user-supplied values from PHP to Javascript
Do you call external shell commands using exec() or system() functions, or to the backtick operator? If so, in addition to SQL Injection & XSS you might have an additional concern to address, users running malicious commands on your server. You need to use escapeshellcmd if you'd like to escape the entire command OR escapeshellarg to escape individual arguments.

What you are describing here is two separate issues:
Sanitizing / filtering of user input data.
Escaping output.
1) User input should always be assumed to be bad.
Using prepared statements, or/and filtering with mysql_real_escape_string is definitely a must.
PHP also has filter_input built in which is a good place to start.
2) This is a large topic, and it depends on the context of the data being output. For HTML there are solutions such as htmlpurifier out there.
as a rule of thumb, always escape anything you output.
Both issues are far too big to go into in a single post, but there are lots of posts which go into more detail:
Methods PHP output
Safer PHP output

If you're using PostgreSQL, the input from PHP can be escaped with pg_escape_literal()
$username = pg_escape_literal($_POST['username']);
From the documentation:
pg_escape_literal() escapes a literal for querying the PostgreSQL database. It returns an escaped literal in the PostgreSQL format.

You never sanitize input.
You always sanitize output.
The transforms you apply to data to make it safe for inclusion in an SQL statement are completely different from those you apply for inclusion in HTML are completely different from those you apply for inclusion in Javascript are completely different from those you apply for inclusion in LDIF are completely different from those you apply to inclusion in CSS are completely different from those you apply to inclusion in an Email....
By all means validate input - decide whether you should accept it for further processing or tell the user it is unacceptable. But don't apply any change to representation of the data until it is about to leave PHP land.
A long time ago someone tried to invent a one-size fits all mechanism for escaping data and we ended up with "magic_quotes" which didn't properly escape data for all output targets and resulted in different installation requiring different code to work.

Easiest way to avoid mistakes in sanitizing input and escaping data is using PHP framework like Symfony, Nette etc. or part of that framework (templating engine, database layer, ORM).
Templating engine like Twig or Latte has output escaping on by default - you don't have to solve manually if you have properly escaped your output depending on context (HTML or Javascript part of web page).
Framework is automatically sanitizing input and you should't use $_POST, $_GET or $_SESSION variables directly, but through mechanism like routing, session handling etc.
And for database (model) layer there are ORM frameworks like Doctrine or wrappers around PDO like Nette Database.
You can read more about it here - What is a software framework?

Just wanted to add that on the subject of output escaping, if you use php DOMDocument to make your html output it will automatically escape in the right context. An attribute (value="") and the inner text of a <span> are not equal.
To be safe against XSS read this:
OWASP XSS Prevention Cheat Sheet

PHP filter extension has many of the functions needed for checking the externaluser input & it is designed for making data sanitization easier and quicker.
PHP filters can comfortably sanitize & validate the external input.

Related

xpages #IsMember function in dialog list item values

Based on this XPages adding #Formulas in dialogList, my dialogList1 takes values from two concatenated views: a and b.
There is another dialogList2, which is rendered depending if the dialogList1 value is null or not, whose values should be like this:
dialogList1.value is from a => dialogList2.choices should be only from b
dialogList1.value is from b => dialogList2.choices should be only from a
I tried:
// Contr.txt_particontractcv_1 - is the value binded by dialogList1
var dbname = session.getServerName() + "!!" + "mynsf.nsf";
//var a = #Unique(#DbColumn(dbname, "vwNumeCompanii", 0)).sort();
//var b = #Unique(#DbColumn(#DbName(),"vwA",0));
//return a.concat(b);
if ( #IsMember(Contr.txt_particontractcv_1,#Unique(#DbColumn(#DbName(),"vwA",0))))
{ return #Unique(#DbColumn(dbname, "vwNumeCompanii", 0)) }
else
{ return #Unique(#DbColumn(#DbName(),"vwA",0)) }
but the dialogList2 is taking values only from vwA ( from b ) ... I think I'm missing something. Thanks for your time.
Contr.txt_particontractcv_1 cannot be used in SSJS. Dot notation works in LotusScript but not SSJS or Java because Java's runtime is not proprietary and has not been extended that way. That is why Contr.getItemValueString("txt_particontractcv_1") is required.
Some SSJS global variables allow dot notation to be used, e.g. sessionScope. But that is because it is based on a Map, so sessionScope.myProperty can only map to sessionScope.get("myProperty"). The Domino Document class does not extend the Map interface (that's one of the enhancements of the OpenNTF Domino API), so dot notation doesn't know whether to use getItemValue(), getItemValueString(), getItemValueDateTimeArray() etc.
This is also why best practice for scoped variables is also to use e.g. sessionScope.get("myVar"). When it comes to moving to Java, you will not be able to use dot notation, you will have to use the relevant method. So working that way in SSJS fosters good habits for the future.
Yep, I just modified Contr.txt_particontractcv_1 to Contr.getItemValueString("txt_particontractcv_1") and, now, it works.`

return codes for Jira workflow script validators

I'm writing a workflow validator in Groovy to link two issues based on a custom field value input at case creation. It is required that the custom filed value to Jira issue link be unique. In other words, I need to ensure only one issue has a particular custom field value. If there is more than one issue that has the input custom field value, the validation should fail.
How or what do I return to cause a workflow validator to fail?
Example code:
// Set up jqlQueryParser object
jqlQueryParser = ComponentManager.getComponentInstanceOfType(JqlQueryParser.class) as JqlQueryParser
// Form the JQL query
query = jqlQueryParser.parseQuery('<my_jql_query>')
// Set up SearchService object used to query Jira
searchService = componentManager.getSearchService()
// Run the query to get all issues with Article number that match input
results = searchService.search(componentManager.getJiraAuthenticationContext().getUser(), query, PagerFilter.getUnlimitedFilter())
// Throw a FATAL level log statement because we should never have more than one case associated with a given KB article
if (results.getIssues().size() > 1) {
for (r in results.getIssues()) {
log.fatal('Custom field has more than one Jira ssue associated with it. ' + r.getKey() + ' is one of the offending issues')
}
return "?????"
}
// Create link from new Improvement to parent issue
for (r in results) {
IssueLinkManager.createIssueLink(issue.getId(), r.getId(), 10201, 1, getJiraAuthenticationContext().getUser())
}
try something like
import com.opensymphony.workflow.InvalidInputException
invalidInputException = new InvalidInputException("Validation failure")
this is based of the groovy script runner. If it doesn't work for you, i would recommend you using some sort of framework to make scripting easier, I like using either groovy script runner , Jira Scripting Suite or Behaviours Plugin
. All of them really makes script writing easier and much more intuitive.

Cleaning user inputs in FuelPHP

I am quite new to the FuelPHP framework. Right now I'm implementing an "autocomplete" for a list of locations.
My code looks like this:
public function action_search($term=null){
$clean_query = Security::clean($term);
$data["locations"] = array();
if ($clean_query != "") {
$data["locations"] = Model_Orm_Location::query()
->where("title", "like", $clean_query."%")
->get();
}
$response = Response::forge(View::forge("location/search", $data));
$response->set_header("Content-Type","application/json");
return $response;
}
As you can see, I'm concatenating a LIKE statement and it sort of feels bad to me. Is this code safe against SQL injections ? If yes, then is it because:
Security::clean will remove all mess;
where() in the ORM query will do the filtering?
Looking at the implementation of Security::clean in the source code of core/class/security.php, in your case the applied filters depend the configuration security.input_filter, which is empty by default. So no filter is applied.
But when you dig deep into the database abstraction, you will see, that when the query is compiled just before execution, the query builder will apply quote on the value that was supplied in the where condition, which will then apply escape on string values. The implementations of that escape method depend on the DBMS connection:
mysql_real_escape_string for mysql,
mysqli::real_escape_string for mysqli, and
PDO::quote for PDO.
This reflects today’s best practices. So, yes, this is safe against SQL injections.

Custom field name mapping in expressionengine

I am making some changes to a site using ExpressionEngine, there is a concept I can't seem to quite understand and am probably coding workarounds that have pre-provided methods.
Specifically with channels and members, the original creator has added several custom fields. I am from a traditional database background where each column in a table has a specific meaningful name. I am also used to extending proprietary data by adding a related table joined with a unique key, again, field names in the related table are meaningful.
However, in EE, when you add custom fields, it creates them in the table as field_id_x and puts an entry into another table telling you what these are.
Now this is all nice from a UI point of view for giving power to an administrator but this is a total headache when writing any code.
Ok, there's template tags but I tend not to use them and they are no good in a database query anyway.
Is there a simple way to do a query on say the members table and then address m_field_1 as what its really called - in my case "addresslonglat".
There are dozens of these fields in the table I am working on and at the moment I am addressing them with fixed names like "m_field_id_73" which means nothing.
Anybody know of an easy way to bring the data and its field names together easily?
Ideally i'd like to do the following:
$result = $this->EE->db->query("select * from exp_member_data where member_id = 123")->row();
echo $result->addresslonglat;
Rather than
echo $result->m_field_id_73;
This should work for you:
<?php
$fields = $data = array();
$member_fields = $this->EE->db->query("SELECT m_field_id, m_field_name FROM exp_member_fields");
foreach($member_fields->result_array() as $row)
{
$fields['m_field_id_'.$row['m_field_id']] = $row['m_field_name'];
}
$member_data = $this->EE->db->query("SELECT * FROM exp_member_data WHERE member_id = 1")->row();
foreach($member_data as $k => $v)
{
if($k != 'member_id')
{
$data[$fields[$k]] = $v;
}
}
print_r($data);
?>
Then just use your new $data array.

Pattern Matching for URL classification

As a part of a project, me and a few others are currently working on a URL classifier. What we are trying to implement is actually quite simple : we simply look at the URL and find relevant keywords occuring within it and classify the page accordingly.
Eg : If the url is : http://cnnworld/sports/abcd, we would classify it under the category "sports"
To accomplish this, we have a database with mappings of the format : Keyword -> Category
Now what we are currently doing is, for each URL, we keep reading all the data items within the database, and using String.find() method to see if the keyword occurs within the URL. Once this is found, we stop.
But this approach has a few problems, the main ones being :
(i) Our database is very big and such repeated querying runs extremely slowly
(ii) A page may belong to more than one category and our approach does not handle such cases. Of-course, one simple way to ensure this would be to continue querying the database even once a category match is found, but this would only make things even slower.
I was thinking of alternatives and was wondering if the reverse could be done - Parse the url, find words occuring within it and then query the database for those words only.
A naive algorithm for this would run in O( n^2 ) - query the database for all substrings that occur within the url.
I was wondering if there was any better approach to accomplish this. Any ideas ?? Thank you in advance :)
In our commercial classifier we have a database of 4m keywords :) and we also search the body of the HTML, there are number of ways to solve this:
Use Aho-Corasick, we have used a modified algorithm specially to work with web content, for example treat: tab, space, \r, \n as space, as only one, so two spaces would be considered as one space, and also ignore lower/upper case.
Another option is to put all your keywords inside a tree (std::map for example) so the search becomes very fast, the downside is that this takes memory, and a lot, but if it's on a server, you wouldn't feel it.
I think your suggestion of breaking apart the URL to find useful bits and then querying for just those items sounds like a decent way to go.
I tossed together some Java that might help illustrate code-wise what I think this would entail. The most valuable portions are probably the regexes, but I hope the general algorithm of it helps some as well:
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.List;
public class CategoryParser
{
/** The db field that keywords should be checked against */
private static final String DB_KEYWORD_FIELD_NAME = "keyword";
/** The db field that categories should be pulled from */
private static final String DB_CATEGORY_FIELD_NAME = "category";
/** The name of the table to query */
private static final String DB_TABLE_NAME = "KeywordCategoryMap";
/**
* This method takes a URL and from that text alone determines what categories that URL belongs in.
* #param url - String URL to categorize
* #return categories - A List<String&rt; of categories the URL seemingly belongs in
*/
public static List<String> getCategoriesFromUrl(String url) {
// Clean the URL to remove useless bits and encoding artifacts
String normalizedUrl = normalizeURL(url);
// Break the url apart and get the good stuff
String[] keywords = tokenizeURL(normalizedUrl);
// Construct the query we can query the database with
String query = constructKeywordCategoryQuery(keywords);
System.out.println("Generated Query: " + query);
// At this point, you'd need to fire this query off to your database,
// and the results you'd get back should each be a valid category
// for your URL. This code is not provided because it's very implementation specific,
// and you already know how to deal with databases.
// Returning null to make this compile, even though you'd obviously want to return the
// actual List of Strings
return null;
}
/**
* Removes the protocol, if it exists, from the front and
* removes any random encoding characters
* Extend this to do other url cleaning/pre-processing
* #param url - The String URL to normalize
* #return normalizedUrl - The String URL that has no junk or surprises
*/
private static String normalizeURL(String url)
{
// Decode URL to remove any %20 type stuff
String normalizedUrl = url;
try {
// I've used a URLDecoder that's part of Java here,
// but this functionality exists in most modern languages
// and is universally called url decoding
normalizedUrl = URLDecoder.decode(url, "UTF-8");
}
catch(UnsupportedEncodingException uee)
{
System.err.println("Unable to Decode URL. Decoding skipped.");
uee.printStackTrace();
}
// Remove the protocol, http:// ftp:// or similar from the front
if (normalizedUrl.contains("://"))
{
normalizedUrl = normalizedUrl.split(":\\/\\/")[1];
}
// Room here to do more pre-processing
return normalizedUrl;
}
/**
* Takes apart the url into the pieces that make at least some sense
* This doesn't guarantee that each token is a potentially valid keyword, however
* because that would require actually iterating over them again, which might be
* seen as a waste.
* #param url - Url to be tokenized
* #return tokens - A String array of all the tokens
*/
private static String[] tokenizeURL(String url)
{
// I assume that we're going to use the whole URL to find tokens in
// If you want to just look in the GET parameters, or you want to ignore the domain
// or you want to use the domain as a token itself, that would have to be
// processed above the next line, and only the remaining parts split
String[] tokens = url.split("\\b|_");
// One could alternatively use a more complex regex to remove more invalid matches
// but this is subject to your (?:in)?ability to actually write the regex you want
// These next two get rid of tokens that are too short, also.
// Destroys anything that's not alphanumeric and things that are
// alphanumeric but only 1 character long
//String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+");
// Destroys anything that's not alphanumeric and things that are
// alphanumeric but only 1 or 2 characters long
//String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+");
return tokens;
}
private static String constructKeywordCategoryQuery(String[] keywords)
{
// This will hold our WHERE body, keyword OR keyword2 OR keyword3
StringBuilder whereItems = new StringBuilder();
// Potential query, if we find anything valid
String query = null;
// Iterate over every found token
for (String keyword : keywords)
{
// Reject invalid keywords
if (isKeywordValid(keyword))
{
// If we need an OR
if (whereItems.length() > 0)
{
whereItems.append(" OR ");
}
// Simply append this item to the query
// Yields something like "keyword='thisKeyword'"
whereItems.append(DB_KEYWORD_FIELD_NAME);
whereItems.append("='");
whereItems.append(keyword);
whereItems.append("'");
}
}
// If a valid keyword actually made it into the query
if (whereItems.length() > 0)
{
query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME
+ " WHERE " + whereItems.toString() + ";";
}
return query;
}
private static boolean isKeywordValid(String keyword)
{
// Keywords better be at least 2 characters long
return keyword.length() > 1
// And they better be only composed of letters and numbers
&& keyword.matches("\\w+")
// And they better not be *just* numbers
// && !keyword.matches("\\d+") // If you want this
;
}
// How this would be used
public static void main(String[] args)
{
List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification");
List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965");
}
}
The Generated Query for the SO link would look like:
SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification'
Plenty of room for optimization, but I imagine it to be much faster than checking the string for every possible keyword.
Aho-corasick algorithm is best for searching intermediate string with one traversal. You can form a tree (aho-corasick tree) of your keyword. At the last node contains a number mapped with a particular keyword.
Now, You just need to traverse the URL string on the tree. When you got some number (work as flag in our scenario), it means that we got some mapped category. Go on with that number on hash map and find respective category for further use.
I think this will help you.
Go to this link: good animation of aho-corasick by ivan
If you have (many) fewer categories than keywords, you could create a regex for each category, where it would match any of the keywords for that category. Then you'd run your URL against each category's regex. This would also address the issue of matching multiple categories.

Resources