How does one extract URLs from a string? (Any language) - string

I realise this question has been asked many times across Stack Overflow and across the web, in fact, I have about 20 tabs open just now with apparent solutions to this problem.
The thing is every single answer says something along the lines of
You could use Regex, but it's not a good idea and doesn't reliably work, but I won't offer any alternatives.
So my question is this - Is there really no reliable, definitive way we can extract URLs from text?

Regular Expressions are extremely powerful tools. Like most powerful tools, they are seriously misunderstood, dangerous in the hands of many of their users, and the best answer to certain tasks. Matching known patterns in strings is what they exist for. Once you have a good URL pattern in hand it will work all the time in the context it was designed for. The reason everyone shies away from using them is that creating a good URL pattern for a specific context is difficult work. The pattern will vary by the execution environment (e.g., operating system for file: URLs), by the programming language and/or library in use, etc.
For the specific case of HTTP URLs, there is a clear definition that is mostly adhered to, and you can build a reliable regular expression from it with almost any language or library.

If you want to extract URL from any string there is no other choice than using Regex.
In fact URI Scheme is defined (see http://en.wikipedia.org/wiki/URI_scheme) and if you go through all it's aspects, regex is very reliable.

Is there really no reliable, definitive way we can extract URLs from text?
Well, anything which is in string formatted list needs a careful exception handling. That is said, once you have that handling in place it should be working fine.
The regexp with a uri schema may do the trick may look something similar to:
<a href="(?<url>http://.*?)".*>(?<text>.+?)<\/a>
That's a .NET regexp though, so you may need to modify it to work on your platform language.

Related

Tools for Domain Specific Language/Functions

Our users can enter questions that get answered by students. Our users need a extensible, flexible way to define the correct answers to these questions (which are stored as a simple string).
I would like to expose a library of domain specific functions that users can call on to describe the correct answer. Eg:
exact_match("puppy") // means the correct answer is the string 'puppy'
or
contains("yesterday") // means any answer with the word 'yesterday' is correct
The naive implementation would involve eval'ing user supplied strings in a sandboxed runtime (like a javascript vm or ruby vm). But I'd like to go further and only allow specific functions to be called. Any other scripting would be discarded. Such that:
puts("foo"); contains("yesterday")
would be illegal. Since we don't expose or allow puts().
How can I constrain the execution environment to only run a whitelist of functions? Or is there a different approach to build this kind of external-facing DSL instead of trying to constrain an existing language to a subset of functions?
I would check out MPS by JetBrains if I were you, its an open source DSL creation tool. I have never used it myself, but from everything I have seen on it, it's very intuitive; and all of their other products are incredibly powerful.
Just because you're creating a DSL, that doesn't necessarily mean that you have to give the user the ability to enter the code in text.
The key to this is providing a list of method names and your special keyword for them, the "FunCode" tag in the code example below:
Create a mapping from keyword to code, and letting them define everything they need, and then use it. And I would actually build my own XML parser so that it's not hackable, at least not on a list of zero-day-exploits hackable.
<strDefs>
<strDef><strNam>sickStr</strNam>
<strText>sick</strText><strNum>01</strNum><strDef>
<strDef><strNam>pupStr</strNam>
<strText>puppy</strText><strNum>02</strNum><strDef>
</strDefs>
<funDefs>
<funDef><funCode>pfContainsStr</funCode><funLabel>contains</funLabel>
<funNum>01</funNum></funDef>
<funDef><funCode>pfXact</funCode><funLabel>exact_match</funLabel>
<funNum>02</funNum></funDef>
</funDefs>
<queries>
<query><fun>01</fun><str>02</str>
</query>
</queries>
The above XML more represents the idea and the structure of what to do, but rather in a user interface, so the user is constrained. The user interface code that allows the data-entry of the above data should be running on your server, and they only interact with it. Any code that runs on their browser is hackable, because they can just save the page, edit the HTML (and/or JavaScript), and run that, which is their code now, not yours anymore.
You can't really open the door (pandora's box) and allow just anyone to write just any code and have it evaluated / interpreted by the language parser, because some hacker is going to exploit it. You must lock down the strings, probably by having them enter them into your database in an earlier step, and each string gets its own token that YOU generate (a SQL Server primary key is very simple, usable, and secure), but give them a display representation so it's readable to them.
Then give them a list of methods / functions they can use, along with a token (a primary key can also serve here, perhaps with a kind of table prefix) and also a display representation (label).
If you have them put all of their labels into yet another table, you can have SQL make sure that all of their labels are unique to each other in the whole "language", and then you can allow them to try to define their expressions in the language they want to use. This has the advantage that foreign languages can be used, but you don't have to do anything terribly special.
An important piece would be the verify button, that would translate their expression into unique tokens and back again, checking that the round-trip was successful. If it wasn't successful, there's some kind of ambiguity, and you might be able to allow them an option to use the list of tokens as the source in that case.
If you heavily rely on set-based logic for the underlying foundation of the language and your tables, you should be able to produce a coherent DSL that works. Many DSL creation problems are ones of integrity, where there are underlying assumptions that are contradictory, unintentionally mutually exclusive, or nonsensical. Truth is an unshakeable foundation. Anything else has a lie somewhere -- that you're trying to build on.
Sudoku is illustrative here. When you screw up a Sudoku, you often don't know that you have done so, and you keep building on that false foundation, until you get to the completion of the puzzle, and one whole string of assumptions disagrees with a different string of assumptions. They can't both be true. But you can't tell where you went wrong because you're too far away from the mistake and can not work backwards (easily). All steps taken look correct. A DSL, a database schema, and code, are all this way. Baby steps, that are double- and even triple-checked, and hopefully "correct by inspection", are the best way to "grow" a DSL, slowly, piece-by-piece. The best way to not have flaws is to not add them in the first place.
You don't want bugs in your DSL. Keep it spartan. KISS - Keep it simple, Sparticus! And I have personally found that keeping it set-based, if not overtly, under the covers, accomplishes this very well.
Finally, to be able to think this way, I've studied languages for a long time, and have cultivated a curiosity about how languages have come to be. Books are a good quality source of information, as they have a higher quality level than the internet, which is nevertheless also an indispensable source. Some of my favorite languages: Forth, Factor, SETL, F#, C#, Visual FoxPro (especially for its embedded SQL), T-SQL, Common LISP, Clojure, and probably my favorite, Dylan, an INFIX Lisp without parentheses that Apple experimented with and abandoned, with a syntax that seems to me reminiscent of Pascal, which I sort of liked. The language list is actually much longer than that (and I haven't written code for many of them -- just studied them or their genesis), but that's enough for now.
One of my favorite books, and immensely interesting for the "people" side of it, is "Masterminds of Programming: Conversations with the Creators of Major Programming Languages" (Theory in Practice (O'Reilly)) 1st Edition, Kindle Edition
by Federico Biancuzzi (Author), Chromatic (Author)
By the way, don't let them compromise the integrity of your DSL -- require that it is expressible set-based, and things should go well (IMHO). I hope it works out well for you. Add a comment to my answer telling me how it worked out, if you think of it. And don't forget to choose my answer if you think it's the best! We work hard for the money! ;-)

Do Lisp apps and webapps need special input sanitizing?

EDIT 3 Quite some new development have happened since I asked this question. Basically I wasn't "seeing things" and webapps written in Clojure have been found to be vulnerable, which prompted changes in Clojure 1.5 and very heated discussion on the Clojure Google groups.
Here's a quote from someone on Hacker News about the changes in Clojure 1.5:
Another slightly interesting thing is the sudden enhancement to
read-eval and EDN[2]. That's mainly because of the rough weather
Ruby/Rubygems was in with the YAML-exploits, which caused a heated
discussion on how the Clojure reader should act by default.
Holes have been found and it's too late to really fix Clojure, so read-eval shall still ship by default set to true (because otherwise it would break too many things). And anyone parsing inputs in Clojure should not use the default read functions but the EDN ones.
So I certainly wasn't seeing things and it didn't take long (not even 18 months) for people to find ways to attack common Clojure webapp stacks.
EDIT 2 I didn't know it but my question is a dupe of the following question (which has been described as a 'killer question'): Lisp data security/validation
If anyone's interested in the answer(s) to this question, I'd suggest they open the above question and read the answers made there by Lisp gurus instead of the ones of the type "nothing to see here, move along, it's just like PHP or JavaScript".
EDIT: I'd like to know if, somehow, because it is Lisp, it would be "easier" for an attacker to transform "data" (i.e. "crafted user input with a malicious intent") into "code". For example, do I need to escape/replace all the parentheses in the user input before starting to "evaluate" / parse or whatever the data?
Original question
I'm still reading about Lisp and suddenly I was wondering, with this entire "code is data" / "data is code" thing, do Lisp need to perform input sanitizing in order to prevent attacks?
I was thinking specifically of webapps, say when a user does some HTTP POST.
What if the data he's sending contains things like:
This is some malicious (eval '(nasty-stuff (...)) or whatever.
(I'm no Lisp programmer, it's just an example of what I've got in mind, it's not meant to be actually mean code)
Is there anything special to keep in mind due to how Lisp works? For example if some dark-side hacker would know that some webserver is running on Clojure, can he exploit that fact and then inject "code between parentheses" that would then be evaluated on the webserver?
Is this a concern at all when receiving/parsing user data (and hence potentially crafted data) from Lisp?
I have written some webapps in Lisp (i.e. Common Lisp) and here are the things I've kept in mind:
if you use read, you should always set *read-eval* to nil for any untrusted data
if you are dealing with code generation - for example, HTML, JS, CSS or SQL generation - which is very common in Lisp-land, you shouldn't forget to use the sanitizing facilities provided by the corresponding libraries (not use raw input strings)
Basically, that's all. Moreover, since it's Lisp, it usually makes your system less prone to attack, because:
there are no standard attacks (as Lisp's use is relatively rare)
the system is rather secure in terms of defaults - this isn't unique, but many web-oriented languages (like PHP in the first place) suffer from insecurity by default, although it is mitigated by modern frameworks
You should always assume that injection attacks are possible until proven otherwise. Without knowing more about your specific Lisp environment and what you are comparing it with, it is impossible to answer whether you need "special" sanitization.
We know that machine code attacks are possible.
We know that SQL injection is possible.
We should assume that it is possible to hijack any turing-complete system, whether it is hardware or software.
Note that the soft barrier between "code" and "data" is not unique to Lisp. perl, once the workhorse of the web world has eval. So does PHP. It looks like Java bytecode injection may be possible as well.
It really does boil down to: don't use READ and don't use EVAL. You need to know exactly what you are sending to either or both of those functions, as well as the contexts within which they are executed. If you do not call either of these, then you're fine.

What is the purpose of case sensitivity in languages? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Is there any advantage of being a case-sensitive programming language?
Why are many languages case sensitive?
Something I have always wondered, is why are languages designed to be case sensitive?
My pea brain can't fathom any possible reason why it is helpful.
But I'm sure there is one out there. And before anyone says it, having a variable called dog and Dog differentiated by case sensitivity is really really bad practise, right?
Any comments appreciated, along with perhaps any history on the matter! I'm insensitive about case sensitivity generally, but sensitive about sensitivity around case sensitivity so let's keep all answers and comments civil!
It's not necessarily bad practice to have two members which are only differentiated by case, in languages which support it. For example, here's a fairly common bit of C#:
private readonly string name;
public string Name { get { return name; } }
Personally I'm quite happy with case sensitivity - particularly as it allows code like the above, where the member variable and property follow conventions anyway, avoiding confusion.
Note that case-sensitivity has a culture aspect too... not all cultures will deem the same characters to be equivalent...
One of the biggest reasons for case-sensitivity in programming languages is readability. Things that mean the same should also look the same.
I found the following interesting example by M. Sandin in a related discussion:
I used to
believe case sensitivity was a
mistake, until I did this in the case
insensitive language PL/SQL (syntax
now entierly forgotten):
function IsValidUserLogin(user:string, password :string):bool begin
result = select * from USERS
where USER_NAME=user and PASSWORD=password;
return not is_empty(result);
end
This passed unnoticed for several
months on a low-volume production
system, and no harm came of it. But it
is a nasty bug, sprung from case
insensitivity, coding conventions, and
the way humans read code. The lesson
for me was that: Things that are the
same should look the same.
Can you see the problem immediately? I couldn't...
I like case sensitivity in order to differentiate between class and instance.
Form form = new Form();
If you can't do that, you end up with variables called myForm or form1 or f, which are not as clean and descriptive as plain old form.
Case sensitivity also means that you don't have references to form, FORM and Form which all mean the same thing. I find it difficult to read such code. I find it much easier to scan code where all references to the same variable look exactly the same.
Something I have always wondered, is why are languages designed to be case sensitive?
Ultimately, it's because it is easier to correctly implement a case-sensitive comparison correctly; you just compare bytes/characters without any conversions. You can also do other things like hashing really easy.
Why is this an issue? Well, case-insensitivity is rather hard to add unless you're in a tiny domain of supported characters (notably, US-ASCII). Case conversion rules vary by locale (the Turkish rules are not the same as those in the rest of the world) and there's no guarantee that flipping a single bit will do the right thing, or that it is always the same bit and under the same preconditions. (IIRC, there's some really complex rules in some language for throwing away diacritics when converting vowels to upper case, and reintroducing them when converting to lower case. I forget exactly what the details are.)
If you're case sensitive, you just ignore all that; it's just simpler. (Mind you, you still ought to pay attention to UNICODE normalization forms, but that's another story and it applies whatever case rules you're using.)
Imagine you have an object called dog, which has a method called Bark(). Also you have defined a class called Dog, which has a static method called Bark(). You write dog.Bark(). So what's it going to do? Call the object's method or the static method from the class? (in a language where :: doesn't exist)
I'm sure originally it was a performance consideration. Converting a string to upper or lower case for caseless comparison isn't an expensive operation exactly, but it's not free either, and on old systems it may have added complexity that the systems of the day weren't ready to handle.
And now, of course, languages like to be compatible with each other (VB for example can't distinguish between C# classes or functions that differ only in case), people are used to naming things the same text but with different cases (See Jon Skeet's answer - I do that a lot), and the value of caseless languages wasn't really enough to outweigh these two.
The reason you can't understand why case-sensitivity is a good idea, is because it is not. It is just one of the weird quirks of C (like 0-based arrays) that now seem "normal" because so many languages copied what C did.
C uses case-sensitivity in indentifiers, but from a language design perspective that was a weird choice. Most languages that were designed from scratch (with no consideration given to being "like C" in any way) were made case-insensitive. This includes Fortran, Cobol, Lisp, and almost the entire Algol family of languages (Pascal, Modula-2, Oberon, Ada, etc.)
Scripting languages are a mixed bag. Many were made case-sensitive because the Unix filesystem was case-sensitive and they had to interact sensibly with it. C kind of grew up organically in the Unix environment, and probably picked up the case-sensitive philosophy from there.
Case-sensitive comparison is (from a naive point of view that ignores canonical equivalence) trivial (simply compare code points), but case-insensitive comparison is not well defined and extremely complex in all cases, and the rules are impossible to remember. Implementing it is possible, but will inadvertedly lead to unexpected and surprising behavior. BTW, some languages like Fortran and Basic have always been case-insensitive.

Translating code comments written in another spoken language

I've just inherited some C code from a German programmer, and all of the comments are, naturally, in German. As I've forgotten most of my high school German, this is a slight problem.
Does anyone know of any translation tools that are code-aware; meaning it will only translate language within comments? The project has many files, being able to operate on all of them at once would also be fantastic.
I'm currently copying-and-pasting into Google Translate, and while this is less than ideal, it can at least get me some answers.
I would only know exactly how to do this in java, but I am sure there is a way to do this in C as well, as the tools exist:
Grab a parser that understands C source files (this one sounds ok, but I don't know much about C)
build a syntax tree. iterate over all nodes of the tree, replacing the text of all comment nodes with translated text.
write the tree back to a new source file (perhaps in a different directory).
Very broadly, this should be possible to do using Google translation's Ajax API and a regex function that can deal with callbacks - I don't think JS's built-in regex functions are up to the task but I'm sure there are libraries out there. You would have to build a regular expression that can isolate the comments, send each chunk to the API, and return the translated result in the callback function.

For what reasons do some programmers vehemently hate languages where whitespace matters (e.g. Python)? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
C++ is my first language, and as such I'm used to whitespace being ignored. However, I've been toying around with Python, and I don't find it too hard to get used to the whitespace rules. It seems, however, that a lot of programmers on the Internet can't get past the whitespace rules. From what I've seen, peoples' C++ programs tend to be formatted very consistently with respect to whitespace (or else it's pretty hard to read), so why do some people have such a problem with whitespace-based languages like Python?
It violates the Principle of Least Astonishment, because we have it ingrained in ourselves (whether for good or bad) that whitespace Does Not Matter in a programming language. Whitespace is one of those issues that has been left up to personal style.
I still have bad memories back from being a student of learning the hard way that 8 spaces is not equivalent to a tab in a Makefile... Ah, the sleep I lost...
The only valid reason I have come across is that refactoring using cut-and-paste (not copy) without refactoring tools (or syntax-aware cut-andpaste), can end up changing semantics if an easy mistake is made.
There are several different types of whitespace (spaces, tabs, weird unicode characters, carriage returns, line breaks, etc.), they aren't necessarily visually distinct, and languages and editors may treat them capriciously. This isn't an argument against well-designed whitespace semantics, but many people are against all forms of it simply because of the possibility of poor design.
People hate it because it violates common sense. Not a single one of the replies I have read here decided that it was ok to simply forgo periods and other punctuations. In fact the grammar has been very good. If the nonsense about indentation actually carrying the meaning were true we would all just forget about using punctuations entirely.
No one learned that newlines terminate a sentence in a horizontal language like English, instead we learned to infer when a sentence ended regardless of whether or not the punctuation was present or not.
The same is true for programming languages, especially for those of us who started out with a programming language that did use explicit block termination. You learn to infer where a block starts and stops over time, it does not mean that the spacing did that for you, the semantics of the language itself did.
Most literate people would have no problem understanding posts without punctuations. Having to rely on what is a representation of the absence of a character is not a good idea. Do any of you count from zero when you make your to-do list?
Alright, this is a very narrow perspective, but I haven't seen it mentioned elsewhere: keeping track of white space is a hassle if you are trying to autogenerate a script.
When I first encountered Python, I don't remember the details, but I had developed a Windows tool with a GUI that allowed novice users to configure several settings, and then press OK. The output of the tool was a script, which the user could copy to a Unix machine, and then execute it there to do something or other that was too complicated or tedious for them to do manually. Since nobody maintained the generated scripts, there was no reason they needed to look nice. So, keeping track of indentation seemed like an unnecessary burden from that perspective.
For most purposes, though, I find that Python is much easier than any other language.
Perhaps your C++ background (and thus who your peers are) is clouding your perception of this (ie selective sampling) but in my experience the reaction to Python's "white space is intent" meme is anywhere from ambivalent to they absolutely love it. The reason a lot of people love it is that it forces people to format their code.
I can't say I've ever met anyone who "hates" it because hating it is much like hating the idea of well-formatted code.
Edit: let me put this in some perspective.
In the Java world there are two main methods of packaging and deploying Web apps: Ant and Maven.
Ant is basically an XML-based Make facility that has tasks for the common things you do. It's a blank slate, which is powerful, but it also means you have to write a lot of common things yourself and every installation is free to do things slightly differently. All of this is well-intentioned but can make it hard to figure out someone's Ant scripts.
Maven is far more fully features. It has archetypes, which are basically project types. Depending on which archetype(s) you use, you won't have to write any tasks to start, stop, clean, build, etc but you will have a mandated directory structure, which is quite deep.
The advantage of that is if you've seen one Maven Web app you've seen them all. You know the commands. You know the structure. That's extremely useful.
But you have people who absolutely hate Maven and I think it comes down to this: they don't like giving up control, even when it's ultimately in their interest to do so. Also, you'll find a certain brand of person who thinks that their use case is a justifiable exception. You see this personality trait a lot. For example, I think an old Joel post mentioned a story where someone wanted to use "enter" to go from the username to password form fields even though the convention was that enter executed the default action (usually "OK") so they had to write a custom dialog class for Windows for this.
Basically some people just don't like being told what to do and others are completely obstinate in their belief that they're right even when all evidence points to the contrary.
This probably explains why some supposedly hate Python's white space: they don't like being told how to format their code. They like the freedom of C/C++.
Because change is scary. And maybe, among certain developers, there are some faint memories of languages with capricious rules about whitespacing that were hard to remember and arbitrary, meant more for compiler convenience than expressiveness.
Most likely, not giving whitespace-significance a fair shake before dismissing it is the real reason. Ask someone to fix a bug in a reasonably complex but well-written Python program, then ask them to go fix a bug in a 20 year old system in C, VB or Cobol and ask them which they prefer.
As for me, I have as much trouble with whitespace in Python or Boo as I have with parentheses in Lisp. Which is to say, none.
They will have to get used to it. Initially I had a problem my self trying to read some examples but after using language for some time I started liking it.
I believe it is a habit that people has to overcome.
Some have developed habits (for example: deeply nested loops, unnecessarily large functions) that they perceive would be hard to support in a whitespace sensitive language.
Some have developed an aesthetic dislike for hanging indents.
Because they are used to languages like C and JavaScript where they can align items as they please.
When it comes to Python, you have to indent code based on its context:
def Print():
ManyArgumentFunction(LongParam1,LongParam2,LongParam3,LongParam4...
In C, you could do:
void Print()
{
ManyArgumentFunction(LongParam1,
LongParam2,
LongParam3,...
}
The only complaints I (also of C++ background) have heard about Python are from people who don't like using the "Replace Tabs with Space" option in their IDE.

Resources