How do you deal with strings that have structure?

How do you deal with strings that have structure? - string

Suppose I have an object representing a person, with getter and setter methods for the person's email address. The setter method definition might look something like this:
setEmailAddress(String emailAddress)
{
this.emailAddress = emailAddress;
}
Calling person.setEmailAddress(0), then, would generate a type error, but calling person.setEmailAddress("asdf") would not - even though "asdf" is in no way a valid email address.
In my experience, so-called strings are almost never arbitrary sequences of characters, with no restriction on length or format. URIs come to mind - as do street addresses, as do phone numbers, as do first names ... you get the idea. Yet these data types are most often stored as "just strings".
Returning to my person object, suppose I modify setEmailAddress() like so
setEmailAddress(EmailAddress emailAddress)
// ...
where EmailAddress is a class ... whose constructor takes a string representation of an email address. Have I gained anything?
OK, so an email address is kind of a bad example. What about a URI class that takes a string representation of a URI as a constructor parameter, and provides methods for managing that URI - setting the path, fetching a query parameter, etc. The validity of the source string becomes important.
So I ask all of you, how do you deal with strings that have structure? And how do you make your structural expectations clear in your interfaces?
Thank you.

"Strings with structure" are a symptom of the common code smell "Primitive Obsession".
The remedy is to watch closely for duplication in code that validates or manipulates parts of these structures. At the first hint of duplication - but not before - extract a class that encapsulates the structure and locate validations and queries there.

Welcome to the world of programming!
I don't think your question is a symptom of an error on your part. Rather it is a basic problem which appears in many guises throughout the programming world. Strings that have some structure and meaning are passed around between different subsystems of an application and each subsystem can only do much parsing and validation.
The problem of verifying an email address, for example, is quite tricky. The regular expressions various people offer accepting an email address, for example, are generally either "too tight" (don't accept everything) or "too loose" (accept illegal things). The first google hit for 'regex "email address"', for example says:
The regular expression I receive the
most feedback, not to mention "bug"
reports on, is the one you'll find
right on this site's home page:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+.[A-Z]{2,4}\b Analyze this regular expression with
RegexBuddy. This regular expression, I
claim, matches any email address. Most
of the feedback I get refutes that
claim by showing one email address
that this regex doesn't match.
The fact is the what is or isn't a valid email address is a complex problem, one that a given program might or might not want to solve. The problem of URLs is even worse, especially given the possibility of malicious URLS.
Ideally, you can have a library or system-call which solves problems of this sort instead of doing anything yourself (Microsoft windows calls a custom dialogue box to allow the user to select or create a file, since validating file names is another tricky problem). But you can't always count on having an appropriate system call for a given "meaningful string" either.
I would say that there no a generic solution to the problem of strings-with-structure. Rather, it is a basic problem that appears right when you design your application. In the process of gathering requirements for your application, you should determine what data the application will take in and how meaningful that data will be to the application. And this is where things get tricky, since you may notice the possibility that the app may grow in ways that your boss or customer might not have thought of - or the app may in fact grow in ways that none of you thought of. Thus the application needs to be a little more flexible than what seems like the minimum BUT only a little. It should also not be so flexible you get bogged down.
Now, if you decide that you need to validate/interpret etc a given string, putting that string into an object or a hash can be a good approach - this is one way I know to make sure your interface is clear. But the tricky thing is deciding just how much validation or interpretation you need.
Making these decisions is thus an art - there are no dogmatic answers that work here.

This is a pretty common problem falling under the title 'validation' - there are many ways to validate textual user input, one of the most common being Regular Expressions.
You might also consider using the built-in System.Net.MailAddress class for this, as it provides validation for email addresses.

Strings are strings. If you need your strings to be smarter than average strings then parsing them into a structural object like you describe would be a good idea. I would use a regex to do that.

Regular expressions are your friend when it comes to formatting strings. you could also store each part separately in a struct to avoid going through the trouble of using regular expressions every time you want to use them. e.g.
struct EMail
{
String BeforeAt = "johndoe123";
String AfterAt = "gmail.com";
}
Struct URL
{
String Protocol = "http";
String Domain = "sub.example.com";
String Path = "stuff/example.html";
}

Well, if you want to do several different kinds of things with an EmailAddress object, those other actions do not have to check if it is a valid email address since the EmailAddress object is guaranteed to have a valid string. You could throw an exception in the constructor or use a factory method or whatever "One True Methodology" approach you're using.

Personally, I like the idea of strong typing, so if I were still working in such languages I'd go with the style of your second example. The only thing I'd change might be to use a more "cast-like" structure, like EmailAddressFromString(String), that generated a new EmailAddress object (or pitched a fit if the string wasn't right), as I'm a bit of a fan of application Hungarian notation.
This whole problem, incidentally, is covered pretty well by Joel in http://www.joelonsoftware.com/articles/Wrong.html if you're interested.

I agree with the calls to strongly type the object, but for those cases where you're parsing from a string to an object, the answer is simple: error handling.
There are two general ways to handle errors: exceptions and return conditions. Generally if you expect to receive badly formed data, then you should return an error message. For cases where the input is not expected, then I would throw an exception. For example, you might pass in an ill formed email address, such as 'bob' instead of 'bob#gmail.com'. However, for null values, you might throw an exception, as you shouldn't try to form an email out of null.
Returning to your question, I do think you gain something by encoding a structure into an object. Specifically, you only need to validate that the string represents a valid email address in one specific place, such as the constructor. Elsewhere, your code is free to assume that an EmailAddress object is valid, and you don't have to rely upon dodgy classes with names like 'EmailHelper' or some such.

I personally do not think strong-typing the email address string as EmailAddress is necessary, in this case.
To create your email address you will, sooner or later, have to do something like:
EmailAddress(String email)
or a setter
SetEmailAddress(String email)
In both cases, you'll have to validate the email string input, which puts you back into your initial validation problem.
I would, as others pointed out, use regular expressions.
Having an EmailAddress class would be useful if you plan on having to perform specific operations on your stored information later on (say get domain name only, stuff like that).

Related

Naming confusion: 'isPackageOmitted' or 'packageIsOmitted'?

As for property names 'isPackageOmited' and 'packageIsOmitted', which should I choose?
Could some native speaker help me?

TLDR: don't put "is" into the middle of the name. It's hard to see quickly. Use isPackageOmitted.
The standard is to always prefix accessors, mutators and predicates with get, set and is.
So you would have methods like getPackage(), setPackage(), and isPackageOmitted().
Even though PackageIsOmitted reads closer to normal English, it is a really good idea to follow the convention of prefixing. It makes it incredibly easy to instantly know that this is a method that retrieves the boolean value.
Compare anIncrediblyLongAnacondaIsAbleToEatSheep with isAnIncrediblyLongAnacondaAbleToEatSheep. The second one instantly tells you that this is a boolean value, while in the first example you have to carefully look through the whole name to figure it out.
Now, if this is just the property, it would probably be best to drop the "is" altogether. Does it really provide any new information? I'd say, generally, it would be best to have a property called packageOmitted and a method isPackageOmitted() to retrieve the property value.

packageIsOmitted is better since "Package is omitted." is an assertion which is either True xor False, whereas "Is Package omitted?" is a question whose answer is either Yes xor No.

How can you dynamically format a string with a user-provided template and slice of parameters in Go?

I have user-provided format strings, and for each, I have a corresponding slice. For instance, I might have Test string {{1}}: {{2}} and ["number 1", "The Bit Afterwards"]. I want to generate Test string number 1: The Bit Afterwards from this.
The format of the user-provided strings is not fixed, and can be changed if need be. However, I cannot guarantee their sanity or safety; neither can I guarantee that any given character will not be used in the string, so any tags (like {} in my example) must be escapable. I also cannot guarantee that the same number of slice values will exist as tags in the template - for example, I might quite reasonably have Test string {{1}} and ["number 1", "another parameter", "yet another parameter"].
How can I efficiently format these strings, in accordance with the input given? They are for use as strings only, and don't require HTML, SQL or any other sort of escaping.
Things I've already considered:
fmt.Sprintf - two issues: 1) using it with user-provided templates is not ideal; 2) Sprintf does not play nicely with a number of parameters that doesn't match its format string, adding %!(EXTRA type=value) to the end.
The text/template library. This would work fine in theory, but I don't want to have to make users type out {{index .arr n}} for each and every one of their tags; in this case, I only ever need slice indexes.
The valyala/fasttemplate library. This is pretty much exactly what I'm looking for, but for the fact that it doesn't currently support escaping the delimiters it uses for its tags, at the time of writing. I've opened an issue for this, but I would have thought that there's already a solution to this problem somewhere - it doesn't feel like it's that unique.
Just writing my own parser for it. This would work... but, as above, I can't be the first person to have come across this!
Any advice or suggestions would be greatly appreciated.

String Class Methods Most Commonly Used

Any programming language. I'm interested in knowing what top 5 methods in a string class are used on data manipulations. Or what top 5 methods does one need to know to be able to handle data manipulation. I know probably all the methods together should be used, but I'm interested to see the 5 most common methods people use.
Thanks for your time.

I'd say
String.Format()
String.Split()
String.IndexOf()
String.Substring()
String.ToUpper()

Adam's top 5 are pretty much mine. I might replace IndexOf() with Trim(); I use this EVERY TIME I get a value from the user. String.Compare() using the IgnoreCase values of the StringComparison enumeration would replace most of the uses I've seen for ToUpper().
Format, HEAVILY used in logs and other user messages (far more efficient for templated messages than a bunch of += statements or a StringBuilder()). Split and Substring, ditto, especially in file processing.

Adam's + KeithS. But don't forget the under the hood calls to
String.hashCode(),
String.equals(String rhs)
and its ilk.

Programming style question on how to code functions

So, I was just coding a bit today, and I realized that I don't have much consistency when it comes to a coding style when programming functions. One of my main concerns is whether or not its proper to code it so that you check that the input of the user is valid OUTSIDE of the function, or just throw the values passed by the user into the function and check if the values are valid in there. Let me sketch an example:
I have a function that lists hosts based on an environment, and I want to be able to split the environment into chunks of hosts. So an example of the usage is this:
listhosts -e testenv -s 2 1
This will get all the hosts from the "testenv", split it up into two parts, and it is displaying part one.
In my code, I have a function that you pass it in a list, and it returns a list of lists based on you parameters for splitting. BUT, before I pass it a list, I first verify the parameters in my MAIN during the getops process, so in the main I check to make sure there are no negatives passed by the user, I make sure the user didnt request to split into say, 4 parts, but asking to display part 5 (which would not be valid), etc.
tl;dr: Would you check the validity of a users input the flow of you're MAIN class, or would you do a check in your function itself, and either return a valid response in the case of valid input, or return NULL in the case of invalid input?
Obviously both methods work, I'm just interested to hear from experts as to which approach is better :) Thanks for any comments and suggestions you guys have! FYI, my example is coded in Python, but I'm still more interested in a general programming answer as opposed to a language-specific one!

Good question! My main advice is that you approach the problem systematically. If you are designing a function f, here is how I think about its specification:
What are the absolute requirements that a caller of f must meet? Those requirements are f's precondition.
What does f do for its caller? When f returns, what is the return value and what is the state of the machine? Under what circumstances does f throw an exception, and what exception is thrown? The answers to all these questions constitute f's postcondition.
The precondition and postcondition together constitute f's contract with callers.
Only a caller meeting the precondition gets to rely on the postcondition.
Finally, bearing directly on your question, what happens if f's caller doesn't meet the precondition? You have two choices:
You guarantee to halt the program, one hopes with an informative message. This is a checked run-time error.
Anything goes. Maybe there's a segfault, maybe memory is corrupted, maybe f silently returns a wrong answer. This is an unchecked run-time error.
Notice some items not on this list: raising an exception or returning an error code. If these behaviors are to be relied upon, they become part of f's contract.
Now I can rephrase your question:
What should a function do when its caller violates its contract?
In most kinds of applications, the function should halt the program with a checked run-time error. If the program is part of an application that needs to be reliable, either the application should provide an external mechanism for restarting an application that halts with a checked run-time error (common in Erlang code), or if restarting is difficult, all functions' contracts should be made very permissive so that "bad input" still meets the contract but promises always to raise an exception.
In every program, unchecked run-time errors should be rare. An unchecked run-time error is typically justified only on performance grounds, and even then only when code is performance-critical. Another source of unchecked run-time errors is programming in unsafe languages; for example, in C, there's no way to check whether memory pointed to has actually been initialized.
Another aspect of your question is
What kinds of contracts make the best designs?
The answer to this question varies more depending on the problem domain.
Because none of the work I do has to be high-availability or safety-critical, I use restrictive contracts and lots of checked run-time errors (typically assertion failures). When you are designing the interfaces and contracts of a big system, it is much easier if you keep the contracts simple, you keep the preconditions restrictive (tight), and you rely on checked run-time errors when arguments are "bad".
I have a function that you pass it in a list, and it returns a list of lists based on you parameters for splitting. BUT, before I pass it a list, I first verify the parameters in my MAIN during the getops process, so in the main I check to make sure there are no negatives passed by the user, I make sure the user didnt request to split into say, 4 parts, but asking to display part 5.
I think this is exactly the right way to solve this particular problem:
Your contract with the user is that the user can say anything, and if the user utters a nonsensical request, your program won't fall over— it will issue a sensible error message and then continue.
Your internal contract with your request-processing function is that you will pass it only sensible requests.
You therefore have a third function, outside the second, whose job it is to distinguish sense from nonsense and act accordingly—your request-processing function gets "sense", the user is told about "nonsense", and all contracts are met.
One of my main concerns is whether or not its proper to code it so that you check that the input of the user is valid OUTSIDE of the function.
Yes. Almost always this is the best design. In fact, there's probably a design pattern somewhere with a fancy name. But if not, experienced programmers have seen this over and over again. One of two things happens:
parse / validate / reject with error message
parse / validate / process
This kind of design has one data type (request) and four functions. Since I'm writing tons of Haskell code this week, I'll give an example in Haskell:
data Request -- type of a request
parse :: UserInput -> Request -- has a somewhat permissive precondition
validate :: Request -> Maybe ErrorMessage -- has a very permissive precondition
process :: Request -> Result -- has a very restrictive precondition
Of course there are many other ways to do it. Failures could be detected at the parsing stage as well as the validation stage. "Valid request" could actually be represented by a different type than "unvalidated request". And so on.

I'd do the check inside the function itself to make sure that the parameters I was expecting were indeed what I got.
Call it "defensive programming" or "programming by contract" or "assert checking parameters" or "encapsulation", but the idea is that the function should be responsible for checking its own pre- and post-conditions and making sure that no invariants are violated.
If you do it outside the function, you leave yourself open to the possibility that a client won't perform the checks. A method should not rely on others knowing how to use it properly.
If the contract fails you either throw an exception, if your language supports them, or return an error code of some kind.

Checking within the function adds complexity, so my personal policy is to do sanity checking as far up the stack as possible, and catch exceptions as they arise. I also make sure that my functions are documented so that other programmers know what the function expects of them. They may not always follow such expectations, but to be blunt, it is not my job to make their programs work.

It often makes sense to check the input in both places.
In the function you should validate the inputs and throw an exception if they are incorrect. This prevents invalid inputs causing the function to get halfway through and then throw an unexpected exception like "array index out of bounds" or similar. This will make debugging errors much simpler.
However throwing exceptions shouldn't be used as flow control and you wouldn't want to throw the raw exception straight to the user, so I would also add logic in the user interface to make sure I never call the function with invalid inputs. In your case this would be displaying a message on the console, but in other cases it might be showing a validation error in a GUI, possibly as you are typing.

"Code Complete" suggests an isolation strategy where one could draw a line between classes that validate all input and classes that treat their input as already validated. Anything allowed to pass the validation line is considered safe and can be passed to functions that don't do validation (they use asserts instead, so that errors in the external validation code can manifest themselves).

How to handle errors depends on the programming language; however, when writing a commandline application, the commandline really should validate that the input is reasonable. If the input is not reasonable, the appropriate behavior is to print a "Usage" message with an explanation of the requirements as well as to exit with a non-zero status code so that other programs know it failed (by testing the exit code).
Silent failure is the worst kind of failure, and that is what happens if you simply return incorrect results when given invalid arguments. If the failure is ever caught, then it will most likely be discovered very far away from the true point of failure (passing the invalid argument). Therefore, it is best, IMHO to throw an exception (or, where not possible, to return an error status code) when an argument is invalid, since it flags the error as soon as it occurs, making it much easier to identify and correct the true cause of failure.
I should also add that it is very important to be consistent in how you handle invalid inputs; you should either check and throw an exception on invalid input for all functions or do that for none of them, since if users of your interface discover that some functions throw on invalid input, they will begin to rely on this behavior and will be incredibly surprised when other function simply return invalid results rather than complaining.

Implications of not including NULL in a language?

I know that NULL isn't necessary in a programming language, and I recently made the decision not to include NULL in my programming language. Declaration is done by initialization, so it is impossible to have an uninitialized variable. My hope is that this will eliminate the NullPointerException in favor of more meaningful exceptions or simply not having certain kinds of bugs.
Of course, since the language is implemented in C, there will be NULLs used under the covers.
My question is, besides using NULL as an error flag (this is handled with exceptions) or as an endpoint for data structures such as linked lists and binary trees (this is handled with discriminated unions) are there any other use-cases for NULL for which I should have a solution? Are there any really important implications of not having NULL which could cause me problems?

There's a recent article referenced on LtU by Tony Hoare titled Null References: The Billion Dollar Mistake which describes a method to allow the presence of NULLs in a programming language, but also eliminates the risk of referencing such a NULL reference. It seems so simple yet it's such a powerful idea.
Update: here's a link to the actual paper that I read, which talks about the implementation in Eiffel: http://docs.eiffel.com/book/papers/void-safety-how-eiffel-removes-null-pointer-dereferencing

Borrowing a page from Haskell's Maybe monad, how will you handle the case of a return value that may or may not exist? For instance, if you tried to allocate memory but none was available. Or maybe you've created an array to hold 50 foos, but none of the foos have been instantiated yet -- you need some way to be able to check for these kinds of things.
I guess you can use exceptions to cover all these cases, but does that mean that a programmer will have to wrap all of those in a try-catch block? That would be annoying at best. Or everything would have to return its own value plus a boolean indicating whether the value was valid, which is certainly not better.
FWIW, I'm not aware of any program that doesn't have some sort of notion of NULL -- you've got null in all the C-style languages and Java; Python has None, Scheme, Lisp, Smalltalk, Lua, Ruby all have nil; VB uses Nothing; and Haskell has a different kind of nothing.
That doesn't mean a language absolutely has to have some kind of null, but if all of the other big languages out there use it, surely there was some sound reasoning behind it.
On the other hand, if you're only making a lightweight DSL or some other non-general language, you could probably get by without null if none of your native data types require it.

The one that immediately comes to mind is pass-by-reference parameters. I'm primarily an Objective-C coder, so I'm used to seeing things kind of like this:
NSError *error;
[anObject doSomething:anArgumentObject error:&error];
// Error-handling code follows...
After this code executes, the error object has details about the error that was encountered, if any. But say I don't care if an error happens:
[anObject doSomething:anArgumentObject error:nil];
Since I don't pass in any actual value for the error object, I get no results back, and I don't really worry about parsing an error (since I don't care in the first place if it occurs).
You've already mentioned you're handling errors a different way, so this specific example doesn't really apply, but the point stands: what do you do when you pass something back by reference? Or does your language just not do that?

I think it's usefull for a method to return NULL - for example for a search method supposed to return some object, it can return the found object, or NULL if it wasn't found.
I'm starting to learn Ruby and Ruby has a very interesting concept for NULL, maybe you could consider implementing something silimar. In Ruby, NULL is called Nil, and it's an actual object just like any other object. It happens to be implemented as a global Singleton object. Also in Ruby, there is an object False, and both Nil and False evaluate to false in boolean expressions, while everything else evaluates to true (even 0, for example, evaluates to true).

In my mind there are two uses cases for which NULL is generally used:
The variable in question doesn't have a value (Nothing)
We don't know the value of the variable in question (Unknown)
Both of common occurrences and, honestly, using NULL for both can cause confusion.
Worth noting is that some languages that don't support NULL do support the nothing of Nothing/Unknown. Haskell, for instance, supports "Maybe ", which can contain either a value of or Nothing. Thus, commands can return (and accept) a type that they know will always have a value, or they can return/accept "Maybe " to indicate that there may not be a value.

I prefer the concept of having non-nullable pointers be the default, with nullable pointers a possibility. You can almost do this with c++ through references (&) rather than pointers, but it can get quite gnarly and irksome in some cases.
A language can do without null in the Java/C sense, for instance Haskell (and most other functional languages) have a "Maybe" type which is effectively a construct that just provides the concept of an optional null pointer.

It's not clear to me why you would want to eliminate the concept of 'null' from a language. What would you do if your app requires you to do some initialization 'lazily' - that is, you don't perform the operation until the data is needed? Ex:
public class ImLazy {
public ImLazy() {
//I can't initialize resources in my constructor, because I'm lazy.
//Maybe I don't have a network connection available yet, or maybe I'm
//just not motivated enough.
}
private ResourceObject lazyObject;
public ResourceObject getLazyObject() { //initialize then return
if (lazyObject == null) {
lazyObject = new DatabaseNetworkResourceThatTakesForeverToLoad();
}
}
public ResourceObject isObjectLoaded() { //just return the object
return (lazyObject != null);
}
}
In a case like this, how could we return a value for getObject()? We could come up with one of two things:
-require the user to initialize LazyObject in the declaration. The user would then have to fill in some dummy object (UselessResourceObject), which requires them to write all of the same error-checking code (if (lazyObject.equals(UselessResourceObject)...) or:
-come up with some other value, which works the same as null, but has a different name
For any complex/OO language you need this functionality, or something like it, as far as I can see. It may be valuable to have a non-null reference type (for example, in a method signature, so that you don't have to do a null check in the method code), but the null functionality should be available for cases where you do use it.

Interesting discussion happening here.
If I was building a language, I really don't know if I would have the concept of null. I guess it depends on how I want the language to look. Case in point: I wrote a simple templating language whose main strength is nested tokens and ease of making a token a list of values. It doesn't have the concept of null, but then it doesn't really have the concept of any types other than string.
By comparison, the langauge it is built-in, Icon, uses null extensively. Probably the best thing the language designers for Icon did with null is make it synonymous with an uninitialized variable (i.e. you can't tell the difference between a variable that doesn't exist and one that currently holds the value null). And then created two prefix operators to check null and not-null.
In PHP, I sometimes use null as a 'third' boolean value. This is good in "black-box" type classes (e.g. ORM core) where a state can be True, False or I Don't Know. Null is used for the third value.
Of course, both of these languages do not have pointers in the same way C does, so null pointers do not exist.

We use nulls all the time in our application to represent the "nothing" case. For example, if you are asked to look up some data in the database given an id, and no record matches that id: return null. This is very handy because we can store nulls in our cache, which means we don't have to go back to the database if someone asks for that id again in a few seconds.
The cache itself has two different kinds of responses: null, meaning there was no such entry in the cache, or an entry object. The entry object might have a null value, which is the case when we cached a null db lookup.
Our app is written in Java, but even with unchecked exceptions doing this with exceptions would be incredibly annoying.

If one accepts the propositions that powerful languages should have some sort of pointer or reference type (i.e. something which can hold a reference to data which does not exist at compile time), and some form of array type (or other means of having a collection of storage slots which are addressable sequentially via integer index), and that slots of the latter should be able to hold the former, and one accepts the possibility that one may have to read some slots of an array of pointers/references before sensible values exist for all of them, then there will be programs which, from a compiler's perspective, will read an array slot before a sensible value has been written to it (trying to ascertain in the general case whether an array slot could be read before it is written would be equivalent to the Halting Problem).
While it would be possible for a language to require that all array slots be initialized with some non-null reference before any of them could be read, in many situations there isn't really anything that could be stored which would be better than null: if an attempt is made to read an as-yet-unwritten array slot and dereference the (non)item contained there, that represents an error, and it would be better to have the system trap that condition than to access some arbitrary object whose sole purpose for existence is to give the array slots some non-null thing they can reference.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string