Javascript's version of f-strings allows for string escaping through use of a somewhat funny API, e.g.
function escape(str) {
var div = document.createElement('div');
div.appendChild(document.createTextNode(str));
return div.innerHTML;
}
function escapes(template, ...expressions) {
return template.reduce((accumulator, part, i) => {
return accumulator + escape(expressions[i - 1]) + part
})
}
var name = "Bobby <img src=x onerr=alert(1)></img> Arson"
element.innerHTML = escapes`Hi, ${name}` # "Hi, Bobby <img src=x onerr=alert(1)></img> Arson"
Does Python f-strings allow for a similar mechanism? or do you need to bring your own string.Formatter? Would a more pythonic implementation wrap results into a class with an overriden __str__() method before interpolation?
When you're dealing with text that is going to be interpreted as code (e.g., text that the browser will parse as HTML or text that a database executes as SQL), you don't want to solve security issues by implementing your own escaping mechanism. You want to use the standard, widely tested tools to prevent them. This gives you much greater safety from attacks for several reasons:
The wide adoption means the tools are well tested and much less likely to contain bugs.
You know they have the best available approach to solving the problem.
They will help you avoid the common mistakes associated with generating the strings yourself.
HTML escaping
The standard tools for HTML escaping are templating engines, such as Jinja. The major advantage is that these are designed to escape text by default, rather than requiring you to remember to explicitly convert unsafe strings. (You do need to be cautious about bypassing or disabling, even temporarily, the escaping, though. I have seen my share of insecure attempts to insecurely construct JSON in templates, but the risk in templates is still lower than a system that requires explicit escaping everywhere.) Your example is pretty easy to implement with Jinja:
import jinja2
template_str = 'Hi, {{name}}'
name = "Bobby <img src=x onerr=alert(1)></img> Arson"
jinjaenv = jinja2.Environment(autoescape=jinja2.select_autoescape(['html', 'xml']))
template = jinjaenv.from_string(template_str)
print(template.render(name=name))
# Hi, Bobby <img src=x onerr=alert(1)></img> Arson
If you're generating HTML, though, chances are you're using a web framework such as Flask or Django. These frameworks include a templating engine and will require less set up than the above example.
MarkupSafe is a useful tool if you're trying to create your own template engine (Some Python templating engines use it internally, such as Jinja.), and you could potentially integrate it with a Formatter. But there's no reason to reinvent the wheel. Using a popular engine will result in much simpler, easier to follow, more recognizable code.
SQL injection
SQL injection is not solved through escaping. PHP has a nasty history that everyone has learned from. The lesson is use parameterized queries instead of trying to escape input. This prevents untrusted user data from ever being parsed as SQL code.
How you do this depends on exactly what libraries you're using for executing your queries, but for an example, doing so with SQLAlchemy's execute method looks like this:
session.execute(text('SELECT * FROM thing WHERE id = :thingid'), thingid=id)
Note that SQLAlchemy is not just escaping the text of id to ensure it does not contain attack code. It is actually differentiating between the SQL and the value for the database server. The database will parse the query text as a query, and then it will include the value separately after the query has been parsed. This makes it impossible for the value of id to trigger unintended side effects.
Note also that quoting issues are precluded by parameterized queries:
name = 'blah blah blah'
session.execute(text('SELECT * FROM thing WHERE name = :thingname'), thingname=name)
If you can't parameterize, whitelist in memory
Sometimes, it's not possible to parameterize something. Maybe you're trying to dynamically select a table name based on the input. In these cases, one thing you can do is have a collection of known valid and safe values. By validating that the input is one of these values and retrieving a known safe representation of it, you avoid sending user input into your query:
# This could also be loaded dynamically if needed.
valid_tables = {
# Keys are uppercased for look up
'TABLE1' : 'table1',
'TABLE2': 'Table2',
'TABLE3': 'TaBlE3',
...
}
def get_table_name(table_num):
table_name = 'TABLE' + table_num
try:
return valid_tables[table_name]
except KeyError:
raise 'Unknown table number: ' + table_num
def query_for_thing(session, table_num):
return session.execute(text('SELECT * FROM "{}"'.format(get_table_name(table_num))
The point is you never want to allow user input to go into your query as something other than a parameter.
Make sure that this whitelisting occurs in application memory. Do not perform the whitelisting in SQL itself. Whitelisting in the SQL is too late; by that time, the input has already been parsed as SQL, which would allow the attacks to be invoked before the whitelisting could take effect.
Make sure you understand your library
In the comments, you mentioned PySpark. Are you sure you're doing this right? If you create a data frame using just a simpler SELECT * FROM thing and then use PySpark filtering functions, are you sure it doesn't properly push those filters down to the query, precluding the need to format values into it unparameterized?
Make sure you understand how data is normally filtered and manipulated with your library, and check if that mechanism will use parameterized queries or otherwise be efficient enough under the hood.
With small data, just filter in memory
If your data isn't at least in the tens of thousands of records, then consider just loading it into memory and then filtering:
filter_name = 'blah blah blah'
results = session.execute(text('SELECT * FROM thing'))
filtered_results = [r for r in results if r.name == filter_name]
If this is fast enough and parameterizing the query is hard, then this approach avoids all the security headaches of trying to make the input safe. Test its performance with somewhat more data than you expect to see in prod. I would use at least double of the maximum you expect; an order of magnitude would be even safer if you can make it perform.
If you're stuck without parameterized query support, the last resort is very strict limits on inputs
If you're stuck with a client that doesn't support parameterized queries, first check if you can use a better client. SQL without parameterized queries is absurd, and it's an indication that the client you're using is very low quality and probably not well maintained; it may not even be widely used.
Doing the following is NOT recommended. I include it only as an absolute last resort. Don't do this if you have any other choice, and spend as much time as you can (even a couple of weeks of research, I dare say) trying to avoid resorting to this. It requires a very high level of diligence on the part of every team member involved, and most developers do not have that level of diligence.
If none of the above is a possibility, then the following approach may be all you can do:
Do not query on text strings coming from the user. There is no way to make this safe. No amount of quoting, escaping, or restricting is guaranteed. I don't know all the details, but I've read of the existence of Unicode abuses that can allow bypassing character restrictions and the like. It's just not worth it to try. The only text strings allowed should be whitelisted in application memory (as opposed to whitelisted via some SQL or database function). Note that even leveraging database level quoting functions (like PostgreSQL's quote_literal) or stored procedures can't help you here because the text has to be parsed as SQL to even reach those functions, which would allow the attacks to be invoked before the whitelisting could take effect.
For all other data types, parse them first and then have the language render them into an appropriate string. Doing so again means avoiding having user input parsed as SQL. This requires you to know the data type of the input, but that's reasonable since you'll need to know that to construct the query. In particular, the available operations with a particular column will be determined by that column's data types, and the operation and column type will determine what data types are valid for the input.
Here's an example for a date:
from datetime import datetime
def fetch_data(start_date, end_date):
# Check data types to prevent injections
if not isinstance(start_date, datetime):
raise ValueError('start_date must be a datetime')
if not isinstance(end_date, datetime):
raise ValueError('end_date must be a datetime')
# WARNING: Using format with SQL queries is bad practice, but we don't
# have a choice because [client lib] doesn't support parameterized queries.
# To mitigate this risk, we do not allow arbitrary strings as input.
# We tightly control the input's data type (to something other than text or binary) and the format used in the query.
session.execute(text(
"SELECT * FROM thing WHERE timestamp BETWEEN CAST('{start}' AS TIMESTAMP) AND CAST('{end}' AS TIMESTAMP)"
.format(
# Make the format used explicit
start=start_date.strftime('%Y-%m-%dT%H:%MZ'),
end=end_date.strftime('%Y-%m-%dT%H:%MZ')
)
))
user_input_start_date = '2019-05-01T00:00'
user_input_end_date = '2019-06-01T00:00'
parsed_start_date = datetime.strptime(user_input_start_date, "%Y-%m-%dT%H:%M")
parsed_end_date = datetime.strptime(user_input_end_date, "%Y-%m-%dT%H:%M")
data = fetch_data(parsed_start_date, parsed_end_date)
There's several details that you need to be aware of.
Notice that in the same function as the query, we're validating the data type. This is one of the rare exceptions in Python where you don't want to trust duck typing. This is a safety feature that ensures insecure data won't be passed into your function accidentally.
The format passed of the input when it's rendered into the SQL string is explicit. Again, this is about control and whitelisting. Don't leave it to any other library to decide what format the input will be rendered to; make sure you know exactly what the format is so that you can be certain that injections are impossible. I'm fairly certain that there's no injection possibility with the ISO 8601 date/time format, but I haven't confirmed that explicitly. You should confirm that.
The quoting of the values is manual. That's okay. And the reason it's okay is because you know what data types you're dealing with and you know exactly what the string will look like after it's formatted. This is by design: you're maintaining very strict, very tight control over the input's format to prevent injections. You know whether quotes need to be added or not based on that format.
Don't skip the comment about how bad this practice is. You have no idea who will read this code later and what knowledge or abilities they have. Competent developers who understand the security risks here will appreciate the warning; developers who weren't aware will be warned to use parameterized queries whenever available and to avoid carelessly including new conditions. If at all feasible, require that changes to these areas of code be reviewed by additional developers to further mitigate the risks.
This function should have full control over generating the query. It should not delegate its construction out to other functions. This is because the data type checking needs to be kept very, very close to the construction of the query to avoid mistakes.
The effect of this is a sort of looser whitelisting technique. You can't whitelist specific values, but you can whitelist the kinds of values you're working with and control the format they're delivered in. Forcing callers to parse the values into a known data type reduces the possibility of an attack getting through.
I'll also note that callering code is free to accept the user input in whatever format is convenient and to parse it using whatever tools you wish. That's one of the advantages of requiring a dedicated data type instead of strings for input: you don't lock callers into a particular string format, just the data type. For date/times in particular, you might consider some third party libraries.
Here's another example with a Decimal value instead:
from decimal import Decimal
def fetch_data(min_value, max_value):
# Check data types to prevent injections
if not isinstance(min_value, Decimal):
raise ValueError('min_value must be a Decimal')
if not isinstance(max_value, Decimal):
raise ValueError('max_value must be a Decimal')
# WARNING: Using format with SQL queries is bad practice, but we don't
# have a choice because [client lib] doesn't support parameterized queries.
# To mitigate this risk, we do not allow arbitrary strings as input.
# We tightly control the input's data type (to something other than text or binary) and the format used in the query.
session.execute(text(
"SELECT * FROM thing WHERE thing_value BETWEEN CAST('{minv}' AS NUMERIC(26, 16)) AND CAST('{maxv}' AS NUMERIC(26, 16))"
.format(
# Make the format used explicit
# Up to 16 decimal places. Maybe validate that at start of function?
minv='{:.16f}'.format(min_value),
maxv='{:.16f}'.format(max_value)
)
))
user_input_min = '78.887'
user_input_max = '89789.78878989'
parsed_min = Decimal(user_input_min)
parsed_max = Decimal(user_input_max)
data = fetch_data(parsed_min, parsed_max)
Everything is basically the same. Just a slightly different data type and format. You're free to use whatever data types your database supports, of course. For example, if your DB does not require specifying a scale and precision on the numeric type or would auto-cast a string or can handle the value unquoted, you can structure your query accordingly.
You do not need to bring your own formatter if you're using python 3.6 or newer. Python 3.6 introduced formatted string literals, see PEP 498: Formatted string literals.
Your example in python 3.6 or newer would look like this:
name = "Bobby <img src=x onerr=alert(1)></img> Arson"
print(f"Hi, {name}") # Hi, Bobby <img src=x onerr=alert(1)></img> Arson
The format specification that can be used with str.format() can also be used with formatted string literals.
This example,
my_dict = {'A': 21.3, 'B': 242.12, 'C': 3200.53}
for key, value in my_dict.items():
print(f"{key}{value:.>15.2f}")
will print the following:
A..........21.30
B.........242.12
C........3200.53
Additionally, since the string is evaluated at runtime, any valid python expression can be used, for example,
name = "Abby"
print(f"Hello, {name.upper()}!")
will print
Hello, ABBY!
Related
When generating JOOQ classes via JOOQ code gen, for each field, there will be a SQLDataType associated with it like below.
public final TableField<EventsRecord, LocalDateTime> CREATED_AT = createField(DSL.name("CREATED_AT"), SQLDataType.LOCALDATETIME(6).nullable(false), this, "");
What's the usage or purpose to have SQLDataType with each generated field? Since we already have a return type and client code is likely to use the this type to do the compile check.
Why we still need to know the actual SQLDataType in generated class/fields?
By client type, you probably mean the LocalDateTime type, i.e. the <T> type that you will find throughout the jOOQ API. Sure, that's the type you care about, but jOOQ, internally, will care about the org.jooq.DataType instead. Your example already gives away two ideas why this may be useful:
There's a precision of 6 fractional digits on LOCALDATETIME(6), which is used (among other things):
In CAST expressions. Try DSL.cast(inline("2000-01-01 00:00:00"), EVENTS.CREATED_AT),
In DDL statements. Try DSLContext.meta(EVENTS). You should see a CREATE TABLE statement with the appropriate data type
In the optimistic locking feature, to create modification timestamps with the right precision.
There's an indication whether the column is nullable, which is used (again among other things):
In DDL statements, see above
In the implicit join feature, to decide whether to produce an INNER JOIN or a LEFT JOIN
There are many other properties a DataType can have, which would be interesting for jOOQ at runtime including:
Custom data type bindings
Character set
Collation
Converters
Default value
Whether it is an identity
Besides, a String is not a String. For example, it could mean CHAR(2), CHAR(5), VARCHAR(100), CLOB, which are all quite different things in some dialects.
It would be a shame if your runtime meta model didn't have this information available.
I have user-provided format strings, and for each, I have a corresponding slice. For instance, I might have Test string {{1}}: {{2}} and ["number 1", "The Bit Afterwards"]. I want to generate Test string number 1: The Bit Afterwards from this.
The format of the user-provided strings is not fixed, and can be changed if need be. However, I cannot guarantee their sanity or safety; neither can I guarantee that any given character will not be used in the string, so any tags (like {} in my example) must be escapable. I also cannot guarantee that the same number of slice values will exist as tags in the template - for example, I might quite reasonably have Test string {{1}} and ["number 1", "another parameter", "yet another parameter"].
How can I efficiently format these strings, in accordance with the input given? They are for use as strings only, and don't require HTML, SQL or any other sort of escaping.
Things I've already considered:
fmt.Sprintf - two issues: 1) using it with user-provided templates is not ideal; 2) Sprintf does not play nicely with a number of parameters that doesn't match its format string, adding %!(EXTRA type=value) to the end.
The text/template library. This would work fine in theory, but I don't want to have to make users type out {{index .arr n}} for each and every one of their tags; in this case, I only ever need slice indexes.
The valyala/fasttemplate library. This is pretty much exactly what I'm looking for, but for the fact that it doesn't currently support escaping the delimiters it uses for its tags, at the time of writing. I've opened an issue for this, but I would have thought that there's already a solution to this problem somewhere - it doesn't feel like it's that unique.
Just writing my own parser for it. This would work... but, as above, I can't be the first person to have come across this!
Any advice or suggestions would be greatly appreciated.
I'm building a CRUD application that pulls data using Persistent and executes a number of fairly complicated queries, for instance using window functions. Since these aren't supported by either Persistent or Esqueleto, I need to use raw sql.
A good example is that I want to select rows in which the value does not deviate strongly from the previous value, so in pseudo-sql the condition is WHERE val - lag(val) <= x. I need to run this selection in SQL, rather than pulling all data and then filtering in Haskell, because otherwise I'd have way to much data to handle.
These queries return many columns. However, the RawSql instance maxes out at tuples with 8 elements. So now I am writing additional functions from9, to9, from10, to10 and so on. And after that, all these are converted using functions with type (Single a, Single b, ...) -> DesiredType. Even though this could be shortened using code generation, the approach is simply hacky and clearly doesn't feel like good Haskell. This concerns me because I think most of my queries will require rawSql.
Do you have suggestions on how to improve this? Currently, my main thought is to un-normalize the database and duplicate data, e.g. by including the lagged value as column, so that I can query the data with Esqueleto.
It's not uncommon to have a record like this:
TAddress = record
Address: string[50];
City : string[20];
State : string[2];
ZIP : string[5];
end;
Where it was nice to have hard-coded string sizes to ensure that the size of the string wouldn't exceed the database field size allotted for the data.
Given, however, that the ShortString type has been deprecated, what are Delphi developers doing to "solve" this problem? Declaring the record fields as string gets the job done, but doesn't protect the data from exceeding the proper length.
What is the best solution here?
If I had to keep data from exceeding the right length, I'd let the database code handle it as much as possible. Put size limits on the fields, and display data to the user in data-bound controls. A TDBEdit bound to a string field will enforce the length limit correctly. Set it up so the record gets populated directly from the dataset, and it will always have the right length.
Then all you need to worry about is data coming into the record from some outside source that is not part of your UI. For that, use the same process. Have the import code insert the data into the dataset, and let its length constraints do the validation for you. If it raises an exception, reject the import. If not, then you've got a valid dataset row that you can use to populate a record from.
The short string types in your question don't really protect the strings from exceeding the proper length. When you assign a longer value to these short strings, the value is silently truncated.
I'm not sure what database access method you are using but I rather imagine that it will do the same thing. Namely truncate any over-length strings to the maximum length. In which case there is nothing to do.
If your database access method throws an error when you give it an over long string then you would need to truncate before passing the value to the database.
If you have to truncate explicitly, then there are lots of places where you might choose to do so. My philosophy would be to truncate at the last possible moment. That's the point at which you are subject to the limit. Truncating anywhere else seems wrong. It means that a database limitation is spreading to parts of the code that are not obviously related to the database.
Of course, all this is based on the assumption that you want to carry on silently truncating. If you want to do provide user feedback in the event of truncation, then you will need to decide just where are the right points to action that feedback.
From my understanding, my answer should be "do not mix layers".
I suspect that the string length is specified at the database layer level (a column width), or at the business application layer (e.g. to validate a card number).
From the "pure Delphi code" point of view, you should not know that your string variable has a maximum length, unless you reach the persistence layer or even the business layer.
Using attributes could be an idea. But it may "pollute" the source code for the very same reason that it is mixing layers.
So what I recommend is to use a dedicated Data Modeling, in which you specify your data expectations. Then, at the Delphi var level, you just define a plain string. This is exactly how our mORMot framework implements data filtering and validation: at Model level, with some dedicated classes - convenient, extendable and clean.
If you're just porting from Delphi 7 to XE3, leave it be. Also, although "ShortString" may be deprecated, I'll eat my hat if they ever remove it completely, because there are a lot of bits of code that will never be able to be rebuilt without it. ShortString + Records is still the only practical way to specify a byte-oriented file-of-record data storage. Delphi will NEVER remove ShortString nor change its behaviour, it would be devastating to existing delphi code. So if you really must define records and limit their length, and you really don't want those records to support Unicode, then there is zero reason to stop using or stop writing ShortString code. That being said, I detest short-strings, and File-of-record, wish they would go away, and am glad they are marked deprecated.
That being said, I agree with mason and David entirely; I would say, Length checking, and validation are presentation/validation concerns, and Delphi's strong typing is NOT the right place or the right way to deal with them. If you need to put validation constraints on your classes, write helper classes that implement constraint-storage (EmployeeName is a string field and EmployeeName has the following length limit). In Edit controls for example, this is already a property. It seems to me that mapping DB Fields to visual fields, using the new Binding system would be much preferable to trying to express constraints statically in the code.
User input validation and storage are different and length limits should be set in your GUI controls not in your data structures.
You could for example use Array of UnicodeChar, if you wanted to have a Unicode wide but length limited string. You could even write your own LimitedString class using the new class helper methods in Delphi. But such approaches are not a maintainable and stable design.
If your SQL database has a field declared with VARCHAR(100) type, and you want to limit your user's input to 100 characters, you should do so at the GUI layer and forget about imposing truncation (data corruption, in fact) silently behind the scenes.
I had this problem - severely - upgrading from Delphi 6 to 2009 for what one program was/is doing it was imperative to be able to treat the old ASCII strings as individual ASCII characters.
The program outputs ASCII files (NO ANSI even) and has concepts such as over-punch on the last numeric digit to indicate negative. So the file format goes back a bit one could say!
After the first build in 2009 (10 year old code, well, you do don't you!) after sorting unit names etc there were literally hundreds of reported errors/ illegal assignments and data loss / conversion warnings...
No matter how good Delphi's back-room manipulation/magic with strings and chars I did not trust it enough. In the end to make sure everything was back as it was I re-declared them all as array of byte and then changed the code accordingly.
You haven't specified the delphi version, here's what works for me in delphi 2010:
Version1:
TTestRecordProp = record
private
FField20: string;
...
FFieldN: string
procedure SetField20(const Value: string);
public
property Field20: string read FField20 write SetField20;
...
property FieldN: string ...
end;
...
procedure TTestRecordProp.SetField20(const Value: string);
begin
if Length(Value) > 20 then
/// maybe raise an exception?
FField20 := Copy(FField20, 1, 20)
else
FField20 := Value;
end;
Version2:
TTestRecordEnsureLengths = record
Field20: string;
procedure EnsureLengths;
end;
...
procedure TTestRecordEnsureLengths.EnsureLengths;
begin
// for each string field, test it's length and truncate or raise exception
if Length(Field20) > 20 then
Field20 := Copy(Field20, 1, 20); // or raise exception...
end;
// You have to call .EnsureLength before push data to db...
Personally, I'd recommend replacing records with objects, then you can do more tricks.
Suppose I have an object representing a person, with getter and setter methods for the person's email address. The setter method definition might look something like this:
setEmailAddress(String emailAddress)
{
this.emailAddress = emailAddress;
}
Calling person.setEmailAddress(0), then, would generate a type error, but calling person.setEmailAddress("asdf") would not - even though "asdf" is in no way a valid email address.
In my experience, so-called strings are almost never arbitrary sequences of characters, with no restriction on length or format. URIs come to mind - as do street addresses, as do phone numbers, as do first names ... you get the idea. Yet these data types are most often stored as "just strings".
Returning to my person object, suppose I modify setEmailAddress() like so
setEmailAddress(EmailAddress emailAddress)
// ...
where EmailAddress is a class ... whose constructor takes a string representation of an email address. Have I gained anything?
OK, so an email address is kind of a bad example. What about a URI class that takes a string representation of a URI as a constructor parameter, and provides methods for managing that URI - setting the path, fetching a query parameter, etc. The validity of the source string becomes important.
So I ask all of you, how do you deal with strings that have structure? And how do you make your structural expectations clear in your interfaces?
Thank you.
"Strings with structure" are a symptom of the common code smell "Primitive Obsession".
The remedy is to watch closely for duplication in code that validates or manipulates parts of these structures. At the first hint of duplication - but not before - extract a class that encapsulates the structure and locate validations and queries there.
Welcome to the world of programming!
I don't think your question is a symptom of an error on your part. Rather it is a basic problem which appears in many guises throughout the programming world. Strings that have some structure and meaning are passed around between different subsystems of an application and each subsystem can only do much parsing and validation.
The problem of verifying an email address, for example, is quite tricky. The regular expressions various people offer accepting an email address, for example, are generally either "too tight" (don't accept everything) or "too loose" (accept illegal things). The first google hit for 'regex "email address"', for example says:
The regular expression I receive the
most feedback, not to mention "bug"
reports on, is the one you'll find
right on this site's home page:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+.[A-Z]{2,4}\b Analyze this regular expression with
RegexBuddy. This regular expression, I
claim, matches any email address. Most
of the feedback I get refutes that
claim by showing one email address
that this regex doesn't match.
The fact is the what is or isn't a valid email address is a complex problem, one that a given program might or might not want to solve. The problem of URLs is even worse, especially given the possibility of malicious URLS.
Ideally, you can have a library or system-call which solves problems of this sort instead of doing anything yourself (Microsoft windows calls a custom dialogue box to allow the user to select or create a file, since validating file names is another tricky problem). But you can't always count on having an appropriate system call for a given "meaningful string" either.
I would say that there no a generic solution to the problem of strings-with-structure. Rather, it is a basic problem that appears right when you design your application. In the process of gathering requirements for your application, you should determine what data the application will take in and how meaningful that data will be to the application. And this is where things get tricky, since you may notice the possibility that the app may grow in ways that your boss or customer might not have thought of - or the app may in fact grow in ways that none of you thought of. Thus the application needs to be a little more flexible than what seems like the minimum BUT only a little. It should also not be so flexible you get bogged down.
Now, if you decide that you need to validate/interpret etc a given string, putting that string into an object or a hash can be a good approach - this is one way I know to make sure your interface is clear. But the tricky thing is deciding just how much validation or interpretation you need.
Making these decisions is thus an art - there are no dogmatic answers that work here.
This is a pretty common problem falling under the title 'validation' - there are many ways to validate textual user input, one of the most common being Regular Expressions.
You might also consider using the built-in System.Net.MailAddress class for this, as it provides validation for email addresses.
Strings are strings. If you need your strings to be smarter than average strings then parsing them into a structural object like you describe would be a good idea. I would use a regex to do that.
Regular expressions are your friend when it comes to formatting strings. you could also store each part separately in a struct to avoid going through the trouble of using regular expressions every time you want to use them. e.g.
struct EMail
{
String BeforeAt = "johndoe123";
String AfterAt = "gmail.com";
}
Struct URL
{
String Protocol = "http";
String Domain = "sub.example.com";
String Path = "stuff/example.html";
}
Well, if you want to do several different kinds of things with an EmailAddress object, those other actions do not have to check if it is a valid email address since the EmailAddress object is guaranteed to have a valid string. You could throw an exception in the constructor or use a factory method or whatever "One True Methodology" approach you're using.
Personally, I like the idea of strong typing, so if I were still working in such languages I'd go with the style of your second example. The only thing I'd change might be to use a more "cast-like" structure, like EmailAddressFromString(String), that generated a new EmailAddress object (or pitched a fit if the string wasn't right), as I'm a bit of a fan of application Hungarian notation.
This whole problem, incidentally, is covered pretty well by Joel in http://www.joelonsoftware.com/articles/Wrong.html if you're interested.
I agree with the calls to strongly type the object, but for those cases where you're parsing from a string to an object, the answer is simple: error handling.
There are two general ways to handle errors: exceptions and return conditions. Generally if you expect to receive badly formed data, then you should return an error message. For cases where the input is not expected, then I would throw an exception. For example, you might pass in an ill formed email address, such as 'bob' instead of 'bob#gmail.com'. However, for null values, you might throw an exception, as you shouldn't try to form an email out of null.
Returning to your question, I do think you gain something by encoding a structure into an object. Specifically, you only need to validate that the string represents a valid email address in one specific place, such as the constructor. Elsewhere, your code is free to assume that an EmailAddress object is valid, and you don't have to rely upon dodgy classes with names like 'EmailHelper' or some such.
I personally do not think strong-typing the email address string as EmailAddress is necessary, in this case.
To create your email address you will, sooner or later, have to do something like:
EmailAddress(String email)
or a setter
SetEmailAddress(String email)
In both cases, you'll have to validate the email string input, which puts you back into your initial validation problem.
I would, as others pointed out, use regular expressions.
Having an EmailAddress class would be useful if you plan on having to perform specific operations on your stored information later on (say get domain name only, stuff like that).