Efficient conversion of int to string - string

I've seen several questions/answers here that suggest the best way to get a string representation of an integer in Objective-C is to use [NSString stringWithFormat:#"%d", x]. I'm afraid the C/C++ programmer in me is having a hard time believing I want to bring all the formatting code into play for such a simple task. That is, I assume stringWithFormat needs to parse through the format string looking for all the different type specifiers, field widths, and options that I could possibly use, then it has to be able to interpret that variable length list of parameters and use the format specifier to coerce x to the appropriate type, then go through a lengthy process of conversion, accounting for signed/unsigned values and negation along the way.
Needless to say in C/C++ I could simply use itoa(x) which does exactly one thing and does it extremely efficiently.
I'm not interested in arguing the relative merits of one language over another, but rather just asking the question: is the incredibly powerful [NSString stringWithFormat:#"%d", x] really the most efficient way to do this very, very simple task in Objective-C? Seems like I'm cracking a peanut with a sledge hammer.

You could use itoa() followed by any of +[NSString stringWithUTF8String:] -[NSString initWithBytes:length:encoding: or +[NSString stringWithCString:encoding:] if it makes you feel better, but I wouldn't worry about it unless you're sure this is a performance problem.

You could also use description method. It boxes it as NSNumber and converts to a NSString.
int intVariable = 1;
NSString* stringRepresentation = [#(intVariable) description];

Related

`Data.Text` vs `Data.Vector.Unboxed Char`

Is there any difference in how Data.Text and Data.Vector.Unboxed Char work internally? Why would I choose one over the other?
I always thought it was cool that Haskell defines String as [Char]. Is there a reason that something analagous wasn't done for Text and Vector Char?
There certainly would be an advantage to making them the same.... Text-y and Vector-y tools could be written to be used in both camps. Imagine Ropes of Ints, or Regexes on strings of poker cards.
Of course, I understand that there were probably historical reasons and I understand that most current libraries use Data.Text, not Vector Char, so there are many practical reasons to favor one over the other. But I am more interested in learning about the abstract qualities, not the current state that we happen to be in.... If the whole thing were rewritten tomorrow, would it be better to unify the two?
Edit, with more info-
To put stuff into perspective-
According to this page, http://www.haskell.org/haskellwiki/GHC/Memory_Footprint, GHC uses 16 bytes for each Char in your program!
Data.Text is not O(1) index'able, it is O(n).
Ropes (binary trees wrapped around text) can also hold strings.... They have better complexity for index/insert/delete, although depending on the number of nodes and balance of the tree, index could be close to that of Text.
This is my takeaway from this-
Text and Vector Char are different internally....
Use String if you don't care about performance.
If performance is important, default to using Text.
If fast indexing of chars is necessary, and you don't mind a lot of memory overhead (up to 16x), use Vector Char.
If you want to insert/delete a lot of data, use Ropes.
It's a fairly bad idea to think of Text as being a list of characters. Text is designed to be thought of as an opaque, user-readable blob of Unicode text. Character boundaries might be defined based on encoding, locale, language, time of month, phase of the moon, coin flips performed by a blinded participant, and migratory patterns of Venezuela's national bird whatever it may be. The same story happens with sorting, up-casing, reversing, etc.
Which is a long way of saying that Text is an abstract type representing human language and goes far out of its way to not behave just the same way as its implementation, be it a ByteString, a Vector UTF16CodePoint, or something totally unique (which is the case).
To clarify this distinction take note that there's no guarantee that unpack . pack witnesses an isomorphism, that the preferred ways of converting from Text to ByteString are in Data.Text.Encoding and are partial, and that there's a whole sophisticated plug-in module text-icu littered with complex ways of handling human language strings.
You absolutely should use Text if you're dealing with a human language string. You should also be really careful to treat it with care since human language strings are not easily amenable to computer processing. If your string is better thought of as a machine string, you probably should use ByteString.
The pedagogical advantages of type String = [Char] are high, but the practical advantages are quite low.
To add to what J. Abrahamson said, it's also worth making the distinction between iterating over runes (roughly character by character, but really could be ideograms too) as opposed to unitary logical unicode code points. Sometimes you need to know if you're looking at a code point that has been "decorated" by a previous code point.
In the case of the latter, you then have to make the distinction between code points that stand alone (such as letters, ideograms) and those that modify the text that follows (right-to-left code point, diacritics, etc).
Well implemented unicode libraries will typically abstract these details away and let you process the text in a more or less character-by-character fashion but you have to drop certain assumptions that come from thinking in terms of ASCII.
A byte is not a character. A logical unit of text isn't necessarily a "character". Not every code point stands alone, some decorate/annotate the following code point or even the rest of the byte stream until invalidated (right-to-left).
Unicode is hard. There is no one true encoding that will eliminate the difficulty of encapsulating the variety inherent in human language. Data.Text does a respectable job of it though.
To summarize:
The methods of processing are:
byte-by-byte - totally invalid for unicode, only applicable to latin-1/ASCII
code point by code point - works for processing unicode, but is lower-level than people realize
logical rune-by-rune - what you actually want
The types are:
String (aka [Char]) - has a limited scope. Best used for teaching Haskell or for legacy use-cases.
Text - the preferred way to handle "human" text.
Bytestring - for byte streams, raw data, binary etc.

Data.Text vs String

While the general opinion of the Haskell community seems to be that it's always better to use Text instead of String, the fact that still the APIs of most of maintained libraries are String-oriented confuses the hell out of me. On the other hand, there are notable projects, which consider String as a mistake altogether and provide a Prelude with all instances of String-oriented functions replaced with their Text-counterparts.
So are there any reasons for people to keep writing String-oriented APIs except backwards- and standard Prelude-compatibility and the "switch-making inertia"?
Are there possibly any other drawbacks to Text as compared to String?
Particularly, I'm interested in this because I'm designing a library and trying to decide which type to use to express error messages.
My unqualified guess is that most library writers don't want to add more dependencies than necessary. Since strings are part of literally every Haskell distribution (it's part of the language standard!), it is a lot easier to get adopted if you use strings and don't require your users to sort out Text distributions from hackage.
It's one of those "design mistakes" that you just have to live with unless you can convince most of the community to switch over night. Just look at how long it has taken to get Applicative to be a superclass of Monad – a relatively minor but much wanted change – and imagine how long it would take to replace all the String things with Text.
To answer your more specific question: I would go with String unless you get noticeable performance benefits by using Text. Error messages are usually rather small one-off things so it shouldn't be a big problem to use String.
On the other hand, if you are the kind of ideological purist that eschews pragmatism for idealism, go with Text.
* I put design mistakes in scare quotes because strings as a list-of-chars is a neat property that makes them easy to reason about and integrate with other existing list-operating functions.
If your API is targeted at processing large amounts of character oriented data and/or various encodings, then your API should use Text.
If your API is primarily for dealing with small one-off strings, then using the built-in String type should be fine.
Using String for large amounts of text will make applications using your API consume significantly more memory. Using it with foreign encodings could seriously complicate usage depending on how your API works.
String is quite expensive (at least 5N words where N is the number of Char in the String). A word is same number of bits as the processor architecture (ex. 32 bits or 64 bits):
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
There are at least three reasons to use [Char] in small projects.
[Char] does not rely on any arcane staff, like foreign pointers, raw memory, raw arrays, etc that may work differently on different platforms or even be unavailable altogether
[Char] is the lingua franka in haskell. There are at least three 'efficient' ways to handle unicode data in haskell: utf8-bytestring, Data.Text.Text and Data.Vector.Unboxed.Vector Char, each requiring dealing with extra package.
by using [Char] one gains access to all power of [] monad, including many specific functions (alternative string packages do try to help with it, but still)
Personally, I consider utf16-based Data.Text one of the most questionable desicions of the haskell community, since utf16 combines flaws of both utf8 and utf32 encoding while having none of their benefits.
I wonder if Data.Text is always more efficient than Data.String???
"cons" for instance is O(1) for Strings and O(n) for Text. Append is O(n) for Strings and O(n+m) for strict Text's. Likewise,
let foo = "foo" ++ bigchunk
bar = "bar" ++ bigchunk
is more space efficient for Strings than for strict Texts.
Other issue not related to efficiency is pattern matching (perspicuous code) and lazyness (predictably per-character in Strings, somehow implementation dependent in lazy Text).
Text's are obviously good for static character sequences and for in-place modification. For other forms of structural editing, Data.String might have advantages.
I do not think there is a single technical reason for String to remain.
And I can see several ones for it to go.
Overall I would first argue that in the Text/String case there is only one best solution :
String performances are bad, everyone agrees on that
Text is not difficult to use. All functions commonly used on String are available on Text, plus some useful more in the context of strings (substitution, padding, encoding)
having two solutions creates unnecessary complexity unless all base functions are made polymorphic. Proof : there are SO questions on the subject of automatic conversions. So this is a problem.
So one solution is less complex than two, and the shortcomings of String will make it disappear eventually. The sooner the better !

Compile-time constraints for strings in F#, similar to Units of Measure - is it possible?

I'm developing a Web application using F#. Thinking of protecting user input strings from SQL, XSS, and other vulnerabilities.
In two words, I need some compile-time constraints that would allow me discriminate plain strings from those representing SQL, URL, XSS, XHTML, etc.
Many languages have it, e.g. Ruby’s native string-interpolation feature #{...}.
With F#, it seems that Units of Measure do very well, but they are only available for numeric types.
There are several solutions employing runtime UoM (link), however I think it's an overhead for my goal.
I've looked into FSharpPowerPack, and it seems quite possible to come up with something similar for strings:
[<MeasureAnnotatedAbbreviation>] type string<[<Measure>] 'u> = string
// Similarly to Core.LanguagePrimitives.IntrinsicFunctions.retype
[<NoDynamicInvocation>]
let inline retype (x:'T) : 'U = (# "" x : 'U #)
let StringWithMeasure (s: string) : string<'u> = retype s
[<Measure>] type plain
let fromPlain (s: string<plain>) : string =
// of course, this one should be implemented properly
// by invalidating special characters and then assigning a proper UoM
retype s
// Supposedly populated from user input
let userName:string<plain> = StringWithMeasure "John'); DROP TABLE Users; --"
// the following line does not compile
let sql1 = sprintf "SELECT * FROM Users WHERE name='%s';" userName
// the following line compiles fine
let sql2 = sprintf "SELECT * FROM Users WHERE name='%s';" (fromPlain userName)
Note: It's just a sample; don't suggest using SqlParameter. :-)
My questions are: Is there a decent library that does it? Is there any possibility to add syntax sugar?
Thanks.
Update 1: I need compile-time constraints, thanks Daniel.
Update 2: I'm trying to avoid any runtime overhead (tuples, structures, discriminated unions, etc).
A bit late (I'm sure there's a time format where there is only one bit different between February 23rd and November 30th), I believe these one-liners are compatible for your goal:
type string<[<Measure>] 'm> = string * int<'m>
type string<[<Measure>] 'm> = { Value : string }
type string<[<Measure>] 'm>(Value : string) = struct end
In theory it's possible to use 'units' to provide various kinds of compile-time checks on strings (is this string 'tainted' user input, or sanitized? is this filename relative or absolute? ...)
In practice, I've personally not found it to be too practical, as there are so many existing APIs that just use 'string' that you have to exercise a ton of care and manual conversions plumbing data from here to there.
I do think that 'strings' are a huge source of errors, and that type systems that deal with taintedness/canonicalization/etc on strings will be one of the next leaps in static typing for reducing errors, but I think that's like a 15-year horizon. I'd be interested in people trying an approach with F# UoM to see if they get any benefit, though!
The simplest solution to not being able to do
"hello"<unsafe_user_input>
would be to write a type which had some numeric type to wrap the string like
type mystring<'t>(s:string) =
let dummyint = 1<'t>
Then you have a compile time check on your strings
It's hard to tell what you're trying to do. You said you "need some runtime constraints" but you're hoping to solve this with units of measure, which are strictly compile-time. I think the easy solution is to create SafeXXXString classes (where XXX is Sql, Xml, etc.) that validate their input.
type SafeSqlString(sql) =
do
//check `sql` for injection, etc.
//raise exception if validation fails
member __.Sql = sql
It gives you run-time, not compile-time, safety. But it's simple, self-documenting, and doesn't require reading the F# compiler source to make it work.
But, to answer your question, I don't see any way to do this with units of measure. As far as syntactic sugar goes, you might be able to encapsulate it in a monad, but I think it will make it more clunky, not less.
You can use discriminated unions:
type ValidatedString = ValidatedString of string
type SmellyString = SmellyString of string
let validate (SmellyString s) =
if (* ... *) then Some(ValidatedString s) else None
You get a compile-time check, and adding two validated strings won't generate a validated string (which units of measure would allow).
If the added overhead of the reference types is too big, you can use structs instead.

Why is Yesod.Request.FileInfo (fileContentType) a Text?

I am just starting out with Haskell and Yesod so please forgive if I am missing something obvious.
I am noticing that fileContentType in Yesod.Request.FileInfo is a Text even though Yesod.Content has an explicit ContentType. I'm wondering, why is it not a ContentType instead and what is the cleanest conversion?
Thanks in advance!
This comes down to a larger issue. A lot of the HTTP spec is stated in terms of ASCII. The question is, how do we represent it. There are essentially three choices:
Create a special newtype Ascii around a ByteString. This is the most correct, but also very tedious since it involves a lot of wrapping/unwrapping. We tried this approach, and got a lot of negative feedback.
Use normal ByteStrings. This is efficient, and mostly correct, but allows people to enter non-ASCII binary data.
Use Text (or String). This is the most developer-friendly, but allows you to enter non-ASCII character data. It's a bit less efficient than (1) or (2) due to the encoding/decoding overhead.
Overall, we've been moving towards (3) for most operations, especially for things like sessions keys which are internal to Yesod. You could say that ContentType is an inconsistency and should be changed to Text, but I think it doesn't seem to bother anyone, is a bit more semantic, and a bit faster.
tl;dr: No good reason :)
You have confused the type of Content with the type of ContentType. The fileContent field of FileInfo should be ContentType - and it is, modulo the type alias - the type of fileContentType is Text which would be ContentTypeType. It might help to imaging the last world as a prefixing adjective, so ContentType == the type of Content.

String Class Methods Most Commonly Used

Any programming language. I'm interested in knowing what top 5 methods in a string class are used on data manipulations. Or what top 5 methods does one need to know to be able to handle data manipulation. I know probably all the methods together should be used, but I'm interested to see the 5 most common methods people use.
Thanks for your time.
I'd say
String.Format()
String.Split()
String.IndexOf()
String.Substring()
String.ToUpper()
Adam's top 5 are pretty much mine. I might replace IndexOf() with Trim(); I use this EVERY TIME I get a value from the user. String.Compare() using the IgnoreCase values of the StringComparison enumeration would replace most of the uses I've seen for ToUpper().
Format, HEAVILY used in logs and other user messages (far more efficient for templated messages than a bunch of += statements or a StringBuilder()). Split and Substring, ditto, especially in file processing.
Adam's + KeithS. But don't forget the under the hood calls to
String.hashCode(),
String.equals(String rhs)
and its ilk.

Resources