Testing for non standard ascii character in common lisp - string

I need to test a string to see if it contains any characters that have codes above decimal 127 (extended ASCII codes) or are below 32. Is there any really nice way to do this or will I just have to iterate through the whole string and compare char-codes of the characters? I am using the common lisp implementation CCL.

The portable way is, as you suggested yourself,
(defun string-standard-p (string &key (min 32) (max 127))
(every (lambda (c) (<= min (char-code c) max)) string))
There may be an implementation-specific way, e.g., in CLISP, you can do
(defun string-encodable-p (string encoding)
(every (lambda (c) (typep c encoding)) string))
(string-encodable-p "foo" charset:ascii)
==> T
although it will actually accept all ASCII characters, not just 32:127.
(I am sorry, I am not familiar with CCL).
However, I am pretty sure that you will not find a nicer solution than the one you suggested in your question.

Related

Why is a string not a list of characters in scheme/racket?

What I'm used to is that a string is just a list or array of characters, like in most C-like languages. However, in the scheme implementations that I use, including Chicken and Racket, a string is not the same as a list of characters. Something like (car "mystring") just won't fly. Instead there are functions to convert from and to lists. Why was this choice made? In my opinion Haskell does it the best way, there is literally no difference in any way between a list of chars and a string. I like this the most because it conveys the meaning of what is meant by a string in the clearest, simplest way. I'm not completely sure, but I'd guess that in the 'background' strings are lists or arrays of chars in almost any language. I'd especially expect a language like scheme with a focus on simplicity to handle strings in this way, or at least make is so you can do with strings what you can do with lists, like take the car or cdr What am I missing?
It looks like what you are really asking is, Why aren't there generic operations that work on both strings and lists?
Those do exist, in libraries like the generic collections library.
#lang racket/base
(require data/collection)
(first '(my list)) ; 'my
(first "mystring") ; #\m
Also, operations like map from this library can work with multiple different types of collections together.
#lang racket/base
(require data/collection)
(define (digit->char n)
(first (string->immutable-string (number->string n))))
(first (map string (map digit->char '(1 2 3)) "abc"))
; "1a"
This doesn't mean that strings are lists, but it does mean that both strings and lists are sequences, and operations on sequences can work on both kinds of data types together.
According to the Racket documentation, strings are arrays of characters:
4.3 Strings
A string is a fixed-length array of characters.
An array, as the term is usually used in programming languages, and especially in C and C++, is a contiguous block of memory with the important property that it supports efficient random access. E.g., you can access the first element (x[0]) just as quickly as the nth (x[n-1]). Linked lists, the lists you encounter by default in Lisps, don't support efficient random access.
So, since strings are arrays in Racket, you'd expect there to be some counterpart to the x[i] notation (which isn't very Lispy). In Racket, you use string-ref and string-set!, which are documented on that same page. E.g.:
(string-ref "mystring" 1) ;=> #\y
(Now, there are also vector-ref and vector-set! procedures for more generalized vectors. In some Lisps, strings are also vectors, so you can use general vector and array manipulation functions on strings. I'm not much of a Racket user, so I'm not sure whether that applies in Racket as well.)
Technically "array" is usually a continuous piece of memory, while "list" is usually understood to be a single- or double-linked list of independently allocated objects. In most common programming languages, including C, and all Lisp and Scheme dialects that I know, for performance reasons string data is stored in an array in the sense that it is stored in a continuous piece of memory.
The confusion is that sometimes they might still be colloquially referred to as lists, which is not correct when understanding "list" as the precise technical term "linked list".
Were a string truly stored as list, including how Scheme and Lisp generally store one, every single character would have the overhead of being part of an object that contains the character and at least one pointer, that to the next character.

Why is the argument position of split and join in clojure.string mixed up?

I wanted to do this:
(-> string
(str/split "\s")
(modification-1)
(modification-2)
…
(modification-n
(str/join "\n"))
But no, split takes [s regex] and join takes [seperator coll].
Is there any apparent reason for this madness (read: What is the design decision behind this)?
As of Clojure 1.5, you can also use one of the new threading macros.
clojure.core/as->
([expr name & forms])
Macro
Binds name to expr, evaluates the first form in the lexical context
of that binding, then binds name to that result, repeating for each
successive form, returning the result of the last form.
It's quite a new construct, so not sure how to use idiomatically yet, but I guess something like this would do:
(as-> "test test test" s
(str/split s #" ")
(modification-1 s)
(modification-2 s)
...
(modification-n s)
(str/join "\n" s))
Edit
As for why the argument position is different, I'm in no place to say, but I think Arthur's suggestion makes sense:
Some functions clearly operate on collections (map, reduce, etc). These tend to consistently take the collection as the last argument, which means they work well with ->>
Some functions don't operate on collections and tend to take the most important argument (is that a thing?) as the first argument. For example, when using / we expect the numerator to come first. These functions work best with ->
The thing is - some functions are ambiguous. They might take a collection and produce a single value, or take a single value and produce a collection. string\split is one example (disregarding for the moment that additional confusion that a string could be thought of as both a single value or a collection). Concatenation/reducing operations will also do it - they will mess up your pipeline!
Consider, for instance:
(->> (range 1 5)
(map inc)
(reduce +)
;; at this point we have a single value and might want to...
(- 4)
(/ 2))
;; but we're threading in the last position
;; and unless we're very careful, we'll misread this arithmetic
In those cases, I think something like as-> is really helpful.
I think in general the guideline to use ->> when operating on collections and -> otherwise is sound - and it's just in these borderline/ambiguous cases, as-> can make the code a little neater, a little clearer.
I also run into this sort of (minor) threading headache fairly regularly.
(-> string
(str/split "\s")
(modification-1)
(modification-2)
…
(modification-n
(#(str/join "\n" %)))
and often create an anonymous function to make the ordering match. My guess as to why is that some functions where intended to be used with thread first ->, some for thread last ->> and for some threading was not a design goal, though this is just a guess.
You can use partial function to fix the separator argument for str/join.
(-> string
(str/split #"\s")
(modification-1)
(modification-2)
;;
(modification-n)
((partial str/join "\n")))
There is nothing wrong with threading your threaded expression through another threading macro, like this:
(-> string
(str/split "\s")
modification-1
modification-2
modification-n
(->> (str/join "\n")))

How to generate a random alphanumeric string with Erlang?

I'm trying to generate an random alphanumeric ID with Erlang.
I naively tried crypto:strong_rand_bytes(Bytes) to generate a random binary and then used that binary like it was created with <<"my_unique_random_id">> - which didn't work because random bits are not necessarily a valid UTF-8 string, right?
Well, I looked for other options in the erlang docs and elsewhere, but I didn't find anything. Could someone point me to a solution?
It might depend on the randomness you need. Erlang's crypto module produces stronger random data than the random module (see also [erlang-questions] Yaws security alert - Yaws 1.93 and this question). If you want to use strong_rand_bytes to generate an ID maybe getting the base64 of it might be enough:
> base64:encode(crypto:strong_rand_bytes(Bytes)).
You could turn this into a list if needed.
According to Generating random strings in Erlang it only takes a few lines of Erlang to generate a string of a specified length from a certain set of characters.
get_random_string(Length, AllowedChars) ->
lists:foldl(fun(_, Acc) ->
[lists:nth(random:uniform(length(AllowedChars)),
AllowedChars)]
++ Acc
end, [], lists:seq(1, Length)).
The blog post has a line-by-line explanation of the code. Look to the comments for a couple of optimization tips.
I have prepared small module do to this
Also it uses crypto:rand_uniform/2 but not obsolete random:uniform
module(cloud_rnd).
-export([rnd_chars/1, rnd_numbers/1, rnd_chars_numbers/1]).
rnd_chars(L) -> get_rnd(L, chars).
rnd_numbers(L) -> get_rnd(L, numbers).
rnd_chars_numbers(L) -> get_rnd(L, chars_numbers).
get_rnd(L, chars) -> gen_rnd(L, "abcdefghijklmnopqrstuvwxyz");
get_rnd(L, numbers) -> gen_rnd(L, "1234567890");
get_rnd(L, chars_numbers) -> gen_rnd(L, "abcdefghijklmnopqrstuvwxyz1234567890").
gen_rnd(Length, AllowedChars) ->
MaxLength = length(AllowedChars),
lists:foldl(
fun(_, Acc) -> [lists:nth(crypto:rand_uniform(1, MaxLength), AllowedChars)] ++ Acc end,
[], lists:seq(1, Length)
).
The problem with responses to the various "I need random strings" questions (in whatever language) is almost every solution uses a flawed specification, namely, string length. The questions themselves rarely reveal why the random strings are needed, but I will boldly assume they are to be used as identifiers which need to be unique.
There are two leading ways to get strictly unique strings: deterministically (which is not random) and store/compare (which is onerous). What to do? Give up the ghost. Go with probabilistic uniqueness instead. That is, accept that there is some (however small) risk that your strings won't be unique. This is where understanding collision probability and entropy are helpful.
So I'll rephrase my bold assumption as you need some number of identifiers with a small risk of repeat. As a concrete example, let's say you need 5 million Ids with a less than 1 in a trillion risk of repeat. So what length of string do you need? Well, that question is underspecified as it depends on the characters used. But more importantly, it's misguided. What you need is specification of the entropy of the strings, not their length.
This is where EntropyString can help.
Bits = entropy_string:bits(5.0e6, 1.0e12).
83.37013046707142
entropy_string:random_string(Bits).
<<"QDrjGQFGgGjJ4t9r2">>
There are other predefined characters sets, and you can specify your own characters as well (though for efficiency reasons only sets with powers of 2 characters are supported). And best of all, the risk of repeat in the specified number of strings is explicit. No more guessing with string length.
randchar(N) ->
randchar(N, []).
randchar(0, Acc) ->
Acc;
randchar(N, Acc) ->
randchar(N - 1, [random:uniform(26) + 96 | Acc]).
You may use function uef_bin:random_latin_binary/2 from here:
https://github.com/DOBRO/uef-lib#uef_binrandom_latin_binary2
Bin = uef_bin:random_latin_binary(Length, any)
And then, if you need a string() type:
String = erlang:binary_to_list(Bin)

Convert CHAR to ASCII in LISP [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I want to write a program in LISP to get a string from the user and return the string formed by adding 1 to each char-code of the string. For example:
input: "hello123"
output: "ifmmp234"
I thought maybe I should convert the characters one by one to ASCII and then do what I want to do.
Any help with this will be so much appreciated..
Thanks
This is the code I developed. It gives me NIL in the output however. Can you help me with this:
(defun esi (n)
(setf m 0)
(loop (when (< m (length n))
(return))
(code-char (+ 1 (char-code (char n m))))
(+ 1 m)))
Look at the functions char-code and code-char.
EDIT: Regarding your code sample:
It seems that the input to your function should be a string. Name it such, e.g. string.
That (setf m 0) is setting a free variable. In this context, I must assume that m is never defined anywhere, so the behaviour is undefined. You should use, for example, let instead to establish a local binding. Hint: most looping constructs also give ways to establish local bindings.
The only exit out of your loop is that (return). Since it does not get any parameters, it will always return nil. You need to accumulate the new string somewhere and finally return it.
Functions in Lisp mostly do not modify their arguments. (+ 1 m) does not modify m. It just returns a value that is one greater than m. Likewise, code-char does not modify its argument, but returns a new value that is the character corresponding to the argument. You need to bind or assign these values.
That finishing condition is wrong. It will either terminate directly or, if the input string is empty, never terminate.
There are quite a few ways of doing what you want. Let's start with a function that returns a character one code-point later (there are some boundary issues here, let's ignore that for now).
(defun next-codepoint (char) (code-char (1+ (char-code char))))
Now, this operates on characters. Happily, a string is, essentially, a sequence of characters. Sequence operations should, in general, send you in the direction of the MAP family.
So, we have:
(defun nextify-string (string) (map 'string #'next-codepoint string))
Taking what's happening step by step:
For each character in an input stringm we do:
We convert a character to a code attribute.
We increment this
We convert it back for a character
Then we assemble all of these into a return value.

Why do programming languages use commas to separate function parameters?

It seems like all programming languages use commas (,) to separate function parameters.
Why don't they use just spaces instead?
Absolutely not. What about this function call:
function(a, b - c);
How would that look with a space instead of the comma?
function(a b - c);
Does that mean function(a, b - c); or function(a, b, -c);? The use of the comma presumably comes from mathematics, where commas have been used to separate function parameters for centuries.
First of all, your premise is false. There are languages that use space as a separator (lisp, ML, haskell, possibly others).
The reason that most languages don't is probably that a) f(x,y) is the notation most people are used to from mathematics and b) using spaces leads to lots of nested parentheses (also called "the lisp effect").
Lisp-like languages use: (f arg1 arg2 arg3) which is essentially what you're asking for.
ML-like languages use concatenation to apply curried arguments, so you would write f arg1 arg2 arg3.
Tcl uses space as a separator between words passed to commands. Where it has a composite argument, that has to be bracketed or otherwise quoted. Mind you, even there you will find the use of commas as separators – in expression syntax only – but that's because the notation is in common use outside of programming. Mathematics has written n-ary function applications that way for a very long time; computing (notably Fortran) just borrowed.
You don't have to look further than most of our natural languages to see that comma is used for separation items in lists. So, using anything other than comma for enumerating parameters would be unexpected for anyone learning a programming language for the first time.
There's a number of historical reasons already pointed out.
Also, it's because in most languages, where , serves as separator, whitespace sequences are largely ignored, or to be more exact, although they may separate tokens, they do not act as tokens themselves. This is moreless true for all languages deriving their syntax from C. A sequence of whitespaces is much like the empty word and having the empty word delimit anything probably is not the best of ideas.
Also, I think it is clearer and easier to read. Why have whitespaces, which are invisible characters, and essentially serve nothing but the purpose of formatting, as really meaningful delimiters. It only introduces ambiguity. One example is that provided by Carl.
A second would f(a (b + c)). Now is that f(a(b+c)) or f(a, b+c)?
The creators of JavaScript had a very useful idea, similar to yours, which yields just the same problems. The idea was, that ENTER could also serve as ;, if the statement was complete. Observe:
function a() {
return "some really long string or expression or whatsoever";
}
function b() {
return
"some really long string or expression or whatsoever";
}
alert(a());//"some really long string or expression or whatsoever"
alert(b());//"undefined" or "null" or whatever, because 'return;' is a valid statement
As a matter of fact, I sometimes tend to use the latter notation in languages, that do not have this 'feature'. JavaScript forces a way to format my code upon me, because someone had the cool idea, of using ENTER instead of ;.
I think, there is a number of good reasons why some languages are the way they are. Especially in dynamic languages (as PHP), where there's no compile time check, where the compiler could warn you, that the way it resolved an ambiguity as given above, doesn't match the signature of the call you want to make. You'd have a lot of weird runtime errors and a really hard life.
There are languages, which allow this, but there's a number of reasons, why they do so. First and foremost, because a bunch of very clever people sat down and spent quite some time designing a language and then discovered, that its syntax makes the , obsolete most of the time, and thus took the decision to eliminate it.
This may sound a bit wise but I gather for the same reason why most earth-planet languages use it (english, french, and those few others ;-) Also, it is intuitive to most.
Haskell doesn't use commas.
Example
multList :: [Int] -> Int -> [Int]
multList (x : xs) y = (x * y) : (multList xs y)
multList [] _ = []
The reason for using commas in C/C++ is that reading a long argument list without a separator can be difficult without commas
Try reading this
void foo(void * ptr point & * big list<pointers<point> > * t)
commas are useful like spaces are. In Latin nothing was written with spaces, periods, or lower case letters.
Try reading this
IAMTHEVERYMODELOFAWHATDOYOUWANTNOTHATSMYBUCKET
it's primarily to help you read things.
This is not true. Some languages don't use commas. Functions have been Maths concepts before programming constructs, so some languages keep the old notation. Than most of the newer has been inspired by C (Javascript, Java, C#, PHP too, they share some formal rules like comma).
While some languages do use spaces, using a comma avoids ambiguous situations without the need for parentheses. A more interesting question might be why C uses the same character as a separator as is used for the "a then b" operator; the latter question is in some ways more interesting given that the C character set has at three other characters that do not appear in any context (dollar sign, commercial-at, and grave, and I know at least one of those (the dollar sign) dates back to the 40-character punchcard set.
It seems like all programming languages use commas (,) to separate function parameters.
In natural languages that include comma in their script, that character is used to separate things. For instance, if you where to enumerate fruits, you'd write: "lemon, orange, strawberry, grape" That is, using comma.
Hence, using comma to separate parameters in a function is more natural that using other character ( | for instance )
Consider:
someFunction( name, age, location )
vs.
someFunction( name|age|location )
Why don't they use just spaces instead?
Thats possible. Lisp does it.
The main reason is, space, is already used to separate tokens, and it's easier not to assign an extra functionality.
I have programmed in quite a few languages and while the comma does not rule supreme it is certainly in front. The comma is good because it is a visible character so that script can be compressed by removing spaces without breaking things. If you have space then you can have tabs and that can be a pain in the ... There are issues with new-lines and spaces at the end of a line. Give me a comma any day, you can see it and you know what it does. Spaces are for readability (generally) and commas are part of syntax. Mind you there are plenty of exceptions where a space is required or de rigueur. I also like curly brackets.
It is probably tradition. If they used space they could not pass expression as param e.g.
f(a-b c)
would be very different from
f(a -b c)
Some languages, like Boo, allow you to specify the type of parameters or leave it out, like so:
def MyFunction(obj1, obj2, title as String, count as Int):
...do stuff...
Meaning: obj1 and obj2 can be of any type (inherited from object), where as title and count must be of type String and Int respectively. This would be hard to do using spaces as separators.

Resources