General method to trim non-printable characters in Clojure - string

I encountered a bug where I couldn't match two seemingly 'identical' strings together. For example, the following two strings fail to match:
"sample" and "​sample".
To replicate the issue, one can run the following in Clojure.
(= "sample" "​sample") ; returns false
After an hour of frustrated debugging, I discovered that there was a zero-width space at the front of the second string! Removing it from this particular example via a backspace is trivial. However I have a database of strings that I'm matching, and it seems like there are multiple strings facing this issue. My question is: is there a general method to trim zero-width spaces in Clojure?
Some method's I've tried:
(count (clojure.string/trim "​abc")) ; returns 4
(count (clojure.string/replace "​abc" #"\s" "")) ; returns 4
This thread Remove zero-width space characters from a JavaScript string does provide a solution with regular expressions that works in this example, i.e.
(count (clojure.string/replace "​abc" #"[\u200B-\u200D\uFEFF]" "")) ; returns 3
However, as stated in the post itself, there are many other potential ascii characters that may be invisible. So I'm still interested if there's a more general method that doesn't rely on listing all possible invisible unicode symbols.

I believe, what you are referring to are so-called non-printable characters. Based on this answer in Java, you could pass the #"\p{C}" regular expression as pattern to replace:
(defn remove-non-printable-characters [x]
(clojure.string/replace x #"\p{C}" ""))
However, this will remove line breaks, e.g. \n. So in order to keep those characters, we need a more complex regular expression:
(defn remove-non-printable-characters [x]
(clojure.string/replace x #"[\p{C}&&^(\S)]" ""))
This function will remove non-printable characters. Let's test it:
(= "sample" "​sample")
;; => false
(= (remove-non-printable-characters "sample")
(remove-non-printable-characters "​sample"))
;; => true
(remove-non-printable-characters "sam\nple")
;; => "sam\nple"
The \p{C} pattern is discussed here.

The regex solution from #Rulle is very nice. The tupelo.chars namespace also has a collection of character classes and predicate functions that could be useful. They work in Clojure and ClojureScript, and also include the ^nbsp; for browsers. In particular, check out the visible? predicate.
The tupelo.string namespace also has a number of helper & convenience functions for string processing.
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[tupelo.chars :as chars]
[tupelo.string :as str] ))
(def sss
"Some multi-line
string." )
(dotest
(println "result:")
(println
(str/join
(filterv
#(or (chars/visible? %)
(chars/whitespace? %))
sss))))
with result
result:
Some multi-line
string.
To use, make your project.clj look like:
:dependencies [
[org.clojure/clojure "1.10.2-alpha1"]
[prismatic/schema "1.1.12"]
[tupelo "20.07.01"]
]

Related

How to distinguish escaped characters from non-escaped e.g. "\x27" from "x27" in a string in Common Lisp?

Solving Advent of Code 2015 task 8 part2 I encountered the problem to have to distinguish in a string the occurrence of "\x27" from plain "x27".
But I don't see a way how I can do it. Because
(length "\x27") ;; is 3
(length "x27") ;; is also 3
(subseq "\x27" 0 1) ;; is "x"
(subseq "x27" 0 1) ;; is "x"
Neither print, prin1, princ made a difference.
# nor does `coerce`
(coerce "\x27" 'list)
;; (#\x #\2 #\7)
So how then to distinguish in a string when "\x27" or any of such
hexadecimal representation occurs?
It turned out, one doesn't need to solve this to solve the task. However, now I still would like to know whether there is a way to distinguish "\x" from "x" in common lisp.
The string literal "\x27" is read as the same as "x27", because \ is an escape character in string literals. If you want a string with the contents \x27, you need to write the literal as "\\x27" (i. e. escape the escape character). This has nothing to do with the strings themselves. If you read a string from a file containing \x27 (e. g. with read-line), then the four-character string \x27 results.
By the time that the Lisp reader gets to work, \x is the same as x. There may be some way to turn this off - I wouldn't be surprised - but the original text talks about Santa's file.
So, I created my own file, like this:
x27
\x27
And I read the data into special variables like this:
(defun read-line-crlf (stream)
(string-right-trim '(#\Return) (read-line stream nil)))
(defun read-lines (filename)
(with-open-file (stream filename)
(setf x (read-line-crlf stream))
(setf x-esc (read-line-crlf stream))
))
The length of x is then 3, and the length of x-esc is 4. The returned string must be trimmed on Windows, or an external format declared, because otherwise SBCL will leave half of the CR-LF on the end of the read strings.

In DrRacket how do I check if a string has a certain amount of characters, as well how do I determine what the first character in a string is

Basically I have a problem, here is the information needed to solve the problem.
PigLatin. Pig Latin is a way of rearranging letters in English words for fun. For example, the sentence “pig latin is stupid” becomes “igpay atinlay isway upidstay”.
Vowels(‘a’,‘e’,‘i’,‘o’,and‘u’)are treated separately from the consonants(any letter that isn’t a vowel).
For simplicity, we will consider ‘y’ to always be a consonant. Although various forms of Pig Latin exist, we will use the following rules:
(1) Words of two letters or less simply have “way” added on the end. So “a” becomes “away”.
(2) In any word that starts with consonants, the consonants are moved to the end, and “ay” is added. If a word begins with more than two consonants, move only the first two letters. So “hello” becomes “ellohay”, and “string” becomes “ringstay”.
(3) Any word which begins with a vowel simply has “way” added on the end. So “explain” becomes “explainway”.
Write a function (pig-latin L) that consumes a non-empty (listof Str) and returns a Str containing the words in L converted to Pig Latin.
Each value in L should contain only lower case letters and have a length of at least 1.
I understand that i need to set three main conditions here, i'm struggling with Racket and learning the proper syntax to write out my solutions. first I need to make a conditions that looks at a string and see if it's length is 2 or less to meet the (1) condition. For (2) I need to look at the first two characters in a string, i'm assuming I have to convert the string into a list of char(string->list). For (3) I understand I just have to look at the first character in the string, i basically have to repeat what I did with (2) but just look at the first character.
I don't know how to manipulate a list of char though. I also don't know how to make sure string-length meets a criteria. Any assistance would be appreciated. I basically have barely any code for my problem since I am baffled on what to do here.
An example of the problem is
(pig-latin (list "this" "is" "a" "crazy" "exercise")) =>
"isthay isway away azycray exerciseway"
The best strategy to solve this problem is:
Check in the documentation all the available string procedures. We don't need to transform the input string to a list of chars to operate upon it, and you'll find that there are existing procedures that meet all of our needs.
Write helper procedures. In fact, we only need a procedure that tells us if a string contains a vowel at a given position; the problem states that only a-z characters are used so we can negate this procedure to also find consonants.
It's also important to identify the best order to write the conditions, for example: conditions 1 and 3 can be combined in a single case. This is my proposal:
(define (vowel-at-index? text index)
(member (string-ref text index)
'(#\a #\e #\i #\o #\u)))
(define (pigify text)
; cases 1 and 3
(cond ((or (<= (string-length text) 2)
(vowel-at-index? text 0))
(string-append text "way"))
; case 2.1
((and (not (vowel-at-index? text 0))
(vowel-at-index? text 1))
(string-append (substring text 1)
(substring text 0 1)
"ay"))
; case 2.2
(else
(string-append (substring text 2)
(substring text 0 2)
"ay"))))
(define (pig-latin lst)
(string-join (map pigify lst)))
For the final step, we only need to apply the pigify procedure to each element in the input, and that's what map does. It works as expected:
(pig-latin '("this" "is" "a" "crazy" "exercise"))
=> "isthay isway away azycray exerciseway"

clojure - avoid extra whitespace when combining string and variable

I'm writing a program that uses printl-str to return commands of an assembly language. I need to use variables in my code and I'm having this issue where the function will return extra whitespace where I don't want it:
(defn pushConstant [constant]
(println-str "#" constant "\r\nD=A\r\n#SP\r\nA=M\r\nM=D\r\n#SP\r\nM=M+1"))
Where instead of having, assuming that constant = 17
#17
D=A
#SP
A=M
M=D
#SP
M=M+1
I'm having:
# 17
D=A
#SP
A=M
M=D
#SP
M=M+1
Which is problematic for my assembly code. I have this issue in so many cases like this. I'll be glad to hear advice on how to avoid this extra whitespace between the String and the variable.
Frankly, I'd implement that to look more like the following:
(defn pushConstant [constant]
(->> [(str "#" constant)
"D=A"
"#SP"
"A=M"
"M=D"
"#SP"
"M=M+1"]
(interpose "\r\n")
(apply str)))
That way you don't have one big ugly format string, but break down your operations into small, readable pieces.
That said, the piece that makes a difference for you here is (str "#" constant), combining your # with the argument with no added whitespace.
Create the string using str which only concatenates (println interleaves spaces):
(defn pushConstant [constant]
(println-str (str "#" constant "\r\nD=A\r\n#SP\r\nA=M\r\nM=D\r\n#SP\r\nM=M+1")))

Lisp - Displaying a String to List

I've been looking for a way to convert user input (read-line) to a list of atoms that I can manipulate more easily.
For example:
SendInput()
This is my input. Hopefully this works.
and I want to get back..
(This is my input. Hopefully this works.)
Eventually it'd be ideal to remove any periods, commas, quotes, etc. But for now I just wanna store the users input in a list (NOT AS A STRING)
So. For now i'm using
(setf stuff (coerce (read-line) 'list))
and that returns to me as...
(#\T #\h #\i #\s #\Space #\i #\s #\Space #\m #\y #\Space #\i #\n #\p #\u #\t #. #\Space #\H #\o #\p #\e #\f #\u #\l #\l #\y #\Space #\t #\h #\i #\s #\Space #\w #\o #\r #\k #\s #.)
So now i'm on the hunt for a function that can take that list and format it properly...
Any help would be greatly appreciated!
Rainer's answer is better in that it's a bit more lightweight (and general), but you could also use CL-PPCRE , if you already have it loaded (I know I always do).
You can use SPLIT directly on the string you get from READ-LINE, like so:
(cl-ppcre:split "[ .]+" (read-line))
(Now you have two problems)
What you want to do is to split a sequence of characters (a String) into a list of smaller strings or symbols.
Use some of the split sequence functions available from a Lisp library (see for example cl-utilities).
In LispWorks, which comes with a SPLIT-SEQUENCE function) I would for example write:
CL-USER 8 > (mapcar #'intern
(split-sequence '(#\space #\.)
"This is my input. Hopefully this works."
:coalesce-separators t))
(|This| |is| |my| |input| |Hopefully| |this| |works|)
Remember, to get symbols with case preserving names, they are surrounded by vertical bars. The vertical bars are not part of the symbol name - just like the double quotes are not part of a string - they are delimiters.
You can also print it:
CL-USER 19 > (princ (mapcar #'intern
(split-sequence '(#\space #\.)
"This is my input. Hopefully this works."
:coalesce-separators t)))
(This is my input Hopefully this works)
(|This| |is| |my| |input| |Hopefully| |this| |works|)
Above prints the list. The first output is the data printed by PRINC and the second output is done by the REPL.
If you don't want symbols, but strings:
CL-USER 9 > (split-sequence '(#\space #\.)
"This is my input. Hopefully this works."
:coalesce-separators t)
("This" "is" "my" "input" "Hopefully" "this" "works")

Make String from Sequence of Characters

This code does not work as I expected. Could you please explain why?
(defn make-str [s c]
(let [my-str (ref s)]
(dosync (alter my-str str c))))
(defn make-str-from-chars
"make a string from a sequence of characters"
([chars] make-str-from-chars chars "")
([chars result]
(if (== (count chars) 0) result
(recur (drop 1 chars) (make-str result (take 1 chars))))))
Thank you!
This is very slow & incorrect way to create string from seq of characters. The main problem, that changes aren't propagated - ref creates new reference to existing string, but after it exits from function, reference is destroyed.
The correct way to do this is:
(apply str seq)
for example,
user=> (apply str [\1 \2 \3 \4])
"1234"
If you want to make it more effective, then you can use Java's StringBuilder to collect all data in string. (Strings in Java are also immutable)
You pass a sequence with one character in it to your make-str function, not the character itself. Using first instead of take should give you the desired effect.
Also there is no need to use references. In effect your use of them is a gross misuse of them. You already use an accumulator in your function, so you can use str directly.
(defn make-str-from-chars
"make a string from a sequence of characters"
([chars] (make-str-from-chars chars ""))
([chars result]
(if (zero? (count chars))
result
(recur (drop 1 chars) (str result (first chars))))))
Of course count is not very nice in this case, because it always has to walk the whole sequence to figure out its length. So you traverse the input sequence several times unnecessarily. One normally uses seq to identify when a sequence is exhausted. We can also use next instead of drop to save some overhead of creating unnecessary sequence objects. Be sure to capture the return value of seq to avoid overhead of object creations later on. We do this in the if-let.
(defn make-str-from-chars
"make a string from a sequence of characters"
([chars] (make-str-from-chars chars ""))
([chars result]
(if-let [chars (seq chars)]
(recur (next chars) (str result (first chars)))
result)))
Functions like this, which just return the accumulator upon fully consuming its input, cry for reduce.
(defn make-str-from-chars
"make a string from a sequence of characters"
[chars]
(reduce str "" chars))
This is already nice and short, but in this particular case we can do even a little better by using apply. Then str can use the underlying StringBuilder to its full power.
(defn make-str-from-chars
"make a string from a sequence of characters"
[chars]
(apply str chars))
Hope this helps.
You can also use clojure.string/join, as follows:
(require '[clojure.string :as str] )
(assert (= (vec "abcd") [\a \b \c \d] ))
(assert (= (str/join (vec "abcd")) "abcd" ))
There is an alternate form of clojure.string/join which accepts a separator. See:
http://clojuredocs.org/clojure_core/clojure.string/join

Resources