clojure: remove a set of strings from a sentence

clojure: remove a set of strings from a sentence - string

I have a sentence "china beijing shanghai USA australia", and a set of words #{"USA" "australia"}
Now i am writting a function which takes input as sentence and set of words, and remove those from sentence :
(defn remove-words-from-sentence [sentence words]
(for [w words] (-> sentence
(.replaceAll w "")))
Note : I wish to replace exact word occurance.. so if words contains letter "a", then all a's should not be replaced in sentence, only word a should be replaced.
But the above function doesn't work, any help??

One way you can do it is by splitting the sentence into individual words, and having the words to be removed in a set, and filter out the words from the sentence.
(let [sentence (clojure.string/split (read-line) #" ")
words (set (clojure.string/split (read-line) #" "))]
(clojure.string/join " "
(filter (complement words)
sentence)))
user=> china beijing shanghai USA australia ;;input sentence
user=> china USA ;;input words
user=> "beijing shanghai australia" ;;output
EDIT:
Thumbnail brought to my attention that (filter (complement pred) coll) is equivalent to (remove pred coll). You can verify that by viewing the source code of remove
(source remove)
(defn remove
"Returns a lazy sequence of the items in coll for which
(pred item) returns false. pred must be free of side-effects."
{:added "1.0"
:static true}
[pred coll]
(filter (complement pred) coll))
nil
So one could just use remove instead
(let [sentence (clojure.string/split (read-line) #" ")
words (set (clojure.string/split (read-line) #" "))]
(clojure.string/join " " (remove words sentence)))
It's even more readable that way. You can read it as "remove words from sentence".

for iterates over the seq given to it, producing another sequence. So, you're generating a list with elements representing each replacement separately but not combined.
What you want is first replacing the first word, then - on the result of that replacement - remove the second one, and so on. This is a typical case for reduce:
(defn remove-words-from-sentence
[sentence words]
(reduce #(.replace % %2 "") sentence words))
(Note that replace does the same as replaceAll but with literal replacements, not allowing a regular expression.)
EDIT: This is only fixing what the OP was trying to do. It will probably produce unwanted results if e.g. one of the words is "eij" (since it will remove that portion of "Beijing"). One way to fix that would be to use (.replaceAll % (str "\\b\\Q" %2 "\\E\\b\\s*") "") to do the replacement; and then trim the result. A more reliable version might thus look like this:
(require '[clojure.string :as string])
(defn remove-words-from-sentence
[sentence words]
(let [pattern (->> (for [w words] (str "\\b\\Q" w "\\E\\b"))
(string/join "|")
(format "(%s)\\s*"))]
(.trim (.replaceAll sentence pattern ""))))
But it all depends on what OP wants.

user> (defn remove-words-from-sentence
[sentence & words]
(loop [sentence sentence
ws words]
(if-not (seq ws)
sentence
(recur
(clojure.string/replace sentence (first ws) "")
(rest ws)))))
#'user/remove-words-from-sentence
user> (remove-words-from-sentence "Hello, World" "World")
;=> "Hello, "
user> (remove-words-from-sentence "Hello, World" "ll" "o" "H")
;=> "e, Wrld"

The answers so far don't deal with the questions specified input types (string and set)
As the input words are specified in the question as a set, and the sentence a string - the easiest solution would probably be using sets - easy to understand too;
(defn remove-words-from-sentence [sentence words]
(str/join " "(set/difference (into #{} (str/split sentence #" ")) words))
)
Works as advertised:
(remove-words-from-sentence "china beijing shanghai USA australia" #{"USA" "australia"})
"beijing china shanghai"

Related

Numbering strings inside a vector in clojure

Given the following string:
(def text "this is the first sentence . And this is the second sentence")
I wanted to count the number of times a word like "this" appears in the text, by appending the count after each occurrence of the word. Like this:
["this: 1", "is" "the" "first" "sentence" "." "and" "this: 2" ...]
As a first step, I tokenized the string:
(def words (split text #" "))
Then I created a helper function to get the number of times "this" appears in the text:
(defn count-this [x] (count(re-seq #"this" text)))
Finally I tried to use the result of the count-this function inside this loop:
(for [x words]
(if (= x "this")
(str "this: "(apply str (take (count-this)(iterate inc 0))))
x))
Here is what I get:
("this: 01" "is" "the" "first" "sentence" "." "And" "this: 01" "is" ...)

This can be achieved fairly succinctly using reduce to thread a counter through your vector traversal, in addition to building the new strings as needed:
(def text "this is the first sentence. And this is the second sentence.")
(defn notate-occurences [word string]
(->
(reduce
(fn [[count string'] member]
(if (= member word)
(let [count' (inc count)]
[count' (conj string' (str member ": " count'))])
[count (conj string' member)]))
[0 []]
(clojure.string/split string #" "))
second))
(notate-occurences "this" text)
;; ["this: 1" "is" "the" "first" "sentence." "And" "this: 2" "is" "the" "second""sentence."]

(defn split-by-word [word text]
(remove empty?
(flatten
(map #(if (number? %) (str word ": " (+ 1 %)) (clojure.string/split (clojure.string/trim %) #" "))
(butlast (interleave
(clojure.string/split (str text " ") (java.util.regex.Pattern/compile (str "\\b" word "\\b")))
(range)))))))

You need to keep some state as you are going along. reduce, loop/recur and iterate all do this. iterate just transitions from one state to another. Here is the transition function:
(defn transition [word]
(fn [[[head & tail] counted out]]
(let [[next-counted to-append] (if (= word head)
[(inc counted) (str head ": " (inc counted))]
[counted head])]
[tail next-counted (conj out to-append)])))
Then you can use iterate to exercise this function until there is no input left:
(let [in (s/split "this is the first sentence . And this is the second sentence" #" ")
step (transition "this")]
(->> (iterate step [in 0 []])
(drop-while (fn [[[head & _] _ _]]
head))
(map #(nth % 2))
first))
;; => ["this: 1" "is" "the" "first" "sentence" "." "And" "this: 2" "is" "the" "second" "sentence"]

The problem with that approach is that (apply str (take (count-this)(iterate inc 0))) is going to evaluate to the same thing every time.
To exert complete control over variables you generally want to use the loop form.
e.g.
(defn add-indexes [word phrase]
(let [words (str/split phrase #"\s+")]
(loop [src words
dest []
counter 1]
(if (seq src)
(if (= word (first src))
(recur (rest src) (conj dest (str word " " counter)) (inc counter))
(recur (rest src) (conj dest (first src)) counter))
dest))))
user=> (add-indexes "this" "this is the first sentence . And this is the second sentence")
["this 1" "is" "the" "first" "sentence" "." "And" "this 2" "is" "the" "second" "sentence"]
loop allows you to specify the value of every of the loop variables on each pass. So you can decide to change them or not according to your own logic.
If you're willing to dip into Java and maybe do something that feels like cheating, this would work too.
(defn add-indexes2 [word phrase]
(let [count (java.util.concurrent.atomic.AtomicInteger. 1)]
(map #(if (= word %) (str % " " (.getAndIncrement count)) %)
(str/split phrase #"\s+"))))
user=> (add-indexes2 "this" "this is the first sentence . And this is the second sentence")
("this 1" "is" "the" "first" "sentence" "." "And" "this 2" "is" "the" "second" "sentence")
Using the mutable counter may not be pure, but on the other hand, it never escapes the context of the function, so its behavior cannot be changed by external forces.

Usually, you can find a simple way of composing your solution from existing Clojure functions in a very succinct way.
Here's two quite short solutions to your problem. First, if you don't need the result as a sequence, but replacements to the string are ok:
(require '(clojure.string))
(def text "this is the first sentence . And this is the second sentence")
(defn replace-token [ca token]
(swap! ca inc)
(str token ": " #ca))
(defn count-this [text]
(let [counter (atom 0)
replacer-fn (partial replace-token counter)]
(clojure.string/replace text #"this" replacer-fn)))
(count-this text)
; => "this: 1 is the first sentence . And this: 2 is the second sentence"
The above solution makes use of the fact that a function can be supplied to clojure.string/replace.
Second, if you need the result as a sequence, there is some overhead from tokenizing:
(defn count-seq [text]
(let [counter (atom 0)
replacer-fn (partial replace-token counter)
converter (fn [tokens] (map #(if (not= % "this")
%
(replacer-fn %))
tokens))]
(-> text
(clojure.string/split #" ")
(converter))))
(count-seq text)
; => ("this: 1" "is" "the" "first" "sentence" "." "And" "this: 2" "is" "the" "second" "sentence")
The loop-recur pattern is very common for beginning Clojurians, who come from non-functional languages. In most cases, there is a cleaner and more idiomatic solution using functional processing with map, reduce, and friends.
Like other answers have stated, the main issue in your original attempt is the binding of your counter. In fact, (iterate inc 0) is not bound to anything. Look at my examples above to think through the scope of the bound atom counter. As a reference, here is an example of using closures, which could also be used in this case with great success!
As a footnote for above examples: For cleaner code, you should make a more general solution by extracting and reusing the common parts of count-seq and count-this functions. Also, the local converter function could be extracted out of count-seq. replace-token is already general for all tokens, but consider how the whole solution could be expanded beyond matching text other than "this". These are left as exercises for the reader.

`.split` method in clojure returns unexpected results

For part of a class project I need to read in a file representing a graph in Clojure. Here is a link to an example file. The file structure for all the files I could possibly read in are such
c Unknown number of lines
c That start with "c" and are just comments
c The rest of the lines are edges
e 2 1
e 3 1
e 4 1
e 4 2
e 4 3
e 5 1
e 5 2
The issue that I am having is trying to split a line based on spaces. In my REPL I have done
finalproject.core> (.split "e 1 2" " ")
#<String[] [Ljava.lang.String;#180f214>
Which, I am not sure what it means exactly.. I think it refers to a memory locations of a String[] I am not sure why it is displayed like that though. If the insert a # in front of the split string, which I think denotes it is a regular expression I receive an error
finalproject.core> (.split "e 1 2" #" ")
ClassCastException java.util.regex.Pattern cannot be cast to java.lang.String
Currently my entire implementation of this module is, which I am pretty sure will work if I could properly use the split function.
(defn lineToEdge [line]
(cond (.startsWith line "e")
(let [split-line (.split line " ")
first-str (split-line 1)
second-str (split-line 2)]
((read-string first-str) (read-string second-str)))))
(defn readGraphFile [filename, numnodes]
(use 'clojure.java.io)
(let [edge-list
(with-open [rdr (reader filename)]
(doseq [line (line-seq rdr)]
(lineToEdge line)))]
(reduce add-edge (empty-graph numnodes) edge-list)))
I have not had a chance to test readGraphFile in any way but when I try to use lineToEdge with some dummy input I receive the error
finalproject.core> (lineToEdge "e 1 2")
ClassCastException [Ljava.lang.String; cannot be cast to clojure.lang.IFn
Suggestions as to where I went wrong?

In the following, your return value is an Array of type String.
finalproject.core> (.split "e 1 2" " ")
#<String[] [Ljava.lang.String;#180f214>
To use it more conveniently in Clojure, you can put it into a vector:
user=> (vec (.split "e 1 2" " "))
["e" "1" "2"]
You can also use the built in clojure.string namespace:
user=> (require '[clojure.string :as string])
nil
user=> (string/split "e 1 2" #" ")
["e" "1" "2"]
The source of your stack trace is here:
(let [split-line (.split line " ")
first-str (split-line 1)
second-str (split-line 2)] ...)
This gets a String Array, via .split, then attempts to call it as if it were a function. Perhaps you meant to use get here to access an element of the List by index? (get split-line 1) will get the element from split-line at index 1, etc.
You'll see another problem here:
((read-string first-str) (read-string second-str))
If I am reading your code properly, this will end up calling a number as if it were a function, with another number as an argument. Perhaps you intend to return a pair of numbers?
[(read-string first-str) (read-string second-str)]

Automatic acronyms of strings in R

Long strings in plots aren't always attractive. What's the shortest way of making an acronym in R? E.g., "Hello world" to "HW", and preferably to have unique acronyms.
There's function abbreviate, but it just removes some letters from the phrase, instead of taking first letters of each word.

An easy way would be to use a combination of strsplit, substr, and make.unique.
Here's an example function that can be written:
makeInitials <- function(charVec) {
make.unique(vapply(strsplit(toupper(charVec), " "),
function(x) paste(substr(x, 1, 1), collapse = ""),
vector("character", 1L)))
}
Test it out:
X <- c("Hello World", "Home Work", "holidays with children", "Hello Europe")
makeInitials(X)
# [1] "HW" "HW.1" "HWC" "HE"
That said, I do think that abbreviate should suffice, if you use some of its arguments:
abbreviate(X, minlength=1)
# Hello World Home Work holidays with children Hello Europe
# "HlW" "HmW" "hwc" "HE"

Using regex you can do following. The regex pattern ((?<=\\s).|^.) looks for any letter followed by space or first letter of the string. Then we just paste resulting vectors using collapse argument to get first letter based acronym. And as Ananda suggested, if you want to make unique pass the result through make.unique.
X <- c("Hello World", "Home Work", "holidays with children")
sapply(regmatches(X, gregexpr(pattern = "((?<=\\s).|^.)", text = X, perl = T)), paste, collapse = ".")
## [1] "H.W" "H.W" "h.w.c"
# If you want to make unique
make.unique(sapply(regmatches(X, gregexpr(pattern = "((?<=\\s).|^.)", text = X, perl = T)), paste, collapse = "."))
## [1] "H.W" "H.W.1" "h.w.c"

How to distribute strings in Emacs or Vim

In Emacs or Vim, what's a smooth way to join strings as in this example:
Transform from:
(alpha, beta, gamma) blah (123, 456, 789)
To:
(alpha=123, beta=456, gamma=789)
It would need to scale to:
many lines of these
many elements in the parentheses
I have recently found myself needing this kind of transformation often.
I use Evil in Emacs which is why a Vim answer would likely also help.
UPDATE:
The solutions were not as general as I had hoped. For example, I'd like the solution to also work when I have a list of strings and wish to distribute them into a large XML document. eg:
<item foo="" bar="barval1"/>
<item foo="" bar="barval2"/>
<item foo="" bar="barval3"/>
<item foo="" bar="barval4"/>
fooval1
fooval2
fooval3
fooval4
I formulated a solution and have added it as an answer.

%s/(\(\S\{-}\), \(\S\{-}\), \(\S\{-}\)).\{-}(\(\S\{-}\), \(\S\{-}\), \(\S\{-}\))/(\1=\4, \2=\5, \3=\6)
%s: global search and replace
\(\S{-}\),: non greedy search for non-whitespace characters up to the next comma, enclosed by "(" for backreferencing
\1=\4 : prints out the first match, an "=" sign, then the fourth match

for such text transformation, I would go with awk:
this one-liner may help:
awk -F'\\(|\\)' '{split($2,t,",");split($4,v,",");printf "( "; for(x in t)s=s""sprintf("%s=%s, ", t[x],v[x]);sub(", $","",s);printf s")\n";s=""}' file
little test:
kent$ cat test
(alpha, beta, gamma) blah (123, 456, 789)
(a, b, c) foo (1, 2, 3)
(x, y, z, m, n) bar (100, 200, 300, 400, 500)
kent$ awk -F'\\(|\\)' '{split($2,t,",");split($4,v,",");printf "( "; for(x in t)s=s""sprintf("%s=%s, ", t[x],v[x]);sub(", $","",s);printf s")\n";s=""}' test
( alpha=123, beta= 456, gamma= 789)
( a=1, b= 2, c= 3)
( m= 400, n= 500, x=100, y= 200, z= 300)

Emacs Lisp version of Prince Goulash answer
(require 'cl)
(defun split-and-trim (str separator)
(let ((strs (split-string str separator)))
(mapcar (lambda (s)
(replace-regexp-in-string "^\\s-+" "" s))
(mapcar (lambda (s)
(replace-regexp-in-string "\\s-$" "" s)) strs))))
(defun my/merge-list (beg end)
(interactive "r")
(goto-char beg)
(let ((endmark (set-mark end))
(regexp "(\\([^)]+\\))[^(]+(\\([^)]+\\))"))
(while (re-search-forward regexp end t)
(let ((replace-start (match-beginning 0))
(replace-end (match-end 0))
(keys-str (match-string-no-properties 1))
(values-str (match-string-no-properties 2)))
(let* ((keys (split-and-trim keys-str ","))
(values (split-and-trim values-str ",")))
(while (> (length keys) (length values))
(setq values (append values '(""))))
(let* ((pairs (mapcar* (lambda (k v)
(format "%s=%s" k v)) keys values))
(transformed (format "(%s)" (mapconcat #'identity pairs ", "))))
(goto-char replace-start)
(delete-region replace-start replace-end)
(insert transformed)))))
(goto-char (marker-position endmark))))
For example, you select region as following
(alpha, beta, gamma) blah (123, 456, 789)
(alpha, beta, gamma, delta) blah (123, 456, 789, aaa)
After M-x my/merge-list
(alpha=123, beta=456, gamma=789)
(alpha=123, beta=456, gamma=789, delta=aaa)

This method I'm going to describe is a bit wacky, but it involves the minimum amount of Elisp code I could manage. It's only applicable if the lists to be joined can be interpreted as Lisp lists once the commas in them are removed. Numbers and sequences of alphabetic characters, as in your example, would be fine.
First, make sure that the Common Lisp library is loaded: M-:(require 'cl)RET.
Now, starting with the cursor at the start of the first list:
M-C-k ; kill-forward-sexp
C-e ; move-end-of-line
M-C-b ; backward-sexp
M-C-k ; kill-forward-sexp
C-a ; move-beginning-of-line
C-k ; kill-line
Now blah (or whatever) is the first entry in the kill ring, the second list is the second entry, and the first list is the third entry.
Type (, then M-: (eval-expression), take a deep breath, and type this:
(loop with (a b) = (mapcar (lambda (x) (car (read-from-string (remove ?, x))))
(subseq kill-ring 1 3))
for x in a for y in b do (insert (format "%s=%s, " y x)))
(I've broken it up for presentation purposes, but you can type it all on one line.)
Then finally DELDEL), and you're done! You could turn it into a macro, if you wanted.

Here is a Vimscript solution. It is nowhere near as elegant as ash's answer, but it works with lists of any length.
function! ListMerge()
" Get line, remove text between lists, split lists at parentheses:
let curline = getline('.')
let curline = substitute(curline,')\zs.*\ze(','','g')
let curline = substitute(curline,'(','','g')
let lists = map(split(curline,')'),'split(v:val,",")')
" Return if we don't have two lists of equal length:
if len(lists) != 2 || len(lists[0]) != len(lists[1])
return
endif
" Loop over the lists, remove whitespace, build the replacement string:
let i=0
let string = '('
while i<len(lists[0])
let string .= substitute(lists[0][i],'^ *','','')
let string .= '='
let string .= substitute(lists[1][i],'^ *','','')
let string .= ', '
let i+=1
endwhile
" Add the concluding bracket:
let string = substitute(string,', $',')','')
" Replace the current line with the string:
execute "normal! S" . string
endfunction
You can then call this function on all lines like this:
:%call ListMerge()

My approach is to create one command to set a match-list, then use replace-regexp as the second command to distribute match-list, leveraging replace-regexp's existing \, facility.
Evaluate Elisp, such as in the .emacs file:
(defvar match-list nil
"A list of matches, as set through the set-match-list and consumed by the cycle-match-list function. ")
(defvar match-list-iter nil
"Iterator through the global match-list variable. ")
(defun reset-match-list-iter ()
"Set match-list-iter to the beginning of match-list and return it. "
(interactive)
(setq match-list-iter match-list))
(defun make-match-list (match-regexp use-regexp beg end)
"Set the match-list variable as described in the documentation for set-match-list. "
;; Starts at the beginning of region, searches forward and builds match-list.
;; For efficiency, matches are appended to the front of match-list and then reversed
;; at the end.
;;
;; Note that the behavior of re-search-backward is such that the same match-list
;; is not created by starting at the end of the region and searching backward.
(let ((match-list nil))
(save-excursion
(goto-char beg)
(while
(let ((old-pos (point)) (new-pos (re-search-forward match-regexp end t)))
(when (equal old-pos new-pos)
(error "re-search-forward makes no progress. old-pos=%s new-pos=%s end=%s match-regexp=%s"
old-pos new-pos end match-regexp))
new-pos)
(setq match-list
(cons (replace-regexp-in-string match-regexp
use-regexp
(match-string 0)
t)
match-list)))
(setq match-list (nreverse match-list)))))
(defun set-match-list (match-regexp use-regexp beg end)
"Set the match-list global variable to a list of regexp matches. MATCH-REGEXP
is used to find matches in the region from BEG to END, and USE-REGEXP is the
regexp to place in the match-list variable.
For example, if the region contains the text: {alpha,beta,gamma}
and MATCH-REGEXP is: \\([a-z]+\\),
and USE-REGEXP is: \\1
then match-list will become the list of strings: (\"alpha\" \"beta\")"
(interactive "sMatch regexp: \nsPlace in match-list: \nr")
(setq match-list (make-match-list match-regexp use-regexp beg end))
(reset-match-list-iter))
(defun cycle-match-list (&optional after-end-string)
"Return the next element of match-list.
If AFTER-END-STRING is nil, cycle back to the beginning of match-list.
Else return AFTER-END-STRING once the end of match-list is reached."
(let ((ret-elm (car match-list-iter)))
(unless ret-elm
(if after-end-string
(setq ret-elm after-end-string)
(reset-match-list-iter)
(setq ret-elm (car match-list-iter))))
(setq match-list-iter (cdr match-list-iter))
ret-elm))
(defadvice replace-regexp (before my-advice-replace-regexp activate)
"Advise replace-regexp to support match-list functionality. "
(reset-match-list-iter))
Then to solve the original problem:
M-x set-match-list
Match regexp: \([0-9]+\)[,)]
Place in match-list: \1
M-x replace-regexp
Replace regexp: \([a-z]+\)\([,)]\)
Replace regexp with: \1=\,(cycle-match-list)\2
And to solve the XML example:
[Select fooval strings.]
M-x set-match-list
Match regexp: .+
Place in match-list: \&
[Select XML tags.]
M-x replace-regexp
Replace regexp: foo=""
Replace regexp with: foo="\,(cycle-match-list)"

How to extract substrings from this string?

The string is
And I want to get substrings "11","1.1","282". Can anyone show me how to do this in R? Thanks!

I believe strsplit(x," +")[[1]] will do it. (the regular expression " +" denotes one or more spaces; strsplit applies to character vectors, and returns a list with the splitted version of each element in the vector, so [[1]] extracts the first (and only) component)

> x = "11 1.1 282"
> res <- strsplit(x, " +")
> res
[[1]]
[1] "11" "1.1" "282"
>

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

clojure: remove a set of strings from a sentence - string

Related

Numbering strings inside a vector in clojure

`.split` method in clojure returns unexpected results

Automatic acronyms of strings in R

How to distribute strings in Emacs or Vim

How to extract substrings from this string?

Categories

Resources