How to sum up the word frequencies after stemming in Racket? - nlp

As background I'm trying to make a NLP application in Racket and I arrived at the part where I have to stem the words (I also obtained their frequency).
I am using the (planet dyoo/porter-stemmer) package in order to stem, and as an example we can write:
(map (λ(x) (list (stem (first x)) (second x)))
'(("cryed" 1)
("racketeer" 2)
("crying" 3)
("playing" 4)
("racketing" 5)
("plays" 6)
("Racket" 7)))
Which produces: '(("cry" 1) ("racket" 2) ("cry" 3) ("plai" 4) ("racket" 5) ("plai" 6) ("racket" 7))
Now my goal is to sum up the frequency for each term, aka to arrive at: '(("cry" 4) ("racket" 14) ("plai" 10))
I came up with a way to do it, but I don't like my solution:
(define (frequency string)
(map (λ(x) (list (first x) (length x)))
(group-by (λ(x) x) (string-split string))))
(define (recalculate lst)
(frequency
(string-join
(flatten
(map (λ(x) (make-list (second x) (first x))) lst)))))
Basically I retype each word as many times as it's frequency, then make a single string containing all words and finally compute the frequency again. Is there a simpler(faster) way to achieve this?
I should perhaps add that the order doesn't matter ("plai" can come up before "cry" and so on). Also I'm looking for a simpler solution because I'm gonna have to use larger datasets and I want to make this faster (I'd also be glad even if the frequency function can be made more faster).

You could create an add-count procedure that takes a list of counts and a new count as arguments, and adds the count to the list if there are no similarly tagged counts already in the list, or combines the new count with an existing count.
#lang racket
(define (get-tag c) (first c))
(define (get-val c) (second c))
(define (add-count cs c)
(let* ((k (get-tag c))
(v (get-val c))
(old-count (assoc k cs)))
(if old-count
(cons (list k (+ v (get-val old-count)))
(remove old-count cs))
(cons c cs))))
Here get-tag and get-val are just convenience procedures to access the tag and value stored in a count. The assoc procedure is used to extract a copy of the first count in cs matching the new count c to be added. This count is stored in old-count, the value of which is used to create a new count which is added to the list after removing old-count from the original list cs.
With the add-count procedure defined, a procedure reduce-counts could be defined that goes through all of the counts and accumulates them to an empty list by using add-count. The resulting list will have the counts combined.
(define (reduce-counts cs (acc '()))
(if (null? cs)
acc
(reduce-counts (rest cs) (add-count acc (first cs)))))
Here is a test run:
reduce-counts.rkt> (define test-counts '(("cry" 1) ("racket" 2) ("cry" 3) ("play" 4) ("racket" 5) ("play" 6) ("racket" 7)))
reduce-counts.rkt> (reduce-counts test-counts)
'(("racket" 14) ("play" 10) ("cry" 4))
As an alternative approach you could use filter to collect counts with similar tags in a list, and combine those into a new count after summing the values. The combined counts can be collected in an accumulator before filtering the input to remove the tags which were just combined. This process can be repeated recursively until all counts have been combined, removed, and collected.
;;; An alternate solution
(define (combine-like-counts cs)
(list (get-tag (first cs))
(foldl (lambda (c x) (+ x (get-val c))) 0 cs)))
(define (reduce-counts cs (acc '()))
(if (null? cs)
acc
(let* ((k (get-tag (first cs)))
(k-tag? (lambda (c) (equal? k (get-tag c))))
(like (filter k-tag? cs))
(remaining (filter (negate k-tag?) cs)))
(reduce-counts remaining
(cons (combine-like-counts like) acc)))))
Here the combine-like-counts procedure assumes that all counts in the input list share the same tag, so a new count is formed by taking the tag and the sum of all values into a list.
The new reduce-counts procedure returns whatever has been placed in the accumulator when the input is the empty list, otherwise the tag of the first count is saved and used to create the k-tag? predicate, which is then used with filter to create a list of matching counts and a list of the remaining counts with all matching counts removed. The list of matching counts is combined into a single count with combine-like-counts and added to the accumulator, which is passed along with remaining recursively to reduce-counts.
This works as before, although the ordering has changed:
reduce-counts.rkt> (define test-counts '(("cry" 1) ("racket" 2) ("cry" 3) ("play" 4) ("racket" 5) ("play" 6) ("racket" 7)))
reduce-counts.rkt> (reduce-counts test-counts)
'(("play" 10) ("racket" 14) ("cry" 4))
I would suspect that these two implementations would have different performance characteristics depending on the particulars of their input data. My hunch is that the second would fare better for large input that contained large quantities of each tag, but the real answer would come from testing on some representative data samples.
If you are really concerned about performance for large amounts of data, you might consider converting the data to a hash table and using some of the built-in dictionary procedures to arrive at a similar solution.

Related

Racket - string->list returns strange results [duplicate]

I want to calculate the sum of digits of a number in Scheme. It should work like this:
>(sum-of-digits 123)
6
My idea is to transform the number 123 to string "123" and then transform it to a list '(1 2 3) and then use (apply + '(1 2 3)) to get 6.
but it's unfortunately not working like I imagined.
>(string->list(number->string 123))
'(#\1 #\2 #\3)
Apparently '(#\1 #\2 #\3) is not same as '(1 2 3)... because I'm using language racket under DrRacket, so I can not use the function like char->digit.
Can anyone help me fix this?
An alternative method would be to loop over the digits by using modulo. I'm not as used to scheme syntax, but thanks to #bearzk translating my Lisp here's a function that works for non-negative integers (and with a little work could encompass decimals and negative values):
(define (sum-of-digits x)
(if (= x 0) 0
(+ (modulo x 10)
(sum-of-digits (/ (- x (modulo x 10)) 10)))))
Something like this can do your digits thing arithmetically rather than string style:
(define (digits n)
(if (zero? n)
'()
(cons (remainder n 10) (digits2 (quotient n 10))))
Anyway, idk if its what you're doing but this question makes me think Project Euler. And if so, you're going to appreciate both of these functions in future problems.
Above is the hard part, this is the rest:
(foldr + (digits 12345) 0)
OR
(apply + (digits 1234))
EDIT - I got rid of intLength above, but in case you still want it.
(define (intLength x)
(define (intLengthP x c)
(if (zero? x)
c
(intLengthP (quotient x 10) (+ c 1))
)
)
(intLengthP x 0))
Those #\1, #\2 things are characters. I hate to RTFM you, but the Racket docs are really good here. If you highlight string->list in DrRacket and hit F1, you should get a browser window with a bunch of useful information.
So as not to keep you in the dark; I think I'd probably use the "string" function as the missing step in your solution:
(map string (list #\a #\b))
... produces
(list "a" "b")
A better idea would be to actually find the digits and sum them. 34%10 gives 4 and 3%10 gives 3. Sum is 3+4.
Here's an algorithm in F# (I'm sorry, I don't know Scheme):
let rec sumOfDigits n =
if n<10 then n
else (n%10) + sumOfDigits (n/10)
This works, it builds on your initial string->list solution, just does a conversion on the list of characters
(apply + (map (lambda (d) (- (char->integer d) (char->integer #\0)))
(string->list (number->string 123))))
The conversion function could factored out to make it a little more clear:
(define (digit->integer d)
(- (char->integer d) (char->integer #\0)))
(apply + (map digit->integer (string->list (number->string 123))))
(define (sum-of-digits num)
(if (< num 10)
num
(+ (remainder num 10) (sum-of-digits (/ (- num (remainder num 10)) 10)))))
recursive process.. terminates at n < 10 where sum-of-digits returns the input num itself.

Lisp: Add respective elements of a list of lists

Let's say I have a list:
((1 2 3) (8 4 7) (41 79 30) (0 8 5))
I want to do this:
(1+8+41+0 2+4+79+8 3+7+30+5) = (50 93 45)
I've found an ugly solution:
(defun nested+ (lst)
(let ((acc nil))
(dotimes (i (length (first lst)))
(push (apply #'+ (mapcar #'(lambda (a) (nth i a)) lst)) acc))
(reverse acc)))
It seems to work for my purposes, but I guess it is slow and un-lispy. What's the proper way?
One option is (apply #'mapcar #'+ list). Mapcar will consume as many lists as you give it and stop when it reaches the end of the shortest list.
The naive solution would be
(apply #'mapcar #'+ list)
However, as already pointed out e.g. here by stackoverflow and here by LispWorks, the call-arguments-limit of (in the worst case) 50 arguments applies to functions called by apply. And reduce is suggested instead.
Thus, I suggest:
(defun sum-all (lists)
(reduce #'(lambda (l1 l2) (mapcar #'+ l1 l2)) lists))
And indeed
(sum-all '((1 2 3) (8 4 7) (41 79 30) (0 8 5)))
;; (50 93 45)
Another option is to loop over your list of lists:
(defun sum-all (lists)
(loop
for list in lists
for result = (copy-list list) then (map-into result #'+ result list)
finally (return result)))
During the first iteration, the first list is copied. The resulting list is then used in successive iterations to hold the respective sums. At the end of the iteration, that result list is returned.

How to implement multithreading

Game rules
Consider a simple 2-player game played as follows: An even number of coins is laid out in a row. Taking turns, each player removes the coin on one of the ends of the row. The object is to have the highest value in coins when all coins have been taken.
Player one finds the sum of all of the even-numbered coins, and all the odd-numbered coins. If the sum of the odd numbered coins is higher, player one takes the leftmost coin; otherwise he takes the rightmost.
Player two has a choice, with an odd number of coins. So he tries taking a coin from both ends, to see which option would leave player 1 worse off.
The problem
I basically want to implement multitheading to this program. I am still very new to Clojure, and I couldn't really find any good material online, regarding multithreading, that could be applied to my program.
The code
(ns game.core
(:gen-class))
(defn vector-from-string [s]
(drop 1 (map read-string (clojure.string/split (clojure.string/trim-newline s) #" "))))
(defn string-from-file [f]
(slurp f))
(defn sum-of-evens [v]
(def evens (vector))
(loop [v v, index 1]
(when (seq v)
(if (even? index)
(def evens (conj evens (first v))))
(recur (rest v) (inc index))))
(reduce + evens))
(defn sum-of-odds [v]
(def odds (vector))
(loop [v v, index 1]
(when (seq v)
(if (odd? index)
(def odds (conj odds (first v))))
(recur (rest v) (inc index))))
(reduce + odds))
(defn player-two [v p1score p2score]
(if (not (empty? v))
(if (> (max (sum-of-odds (drop 1 v)) (sum-of-evens (drop 1 v))) (max (sum-of-odds (drop-last v)) (sum-of-evens (drop-last v))))
(player-one (drop-last v) p1score (+ p2score(last v)))
(player-one (drop 1 v) p1score (+ p2score (first v))))
(println "p1score" p1score "p2score" p2score)))
(defn player-one [v p1score p2score]
(if (not (empty? v))
(if (> (sum-of-odds v) (sum-of-evens v))
(player-two (drop 1 v) (+ p1score (first v)) p2score)
(player-two (drop-last v) (+ p1score (last v)) p2score))
(println "p1score" p1score "p2score" p2score)))
(defn -main [& args]
(let [v (vector-from-string (string-from-file "numbers.txt")) ]
(player-one v 0 0)))
So -main runs the player-one function first, and player-one calls player-two, and they both continue on until the end of the program. I would like to somehow implement multithreading to speed up the executing of this game with a higher amount of starting coins.
Your code is currently very unidiomatic.
A few remarks that hopefully help you getting into the right direction:
A def inside a defn (or def) is (almost) always wrong. You're thinking in terms of variable assignment and mutable variables here. This is not how Clojure works. Use variables in your recur instead, if you absolutely must, use a local atom (also almost always wrong, but less often wrong than def inside defn).
Your loops are unnecessarily complicated. You want to sum over the elements at even or odd indices? Use a combination of reduce, take-nth and rest:
(take-nth 2 [1 2 3])
;=> (1 3)
(take-nth 2 (rest [1 2 3 4]))
;=> (2 4)
The whole things looks like you're compiling this over and over again and then run the JVM with it. Am I right? The preferred way is to work at the REPL. How to access it, depends on which editing environment you use. There are many beginner-friendly REPLs out there. Gorilla REPL is one example.
Once you got your code and development workflow in better shape, you may want to explore functions like pmap and future for easy access to multi-threading. More advanced stuff involves a library called core.async, but that's probably not the ideal route for the beginner. You can also fall back to Java interop to create your threads. Again something that, while not really hard to do, requires a bit of experience with Clojure.
Hope that helps, even it is not a direct answer to your question.
First let's look at some issues in your example that will need to be addressed before parallelizing this code.
sum-of-evens is using def inside a function, which is almost always a mistake. This might seem to have the effect you want, but it's not the right way to achieve it. defs are typically used for namespace-level (at the same level as your function defns) values. We can refactor sum-of-evens to not rely on unintentionally side-effecty behavior via def:
(defn sum-of-evens [v]
(loop [v v
index 1
evens []]
(if (seq v)
(recur (rest v)
(inc index)
(if (even? index) ;; add a binding to loop, not a def
(conj evens (first v))
evens)) ;; pass unchanged value when necessary
(reduce + evens))))
But we can further simplify this function with keep-indexed:
(defn sum-of-evens [coll]
(->> coll
(keep-indexed (fn [i v] (when (even? (inc i))
v)))
(apply +)))
And when we do the same for sum-of-odds, we can see the functions are nearly identical except for the condition they use: odd? vs. even?. We can make another function that takes a predicate function:
(defn sum-by-index-pred [f coll]
(->> coll
(keep-indexed (fn [i v] (when (f i) v)))
(apply +)))
;; using partial application and function composition
(def sum-of-evens (partial sum-by-index-pred (comp even? inc)))
(def sum-of-odds (partial sum-by-index-pred (comp odd? inc)))
Looking at the implementation of player-one and player-two, they seem to be mutually recursive. I don't see how you could parallelize this to make it any faster because each turn is dependent on the previous turn's outcome; there's nothing to parallelize.
I'd suggest refactoring this so that your game rules and state are computed in one place, rather than mutually recursive functions.
(loop [scores (array-map :player-1 0 :player-2 0)
turns (cycle (keys scores))
vs (shuffle (range 100))]
(if (seq vs)
(let [higher-odds? (> (sum-of-odds vs) (sum-of-evens vs))
scores (if higher-odds?
(update scores (first turns) + (first vs))
(update scores (first turns) + (last vs)))
remain (if higher-odds?
(rest vs)
(butlast vs))]
(recur scores (rest turns) remain))
(prn scores)))
;; {:player-1 2624, :player-2 2326}
I'm not sure if this preserves your original game logic but it should be close, and it does generalize it for more than two players. Try adding :player-3 0 to the starting scores.

Scheme list of strings

my function in scheme looks like this
(define (func1 input)
(let kloop ((x 6))
(let ((act (string-copy (func2 input2))))
(if (eq? act "") (display "null") (display act))
(if (> x 0) (kloop (- x 1)))))))
func2 return some string which is stored in act. Now I have to create a list of all strings returned by this function. Here above, I am just displaying those strings. I tried different approaches, but nothing is working out. I tried using append and cons.
Please suggest.
Your last if is missing the else case, which is where one would expect the return value of the function to be.
You don't mention how you've tried to use append and cons, but a common pattern is to pass an accumulating parameter around in the loop:
(define (five input)
(let loop ((x 5) (outputs '()))
(if (> x 0)
(loop (- x 1) (cons input outputs))
outputs)))
> (five "yes")
'("yes" "yes" "yes" "yes" "yes")
You are calling func2 on input six times. Does it return a different value each time? If not, this works:
(define (func1 input)
(make-list 6 (func2 input)))
The question is a bit confusing, you should provide a sample of the expected output for a given input. And why the empty string is treated differently in your code? apparently the recursion should advance on the value of x, not the value of the string returned by func2. Also, why are you copying the string? seems unnecessary.
Assuming that the named let is used just for keeping track of the number of iterations, this solution seems aligned with your intent, as this will return a 6-element list of all strings returned by func2
(define (func1 input)
(let kloop ((x 6))
(if (zero? x)
'()
(cons (func2 input)
(kloop (- x 1))))))
But we can be smarter and use the named let to give a tail-recursive solution, which is more efficient:
(define (func1 input)
(let kloop ((x 6)
(acc '()))
(if (zero? x)
acc
(kloop (- x 1)
(cons (func2 input)
acc)))))

Scheme - list functions with filter

I am currently working on a homework assignment with MIT scheme, and have come across a few problems that are supposedly very short, though I'm a bit confused as to how to implement some of them.
One problem asks me to write a function that returns a list with all the integers removed. I did manage to solve that,
(define (f2a lst) (map (lambda(x) (remove number? x)) lst))
though I'm confused as to how I can rewrite it to not use remove, but rather use a filter.
*note: (f2a '(("a" 1 "b") (2 "c") (-1 "d") (-2))) returns '(("a" "b") ("c") ("d"))
The other two problems are ones to which I haven't found any solutions.
They ask me to write a function that returns a list with all positive odd and negative even integers removed. For example,
(f2b '(("a" 1 "b") (2 "c") (-1 "d") (-2)))
returns
(("a" "b") (2 "c") (-1 "d"))
I have some code down that is incorrect, but I feel shows how I have tried to approach solving this one:
(define (f2b lst)
(lambda(x)
(cond ((and (positive? x) (odd? x)) (filter x lst))
((and (negative? x) (even? x)) (filter x lst))
(else "this should never print"))))
The last problem simply asks for a function that returns a string consisting of all strings appended together in a list. (f2c '(("a" 1 "b") (2 "c") (-1 "d") (-2))) returns "abcd".
I almost managed to figure this one out, but got stuck when it kept returning strange values. This is the code I have:
(define (f2c lst)
(lambda(x)
(map (lambda (x) (filter string? x)) lst)
(list x))
(string-append (car lst) (cdr lst)))
In terms of higher-order syntax, I'm limited to map, filter, accumulate and sum. I am not asking for a direct answer, but rather some help for me to figure out what I need to do. What am I doing wrong with my code? Any assistance given with this is very much appreciated. Thank you.
The structure of the input and the desired output is identical in the first two problems; the only thing that differs is the predicate on when/when-not to remove an element. For the second case it would be:
(define (f2b lst)
(map (lambda (sublst)
(remove (lambda (x)
(and (number? x)
(or (and (positive? x) (odd? x))
(and (negative? x) (even? x)))))
sublst))
lst))
Since only the predicate differs you can generalize this as:
(define (f2x predicate)
(lambda (lst)
(map (lambda (sublst) (remove predicate sublst)) lst)))
(define f2a (f2x number?))
(define f2b (f2x (lambda (x)
(and (number? x)
(or (and (positive? x) (odd? x))
(and (negative? x) (even? x))))))
For your last problem, you can use the result of the first problem as:
(define (f2c lst)
(apply string-append (apply append (f2a list))))
Also, note that your syntax for f2b and f2a is incorrect. You are using
(define (func arg)
(lambda (x) ...))
which means that (func arg) returns a function which isn't what you want.

Resources