Chicken Scheme read-line taking too long - io

Is there a fast way to read and tokenize a large corpus? I am trying to read a moderately large text file and compiled CHICKEN seems to just hang (I killed the process after about 2 mins) whereas, say, Racket performs acceptably (around 20 sec). Is there anything I can do to get the same performance on CHICKEN? This is the code I am using to read the file. All suggestions are welcome.
(define *corpus*
(call-with-input-file "largeish_file.txt"
(lambda (input-file)
(let loop ([line (read-line input-file)]
[tokens '()])
(if (eof-object? line)
tokens
(loop (read-line input-file)
(append tokens (string-split line))))))))

Try running it with a larger initial heap:
./prog -:hi100M
The program does a lot of allocation, which means the heap needs to be resized a lot, which triggers a lot of major GCs (and those are expensive).
You can see the heap resizes happening when you enable debugging output:
./prog -:d
If you want to see the GC output, try:
./prog -:g

In case you can afford to read the entire file into memory in one go, you could use something like the following code, which should be faster:
(let loop ((lines (with-input-from-file "largeish_file.txt"
read-lines)))
(if (null? lines)
'()
(append (string-split (car lines))
(loop (cdr lines)))))
Here's some quick benchmark code:
(import (chicken io)
(chicken string))
;; Warm-up
(with-input-from-file "largeish_file.txt" read-lines)
(time
(with-output-to-file "a.out"
(lambda ()
(display
(call-with-input-file "largeish_file.txt"
(lambda (input-file)
(let loop ([line (read-line input-file)]
[tokens '()])
(if (eof-object? line)
tokens
(loop (read-line input-file)
(append tokens (string-split line)))))))))))
(time
(with-output-to-file "b.out"
(lambda ()
(display
(let loop ((lines (with-input-from-file "largeish_file.txt"
read-lines)))
(if (null? lines)
'()
(append (string-split (car lines))
(loop (cdr lines)))))))))
And here are the results on my system:
$ csc bench.scm && ./bench
28.629s CPU time, 13.759s GC time (major), 68772/275 mutations (total/tracked), 4402/14196 GCs (major/minor), maximum live heap: 4.63 MiB
0.077s CPU time, 0.033s GC time (major), 68778/292 mutations (total/tracked), 10/356 GCs (major/minor), maximum live heap: 3.23 MiB
Just making sure we got the same result from both code snippets:
$ cmp a.out b.out && echo They contain the same data
They contain the same data
largeish_file.txt was generated by cat'ing a ~100KB syslog file until it got ~10000 lines (mentioning this so that you get an idea on the profile of the input file):
$ ls -l largeish_file.txt
-rw-r--r-- 1 mario mario 587340 Aug 2 11:55 largeish_file.txt
$ wc -l largeish_file.tx
5790 largeish_file.txt
The results I got using CHICKEN 5.2.0 on a Debian system.

Related

unread-char behaviour deviating from spec?

On the Common Lisp HyperSpec page for unread-char - see
here - it says both of the following things:
"unread-char is intended to be an efficient mechanism for allowing the Lisp reader and other
parsers to perform one-character lookahead in input-stream."
"It is an error to invoke unread-char twice consecutively on the same stream without an
intervening call to read-char (or some other input operation which implicitly reads characters)
on that stream."
I'm investigating how to add support for multiple-character lookahead for CL streams for a
parser I'm planning to write, and just to confirm the above, I ran the following code:
(defun unread-char-test (data)
(with-input-from-string (stream data)
(let ((stack nil))
(loop
for c = (read-char stream nil)
while c
do (push c stack))
(loop
for c = (pop stack)
while c
do (unread-char c stream)))
(coerce
(loop
for c = (read-char stream nil)
while c
collect c)
'string)))
(unread-char-test "hello")
==> "hello"
It doesn't throw an error (on SBCL or CCL, I haven't tested it on other implementations yet) but I don't see how there can possibly be any read
operations (implicit or explicit) taking place on the stream between the consecutive calls
to unread-char.
This behaviour is good news for multiple-character lookahead, as long as it is consistent, but why
isn't an error being thrown?
In response to user jkiiski's comment I did some more digging. I defined a function similar to the above but that takes the stream as an argument (for easier reuse):
(defun unread-char-test (stream)
(let ((stack nil))
(loop
for c = (read-char stream nil)
while c
do (push c stack))
(loop
for c = (pop stack)
while c
do (unread-char c stream)))
(coerce
(loop
for c = (read-char stream nil)
while c
collect c)
'string))
I then ran the following in a second REPL:
(defun create-server (port)
(usocket:with-socket-listener (listener "127.0.0.1" port)
(usocket:with-server-socket (connection (usocket:socket-accept listener))
(let ((stream (usocket:socket-stream connection)))
(print "hello" stream)))))
(create-server 4000)
And the following in the first REPL:
(defun create-client (port)
(usocket:with-client-socket (connection stream "127.0.0.1" port)
(unread-char-test stream)))
(create-client 4000)
And it did throw the error I expected:
Two UNREAD-CHARs without intervening READ-CHAR on #<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001813E2D>
[Condition of type SIMPLE-ERROR]
This suggests that jkiiski's assumption is correct. The original behaviour was also observed when the input was read from a text file, like so:
(with-open-file (stream "test.txt" :direction :output)
(princ "hello" stream))
(with-open-file (stream "test.txt")
(unread-char-test stream)))
==> "hello"
I imagine that, when dealing with local file I/O, the implementation reads large chunks of a file into memory, and then read-char reads from the buffer. If correct, this also supports the assumption that the error described in the specification is not thrown by typical implementations when unreading from a stream whose contents are in-memory.

Racket: Storing bytes then outputting them all at the end of program

My program currently writes bytes using write-byte throughout the program.
When there is an error in the program, the program stops there but I've realized that this still leaves the previously written bytes (before encountering the error).
I was wondering if it is possible to hold on to all the bytes that I want to output until the successful ending of the program so that if the program encounters an error before the end of the program, it outputs nothing, and if no error is encountered, then I can output all the bytes that I wanted to write.
You can wrap your program in with-output-to-bytes to produce a bytestring value instead of writing directly to stdout:
(with-output-to-bytes
(λ ()
(write-bytes #"a")
(write-bytes #"b")))
Internally, this is just a super simple wrapper around open-output-bytes and a parameterization of current-output-port, so if you want more fine-grained control, you can use those directly. For example, if you have a simple script and don’t want to wrap the whole program, you can mutate the current-output-port parameter globally:
(define stdout (current-output-port))
(define output (open-output-bytes))
(current-output-port output)
(void
(begin
(write-bytes #"a")
(write-bytes #"b")))
(write-bytes (get-output-bytes output) stdout)
However, be careful: mutating current-output-port like that will affect everything that prints, including the output from expressions evaluated at a module level, which is why it is necessary to wrap the write-bytes invocations with void above.
One can add bytes to a list and print them together later:
(define lst '())
(set! lst (cons #"a" lst))
(set! lst (cons #"b" lst))
(println lst)
(for ((item (reverse lst)))
(write-bytes item))
Output:
'(#"b" #"a")
ab
List has to be reversed since 'cons' adds item to the head of the list.

How can I delete silence from the middle of .wav files in Audacity but not the edges?

I am trying to delete silence from an audio file using Audacity. There is a Nyquist plugin called Trim Silence which deletes silence from the start and end of a file, but not the middle. I would like to invert this, and delete silence from everywhere except the start and end.
I think the function below is the relevant part of the plugin. How should I change it to get a truncate-internal-silences function? (I don't know any Nyquist, or Lisp, so I'm struggling to understand what it currently does, let alone change it.)
Entirely different approaches also welcome - this is just my current best guess at how to edit my many audio files.
(defun trim-silence ()
;; Nyquist plug-ins cannot return 'no audio', so trap as error.
(if (< (get '*selection* 'peak-level) threshold)
(throw 'error (format nil "Error.~%All selected audio in the ~a selected track~%~
is below the silence threshold.~%~%~
Try setting the threshold to a~%~
lower (more negative) dB level."
(add-num-suffix (get '*track* 'index)))))
(if (> len (* limit *sound-srate*)) ;max length in samples
(throw 'error (format nil "Error.\nMax RAM usage by Trim Silence is set to ~a GB.~%This allows a maximum duration ~
for a ~a~%track at ~a Hz of ~a.~%Selected track is ~a.~%"
RAM-limit
(if (arrayp *track*) "stereo" "mono")
(round *sound-srate*)
(to-hhmmss limit)
(to-hhmmss (get-duration 1)))))
(let* (;; ratio provides tighter trimming for short selections
;; while maintaining reasonable speed for long selections
(ratio (max 10 (min 200 (round (/ len 100000.0)))))
(my-srate (/ *sound-srate* ratio))
(mysound (convert *track* ratio))
(limits (get-clip-limits)) ; (list start, end) times of audio clips
(clip-start (if (first limits)
(abs-to-relative-time (nth 0 limits)))) ; nil id invalid
(clip-end (if (second limits)
(abs-to-relative-time (nth 1 limits))))) ; nil if invalid
;loop through samples and mark start and end
(setf result (find-sil mysound clip-start clip-end))
(let ((start (if clip-start
(max clip-start
(- (/ (first result) my-srate) min-start-silence))
0))
(end (if clip-end
(min (+ (- (get-duration 1) (/ (second result) my-srate))
min-end-silence)
clip-end)
(get '*selection* 'end))))
;; ensure at least 1 sample remains
;; This should never happen.
(if (>= start end)
(setq start (- end (/ *sound-srate*))))
; trim
(multichan-expand #'extract-abs start end (cue *track*)))))

How to exhaust a channel's values and then return the result (ClojureScript)?

Suppose that channel chan has the values "1" and "2" on queue.
Goal: Make a function which takes chan and returns the vector [1 2]. Note that I am totally fine if this function has to block for some time before its value is returned.
Attempt:
(defn chan->vector
[chan]
(let [a (atom true) v []]
(while (not-nil? #a)
(go
(reset! a (<! chan))
(into v #a)
(reset! a (<! chan))
)
) v
)
)
Result: My REPL freezes and eventually spits out a huge error. I have come to realize that this is because the (go ...) block is asynchronous, and so immediately returns. Thus the atom іn my (while ...) loop is never given a chance to be set to nil and the loop can never terminate.
So how do I accomplish the desired result? In case it's relevant, I'm using ClojureScript and targetting nodejs.
you should use alts! from core.async to fulfill this task
(https://clojure.github.io/core.async/#clojure.core.async/alts!):
(def x (chan 10))
(go (>! x 1)
(>! x 2)
(>! x 3))
(defn read-all [from-chan]
(<!! (go-loop [res []]
(let [[v _] (alts! [from-chan] :default :complete)]
(if (= v :complete)
res
(recur (conj res v)))))))
(read-all x)
;; output: [1 2 3]
(read-all x)
;; output: []
(go (>! x 10)
(>! x 20)
(>! x 30)
(>! x 40))
(read-all x)
;; output: [10 20 30 40]
inside the go-loop this (a/alts! [from-chan] :default :complete) tries to read any value from channel, and in case there are no value available at once it emits the default value, so you will see you should break the loop and return accumulated values.
update: as the blocking read (<!!) is absent in cljs, you can rewrite it the following way:
(defn read-all [from-chan]
(go-loop [res []]
(let [[v _] (alts! [from-chan] :default :complete)]
(if (= v :complete)
res
(recur (conj res v)))))))
so it will return the channel, and then just read one value from there:
(go (let [res (<! (read-all x))]
(println res)
;; do something else
))
You can use clojure.core.async/reduce:
;; demo setup
(def ch (async/chan 2))
(async/>!! ch :foo)
(async/>!! ch :bar)
;; background thread to print reduction result
(async/thread
(prn (async/<!! (async/reduce conj [] ch))))
;; closing the channel…
(async/close! ch)
;; …terminates the reduction and the result gets printed out:
;; [:foo :bar]
clojure.core.async/reduce returns a channel that will produce a value if and when the original channel closes. Internally it uses a go block and will release control in between taking elements from the original channel.
If you want to produce a value after a certain amount of time passes whether or not the original channel closes, you can either wrap the original channel in a pass-through channel that will close itself after a timeout passes or you can use a custom approach to the reduction step (perhaps the approach suggested by #leetwinski).
Use into
Returns a channel containing the single (collection) result of the
items taken from the channel conjoined to the supplied collection. ch
must close before into produces a result.
Something like this should work (it should print the events from events-chan given events-chan closes when it is done publishing events):
(go
(println (<! (into [] events-chan))))
The source channel needs to end (close), otherwise you can't put all events into a collection.
Edit:
Re-read your question, and it is not very clear what you want to accomplish. Whatever you want to do, chan->vector needs to return a channel so that whoever calls it can wait for the result. In fact, chan->vector is exactly into:
; chan->vector ch:Chan<Event> -> Chan<Vector[Event]>
(defn chan->vector [ch]
(into [] ch))
(go
(let [events (<! (chan->vector events-chan))]
(println events) ; Do whatever with the events vector
))
As I mentioned above, if the events chan never closes, then you have to do more thinking about how to consume the events. There is no magic solution. Do you want to batch the events by time intervals? By number of events? By a combination of those?
In summary, as mentioned above, chan->vector is into.
While possible in Clojure and many other languages, what you want to do is not possible in ClojureScript.
You want a function that blocks while listening to a channel. However, ClojureScript's version of core.async doesn't include the blocking operators. Why? Because ClojureScript doesn't block.
I couldn't find a reliable source to back that last sentence. There seems to be a lot of confusion around this topic on the web. However, I'm pretty sure of what I'm saying because ClojureScript ultimately becomes JavaScript, and that's how JavaScript works.
Indeed, JavaScript never blocks, neither on the browser nor in Node.js. Why? As far as I understand, it uses a single thread, so if it were to block, the user would be unable to do anything in the browser.
So it's impossible to do what you want. This is by design, because it could have disastrous UX effects. ClojureScript channels are like JavaScript events; in the same way you don't want an event listener to block the user interface while waiting for an event to happen, you also shouldn't want a channel to block while waiting for new values.
Instead, try using a callback function that gets called whenever a new value is delivered.

Clojure core.async, CPU hangs after timeout. Anyway to properly kill macro thread produced by (go..) block?

Based on core.async walk through example, I created below similar code to handle some CPU intensive jobs using multiple channels with a timeout of 10 seconds. However after the main thread returns, the CPU usage remains around 700% (8 CPUs machine). I have to manually run nrepl-close in emacs to shut down the Java process.
Is there any proper way to kill macro thread produced by (go..) block ? I tried close! each chan, but it doesn't work. I want to make sure CPU usage back to 0 by Java process after main thread returns.
(defn [] RETURNED-STR-FROM-SOME-CPU-INTENSE-JOB (do... (str ...)))
(let [n 1000
cs (repeatedly n chan)]
(doseq [c cs]
(go
(>! c (RETURNED-STR-FROM-SOME-CPU-INTENSE-JOB ))))
(dotimes [i n]
(let [[result source] (alts!! (conj cs (timeout 10000))) ] ;;wait for 10 seconds for each job
(if (list-contains? cs source) ;;if returned chan belongs to cs
(prn "OK JOB FINISHED " result)
(prn "JOB TIMEOUT")
)))
(doseq [i cs]
(close! i)) ;;not useful for "killing" macro thread
(prn "JOBS ARE DONE"))
;;Btw list-contains? function is used to judge whether an element is in a list
;;http://stackoverflow.com/questions/3249334/test-whether-a-list-contains-a-specific-value-in-clojure
(defn list-contains? [coll value]
(let [s (seq coll)]
(if s
(if (= (first s) value) true (recur (rest s) value))
false)))
In REPL there seems to be no clean way yet.
I first tried a very dirty way by using deprecated method Thread.stop
(doseq [i #threadpool ]
(.stop i))
It seemed worked as CPU usage dropped once the main thread returned to REPL, but if I run the program again in REPL, it'd just hang at the go block part!!
Then I googled around and found this blog and it says
One final thing to note: we don't explicitly do any work to shutdown the go routines. Go routines will automatically stop operation when the main function exits. Thus, go routines are like daemon threads in the JVM (well, except for the "thread" part ...)
So I tried again by making my project into a uberjar and run it on a command console, and it turned out that CPU usage would drop immediately when blinking cursor returns to the console!
Based on answer for another related question How to control number of threads in (go...), I've found a better way to properly kill all the threads started by (go...) block:
First alter the executor var and supply a custom thread pool
;; def, not defonce, so that the executor can be re-defined
;; Number of threads are fixed to be 4
(def my-executor
(java.util.concurrent.Executors/newFixedThreadPool
4
(conc/counted-thread-factory "my-async-dispatch-%d" true)))
(alter-var-root #'clojure.core.async.impl.dispatch/executor
(constantly (delay (tp/thread-pool-executor my-executor))))
Then call .shutdownNow and .awaitTermination method at the end of (go...) block
(.shutdownNow my-executor)
(while (not (.awaitTermination my-executor 10 java.util.concurrent.TimeUnit/SECONDS ) )
(prn "...waiting 10 secs for executor pool to finish") )
[UPDATE]
The shutdown executor method above seems not pure enough. The final solution for my case is to send a function with control of its own timeout into go block, using thunk-timeout function. Credits go to this post. Example below
(defn toSendToGo [args timeoutUnits]
(let [result (atom nil)
timeout? (atom false)]
(try
( thunk-timeout
(fn [] (reset! result (myFunction args))) timeoutUnits)
(catch java.util.concurrent.TimeoutException e (do (prn "!Time out after " timeoutUnits " seconds!!") (reset! timeout? true)) ))
(if #timeout? (do sth))
#result))
(let [c ( chan)]
(go (>! c (toSendToGo args timeoutUnits))))
(shutdown-agents)
Implementation-specific, JVM: both agents and channels use a global thread pool, and the termination function for agents iterates and closes all open threads in the VM. Empty the channels first: this action is immediate and non-reversible (especially if you are in a REPL).

Resources