MapD slow on text fields

MapD slow on text fields - mapdb

In benchmarking the following query:
SELECT * from test order by c0 asc limit 1
Out of ten tests we get:
Took: 7877 ms. (7.877 s)
Took: 15617 ms. (15.617 s)
Took: 8067 ms. (8.067 s)
Took: 15924 ms. (15.924 s)
Took: 8057 ms. (8.057 s)
Took: 15864 ms. (15.864 s)
Took: 15455 ms. (15.455 s)
Took: 15245 ms. (15.245 s)
Took: 7857 ms. (7.857 s)
Took: 15624 ms. (15.624 s)
c0 is a text field.
Why is this query so slow? And is there a way to fix this or do a work-around so that a query -- that would take 0.1s in any other database -- works acceptably?

How many rows in table? What is the cardinality of c0. Is it possible to get access to your benchmark details and data.
MapD has no indexes so with the sql expressed like this the sort is fully executed for the c0 column and we have not optimized for this particular style of query plan.
To get a more performant result for your query try rewriting it as follows
select * from test where c0 in (select c0 from test group by c0 order by c0 asc limit 1) limit 1;
A little long winded but I hope you will find better performance.
Using the online 400m Twitter dataset the following query comes back in around 100ms
select * from tweets_new where county_state in (select county_state from tweets_new group by county_state order by county_state asc limit 1) limit 1;

Related

How to calculate space for number of records

I am trying to calculate space required by a dataset using below formula, but I am getting wrong somewhere when I cross check it with the existing dataset in the system. Please help me
1st Dataset:
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 32760
Number of records....: 51560
Using below formula to calculate
optimal block length (OBL) = 32760/record length = 32760/449 = 73
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 73*2 = 146
Find number of physical records (PR) = Number of records/TOBL = 51560/146 = 354
Number of tracks = PR/2 = 354/2 = 177
But I can below in the dataset information
Current Allocation
Allocated tracks . : 100
Allocated extents . : 1
Current Utilization
Used tracks . . . . : 100
Used extents . . . : 1
2nd Dataset :
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 27998
Number of Records....: 127,252
Using below formula to calculate
optimal block length (OBL) = 27998/record length = 27998/449 = 63
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 63*2 = 126
Find number of physical records (PR) = Number of records/TOBL = 127252/126 = 1010
Number of tracks = PR/2 = 1010/2 = 505
Number of Cylinders = 505/15 = 34
But I can below in the dataset information
Current Allocation
Allocated cylinders : 69
Allocated extents . : 1
Current Utilization
Used cylinders . . : 69
Used extents . . . : 1

A few observations on your approach.
First, since your dealing with records that are variable length it would be helpful to know the "average" record length as that would help to formulate a more accurate prediction of storage. Your approach assumes a worst case scenario of all records being at maximum which is fine for planning purposes but in reality you'll likely see the actual allocation would be lower if the average of the record lengths is lower than the maximum.
The approach you are taking is reasonable but consider that you can inform z/OS of the space requirements in blocks, records, DASD geometry or let DFSMS perform the calculation on your behalf. Refer to this article to get some additional information on options.
Back to your calculations:
You Optimum Block Length (OBL) is really a records per block (RPB) number. Block size divided maximum record length yields the number of records at full length that can be stored in the block. If your average record length is less then you can store more records per block.
The assumption of two blocks per track may be true for your situation but it depends on the actual device type that will be used for the underlying allocation. Here is a link to some of the geometries for supported DASD devices and their geometries.
Your assumption of two blocks per track depends on the device is not correct for 3390's as you would need 64k for two blocks on a track but as you can see the 3390's max out at 56k so you would only get one block per track on the device.
Also, it looks like you did factor in the RDW by adding 4 bytes but someone looking at the question might be confused if they are not familiar with V records on z/OS.In the case of your calculation that would be 61 records per block at 27998 (which is the "optimal block length" so two blocks can fit comfortable on a track).
I'll use the following values:
MaximumRecordLength = RecordLength + 4 for RDW
TotalRecords = Total Records at Maximum Length (worst case)
BlockSize = modeled blocksize
RecordsPerBlock = number of records that can fit in a block (worst case)
BlocksNeeded = number of blocks needed to contain estimated records (worst case)
BlocksPerTrack = from IBM device geometry information
TracksNeeded = TotalRecords / RecordsPerBlock / BlocksPerTrack
Cylinders = Device Tracks per cylinder (15 for most devices)
Example 1:
Total Records = 51,560
BlockSize = 32,760
BlocksPerTrack = 1 (from device table)
RecordsPerBlock: 32,760 / 449 = 72.96 (72)
Total Blocks = 51,560 / 72 = 716.11 (717)
Total Tracks = 717 * 1 = 717
Cylinders = 717 / 15 = 47.8 (48)
Example 2:
Total Records = 127,252
BlockSize = 27,998
BlocksPerTrack = 2 (from device table)
RecordsPerBlock: 27,998 / 449 = 62.35 (62)
Total Blocks = 127,252 / 62 = 2052.45 (2,053)
Total Tracks = 2,053 / 2 = 1,026.5 (1,027)
Cylinders = 1027 / 15 = 68.5 (69)
Now, as to the actual allocation. It depends on how you allocated the space, the size of the records. Assuming it was in JCL you could use the RLSE subparameter of the SPACE= to release space when the is created and closed. This should release unused resources.
Given that the records are Variable the estimates are worst case and you would need to know more about the average record lengths to understand the actual allocation in terms of actual space used.
Final thought, all of the work you're doing can be overridden by your storage administrator through ACS routines. I believe that most people today would specify a BLKSIZE=0 and let DFSMS do all of the hard work because that component has more information about where a file will go, what the underlying devices are and the most efficient way of doing the allocation. The days of disk geometry and allocation are more of a campfire story unless your environment has not been administered to do these things for you.

Instead of trying to calculate tracks or cylinders, go for MBs, or KBs. z/OS (DFSMS) will calculate for you, how many tracks or cylinders are needed.
In JCL it is not straight forward but also not too complicated, once you got it.
There is a DD statement parameter called AVGREC=, which is the trigger. Let me do an example for your first case above:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(445,(51560,1000)),AVGREC=U
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
Parameter AVGREC=U (4) tells the system three things:
Firstly, the first subparameter in SPACE= (1) shall be interpreted as an average record length. (Note that this value is completely independend of the value specified in LRECL=.)
Secondly, it tells the system, that the second (2), and third (3) SPACE= subparameter are the number of records of average length (1) that the data set shall be able to store.
Thirdly, it tells the system that numbers (2), and (3) are in records (AVGREC=U). Alternatives are thousands (AVGREC=M), and millions (AVGREC=M).
So, this DD statement will allocate enough space to hold the estimated number of records. You don't have to care for track capacity, block capacity, device geometry, etc.
Given the number of records you expect and the (average) record length, you can easily calculate the number of kilobytes or megabytes you need. Unfortunately, you cannot directly specify KB, or MB in JCL, but there is a way using AVGREC= as follows.
Your first data set will get 51560 records of (maximum) length 445, i.e. 22'944'200 bytes, or ~22'945 KB, or ~23 MB. The JCL for an allocation in KB looks like this:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(22945,10000)),AVGREC=K
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
You want the system to allocate primary space for 22945 (2) thousands (4) records of length 1 byte (1), which is 22945 KB, and secondary space for 10'000 (3) thousands (4) records of length 1 byte (1), i.e. 10'000 KB.
Now the same alloation specifying MB:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(23,10)),AVGREC=M
//* | | | |
//* V V V V
//* (1) (2)(3) (4)
You want the system to allocate primary space for 23 (2) millions (4) records of length 1 byte (1), which is 23 MB, and secondary space for 10 (3) millions (4) records of length 1 byte (1), i.e. 10 MB.
I rarely use anything other than the latter.
In ISPF, it is even easier: Data Set Allocation (3.2) allows KB, and MB as space units (amongst all the old ones).

A useful and usually simpler alternative to using SPACE and AVGREC etc is to simply use a DATACLAS for space if your site has appropriate sized ones defined. If you look at ISMF Option 4 you can list available DATACLAS's and see what space values etc they provide. You'd expect to see a number of ranges in size, and some with or without Extended Format and/or Compression. Even if a DATACLAS overallocates a bit then it is likely the overallocated space will be released by the MGMTCLAS assigned to the dataset at close or during space management. And you do have an option to code DATACLAS AND SPACE in which case any coded space (or other) value will override the DATACLAS, which helps with exceptions. It still depends how your Storage Admin's have coded the ACS routines but generally Users are allowed to specify a DATACLAS and it will be honored by the ACS routines.
For basic dataset size calculation I just use LRECL times the expected Max Record Count divided by 1000 a couple of times to get a rough MB figure. Obviously variable records/blks add 4bytes each for RDW and/or BDW but unless the number of records is massive or DASD is extremely tight for space wise it shouldn't be significant enough to matter.
e.g.
=(51560*445)/1000/1000 shows as ~23MB
Also, don't expect your allocation to be exactly what you requested because the minimum allocation on Z/OS is 1 track or ~56k. The BLKSIZE also comes into effect by adding interblock gaps of ~32bytes per block. With SDB (system Determined Blocksize) invoked by omitting BLKSIZE or coding BLKSIZE=0, it will always try to provide half track blocking as close to 28k as possible so two blocks per track which is the most space efficient. That does matter, a BLKSIZE of 80bytes wastes ~80% of a track with interblock gaps. The BLKSIZE is also the unit of transfer when doing read/write to disk so generally the larger the better with some exceptions such as KSDS's being randomly access by key for example which might result in more data transfer than desired in an OLTP transaction.

Memory footprint of splitOn?

I wrote a file indexing program that should read thousands of text file lines as records and finally group those records by fingerprint. It uses Data.List.Split.splitOn to split the lines at tabs and retrieve the record fields. The program consumes 10-20 GB of memory.
Probably there is not much I can do to reduce that huge memory footprint, but I cannot explain why a function like splitOn (breakDelim) can consume that much memory:
Mon Dec 9 21:07 2019 Time and Allocation Profiling Report (Final)
group +RTS -p -RTS file1 file2 -o 2 -h
total time = 7.40 secs (7399 ticks # 1000 us, 1 processor)
total alloc = 14,324,828,696 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
fileToPairs.linesIncludingEmptyLines ImageFileRecordParser ImageFileRecordParser.hs:35:7-47 25.0 33.8
breakDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(151,1)-(156,36) 24.9 39.3
sortAndGroup Aggregations Aggregations.hs:6:1-85 12.9 1.7
fileToPairs ImageFileRecordParser ImageFileRecordParser.hs:(33,1)-(42,14) 8.2 10.7
matchDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(73,1)-(77,23) 7.4 0.4
onSublist Data.List.Split.Internals src/Data/List/Split/Internals.hs:278:1-72 3.6 0.0
toHashesView ImageFileRecordStatistics ImageFileRecordStatistics.hs:(48,1)-(51,24) 3.0 6.3
main Main group.hs:(47,1)-(89,54) 2.9 0.4
numberOfUnique ImageFileRecord ImageFileRecord.hs:37:1-40 1.6 0.1
toHashesView.sortedLines ImageFileRecordStatistics ImageFileRecordStatistics.hs:50:7-30 1.4 0.1
imageFileRecordFromFields ImageFileRecordParser ImageFileRecordParser.hs:(11,1)-(30,5) 1.1 0.3
toHashView ImageFileRecord ImageFileRecord.hs:(67,1)-(69,23) 0.7 1.7
Or is type [Char] too memory inefficient (compared to Text), causing splitOn to take that much memory?
UPDATE 1 (+RTS -s suggestion of user HTNW)
23,446,268,504 bytes allocated in the heap
10,753,363,408 bytes copied during GC
1,456,588,656 bytes maximum residency (22 sample(s))
29,282,936 bytes maximum slop
3620 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 45646 colls, 0 par 4.055s 4.059s 0.0001s 0.0013s
Gen 1 22 colls, 0 par 4.034s 4.035s 0.1834s 1.1491s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.477s ( 7.475s elapsed)
GC time 8.089s ( 8.094s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.114s ( 0.114s elapsed)
Total time 15.687s ( 15.683s elapsed)
%GC time 51.6% (51.6% elapsed)
Alloc rate 3,135,625,407 bytes per MUT second
Productivity 48.4% of total user, 48.4% of total elapsed
The processed text files are smaller than usual (UTF-8 encoded, 37 MB). But still 3 GB of memory are used.
UPDATE 2 (critical part of the code)
Explanation: fileToPairs processes a text file. It returns a list of key-value pairs (key: fingerprint of record, value: record).
sortAndGroup associations = Map.fromListWith (++) [(k, [v]) | (k, v) <- associations]
main = do
CommandLineArguments{..} <- cmdArgs $ CommandLineArguments {
ignored_paths_file = def &= typFile,
files = def &= typ "FILES" &= args,
number_of_occurrences = def &= name "o",
minimum_number_of_occurrences = def &= name "l",
maximum_number_of_occurrences = def &= name "u",
number_of_hashes = def &= name "n",
having_record_errors = def &= name "e",
hashes = def
}
&= summary "Group image/video files"
&= program "group"
let ignoredPathsFilenameMaybe = ignored_paths_file
let filenames = files
let hashesMaybe = hashes
ignoredPaths <- case ignoredPathsFilenameMaybe of
Just ignoredPathsFilename -> ioToLines (readFile ignoredPathsFilename)
_ -> return []
recordPairs <- mapM (fileToPairs ignoredPaths) filenames
let allRecordPairs = concat recordPairs
let groupMap = sortAndGroup allRecordPairs
let statisticsPairs = map toPair (Map.toList groupMap) where toPair item = (fst item, imageFileRecordStatisticsFromRecords . snd $ item)
let filterArguments = FilterArguments {
numberOfOccurrencesMaybe = number_of_occurrences,
minimumNumberOfOccurrencesMaybe = minimum_number_of_occurrences,
maximumNumberOfOccurrencesMaybe = maximum_number_of_occurrences,
numberOfHashesMaybe = number_of_hashes,
havingRecordErrorsMaybe = having_record_errors
}
let filteredPairs = filterImageRecords filterArguments statisticsPairs
let filteredMap = Map.fromList filteredPairs
case hashesMaybe of
Just True -> mapM_ putStrLn (map toHashesView (map snd filteredPairs))
_ -> Char8.putStrLn (encodePretty filteredMap)

As I'm sure you're aware, there's not really enough information here for us to help you make your program more efficient. It might be worth posting some (complete, self-contained) code on the Code Review site for that.
However, I think I can answer your specific question about why splitOn allocates so much memory. In fact, there's nothing particularly special about splitOn or how it's been implemented. Many straightforward Haskell functions will allocate lots of memory, and this in itself doesn't indicate that they've been poorly written or are running inefficiently. In particular, splitOn's memory usage seems similar to other straightforward approaches to splitting a string based on delimiters.
The first thing to understand is that GHC compiled code works differently than other compiled code you're likely to have seen. If you know a lot of C and understand stack frames and heap allocation, or if you've studied some JVM implementations, you might reasonably expect that some of that understanding would translate to GHC executables, but you'd be mostly wrong.
A GHC program is more or less an engine for allocating heap objects, and -- with a few exceptions -- that's all it really does. Nearly every argument passed to a function or constructor (as well as the constructor application itself) allocates a heap object of at least 16 bytes, and often more. Take a simple function like:
fact :: Int -> Int
fact 0 = 1
fact n = n * fact (n-1)
With optimization turned off, it compiles to the following so-called "STG" form (simplified from the actual -O0 -ddump-stg output):
fact = \n -> case n of I# n' -> case n' of
0# -> I# 1#
_ -> let sat1 = let sat2 = let one = I#! 1# in n-one
in fact sat2;
in n*sat1
Everywhere you see a let, that's a heap allocation (16+ bytes), and there are presumably more hidden in the (-) and (*) calls. Compiling and running this program with:
main = print $ fact 1000000
gives:
113,343,544 bytes allocated in the heap
44,309,000 bytes copied during GC
25,059,648 bytes maximum residency (5 sample(s))
29,152 bytes maximum slop
23 MB total memory in use (0 MB lost due to fragmentation)
meaning that each iteration allocates over a hundred bytes on the heap, though it's literally just performing a comparison, a subtraction, a multiplication, and a recursive call,
This is what #HTNW meant in saying that total allocation in a GHC program is a measure of "work". A GHC program that isn't allocating probably isn't doing anything (again, with some rare exceptions), and a typical GHC program that is doing something will usually allocate at a relatively constant rate of several gigabytes per second when it's not garbage collecting. So, total allocation has more to do with total runtime than anything else, and it isn't a particularly good metric for assessing code efficiency. Maximum residency is also a poor measure of overall efficiency, though it can be helpful for assessing whether or not you have a space leak, if you find that it tends to grow linearly (or worse) with the size of the input where you expect the program should run in constant memory regardless of input size.
For most programs, the most important true efficiency metric in the +RTS -s output is probably the "productivity" rate at the bottom -- it's the amount of time the program spends not garbage collecting. And, admittedly, your program's productivity of 48% is pretty bad, which probably means that it is, technically speaking, allocating too much memory, but it's probably only allocating two or three times the amount it should be, so, at a guess, maybe it should "only" be allocating around 7-8 Gigs instead of 23 Gigs for this workload (and, consequently, running for about 5 seconds instead of 15 seconds).
With that in mind, if you consider the following simple breakDelim implementation:
breakDelim :: String -> [String]
breakDelim str = case break (=='\t') str of
(a,_:b) -> a : breakDelim b
(a,[]) -> [a]
and use it like so in a simple tab-to-comma delimited file converter:
main = interact (unlines . map (intercalate "," . breakDelim) . lines)
Then, unoptimized and run on a file with 10000 lines of 1000 3-character fields each, it allocates a whopping 17 Gigs:
17,227,289,776 bytes allocated in the heap
2,807,297,584 bytes copied during GC
127,416 bytes maximum residency (2391 sample(s))
32,608 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
and profiling it places a lot of blame on breakDelim:
COST CENTRE MODULE SRC %time %alloc
main Main Delim.hs:8:1-71 57.7 72.6
breakDelim Main Delim.hs:(4,1)-(6,16) 42.3 27.4
In this case, compiling with -O2 doesn't make much difference. The key efficiency metric, productivity, is only 46%. All these results seem to be in line with what you're seeing in your program.
The split package has a lot going for it, but looking through the code, it's pretty clear that little effort has been made to make it particularly efficient or fast, so it's no surprise that splitOn performs no better than my quick-and-dirty custom breakDelim function. And, as I said before, there's nothing special about splitOn that makes it unusually memory hungry -- my simple breakDelim has similar behavior.
With respect to inefficiencies of the String type, it can often be problematic. But, it can also participate in optimizations like list fusion in ways that Text can't. The utility above could be rewritten in simpler form as:
main = interact $ map (\c -> if c == '\t' then ',' else c)
which uses String but runs pretty fast (about a quarter as fast as a naive C getchar/putchar implementation) at 84% productivity, while allocating about 5 Gigs on the heap.
It's quite likely that if you just take your program and "convert it to Text", you'll find it's slower and more memory hungry than the original! While Text has the potential to be much more efficient than String, it's a complicated package, and the way in which Text objects behave with respect to allocation when they're sliced and diced (as when you're chopping a big Text file up into little Text fields) makes it more difficult to get right.
So, some take-home lessons:
Total allocation is a poor measure of efficiency. Most well written GHC programs can and should allocate several gigabytes per second of runtime.
Many innocuous Haskell functions will allocate lots of memory because of the way GHC compiled code works. This isn't necessarily a sign that there's something wrong with the function.
The split package provides a flexible framework for all manner of cool list splitting manipulations, but it was not designed with speed in mind, and it may not be the best method of processing a tab-delimited file.
The String data type has a potential for terrible inefficiency, but isn't always inefficient, and Text is a complicated package that won't be a plug-in replacement to fix your String performance woes.
Most importantly:
Unless your program is too slow for its intended purpose, its run-time statistics and the theoretical advantages of Text over String are largely irrelevant.

Why does looping over a large amount of data in another thread cause an overactive GC, and prevent some data from being freed?

I'm writing code that takes some lazy results produced by pmap, and draws them onto a BufferedImage. For three days now I've been trying to figure out why the drawing suddenly starts freezing and eventually halts about a 1/3 of the way through.
I've finally narrowed it down to the fact that I'm looping over a large amount of data in another thread.
This is the best MCVE that I've come up with:
(ns mandelbrot-redo.irrelevant.write-image-mcve
(:import [java.awt.image BufferedImage]
(java.util.concurrent Executors Executor)))
(defn lazy-producer [width height]
(for [y (range height)
x (range width)]
[x y (+ x y)]))
; This works fine; finishing after about 5 seconds when width=5000
(defn sync-consumer [results width height]
(time
(doseq [[i result] (map vector (range) results)]
(when (zero? (rem i 1e6))
(println (str (double (/ i (* width height) 0.01)) "%")))
((fn boop [x] x) result)))) ; Data gets consumed here
; This gets to ~30%, then begins being interupted by 1-4 second lags
(defn async-consumer [results width height]
(doto
(Thread. ^Runnable
(fn []
(sync-consumer results width height)
(println "Done...")))
(.start)))
(defn -main []
(let [width 5000
height (int (* width 2/3))]
(-> (lazy-producer width height)
(async-consumer width height))))
When -main is run with sync-consumer, it finishes after a few seconds. With async-consumer however, it gets to about 25%, then begins slowing to a crawl to the point where the last printed percentage is 30%. If I leave it, I get an OOME.
If I use an explicit Thread., or use a local thread pool in async-consumer, it hangs and crashes. If I use future however, it finishes fine, just like sync-consumer.
The only hint I've gotten is that when I run this in VisualVM, I see that I have runaway allocation of Longs when using the async version:
The sync version shows a peak amount of Longs to be about 45mb at once in comparison.
The CPU usage is quite different too:
There's massive GC spikes, but it doesn't seem like the Longs are being disposed of.
I could use future for this, but I've been bitten by its exception swallowing behavior so many times, I'm hesitant.
Why is this happening? Why is running this in a new thread causing the GC to go crazy, while at the same time numbers aren't being freed?
Can anyone explain this behavior?

The sync version seems to be processing through the 16M+ results and will not hold onto the head of the results seq due to locals clearing. This means that as you go, values are created, processed, and GC'ed.
The async one closes over results in the fn and will hold the head, keeping all 16M+ values in memory, likely leading to GC thrashing?
I actually can't reproduce what you describe - both sync and async take about the same time for me as written above. (Clojure 1.9, Java 1.8).

I simplified you example, and get inconsistent results. I suspect that the manual Thread object is somehow being regarded (sometimes) as a daemon thread, so the JVM sometimes exits before it has completed:
(def N 5e3)
(def total-count (* N N))
(def report-fact (int (/ total-count 20)))
(defn lazy-producer []
(for [y (range N)
x (range N)]
[x y (+ x y)]))
(defn sync-consumer [results]
(println "sync-consumer: start")
(time
(doseq [[i result] (map vector (range) results)]
(when (zero? (rem i report-fact))
(println (str (Math/round (/ (* 100 i) total-count)) " %")))))
(println "sync-consumer: stop"))
(defn async-consumer [results]
; (spyx (count results))
(spyx (nth results 99))
(let [thread (Thread. (fn []
(println "thread start")
(sync-consumer results)
(println "thread done")
(flush)))]
; (.setDaemon thread false)
(.start thread)
(println "daemon? " (.isDaemon thread))
thread))
(dotest
(println "test - start")
(let [thread (async-consumer
(lazy-producer))]
(when true
(println "test - sleeping")
(Thread/sleep 5000))
; (.join thread)
)
(println "test - end"))
with results:
----------------------------------
Clojure 1.9.0 Java 10.0.1
----------------------------------
lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon? false
test - sleeping
thread start
sync-consumer: start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
test - end
Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
60 %
lein test 54.58s user 1.37s system 372% cpu 15.028 total
If we uncomment the (.join thread) line, we get a complete run:
~/expr/demo > lein test
----------------------------------
Clojure 1.9.0 Java 10.0.1
----------------------------------
lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon? false
test - sleeping
thread start
sync-consumer: start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
60 %
65 %
70 %
75 %
80 %
85 %
90 %
95 %
"Elapsed time: 9388.313828 msecs"
sync-consumer: stop
thread done
test - end
Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
lein test 72.52s user 1.69s system 374% cpu 19.823 total
It seems to exit early as if Clojure killed off the manual Thread object.
Perhaps you have found an (intermittent) bug.

Thanks to #amalloy and #Alex, I got it working.
I implemented the suggestions by #amalloy in the comments, and both variants work here and in my real case:
; Brittle since "once" is apparently more of an implementation detail of the language
(defn async-consumer [results width height]
(doto
(Thread. ^Runnable
(^:once fn* []
(sync-consumer results width height)
(println "Done...")))
(.start)))
; Arguably less brittle under the assumption that if they replace "once" with another mechanism,
; they'll update "delay".
(defn async-consumer [results width height]
(let [d (delay (sync-consumer results width height))]
(doto
(Thread. ^Runnable
(fn []
#d
(println "Done...")))
(.start))))
I also tried updating to 1.9.0. I thought that might fix it since #Alex says he's on 1.9.0 and can't reproduce this, and there's also this bug fix that seems related. Unfortunately, I didn't notice any difference.
It would be nice if there was an actual, solid mechanism for this. ^:once seems fine, but I don't want to use if just to have it potentially break later, and the use of delay seems like blatant abuse of the object just to make use of its inner (^:once fn* ...).
Oh well, at least it works now. Thanks guys.

job hangs when using pipe() with reduceByKey()

I meet a situation:
When I use
val a = rdd.pipe("./my_cpp_program").persist()
a.count() // just use it to persist a
val b = a.map(s => (s, 1)).reduceByKey().count()
it 's so fast
but when I use
val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count()
it is so slow....
and there are many such log in my executors:
15/10/31 19:53:58 INFO collection.ExternalSorter: Thread 78 spilling in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:14 INFO collection.ExternalSorter: Thread 74 spilling in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:17 INFO collection.ExternalSorter: Thread 79 spilling in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:29 INFO collection.ExternalSorter: Thread 77 spilling in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:50 INFO collection.ExternalSorter: Thread 76 spilling in-memory map of 633.1 MB to disk (9 times so far)

You haven't passed a function to reduceByKey(). From the docs for reduceByKey:
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
In this case, you want to pass the anonymous function (a, b) => a + b to aggregate values across your keys (can also be written as _ + _ using Scala's shortened underscore notation).
Since you are calling count() though (which essentially will count the number of unique keys after a reduceByKey()), it probably makes sense for you to just use distinct() instead. The implementation of distinct is actually very similar to what you're currently trying to do (mapping instead to (s, null) and then calling reduceByKey) but from a code readability standpoint, a distinct would better indicate what your end goal is. Something like this would work:
val b = rdd.pipe("./my_cpp_program").distinct().count()
Since you may actually also be interested in the counts per each unique key, there are other functions in the PairRDDFunctions class that can help out with this. I would check out countByKey(), countByKeyApprox(), and countApproxDistinctByKey(). Each have different use cases but offer interesting solutions to their respective problems.

Why so big difference in GC time in two implementations

have two algorithm implementations:
average(List) -> sum(List) / len(List).
sum([]) -> 0;
sum([Head | Tail]) -> Head + sum(Tail).
len([]) -> 0;
len([_ | Tail]) -> 1 + len(Tail).
average1(List) -> average_acc(List, 0,0).
average_acc([], Sum, Length) -> Sum / Length;
average_acc([H | T], Sum, Length) -> average_acc(T, Sum + H, Length + 1).
and output for trace GC events gc_start gc_end (gc started and stopped):
here every next value for process is a sum of preceding value and last gc time
average: 5189
average: 14480
average: 15118
average1: 594
Why so big difference?
PS. I use wall clock time.

You should not use wall clock time (timestamp flag) to measure what GC takes because even GC is not rescheduled in Erlang scheduler thread the thread self can be rescheduled by underlying OS. So you should use cpu_timestamp instead.
Your average/1 is using sum/1 and count/1 implementation where both are not tail recursive so you allocate 2*N stack frames which makes big difference in performance to average1/1 which uses tail recursive average_acc/3. So average1/1 performs exactly as loop in other languages.
Edit:
Edited according Yola stated he is using timestamp flag for trace messages.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string