Memory consumption in Python - lists, subscripting, and pointers - python-3.x

I'm trying to understand how much memory Python objects use.
In the following code, I check the memory of a numpy array vs list as well as a subscripted numpy array:
import sys, os, psutil, numpy as np
def size_of(obj):
return f'{sys.getsizeof(obj) / 1000000:,.0f} MB'
def get_memory_usage():
process = psutil.Process(os.getpid())
return f'{process.memory_info().rss / 1000000:,.0f} MB'
# Numpy vs List
print(f'(1) Mem usage: {get_memory_usage()}')
ONE_HUNDRED_MIL_NP = np.random.randint(-128,127,int(10**8),dtype='int8')
print(f'(2) Mem usage: {get_memory_usage()}, ONE_HUNDRED_MIL_NP: {size_of(ONE_HUNDRED_MIL_NP)}')
ONE_HUNDRED_MIL_LIST = list(np.random.choice(127, int(10**8), replace=True).astype('int8'))
print(f'(3) Mem usage: {get_memory_usage()}, ONE_HUNDRED_MIL_LIST: {size_of(ONE_HUNDRED_MIL_LIST)}')
# Now try subscriping
FOURCOLS = np.random.randint(-128,127,size=(int(10**8),4),dtype='int8')
print(f'(4) Mem usage: {get_memory_usage()}, FOURCOLS: {size_of(FOURCOLS)}')
FOURCOLS_PERMUTED = FOURCOLS[np.random.randint(0,len(FOURCOLS),size=len(FOURCOLS),dtype='int32')]
print(f'(5) Mem usage: {get_memory_usage()}, FOURCOLS_PERMUTED: {size_of(FOURCOLS_PERMUTED)}')
This returns:
(1) Mem usage: 187 MB
(2) Mem usage: 287 MB, ONE_HUNDRED_MIL_NP: 100 MB
(3) Mem usage: 3,526 MB, ONE_HUNDRED_MIL_LIST: 900 MB
(4) Mem usage: 3,926 MB, FOURCOLS: 400 MB
(5) Mem usage: 4,326 MB, FOURCOLS_PERMUTED: 400 MB
Notes:
Output (2) makes sense. One int8 is 8 bits (one byte) and 100 million bytes is 100 MB
Output (3) I don't understand:
The first issue is that sys.getsizeof() shows the objects takes up 900 MB, but psutil shows that the process now takes up 3,239 MB more memory (3526-287=3239). Where is this phantom memory usage coming from?
Where does the 900 MB come from? (From Python: Size of Reference?, I'm assuming that there's 100 MB of the numpy object plus 100 million pointers, which are 8 bytes each, so 100 MB + 800 MB = 900 MB?)
Output (4) Makes sense. 400 million int8s is 400 MB.
Output (5) I don't understand. Is a copy being made or references? If references, we're only referencing 100 million rows, right? How does this make 400 MB?
Thanks

ONE_HUNDRED_MIL_NP = np.random.randint(-128,127,int(10**8),dtype='int8')
This makes an array. ONE_HUNDRED_MIL_NP.nbytes is a good measure of the array size. An array has some basic info like shape, strides, dtype, but the bulk of the space is a 1d data buffer that contains bytes, in this case on byte per element.
ONE_HUNDRED_MIL_LIST = list(np.random.choice(127, int(10**8), replace=True).astype('int8'))
This produces a list from that array. A list has a databuffer that contains references to objects else where in memory. getsizeof just measures that size of that buffer, and says nothing about the objects. Here those objects are numpy.int8 objects, extracted from the array. They don't actually reference elements of the array, but rather are copies of those values.
A better way to get a list from an array is with arr.tolist().
FOURCOLS = np.random.randint(-128,127,size=(int(10**8),4),dtype='int8')
This is just another array. the 2d shape doesn't change how much memory it takes up.
FOURCOLS_PERMUTED = FOURCOLS[np.random.randint(0,len(FOURCOLS),size=len(FOURCOLS),dtype='int32')]
This is an example of advanced indexing. It creates a new array with its own data buffer (not a view with a shared buffer). Yes, you are just indexing the first dimension, but the data buffer stores all values, not references to rows of FOURCOLS.
A list of lists does store references or points to the nested lists. So shuffling the outer list would just shuffle the references. Multidimensional c arrays also store references or pointers. But multidimensional numpy arrays use a different model. The data is a flat c array. Multidimensionality is produced by the shape/strides iteration code.
So looking at your numbers:
(1) Mem usage: 187 MB
base usage.
(2) Mem usage: 287 MB, ONE_HUNDRED_MIL_NP: 100 MB
adds 100mb to base.
(3) Mem usage: 3,526 MB, ONE_HUNDRED_MIL_LIST: 900 MB
The 900 is roughly the memory used by the list's data buffer. The rest of total usage increase is storage for those 10**8 np.int8 objects.
(4) Mem usage: 3,926 MB, FOURCOLS: 400 MB
This shows another 400 MB memory usage.
(5) Mem usage: 4,326 MB, FOURCOLS_PERMUTED: 400 MB
And yet another 400.
Without the (3) list creating mem usage should show an orderly increase by the array size.

Related

generating huge amout of random numbers with python

I want to generate random numbers, uniformly between -1 and 1.
I know that using NumPy and generating an array of numbers is much better than generate the one by one in a for loop.
On the other hand, I want these numbers to operate with them only once, so there's no reason for storing them in an array.
My question is, what is the best solution to this, on one hand using a for loop is not time efficient, but I don't store unnecessary numbers, I generate them one by one and then I throw them. On the other hand, an array is not memory efficient, since if I want to generate 10^10 numbers, I need to create a 10^10 size array, with horrible results.
I assume the best choice is to generate small arrays (10^3 or 10^4 elements) one by one, but I want to know if there's a better solution to this problem (maybe a NumPy function that generates the numbers but creates something like an iterable that don't store all them in memory?)
Using NumPy to generate blocks of numbers is best, and you want to keep operations vectorised as much as possible.
A simple benchmark shows that somewhere between 4k and 64k is a reasonable block size:
from timeit import Timer
import numpy as np
for xp in range(20):
size = 2**xp
timer = Timer(
f'rng.uniform(-1., 1., size={size})',
'rng = np.random.default_rng()',
globals=globals()
)
n, t = timer.autorange()
t = min([t] + timer.repeat(3, n)) / n / size
print(f'{size:8} = {1e-6/t:6.2f}M/s')
gives me
1 = 0.47M/s
2 = 0.95M/s
4 = 1.89M/s
8 = 3.80M/s
16 = 7.43M/s
32 = 14.26M/s
64 = 27.10M/s
128 = 48.60M/s
256 = 78.72M/s
512 = 119.07M/s
1024 = 158.71M/s
2048 = 191.51M/s
4096 = 218.71M/s
8192 = 233.25M/s
16384 = 241.23M/s
32768 = 245.35M/s
65536 = 248.75M/s
131072 = 250.53M/s
262144 = 252.62M/s
524288 = 253.99M/s
and working with numbers in a vectorised form is orders-of-magnitude faster.
For example, given a 64k array of values, a vectorised call of np.sum(x) takes 17µs while the similar version going through a generator sum(x) takes 3.5ms, i.e. 200 times slower. Once you've paid the price for getting the floats out into the non-vectorised Python-world going through another yield from doesn't make much difference, only taking 4.5ms, e.g.: via the iPython %timeit magic:
def yield_from(it):
yield from it
x = np.random.uniform(-1, 1, size=2**16)
%timeit np.sum(x)
%timeit sum(x)
%timeit sum(yield_from(x))
you could make a generator, as said in the comment by #Carcigenicate, and combine that with the speedup of generating entire arrays using a yield from expression.
this would look something like this:
def random_numbers():
while True:
yield from np.random.random(1000) * 2 - 1
you can adjust the number of values generated at once to whatever you need, larger is faster but uses more memory

How to calculate space for number of records

I am trying to calculate space required by a dataset using below formula, but I am getting wrong somewhere when I cross check it with the existing dataset in the system. Please help me
1st Dataset:
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 32760
Number of records....: 51560
Using below formula to calculate
optimal block length (OBL) = 32760/record length = 32760/449 = 73
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 73*2 = 146
Find number of physical records (PR) = Number of records/TOBL = 51560/146 = 354
Number of tracks = PR/2 = 354/2 = 177
But I can below in the dataset information
Current Allocation
Allocated tracks . : 100
Allocated extents . : 1
Current Utilization
Used tracks . . . . : 100
Used extents . . . : 1
2nd Dataset :
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 27998
Number of Records....: 127,252
Using below formula to calculate
optimal block length (OBL) = 27998/record length = 27998/449 = 63
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 63*2 = 126
Find number of physical records (PR) = Number of records/TOBL = 127252/126 = 1010
Number of tracks = PR/2 = 1010/2 = 505
Number of Cylinders = 505/15 = 34
But I can below in the dataset information
Current Allocation
Allocated cylinders : 69
Allocated extents . : 1
Current Utilization
Used cylinders . . : 69
Used extents . . . : 1
A few observations on your approach.
First, since your dealing with records that are variable length it would be helpful to know the "average" record length as that would help to formulate a more accurate prediction of storage. Your approach assumes a worst case scenario of all records being at maximum which is fine for planning purposes but in reality you'll likely see the actual allocation would be lower if the average of the record lengths is lower than the maximum.
The approach you are taking is reasonable but consider that you can inform z/OS of the space requirements in blocks, records, DASD geometry or let DFSMS perform the calculation on your behalf. Refer to this article to get some additional information on options.
Back to your calculations:
You Optimum Block Length (OBL) is really a records per block (RPB) number. Block size divided maximum record length yields the number of records at full length that can be stored in the block. If your average record length is less then you can store more records per block.
The assumption of two blocks per track may be true for your situation but it depends on the actual device type that will be used for the underlying allocation. Here is a link to some of the geometries for supported DASD devices and their geometries.
Your assumption of two blocks per track depends on the device is not correct for 3390's as you would need 64k for two blocks on a track but as you can see the 3390's max out at 56k so you would only get one block per track on the device.
Also, it looks like you did factor in the RDW by adding 4 bytes but someone looking at the question might be confused if they are not familiar with V records on z/OS.In the case of your calculation that would be 61 records per block at 27998 (which is the "optimal block length" so two blocks can fit comfortable on a track).
I'll use the following values:
MaximumRecordLength = RecordLength + 4 for RDW
TotalRecords = Total Records at Maximum Length (worst case)
BlockSize = modeled blocksize
RecordsPerBlock = number of records that can fit in a block (worst case)
BlocksNeeded = number of blocks needed to contain estimated records (worst case)
BlocksPerTrack = from IBM device geometry information
TracksNeeded = TotalRecords / RecordsPerBlock / BlocksPerTrack
Cylinders = Device Tracks per cylinder (15 for most devices)
Example 1:
Total Records = 51,560
BlockSize = 32,760
BlocksPerTrack = 1 (from device table)
RecordsPerBlock: 32,760 / 449 = 72.96 (72)
Total Blocks = 51,560 / 72 = 716.11 (717)
Total Tracks = 717 * 1 = 717
Cylinders = 717 / 15 = 47.8 (48)
Example 2:
Total Records = 127,252
BlockSize = 27,998
BlocksPerTrack = 2 (from device table)
RecordsPerBlock: 27,998 / 449 = 62.35 (62)
Total Blocks = 127,252 / 62 = 2052.45 (2,053)
Total Tracks = 2,053 / 2 = 1,026.5 (1,027)
Cylinders = 1027 / 15 = 68.5 (69)
Now, as to the actual allocation. It depends on how you allocated the space, the size of the records. Assuming it was in JCL you could use the RLSE subparameter of the SPACE= to release space when the is created and closed. This should release unused resources.
Given that the records are Variable the estimates are worst case and you would need to know more about the average record lengths to understand the actual allocation in terms of actual space used.
Final thought, all of the work you're doing can be overridden by your storage administrator through ACS routines. I believe that most people today would specify a BLKSIZE=0 and let DFSMS do all of the hard work because that component has more information about where a file will go, what the underlying devices are and the most efficient way of doing the allocation. The days of disk geometry and allocation are more of a campfire story unless your environment has not been administered to do these things for you.
Instead of trying to calculate tracks or cylinders, go for MBs, or KBs. z/OS (DFSMS) will calculate for you, how many tracks or cylinders are needed.
In JCL it is not straight forward but also not too complicated, once you got it.
There is a DD statement parameter called AVGREC=, which is the trigger. Let me do an example for your first case above:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(445,(51560,1000)),AVGREC=U
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
Parameter AVGREC=U (4) tells the system three things:
Firstly, the first subparameter in SPACE= (1) shall be interpreted as an average record length. (Note that this value is completely independend of the value specified in LRECL=.)
Secondly, it tells the system, that the second (2), and third (3) SPACE= subparameter are the number of records of average length (1) that the data set shall be able to store.
Thirdly, it tells the system that numbers (2), and (3) are in records (AVGREC=U). Alternatives are thousands (AVGREC=M), and millions (AVGREC=M).
So, this DD statement will allocate enough space to hold the estimated number of records. You don't have to care for track capacity, block capacity, device geometry, etc.
Given the number of records you expect and the (average) record length, you can easily calculate the number of kilobytes or megabytes you need. Unfortunately, you cannot directly specify KB, or MB in JCL, but there is a way using AVGREC= as follows.
Your first data set will get 51560 records of (maximum) length 445, i.e. 22'944'200 bytes, or ~22'945 KB, or ~23 MB. The JCL for an allocation in KB looks like this:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(22945,10000)),AVGREC=K
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
You want the system to allocate primary space for 22945 (2) thousands (4) records of length 1 byte (1), which is 22945 KB, and secondary space for 10'000 (3) thousands (4) records of length 1 byte (1), i.e. 10'000 KB.
Now the same alloation specifying MB:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(23,10)),AVGREC=M
//* | | | |
//* V V V V
//* (1) (2)(3) (4)
You want the system to allocate primary space for 23 (2) millions (4) records of length 1 byte (1), which is 23 MB, and secondary space for 10 (3) millions (4) records of length 1 byte (1), i.e. 10 MB.
I rarely use anything other than the latter.
In ISPF, it is even easier: Data Set Allocation (3.2) allows KB, and MB as space units (amongst all the old ones).
A useful and usually simpler alternative to using SPACE and AVGREC etc is to simply use a DATACLAS for space if your site has appropriate sized ones defined. If you look at ISMF Option 4 you can list available DATACLAS's and see what space values etc they provide. You'd expect to see a number of ranges in size, and some with or without Extended Format and/or Compression. Even if a DATACLAS overallocates a bit then it is likely the overallocated space will be released by the MGMTCLAS assigned to the dataset at close or during space management. And you do have an option to code DATACLAS AND SPACE in which case any coded space (or other) value will override the DATACLAS, which helps with exceptions. It still depends how your Storage Admin's have coded the ACS routines but generally Users are allowed to specify a DATACLAS and it will be honored by the ACS routines.
For basic dataset size calculation I just use LRECL times the expected Max Record Count divided by 1000 a couple of times to get a rough MB figure. Obviously variable records/blks add 4bytes each for RDW and/or BDW but unless the number of records is massive or DASD is extremely tight for space wise it shouldn't be significant enough to matter.
e.g.
=(51560*445)/1000/1000 shows as ~23MB
Also, don't expect your allocation to be exactly what you requested because the minimum allocation on Z/OS is 1 track or ~56k. The BLKSIZE also comes into effect by adding interblock gaps of ~32bytes per block. With SDB (system Determined Blocksize) invoked by omitting BLKSIZE or coding BLKSIZE=0, it will always try to provide half track blocking as close to 28k as possible so two blocks per track which is the most space efficient. That does matter, a BLKSIZE of 80bytes wastes ~80% of a track with interblock gaps. The BLKSIZE is also the unit of transfer when doing read/write to disk so generally the larger the better with some exceptions such as KSDS's being randomly access by key for example which might result in more data transfer than desired in an OLTP transaction.

PyTorch GPU memory management

In my code, I want to replace values in the tensor given values of some indices are zero, for example
target_mac_out[avail_actions[:, 1:] == 0] = -9999999
But, it returns OOM
RuntimeError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 10.76 GiB total capacity; 9.45 GiB already allocated; 4.75 MiB free; 9.71 GiB reserved in total by PyTorch)
I think there is no memory allocation because it just visits the tensor of target_mac_out and check the value and replace a new value for some indices.
Am I understanding right?
It's hard to guess since we do not even know the sizes if the involved tensors, but your indexing avail_actions[:, 1:] == 0 creates a temporary tensor that does require memory allocation.
The avail_actions[:, 1:] == 0 create a new tensor, and possibly the whole line itself create another tensor before delete the old one after finish the operation.
If speed is not a problem then you can just use for loop. Like
for i in range(target_mac_out.size(0)):
for j in range(target_mac_out.size(1)-1):
if target_mac_out[i, j+1] == 0:
target_mac_out[i, j+1] = -9999999

Memory footprint of splitOn?

I wrote a file indexing program that should read thousands of text file lines as records and finally group those records by fingerprint. It uses Data.List.Split.splitOn to split the lines at tabs and retrieve the record fields. The program consumes 10-20 GB of memory.
Probably there is not much I can do to reduce that huge memory footprint, but I cannot explain why a function like splitOn (breakDelim) can consume that much memory:
Mon Dec 9 21:07 2019 Time and Allocation Profiling Report (Final)
group +RTS -p -RTS file1 file2 -o 2 -h
total time = 7.40 secs (7399 ticks # 1000 us, 1 processor)
total alloc = 14,324,828,696 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
fileToPairs.linesIncludingEmptyLines ImageFileRecordParser ImageFileRecordParser.hs:35:7-47 25.0 33.8
breakDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(151,1)-(156,36) 24.9 39.3
sortAndGroup Aggregations Aggregations.hs:6:1-85 12.9 1.7
fileToPairs ImageFileRecordParser ImageFileRecordParser.hs:(33,1)-(42,14) 8.2 10.7
matchDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(73,1)-(77,23) 7.4 0.4
onSublist Data.List.Split.Internals src/Data/List/Split/Internals.hs:278:1-72 3.6 0.0
toHashesView ImageFileRecordStatistics ImageFileRecordStatistics.hs:(48,1)-(51,24) 3.0 6.3
main Main group.hs:(47,1)-(89,54) 2.9 0.4
numberOfUnique ImageFileRecord ImageFileRecord.hs:37:1-40 1.6 0.1
toHashesView.sortedLines ImageFileRecordStatistics ImageFileRecordStatistics.hs:50:7-30 1.4 0.1
imageFileRecordFromFields ImageFileRecordParser ImageFileRecordParser.hs:(11,1)-(30,5) 1.1 0.3
toHashView ImageFileRecord ImageFileRecord.hs:(67,1)-(69,23) 0.7 1.7
Or is type [Char] too memory inefficient (compared to Text), causing splitOn to take that much memory?
UPDATE 1 (+RTS -s suggestion of user HTNW)
23,446,268,504 bytes allocated in the heap
10,753,363,408 bytes copied during GC
1,456,588,656 bytes maximum residency (22 sample(s))
29,282,936 bytes maximum slop
3620 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 45646 colls, 0 par 4.055s 4.059s 0.0001s 0.0013s
Gen 1 22 colls, 0 par 4.034s 4.035s 0.1834s 1.1491s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.477s ( 7.475s elapsed)
GC time 8.089s ( 8.094s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.114s ( 0.114s elapsed)
Total time 15.687s ( 15.683s elapsed)
%GC time 51.6% (51.6% elapsed)
Alloc rate 3,135,625,407 bytes per MUT second
Productivity 48.4% of total user, 48.4% of total elapsed
The processed text files are smaller than usual (UTF-8 encoded, 37 MB). But still 3 GB of memory are used.
UPDATE 2 (critical part of the code)
Explanation: fileToPairs processes a text file. It returns a list of key-value pairs (key: fingerprint of record, value: record).
sortAndGroup associations = Map.fromListWith (++) [(k, [v]) | (k, v) <- associations]
main = do
CommandLineArguments{..} <- cmdArgs $ CommandLineArguments {
ignored_paths_file = def &= typFile,
files = def &= typ "FILES" &= args,
number_of_occurrences = def &= name "o",
minimum_number_of_occurrences = def &= name "l",
maximum_number_of_occurrences = def &= name "u",
number_of_hashes = def &= name "n",
having_record_errors = def &= name "e",
hashes = def
}
&= summary "Group image/video files"
&= program "group"
let ignoredPathsFilenameMaybe = ignored_paths_file
let filenames = files
let hashesMaybe = hashes
ignoredPaths <- case ignoredPathsFilenameMaybe of
Just ignoredPathsFilename -> ioToLines (readFile ignoredPathsFilename)
_ -> return []
recordPairs <- mapM (fileToPairs ignoredPaths) filenames
let allRecordPairs = concat recordPairs
let groupMap = sortAndGroup allRecordPairs
let statisticsPairs = map toPair (Map.toList groupMap) where toPair item = (fst item, imageFileRecordStatisticsFromRecords . snd $ item)
let filterArguments = FilterArguments {
numberOfOccurrencesMaybe = number_of_occurrences,
minimumNumberOfOccurrencesMaybe = minimum_number_of_occurrences,
maximumNumberOfOccurrencesMaybe = maximum_number_of_occurrences,
numberOfHashesMaybe = number_of_hashes,
havingRecordErrorsMaybe = having_record_errors
}
let filteredPairs = filterImageRecords filterArguments statisticsPairs
let filteredMap = Map.fromList filteredPairs
case hashesMaybe of
Just True -> mapM_ putStrLn (map toHashesView (map snd filteredPairs))
_ -> Char8.putStrLn (encodePretty filteredMap)
As I'm sure you're aware, there's not really enough information here for us to help you make your program more efficient. It might be worth posting some (complete, self-contained) code on the Code Review site for that.
However, I think I can answer your specific question about why splitOn allocates so much memory. In fact, there's nothing particularly special about splitOn or how it's been implemented. Many straightforward Haskell functions will allocate lots of memory, and this in itself doesn't indicate that they've been poorly written or are running inefficiently. In particular, splitOn's memory usage seems similar to other straightforward approaches to splitting a string based on delimiters.
The first thing to understand is that GHC compiled code works differently than other compiled code you're likely to have seen. If you know a lot of C and understand stack frames and heap allocation, or if you've studied some JVM implementations, you might reasonably expect that some of that understanding would translate to GHC executables, but you'd be mostly wrong.
A GHC program is more or less an engine for allocating heap objects, and -- with a few exceptions -- that's all it really does. Nearly every argument passed to a function or constructor (as well as the constructor application itself) allocates a heap object of at least 16 bytes, and often more. Take a simple function like:
fact :: Int -> Int
fact 0 = 1
fact n = n * fact (n-1)
With optimization turned off, it compiles to the following so-called "STG" form (simplified from the actual -O0 -ddump-stg output):
fact = \n -> case n of I# n' -> case n' of
0# -> I# 1#
_ -> let sat1 = let sat2 = let one = I#! 1# in n-one
in fact sat2;
in n*sat1
Everywhere you see a let, that's a heap allocation (16+ bytes), and there are presumably more hidden in the (-) and (*) calls. Compiling and running this program with:
main = print $ fact 1000000
gives:
113,343,544 bytes allocated in the heap
44,309,000 bytes copied during GC
25,059,648 bytes maximum residency (5 sample(s))
29,152 bytes maximum slop
23 MB total memory in use (0 MB lost due to fragmentation)
meaning that each iteration allocates over a hundred bytes on the heap, though it's literally just performing a comparison, a subtraction, a multiplication, and a recursive call,
This is what #HTNW meant in saying that total allocation in a GHC program is a measure of "work". A GHC program that isn't allocating probably isn't doing anything (again, with some rare exceptions), and a typical GHC program that is doing something will usually allocate at a relatively constant rate of several gigabytes per second when it's not garbage collecting. So, total allocation has more to do with total runtime than anything else, and it isn't a particularly good metric for assessing code efficiency. Maximum residency is also a poor measure of overall efficiency, though it can be helpful for assessing whether or not you have a space leak, if you find that it tends to grow linearly (or worse) with the size of the input where you expect the program should run in constant memory regardless of input size.
For most programs, the most important true efficiency metric in the +RTS -s output is probably the "productivity" rate at the bottom -- it's the amount of time the program spends not garbage collecting. And, admittedly, your program's productivity of 48% is pretty bad, which probably means that it is, technically speaking, allocating too much memory, but it's probably only allocating two or three times the amount it should be, so, at a guess, maybe it should "only" be allocating around 7-8 Gigs instead of 23 Gigs for this workload (and, consequently, running for about 5 seconds instead of 15 seconds).
With that in mind, if you consider the following simple breakDelim implementation:
breakDelim :: String -> [String]
breakDelim str = case break (=='\t') str of
(a,_:b) -> a : breakDelim b
(a,[]) -> [a]
and use it like so in a simple tab-to-comma delimited file converter:
main = interact (unlines . map (intercalate "," . breakDelim) . lines)
Then, unoptimized and run on a file with 10000 lines of 1000 3-character fields each, it allocates a whopping 17 Gigs:
17,227,289,776 bytes allocated in the heap
2,807,297,584 bytes copied during GC
127,416 bytes maximum residency (2391 sample(s))
32,608 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
and profiling it places a lot of blame on breakDelim:
COST CENTRE MODULE SRC %time %alloc
main Main Delim.hs:8:1-71 57.7 72.6
breakDelim Main Delim.hs:(4,1)-(6,16) 42.3 27.4
In this case, compiling with -O2 doesn't make much difference. The key efficiency metric, productivity, is only 46%. All these results seem to be in line with what you're seeing in your program.
The split package has a lot going for it, but looking through the code, it's pretty clear that little effort has been made to make it particularly efficient or fast, so it's no surprise that splitOn performs no better than my quick-and-dirty custom breakDelim function. And, as I said before, there's nothing special about splitOn that makes it unusually memory hungry -- my simple breakDelim has similar behavior.
With respect to inefficiencies of the String type, it can often be problematic. But, it can also participate in optimizations like list fusion in ways that Text can't. The utility above could be rewritten in simpler form as:
main = interact $ map (\c -> if c == '\t' then ',' else c)
which uses String but runs pretty fast (about a quarter as fast as a naive C getchar/putchar implementation) at 84% productivity, while allocating about 5 Gigs on the heap.
It's quite likely that if you just take your program and "convert it to Text", you'll find it's slower and more memory hungry than the original! While Text has the potential to be much more efficient than String, it's a complicated package, and the way in which Text objects behave with respect to allocation when they're sliced and diced (as when you're chopping a big Text file up into little Text fields) makes it more difficult to get right.
So, some take-home lessons:
Total allocation is a poor measure of efficiency. Most well written GHC programs can and should allocate several gigabytes per second of runtime.
Many innocuous Haskell functions will allocate lots of memory because of the way GHC compiled code works. This isn't necessarily a sign that there's something wrong with the function.
The split package provides a flexible framework for all manner of cool list splitting manipulations, but it was not designed with speed in mind, and it may not be the best method of processing a tab-delimited file.
The String data type has a potential for terrible inefficiency, but isn't always inefficient, and Text is a complicated package that won't be a plug-in replacement to fix your String performance woes.
Most importantly:
Unless your program is too slow for its intended purpose, its run-time statistics and the theoretical advantages of Text over String are largely irrelevant.

GHC optimizes allocations away

I'm trying to improve performance of this binary-trees benchmark from The Computer Language Benchmark Game. The idea is to build lots of binary trees to benchmark memory allocation. The Tree data definition looks like this:
data Tree = Nil | Node !Int !Tree !Tree
According to the problem statement, there's no need to store an Int in every node and other languages don't have it.
I use GHC 8.2.2 and get the following RTS report when run the original code:
stack --resolver lts-10.3 --compiler ghc-8.2.2 ghc -- --make -O2 -threaded -rtsopts -funbox-strict-fields -XBangPatterns -fllvm -pgmlo opt-3.9 -pgmlc llc-3.9 binarytrees.hs -o binarytrees.ghc_run
./binarytrees.ghc_run +RTS -N4 -sstderr -K128M -H -RTS 21
...
19,551,302,672 bytes allocated in the heap
7,291,702,272 bytes copied during GC
255,946,744 bytes maximum residency (18 sample(s))
233,480 bytes maximum slop
635 MB total memory in use (0 MB lost due to fragmentation)
...
Total time 58.620s ( 39.281s elapsed)
So far so good. Let's remove this Int, which is actually never used. The definition becomes
data Tree = Nil | Node !Tree !Tree
In theory we are going to save about 25% of total memory (3 integers in every node instead of 4). Let's try it:
...
313,388,960 bytes allocated in the heap
640,488 bytes copied during GC
90,016 bytes maximum residency (2 sample(s))
57,872 bytes maximum slop
5 MB total memory in use (0 MB lost due to fragmentation)
...
Total time 9.596s ( 9.621s elapsed)
5MB total memory in use and almost zero GC? Why? Where did all the allocations go?
I believe the sudden memory usage drop caused by the Common Sub-expression Elimination optimization. The original code was:
make i d = Node i (make d d2) (make d2 d2)
-- ^ ^
-- | d2 != d
-- d != d2
Since expressions constructing the left and the right subtrees are different, the compiler is not able eliminate any allocations.
If I remove the unused integer, the code looks like this
make d = Node (make (d - 1)) (make (d - 1))
-- ^ ^
-- | |
-- `--------------`----- identical
If I add the -fno-cse flag to GHC, the memory allocation is as high as expected, but the code is rather slow. I couldn't find a way to suppress this optimization locally so I decided to "outsmart" the compiler by adding extra unused arguments:
make' :: Int -> Int -> Tree
make' _ 0 = Node Nil Nil
make' !n d = Node (make' (n - 1) (d - 1)) (make' (n + 1) (d - 1))
The trick worked, the memory usage dropped by expected 30%. But I wish there was a nicer way to tell the compiler what I want.
Thanks to #Carl for mentioning the CSE optimization.

Resources