How to calculate space for number of records

How to calculate space for number of records - mainframe

I am trying to calculate space required by a dataset using below formula, but I am getting wrong somewhere when I cross check it with the existing dataset in the system. Please help me
1st Dataset:
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 32760
Number of records....: 51560
Using below formula to calculate
optimal block length (OBL) = 32760/record length = 32760/449 = 73
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 73*2 = 146
Find number of physical records (PR) = Number of records/TOBL = 51560/146 = 354
Number of tracks = PR/2 = 354/2 = 177
But I can below in the dataset information
Current Allocation
Allocated tracks . : 100
Allocated extents . : 1
Current Utilization
Used tracks . . . . : 100
Used extents . . . : 1
2nd Dataset :
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 27998
Number of Records....: 127,252
Using below formula to calculate
optimal block length (OBL) = 27998/record length = 27998/449 = 63
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 63*2 = 126
Find number of physical records (PR) = Number of records/TOBL = 127252/126 = 1010
Number of tracks = PR/2 = 1010/2 = 505
Number of Cylinders = 505/15 = 34
But I can below in the dataset information
Current Allocation
Allocated cylinders : 69
Allocated extents . : 1
Current Utilization
Used cylinders . . : 69
Used extents . . . : 1

A few observations on your approach.
First, since your dealing with records that are variable length it would be helpful to know the "average" record length as that would help to formulate a more accurate prediction of storage. Your approach assumes a worst case scenario of all records being at maximum which is fine for planning purposes but in reality you'll likely see the actual allocation would be lower if the average of the record lengths is lower than the maximum.
The approach you are taking is reasonable but consider that you can inform z/OS of the space requirements in blocks, records, DASD geometry or let DFSMS perform the calculation on your behalf. Refer to this article to get some additional information on options.
Back to your calculations:
You Optimum Block Length (OBL) is really a records per block (RPB) number. Block size divided maximum record length yields the number of records at full length that can be stored in the block. If your average record length is less then you can store more records per block.
The assumption of two blocks per track may be true for your situation but it depends on the actual device type that will be used for the underlying allocation. Here is a link to some of the geometries for supported DASD devices and their geometries.
Your assumption of two blocks per track depends on the device is not correct for 3390's as you would need 64k for two blocks on a track but as you can see the 3390's max out at 56k so you would only get one block per track on the device.
Also, it looks like you did factor in the RDW by adding 4 bytes but someone looking at the question might be confused if they are not familiar with V records on z/OS.In the case of your calculation that would be 61 records per block at 27998 (which is the "optimal block length" so two blocks can fit comfortable on a track).
I'll use the following values:
MaximumRecordLength = RecordLength + 4 for RDW
TotalRecords = Total Records at Maximum Length (worst case)
BlockSize = modeled blocksize
RecordsPerBlock = number of records that can fit in a block (worst case)
BlocksNeeded = number of blocks needed to contain estimated records (worst case)
BlocksPerTrack = from IBM device geometry information
TracksNeeded = TotalRecords / RecordsPerBlock / BlocksPerTrack
Cylinders = Device Tracks per cylinder (15 for most devices)
Example 1:
Total Records = 51,560
BlockSize = 32,760
BlocksPerTrack = 1 (from device table)
RecordsPerBlock: 32,760 / 449 = 72.96 (72)
Total Blocks = 51,560 / 72 = 716.11 (717)
Total Tracks = 717 * 1 = 717
Cylinders = 717 / 15 = 47.8 (48)
Example 2:
Total Records = 127,252
BlockSize = 27,998
BlocksPerTrack = 2 (from device table)
RecordsPerBlock: 27,998 / 449 = 62.35 (62)
Total Blocks = 127,252 / 62 = 2052.45 (2,053)
Total Tracks = 2,053 / 2 = 1,026.5 (1,027)
Cylinders = 1027 / 15 = 68.5 (69)
Now, as to the actual allocation. It depends on how you allocated the space, the size of the records. Assuming it was in JCL you could use the RLSE subparameter of the SPACE= to release space when the is created and closed. This should release unused resources.
Given that the records are Variable the estimates are worst case and you would need to know more about the average record lengths to understand the actual allocation in terms of actual space used.
Final thought, all of the work you're doing can be overridden by your storage administrator through ACS routines. I believe that most people today would specify a BLKSIZE=0 and let DFSMS do all of the hard work because that component has more information about where a file will go, what the underlying devices are and the most efficient way of doing the allocation. The days of disk geometry and allocation are more of a campfire story unless your environment has not been administered to do these things for you.

Instead of trying to calculate tracks or cylinders, go for MBs, or KBs. z/OS (DFSMS) will calculate for you, how many tracks or cylinders are needed.
In JCL it is not straight forward but also not too complicated, once you got it.
There is a DD statement parameter called AVGREC=, which is the trigger. Let me do an example for your first case above:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(445,(51560,1000)),AVGREC=U
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
Parameter AVGREC=U (4) tells the system three things:
Firstly, the first subparameter in SPACE= (1) shall be interpreted as an average record length. (Note that this value is completely independend of the value specified in LRECL=.)
Secondly, it tells the system, that the second (2), and third (3) SPACE= subparameter are the number of records of average length (1) that the data set shall be able to store.
Thirdly, it tells the system that numbers (2), and (3) are in records (AVGREC=U). Alternatives are thousands (AVGREC=M), and millions (AVGREC=M).
So, this DD statement will allocate enough space to hold the estimated number of records. You don't have to care for track capacity, block capacity, device geometry, etc.
Given the number of records you expect and the (average) record length, you can easily calculate the number of kilobytes or megabytes you need. Unfortunately, you cannot directly specify KB, or MB in JCL, but there is a way using AVGREC= as follows.
Your first data set will get 51560 records of (maximum) length 445, i.e. 22'944'200 bytes, or ~22'945 KB, or ~23 MB. The JCL for an allocation in KB looks like this:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(22945,10000)),AVGREC=K
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
You want the system to allocate primary space for 22945 (2) thousands (4) records of length 1 byte (1), which is 22945 KB, and secondary space for 10'000 (3) thousands (4) records of length 1 byte (1), i.e. 10'000 KB.
Now the same alloation specifying MB:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(23,10)),AVGREC=M
//* | | | |
//* V V V V
//* (1) (2)(3) (4)
You want the system to allocate primary space for 23 (2) millions (4) records of length 1 byte (1), which is 23 MB, and secondary space for 10 (3) millions (4) records of length 1 byte (1), i.e. 10 MB.
I rarely use anything other than the latter.
In ISPF, it is even easier: Data Set Allocation (3.2) allows KB, and MB as space units (amongst all the old ones).

A useful and usually simpler alternative to using SPACE and AVGREC etc is to simply use a DATACLAS for space if your site has appropriate sized ones defined. If you look at ISMF Option 4 you can list available DATACLAS's and see what space values etc they provide. You'd expect to see a number of ranges in size, and some with or without Extended Format and/or Compression. Even if a DATACLAS overallocates a bit then it is likely the overallocated space will be released by the MGMTCLAS assigned to the dataset at close or during space management. And you do have an option to code DATACLAS AND SPACE in which case any coded space (or other) value will override the DATACLAS, which helps with exceptions. It still depends how your Storage Admin's have coded the ACS routines but generally Users are allowed to specify a DATACLAS and it will be honored by the ACS routines.
For basic dataset size calculation I just use LRECL times the expected Max Record Count divided by 1000 a couple of times to get a rough MB figure. Obviously variable records/blks add 4bytes each for RDW and/or BDW but unless the number of records is massive or DASD is extremely tight for space wise it shouldn't be significant enough to matter.
e.g.
=(51560*445)/1000/1000 shows as ~23MB
Also, don't expect your allocation to be exactly what you requested because the minimum allocation on Z/OS is 1 track or ~56k. The BLKSIZE also comes into effect by adding interblock gaps of ~32bytes per block. With SDB (system Determined Blocksize) invoked by omitting BLKSIZE or coding BLKSIZE=0, it will always try to provide half track blocking as close to 28k as possible so two blocks per track which is the most space efficient. That does matter, a BLKSIZE of 80bytes wastes ~80% of a track with interblock gaps. The BLKSIZE is also the unit of transfer when doing read/write to disk so generally the larger the better with some exceptions such as KSDS's being randomly access by key for example which might result in more data transfer than desired in an OLTP transaction.

Related

Why does python lru_cache performs best when maxsize is a power-of-two?

Documentation says this:
If maxsize is set to None, the LRU feature is disabled and the cache can grow without bound. The LRU feature performs best when maxsize is a power-of-two.
Would anyone happen to know where does this "power-of-two" come from? I am guessing it has something to do with the implementation.

Where the size effect arises
The lru_cache() code exercises its underlying dictionary in an atypical way. While maintaining total constant size, cache misses delete the oldest item and insert a new item.
The power-of-two suggestion is an artifact of how this delete-and-insert pattern interacts with the underlying dictionary implementation.
How dictionaries work
Table sizes are a power of two.
Deleted keys are replaced with dummy entries.
New keys can sometimes reuse the dummy slot, sometimes not.
Repeated delete-and-inserts with different keys will fill-up the table with dummy entries.
An O(N) resize operation runs when the table is two-thirds full.
Since the number of active entries remains constant, a resize operation doesn't actually change the table size.
The only effect of the resize is to clear the accumulated dummy entries.
Performance implications
A dict with 2**n entries has the most available space for dummy entries, so the O(n) resizes occur less often.
Also, sparse dictionaries have fewer hash collisions than mostly full dictionaries. Collisions degrade dictionary performance.
When it matters
The lru_cache() only updates the dictionary when there is a cache miss. Also, when there is a miss, the wrapped function is called. So, the effect of resizes would only matter if there are high proportion of misses and if the wrapped function is very cheap.
Far more important than giving the maxsize a power-of-two is using the largest reasonable maxsize. Bigger caches have more cache hits — that's where the big wins come from.
Simulation
Once an lru_cache() is full and the first resize has occurred, the dictionary settles into a steady state and will never get larger. Here, we simulate what happens next as new dummy entries are added and periodic resizes clear them away.
steady_state_dict_size = 2 ** 7 # always a power of two
def simulate_lru_cache(lru_maxsize, events=1_000_000):
'Count resize operations as dummy keys are added'
resize_point = steady_state_dict_size * 2 // 3
assert lru_maxsize < resize_point
dummies = 0
resizes = 0
for i in range(events):
dummies += 1
filled = lru_maxsize + dummies
if filled >= resize_point:
dummies = 0
resizes += 1
work = resizes * lru_maxsize # resizing is O(n)
work_per_event = work / events
print(lru_maxsize, '-->', resizes, work_per_event)
Here is an excerpt of the output:
for maxsize in range(42, 85):
simulate_lru_cache(maxsize)
42 --> 23255 0.97671
43 --> 23809 1.023787
44 --> 24390 1.07316
45 --> 25000 1.125
46 --> 25641 1.179486
...
80 --> 200000 16.0
81 --> 250000 20.25
82 --> 333333 27.333306
83 --> 500000 41.5
84 --> 1000000 84.0
What this shows is that the cache does significantly less work when maxsize is as far away as possible from the resize_point.
History
The effect was minimal in Python3.2, when dictionaries grew by 4 x active_entries when resizing.
The effect became catastrophic when the growth rate was lowered to 2 x active entries.
Later a compromise was reached, setting the growth rate to 3 x used. That significantly mitigated the issue by giving us a larger steady state size by default.
A power-of-two maxsize is still the optimum setting, giving the least work for a given steady state dictionary size, but it no longer matters as much as it did in Python3.2.
Hope this helps clear up your understanding. :-)

TL;DR - this is an optimization that doesn't have much effect at small lru_cache sizes, but (see Raymond's reply) has a larger effect as your lru_cache size gets bigger.
So this piqued my interest and I decided to see if this was actually true.
First I went and read the source for the LRU cache. The implementation for cpython is here: https://github.com/python/cpython/blob/master/Lib/functools.py#L723 and I didn't see anything that jumped out to me as something that would operate better based on powers of two.
So, I wrote a short python program to make LRU caches of various sizes and then exercise those caches several times. Here's the code:
from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time
def run_test(i):
# We create a new decorated perform_calc
#lru_cache(maxsize=i)
def perform_calc(input):
return input * 3.1415
# let's run the test 5 times (so that we exercise the caching)
for j in range(5):
# Calculate the value for a range larger than our largest cache
for k in range(2000):
perform_calc(k)
for t in range(10):
print (t)
values = defaultdict(list)
for i in range(1,1025):
start = time.perf_counter()
run_test(i)
t = time.perf_counter() - start
values[i].append(t)
for k,v in values.items():
print(f"{k}\t{mean(v)}")
I ran this on a macbook pro under light load with python 3.7.7.
Here's the results:
https://docs.google.com/spreadsheets/d/1LqZHbpEL_l704w-PjZvjJ7nzDI1lx8k39GRdm3YGS6c/preview?usp=sharing
The random spikes are probably due to GC pauses or system interrupts.
At this point I realized that my code always generated cache misses, and never cache hits. What happens if we run the same thing, but always hit the cache?
I replaced the inner loop with:
# let's run the test 5 times (so that we exercise the caching)
for j in range(5):
# Only ever create cache hits
for k in range(i):
perform_calc(k)
The data for this is in the same spreadsheet as above, second tab.
Let's see:
Hmm, but we don't really care about most of these numbers. Also, we're not doing the same amount of work for each test, so the timing doesn't seem useful.
What if we run it for just 2^n 2^n + 1, and 2^n - 1. Since this speeds things up, we'll average it out over 100 tests, instead of just 10.
We'll also generate a large random list to run on, since that way we'll expect to have some cache hits and cache misses.
from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time
import random
rands = list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128))
random.shuffle(rands)
def run_test(i):
# We create a new decorated perform_calc
#lru_cache(maxsize=i)
def perform_calc(input):
return input * 3.1415
# let's run the test 5 times (so that we exercise the caching)
for j in range(5):
for k in rands:
perform_calc(k)
for t in range(100):
print (t)
values = defaultdict(list)
# Interesting numbers, and how many random elements to generate
for i in [15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255, 256, 257, 511, 512, 513, 1023, 1024, 1025]:
start = time.perf_counter()
run_test(i)
t = time.perf_counter() - start
values[i].append(t)
for k,v in values.items():
print(f"{k}\t{mean(v)}")
Data for this is in the third tab of the spreadsheet above.
Here's a graph of the average time per element / lru cache size:
Time, of course, decreases as our cache size gets larger since we don't spend as much time performing calculations. The interesting thing is that there does seem to be a dip from 15 to 16, 17 and 31 to 32, 33. Let's zoom in on the higher numbers:
Not only do we lose that pattern in the higher numbers, but we actually see that performance decreases for some of the powers of two (511 to 512, 513).
Edit: The note about power-of-two was added in 2012, but the algorithm for functools.lru_cache looks the same at that commit, so unfortunately that disproves my theory that the algorithm has changed and the docs are out of date.
Edit: Removed my hypotheses. The original author replied above - the problem with my code is that I was working with "small" caches - meaning that the O(n) resize on the dicts was not very expensive. It would be cool to experiment with very large lru_caches and lots of cache misses to see if we can get the effect to appear.

Best Precision for String in DocumentDB Indexing Policies

I'm writing indexing policies for my collection, and trying to figure out what is the right "Precision" for String in Hash Index, i.e.
collection.IndexingPolicy.IncludedPaths.Add(
new IncludedPath {
Path = "/customId/?",
Indexes = new Collection<Index> {
new HashIndex(DataType.String) { Precision = 20 } }
});
There will be around 10,000 different customId, so what is the right "Precision"? What if it gets more than 100,000,000 ids?

There will be around 10,000 different customId, so what is the right "Precision"? What if it gets more than 100,000,000 ids?
As Andrew Liu said in this thread: The indexing precision for a hash index indicates the number of bytes to hash the property value to.
And as we know, 1 bytes = 8 bits, which can hold 2^8 = 256 values. 2 bytes can hold 2^16 = 65,536 values, and so forth. You could do similar calculation to get the indexing precision based on the number of documents that you expect to contain the path for property customId.
Besides, you could refer to Index precision section in this article and tradeoff between index storage overhead and query performance when specifying Index precision.

how can I use multithreading to compute the intersection on a very large dataset

I have a file composed of 4 millions sets. every set contains 1 to n words. The size of the file is 120 MB.
set1 = {w11, w12,...,w1i}
set2 = {w21, w22,...,w2j}
...
setm = {wm1, wm2,...,wmk}
I want to compute the intersection between all the sets.
Set 1 ∩ {set1,...,setm}
Set 2 ∩ {set1,...,setm}
...
Set m ∩ {set1,...,setm}
Every operation takes Around 1.2 seconds. What I did the following:
divide the 4 million sets into 6 chunks. Every chunk containing 666666 sets
Then I do the following. In here i'll be creating 36 threads and i'll be computing the intersection between the chuncks. It is too slow and I complicated the problem.
vector<thread> threads;
for(int i = 0; i< chunk.size();i++)
{
for(int j = 0; j < chunk.size();j++)
{
threads.push_back(thread(&Transform::call_intersection, this, ref(chunk[i]),ref(tmp[j]), chunk(results)));
}
}
for(auto &t : threads){ t.join(); }
Do you have an idea on how to divide the problem into sub-problems and then join all of them together in the end. any good way in linux too?
Sample
The first column represents the ID of the set and the rest of the columns represents the words.
m.06fl3b|hadji|barbarella catton|haji catton|haji cat|haji
m.06flgy|estadio neza 86
m.06fm8g|emd gp39dc
m.0md41|pavees|barbarella catton
m.06fmg|round
m.01012g|hadji|fannin county windom town|windom
m.0101b|affray
Example
m.06fl3b has an intersection with m.01012g and m.0md41. The output file will be as follows:
m.06fl3b m.01012g m.0md41
m.06flgy
m.06fm8g
m.0md41 m.06fl3b
m.06fmg
m.01012g m.06fl3b
m.0101b

Set intersection is associative and therefore amenable to parallel folding (which is one of many use cases of MapReduce). For each pair of sets ((1, 2), (3, 4), ...), you can compute the intersection of each pair, and put the results into a new collection of sets, which will have half the size. Repeat until you're left with only one set. The total number of intersection operations will be equal to the number of sets minus one.
Launching millions of threads is going to bog down your machine, however, so you will probably want to use a thread pool: Make a number of threads that is close to the amount of CPU cores you have available, and create a list of tasks, where each task is two sets that are to be intersected. Each thread repeatedly checks the task list and grabs the first available task (make sure that you access the task list in a thread-safe manner).

how to differentiate two very long strings in c++?

I would like to solve
Levenshtein_distance this problem where length of string is too huge .
Edit2 :
As Bobah said that title is miss leading , so i had updated the title of questoin .
Initial title was how to declare 100000x100000 2-d integer in c++ ?
Content was
There is any way to declare int x[100000][100000] in c++.
When i declare it globally then compiler produces error: size of array ‘x’ is too large .
One method could be using map< pair< int , int > , int > mymap .
But allocating and deallocating takes more time . There is any other way like uisng vector<int> myvec ;

For memory blocks that large, the best approach is dynamic allocation using the operating system's facilities for adding virtual memory to the process.
However, look how large a block you are trying to allocate:
40 000 000 000 bytes
I take my previous advice back. For a block that large, the best approach is to analyze the problem and figure out a way to use less memory.

Filling the edit distance matrix can be done each row at a time. Remembering the previous row is enough to compute the current row. This observation reduces space usage from quadratic to linear. Makes sense?

Your question is very interesting, but the title is misleading.
This is what you need in terms of data model (x - first string, y - second string, * - distance matrix).
y <-- first string (scrolls from top down)
y
x x x x x x x x <- second string (scrolls from left to right)
y * * *
y * * *
y * * * <-- distance matrix (a donut) scrolls together with strings
and grows/shrinks when needed, as explained below
y
Have two relatively long (but still << N) character buffers and relatively small ( << buffers size) rectangular (start from square) distance matrix.
Make the matrix a donut - bi-dimentional ring buffer (can use the one from boost, or just std::deque).
When string fragments currently covered by the matrix are 100% match shift both buffers by one, rotate the donut around both axes, recalculating one new row/column in the distance matrix.
When match is <100% and is less than configured threshold then grow the size of the both dimensions of the donut without dropping any rows/columns and do it until either match gets above the threshold or you reach the maximum donut size. When match ratio hits the threshold from the below you need to scroll donut discarding head of x and y buffers and at the same time aligning them (only X needs moving by 1 when the distance matrix tells that X[i] does not exist in Y, but X[i+1,i+m] matches Y[j, j+m-1]).
As a result you will have a simple yet very efficient heuristic diff engine with deterministic limited memory footprint and all memory can be pre-allocated at startup so no dynamic allocation will slow it down at runtime.
Apache v2 license, in case you decide to go for it.

scull driver from LDD - scull_read and scull_write

I am going through LDD from Rubini to learn driver programming.Currently, I am going through 3rd chapter - writing character driver "scull". However, In the example code provided by the authors, I am not able to understand the following lines in scull_read() and scull_write() methods :
item = (long)*f_pos / itemsize;
rest = (long)*f_pos % itemsize;
s_pos = rest / quantum;
q_pos = rest % quantum;
I have spent quite a time on it in vain( and still working on it) . Can someone please help me understand the functionality of the above code snippet??
Regards,
Roy

Suppose you have set quantum area size to 4000 bytes in scull driver and qset array size to 10. In that case, value of itemsize would be 40000. f_pos is a position from where read/write should start, which is coming as a parameter to read/write function. suppose read request has come and f_pos is 50000.
Now,
item = (long)*f_pos / itemsize; so item would be 50000/40000 = 1
rest = (long)*f_pos % itemsize; so rest would be 50000%40000 = 10000
s_pos = rest / quantum; so s_pos would be 10000/4000 = 2
q_pos = rest % quantum; so q_pos would be 10000%4000 = 2000
If you have read description of scull driver in chapter 3 carefully then each scull device is a linked list of pointers (of scull_qset) and in our case each scull_qset points to array of pointers which points to quantum area of 4000 bytes as we have set quantum area size 4000 bytes and array size in our case is 10. So, our each scull_qset is an array of 10 pointers and each pointer points to 4000 bytes. So, one scull_qset has capacity of 40000 bytes.
In our read request, f_pos is 50000, so obviously this position would not be in first scull_qset which is proven by calculation of item. As item is 1, it will point to second scull_qset(value of item would be 0 for first scull_qset, for more information see scull_follow function definition).
Value of rest will help to find out at which position in second scull_qset read should start. As each quantum area is of 4000 bytes, s_pos gives out of 10 pointers of second scull_qset which pointer should be used and qset tells that in a particular quantum area pointed by pointer found in s_pos, at which particular location read should start.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string