I am creating a tool which will estimated space should be given to the VSAM file based on number of records, record length and block size parameters.
While going through different sources on internet I got an article on IBM website about space calculation as follows but I didn't understood some information like, where does 33 come from in point 5, Also how 10% and 20% is being taken in CI and CA.
Device type. 3390
Unit of space allocation. Cylinders
Data control interval size. 1024 bytes
Physical block size (calculated by VSAM). 1024 bytes
Record size. 200 bytes
Free space definition – control interval. 20%
Free space definition – control area. 10%
Number of records to be loaded. 3000
You can calculate space for the data component as follows:
1. Number of bytes of free space (20% × 1024) = 204 (round down)
2. Number of loaded records per control interval (1024–10–204)/200 = 4.
3. Number of physical blocks per track = 33.
4. Number of control intervals per track = 33.
5. Maximum number of control intervals per control area (33 x 15) = 495.
6. Number of loaded control intervals per control area (495 - 10% x 495) = 446.
7. Number of loaded records per cylinder (4 x 446) = 1784.
8. Total space for data component (3000/1784) (rounded) = 2 cylinders.
The value (1024 – 10) is the control interval length minus 10 bytes for two RDFs and one CIDF. The 10. record size is 200 bytes. On an IBM 3380, 31 physical blocks with 1024 bytes can be stored on one track. The value (33 × 15) is the number of physical blocks per track multiplied by the number of data tracks per cylinder.
Free space (in percent) on CA and CI is determined by FREESPACE parameter in DEFINE CLUSTER IDCAMS command. Values in fomula above are only example. You should change it if definition of VSAM is different.
The maximum size on track for 3390 is 56664, but you must remeber about space used for intersector gaps. More sectors - more gaps, less available space for data. 33 is maximum number of blocks on 3390 track for block size between 1019 and 1086 (you can find calculation of that and reference table on bitsavers document "IBM 3390 Direct Storage Access Reference")
Related
I set 400 RU/s for my collection in CosmosDB. I want to estimate, maximum total of RU/s in 24 hours. please let me know, how can I calculate it?
is the calculation right?
400 * 60 * 60 * 24 = 34.560.000 RU in 24 hours
When talking about maximum RU available within a given time window, you're correct (except that would be total RU, not total RU/second - it would remain constant at 400/sec unless you adjusted your RU setting).
But... given that RUs are reserved on a per-second basis: What you consume within a given one-second window is really up to you and your specific operations, and if you don't consume all allocated RUs (400, in this case) in a given one-second window, it's gone (you're paying for reserved capacity). So yes, you're right about absolute maximum, but that might not match your real-world scenario. You could end up consuming far less than the maximum you've allocated (imagine idle-times for your app, when you're just not hitting the database much). Also note that RUs are distributed across partitions, so depending on your partitioning scheme, you could end up not using a portion of your available RUs.
Note that it's entirely possible to use more than your allocated 400 RU in a given one-second window, which then puts you into "RU Debt" (see my answer here for a detailed explanation).
I am trying to find all the cycles in my data using networkx.simple_cycles() function in python. As written in networkx documentation "This is a non-recursive, iterator/generator version of Johnson’s algorithm". Johnson's algorithm has a time complexity of: ((nodes+edges)*(cycles+1)) and space complexity of: (nodes+edges).
I have created a directed graph from transaction data with edges as From node--->To node. I am removing the nodes which have in-degree or out-degree of 0 as they will not contribute to cycles(elementary circuits).
My graph details after removing the in-degree and out-degree 0 nodes are as follows:
For 1.09 million records:
Number of nodes remaining-39278;
Number of edges-26324;
Number of cycles- Not able to get any output as the code keeps on running.
For 1.08 million records:
Number of nodes remaining-38664;
Number of edges-25710;
Number of cycles- 5612438.
For 1.05 million records:
Number of nodes remaining-36737;
Number of edges-23784;
Number of cycles- 69671.
For 1.01 million records:
Number of nodes remaining-34393;
Number of edges-21566;
Number of cycles- 3079.
For 1 million records:
Number of nodes remaining-33841;
Number of edges-21125;
Number of cycles- 3072.
I want to understand the relation of time and space complexity with system configuration. Given all these details about an algorithm, how can i find out what system configuration will be suitable for a particular size of data. Why am i not able to run this algorithm beyond 1.08 million data?
My system configuration:
AMD PRO A12-8800B (2.1 GHz, up to 3.4 GHz, 2MB Cache, 4 Cores) with AMD Radeon R7 Graphics.
8GB RAM.
I am trying to calculate the the partition size for each row in a table with arbitrary amount of columns and types using a formula from the Datastax Academy Data Modeling Course.
In order to do that I need to know the "size in bytes" for some common Cassandra data types. I tried to google this but I get a lot of suggestions so I am puzzled.
The data types I would like to know the byte size of are:
A single Cassandra TEXT character (I googled answers from 2 - 4 bytes)
A Cassandra DECIMAL
A Cassandra INT (I suppose it is 4 bytes)
A Cassandra BIGINT (I suppose it is 8 bytes)
A Cassandra BOOELAN (I suppose it is 1 byte, .. or is it a single bit)
Any other considerations would of course also be appreciated regarding data types sizes in Cassandra.
Adding more info since it seems confusing to understand that I am only trying to estimate the "worst scenario disk usage" the data would occupy with out any compressions and other optimizations done by Cassandra behinds the scenes.
I am following the Datastax Academy Course DS220 (see link at end) and implement the formula and will use the info from answers here as variables in that formula.
https://academy.datastax.com/courses/ds220-data-modeling/physical-partition-size
I think, from a pragmatic point of view, that it is wise to get a back-of-the-envelope estimate of worst case using the formulae in the ds220 course up-front at design time. The effect of compression often varies depending on algorithms and patterns in the data. From ds220 and http://cassandra.apache.org/doc/latest/cql/types.html:
uuid: 16 bytes
timeuuid: 16 bytes
timestamp: 8 bytes
bigint: 8 bytes
counter: 8 bytes
double: 8 bytes
time: 8 bytes
inet: 4 bytes (IPv4) or 16 bytes (IPV6)
date: 4 bytes
float: 4 bytes
int 4 bytes
smallint: 2 bytes
tinyint: 1 byte
boolean: 1 byte (hopefully.. no source for this)
ascii: equires an estimate of average # chars * 1 byte/char
text/varchar: requires an estimate of average # chars * (avg. # bytes/char for language)
map/list/set/blob: an estimate
hope it helps
The only reliable way to estimate the overhead associated to something is to actually perform measures. Really, you can't take the single data types and generalize something about them. If you have 4 bigints columns and you're supposing that your overhead is X, if you have 400 bigint columns your overhead won't probably be 100x. That's because Cassandra compresses (by default, and it's a settings tunable per column family) everything before storing data on disk.
Try to load some data, I mean production data, in the cluster, and then let's know your results and compression configuration. You'd find some surprises.
Know your data.
I cannot find documentation on the "compactionstats":
While using nodetool compactionstats, what do the numerical values on the completed and total columns mean?
My column family has a total data size of about 360 GB but my compaction status displays:
pending tasks: 7
compaction type keyspace column family completed total unit progress
Compaction Test Message 161257707087 2475323941809 bytes 6.51%
While I see the "completed" increasing slowly (also the progress;-).
But how is this "total" computed? Why is it 2.5 TB when I have only 360 GB of data?
You must have compression on. total is the total number of uncompressed bytes comprising the set of sstables that are being compacted together. If you grep the cassandra log file for lines containing Compacting you will find the sstables that are part of a compaction. If you sum these sizes and multiply by the inverse of your compression ratio for the column family you will get pretty close to the total. By default this can be a bit difficult to verify on a multi-core system because the number of simultaneous compactions defaults to the number of cores.
You can also verify this answer by looking at the code:
AbstractionCompactionIterable - getCompactionInfo() uses the bytesRead and totalBytes fields from that class. totalBytes is final and is computed in the constructor, by summing getLengthInBytes() from each file that is part of the compaction.
The scanners vary, but the length in bytes returned by CompressedRandomAccessReader is the uncompressed size of the file.
I am using cassandra in my app and it started eating up disk space much faster than I expected and much faster than defined in manual. Consider this most simple example:
CREATE TABLE sizer (
id ascii,
time timestamp,
value float,
PRIMARY KEY (id,time)
) WITH compression={'sstable_compression': ''}"
I am turning off compression on purpose to see how many bytes will each record take.
Then I insert few values, I run nodetool flush and then I check the size of data file on disk to see how much space did it take.
Results show huge waste of space. Each record take 67 bytes, I am not sure how that is possible.
My id is 13 bytes long at it is saved only once in data file, since it is always the same for testing purposes.
According to: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/architecture/architecturePlanningUserData_t.html
Size should be:
timestamp should be 8 bytes
value as column name takes 6 bytes
column value float takes 4 bytes
column overhead 15 bytes
TOTAL: 33 bytes
For testing sake, my id is always same, so I have actually only 1 row if I understood correctly.
So, my questions is how do I end up on using 67 bytes instead of 33.
Datafile size is correct, I tried inserting 100, 1000 and 10000 records. Size is always 67 bytes.
There are 3 overheads discussed in the file. One is the column overhead, which you have accommodated for. The second is the row overhead. And also if you have replication_factor greater than 1 there's an over head for that as well.