Dudes about parsable output in atop - linux

I'm trying to analyze the parseable output from atop, but the man section is not clearly to me, I count more words than manual page explain.
"The first part of each output-line consists of the following six fields: label (the name of the label), host (the name of this machine), epoch (the time of this interval as number of seconds since 1-1-1970), date (date of this interval in format YYYY/MM/DD), time (time of this interval in format HH:MM:SS), and interval (number of seconds elapsed for this interval).
The subsequent fields of each output-line depend on the label:
PRM:
Subsequent fields: PID, name (between brackets), state, page size for this machine (in bytes), virtual memory size (Kbytes), resident memory size (Kbytes), shared text memory size (Kbytes), virtual memory growth (Kbytes), resident memory growth (Kbytes), number of minor page faults, and number of major page faults."
https://linux.die.net/man/1/atop
So,
Standard fields + PRM fields
6 + 11 = 17
But I count 24 fields total
atop -r FILE -p PRM
sample output;
PRM hernan-Virtual-Machine 1591135517 2020/06/02 19:05:17 834 660
(cron) S 4096 38424 3288 44 38424 3288 247 1 3216 336 132 0 660 y 385
How should i read the output?

Related

numpy broadcasting on pandas dataframe gives memory error

I have two data frames. Dataframe A is of shape (1269345,5) and dataframe B is of shape (18583586, 3).
Dataframe A looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
Dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to make extract rows and make two data frames for which position column in dataframe B falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.So resulting dataframe would look like:
###Final dataframe A
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
###Final dataframe B
ID_sim. position string
1 89 aa
4 568 bb
I tried using numpy broadcasting like this:
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
But this gave me the following error:
MemoryError: Unable to allocate 2.72 TiB for an array with shape (18583586, 160711) and data type bool
I think its because my numpy becomes quite large when I try broadcasting. How can I achieve my task without numpy broadcasting considering that my dataframes are very large. Insights will be appreciated.
This is likely due to your system overcommit mode.
It will be 0 by default,
Heuristic overcommit handling. Obvious overcommits of address space
are refused. Used for a typical system. It ensures a seriously wild
allocation fails while allowing overcommit to reduce swap usage. The
root is allowed to allocate slightly more memory in this mode. This is
the default.
By Running below command to check your current overcommit mode
$ cat /proc/sys/vm/overcommit_memory
0
In this case, you're allocating
> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588
~282 GB and the kernel is saying well obviously there's no way I'm going to be able to commit that many physical pages to this, and it refuses the allocation.
If (as root) you run:
$ echo 1 > /proc/sys/vm/overcommit_memory
This will enable the "always overcommit" mode, and you'll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).
I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0 I also got a MemoryError, but after changing it back to 1 it works:
>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056
You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.

what is the need to divide a list sys.getsizeof() by 8 or 4 ( Depending upon the machine) after subtracting 64 or 36 from the list

I am trying to find the capacity of a list by a function. But a step involves subtracting the list size by 64 ( in my machine ) and also it has to be divided by 8 to get the capacity. What does this capacity value mean ?
I tried reading the docs of python to know about sys.getsizeof() method but still it couldn't answer my doubts.
import sys
def disp(l1):
print("Capacity",(sys.getsizeof(l1)-64)//8) // What does this line mean especially //8 part
print("Length",len(l1))
mariya_list=[]
mariya_list.append("Sugarisverysweetand it can be used for cooking sweets
and also used in beverages ")
mariya_list.append("Choco")
mariya_list.append("bike")
disp(mariya_list)
print(mariya_list)
mariya_list.append("lemon")
print(mariya_list)
disp(mariya_list)
mariya_list.insert(1,"leomon Tea")
print(mariya_list)
disp(mariya_list)
Output:
Capacity 4
Length 1
['Choco']
['Choco', 'lemon']
Capacity 4
Length 2
['Choco', 'leomon Tea', 'lemon']
Capacity 4
Length 3
This is the output. Here I am unable to understand what does capacity 4 mean. Why does it repeats the same value four even after subsequent addition of elements.

difference between counting packets and counting the total number of bytes in the packets

I'm reading perfbook. In chapter5.2, the book give some example about statistical counters. These example can solve the network packet count problem.
Quick Quiz 5.2: Network-packet counting problem. Suppose that you need
to collect statistics on the number of networking packets (or total
number of bytes) transmitted and/or received. Packets might be
transmitted or received by any CPU on the system. Suppose further that
this large machine is capable of handling a million packets per
second, and that there is a system-monitoring package that reads out
the count every five seconds. How would you implement this statistical
counter?
There is one QuickQuiz ask about difference between counting packets and counting the total number of bytes in the packets.
I can't understand the answer. After reading it, I still don't know the difference.
The example in "To see this" paragraph, if changing number the 3 and 5 to 1, what difference does it make?
Please help me to understand it.
QuickQuiz5.26: What fundamental difference is there between counting
packets and counting the total number of bytes in the packets, given
that the packets vary in size?
Answer: When counting packets, the
counter is only incremented by the value one. On the other hand, when
counting bytes, the counter might be incremented by largish numbers.
Why does this matter? Because in the increment-by-one case, the value
returned will be exact in the sense that the counter must necessarily
have taken on that value at some point in time, even if it is
impossible to say precisely when that point occurred. In contrast,
when counting bytes, two different threads might return values that are
inconsistent with any global ordering of operations.
To see this, suppose that thread 0 adds the value three to its counter,
thread 1 adds the value five to its counter, and threads 2 and 3 sum the
counters. If the system is “weakly ordered” or if the compiler uses
aggressive optimizations, thread 2 might find the sum to be three and
thread 3 might find the sum to be five. The only possible global orders
of the sequence of values of the counter are 0,3,8 and 0,5,8, and
neither order is consistent with the results obtained.
If you missed > this one, you are not alone. Michael Scott used this
question to stump Paul E. McKenney during Paul’s Ph.D. defense.
I can be wrong but presume that idea behind that is the following: suppose there are 2 separate processes which collect their counters to be summed up for a total value. Now suppose that there are some sequences of events which occur simultaneously in both processes, for example a packet of size 10 comes to the first process and a packet of size 20 comes to the second at the same time and after some period of time a packet of size 30 comes to the first process at the same time when a packet of size 60 comes to the second process. So here is the the sequence of events:
Time point#1 Time point#2
Process1: 10 30
Process2: 20 60
Now let's build a vector of possible total counter states after the time point #1 and #2 for a weakly ordered system, considering the previous total value was 0:
Time point#1
0 + 10 (process 1 wins) = 10
0 + 20 (process 2 wins) = 20
0 + 10 + 20 = 30
Time point#2
10 + 30 = 40 (process 1 wins)
10 + 60 = 70 (process 2 wins)
20 + 30 = 50 (process 1 wins)
20 + 60 = 80 (process 2 wins)
30 + 30 = 60 (process 1 wins)
30 + 60 = 90 (process 2 wins)
30 + 90 = 110
Now presuming that there can be some period of time between time point#1 and time point#2 let's assess which values reflect the real state of the system. Apparently all states after time point#1 can be treated as valid as there was some precise moment in time when total received size was 10, 20 or 30 (we ignore the fact the the final value may not the actual one - at least it contains a value which was actual at some moment of system functioning). For the possible states after the Time point#2 the picture is slightly different. For example the system has never been in the states 40, 70, 50 and 80 but we are under the risk to get these values after the second collection.
Now let's take a look at the situation from the number of packets perspective. Our matrix of events is:
Time point#1 Time point#2
Process1: 1 1
Process2: 1 1
The possible total states:
Time point#1
0 + 1 (process 1 wins) = 1
0 + 1 (process 1 wins) = 1
0 + 1 + 1 = 2
Time point#2
1 + 1 (process 1 wins) = 2
1 + 1 (process 2 wins) = 2
2 + 1 (process 1 wins) = 3
2 + 1 (process 2 wins) = 3
2 + 2 = 4
In that case all possible values (1, 2, 3, 4) reflect a state in which the system definitely was at some point in time.

Find the minimum number of tanks to hold the maximum quantity of wines, at each tank maximum possible capacity

My business is in the wine reselling business, and we have this problem I've been trying to solve. We have 50 - 70 types of wine to be stored at any time, and around 500 tanks of various capacity. Each tank can only hold 1 type of wine. My job is to determine the minimum number of tanks to hold the maximum number of type of wines, each filled as close to its maximum capacity as possible, i.e 100l of wine should not be stored in a 200l tank if 2 tanks of 60l and 40l also exist.
I've been doing the job by hand in excel and want to try to automate the process, but using macros and array formulas quickly get out of hand. I can write a simple program in C and Swift, but stuck at finding a general algorithm. And pointer on where I can start is much appreciated. A full solution and I will send you a bottle ;)
Edit: for clarification, I do know how many types of wine I have and their total quantity, i.e Pinot at 700l, Merlot 2000l, etc. These change every week. The tanks however have many different capacities (40, 60, 80, 100, 200 liters etc) and change at irregular interval since they have to be taken out for cleaning and replaced. Simply using 70 tanks to hold 70 types is not possible.
Also, total quantity of wine never matches total tanks' capacity, and I need to use the minimum number of tanks to hold the maximum amount of wine. In case of insufficient capacity the amount of wine left over must be smallest possible (they'll spoil quickly). If there is left-over, the amount left over of each type must be proportional to their quantity.
A simplified example of the problem is this:
Wine:
----------
Merlot 100
Pinot 120
Tocai 230
Chardonay 400
Total: 850L
Tanks:
----------
T1 10
T2 20
T3 60
T4 150
T5 80
T6 80
T7 90
T8 80
T9 50
T10 110
T11 50
T12 50
Total: 830L
This greedy-DP algorithm attempts to perform a proportional split: for example, if you have 700l Pinot, 2000l Merlot and tank capacities 40, 60, 80, 100, 200, that means a total capacity of 480.
700 / (700 + 2000) = 0.26
2000 / (700 + 2000) = 0.74
0.26 * 480 = 125
0.74 * 480 = 355
So we will attempt to store 125l of the Pinot and 355l of the Merlot, to make the storage proportional to the amounts we have.
Obviously this isn't fully possible, because you cannot mix wines, but we should be able to get close enough.
To store the Pinot, the closest would be to use tanks 1 (40l) and 3 (80l), then use the rest for the Merlot.
This can be implemented as a subset sum problem:
d[i] = true if we can make sum i and false otherwise
d[0] = true, false otherwise
sum_of_tanks = 0
for each tank i:
sum_of_tanks += tank_capacities[i]
for s = sum_of_tanks down to tank_capacities[i]
d[s] = d[s] OR d[s - tank_capacities[i]]
Compute the proportions then run this for each type of wine you have (removing the tanks already chosen, which you can find by using the d array, I can detail if you want). Look around d[computed_proportion] to find the closest sum possible to achieve for each wine type.
This should be fast enough for a few hundred tanks, which I'm guessing don't have capacities larger than a few thousands.

Cassandra and wide row disk size estimate?

I am trying to estimate the amount of space required for each column in a Cassandra wide row, but the numbers that I get are wildly conflicting.
I have a pretty standard wide row table to store some time series data:
CREATE TABLE raw_data (
id uuid,
time timestamp,
data list<float>,
PRIMARY KEY (id, time)
);
In my case, I store 20 floats in the data list.
Datastax provides some formulas for estimating user data size.
regular_total_column_size = column_name_size + column_value_size + 15
row_size = key_size + 23
primary_key_index = number_of_rows * ( 32 + average_key_size )
For this table, we get the following values:
regular_total_column_size = 8 + 80 + 15 = 103 bytes
row_size = 16 + 23 = 39 bytes
primary_key_index = 276 * ( 32 + 16 ) = 13248 bytes
I'm mostly interested in how the row grows, so the 103 bytes per column is of interest. I counted all the samples in my database and ended up with 29,241,289 unique samples. Multiplying it out I get an estimated raw_data table size of 3GB.
In reality, I have 4GB of compressed data as measured by nodetool cfstats right after compaction. It reports a compression ratio of 0.117. It averages out to 137 bytes per sample, on disk, after compression. That seems very high, considering:
only 88 bytes of that is user data
It's 34 bytes more per sample
This is after deflate compression.
So, my question is: how do I accurately forecast how much disk space Cassandra wide rows consume, and how can I minimize the total disk space?
I'm running a single node with no replication for these tests.
This may be due to compaction strategies. With size tiered compaction, the SSTables will build up to double the required space during compaction. For levelled compaction, around 10% extra space will be needed. Depending on compaction strategy, you need to take into account the additional disk spaced used.

Resources