Cassandra cluster: write speed decreased with multiple nodes - cassandra

I'm testing Cassandra 4.0 with 3 nodes as a POC, all the nodes are using VMs with 8GB RAM and 2 cores. The VMs were created in a way to make sure that they do not share I/O.
I started the 1st node, and with 50 threads in the client, it takes 7 sec to insert 150,000 records(No batch). So the write speed is 22k/sec. Then I added 2nd node, and started another client(write to different table at the same time with the first client) with 50 threads, also insert 150k records, it takes 18sec for both clients to finish, so the write speed becomes 16k/sec. Last, I added 3rd node, with the 2 clients, it takes 27 sec for 300k records to insert, so the write speed becomes 11k/sec. Apparently, the write speed decreased with more nodes added.
I checked CPU usage and it is around 70~80%.
Here is the result from "nodetool status":
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.30.1.1 1.65 GiB 16 ? 4d379ca0-362b-4077-b650-c589088e86ed rack1
UN 10.30.1.2 3.1 GiB 16 ? b0d37f83-dfaf-45ae-9749-25f2d6746d0e rack1
UN 10.30.1.3 2.7 GiB 16 ? 8a48959b-a2a4-4543-abbf-257ddb7ca5b1 rack1
And result from "nodetool tpstats":
Pool Name Active Pending Completed Blocked All time blocked
RequestResponseStage 0 0 1615905 0 0
MutationStage 0 0 2090208 0 0
ReadStage 0 0 1466 0 0
CompactionExecutor 0 0 1239 0 0
MemtableReclaimMemory 0 0 6 0 0
PendingRangeCalculator 0 0 4 0 0
GossipStage 0 0 7695 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 0 0 0
MemtablePostFlush 0 0 11 0 0
PerDiskMemtableFlushWriter_0 0 0 6 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
ViewBuildExecutor 0 0 0 0 0
MemtableFlushWriter 0 0 6 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 2202576 0 0
Latencies waiting in queue (micros) per dropped message types
Message type Dropped 50% 95% 99% Max
READ_RSP 0 0.0 0.0 0.0 0.0
RANGE_REQ 0 0.0 0.0 0.0 0.0
PING_REQ 0 0.0 0.0 0.0 0.0
_SAMPLE 0 0.0 0.0 0.0 0.0
VALIDATION_RSP 0 0.0 0.0 0.0 0.0
SCHEMA_PULL_RSP 0 0.0 0.0 0.0 0.0
SYNC_RSP 0 0.0 0.0 0.0 0.0
SCHEMA_VERSION_REQ 0 0.0 0.0 0.0 0.0
HINT_RSP 0 0.0 0.0 0.0 0.0
BATCH_REMOVE_RSP 0 0.0 0.0 0.0 0.0
PAXOS_COMMIT_REQ 0 0.0 0.0 0.0 0.0
SNAPSHOT_RSP 0 0.0 0.0 0.0 0.0
COUNTER_MUTATION_REQ 0 0.0 0.0 0.0 0.0
GOSSIP_DIGEST_SYN 0 943.1270000000001 1955.666 2816.159 2816.159
PAXOS_PREPARE_REQ 0 0.0 0.0 0.0 0.0
PREPARE_MSG 0 0.0 0.0 0.0 0.0
PAXOS_COMMIT_RSP 0 0.0 0.0 0.0 0.0
HINT_REQ 0 0.0 0.0 0.0 0.0
BATCH_REMOVE_REQ 0 0.0 0.0 0.0 0.0
STATUS_RSP 0 0.0 0.0 0.0 0.0
READ_REPAIR_RSP 0 0.0 0.0 0.0 0.0
GOSSIP_DIGEST_ACK2 0 654.9490000000001 3379.391 4055.2690000000002 4055.2690000000002
CLEANUP_MSG 0 0.0 0.0 0.0 0.0
REQUEST_RSP 0 0.0 0.0 0.0 0.0
TRUNCATE_RSP 0 0.0 0.0 0.0 0.0
UNUSED_CUSTOM_VERB 0 0.0 0.0 0.0 0.0
REPLICATION_DONE_RSP 0 0.0 0.0 0.0 0.0
SNAPSHOT_REQ 0 0.0 0.0 0.0 0.0
ECHO_REQ 0 0.0 0.0 0.0 0.0
PREPARE_CONSISTENT_REQ 0 0.0 0.0 0.0 0.0
FAILURE_RSP 0 0.0 0.0 0.0 0.0
BATCH_STORE_RSP 0 0.0 0.0 0.0 0.0
SCHEMA_PUSH_RSP 0 0.0 0.0 0.0 0.0
MUTATION_RSP 0 2816.159 10090.808 17436.917 89970.66
FINALIZE_PROPOSE_MSG 0 0.0 0.0 0.0 0.0
ECHO_RSP 0 0.0 0.0 0.0 0.0
INTERNAL_RSP 0 0.0 0.0 0.0 0.0
FAILED_SESSION_MSG 0 0.0 0.0 0.0 0.0
_TRACE 0 0.0 0.0 0.0 0.0
SCHEMA_VERSION_RSP 0 0.0 0.0 0.0 0.0
FINALIZE_COMMIT_MSG 0 0.0 0.0 0.0 0.0
SNAPSHOT_MSG 0 0.0 0.0 0.0 0.0
PREPARE_CONSISTENT_RSP 0 0.0 0.0 0.0 0.0
PAXOS_PROPOSE_REQ 0 0.0 0.0 0.0 0.0
PAXOS_PREPARE_RSP 0 0.0 0.0 0.0 0.0
MUTATION_REQ 0 2346.799 10090.808 17436.917 74975.55
READ_REQ 0 0.0 0.0 0.0 0.0
PING_RSP 0 0.0 0.0 0.0 0.0
RANGE_RSP 0 0.0 0.0 0.0 0.0
VALIDATION_REQ 0 0.0 0.0 0.0 0.0
SYNC_REQ 0 0.0 0.0 0.0 0.0
_TEST_1 0 0.0 0.0 0.0 0.0
GOSSIP_SHUTDOWN 0 0.0 0.0 0.0 0.0
TRUNCATE_REQ 0 0.0 0.0 0.0 0.0
_TEST_2 0 0.0 0.0 0.0 0.0
GOSSIP_DIGEST_ACK 0 785.939 2346.799 14530.764000000001 14530.764000000001
SCHEMA_PUSH_REQ 0 0.0 0.0 0.0 0.0
FINALIZE_PROMISE_MSG 0 0.0 0.0 0.0 0.0
BATCH_STORE_REQ 0 0.0 0.0 0.0 0.0
COUNTER_MUTATION_RSP 0 0.0 0.0 0.0 0.0
REPAIR_RSP 0 0.0 0.0 0.0 0.0
STATUS_REQ 0 0.0 0.0 0.0 0.0
SCHEMA_PULL_REQ 0 0.0 0.0 0.0 0.0
READ_REPAIR_REQ 0 0.0 0.0 0.0 0.0
REPLICATION_DONE_REQ 0 0.0 0.0 0.0 0.0
PAXOS_PROPOSE_RSP 0 0.0 0.0 0.0 0.0
The table created with:
create keyspace example with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };
create table example.tweet(timeline text, id UUID, text text, PRIMARY KEY(id));
And the client code:
package main
import (
"fmt"
"strconv"
"github.com/gocql/gocql"
"time"
"sync"
)
const (
gophers = 50
entries = 3000
)
func main() {
var wg sync.WaitGroup
start_time:=time.Now().UnixNano()
for i :=0 ; i < gophers; i++ {
wg.Add(1)
// spin up a gopher
go gopher(i, &wg)
}
wg.Wait()
end_time := time.Now().UnixNano()
total_time := (end_time - start_time)/1000000
fmt.Println("total spent time: ", strconv.FormatInt(total_time, 10))
}
func gopher (thread_id int, wg *sync.WaitGroup) {
defer wg.Done()
cluster := gocql.NewCluster("10.30.1.1","10.30.1.2","10.30.1.3")
cluster.ConnectTimeout = time.Second*30
cluster.DisableInitialHostLookup=true
cluster.Timeout = 25*time.Second
cluster.Consistency = gocql.LocalQuorum
cluster.Keyspace = "example"
session, err := cluster.CreateSession()
if err != nil {
panic(err)
}
defer session.Close()
stmt:= session.Query("INSERT INTO tweet (timeline, id, text) VALUES (?, ?, ?)")
fmt.Println("StartTime: ", time.Now())
for i:=0; i < entries; i ++ {
_=stmt.Bind("me", gocql.TimeUUID(), "Hello"+strconv.Itoa(i)).Exec()
}
fmt.Println("EndTime:", time.Now())
}
I wonder if anyone can give me some suggestion on what else I can look for?

If you are running all the 3 VMs on the same physical host then that would invalidate your test because the 3 VMs are competing for the same physical resources.
For the test to be valid, you should host each VM on a separate physical host. Cheers!

Related

Calculate mutual information in columns

I have been set a sample exercise by my teacher. It is to reduce dimensionality by writing a function that uses sklearn(mutual information).I am not that good in it but I tried many ways. Its not giving me any reliable answer even. I am unable to find out the mistake.
The data consists of 19 columns that i got with one hot encoding. And i named it as dummy. whenever i run the code it does not give me any output. neither error nor result.
first i am not sure what to set the threshold.
2nd how to call the mutual information source from sklearn and iterate every column in a pair, to drop one out of the highly correlated columns pair.
Address_A Address_B Address_C Address_D Address_E Address_F Address_G Address_H DoW_0 DoW_1 DoW_2 DoW_3 DoW_4 DoW_5 DoW_6 Month_1 Month_11 Month_12 Month_2
0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
252199 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
252200 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
252201 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
252202 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
252203 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
from sklearn.metrics import mutual_info_score
def reduce_dimentionality(dummy, threshold):
df_cols = dummy[['Address_A','Address_B','Address_C','Address_D','Address_E','Address_F','Address_G','Address_H',
'DoW_0','DoW_1','DoW_2','DoW_3','DoW_4','DoW_5','DoW_6','Month_1','Month_11','Month_12','Month_2']]
to_remove = []
for col_ix, Address_A in enumerate(df_cols):
for address_B in df_cols:
calc_MI=sklearn.metrics.mutual_info_score
mu_info = calc_MI(dummy['Address_A'],dummy['Address_B'], bins=20)
if mu_info <1:
d=to_remove.append(Address_A)
new_data_frame = pd.DataFrame.drop(d)
return new_data_frame

How to handle correctly sparse features to avoid poor performance of classification neural network?

I'm trying to understand how sparse neural networks work. I have a very sparse data of about 40k rows for two classes. The dataset looks like this:
RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RA8 RA9 RB0 RB1 RB2 RB3 RB4 RB5 RB6 RB7 RB8 RB9
50 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
52 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
53 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
54 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
55 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
57 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
58 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
59 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
60 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
61 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
62 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
63 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
As you can see, some rows have only 0's on it. The columns with name RA are the features of a class 0 and the columns with name RB are the features of class 1, so the same dataset with the actual labels looks like this:
RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RA8 RA9 ... RB1 RB2 RB3 RB4 RB5 RB6 RB7 RB8 RB9 label
50 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
52 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
53 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
54 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
55 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
57 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
58 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
59 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
60 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
I did a simple neural network model using Keras, but the model isn't learning and accuracy rarely goes beyond 52% on train dataset. I tried two variations of the same model:
Variation 1:
def build_nn(n_features,lr = 0.001):
_input = Input(shape = (n_features,),name = 'input',sparse = True)
x = Dense(12,kernel_initializer = 'he_uniform',activation = 'relu')(_input)
x = Dropout(0.5)(x)
x = Dense(8,kernel_initializer = 'he_uniform',activation = 'relu')(x)
x = Dropout(0.5)(x)
x = Dense(2,kernel_initializer = 'he_uniform',activation = 'softmax')(x)
nn = Model(inputs = [_input],outputs = [x])
nn.compile(loss='sparse_categorical_crossentropy',optimizer=Adam(lr = lr),metrics=['accuracy'])
return nn
Variation 2:
def build_nn(feature_layer,lr = 0.001):
feature_inputs = {}
for feature in feature_layer:
feature_inputs[feature.key] = Input(shape = (1,),name = feature.key)
feature_layer = tf.keras.layers.DenseFeatures(feature_layer)
feature_inputs_n = feature_layer(feature_inputs)
x = Dense(12,kernel_initializer = 'he_uniform',activation = 'relu')(feature_inputs_n)
x = Dropout(0.5)(x)
x = Dense(8,kernel_initializer = 'he_uniform',activation = 'relu')(x)
x = Dropout(0.5)(x)
x = Dense(2,kernel_initializer = 'he_uniform',activation = 'softmax')(x)
nn = Model(inputs = [v for v in feature_inputs.values()],outputs = [x])
nn.compile(loss='sparse_categorical_crossentropy',optimizer=Adam(lr = lr),metrics=['accuracy'])
return nn
The motivation behind doing the variation 2 is because the features are sparse and I thought that this could have an impact on the model's performance, so I followed this tensorflow guide.
Also, the labels are converted to a categorical label using to_categorical function, provided by the keras api:
y_train2 = to_categorical(y_train)
y_test2 = to_categorical(y_test)
My questions are:
Is my model wrong (especially the variation 2) or if I'm doing the wrong representation of the sparse features and how this features should be handled?
The RA and RB are the features of two different classes and since there are rows full of 0, should I add a third class representing an unknown class or remove the rows that contains only 0?
Since RA and RB map two different classes, should I do two separate model, one for columns RA and class 0 and the other for columns RB and class 1?
I'm also posting an image of the train/test model's accuracy:
I can also provide any other part of the code if needed.
EDIT:
I didn't put this part because I felt it doesn't has a relation to what I was asking, but it seems I was wrong.
Each feature is an individual branch from a sklearn decision tree. The class that the decision tree looks for is an up or down for the next candle in a trading enviroment (a candle is a price aggregation of an instrument in time that has an open, low, high and close price). Then, the idea is to grab those branches, that are valuated in the price time series, and evaluate if the condition is met, so if the branch is active the value is 1.
For example, branch RA0 at index 55 is active, so the value is 1. The labels are calculated as np.sign(close - open). So, the idea is that by using multiple branches the classification of the label can be improved, by having a neural network that can see if which branch is active and which one has more weight in order to make a classification.
The use of sparse_categorical_crossentropy is wrong here; the sparsity in sparse_categorical_crossentropy refers to the label representation, and not to the features. Since you are using one-hot encoded labels:
y_train2 = to_categorical(y_train)
y_test2 = to_categorical(y_test)
and a final layer of 2 nodes with activation = 'softmax' (which I take it to mean that you have only 2 classes), you should switch to loss='categorical_crossentropy' irrespectively of the sparsity in your features.
Other general remarks:
Remove dropout, which should never be used by default. Dropout is used to help against overfitting if such a thing is detected; used uncritically (even worse, with such high values), it is well-known to prevent training altogether (i.e. something very similar to what you report here).
Remove kernel_initializer = 'he_uniform' from all layers, thus leaving the default glorot_uniform one (useful hint: default values are there for a reason, and it is not advisable to play with them unless you have a specific reason to do so and you know exactly what you are doing).

How to output CoordinateMatrix in tabular format?

I need to produce an output table of a subset of movielens rating data. I have converted my dataframe to a CoordinateMatrix:
from pyspark.mllib.linalg.distributed import MatrixEntry, CoordinateMatrix
mat = CoordinateMatrix(ratings.map(
lambda r: MatrixEntry(r.user, r.product, r.rating)))
However, I can't see how I can print the output in a tabular format. I can print the entries:
mat.entries.collect()
Which outputs:
[MatrixEntry(1, 1, 5.0),
MatrixEntry(5, 6, 2.0),
MatrixEntry(6, 1, 4.0),
MatrixEntry(7, 6, 4.0),
MatrixEntry(8, 1, 4.0),
MatrixEntry(8, 4, 3.0),
MatrixEntry(9, 1, 5.0)]
However, I'm looking to output:
1 2 3 4 5 6 7 8 9
------------------------------------- ...
1 | 5
2 |
3 |
4 |
5 | 2
...
Update
The pandas equivalent is pivot_table, e.g.
import pandas as pd
import numpy as np
import os
import requests
import zipfile
np.set_printoptions(precision=4)
filename = 'ml-1m.zip'
if not os.path.exists(filename):
r = requests.get('http://files.grouplens.org/datasets/movielens/ml-1m.zip', stream=True)
if r.status_code == 200:
with open(filename, 'wb') as f:
for chunk in r:
f.write(chunk)
else:
raise 'Could not save dataset'
zip_ref = zipfile.ZipFile('ml-1m.zip', 'r')
zip_ref.extractall('.')
zip_ref.close()
ratingsNames = ["userId", "movieId", "rating", "timestamp"]
ratings = pd.read_table("./ml-1m/ratings.dat", header=None, sep="::", names=ratingsNames, engine='python')
ratingsMatrix = ratings.pivot_table(columns=['movieId'], index =['userId'], values='rating', dropna = False)
ratingsMatrix = ratingsMatrix.fillna(0)
# we don't have space to print the full matrix, just show the first few cells
print(ratingsMatrix.ix[:9, :9])
Which outputs:
movieId 1 2 3 4 5 6 7 8 9
userId
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0
6 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0
8 4.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
9 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

What is `MAIN` ? (ghc profiling)

I build an old big project, Pugs, with ghc 7.10.1 using stack build (I wrote my own stack.yaml). Then I run stack build --library-profiling --executable-profiling and .stack-work/install/x86_64-osx/nightly-2015-06-26/7.10.1/bin/pugs -e 'my $i=0; for (1..100_000) { $i++ }; say $i' +RTS -pa and output the following pugs.prof file.
Fri Jul 10 00:10 2015 Time and Allocation Profiling Report (Final)
pugs +RTS -P -RTS -e my $i=0; for (1..10_000) { $i++ }; say $i
total time = 0.60 secs (604 ticks # 1000 us, 1 processor)
total alloc = 426,495,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
MAIN MAIN 92.2 90.6 557 386532168
CAF Pugs.Run 2.8 5.2 17 22191000
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes
MAIN MAIN 287 0 92.2 90.6 100.0 100.0 557 386532168
listAssocOp Pugs.Parser.Operator 841 24 0.0 0.0 0.0 0.0 0 768
nassocOp Pugs.Parser.Operator 840 24 0.0 0.0 0.0 0.0 0 768
lassocOp Pugs.Parser.Operator 839 24 0.0 0.0 0.0 0.0 0 768
rassocOp Pugs.Parser.Operator 838 24 0.0 0.0 0.0 0.0 0 768
postfixOp Pugs.Parser.Operator 837 24 0.0 0.0 0.0 0.0 0 768
termOp Pugs.Parser.Operator 824 24 0.0 0.5 0.7 1.2 0 2062768
insert Data.HashTable.ST.Basic 874 1 0.0 0.0 0.0 0.0 0 152
checkOverflow Data.HashTable.ST.Basic 890 1 0.0 0.0 0.0 0.0 0 80
readDelLoad Data.HashTable.ST.Basic 893 0 0.0 0.0 0.0 0.0 0 184
writeLoad Data.HashTable.ST.Basic 892 0 0.0 0.0 0.0 0.0 0 224
readLoad Data.HashTable.ST.Basic 891 0 0.0 0.0 0.0 0.0 0 184
_values Data.HashTable.ST.Basic 889 1 0.0 0.0 0.0 0.0 0 0
_keys Data.HashTable.ST.Basic 888 1 0.0 0.0 0.0 0.0 0 0
.. snip ..
MAIN costs 92.2% of time, however, I don't know what MAIN means. What does MAIN label mean?
I was in the same spot a few days ago. What I deduced is the same thing, MAIN is expressions without anotations. It's counts shrink significantly if you add "-fprof-auto" and "-caf-all". Those options will also let you find a lot of interesting things happening in your code.

Haskell small CPU leak

I’m experiencing small CPU leaks using GHC 7.8.3 and Yesod 1.4.9.
When I run my site with time and stop it (Ctrl+C) after 1 minute without doing anything (just run, no request at all), it consumes 1 second. It represents approximately 1.7% of CPU.
$ time mysite
^C
real 1m0.226s
user 0m1.024s
sys 0m0.060s
If I disable the idle garbage collector, it drops to 0.35 second (0.6% of CPU). Though it’s better, it still consumes CPU without doing anything.
$ time mysite +RTS -I0 # Disable idle GC
^C
real 1m0.519s
user 0m0.352s
sys 0m0.064s
$ time mysite +RTS -I0
^C
real 4m0.676s
user 0m0.888s
sys 0m0.468s
$ time mysite +RTS -I0
^C
real 7m28.282s
user 0m1.452s
sys 0m0.976s
Compared to a cat command waiting indefinitely for something on the standard input:
$ time cat
^C
real 1m1.349s
user 0m0.000s
sys 0m0.000s
Is there anything else in Haskell that does consume CPU in the background ?
Is it a leak from Yesod ?
Or is it something that I have done in my program ? (I have only added handler functions, I don’t do parallel computation)
Edit 2015-05-31 19:25
Here’s the execution with the -s flag:
$ time mysite +RTS -I0 -s
^C 23,138,184 bytes allocated in the heap
4,422,096 bytes copied during GC
2,319,960 bytes maximum residency (4 sample(s))
210,584 bytes maximum slop
6 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 30 colls, 0 par 0.00s 0.00s 0.0001s 0.0003s
Gen 1 4 colls, 0 par 0.03s 0.04s 0.0103s 0.0211s
TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.86s (224.38s elapsed)
GC time 0.03s ( 0.05s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.90s (224.43s elapsed)
Alloc rate 26,778,662 bytes per MUT second
Productivity 96.9% of total user, 0.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 3m44.447s
user 0m0.896s
sys 0m0.320s
And with profiling:
$ time mysite +RTS -I0
^C 23,024,424 bytes allocated in the heap
19,367,640 bytes copied during GC
2,319,960 bytes maximum residency (94 sample(s))
211,312 bytes maximum slop
6 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 27 colls, 0 par 0.00s 0.00s 0.0002s 0.0005s
Gen 1 94 colls, 0 par 1.09s 1.04s 0.0111s 0.0218s
TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.00s (201.66s elapsed)
GC time 1.07s ( 1.03s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.09s (202.68s elapsed)
Alloc rate 23,115,591 bytes per MUT second
Productivity 47.7% of total user, 0.5% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 3m22.697s
user 0m2.088s
sys 0m0.060s
mysite.prof:
Sun May 31 19:16 2015 Time and Allocation Profiling Report (Final)
mysite +RTS -N -p -s -h -i0.1 -I0 -RTS
total time = 0.05 secs (49 ticks # 1000 us, 1 processor)
total alloc = 17,590,528 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 98.0 93.7
acquireSeedSystem.\.\ System.Random.MWC 2.0 0.0
toByteString Data.Serialize.Builder 0.0 3.9
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 5684 0 98.0 93.7 100.0 100.0
createSystemRandom System.Random.MWC 11396 0 0.0 0.0 2.0 0.3
withSystemRandom System.Random.MWC 11397 0 0.0 0.1 2.0 0.3
acquireSeedSystem System.Random.MWC 11399 0 0.0 0.0 2.0 0.2
acquireSeedSystem.\ System.Random.MWC 11401 1 0.0 0.2 2.0 0.2
acquireSeedSystem.\.\ System.Random.MWC 11403 1 2.0 0.0 2.0 0.0
sndS Data.Serialize.Put 11386 21 0.0 0.0 0.0 0.0
put Data.Serialize 11384 21 0.0 0.0 0.0 0.0
unPut Data.Serialize.Put 11383 21 0.0 0.0 0.0 0.0
toByteString Data.Serialize.Builder 11378 21 0.0 3.9 0.0 4.0
flush.\ Data.Serialize.Builder 11393 21 0.0 0.0 0.0 0.0
withSize Data.Serialize.Builder 11388 0 0.0 0.0 0.0 0.0
withSize.\ Data.Serialize.Builder 11389 21 0.0 0.0 0.0 0.0
runBuilder Data.Serialize.Builder 11390 21 0.0 0.0 0.0 0.0
runBuilder Data.Serialize.Builder 11382 21 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11372 174 0.0 0.1 0.0 0.1
CAF GHC.IO.Encoding 11322 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 11319 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 11318 0 0.0 0.2 0.0 0.2
CAF GHC.Event.Thread 11304 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 11292 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 11288 0 0.0 0.0 0.0 0.0
CAF GHC.TopHandler 11284 0 0.0 0.0 0.0 0.0
CAF GHC.Event.Control 11271 0 0.0 0.0 0.0 0.0
CAF Main 11263 0 0.0 0.0 0.0 0.0
main Main 11368 1 0.0 0.0 0.0 0.0
CAF Application 11262 0 0.0 0.0 0.0 0.0
CAF Foundation 11261 0 0.0 0.0 0.0 0.0
CAF Model 11260 0 0.0 0.1 0.0 0.3
unstream/resize Data.Text.Internal.Fusion 11375 35 0.0 0.1 0.0 0.1
CAF Settings 11259 0 0.0 0.1 0.0 0.2
unstream/resize Data.Text.Internal.Fusion 11370 20 0.0 0.1 0.0 0.1
CAF Database.Persist.Postgresql 6229 0 0.0 0.3 0.0 0.9
unstream/resize Data.Text.Internal.Fusion 11373 93 0.0 0.6 0.0 0.6
CAF Database.PostgreSQL.Simple.Transaction 6224 0 0.0 0.0 0.0 0.0
CAF Database.PostgreSQL.Simple.TypeInfo.Static 6222 0 0.0 0.0 0.0 0.0
CAF Database.PostgreSQL.Simple.Internal 6219 0 0.0 0.0 0.0 0.0
CAF Yesod.Static 6210 0 0.0 0.0 0.0 0.0
CAF Crypto.Hash.Conduit 6193 0 0.0 0.0 0.0 0.0
CAF Yesod.Default.Config2 6192 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11371 1 0.0 0.0 0.0 0.0
CAF Yesod.Core.Internal.Util 6154 0 0.0 0.0 0.0 0.0
CAF Text.Libyaml 6121 0 0.0 0.0 0.0 0.0
CAF Data.Yaml 6120 0 0.0 0.0 0.0 0.0
CAF Data.Yaml.Internal 6119 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11369 1 0.0 0.0 0.0 0.0
CAF Database.Persist.Quasi 6055 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11376 1 0.0 0.0 0.0 0.0
CAF Database.Persist.Sql.Internal 6046 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11377 6 0.0 0.0 0.0 0.0
CAF Data.Pool 6036 0 0.0 0.0 0.0 0.0
CAF Network.HTTP.Client.TLS 6014 0 0.0 0.0 0.0 0.0
CAF System.X509.Unix 6010 0 0.0 0.0 0.0 0.0
CAF Crypto.Hash.MD5 5927 0 0.0 0.0 0.0 0.0
CAF Data.Serialize 5873 0 0.0 0.0 0.0 0.0
put Data.Serialize 11385 1 0.0 0.0 0.0 0.0
CAF Data.Serialize.Put 5872 0 0.0 0.0 0.0 0.0
withSize Data.Serialize.Builder 11387 1 0.0 0.0 0.0 0.0
CAF Data.Serialize.Builder 5870 0 0.0 0.0 0.0 0.0
flush Data.Serialize.Builder 11392 1 0.0 0.0 0.0 0.0
toByteString Data.Serialize.Builder 11391 0 0.0 0.0 0.0 0.0
defaultSize Data.Serialize.Builder 11379 1 0.0 0.0 0.0 0.0
defaultSize.overhead Data.Serialize.Builder 11381 1 0.0 0.0 0.0 0.0
defaultSize.k Data.Serialize.Builder 11380 1 0.0 0.0 0.0 0.0
CAF Crypto.Random.Entropy.Unix 5866 0 0.0 0.0 0.0 0.0
CAF Network.HTTP.Client.Manager 5861 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11374 3 0.0 0.0 0.0 0.0
CAF System.Random.MWC 5842 0 0.0 0.0 0.0 0.0
coff System.Random.MWC 11405 1 0.0 0.0 0.0 0.0
ioff System.Random.MWC 11404 1 0.0 0.0 0.0 0.0
acquireSeedSystem System.Random.MWC 11398 1 0.0 0.0 0.0 0.0
acquireSeedSystem.random System.Random.MWC 11402 1 0.0 0.0 0.0 0.0
acquireSeedSystem.nbytes System.Random.MWC 11400 1 0.0 0.0 0.0 0.0
createSystemRandom System.Random.MWC 11394 1 0.0 0.0 0.0 0.0
withSystemRandom System.Random.MWC 11395 1 0.0 0.0 0.0 0.0
CAF Data.Streaming.Network.Internal 5833 0 0.0 0.0 0.0 0.0
CAF Data.Scientific 5728 0 0.0 0.1 0.0 0.1
CAF Data.Text.Array 5722 0 0.0 0.0 0.0 0.0
CAF Data.Text.Internal 5718 0 0.0 0.0 0.0 0.0
Edit 2015-06-01 08:40
You can browse source code at the following repository → https://github.com/Zigazou/Ouep
Found a related bug in the Yesod bug tracker. Ran my program like this:
myserver +RTS -I0 -RTS Development
And now idle CPU usage is down to almost nothing, compared to 14% or so before (ARM computer). The I0 (that's I and zero) option turns off periodic garbage collection, which defaults to 0.3 secs I think. Not sure about that implications for app responsiveness or memory usage, but for me at least this is definitely the culprit.

Resources