Tablets are not splitting - accumulo

I am ingesting large amount of data into my Accumulo database.
My tablet split threshold is 4G.
During the work of my program I see the tablets filling up and when any of them gets larger than 4G a new tablet is created but always stays empty.
Eventually I see than there are >200 tablets created (initial number was 30) but all of them are empty - verified with the following command:
hadoop fs -du -h /apps/accumulo/data/tables/3/
the results:
16.9 G /apps/accumulo/data/tables/3/default_tablet
16.4 G /apps/accumulo/data/tables/3/t-0000cr6
16.6 G /apps/accumulo/data/tables/3/t-0000cr7
16.3 G /apps/accumulo/data/tables/3/t-0000cr8
17.3 G /apps/accumulo/data/tables/3/t-0000cr9
17.2 G /apps/accumulo/data/tables/3/t-0000cra
18.4 G /apps/accumulo/data/tables/3/t-0000crb
16.9 G /apps/accumulo/data/tables/3/t-0000crc
16.5 G /apps/accumulo/data/tables/3/t-0000crd
17.4 G /apps/accumulo/data/tables/3/t-0000cre
16.4 G /apps/accumulo/data/tables/3/t-0000crf
16.5 G /apps/accumulo/data/tables/3/t-0000crg
16.3 G /apps/accumulo/data/tables/3/t-0000crh
17.6 G /apps/accumulo/data/tables/3/t-0000cri
16.9 G /apps/accumulo/data/tables/3/t-0000crj
16.8 G /apps/accumulo/data/tables/3/t-0000crk
17.1 G /apps/accumulo/data/tables/3/t-0000crl
17.4 G /apps/accumulo/data/tables/3/t-0000crm
17.2 G /apps/accumulo/data/tables/3/t-0000crn
17.1 G /apps/accumulo/data/tables/3/t-0000cro
17.4 G /apps/accumulo/data/tables/3/t-0000crp
19.8 G /apps/accumulo/data/tables/3/t-0000crq
17.0 G /apps/accumulo/data/tables/3/t-0000crr
16.6 G /apps/accumulo/data/tables/3/t-0000crs
16.7 G /apps/accumulo/data/tables/3/t-0000crt
16.7 G /apps/accumulo/data/tables/3/t-0000cru
17.7 G /apps/accumulo/data/tables/3/t-0000crv
16.7 G /apps/accumulo/data/tables/3/t-0000crw
16.7 G /apps/accumulo/data/tables/3/t-0000crx
16.2 G /apps/accumulo/data/tables/3/t-0000cry
0 /apps/accumulo/data/tables/3/t-000109c
0 /apps/accumulo/data/tables/3/t-000118l
0 /apps/accumulo/data/tables/3/t-00011bv
0 /apps/accumulo/data/tables/3/t-00011cs
0 /apps/accumulo/data/tables/3/t-00011nx
0 /apps/accumulo/data/tables/3/t-0001212
0 /apps/accumulo/data/tables/3/t-0001238
0 /apps/accumulo/data/tables/3/t-00012a3
0 /apps/accumulo/data/tables/3/t-00012gn
0 /apps/accumulo/data/tables/3/t-00012ku
0 /apps/accumulo/data/tables/3/t-00012nf
all the rest of the tablets are empty too.
This doesn't make sense to me and I am afraid it slows down the ingestion rate. Is it a known issue? What aren't the tablet servers split as expected?

Accumulo Tablets can refer to files that are outside of their directory in HDFS (as opposed to HBase in this regard). You can verify this looking at the contents of the accumulo.metadata table if you are brave :)
Compact the table and then re-run your check over the contents of HDFS. After the compactions finish, each tablet would uniquely refer to a file in its own directory.
The other explanation (albeit unlikely) would be that your data was so skewed such that data only resided in the upper or lower half of the Tablet's "key space" (and thus only one daughter of the split contained data).

Related

My matrix multiplication program takes quadruple time when thread count doubles

I wrote this simple program that multiplies matrices. I can specify how
many OS threads are used to run it with the environment variable
OMP_NUM_THREADS. It slows down a lot when the thread count gets
larger than my CPU's physical threads.
Here's the program.
static double a[DIMENSION][DIMENSION], b[DIMENSION][DIMENSION],
c[DIMENSION][DIMENSION];
#pragma omp parallel for schedule(static)
for (unsigned i = 0; i < DIMENSION; i++)
for (unsigned j = 0; j < DIMENSION; j++)
for (unsigned k = 0; k < DIMENSION; k++)
c[i][k] += a[i][j] * b[j][k];
My CPU is i7-8750H. It has 12 threads. When the matrices are large
enough, the program is fastest on around 11 threads. It is 4 times as
slow when the thread count reaches 17. Then run time stays about the
same as I increase the thread count.
Here's the results. The top row is DIMENSION. The left column is the
thread count. Times are in seconds. The column with * is when
compiled with -fno-loop-unroll-and-jam.
1024 2048 4096 4096* 8192
1 0.2473 3.39 33.80 35.94 272.39
2 0.1253 2.22 18.35 18.88 141.23
3 0.0891 1.50 12.64 13.41 100.31
4 0.0733 1.13 10.34 10.70 82.73
5 0.0641 0.95 8.20 8.90 62.57
6 0.0581 0.81 6.97 8.05 53.73
7 0.0497 0.70 6.11 7.03 95.39
8 0.0426 0.63 5.28 6.79 81.27
9 0.0390 0.56 4.67 6.10 77.27
10 0.0368 0.52 4.49 5.13 55.49
11 0.0389 0.48 4.40 4.70 60.63
12 0.0406 0.49 6.25 5.94 68.75
13 0.0504 0.63 6.81 8.06 114.53
14 0.0521 0.63 9.17 10.89 170.46
15 0.0505 0.68 11.46 14.08 230.30
16 0.0488 0.70 13.03 20.06 241.15
17 0.0469 0.75 20.67 20.97 245.84
18 0.0462 0.79 21.82 22.86 247.29
19 0.0465 0.68 24.04 22.91 249.92
20 0.0467 0.74 23.65 23.34 247.39
21 0.0458 1.01 22.93 24.93 248.62
22 0.0453 0.80 23.11 25.71 251.22
23 0.0451 1.16 20.24 25.35 255.27
24 0.0443 1.16 25.58 26.32 253.47
25 0.0463 1.05 26.04 25.74 255.05
26 0.0470 1.31 27.76 26.87 253.86
27 0.0461 1.52 28.69 26.74 256.55
28 0.0454 1.15 28.47 26.75 256.23
29 0.0456 1.27 27.05 26.52 256.95
30 0.0452 1.46 28.86 26.45 258.95
Code inside the loop compiles to this on gcc 9.3.1 with
-O3 -march=native -fopenmp. rax starts from 0 and increases by 64
each iteration. rdx points to c[i]. rsi points to b[j]. rdi
points to b[j+1].
vmovapd (%rsi,%rax), %ymm1
vmovapd 32(%rsi,%rax), %ymm0
vfmadd213pd (%rdx,%rax), %ymm3, %ymm1
vfmadd213pd 32(%rdx,%rax), %ymm3, %ymm0
vfmadd231pd (%rdi,%rax), %ymm2, %ymm1
vfmadd231pd 32(%rdi,%rax), %ymm2, %ymm0
vmovapd %ymm1, (%rdx,%rax)
vmovapd %ymm0, 32(%rdx,%rax)
I wonder why the run time increases so much when the thread count
increases.
My estimate says this shouldn't be the case when DIMENSION is 4096.
What I thought before I remembered that the compiler does 2 j loops at
a time. Each iteration of the j loop needs rows c[i] and b[j].
They are 64KB in total. My CPU has a 32KB L1 data cache and a 256KB L2
cache per 2 threads. The four rows the two hardware threads are working
with don't fit in L1 but fit in L2. So when j advances, c[i] is
read from L2. When the program is run on 24 OS threads, the number of
involuntary context switches is around 29371. Each thread gets
interrupted before it has a chance to finish one iteration of the j
loop. Since 8 matrix rows can fit in the L2 cache, the other software
thread's 2 rows are probably still in L2 when it resumes. So the
execution time shouldn't be much different from the 12 thread case.
However measurements say it's 4 times as slow.
Now that I have realized 2 j loops are done at a time. This way each
j iteration works on 96KB of memory. So 4 of them can't fit in the
256KB L2 cache. To verify this is what slows the program down, I
compiled the program with -fno-loop-unroll-and-jam. I got
vmovapd ymm0, YMMWORD PTR [rcx+rax]
vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdx+rax]
vmovapd YMMWORD PTR [rdx+rax], ymm0
The results are in the table. They are like when 2 rows are done at a
time. Which makes me wonder even more. When DIMENSION is 4096, 4
software threads' 8 rows fit in the L2 cache when each thread works on 1
row at a time, but 12 rows don't fit in the L2 cache when each thread
works on 2 rows at a time. Why are the run times similar?
I thought maybe it's because the CPU warmed up when running with less
threads and has to slow down. I ran the tests multiple times, both in
the order of increasing thread count and decreasing thread count. They
yield similar results. And dmesg doesn't contain anything related to
thermal or clock.
I tried separately changing 4096 columns to 4104 columns and setting
OMP_PROC_BIND=true OMP_PLACES=cores, and the results are similar.
This problem seems to come from either the CPU caches (due to the bad memory locality) or the OS scheduler (due to more threads than the hardware can simultaneously execute).
I cannot exactly reproduce the same effect on my i5-9600KF processor (with 6 cores and 6 threads) and with a matrix of size 4096x4096. However, similar effects occur.
Here are performance results (with GCC 9.3 using -O3 -march=native -fopenmp on Linux 5.6):
#threads | time (in seconds)
----------------------------
1 | 16.726885
2 | 9.062372
3 | 6.397651
4 | 5.494580
5 | 4.054391
6 | 5.724844 <-- maximum number of hardware threads
7 | 6.113844
8 | 7.351382
9 | 8.992128
10 | 10.789389
11 | 10.993626
12 | 11.099117
24 | 11.283873
48 | 11.412288
We can see that the computation time starts to significantly grow between 5 and 12 cores.
This problem is due to a lot more data fetched from the RAM. Indeed, 161.6 Gio are loaded from memory with 6 threads while 424.7 Gio are loaded with 12 threads! In both cases, 3.3 Gio are written to the RAM. Because my memory throughput is roughly 40 Gio/s, the RAM accesses represent more than 96% of the overall execution time with 12 threads!
If we dig deeper, we can see that the number of L1 cache references and L1 cache misses are the same whatever the number of threads used. Meanwhile, there are a lot more L3 cache misses (as well as more references). Here are L3-cache statistics:
With 6 threads: 4.4 G loads
1.1 G load-misses (25% of all LL-cache hits)
With 12 threads: 6.1 G loads
4.5 G load-misses (74% of all LL-cache hits)
This means that the locality of the memory access is clearly worse with more threads. I guess this is because the compiler is not clever enough to do high-level cache-based optimizations that could reduce RAM pressure (especially when the number of threads is high). You have to do tiling yourself in order to improve the memory locality. You can find a good guide here.
Finally, note that using more threads that the hardware can simultaneously execute is generally not efficient. One problem is that the OS scheduler often badly place threads to core and frequently move them. The usual way to fix that is to bind software threads to hardware threads using OMP_PROC_BIND=TRUE and set the OMP_PLACES environment variable. Another problem is that the threads are executed using preemptive multitasking with shared resources (eg. caches).
PS: please note that BLAS libraries (eg. OpenBLAS, BLIS, Intel MKL, etc.) are much more optimized than this code as most they already include clever optimization including manual vectorization for the target hardware, loop unrolling, multithreading, tiling and fast matrix transpositions when needed. For a 4096x4096 matrix, they are about 10 times faster.

Pandas summing rows grouped by another column

I have attached dataset
Time podId Batt (avg) Temp (avg)
0 2019-10-07 9999 6.1 71.271053
1 2019-10-08 9999 6.0 71.208285
2 2019-10-09 9999 5.9 77.896628
3 2019-10-10 9999 5.8 78.709279
4 2019-10-11 9999 5.7 71.849283
59 2019-12-05 8888 5.5 76.548780
60 2019-12-06 8888 5.4 73.975295
61 2019-12-07 8888 5.3 76.209434
62 2019-12-08 8888 5.2 76.717481
63 2019-12-09 8888 5.1 70.433920
I imported it using- batt2 = pd.read_csv('battV2.csv')
I need to determine when battery change occurs, i.e. when Batt (avg) increases from previous row. I am able to do this by using the 'diff' in this manner batt2['Vdiff']=batt2['Batt (avg)'].diff(-1)
Now for each podId I need to sum the Vdiff column between battery changes, i.e. between two negative Vdiff values
Also I need to average Temp (avg) over the same range
Count Time to determine the number of days between battery changes
Thanks.
There are a couple of steps involved:
Import data
Be aware that I have changed your dataset a bit to provide a valid test case for your requirements (in your given dataset, Batt_avg never increases).
from io import StringIO
import pandas as pd
data = StringIO('''Time podId Batt_avg Temp_avg
0 2019-10-07 9999 6.1 71.271053
1 2019-10-08 9999 6.0 71.208285
2 2019-10-09 9999 5.9 77.896628
3 2019-10-10 9999 5.8 78.709279
4 2019-10-11 9999 5.7 71.849283
5 2019-10-12 9999 6.0 71.208285
6 2019-10-13 9999 5.9 77.896628
7 2019-10-14 9999 5.8 78.709279
8 2019-10-15 9999 5.7 71.849283
59 2019-12-05 8888 5.5 76.548780
60 2019-12-06 8888 5.4 73.975295
61 2019-12-07 8888 5.3 76.209434
62 2019-12-08 8888 5.2 76.717481
63 2019-12-09 8888 5.1 70.433920''')
df = pd.read_csv(data, delim_whitespace=True)
Determine changes in battery voltage
As you have already found out, you can do this with diff(). I am not certain that the code you have given with df.Batt_avg.diff(-1) satisfies your requirement of: "i.e. when Batt (avg) increases from previous row". Instead, for a given row, this shows how the value will change in the next row (multiplied by -1). If you need the negative change to the previous row, you can instead use -df.Batt_avg.diff().
df['Batt_avg_diff'] = df.Batt_avg.diff(-1)
Group data and apply the aggregation functions
You can express your grouping conditions as df.podId.diff().fillna(0.0) != 0 for the podIds and df.Batt_avg_diff.fillna(0.0) < 0 for the condition "between battery changes, i.e. between two negative Vdiff values" - either of these will trigger a new group. Use cumsum() on the triggers to create the groups. Then you can use groupby() to act on these groups and transform() to expand the results to the dimensions of the original dataframe.
df['group'] = ((df.podId.diff().fillna(0.0) != 0) | (df.Batt_avg_diff.fillna(0.0) < 0)).cumsum()
df['Batt_avg_diff_sum'] = df.Batt_avg_diff.groupby(df.group).transform('sum')
df['Temp_avg_mean'] = df.Temp_avg.groupby(df.group).transform('mean')
Datetime calculations
For the final step, you need to first convert the string to datetime to allow date operations. Then you can use groupby operations to get the max and min in each group, and take the delta.
df.Time = pd.to_datetime(df.Time)
df['Time_days'] = df.Time.groupby(df.group).transform('max') - df.Time.groupby(df.group).transform('min')
Note: if you do not need or want the aggregate data in the original dataframe, just apply the functions directly (without transform):
df_group = pd.DataFrame()
df_group['Batt_avg_diff_sum'] = df.Batt_avg_diff.groupby(df.group).sum()
df_group['Temp_avg_mean'] = df.Temp_avg.groupby(df.group).mean()
df_group['Time_days'] = df.Time.groupby(df.group).max() - df.Time.groupby(df.group).min()

PM3d and Impulses combined not scaling

I am new to gnuplot, but I think I have all the basics. I am trying to plot a 3d surface with some impulses. When I do each splot individually, they look great, but when I splot them together, the scale gets all messed up. Any thoughts? Autoscale is set in all cases.
1st splot:
splot "C:/data/file1.dat" matrix rowheaders columnheaders with pm3d
2nd splot:
splot "C:/Data/file2.dat" with impulses, "C:/Data/file2.dat" with points pt 7
Combined:
splot "C:/data/file1.dat" matrix rowheaders columnheaders with pm3d, \
"C:/Data/file2.dat" with impulses, \
"C:/Data/file2.dat" with points pt 7
See how the scale gets all messed up, and the first chart gets scrunched down to one corner? Both data sets have roughly the same ranges in data.
file1.dat
6 8 10 12 16 20 24
30 3.513999939 4.515999794 5.293000221 5.894999981 6.633999825 6.870999813 6.901000023
35 4.235000134 5.330999851 6.169000149 6.72300005 7.196000099 7.374000072 7.434000015
40 4.818999767 5.940999985 6.776000023 7.171000004 7.558000088 7.722000122 7.802999973
45 5.291999817 6.453999996 7.136000156 7.480999947 7.831999779 7.997000217 8.092000008
50 5.656000137 6.791999817 7.393000126 7.718999863 8.057999611 8.232999802 8.340000153
55 5.968999863 7.014999866 7.587999821 7.913000107 8.255000114 8.44299984 8.565999985
60 6.225999832 7.176000118 7.741000175 8.079999924 8.434000015 8.642000198 8.788000107
65 6.414000034 7.326000214 7.859000206 8.225999832 8.602000237 8.840000153 9.015000343
70 6.624000072 7.494999886 7.956999779 8.357000351 8.767000198 9.039999962 9.25
75 6.801000118 7.638999939 8.100999832 8.468000412 8.930000305 9.251999855 9.496999741
80 6.93599987 7.758999825 8.222000122 8.56799984 9.107999802 9.491000175 9.772000313
85 7.035999775 7.855000019 8.322999954 8.690999985 9.289999962 9.748999596 10.10700035
90 7.102000237 7.919000149 8.409999847 8.80300045 9.470999718 10.03199959 10.47500038
95 7.125 7.933000088 8.479000092 8.901000023 9.642999649 10.31599998 10.83600044
100 7.107999802 7.907999992 8.534000397 8.987000465 9.812000275 10.60000038 11.18799973
105 7.053999901 7.849999905 8.515999794 9.06000042 9.972999573 10.86600018 11.52400017
110 6.965000153 7.769999981 8.43500042 9.090999603 10.11800003 11.10400009 11.84200001
115 6.840000153 7.663000107 8.309000015 8.961000443 10.24100018 11.31099987 12.14299965
120 6.672999859 7.524000168 8.149999619 8.75399971 10.32299995 11.48900032 12.42500019
125 6.436999798 7.349999905 7.961999893 8.529000282 9.987000465 11.64599991 12.68999958
130 6.044000149 7.133999825 7.749000072 8.298000336 9.579000473 11.67500019 12.96199989
135 5.572000027 6.856999874 7.513000011 8.06499958 9.237999916 11.11900043 13.27099991
140 5.127999783 6.440000057 7.257999897 7.831999779 8.937999725 10.52499962 12.90999985
145 4.683000088 5.933000088 6.981999874 7.598999977 8.670000076 10.0170002 12.10299969
150 4.30700016 5.52699995 6.657999992 7.363999844 8.425999641 9.602999687 11.39599991
155 3.996999979 5.196000099 6.294000149 7.122000217 8.194000244 9.262000084 10.79100037
160 3.730999947 4.887000084 5.936999798 6.868999958 7.973999977 8.970999718 10.27600002
165 3.506999969 4.620999813 5.642000198 6.610000134 7.78000021 8.737999916 9.892000198
170 3.342999935 4.421999931 5.427999973 6.385000229 7.625 8.56499958 9.626999855
175 3.233999968 4.288000107 5.281000137 6.217000008 7.506999969 8.43900013 9.44299984
180 3.170000076 4.209000111 5.191999912 6.111000061 7.428999901 8.354000092 9.32199955
file2.dat
7.5 172.0 4.5
5.6 56.8 4.7
6.7 35.0 5.1
11.0 158.7 5.3
13.8 24.8 5.6
12.1 180.0 6.0
5.1 83.2 6.4
13.2 158.0 6.6
15.8 34.5 6.67
15.6 32.9 6.69
11.8 180.0 6.8
13.7 96.0 7.2
15.0 62.4 7.3
11.2 76.2 7.3
11.7 84.9 7.4
13.8 121.8 7.46
9.7 90.9 7.6
13.2 66.0 7.64
14.3 61.3 7.8
14.8 124.6 8.0
9.5 118.8 8.20
15.1 148.8 8.29
12.2 81.8 8.4
You can see in your first image that the spacing between x=10 and x=12 is as big as the spacing between x=12 and x=16, which gives a clue to what's going on: while first plot looks like gnuplot is using the x coordinates 8,10,12,16,20,24, those are really only labels, while numerically gnuplot uses the x coordinates 0,1,2,3,4,5,6. So when you then plot the second graph on the same scale, the data points have x values between 5.1 and 15.8, so will show up the side of the pm3d surface.
If you want gnuplot to use the first column and first row as actual coordinates, you have to use the nonuniform matrix format (see help matrix nonuniform). First, you need to change your data file file1.dat to start with the number 7, the number of columns. The beginning of the file should look like this:
7 6 8 10 12 16 20 24
30 3.513999939 4.515999794 5.293000221 5.894999981 6.633999825 6.870999813 6.901000023
35 4.235000134 5.330999851 6.169000149 6.72300005 7.196000099 7.374000072 7.434000015
Then you can plot the data as follows:
splot "file1.dat" nonuniform matrix w pm3d, \
"file2.dat" with impulses, \
"file2.dat" with points pt 7

read many lines with specific position

Thank you for the time you soent reading it, maybe it is a nooby question
I have a file of 10081 lines, this is an example of the file (a nordic seismic bulletin):
2016 1 8 0921 21.5 L -22.382 -67.835 148.9 OSC 18 0.3 4.7LOSC 1
2016 1 8 1515 43.7 L -20.762 -67.475 188.7 OSC 16 .30 3.7LOSC 1
2016 1 9 0529 35.9 L -18.811 -67.278 235.9 OSC 16 0.5 3.9LOSC 1
2016 110 313 55.6 L -22.032 -67.375 172.0 OSC 14 .30 3.0LOSC 1
2016 110 1021 36.5 L -16.923 -66.668 35.0 OSC 16 0.4 4.5LOSC 1
I tried the following code to extract some information from the file and save them in a separate file.
awk 'NR==1 {print substr($0,24,7), substr($0,32,7), substr($0,40,5)}' select.inp > lat_lon_depth.xyz
substr($0,24,7) means that I take from the 24th position 7 characters which is
the latitude information (-22.382) and the same for the others (longitude from 32th place with 7 characters and depth on 4oth position with 5 characters).
So the question, is possible to go trought all the lines of file and have all latitude, longitude and depth.
Thank you for the time

Using GHC's profiling stats/charts to identify trouble-areas / improve performance of Haskell code

TL;DR: Based on the Haskell code and it's associated profiling data below, what conclusions can we draw that let us modify/improve it so we can narrow the performance gap vs. the same algorithm written in imperative languages (namely C++ / Python / C# but the specific language isn't important)?
Background
I wrote the following piece of code as an answer to a question on a popular site which contains many questions of a programming and/or mathematical nature. (You've probably heard of this site, whose name is pronounced "oiler" by some, "yoolurr" by others.) Since the code below is a solution to one of the problems, I'm intentionally avoiding any mention of the site's name or any specific terms in the problem. That said, I'm talking about problem one hundred and three.
(In fact, I've seen many solutions in the site's forums from resident Haskell wizards :P)
Why did I choose to profile this code?
This was the first problem (on said site) in which I encountered a difference in performance (as measured by execution time) between Haskell code vs. C++/Python/C# code (when both use a similar algorithm). In fact, it was the case for all of the problems (thus far; I've done ~100 problems but not sequentially) that an optimized Haskell code was pretty much neck-and-neck with the fastest C++ solutions, ceteris paribus for the algorithm, of course.
However, the posts in the forum for this particular problem would indicate that the same algorithm in these other languages typically require at most one or two seconds, with the longest taking 10-15 sec (assuming the same starting parameters; I'm ignoring the very naive algorithms that take 2-3 min+). In contrast, the Haskell code below required ~50 sec on my (decent) computer (with profiling disabled; with profiling enabled, it takes ~2 min, as you can see below; note: the exec time was identical when compiling with -fllvm). Specs: i5 2.4ghz laptop, 8gb RAM.
In an effort to learn Haskell in a way that it can become a viable substitute to the imperative languages, one of my aims in solving these problems is learning to write code that, to the extent possible, has performance that's on par with those imperative languages. In that context, I still consider the problem as yet unsolved by me (since there's nearly a ~25x difference in performance!)
What have I done so far?
In addition to the obvious step of streamlining the code itself (to the best of my ability), I've also performed the standard profiling exercises that are recommended in "Real World Haskell".
But I'm having a hard time drawing conclusions that that tell me which pieces need to be modified. That's where I'm hoping folks might be able to help provide some guidance.
Description of the problem:
I'd refer you to the website of problem one hundred and three on the aforementioned site but here's a brief summary: the goal is to find a group of seven numbers such that any two disjoint subgroups (of that group) satisfy the following two properties (I'm trying to avoid using the 's-e-t' word for reasons mentioned above...):
no two subgroups sum to the same amount
the subgroup with more elements has a larger sum (in other words, the sum of the smallest four elements must be greater than the sum of the largest three elements).
In particular, we are trying to find the group of seven numbers with the smallest sum.
My (admittedly weak) observations
A warning: some of these comments may well be totally wrong but I wanted to atleast take a stab at interpreting the profiling data based on what I read in Real World Haskell and other profiling-related posts on SO.
There does indeed seem to be an efficiency issue seeing as how one-third of the time is spent doing garbage collection (37.1%). The first table of figures shows that ~172gb is allocated in the heap, which seems horrible... (Maybe there's a better structure / function to use for implementing the dynamic programming solution?)
Not surprisingly, the vast majority (83.1%) of time is spent checking rule 1: (i) 41.6% in the value sub-function, which determines values to fill in the dynamic programming ("DP") table, (ii) 29.1% in the table function, which generates the DP table and (iii) 12.4% in the rule1 function, which checks the resulting DP table to make sure that a given sum can only be calculated in one way (i.e., from one subgroup).
However, I did find it surprising that more time was spent in the value function relative to the table and rule1 functions given that it's the only one of the three which doesn't construct an array or filter through a large number of elements (it's really only performing O(1) lookups and making comparisons between Int types, which you'd think would be relatively quick). So this is a potential problem area. That said, it's unlikely that the value function is driving the high heap-allocation
Frankly, I'm not sure what to make of the three charts.
Heap profile chart (i.e., the first char below):
I'm honestly not sure what is represented by the red area marked as Pinned. It makes sense that the dynamic function has a "spiky" memory allocation because it's called every time the construct function generates a tuple that meets the first three criteria and, each time it's called, it creates a decently large DP array. Also, I'd think that the allocation of memory to store the tuples (generated by construct) wouldn't be flat over the course of the program.
Pending clarification of the "Pinned" red area, I'm not sure this one tells us anything useful.
Allocation by type and allocation by constructor:
I suspect that the ARR_WORDS (which represents a ByteString or unboxed Array according to the GHC docs) represents the low-level execution of the construction of the DP array (in the table function). Nut I'm not 100% sure.
I'm not sure what's the FROZEN and STATIC pointer categories correspond to.
Like I said, I'm really not sure how to interpret the charts as nothing jumps out (to me) as unexpected.
The code and the profiling results
Without further ado, here's the code with comments explaining my algorithm. I've tried to make sure the code doesn't run off of the right-side of the code-box - but some of the comments do require scrolling (sorry).
{-# LANGUAGE NoImplicitPrelude #-}
{-# OPTIONS_GHC -Wall #-}
import CorePrelude
import Data.Array
import Data.List
import Data.Bool.HT ((?:))
import Control.Monad (guard)
main = print (minimum construct)
cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int
--we enumerate tuples that are potentially valid and then
--filter for valid ones; we perform the most computationally
--expensive step (i.e., rule 1) at the very end
construct :: [[Int]]
construct = {-# SCC "construct" #-} do
a <- [flr..cap] --1st: we construct potentially valid tuples while applying a
b <- [a+step..cap] --constraint on the upper bound of any element as implied by rule 2
c <- [b+step..a+b-1]
d <- [c+step..a+b-1]
e <- [d+step..a+b-1]
f <- [e+step..a+b-1]
g <- [f+step..a+b-1]
guard (a + b + c + d - e - f - g > 0) --2nd: we screen for tuples that completely conform to rule 2
let nn = [g,f,e,d,c,b,a]
guard (sum nn < 285) --3rd: we screen for tuples of a certain size (a guess to speed things up)
guard (rule1 nn) --4th: we screen for tuples that conform to rule 1
return nn
rule1 :: [Int] -> Bool
rule1 nn = {-# SCC "rule1" #-}
null . filter ((>1) . snd) --confirm that there's only one subgroup that sums to any given sum
. filter ((length nn==) . snd . fst) --the last column us how many subgroups sum to a given sum
. assocs --run the dynamic programming algorithm and generate a table
$ dynamic nn
dynamic :: [Int] -> Array (Int,Int) Int
dynamic ns = {-# SCC "dynamic" #-} table
where
(len, maxSum) = (length &&& sum) ns
table = array ((0,0),(maxSum,len))
[ ((s,i),x) | s <- [0..maxSum], i <- [0..len], let x = value (s,i) ]
elements = listArray (0,len) (0:ns)
value (s,i)
| i == 0 || s == 0 = 0
| s == m = table ! (s,i-1) + 1
| s > m = s <= sum (take i ns) ?:
(table ! (s,i-1) + table ! ((s-m),i-1), 0)
| otherwise = 0
where
m = elements ! i
Stats on heap allocation, garbage collection and time elapsed:
% ghc -O2 --make 103_specialsubset2.hs -rtsopts -prof -auto-all -caf-all -fforce-recomp
[1 of 1] Compiling Main ( 103_specialsubset2.hs, 103_specialsubset2.o )
Linking 103_specialsubset2 ...
% time ./103_specialsubset2.hs +RTS -p -sstderr
zsh: permission denied: ./103_specialsubset2.hs
./103_specialsubset2.hs +RTS -p -sstderr 0.00s user 0.00s system 86% cpu 0.002 total
% time ./103_specialsubset2 +RTS -p -sstderr
SOLUTION REDACTED
172,449,596,840 bytes allocated in the heap
21,738,677,624 bytes copied during GC
261,128 bytes maximum residency (74 sample(s))
55,464 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 327548 colls, 0 par 27.34s 41.64s 0.0001s 0.0092s
Gen 1 74 colls, 0 par 0.02s 0.02s 0.0003s 0.0013s
INIT time 0.00s ( 0.01s elapsed)
MUT time 53.91s ( 70.60s elapsed)
GC time 27.35s ( 41.66s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 81.26s (112.27s elapsed)
%GC time 33.7% (37.1% elapsed)
Alloc rate 3,199,123,974 bytes per MUT second
Productivity 66.3% of total user, 48.0% of total elapsed
./103_specialsubset2 +RTS -p -sstderr 81.26s user 30.90s system 99% cpu 1:52.29 total
Stats on time spent per cost-centre:
Wed Dec 17 23:21 2014 Time and Allocation Profiling Report (Final)
103_specialsubset2 +RTS -p -sstderr -RTS
total time = 15.56 secs (15565 ticks # 1000 us, 1 processor)
total alloc = 118,221,354,488 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
dynamic.value Main 41.6 17.7
dynamic.table Main 29.1 37.8
construct Main 12.9 37.4
rule1 Main 12.4 7.0
dynamic.table.x Main 1.9 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 55 0 0.0 0.0 100.0 100.0
main Main 111 0 0.0 0.0 0.0 0.0
CAF:main1 Main 108 0 0.0 0.0 0.0 0.0
main Main 110 1 0.0 0.0 0.0 0.0
CAF:main2 Main 107 0 0.0 0.0 0.0 0.0
main Main 112 0 0.0 0.0 0.0 0.0
CAF:main3 Main 106 0 0.0 0.0 0.0 0.0
main Main 113 0 0.0 0.0 0.0 0.0
CAF:construct Main 105 0 0.0 0.0 100.0 100.0
construct Main 114 1 0.6 0.0 100.0 100.0
construct Main 115 1 12.9 37.4 99.4 100.0
rule1 Main 123 282235 0.6 0.0 86.5 62.6
rule1 Main 124 282235 12.4 7.0 85.9 62.6
dynamic Main 125 282235 0.2 0.0 73.5 55.6
dynamic.elements Main 133 282235 0.3 0.1 0.3 0.1
dynamic.len Main 129 282235 0.0 0.0 0.0 0.0
dynamic.table Main 128 282235 29.1 37.8 72.9 55.5
dynamic.table.x Main 130 133204473 1.9 0.0 43.8 17.7
dynamic.value Main 131 133204473 41.6 17.7 41.9 17.7
dynamic.value.m Main 132 132640003 0.3 0.0 0.3 0.0
dynamic.maxSum Main 127 282235 0.0 0.0 0.0 0.0
dynamic.(...) Main 126 282235 0.1 0.0 0.1 0.0
dynamic Main 122 282235 0.0 0.0 0.0 0.0
construct.nn Main 121 12683926 0.0 0.0 0.0 0.0
CAF:main4 Main 102 0 0.0 0.0 0.0 0.0
construct Main 116 0 0.0 0.0 0.0 0.0
construct Main 117 0 0.0 0.0 0.0 0.0
CAF:cap Main 101 0 0.0 0.0 0.0 0.0
cap Main 119 1 0.0 0.0 0.0 0.0
CAF:flr Main 100 0 0.0 0.0 0.0 0.0
flr Main 118 1 0.0 0.0 0.0 0.0
CAF:step_r1dD Main 99 0 0.0 0.0 0.0 0.0
step Main 120 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 96 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 93 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 91 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 82 0 0.0 0.0 0.0 0.0
Heap profile:
Allocation by type:
Allocation by constructors:
There is a lot that can be said. In this answer I'll just comment on the nested list comprehensions in the construct function.
To get an idea on what's going on in construct we'll isolate it and compare it to a nested loop version that you would write in an imperative language. We've removed the rule1 guard to test only the generation of lists.
-- List.hs -- using list comprehensions
import Control.Monad
cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int
construct :: [[Int]]
construct = do
a <- [flr..cap]
b <- [a+step..cap]
c <- [b+step..a+b-1]
d <- [c+step..a+b-1]
e <- [d+step..a+b-1]
f <- [e+step..a+b-1]
g <- [f+step..a+b-1]
guard (a + b + c + d - e - f - g > 0)
guard (a + b + c + d + e + f + g < 285)
return [g,f,e,d,c,b,a]
-- guard (rule1 nn)
main = do
forM_ construct print
-- Loops.hs -- using imperative looping
import Control.Monad
loop a b f = go a
where go i | i > b = return ()
| otherwise = do f i; go (i+1)
cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int
main =
loop flr cap $ \a ->
loop (a+step) cap $ \b ->
loop (b+step) (a+b-1) $ \c ->
loop (c+step) (a+b-1) $ \d ->
loop (d+step) (a+b-1) $ \e ->
loop (e+step) (a+b-1) $ \f ->
loop (f+step) (a+b-1) $ \g ->
if (a+b+c+d-e-f-g > 0) && (a+b+c+d+e+f+g < 285)
then print [g,f,e,d,c,b,a]
else return ()
Both programs were compiled with ghc -O2 -rtsopts and run with prog +RTS -s > out.
Here is a summary of the results:
Lists.hs Loops.hs
Heap allocation 44,913 MB 2,740 MB
Max. Residency 44,312 44,312
%GC 5.8 % 1.7 %
Total Time 9.48 secs 1.43 secs
As you can see, the loop version, which is the way you would write this in a language like C,
wins in every category.
The list comprehension version is cleaner and more composable but also less performant than direct iteration.

Resources