How to determine a formula for execution time given quantitative data, Excel, trendlines, monte carlo simulation - excel

Can I get your help on some Maths and possibly Excel?
I have benchmarked my app increasing the number of iterations and number of obligors recording the time taken in seconds with the following result:
200 400 600 800 1000 1200 1400 1600 1800 2000
20000 15.627681 30.0968663 44.7592684 60.9037558 75.8267358 90.3718977 105.8749983 121.0030672 135.9191249 150.3331682
40000 31.7202111 62.3603882 97.2085204 128.8111731 156.2443206 186.6374271 218.324317 249.2699288 279.6008184 310.9970803
60000 47.0708635 92.4599437 138.874287 186.0576007 231.2181381 280.541207 322.9836878 371.3076757 413.4058622 459.6208335
80000 60.7346238 120.3216303 180.471169 241.668982 300.4283548 376.9639188 417.5231669 482.6288981 554.9740194 598.0394434
100000 76.7535915 150.7479245 227.5125656 304.3908046 382.5900043 451.6034296 526.0730786 609.0358776 679.0268121 779.6887277
120000 90.4174626 179.5511355 269.4099593 360.2934453 448.4387573 537.1406039 626.7325734 727.6132992 807.4767327 898.307638
How can I now come up with a function for T (time taken in seconds) as an expression of number of obligors O and number of iterations I
Thanks

I'm not quite sure of the data involved due to the question construction/presentation.
Assuming you're looking for y = f(x). If you load the data into Excel, you can use the methods SLOPE and INTERCEPT on the data ranges to derive an expression of the form
y = mx+c
and thus a linear function.
If you want a quadratic or cubic, you can use LINEST with a column of time data squared/cubed etc. to give you quadratic/cubic parameters, and thus derive an appropriate higher order function.

Spoke to one of the quants here the function is of the from T = KNO, where T is time, K some constant, N iterations, O obligors.
Rearrange for K = T/(NO), plug this into my sample data, take the average of all sample points, use the Std dev for the error
I did this for my data and get:
T = 3.81524E-06 * N * O (with 1.9% error), this is a pretty good approximation.

Create a chart in Excel, add a trendline, and select to have the equation displayed on the chart.

To clarify: You have tabular data below which you want to fit to some function f(O,I)=t?
200 400 600 800 1000 1200 1400 1600 1800 2000
20000 15.627681 30.0968663 44.7592684 60.9037558 75.8267358 90.3718977 105.8749983 121.0030672 135.9191249 150.3331682
40000 31.7202111 62.3603882 97.2085204 128.8111731 156.2443206 186.6374271 218.324317 249.2699288 279.6008184 310.9970803
60000 47.0708635 92.4599437 138.874287 186.0576007 231.2181381 280.541207 322.9836878 371.3076757 413.4058622 459.6208335
80000 60.7346238 120.3216303 180.471169 241.668982 300.4283548 376.9639188 417.5231669 482.6288981 554.9740194 598.0394434
100000 76.7535915 150.7479245 227.5125656 304.3908046 382.5900043 451.6034296 526.0730786 609.0358776 679.0268121 779.6887277
120000 90.4174626 179.5511355 269.4099593 360.2934453 448.4387573 537.1406039 626.7325734 727.6132992 807.4767327 898.307638
A rough guess looks like both O & I are linear. So f is in the form t = aO + bI + c. Plug in a few (O,I,t) and see what a,b,c should be.

Related

What is the simplest way to complete a function on every row of a large table?

so I want to do a fisher exact test (one sided) on every row of a 3000+ row table with a format matching the below example
gene
sample_alt
sample_ref
population_alt
population_ref
One
4
556
770
37000
Two
5
555
771
36999
Three
6
554
772
36998
I would ideally like to make another column of the table equivalent to
[(4+556)!(4+770)!(770+37000)!(556+37000)!]/[4!(556!)770!(37000!)(4+556+770+37000)!]
for the first row of data, and so on and so forth for each row of the table.
I know how to do a fisher test in R for simple 2x2 tables, but I wouldn't know how I would apply the fisher.test() function to each row of a large table. I also can't use an excel formula because the numbers get so big with the factorials that they reach excel's digit limit and result in a #NUM error. What's the best way to simply complete this? Thanks in advance!
Beginning with a tab-delimited text file on desktop (table.txt) with the same format as shown in the stem question
if(!require(psych)){install.packages("psych")}
multiFisher = function(file="Desktop/table.txt", saveit=TRUE,
outfile="Desktop/table.csv", progress=T,
verbose=FALSE, digits=3, ... )
{
require(psych)
Data = read.table(file, skip=1, header=F,
col.names=c("Gene", "MD", "WTD", "MC", "WTC"), ...)
if(verbose){print(str(Data))}
Data$Fisher.p = NA
Data$phi = NA
Data$OR1 = format(0.123, nsmall=3)
Data$OR2 = NA
if(progress){cat("\n")}
for(i in 1:length(Data$Gene)){
Matrix = matrix(c(Data$WTC[i],Data$MC[i],Data$WTD[i],Data$MD[i]), nrow=2)
Fisher = fisher.test(Matrix, alternative = 'greater')
Data$Fisher.p[i] = signif(Fisher$p.value, digits=digits)
Data$phi[i] = phi(Matrix, digits=digits)
OR1 = (Data$WTC[i]*Data$MD[i])/(Data$MC[i]*Data$WTD[i])
OR2 = 1 / OR1
Data$OR1[i] = format(signif(OR1, digits=digits), nsmall=3)
Data$OR2[i] = signif(OR2, digits=digits)
if(progress) {cat(".")}
}
if(progress){cat("\n"); cat("\n")}
if(saveit){write.csv(Data, outfile)}
return(Data)
}
multiFisher()

Parsing heterogenous data from a text file in Python

I am trying to parse raw data results from a text file into an organised tuple but having trouble getting it right.
My raw data from the textfile looks something like this:
Episode Cumulative Results
EpisodeXD0281119
Date collected21/10/2019
Time collected10:00
Real time PCR for M. tuberculosis (Xpert MTB/Rif Ultra):
PCR result Mycobacterium tuberculosis complex NOT detected
Bacterial Culture:
Bottle: Type FAN Aerobic Plus
Result No growth after 5 days
EpisodeST32423457
Date collected23/02/2019
Time collected09:00
Gram Stain:
Neutrophils Occasional
Gram positive bacilli Moderate (2+)
Gram negative bacilli Numerous (3+)
Gram negative cocci Moderate (2+)
EpisodeST23423457
Date collected23/02/2019
Time collected09:00
Bacterial Culture:
A heavy growth of
1) Klebsiella pneumoniae subsp pneumoniae (KLEPP)
ensure that this organism does not spread in the ward/unit.
A heavy growth of
2) Enterococcus species (ENCSP)
Antibiotic/Culture KLEPP ENCSP
Trimethoprim-sulfam R
Ampicillin / Amoxic R S
Amoxicillin-clavula R
Ciprofloxacin R
Cefuroxime (Parente R
Cefuroxime (Oral) R
Cefotaxime / Ceftri R
Ceftazidime R
Cefepime R
Gentamicin S
Piperacillin/tazoba R
Ertapenem R
Imipenem S
Meropenem R
S - Sensitive ; I - Intermediate ; R - Resistant ; SDD - Sensitive Dose Dependant
Comment for organism KLEPP:
** Please note: this is a carbapenem-RESISTANT organism. Although some
carbapenems may appear susceptible in vitro, these agents should NOT be used as
MONOTHERAPY in the treatment of this patient. **
Please isolate this patient and practice strict contact precautions. Please
inform Infection Prevention and Control as contact screening might be
indicated.
For further advice on the treatment of this isolate, please contact.
The currently available laboratory methods for performing colistin
susceptibility results are unreliable and may not predict clinical outcome.
Based on published data and clinical experience, colistin is a suitable
therapeutic alternative for carbapenem resistant Acinetobacter spp, as well as
carbapenem resistant Enterobacteriaceae. If colistin is clinically indicated,
please carefully assess clinical response.
EpisodeST234234057
Date collected23/02/2019
Time collected09:00
Authorised by xxxx on 27/02/2019 at 10:35
MIC by E-test:
Organism Klebsiella pneumoniae (KLEPN)
Antibiotic Meropenem
MIC corrected 4 ug/mL
MIC interpretation Resistant
Antibiotic Imipenem
MIC corrected 1 ug/mL
MIC interpretation Sensitive
Antibiotic Ertapenem
MIC corrected 2 ug/mL
MIC interpretation Resistant
EpisodeST23423493
Date collected18/02/2019
Time collected03:15
Potassium 4.4 mmol/L 3.5 - 5.1
EpisodeST45445293
Date collected18/02/2019
Time collected03:15
Creatinine 32 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Creatinine 28 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Essentially the pattern is that ALL information starts with a unique EPISODE NUMBER and follows with a DATE and TIME and then the result of whatever test. This is the pattern throughout.
What I am trying to parse into my tuple is the date, time, name of the test and the result - whatever it might be. I have the following code:
with open(filename) as f:
data = f.read()
data = data.splitlines()
DS = namedtuple('DS', 'date time name value')
parsed = list()
idx_date = [i for i, r in enumerate(data) if r.strip().startswith('Date')]
for start, stop in zip(idx_date[:-1], idx_date[1:]):
chunk = data[start:stop]
date = time = name = value = None
for row in chunk:
if not row: continue
row = row.strip()
if row.startswith('Episode'): continue
if row.startswith('Date'):
_, date = row.split()
date = date.replace('collected', '')
elif row.startswith('Time'):
_, time = row.split()
time = time.replace('collected', '')
else:
name, value, *_ = row.split()
print (name)
parsed.append(DS(date, time, name, value))
print(parsed)
My error is that I am unable to find a way to parse the heterogeneity of the test RESULT in a way that I can use later, for example for the tuple DS ('DS', 'date time name value'):
DATE = 21/10/2019
TIME = 10:00
NAME = Real time PCR for M tuberculosis or Potassium
RESULT = Negative or 4.7
Any advice appreciated. I have hit a brick wall.

Calculate the average of Spearman correlation

I have 2 columns A and B which contain the Spearman's correlation values as follows:
0.127272727 -0.260606061
-0.090909091 -0.224242424
0.345454545 0.745454545
0.478787879 0.660606061
-0.345454545 -0.333333333
0.151515152 -0.127272727
0.478787879 0.660606061
-0.321212121 -0.284848485
0.284848485 0.515151515
0.36969697 -0.139393939
-0.284848485 0.272727273
How can I calculate the average of those correlation values in these 2 columns in Excel or Matlab ? I found a close answer in this link : https://stats.stackexchange.com/questions/8019/averaging-correlation-values
The main point is we can not use mean or average in this case, as explained in the link. They proposed a nice way to do that, but I dont know how to implement it in Excel or Matlab.
Following the second answer of the link you provided, which is the most general case, you can calculate the average Spearman's rho in Matlab as follows:
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
z = atanh(M);
meanRho = tanh(mean(z));
As you can see it gives mean values of
meanRho =
0.1165 0.1796
whereas the simple mean is quite close:
mean(M)
ans =
0.1085 0.1350
Edit: more information on Fisher's transformation here.
In MATLAB, define a matrix with these values and use mean function as follows:
%define a matrix M
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
%calculates the mean of each column
meanVals = mean(M);
Result
meanVals =
0.1085 0.1350
It is also possible to calculate the total meanm and the mean of each row as follows:
meanVals = mean(M); %total mean
meanVals = mean(M,2); %mean of each row

simple Granger Casuality test using statsmodels.tsa.grangercausalitytests

I have several time-series files ( 540 rows x 6 columns ) that i would like to do a simple Granger Casuality test using statsmodels.tsa.grangercausalitytests
from statsmodels.tsa.stattools import grangercausalitytests
my pandas dataframe ( df) contains the data in the following format
i tried to print the tests using Open and close columns with following:
print(grangercausalitytests([df[Open], df[Close]], maxlag=15, addconst=True, verbose=True))
but it does not work. Is there a way to perform Granger test on each Column ( Open, High, Low ) with Close i.e Open and close, High and close , low and close )
Epochtime Open High Low Close Vol
1486094520, 808.11000, 808.11000, 808.11000, 808.11000, 100
1486094580, 809.45000, 809.45000, 809.45000, 809.45000, 100
1486094820, 809.99000, 809.99000, 809.99000, 809.99000, 100
1486095540, 811.45000, 811.45000, 811.45000, 811.45000, 100
1486095840, 811.30000, 811.30000, 811.01000, 811.01000, 300
1486095900, 810.76000, 810.76000, 810.76000, 810.76000, 100
1486096200, 812.00000, 812.00000, 812.00000, 812.00000, 100
It requires 2-dimensional array, try this:
print(grangercausalitytests(df[['Open', 'Close']], maxlag=15, addconst=True, verbose=True))

JAGS Beginner - Receiving and Understanding Output

When using JAGS, how does one receive output from a model in the format:
Inference for Bugs model at "model.txt", fit using jags,
3 chains, each with 10000 iterations (first 5000 discarded)
n.sims = 15000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
mu 9.950 0.288 9.390 9.755 9.951 10.146 10.505 1.001 11000
sd.obs 3.545 0.228 3.170 3.401 3.534 3.675 3.978 1.001 13000
deviance 820.611 3.460 818.595 819.132 819.961 821.366 825.871 1.001 15000
I assumed, as with BUGS, it would appear when the model completes however I only get something in the format:
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 1785
Unobserved stochastic nodes: 1843
Total graph size: 61542
Initializing model
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Apologies for the basic question. If anyone can provide useful JAGS introductory material that would also be useful.
Kind regards.
If you only get the 'plus' signs, it means you only initialized the model. When jags really runs, it typically produces '***' signs after. So you are missing a line here (would have been nice to see your code). For instance if you use r2jags, you would write:
out <- jags(data = data, parameters.to.save = params, n.chains = 3, n.iter = 90000,n.burnin = 5000,
model.file = modFile)
out.upd <- update(abundance.out.mod, n.iter=10000)

Resources