Reading a text file linewise - io

I have a text file which looks like this:
0.031 0.031 0.031 1.4998 0.9976 0.5668 0.9659
0.062 0.031 0.031 0.9620 0.7479 0.3674 0.4806
0.094 0.031 0.031 0.3549 0.0738 0.0054 0.3471
0.125 0.031 0.031 0.4270 0.3422 0.2180 0.1332
0.156 0.031 0.031 1.0766 0.9005 0.3868 0.4455
0.188 0.031 0.031 0.9285 0.6619 0.0161 0.6509
0.219 0.031 0.031 1.1200 0.6464 0.3230 0.8557
and so on...(32768 lines because it's a 32^3 grid.)
The first three columns represent x,y and z coordinates , the 4th column being the norm of the coordinates and the last three columns represent the components of a 3D vector. I need to calculate curl and divergence of the vector field( 5,6,7) using central difference method at each point. Since there are a lot of lines, I want to parallelize the input using MPI. But for this to happen,I need to have access to three lines at once as central difference method states (for eg)
My fortran code for reading in data from the file is:
open(unit = 1,file = '32data.txt') !
do i = 1,32767
read(1,*) x(i),y(i),z(i),norm(i),xv(i),yv(i),zv(i)
end do
do i = 2,32767
dx = x(i) - x(i-1)
dy = y(i) - y(i-1)
dz = z(i) - z(i-1)
Fx(i) = (xv(i+1) - xv(i-1))/(2.0)*dx
Fy(i) = (yv(i+1) - yv(i-1))/(2.0)*dy
Fz(i) = (zv(i+1) - zv(i-1))/(2.0)*dz
div(i) = Fx(i)+Fy(i)+Fz(i)
end do
So I need to send chunk of lines to different processes. BUT there's a problem here. Let's say process 2 takes 4 lines and process 3 takes the next 6. To calculate the gradient for the first line in process 3, I need data from the previous line in process 2. So I need to send that again to process 3. So is it possible to parallelize this in MPI? Or should I follow something else?

Related

Signal Processing: Sample rate vs sample period

Let's consider the following code:
f = 40; # Hz
tmin = -0.3;
tmax = 0.3;
t, sampling_period = linspace(start=tmin, stop=tmax, num=400, retstep=True); # here I am saying split in 400 regular interval the 0.6 units of times.
# Based on the above, I obtain the distance between two regular intervals of 0.0015 = 0.6/400 ==>
# sample period T = 0.0015
significant_digits = 2
rounded_sampling_period = round(sampling_period, significant_digits -
int(math.floor(math.log10(abs(sampling_period)))) - 1)
sampling_frequency = 1/rounded_sampling_period
print("sampling_period - regular intervals of T = ", sampling_period)
print("rounded_sampling_period ", rounded_sampling_period)
print("sampling_frequency or sample rate = 1/T ", sampling_frequency)
x = cos(2*pi*f*t); # signal sampling
plot(t, x)
I am getting as results:
sampling_period - regular intervals of T = 0.0015037593984962405
rounded_sampling_period 0.0015
sampling_frequency 666.6666666666666
What is wrong in trying to understand the difference between Sample rate vs sample period? 666.666 does not make sense?
Thank you in advance for you help.
I believe you are mixing two different concepts, the signal frequency/period and the sample frequency/period.
The signal frequency/period is the interval in which the signal repeat itself.
The sample frequency/period is the distance between samples (between points in an array in this case)
So that's why you are getting 666.66, because your signal is sampled between 400 data points over a period of time of 0.6 seconds (0.3 -(-0.3)), and thats results in 666.66

How I can plot multiple roc together?

I want to find some good predictors (genes). This is my data, log transformed RNA-seq:
TRG CDK6 EGFR KIF2C CDC20
Sample 1 TRG12 11.39 10.62 9.75 10.34
Sample 2 TRG12 10.16 8.63 8.68 9.08
Sample 3 TRG12 9.29 10.24 9.89 10.11
Sample 4 TRG45 11.53 9.22 9.35 9.13
Sample 5 TRG45 8.35 10.62 10.25 10.01
Sample 6 TRG45 11.71 10.43 8.87 9.44
I have calculated confusion matrix for different models like below
1- I tested each of 23 genes individually in this code and each of them gives p-value < 0.05 remained as a good predictor; For example for CDK6 I have done
glm=glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))
Finally I obtained five genes and I put them in this model:
final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))
I want a plot like this for ROC curve of each model but I don't know how to do that
Any help please?
I will give you an answer using the pROC package. Disclaimer: I am the author and maintiner of the package. There are alternative ways to do it.
The plot your are seeing was probably generated by the ggroc function of pROC. In order to generate such a plot from glm models, you need to 1) use the predict function to generate the predictions, 2) generate the roc curves and store them in a list, preferably named to get a legend automatically, and 3) call ggroc.
glm.cdk6 <- glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))
final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))
rocs <- list()
library(pROC)
rocs[["CDK6"]] <- roc(df$TRG, predict(glm.cdk6))
rocs[["final"]] <- roc(df$TRG, predict(final))
ggroc(rocs)

Displaying a rounded matrix

I want to display a vector with a predefined precision. For instance, let us consider the following vector,
v = [1.2346 2.0012 0.1230 0.0001 1.0000]
If I call,
mat2str(v, 1);
the output should be,
1.2 2.0 0.1 0.0 1.0
If I call,
mat2str(v, 2)
the output should be,
1.24 2.00 0.12 0.00 1.00
and so on.
I tried this code, but it resulted in an empty matrix:
function s = mat2str(mat, precision)
s = sprintf('%.%df ', precision, round(mat, precision));
end
mat2str(similarity, 3)
ans =
Empty string: 1-by-0
How can I display a vector with a predefined number of decimal places?
The format specifier for sprintf already provides an easy way to do this by using * for the precision field and passing that value as an argument to sprintf. Your function (which I renamed to mat2prec) can therefore be written as follows:
function s = mat2prec(mat, precision)
s = sprintf('%.*f', precision, mat);
end
This one works on my Matlab 2014b:
function s = mat2str(mat, precision)
printstring=strcat('%',num2str(precision),'.',num2str(precision),'f','\t');
s = sprintf(printstring, round(mat, precision));
end
function roundedmat2str(X,N)
NN = num2str(N); % Make a number of the precision
for ii = size(X,1):-1:1
out(ii,:) = sprintf(['%.' NN 'f \t'],X(ii,:)); % create string
end
disp(out)
end
X=magic(3)+rand(3);N=2;
MyRounding(X,N)
8.69 1.03 6.77
3.32 5.44 7.80
4.95 9.38 2.19
X = X(:).';
MyRounding(X,N)
8.69 3.32 4.95 1.03 5.44 9.38 6.77 7.80 2.19
Note that sprintf and fprintf already do implicit rounding when setting the number of decimals.
Also: please don't use existing function names for your own functions or variables. Never call a sum sum, a mean mean, or a mat2str mat2str. Do things like total, average and roundedmat2str. This makes your code portable and also makes sure you don't error out when you're using your own function but expect the default and vice-versa.
I think this is what you wanted to do in the first place:
s = sprintf(sprintf('%%.%df ', precision), mat)
EDIT
In case you want to extend your question to matrices, you could use this slightly more complicated one-liner:
s = sprintf([repmat(sprintf('%%.%df ', precision), 1, size(mat, 2)) '\n'], mat')
One noticeable difference with the previous one-liner is that it ends with a carriage return.

Statsmodels GLM and OLS with formulas missing paramters

I am trying to run a general linear model using formulas on a data set that contains categorical variables. The results summary table appears to be leaving out one of the variables when I list the parameters?
I haven't been able to find doc's specific to the glm showing the output with categorical variables but I have for the OLS and it looks like it should list each categorical variable seperately. When it do it (with GLM or OLS) it leaves out one of the values for each category. For example:
import statsmodels.formula.api as smf
import pandas as pd
Data = pd.read_csv(root+'/Illisarvik/TestData.csv')
formula = 'Response~Day+Class+Var'
gm = sm.GLM.from_formula(formula=formula, data=Data,
family=sm.families.Gaussian()).fit()
ls = smf.ols(formula=formula,data=Data).fit()
print (Data)
print(gm.params)
print(ls.params)
Day Class Var Response
0 D A 0.533088 0.582931
1 D B 0.839837 0.075011
2 D C 1.454716 0.505442
3 D A 1.455503 0.188945
4 D B 1.163155 0.144176
5 N A 1.072238 0.918962
6 N B 0.815384 0.249160
7 N C 1.182626 0.520460
8 N A 1.448843 0.870644
9 N B 0.653531 0.460177
Intercept 0.625111
Day[T.N] 0.298084
Class[T.B] -0.439025
Class[T.C] -0.104725
Var -0.118662
dtype: float64
Intercept 0.625111
Day[T.N] 0.298084
Class[T.B] -0.439025
Class[T.C] -0.104725
Var -0.118662
dtype: float64
C:/Users/wesle/Dropbox/PhD_Work/Figures/SkeeterEtAlAnalysis.py:55: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
P.sort()
Is there something wrong with my model? The same issue presents its self when I print the full summary table:
print(gm.summary())
print(ls.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Response No. Observations: 10
Model: GLM Df Residuals: 5
Model Family: Gaussian Df Model: 4
Link Function: identity Scale: 0.0360609978309
Method: IRLS Log-Likelihood: 5.8891
Date: Sun, 05 Mar 2017 Deviance: 0.18030
Time: 23:26:48 Pearson chi2: 0.180
No. Iterations: 2
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6251 0.280 2.236 0.025 0.077 1.173
Day[T.N] 0.2981 0.121 2.469 0.014 0.061 0.535
Class[T.B] -0.4390 0.146 -3.005 0.003 -0.725 -0.153
Class[T.C] -0.1047 0.170 -0.617 0.537 -0.438 0.228
Var -0.1187 0.222 -0.535 0.593 -0.553 0.316
==============================================================================
OLS Regression Results
==============================================================================
Dep. Variable: Response R-squared: 0.764
Model: OLS Adj. R-squared: 0.576
Method: Least Squares F-statistic: 4.055
Date: Sun, 05 Mar 2017 Prob (F-statistic): 0.0784
Time: 23:26:48 Log-Likelihood: 5.8891
No. Observations: 10 AIC: -1.778
Df Residuals: 5 BIC: -0.2652
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6251 0.280 2.236 0.076 -0.094 1.344
Day[T.N] 0.2981 0.121 2.469 0.057 -0.012 0.608
Class[T.B] -0.4390 0.146 -3.005 0.030 -0.815 -0.064
Class[T.C] -0.1047 0.170 -0.617 0.564 -0.541 0.332
Var -0.1187 0.222 -0.535 0.615 -0.689 0.451
==============================================================================
Omnibus: 1.493 Durbin-Watson: 2.699
Prob(Omnibus): 0.474 Jarque-Bera (JB): 1.068
Skew: -0.674 Prob(JB): 0.586
Kurtosis: 2.136 Cond. No. 9.75
==============================================================================
This is a consequence of the way the linear model works.
For instance, where you have the categorical variable Day as far as the linear model is concerned this can be represented as just a single 'dummy' variable which is set to 0 (zero) for the value you mention first, namely D and one for the second value, namely N. Statistically speaking, you can recover only the difference between the effects of the two levels of this categorical variable.
If you now consider Class, which has two levels, you have two dummy variables which represent two differences between the levels of the available three levels of this categorical variable.
As a matter of fact, it's perfectly possible to expand on this idea using orthogonal polynomials on the treatment means but that's something for another day.
The short answer is that there's nothing wrong, at least on this account, with your model.

Multiple inputs for coxph

Is there a way to run coxph for multiple inputs. Here I have used the input hsa_let_7b_5p.
coxph(Surv(Time, Status)~ hsa_let_7b_5p, data=as.data.frame(test))
Call:
coxph(formula = Surv(Time, Status) ~ hsa_let_7b_5p, data = as.data.frame(test))
coef exp(coef) se(coef) z p
hsa_let_7b_5p 0.169 1.184 0.173 0.98 0.33
Likelihood ratio test=0.94 on 1 df, p=0.333
n= 91, number of events= 45
It's not too clear to me if this answers the question you meant or the question you asked, but you can add more regression variables to the right side of the formula (after the ~)
coxph(Surv(Time, Status)~ hsa_let_7b_5p + x + y, data=as.data.frame(test))
where x & y are the names of other variables (columns) in your data frame.
You may wish to read into interactions and stratification at some point

Resources