Excel negative correlation of trendline - excel

I have been getting this negative R² so many times when I add a trendline in excel as shown on the figure below.
Do I care about this negative sign?
Here is the data:
x y
0.059 0.13
0.095 0.05
0.097 0.02
0.12 0.2
0.146 0.05
0.192 0.11
0.231 0.16
0.25 0.16
0.28 0.09
0.33 0.05
0.36 0.18
0.37 0.24
0.47 0.14
0.76 0.11
1.2 0.07
1.86 0.12

So, a negative R² is possible based on how that value is computed (it's not purely a square of a number). For a properly-defined model, the value of the correlation coefficient will be between 0 and 1, and the interpretation is that "x" percentage of the variability in your data is explained by the model.
The interpretation of a negative value is that your trend line is a worse fit than a horizontal line. This answer provides a much more thorough explanation.

When you did your trendline you selected Set Intercept = option with intercept = 0.05. In this case Excel returns an R^2 that doesn't have it's customary meaning and can be negative --- see here.
To fix the problem, unselect the Set Intercept option.
When I run the trendline with the option unselected I get
y = -0.0017x + 0.1182 (R^2 = 0.0002)
Hope that helps.

Related

how do i improve my nlp model to classify 4 different mental illness?

I have a dataset in csv containing 2 columns: 1 is the label which determines the type of mental illness of the patient and the other is the corresponding reddit posts from a certain time period of that user.
These are the total number of patients in each group of illness:
control: 3000
depression: 2118
bipolar: 1062
ptsd: 330
schizophrenia: 148
for starters I tried binary classification between my depression and bipolar patients. I used tfidf vectors and fed it into 2 different types of classifiers: MultinomialNB and SVM.
here is a sample of the code:
using MultinomialNB:
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
text_clf = text_clf.fit(x_train, y_train)
using SVM:
text_clf_svm = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)),])
text_clf_svm = text_clf_svm.fit(x_train, y_train)
these are my results:
precision recall f1-score support
bipolar 0.00 0.00 0.00 304
depression 0.68 1.00 0.81 650
accuracy 0.68 954
macro avg 0.34 0.50 0.41 954
weighted avg 0.46 0.68 0.55 954
The problem is the models are simply just predicting all the patients to be in the class of the larger data sample, in this case all are being predicted as depressed patients. I have tried using BERT as well but I get the same accuracy. I have read papers on them using LIWC lexicon, These categories include variables that characterize linguistic style as well as psychological aspects of language.
I don't understand if what I am doing is correct or is there a better way at classifying using NLP, if so please enlighten me.
Thanking anybody who comes across such a big post and shares their idea beforehand!

How to calculate confidence intervals for crude survival rates?

Let's assume that we have a survfit object as follows.
fit = survfit(Surv(data$time_12m, data$status_12m) ~ data$group)
fit
Call: survfit(formula = Surv(data$time_12m, data$status_12m) ~ data$group)
n events median 0.95LCL 0.95UCL
data$group=HF 10000 3534 NA NA NA
data$group=IGT 70 20 NA NA NA
fit object does not show CI-s. How to calculate confidence intervals for the survival rates? Which R packages and code should be used?
The print result of survfit gives confidnce intervals by group for median survivla time. I'm guessing the NA's for the estimates of median times is occurring because your groups are not having enough events to actually get to a median survival. You should show the output of plot(fit) to see whether my guess is correct.
You might try to plot the KM curves, noting that the plot.survfit function does have a confidence interval option constructed around proportions:
plot(fit, conf.int=0.95, col=1:2)
Please read ?summary.survfit. It is the class of generic summary functions which are typically used by package authors to deliver the parameter estimates and confidence intervals. There you will see that it is not "rates" which are summarized by summary.survfit, but rather estimates of survival proportion. These proportions can either be medians (in which case the estimate is on the time scale) or they can be estimates at particular times (and in that instance the estimates are of proportions.)
If you actually do want rates then use a functions designed for that sort of model, perhaps using ?survreg. Compare what you get from using survreg versus survfit on the supplied dataset ovarian:
> reg.fit <- survreg( Surv(futime, fustat)~rx, data=ovarian)
> summary(reg.fit)
Call:
survreg(formula = Surv(futime, fustat) ~ rx, data = ovarian)
Value Std. Error z p
(Intercept) 6.265 0.778 8.05 8.3e-16
rx 0.559 0.529 1.06 0.29
Log(scale) -0.121 0.251 -0.48 0.63
Scale= 0.886
Weibull distribution
Loglik(model)= -97.4 Loglik(intercept only)= -98
Chisq= 1.18 on 1 degrees of freedom, p= 0.28
Number of Newton-Raphson Iterations: 5
n= 26
#-------------
> fit <- survfit( Surv(futime, fustat)~rx, data=ovarian)
> summary(fit)
Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
rx=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 13 1 0.923 0.0739 0.789 1.000
115 12 1 0.846 0.1001 0.671 1.000
156 11 1 0.769 0.1169 0.571 1.000
268 10 1 0.692 0.1280 0.482 0.995
329 9 1 0.615 0.1349 0.400 0.946
431 8 1 0.538 0.1383 0.326 0.891
638 5 1 0.431 0.1467 0.221 0.840
rx=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
353 13 1 0.923 0.0739 0.789 1.000
365 12 1 0.846 0.1001 0.671 1.000
464 9 1 0.752 0.1256 0.542 1.000
475 8 1 0.658 0.1407 0.433 1.000
563 7 1 0.564 0.1488 0.336 0.946
Might have been easier if I had used "exponential" instead of "weibull" as the distribution type. Exponential fits have a single parameter that is estimated and are more easily back-transformed to give estimates of rates.
Note: I answered an earlier question about survfit, although the request was for survival times rather than for rates. Extract survival probabilities in Survfit by groups

Gnuplot - How to join smoothly ordered points?

I've a set of data in three columns:
1st column: order criterion between 0 and 1
2nd: x vals
3rd: y vals
As a data file example:
0.027 -29.3 -29.6
0.071 -26.0 -31.0
0.202 -14.0 -32.8
0.304 -3.4 -29.3
0.329 -0.5 -26.0
0.409 6.7 -14.0
0.458 11.7 -3.4
0.471 12.8 -0.5
0.495 12.5 6.7
0.588 18.8 11.7
0.600 20.4 12.8
0.618 20.8 12.5
0.674 20.9 18.8
0.754 22.1 20.4
0.810 27.0 20.8
0.874 24.7 20.9
0.892 9.4 22.1
0.911 -11.5 27.0
0.943 -23.7 24.7
0.962 -29.6 9.4
0.991 -31.0 -11.5
0.999 -32.8 -23.7
My goal is to plot (x,y) points and a trend curve passing through each points ordered in ascending order with the first column values.
I use the following script:
set terminal png small size 600,450
set output "my_data_mcsplines_joined_points.png"
set table "table_interpolation.dat"
plot 'my_data.dat' using 2:3 smooth mcsplines
unset table
plot 'my_data.dat' using 2:3:(sprintf("%'.3f", $1)) with labels point pt 7 offset char 1,1 notitle ,\
"table_interpolation.dat" with lines notitle
Here mcspline results as an example:
mcspline joined points figure
The resulting curve should have the shape of a spindle or a loop.
Whatever smooth options used, Gnuplot seems invalid to handle such aim.
Unfortunatly most of smooth (mcspline, csplines...) options do a monotonic ordering of data.
How can I plot a trend curve passing through each points ordered in ascending order with the first column values?
Thanks.
I cannot post an image in a comment, and so place it here. I don't think a 2D plot will be sufficient, based on this 3D acatterplot of the data in your question.

Multiple plots with gnuplot by grouping columns

I have a data file with schema as "object parameter output1 output2 ...... outputk". For eg.
A 0.1 0.2 0.43 0.81 0.60
A 0.2 0.1 0.42 0.83 0.62
A 0.3 0.5 0.48 0.84 0.65
B 0.1 0.1 0.42 0.83 0.62
B 0.2 0.1 0.82 0.93 0.61
B 0.3 0.5 0.48 0.34 0.15
...
I want to create multiple plots, each plot corresponding to an object, with x axis being the parameter and series being the outputs. Currently, I've written a python script which dumps the rows for each object in different files and then calls gnuplot. Is there a more elegant way to plot it?
You are looking for this:
plot 'data.txt' using (strcol(1) eq "A" ? $2 : 1/0):4 with line
which results to:
If you would like to create plots for every object use:
do for [object in "A B"] {
reset
set title sprintf("Object %s",object)
plot 'data.txt' using (strcol(1) eq object ? $2 : 1/0):4 notitle with line
pause -1
}
Just press Enter for next plot.
Of course you can export these plots in files, too.

Referring to objects using variable strings in R

Edit: Thanks to those who have responded so far; I'm very much a beginner in R and have just taken on a large project for my MSc dissertation so am a bit overwhelmed with the initial processing. The data I'm using is as follows (from WMO publically available rainfall data):
120 6272100 KHARTOUM 15.60 32.55 382 1899 1989 0.0
1899 0.03 0.03 0.03 0.03 0.03 1.03 13.03 12.03 9999 6.03 0.03 0.03
1900 0.03 0.03 0.03 0.03 0.03 23.03 80.03 47.03 23.03 8.03 0.03 0.03
1901 0.03 0.03 0.03 0.03 0.03 17.03 23.03 17.03 0.03 8.03 0.03 0.03
(...)
120 6272101 JEBEL AULIA 15.20 32.50 380 1920 1988 0.0
1920 0.03 0.03 0.03 0.00 0.03 6.90 20.00 108.80 47.30 1.00 0.01 0.03
1921 0.03 0.03 0.03 0.00 0.03 0.00 88.00 57.00 35.00 18.50 0.01 0.03
1922 0.03 0.03 0.03 0.00 0.03 0.00 87.50 102.30 10.40 15.20 0.01 0.03
(...)
There are ~100 observation stations that I'm interested in, each of which has a varying start and end date for rainfall measurements. They're formatted as above in a single data file, with stations separated by "120 (station number) (station name)".
I need first to separate this file by station, then to extract March, April, May and June for each year, then take a total of these months for each year. So far I'm messing around with loops (as below), but I understand this isn't the right way to go about it and would rather learn some better technique.
Thanks again for the help!
(Original question:)
I've got a large data set containing rainfall by season for ~100 years over 100+ locations. I'm trying to separate this data into more managable arrays, and in particular I want to retrieve the sum of the rainfall for March, April, May and June for each station for each year.
The following is a simplified version of my code so far:
a <- array(1,dim=c(10,12))
for (i in 1:5) {
all data:
assign(paste("station_",i,sep=""), a)
#march - june data:
assign(paste("station_",i,"_mamj",sep=""), a[,4:7])
}
So this gives me station_(i)__mamj_ which contains the data for the months I'm interested in for each station. Now I want to sum each row of this array and enter it in a new array called station_(i)_mamj_tot. Simple enough in theory, but I can't work out how to reference station_(i)_mamj so that it varies the value of i with each iteration. Any help much appreciated!
This is totally begging for a dataframe, then it's just this one-liner with power-tools like ddply (amazingly powerful):
tot_mamj <- ddply(rain[rain$month %in% 3:6,-2], 'year', colwise(sum))
giving your aggregate of total for M/A/M/J, by year:
year station_1 station_2 station_3 station_4 station_5 ...
1 1972 8.618960 5.697739 10.083192 9.264512 11.152378 ...
2 1973 18.571748 18.903280 11.832462 18.262272 10.509621 ...
3 1974 22.415201 22.670821 32.850745 31.634717 20.523778 ...
4 1975 16.773286 17.683704 18.259066 14.996550 19.007762 ...
...
Below is perfectly working code. We create a dataframe whose col.names are 'station_n'; also extra columns for year and month (factor, or else integer if you're lazy, see the footnote). Now you can do arbitrary analysis by month or year (using plyr's split-apply-combine paradigm):
require(plyr) # for d*ply, summarise
#require(reshape) # for melt
# Parameterize everything here, it's crucial for testing/debugging
all_years <- c(1970:2011)
nYears <- length(all_years)
nStations <- 101
# We want station names as vector of chr (as opposed to simple indices)
station_names <- paste ('station_', 1:nStations, sep='')
rain <- data.frame(cbind(
year=rep(c(1970:2011),12),
month=1:12
))
# Fill in NAs for all data
rain[,station_names] <- as.numeric(NA)
# Make 'month' a factor, to prevent any numerical funny stuff e.g accidentally 'aggregating' it
rain$month <- factor(rain$month)
# For convenience, store the row indices for all years, M/A/M/J
I.mamj <- which(rain$month %in% 3:6)
# Insert made-up seasonal data for M/A/M/J for testing... leave everything else NA intentionally
rain[I.mamj,station_names] <- c(3,5,9,6) * runif(4*nYears*nStations)
# Get our aggregate of MAMJ totals, by year
# The '-2' column index means: "exclude month, to prevent it also getting 'aggregated'"
excludeMonthCol = -2
tot_mamj <- ddply(rain[rain$month %in% 3:6, excludeMonthCol], 'year', colwise(sum))
# voila!!
# year station_1 station_2 station_3 station_4 station_5
# 1 1972 8.618960 5.697739 10.083192 9.264512 11.152378
# 2 1973 18.571748 18.903280 11.832462 18.262272 10.509621
# 3 1974 22.415201 22.670821 32.850745 31.634717 20.523778
# 4 1975 16.773286 17.683704 18.259066 14.996550 19.007762
As a footnote, before I converted month from numeric to factor, it was getting silently 'aggregated' (until I put in the '-2': exclude column reference).
However, better still is when you make it a factor, it will refuse point-blank to be aggregate'd, and throw an error (which is desirable for debugging):
ddply(rain[rain$month %in% 3:6, ], 'year', colwise(sum))
Error in Summary.factor(c(3L, 3L, 3L, 3L, 3L, 3L), na.rm = FALSE) :
sum not meaningful for factors
For your original question, use get():
i <- 10
var <- paste("test", i, sep="_")
assign(10, var)
get(var)
As David said, this is probably not the best path to be taking, but it can be useful at times (and IMO the assign/get construct is far better than eval(parse))
Why are you using assign to create variables like station1, station2, station_3_mamj and so on? It would be much easier and more intuitive to store them in a list, like stations[[1]], stations[[2]], stations_mamj[[3]], and such. Then each could be accessed using their index.
Since it looks like each piece of per-station data you're working with is a matrix of the same size, you could even deal with them as a three-dimensional matrix.
ETA: Incidentally, if you really want to solve the problem this way, you would do:
eval(parse(text=paste("station", i, "mamj", sep="_")))
But don't- using eval is almost always bad practices, and will make it difficult to do even simple operations on your data.

Resources