Referring to objects using variable strings in R - string
Edit: Thanks to those who have responded so far; I'm very much a beginner in R and have just taken on a large project for my MSc dissertation so am a bit overwhelmed with the initial processing. The data I'm using is as follows (from WMO publically available rainfall data):
120 6272100 KHARTOUM 15.60 32.55 382 1899 1989 0.0
1899 0.03 0.03 0.03 0.03 0.03 1.03 13.03 12.03 9999 6.03 0.03 0.03
1900 0.03 0.03 0.03 0.03 0.03 23.03 80.03 47.03 23.03 8.03 0.03 0.03
1901 0.03 0.03 0.03 0.03 0.03 17.03 23.03 17.03 0.03 8.03 0.03 0.03
(...)
120 6272101 JEBEL AULIA 15.20 32.50 380 1920 1988 0.0
1920 0.03 0.03 0.03 0.00 0.03 6.90 20.00 108.80 47.30 1.00 0.01 0.03
1921 0.03 0.03 0.03 0.00 0.03 0.00 88.00 57.00 35.00 18.50 0.01 0.03
1922 0.03 0.03 0.03 0.00 0.03 0.00 87.50 102.30 10.40 15.20 0.01 0.03
(...)
There are ~100 observation stations that I'm interested in, each of which has a varying start and end date for rainfall measurements. They're formatted as above in a single data file, with stations separated by "120 (station number) (station name)".
I need first to separate this file by station, then to extract March, April, May and June for each year, then take a total of these months for each year. So far I'm messing around with loops (as below), but I understand this isn't the right way to go about it and would rather learn some better technique.
Thanks again for the help!
(Original question:)
I've got a large data set containing rainfall by season for ~100 years over 100+ locations. I'm trying to separate this data into more managable arrays, and in particular I want to retrieve the sum of the rainfall for March, April, May and June for each station for each year.
The following is a simplified version of my code so far:
a <- array(1,dim=c(10,12))
for (i in 1:5) {
all data:
assign(paste("station_",i,sep=""), a)
#march - june data:
assign(paste("station_",i,"_mamj",sep=""), a[,4:7])
}
So this gives me station_(i)__mamj_ which contains the data for the months I'm interested in for each station. Now I want to sum each row of this array and enter it in a new array called station_(i)_mamj_tot. Simple enough in theory, but I can't work out how to reference station_(i)_mamj so that it varies the value of i with each iteration. Any help much appreciated!
This is totally begging for a dataframe, then it's just this one-liner with power-tools like ddply (amazingly powerful):
tot_mamj <- ddply(rain[rain$month %in% 3:6,-2], 'year', colwise(sum))
giving your aggregate of total for M/A/M/J, by year:
year station_1 station_2 station_3 station_4 station_5 ...
1 1972 8.618960 5.697739 10.083192 9.264512 11.152378 ...
2 1973 18.571748 18.903280 11.832462 18.262272 10.509621 ...
3 1974 22.415201 22.670821 32.850745 31.634717 20.523778 ...
4 1975 16.773286 17.683704 18.259066 14.996550 19.007762 ...
...
Below is perfectly working code. We create a dataframe whose col.names are 'station_n'; also extra columns for year and month (factor, or else integer if you're lazy, see the footnote). Now you can do arbitrary analysis by month or year (using plyr's split-apply-combine paradigm):
require(plyr) # for d*ply, summarise
#require(reshape) # for melt
# Parameterize everything here, it's crucial for testing/debugging
all_years <- c(1970:2011)
nYears <- length(all_years)
nStations <- 101
# We want station names as vector of chr (as opposed to simple indices)
station_names <- paste ('station_', 1:nStations, sep='')
rain <- data.frame(cbind(
year=rep(c(1970:2011),12),
month=1:12
))
# Fill in NAs for all data
rain[,station_names] <- as.numeric(NA)
# Make 'month' a factor, to prevent any numerical funny stuff e.g accidentally 'aggregating' it
rain$month <- factor(rain$month)
# For convenience, store the row indices for all years, M/A/M/J
I.mamj <- which(rain$month %in% 3:6)
# Insert made-up seasonal data for M/A/M/J for testing... leave everything else NA intentionally
rain[I.mamj,station_names] <- c(3,5,9,6) * runif(4*nYears*nStations)
# Get our aggregate of MAMJ totals, by year
# The '-2' column index means: "exclude month, to prevent it also getting 'aggregated'"
excludeMonthCol = -2
tot_mamj <- ddply(rain[rain$month %in% 3:6, excludeMonthCol], 'year', colwise(sum))
# voila!!
# year station_1 station_2 station_3 station_4 station_5
# 1 1972 8.618960 5.697739 10.083192 9.264512 11.152378
# 2 1973 18.571748 18.903280 11.832462 18.262272 10.509621
# 3 1974 22.415201 22.670821 32.850745 31.634717 20.523778
# 4 1975 16.773286 17.683704 18.259066 14.996550 19.007762
As a footnote, before I converted month from numeric to factor, it was getting silently 'aggregated' (until I put in the '-2': exclude column reference).
However, better still is when you make it a factor, it will refuse point-blank to be aggregate'd, and throw an error (which is desirable for debugging):
ddply(rain[rain$month %in% 3:6, ], 'year', colwise(sum))
Error in Summary.factor(c(3L, 3L, 3L, 3L, 3L, 3L), na.rm = FALSE) :
sum not meaningful for factors
For your original question, use get():
i <- 10
var <- paste("test", i, sep="_")
assign(10, var)
get(var)
As David said, this is probably not the best path to be taking, but it can be useful at times (and IMO the assign/get construct is far better than eval(parse))
Why are you using assign to create variables like station1, station2, station_3_mamj and so on? It would be much easier and more intuitive to store them in a list, like stations[[1]], stations[[2]], stations_mamj[[3]], and such. Then each could be accessed using their index.
Since it looks like each piece of per-station data you're working with is a matrix of the same size, you could even deal with them as a three-dimensional matrix.
ETA: Incidentally, if you really want to solve the problem this way, you would do:
eval(parse(text=paste("station", i, "mamj", sep="_")))
But don't- using eval is almost always bad practices, and will make it difficult to do even simple operations on your data.
Related
how do i improve my nlp model to classify 4 different mental illness?
I have a dataset in csv containing 2 columns: 1 is the label which determines the type of mental illness of the patient and the other is the corresponding reddit posts from a certain time period of that user. These are the total number of patients in each group of illness: control: 3000 depression: 2118 bipolar: 1062 ptsd: 330 schizophrenia: 148 for starters I tried binary classification between my depression and bipolar patients. I used tfidf vectors and fed it into 2 different types of classifiers: MultinomialNB and SVM. here is a sample of the code: using MultinomialNB: text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),]) text_clf = text_clf.fit(x_train, y_train) using SVM: text_clf_svm = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)),]) text_clf_svm = text_clf_svm.fit(x_train, y_train) these are my results: precision recall f1-score support bipolar 0.00 0.00 0.00 304 depression 0.68 1.00 0.81 650 accuracy 0.68 954 macro avg 0.34 0.50 0.41 954 weighted avg 0.46 0.68 0.55 954 The problem is the models are simply just predicting all the patients to be in the class of the larger data sample, in this case all are being predicted as depressed patients. I have tried using BERT as well but I get the same accuracy. I have read papers on them using LIWC lexicon, These categories include variables that characterize linguistic style as well as psychological aspects of language. I don't understand if what I am doing is correct or is there a better way at classifying using NLP, if so please enlighten me. Thanking anybody who comes across such a big post and shares their idea beforehand!
How to calculate confidence intervals for crude survival rates?
Let's assume that we have a survfit object as follows. fit = survfit(Surv(data$time_12m, data$status_12m) ~ data$group) fit Call: survfit(formula = Surv(data$time_12m, data$status_12m) ~ data$group) n events median 0.95LCL 0.95UCL data$group=HF 10000 3534 NA NA NA data$group=IGT 70 20 NA NA NA fit object does not show CI-s. How to calculate confidence intervals for the survival rates? Which R packages and code should be used?
The print result of survfit gives confidnce intervals by group for median survivla time. I'm guessing the NA's for the estimates of median times is occurring because your groups are not having enough events to actually get to a median survival. You should show the output of plot(fit) to see whether my guess is correct. You might try to plot the KM curves, noting that the plot.survfit function does have a confidence interval option constructed around proportions: plot(fit, conf.int=0.95, col=1:2) Please read ?summary.survfit. It is the class of generic summary functions which are typically used by package authors to deliver the parameter estimates and confidence intervals. There you will see that it is not "rates" which are summarized by summary.survfit, but rather estimates of survival proportion. These proportions can either be medians (in which case the estimate is on the time scale) or they can be estimates at particular times (and in that instance the estimates are of proportions.) If you actually do want rates then use a functions designed for that sort of model, perhaps using ?survreg. Compare what you get from using survreg versus survfit on the supplied dataset ovarian: > reg.fit <- survreg( Surv(futime, fustat)~rx, data=ovarian) > summary(reg.fit) Call: survreg(formula = Surv(futime, fustat) ~ rx, data = ovarian) Value Std. Error z p (Intercept) 6.265 0.778 8.05 8.3e-16 rx 0.559 0.529 1.06 0.29 Log(scale) -0.121 0.251 -0.48 0.63 Scale= 0.886 Weibull distribution Loglik(model)= -97.4 Loglik(intercept only)= -98 Chisq= 1.18 on 1 degrees of freedom, p= 0.28 Number of Newton-Raphson Iterations: 5 n= 26 #------------- > fit <- survfit( Surv(futime, fustat)~rx, data=ovarian) > summary(fit) Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian) rx=1 time n.risk n.event survival std.err lower 95% CI upper 95% CI 59 13 1 0.923 0.0739 0.789 1.000 115 12 1 0.846 0.1001 0.671 1.000 156 11 1 0.769 0.1169 0.571 1.000 268 10 1 0.692 0.1280 0.482 0.995 329 9 1 0.615 0.1349 0.400 0.946 431 8 1 0.538 0.1383 0.326 0.891 638 5 1 0.431 0.1467 0.221 0.840 rx=2 time n.risk n.event survival std.err lower 95% CI upper 95% CI 353 13 1 0.923 0.0739 0.789 1.000 365 12 1 0.846 0.1001 0.671 1.000 464 9 1 0.752 0.1256 0.542 1.000 475 8 1 0.658 0.1407 0.433 1.000 563 7 1 0.564 0.1488 0.336 0.946 Might have been easier if I had used "exponential" instead of "weibull" as the distribution type. Exponential fits have a single parameter that is estimated and are more easily back-transformed to give estimates of rates. Note: I answered an earlier question about survfit, although the request was for survival times rather than for rates. Extract survival probabilities in Survfit by groups
Bokeh Dodge Chart using Different Pandas DataFrame
everyone! So I have 2 dataframes extracted from Pro-Football-Reference as a csv and run through Pandas with the aid of StringIO. I'm pasting only the header and a row of the info right below: data_1999 = StringIO("""Tm,W,L,W-L%,PF,PA,PD,MoV,SoS,SRS,OSRS,DSRS Indianapolis Colts,13,3,.813,423,333,90,5.6,0.5,6.1,6.6,-0.5""") data = StringIO("""Tm,W,L,T,WL%,PF,PA,PD,MoV,SoS,SRS,OSRS,DSRS Indianapolis Colts,10,6,0,.625,433,344,89,5.6,-2.2,3.4,3.9,-0.6""") And then interpreted normally using pandas.read_csv, creating 2 different dataframes called df_nfl_1999 and df_nfl respectively. So I was trying to use Bokeh and do something like here, except instead of 'apples' and 'pears' would be the name of the teams being the main grouping. I tried to emulate it by using only Pandas Dataframe info: p9=figure(title='Comparison 1999 x 2018',background_fill_color='#efefef',x_range=df_nfl_1999['Tm']) p9.xaxis.axis_label = 'Team' p9.yaxis.axis_label = 'Variable' p9.vbar(x=dodge(df_nfl_1999['Tm'],0.0,range=p9.x_range),top=df_nfl_1999['PF'],legend='PF in 1999', width=0.3) p9.vbar(x=dodge(df_nfl_1999['Tm'],0.25,range=p9.x_range),top=df_nfl['PF'],legend='PF in 2018', width=0.3, color='#A6CEE3') show(p9) And the error I got was: ValueError: expected an element of either String, Dict(Enum('expr', 'field', 'value', 'transform'), Either(String, Instance(Transform), Instance(Expression), Float)) or Float, got {'field': 0 Washington Redskins My initial idea was to group by Team Name (df_nfl['Tm']), analyzing the points in favor in each year (so df_nfl['PF'] for 2018 and df_nfl_1999['PF'] for 1999). A simple offset of the columns could resolve, but I can't seem to find a way to do this, other than the dodge chart, and it's not really working (I'm a newbie). By the way, the error reference is appointed at happening on the: p9.vbar(x=dodge(df_nfl_1999['Tm'],0.0,range=p9.x_range),top=df_nfl_1999['PF'],legend='PF in 1999', width=0.3) I could use a scatter plot, for example, and both charts would coexist, and in some cases overlap (if the data is the same), but I was really aiming at plotting it side by side. The other answers related to the subject usually have older versions of Bokeh with deprecated functions. Any way I can solve this? Thanks! Edit: Here is the .head() method. The other one will return exactly the same categories, columns and rows, except that obviously the data changes since it's from a different season. Tm W L W-L% PF PA PD MoV SoS SRS OSRS \ 0 Washington Redskins 10 6 0.625 443 377 66 4.1 -1.3 2.9 6.8 1 Dallas Cowboys 8 8 0.500 352 276 76 4.8 -1.6 3.1 -0.3 2 New York Giants 7 9 0.438 299 358 -59 -3.7 0.7 -3.0 -1.8 3 Arizona Cardinals 6 10 0.375 245 382 -137 -8.6 -0.2 -8.8 -5.5 4 Philadelphia Eagles 5 11 0.313 272 357 -85 -5.3 1.1 -4.2 -3.3 DSRS 0 -3.9 1 3.4 2 -1.2 3 -3.2 4 -0.9 And the value of executing just x=dodge returns: dodge() missing 1 required positional argument: 'value' By adding that argumento value=0.0 or value=0.2 the error returned is the same as the original post.
The first argument to dodge should a single column name of a column in a ColumnDataSource. The effect is then that any values from that column are dodged by the specified amount when used as coordinates. You are trying to pass the contents of a column, which is is not expected. It's hard to say for sure without complete code to test, but you most likely want x=dodge('Tm', ...) However, you will also need to actually use an explicit Bokeh ColumnDataSource and pass that as source to vbar as is done in the example you link. You can construct one explicitly, but often times you can also just pass the dataframe directly source=df, and it will be adapted.
Decimal Point Normalization in Python
I am trying to apply normalization to my data and I have tried the Conventional scaling techniques using sklearn packages readily available for this kind of requirement. However, I am looking to implement something called Decimal scaling. I read about it in this research paper and looks like a technique which can improve results of a neural network regression. As per my understanding, this is what I believe needs to be done - Suppose the range of attribute X is −4856 to 28. The maximum absolute value of X is 4856. To normalize by decimal scaling I will need to divide each value by 10000 (c = 4). In this case, −4856 becomes −0.4856 while 28 becomes 0.0028. So for all values: new value = old value/ 10^c How can I reproduce this as a function in Python so as to normalize all the features(column by column) in my data set? Input: A B C 30 90 75 56 168 140 28 84 70 369 1107 922.5 485 1455 1212.5 4856 14568 12140 40 120 100 56 168 140 45 135 112.5 78 234 195 899 2697 2247.5 Output: A B C 0.003 0.0009 0.0075 0.0056 0.00168 0.014 0.0028 0.00084 0.007 0.0369 0.01107 0.09225 0.0485 0.01455 0.12125 0.4856 0.14568 1.214 0.004 0.0012 0.01 0.0056 0.00168 0.014 0.0045 0.00135 0.01125 0.0078 0.00234 0.0195 0.0899 0.02697 0.22475
Thank you guys for asking questions which led me to think about my problem more clearly and break it into steps. I have arrived to a solution. Here's how my solution looks like: def Dec_scale(df): for x in df: p = df[x].max() q = len(str(abs(p))) df[x] = df[x]/10**q I hope this solution looks agreeable!
def decimal_scaling (df): df_abs = abs(df) max_valus= df_abs.max() log_num=[] for i in range(max_valus.shape[0]): log_num.append(int(math.log10(max_valus[i]))+1) log_num = np.array(log_num) log_num = [pow(10, number) for number in log_num] X_full =df/log_num return X_full
Excel negative correlation of trendline
I have been getting this negative R² so many times when I add a trendline in excel as shown on the figure below. Do I care about this negative sign? Here is the data: x y 0.059 0.13 0.095 0.05 0.097 0.02 0.12 0.2 0.146 0.05 0.192 0.11 0.231 0.16 0.25 0.16 0.28 0.09 0.33 0.05 0.36 0.18 0.37 0.24 0.47 0.14 0.76 0.11 1.2 0.07 1.86 0.12
So, a negative R² is possible based on how that value is computed (it's not purely a square of a number). For a properly-defined model, the value of the correlation coefficient will be between 0 and 1, and the interpretation is that "x" percentage of the variability in your data is explained by the model. The interpretation of a negative value is that your trend line is a worse fit than a horizontal line. This answer provides a much more thorough explanation.
When you did your trendline you selected Set Intercept = option with intercept = 0.05. In this case Excel returns an R^2 that doesn't have it's customary meaning and can be negative --- see here. To fix the problem, unselect the Set Intercept option. When I run the trendline with the option unselected I get y = -0.0017x + 0.1182 (R^2 = 0.0002) Hope that helps.