Meaning of NER Training values using Spacy - python-3.x

Please explain the meaning of the columns when training Spacy NER model:
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 78.11 26.82 22.88 32.41 0.27
26 200 82.40 3935.97 94.44 94.44 94.44 0.94
59 400 50.37 2338.60 94.91 94.91 94.91 0.95
98 600 66.31 2646.82 92.13 92.13 92.13 0.92
146 800 85.11 3097.20 94.91 94.91 94.91 0.95
205 1000 92.20 3472.80 94.91 94.91 94.91 0.95
271 1200 124.10 3604.98 94.91 94.91 94.91 0.95
I know that ENTS_F ENTS_P and ENTS_R represent the F-score, precision, and recall respectively and the SCORE is the overall model score.
What is the formula for SCORE?
Where can I see the documentation about these columns?
What are the # and E columns stand for?
Please guide or send me to the relevant docs, I didn't find a proper documentation about the columns except here.

# refers to iterations (or batches), and E refers to epochs.
The score is calculated as a weighted average of other metrics, as designated in your config file. This is documented here.

Related

Using rdrobust to calculate 2sls and getting error message stating "should be set within the range of x"

I am trying to calculate a two stage least squares in Stata. My dataset looks like the following:
income bmi health_index asian black q_o_l age aide
100 19 99 1 0 87 23 1
0 21 87 1 0 76 29 0
1002 23 56 0 1 12 47 1
2200 24 67 1 0 73 43 0
2076 21 78 1 0 12 73 1
I am trying to use rdrobust to estimate the treatment effect. I used the following code:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
I varied the income variable with multiple polynomial forms and used multiple bandwidths. I keep getting the same error message stating:
c() should be set within the range of aide
I am assuming that this has to do with the bandwidth. How can I correct it?
You have two issues with the syntax. You wrote:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
This will ignore health_index-age variables, since you can only have one running variable. It will then try to use a cutoff of 10 for aide (the second variable is always the running one). Since aide is binary, Stata complains.
It's not obvious to me what makes sense in your setting, but here's an example demonstrating the problem and the two remedies:
. use "http://fmwww.bc.edu/repec/bocode/r/rdrobust_senate.dta", clear
. rdrobust vote margin, c(0) covs(state year class termshouse termssenate population)
Covariate-adjusted sharp RD estimates using local polynomial regression.
Cutoff c = 0 | Left of c Right of c Number of obs = 1108
-------------------+---------------------- BW type = mserd
Number of obs | 491 617 Kernel = Triangular
Eff. Number of obs | 309 279 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 17.669 17.669
BW bias (b) | 28.587 28.587
rho (h/b) | 0.618 0.618
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | 6.8862 1.3971 4.9291 0.000 4.14804 9.62438
Robust | - - 4.2540 0.000 3.78697 10.258
--------------------------------------------------------------------------------
Covariate-adjusted estimates. Additional covariates included: 6
. sum margin
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
margin | 1,390 7.171159 34.32488 -100 100
. rdrobust vote margin state year class termshouse termssenate population, c(7) // margin rang
> es from -100 to 100
Sharp RD estimates using local polynomial regression.
Cutoff c = 7 | Left of c Right of c Number of obs = 1297
-------------------+---------------------- BW type = mserd
Number of obs | 744 553 Kernel = Triangular
Eff. Number of obs | 334 215 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 14.423 14.423
BW bias (b) | 24.252 24.252
rho (h/b) | 0.595 0.595
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | .1531 1.7487 0.0875 0.930 -3.27434 3.58053
Robust | - - -0.0718 0.943 -4.25518 3.95464
--------------------------------------------------------------------------------
. rdrobust vote margin state year class termshouse termssenate population, c(-100) // nonsensical
> cutoff for margin
c() should be set within the range of margin
r(125);
end of do-file
r(125);
You might also find this answer interesting.

how do i improve my nlp model to classify 4 different mental illness?

I have a dataset in csv containing 2 columns: 1 is the label which determines the type of mental illness of the patient and the other is the corresponding reddit posts from a certain time period of that user.
These are the total number of patients in each group of illness:
control: 3000
depression: 2118
bipolar: 1062
ptsd: 330
schizophrenia: 148
for starters I tried binary classification between my depression and bipolar patients. I used tfidf vectors and fed it into 2 different types of classifiers: MultinomialNB and SVM.
here is a sample of the code:
using MultinomialNB:
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
text_clf = text_clf.fit(x_train, y_train)
using SVM:
text_clf_svm = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)),])
text_clf_svm = text_clf_svm.fit(x_train, y_train)
these are my results:
precision recall f1-score support
bipolar 0.00 0.00 0.00 304
depression 0.68 1.00 0.81 650
accuracy 0.68 954
macro avg 0.34 0.50 0.41 954
weighted avg 0.46 0.68 0.55 954
The problem is the models are simply just predicting all the patients to be in the class of the larger data sample, in this case all are being predicted as depressed patients. I have tried using BERT as well but I get the same accuracy. I have read papers on them using LIWC lexicon, These categories include variables that characterize linguistic style as well as psychological aspects of language.
I don't understand if what I am doing is correct or is there a better way at classifying using NLP, if so please enlighten me.
Thanking anybody who comes across such a big post and shares their idea beforehand!

How to calculate confidence intervals for crude survival rates?

Let's assume that we have a survfit object as follows.
fit = survfit(Surv(data$time_12m, data$status_12m) ~ data$group)
fit
Call: survfit(formula = Surv(data$time_12m, data$status_12m) ~ data$group)
n events median 0.95LCL 0.95UCL
data$group=HF 10000 3534 NA NA NA
data$group=IGT 70 20 NA NA NA
fit object does not show CI-s. How to calculate confidence intervals for the survival rates? Which R packages and code should be used?
The print result of survfit gives confidnce intervals by group for median survivla time. I'm guessing the NA's for the estimates of median times is occurring because your groups are not having enough events to actually get to a median survival. You should show the output of plot(fit) to see whether my guess is correct.
You might try to plot the KM curves, noting that the plot.survfit function does have a confidence interval option constructed around proportions:
plot(fit, conf.int=0.95, col=1:2)
Please read ?summary.survfit. It is the class of generic summary functions which are typically used by package authors to deliver the parameter estimates and confidence intervals. There you will see that it is not "rates" which are summarized by summary.survfit, but rather estimates of survival proportion. These proportions can either be medians (in which case the estimate is on the time scale) or they can be estimates at particular times (and in that instance the estimates are of proportions.)
If you actually do want rates then use a functions designed for that sort of model, perhaps using ?survreg. Compare what you get from using survreg versus survfit on the supplied dataset ovarian:
> reg.fit <- survreg( Surv(futime, fustat)~rx, data=ovarian)
> summary(reg.fit)
Call:
survreg(formula = Surv(futime, fustat) ~ rx, data = ovarian)
Value Std. Error z p
(Intercept) 6.265 0.778 8.05 8.3e-16
rx 0.559 0.529 1.06 0.29
Log(scale) -0.121 0.251 -0.48 0.63
Scale= 0.886
Weibull distribution
Loglik(model)= -97.4 Loglik(intercept only)= -98
Chisq= 1.18 on 1 degrees of freedom, p= 0.28
Number of Newton-Raphson Iterations: 5
n= 26
#-------------
> fit <- survfit( Surv(futime, fustat)~rx, data=ovarian)
> summary(fit)
Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
rx=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 13 1 0.923 0.0739 0.789 1.000
115 12 1 0.846 0.1001 0.671 1.000
156 11 1 0.769 0.1169 0.571 1.000
268 10 1 0.692 0.1280 0.482 0.995
329 9 1 0.615 0.1349 0.400 0.946
431 8 1 0.538 0.1383 0.326 0.891
638 5 1 0.431 0.1467 0.221 0.840
rx=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
353 13 1 0.923 0.0739 0.789 1.000
365 12 1 0.846 0.1001 0.671 1.000
464 9 1 0.752 0.1256 0.542 1.000
475 8 1 0.658 0.1407 0.433 1.000
563 7 1 0.564 0.1488 0.336 0.946
Might have been easier if I had used "exponential" instead of "weibull" as the distribution type. Exponential fits have a single parameter that is estimated and are more easily back-transformed to give estimates of rates.
Note: I answered an earlier question about survfit, although the request was for survival times rather than for rates. Extract survival probabilities in Survfit by groups

Decimal Point Normalization in Python

I am trying to apply normalization to my data and I have tried the Conventional scaling techniques using sklearn packages readily available for this kind of requirement. However, I am looking to implement something called Decimal scaling.
I read about it in this research paper and looks like a technique which can improve results of a neural network regression. As per my understanding, this is what I believe needs to be done -
Suppose the range of attribute X is −4856 to 28. The maximum absolute value of X is 4856.
To normalize by decimal scaling I will need to divide each value by 10000 (c = 4). In this case, −4856 becomes −0.4856 while 28 becomes 0.0028.
So for all values: new value = old value/ 10^c
How can I reproduce this as a function in Python so as to normalize all the features(column by column) in my data set?
Input:
A B C
30 90 75
56 168 140
28 84 70
369 1107 922.5
485 1455 1212.5
4856 14568 12140
40 120 100
56 168 140
45 135 112.5
78 234 195
899 2697 2247.5
Output:
A B C
0.003 0.0009 0.0075
0.0056 0.00168 0.014
0.0028 0.00084 0.007
0.0369 0.01107 0.09225
0.0485 0.01455 0.12125
0.4856 0.14568 1.214
0.004 0.0012 0.01
0.0056 0.00168 0.014
0.0045 0.00135 0.01125
0.0078 0.00234 0.0195
0.0899 0.02697 0.22475
Thank you guys for asking questions which led me to think about my problem more clearly and break it into steps. I have arrived to a solution. Here's how my solution looks like:
def Dec_scale(df):
for x in df:
p = df[x].max()
q = len(str(abs(p)))
df[x] = df[x]/10**q
I hope this solution looks agreeable!
def decimal_scaling (df):
df_abs = abs(df)
max_valus= df_abs.max()
log_num=[]
for i in range(max_valus.shape[0]):
log_num.append(int(math.log10(max_valus[i]))+1)
log_num = np.array(log_num)
log_num = [pow(10, number) for number in log_num]
X_full =df/log_num
return X_full

Treatment factor variable omitted in stata regression

I'm running a basic difference-in-differences regression model with year and county fixed effects with the following code:
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_born young_population manufacturing low_skill_sector unemployment ln_median_income [weight = mean_population], fe cluster(fips) robust
i.treated is a dichotomous measure of whether or not a county received the treatment over the lifetime of the study and after_1980 measures the post period of the treatment. However, when I run this regression, the estimate for my treatment variable is omitted so I can't really interpret the results. Below is a screen shot of the output. Would love some guidance on what to check so that i can get an estimate for the treated variables prior to treatment.
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_bo
> rn young_population manufacturing low_skill_sector unemployment ln_median_income
> [weight = mean_population], fe cluster(fips) robust
(analytic weights assumed)
note: 1.treated omitted because of collinearity
note: 2000.year omitted because of collinearity
Fixed-effects (within) regression Number of obs = 15,221
Group variable: fips Number of groups = 3,117
R-sq: Obs per group:
within = 0.2269 min = 1
between = 0.1093 avg = 4.9
overall = 0.0649 max = 5
F(12,3116) = 89.46
corr(u_i, Xb) = 0.0502 Prob > F = 0.0000
(Std. Err. adjusted for 3,117 clusters in fips)
---------------------------------------------------------------------------------
| Robust
ln_murder_rate | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
1.treated | 0 (omitted)
1.after_1980 | .2012816 .1105839 1.82 0.069 -.0155431 .4181063
|
treated#|
after_1980 |
1 1 | .0469658 .0857318 0.55 0.584 -.1211307 .2150622
|
year |
1970 | .4026329 .0610974 6.59 0.000 .2828376 .5224282
1980 | .6235034 .0839568 7.43 0.000 .4588872 .7881196
1990 | .4040176 .0525122 7.69 0.000 .3010555 .5069797
2000 | 0 (omitted)
|
ln_deprivation | .3500093 .119083 2.94 0.003 .1165202 .5834983
ln_foreign_born | .0179036 .0616842 0.29 0.772 -.1030421 .1388494
young_populat~n | .0030727 .0081619 0.38 0.707 -.0129306 .0190761
manufacturing | -.0242317 .0073166 -3.31 0.001 -.0385776 -.0098858
low_skill_sec~r | -.0084896 .0088702 -0.96 0.339 -.0258816 .0089025
unemployment | .0335105 .027627 1.21 0.225 -.0206585 .0876796
ln_median_inc~e | -.2423776 .1496396 -1.62 0.105 -.5357799 .0510246
_cons | 2.751071 1.53976 1.79 0.074 -.2679753 5.770118
----------------+----------------------------------------------------------------
sigma_u | .71424066
sigma_e | .62213091
rho | .56859936 (fraction of variance due to u_i)
---------------------------------------------------------------------------------
This is borderline off-topic since this is essentially a statistical question.
The variable treated is dropped because it is time-invariant and you are doing a fixed effects regression, which transforms the data by subtracting the average for each panel for each covariate and outcome. Treated observations all have treated set to one, so when you subtract the average of treated for each panel, which is also one, you get a zero. Similarly for control observations, except they all have treated set to zero. The result is that the treated column is all zeros and Stata drops it because otherwise the matrix is not invertible since there is no variation.
The parameter you care about is treated#after_1980, which is the DID effect and is reported in your output. The fact that treated is dropped is not concerning.

Resources