Resolving an error in using the Crf++ toolkit - nlp

To all those who have had experience with using the crf++ toolkit (refer: http://crfpp.sourceforge.net/)
Please find the error message which pops up on trying to execute the CRF++ training program:
CRF++: Yet Another CRF Tool Kit
Copyright (C) 2005-2009 Taku Kudo, All rights reserved.
encoder.cpp(280) [feature_index.open(templfile, trainfile)] feature_index.cpp(86) [max_size == size] inconsistent column size: 21 20 train.data
I'm not sure how to interpret the error message.
There are 20 features in my training file and the 21st token is the class value.
I have created the Crf++ template file as per the instructions on the site.

It looks like a training data format issue, make sure the number of columns are consistent across all sentences.

I got this error today, and found that crf++ toolkit just set tab character(\t) to default column separator whereas my train data file using one white space lead to error.

Some points to check:
1. Check if you have a new line after each sentence
2. Check if your columnar values does not contain any sp

Error suggests that your number of columns in rows are not same among all. Your maximum number of columns are 21 and that should be consistent through out the training file but crf_learn finds it 20 somewhere in your train.data training file. So locate such row and remove/repair it.

Related

Table values moved after being added to old QGIS version: behaves normally on current version

I am teaching a man the basics of QGis for a project he needs to do at his work. He has very little computer knowledge and would like to standardise the work as much as possible (specific workarounds would complicate it too much for him). His QGis version is 3.16 "Hannover" and as this is a work laptop he does not have permission to download a newer version.
We have been having problem with one specific table. The first few rows are below, written exactly as they are originally.
Baum-Nr. Baumart BHD Alter Y X Biotopbaum Klassifizierung Bemerkungen
1 Buche 86 120 49.1356 11.0488 A Altbaum Freistellen !!!
2 Kiefer 45 100 49.13561 11.04883 Hlb,Bs,Th Höhlenbaum
3 Kiefer 32 100 49.13571 11.04579 Hlb,Sw,Th Höhlenbaum
4 Kiefer 74 120 49.13513 11.0495 A Altbaum
After adding it from Excel to QGis through "add vector layer", the header "Klassifizierung" becomes one of the coordinates and I believe one of the columns are switched (unfortunately, I can't remember specifics. This is a small side job and I haven't had time to look into it for days. I should have taken a photo, but this isn't possible anymore). We have attempted to copy the column into a new Excel document and transferring it to QGis again, and this time the headers were shoved one cell to the right such that "Y" was placed over "X" and "Biotopbaum" over "Klassifizierung", for example.
I could not find a way to fix the import problem in his laptop. He e-mailed me the problematic table and I opened it successfully in my QGis 3.26 "Buenos Aires".
I believe this may be a problem with his QGis version, but it is curious that we only encountered it with this one table. All other tables we have worked with have the same headers and the same kind of data on their respective rows.
Is this a known problem, or have other people faced similar situations? Could someone explain what could be causing it? Would there be a way to fix it such that we can successfully import the table without having to edit it in QGis? This is not a solution the man would accept.
Thank you in advance.
Remove the commas in the Biotopbaum field or replace them with a less common delimiter. In fact, remove all punctionation (e.g., Baum-Nr. >> remove the period ".").
Also save the table into a csv format and try to import.

ADF copy activity failed while extracting data from DB2- Issue found for few records having special characters

I am doing a full extract from a table ABC. In copy activity, I have given a query as
select * from ABC
whereas I am facing issue for few rows (It has special characters - Japanese and Korean)
Error code 2200
Failure type User configuration issue
Details Failure happened on 'Source' side. ErrorCode=DB2DriverRunFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error thrown from driver. Sql code: '-343',Source=Microsoft.DataTransfer.ClientLibrary.Db2Connector,''Type=Microsoft.HostIntegration.DrdaClient.DrdaException,Message=HISMPCB0001 In BasePrimitiveConverter an exception has occurred. Exception description: Output buffer is smaller than required size 12 SQLSTATE=HY000 SQLCODE=-343,Source=Microsoft.HostIntegration.Drda.Requester,'
The character which is causing the issue is '轎ᆃ '
In the error description, it states that there is BasePrimitiveConverter exception that has occurred. The exception description indicates that the output buffer is smaller than the required size. So, please try converting the column to an acceptable type like graphic in db2. Refer to the following link to understand more.
https://bytes.com/topic/db2/answers/488983-storing-some-japanese-data
Referring to these links, I understand that this error might be due to the datatype of source column, or the encoding used. Try working with different encoding options available in your source dataset. Here is a similar problem with a different source but a similar problem of not being able to retrieve special characters.
https://learn.microsoft.com/en-us/answers/questions/467456/failure-happened-source-side-in-copy-activity-for.html

Syntax error when indicating "equalslopes" in proc logistic model equation

I've been trying to run a proc logistic stepwise regression model using an ordinal outcome. Because I am trying to account for the assumption of proportional odds, several of my variables have uneven slopes. So, in my code, I am indicating both equalslopes and unequalslopes, however I continue to get this syntax error.
ERROR 22-322: Syntax error, expecting one of the following: ;, ABSFCONV, AGGREGATE, ALPHA, BACKWARD, BEST, BINWIDTH, BUILDRULE, CL,
CLODDS, CLPARM, CODING, CONVERGE, CORRB, COVB, CT, CTABLE, DETAILS, DSCALE, EXPB, FAST, FCONV, FIRTH, GCONV,
HIERARCHY, INCLUDE, INFLUENCE, IPLOTS, ITPRINT, L, LACKFIT, LINK, MAXITER, MAXSTEP, NOCHECK, NODUMMYPRINT, NOFIT,
NOINT, NOLOGSCALE, OFFSET, OUTROC, PARMLABEL, PCORR, PEVENT, PL, PLCL, PLCONV, PLRL, PPROB, PSCALE, RIDGING,
RISKLIMITS, ROCEPS, RSQUARE, SCALE, SELECTION, SEQUENTIAL, SINGULAR, SLE, SLENTRY, SLS, SLSTAY, START, STB, STEPWISE,
STOP, STOPRES, TECHNIQUE, UNEQUALSLOPES, WALDCL, WALDRL, XCONV.
ERROR 76-322: Syntax error, statement will be ignored.
I am sure I am putting in my code correct, even when I do not state any specific variables for either the equalslopes option and the unequalslopes option, I continue to receive the same message. Here is my code below.
proc logistic data=upper_limit;
class Profitrank dealer_state_id dlr_zip RUCA2 area_description/ param = ref;
model profitrank = num_booked num_approved num_apps bk_conversion_rt appr_rt num_new num_used Percentage_of_New Average_Vehicle_Age average_dti
average_pti average_revolving_balance average_revolving_balance_rate average_public_records average_inquiries average_inquiry_6_months
average_open_rev_trades average_term average_major_derog average_minor_derog average_open_install average_scorecard average_fico_score average_app_risk_score
average_LTV average_consumer_rate average_truist_rate Dealer_state_ID RUCA2/link=clogit selection=stepwise SLE=0.10 SLE=0.10 EQUALSLOPES=(appr_rt average_pti average_public_records
average_open_rev_trades average_major_derog average_minor_derog average_open_install average_fico_score average_app_risk_score RUCA2) unequalslopes details maxiter=500;
run;```
Any help with this would be greatly appreciated!
You need to use options that work with the version of SAS you are using. The EQUALSLOPES option was added in SAS/STAT 14.1 which is SAS 9.4m3 released in 2015.
Note that SAS is a subscription license. Which means you have already paid for the right to use the newest version. Get someone at your company to install the newest version of SAS. You might also need to use a newer version of Enterprise Guide to take advantage of the new features.

Excel in SSIS: How to import a column that may have more than 255 characters when DT_NTEXT causes failures?

OK, so my latest project requires loading an Excel 2007 spreadsheet into a SQL Server table. I'm working in SSIS 2008R2. Based on some stuff I found on the internet, I opened the Excel source in Advanced editor and changed the datatype of the long column to DT_NTEXT, so that it wouldn't truncate it. Then I made the database column VARCHAR(MAX). This runs correctly in debug mode on my laptop.
Then I deployed it to the development server and attempted to load the same test file. It failed with the following error messages:
Error: Code: 0xC0208265
Source: Main Data Flow Task Get Main Data [1]
Description: Failed to retrieve long data for column "DESCR".
End Error
Error: Code: 0xC020901C
Source: Main Data Flow Task Get Main Data [1]
Description: There was an error with output column "DESCR" (72) on output "Excel Source Output" (9). The column status returned was: "DBSTATUS_UNAVAILABLE".
End Error
Error: Code: 0xC0209029
Source: Main Data Flow Task Get Main Data [1]
Description: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR. The "output column "DESCR" (72)" failed because error code 0xC0209071 occurred, and the error row disposition on "output column "DESCR" (72)" specifies failure on error. An error occurred on the specified object of the specified component. There may be error messages posted before this with more information about the failure.
End Error
Searching for information about the error, I found about a million sites offering the same three suggested solutions:
Add 'IMEX=1' to the extended properties of the connection string.
It was already there.
Change the TypeGuessRows key in the registry.
This was set to zero on the server, which I understand to mean that it should look at the entire file. Nevertheless, I changed it to 8 to match my laptop. The same error occurred when I ran it again. Then I changed it to 1,763, which is more than the number of rows in the spreadsheet. It still gave the same error. So, I put it back to zero. (There's a 1,900-character value in the first row of my test file, so it shouldn't really matter how many it checks, in this case.)
Change the datatype to DT_WSTR(4000) in the source.
The column is supposed to have up to 10,000 characters, so I'm not sure this would be a good idea even if it worked. However, I tried it anyway. This time it gave me a truncation error. I changed the truncation error disposition to "ignore failure" and it loaded the data, but truncated the value to 255 characters. I have verified that the length is 4000 and doesn't get changed when I save the file, but it's still truncating at 255 characters.
I have no idea what else to look at. Any help would be appreciated.
UPDATE 1/29: The package, without any changes, works correctly when running on the pre-production server. It still fails when running on the development server. Both servers have the same version of SSIS (including minor version numbers) as well as the same versions of Windows, Access and Excel. I do not know how to explain this, nor do I know how to tell if it would work in production.
I created a new package with similar non-functional requirements (Excel 2007 file, SSIS 2008, SQL Server 2008 R2, VARCHAR(MAX) target column) and it worked just fine after deployment into the database server. My package:
Metadata at the Excel Source component's output (checked using Advanced Editor): DT_NTEXT
Derived Column component between source and destination to cast to non-unicode from unicode using (DT_TEXT,1252)
Metadata at the OLE DB Destination component's input (checked using Advanced Editor): DT_TEXT
Target Column data type: VARCHAR(MAX)
I do not explicitly use the extended property IMEX in the connection
Executed by right-clicking on the package at the database server, and loaded a file with a few thousand characters per record into the table without truncation. Hope this helps
I have faced this issue while importing an excel file with a field containing more than 255 characters. I solved the issue using Python.
Simply, import the excel in a pandas data frame and then calculate the length of each of those string values per row.
Then, sort the dataframe in descending order. This will enable SSIS to allocate maximum space for that field as it scans the first 3 rows to allocate storage:
df = pd.read_excel(f,sheet_name=0,skiprows = 1)
df = df.drop(df.columns[[0]], axis = 1)
df['length'] = df['Item Description'].str.len()
df.sort_values('length', ascending=False, inplace=True)
writer = ExcelWriter('Clean/Cleaned_'+f[5:])
df.to_excel(writer,sheet_name='Billing',index=False)
writer.save()

Can't Input Tab Delimited file to Stanford Classifier

I'm having a problem inputting tab delimited files into the stanford classifier.
Although I was able to successfully walk through all the included stanford tutorials, including the newsgroup tutorial, when I try to input my own training and test data it doesn't load properly.
At first I thought the problem was that I was saving the data into a tab delimited file using an Excel spreadsheet and it was some kind of encoding issue.
But then I got exactly the same results when I did the following. First I literally typed the demo data below into gedit, making sure to use a tab between the politics/sports class and the ensuing text:
politics Obama today announced a new immigration policy.
sports The NBA all-star game was last weekend.
politics Both parties are eyeing the next midterm elections.
politics Congress votes tomorrow on electoral reforms.
sports The Lakers lost again last night, 102-100.
politics The Supreme Court will rule on gay marriage this spring.
sports The Red Sox report to spring training in two weeks.
sports Messi set a world record for goals in a calendar year in 2012.
politics The Senate will vote on a new budget proposal next week.
politics The President declared on Friday that he will veto any budget that doesn't include revenue increases.
I saved that as myproject/demo-train.txt and a similar file as myproject/demo-test.txt.
I then ran the following:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt
The good news: this actually ran without throwing any errors.
The bad news: since it doesn't extract any features, it can't actually estimate a real model and the probability defaults to 1/n for each item, where n is the number of classes.
So then I ran the same command but with two basic options specified:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
That yielded:
Exception in thread "main" java.lang.RuntimeException: Training dataset could not be processed
at edu.stanford.nlp.classify.ColumnDataClassifier.readDataset(ColumnDataClassifier.java:402)
at edu.stanford.nlp.classify.ColumnDataClassifier.readTrainingExamples (ColumnDataClassifier.java:317)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:1652)
at edu.stanford.nlp.classify.ColumnDataClassifier.main(ColumnDataClassifier.java:1628)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:670)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatumFromLine(ColumnDataClassifier.java:267)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:396)
... 3 more
These are exactly the same results I get when I used the real data I saved from Excel.
Even more though, I don't know how to make sense of the ArrayIndexOutOfBoundsException. When I used readline in python to print out the raw strings for both the demo files I created and the tutorial files that worked, nothing about the formatting seemed different. So I don't know why this exception would be raised with one set of files but not the other.
Finally, one other quirk. At one point I thought maybe line breaks were the problem. So I deleted all line breaks from the demo files while preserving tab breaks and ran the same command:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
Surprisingly, this time no java exceptions are thrown. But again, it's worthless: it treats the entire file as one observation, and can't properly fit a model as a result.
I've spent 8 hours on this now and have exhausted everything I can think of. I'm new to Java but I don't think that should be an issue here -- according to Stanford's API documentation for ColumnDataClassifier, all that's required is a tab delimited file.
Any help would be MUCH appreciated.
One last note: I've run these same commands with the same files on both Windows and Ubuntu, and the results are the same in each.
Use a properties file. In the example Stanford classifier example
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
2.useSplitWords=true
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=\\s+
The number 2 at the start of lines 3, 4 and 5 signifies the column in your tsv file. So in your case you would use
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
1.useSplitWords=true
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
1.splitWordsIgnoreRegexp=\\s+
or if you want to run with command line arguments
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -1.useSplitWords =1.splitWordsRegexp "\s+"
I've faced the same error as you.
Pay attention on tabs in the text you are classifying.
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
This means, that at some point classifier expects array of 3 elements, after it splits the string with tabs.
I've run a method, that counts amount of tabs in each line, and if at some line you have not two of them - here is an error.

Resources