WEKA linear regression conversion issue - excel

I have converted an excel file to csv and opened the csv file on WEKA to classify the data using linear regression but it doesn't allow me to select 'linear regression' option under the 'function' branch. This is my format
#RELATION book
#ATTRIBUTE bookID STRING
#ATTRIBUTE author STRING
#ATTRIBUTE genre STRING
#ATTRIBUTE publisher STRING
#ATTRIBUTE yearPublished NUMERIC
#ATTRIBUTE rating NUMERIC
#DATA
book1, suzzane-collins, horror, scholastic, 2008, 4011425
book2, jay-rowling, fantasy, scholastic, 2004, 1560433
book3, harper-lee, comedy, harper-classics, 2006, 2708232
book4, jane-austen, romance, modern-library, 2008, 1560433
book5, stephenie-meyer, romance, little-brown, 2006, 40114255
book6, john-lewis, thriller, harper-collins, 2002, 352728
book7, margarte, mystery, grand-central, 1964, 780522
book8, George-orwell, humour, nal, 2003, 1679178
book9, markus-zusak, legend, grand-central, 2006, 780522
book10, shel-silverstein, folklore, harper-collins, 1964, 592994

For linear regression, your attributes have to be #NUMERIC. If you want to do the regression based only on the last two attributes, then you will have to specify that (by checking those attributes) in the "Preprocess" tab in Weka so that it only uses the right ones. You can check this example to see what you are doing wrong. They explain how to run basic linear regression in WEKA from scratch.

Related

Mutate: character string is not in a standard unambiguous format

I have a column titled started_at which is formatted in like this: 4/12/2021 18:25. When I try to run code, I get the following message:
Error in mutate():
! Problem while computing day_of_week = wday(start_time, label = TRUE).
Caused by error in as.POSIXlt.character():
! character string is not in a standard unambiguous format
This is the code I am trying to run:
**> cyclistic_trips_merge_v2 %>%
mutate(day_of_week = wday(start_time, label = TRUE)) %>% #creates weekday field using wday()
group_by(usertype, day_of_week) %>% #groups by usertype and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(usertype, day_of_week)**
I am new to R, and this is a capstone project. I get stuck but then figure my way around things doing web searches but right now I am kind of stumped. The date/time mentioned above is classified as a string. I believe that is the problem, but what do I need to convert it to and how can I do that? Can anyone please help? Losing my mind.

python Spacy custom NER – how to prepare multi-words entities?

:) Please help :)
I`m preparing custom Name Entity Recognition using Spacy (blank) model. I use only one entity: Brand (we can name it 'ORG' as Organisation). I have short texts with ORGs & have prepared data like this (but I can change it):
train_data = [
(‘First text in string with Name I want’, {'entities': [(START, END, ‘ORG')]}),
(‘Second text with Name and Name2’, {'entities': [(START, END, ‘ORG'), (START2, END2, ‘ORG')]})
]
START, END – are the start and end indexes of the brand name in text , of course.
This is working well, but...
The problem I have is how to prepare entities for Brands that are made of 2 (or more) words.
Lets say Brand Name is a full name of a company. How to prepare an entity?
Consider the tuple itself for a single text:
text = 'Third text with Brand Name'
company = 'Brand Name'
Can I treat company as a one word?
(‘Third text with Brand Name', {“entities”: [(16, 26, 'ORG')]})
Or 2 separated brands ‘Brand’ & ‘Name’ ? (will not be useful in my case while using :( the model later)
(‘Third text with Brand Name', {“entities”: [(16, 21, 'ORG'), (22, 26, 'ORG')]})
Or I should use a different format of labeling eg. BIO ?
So Brand will be B-ORG and Name will be I-ORG ?
IF so can I prepare it like this for Spacy:
(‘Third text with Brand Name', {“entities”: [(16, 21, 'B-ORG'), (22, 26, 'I-ORG')]})
or should I change the format of train_data because I also need the ‘O’ from BIO?
How? Like this? :
(‘Third text with Brand Name', {"entities": ["O", "O", "O", "B-ORG", "I-ORG"]})
The question is on the format of the train_data for ‘Third text with Brand Name' - how to label the entity. If I have the format, I will handle the code. :)
The same question for 3 or more words entities. :)
You can just provide the start and end offsets for the whole entity. You describe this as "treating it as one word", but the character offsets don't have any direct relation to tokenization - they won't affect tokenizer output.
You will get an error if the start and end of your entity don't match token boundaries, but it doesn't matter if the entity is one token or many.
I recommend you take a look at the training data section in the spaCy docs. Your specific question isn't answered explicitly, but that's only because multi-token entries don't require special treatment. Examples include multi-token entities.
Regarding BIO tagging, for details on how to use it with spaCy you can see the docs for spacy convert.

How to extract relationships from a text

I am currently new with NLP and need guidance as of how I can solve this problem.
I am currently doing a filtering technique where I need to brand data in a database as either being correct or incorrect. I am given a structured data set, with columns and rows.
However, the filtering conditions are given to me in a text file.
An example filtering text file could be the following:
Values in the column ID which are bigger than 99
Values in the column Cash which are smaller than 10000
Values in the column EndDate that are smaller than values in StartDate
Values in the column Name that contain numeric characters
Any value that follows those conditions should be branded as bad.
However, I want to extract those conditions and append them to the program that I've made so far.
For instance, for the conditions above, I would like to produce
`if ID>99`
`if Cash<10000`
`if EndDate < StartDate`
`if Name LIKE %[1-9]%`
How can I achieve the above result using the Stanford NLP? (or any other NLP library).
This doesn't look like a machine learning problem; it's a simple parser. You have a simple syntax, from which you can easily extract the salient features:
column name
relationship
target value or target column
The resulting "action rule" is simply removing the "syntactic sugar" words and converting the relationship -- and possibly the target value -- to its symbolic form.
Enumerate all of your critical words for each position in a lexicon. Then use basic string manipulation operators in your chosen implementation language to find the three needed fields.
EXAMPLE
Given the data above, your lexicons might be like this:
column_trigger = "Values in the column"
relation_dict = {
"are bigger than" : ">",
"are smaller than" : "<",
"contain" : "LIKE",
...
}
value_desc = {
"numeric characters" : "%[1-9]%",
...
}
From here, use these items in standard parsing. If you're not familiar with that, please look up the basics of a simple sentence grammar in your favourite programming language, with rules such as such as
SENTENCE => SUBJ VERB OBJ
Does that get you going?

Unable to determine structure as arff in WEKA

I tried the proposed solutions online by saving the file in ANSI and deleting the first line and changing the attributes to numeric instead of real as follows and even by adding a '}' symbol at the line 29 but I still get the following error in WEKA when I try to import the arff file.
Error Message:
Unable to determine structure as arff(Reason:java.io.IOException: } expected at end of enumeration, read Token[EOL],line 29)
ARFF file:
#relation Pilot
#attribute Gender? { Male (Lelaki),Female (Perempuan) }
#attribute Age? numeric
#attribute 1# numeric
#attribute 2# numeric
#attribute 22#{Nothing_to_Carry,Need_to_carry_many_things}
#attribute 14# numeric
#attribute 3# numeric
#attribute 18# numeric
#attribute 17# numeric
#attribute 4# numeric
#attribute 5# numeric
#attribute 15# numeric
#attribute 16# numeric
#attribute 19# {No,Yes}
#attribute 20# {Yes,No}
#attribute 6# numeric
#attribute 7# numeric
#attribute 8# numeric
#attribute 9# numeric
#attribute 11# numeric
#attribute 10# numeric
#attribute 12# numeric
#attribute 13# numeric
#attribute 21#{No,Yes}
#attribute Physical_Disability{Partially_Visually_Impaired,Blind}
#attribute 23#{Yes,Don't_know,No}
#attribute 24#{No,Don't know,Yes}
#attribute 25#{Yes,No}
}
#data
Male,36,2,3,Nothing_to_Carry,1,3,2,3,3,2,1,2,No,Yes,3,5,5,4,5,4,3,3,No,Partially_Visually_Impaired,Yes,No,Yes
Female,44,3,3,Nothing_to_Carry,3,4,3,3,4,3,1,1,No,Yes,4,4,3,2,3,3,4,4,No,Partially_Visually_Impaired,Yes,No,Yes
Male,34,3,4,Nothing_to_Carry,3,3,2,1,4,3,2,1,No,Yes,1,4,3,1,5,3,4,5,No,Blind,Yes,Don't know,Yes
Male,56,1,3,Nothing_to_Carry,3,4,4,4,4,3,3,3,No,Yes,1,5,5,5,3,3,5,1,Yes,Blind,Don't know,Yes,Yes
Male,54,5,5,Nothing_to_Carry,1,1,1,5,5,5,1,5,No,Yes,1,5,5,1,5,1,1,5,Yes,Blind,Yes,No,Yes
Female,39,1,1,Nothing_to_Carry,1,2,1,5,3,5,5,5,Yes,Yes,3,3,5,1,1,5,5,5,Yes,Blind,Yes,Yes,Yes
Male,49,2,3,Nothing_to_Carry,2,2,3,4,4,4,3,3,No,Yes,1,3,3,4,3,3,4,4,No,Partially_Visually_Impaired,No,No,Yes
Male,68,5,4,Nothing_to_Carry,4,4,2,5,2,3,3,3,No,No,1,2,3,1,3,3,3,4,No,Blind,Yes,Don't know,No
Male,44,1,1,Nothing_to_Carry,1,3,3,3,3,3,3,1,No,Yes,1,5,4,4,3,4,2,2,Yes,Blind,Yes,Yes,Yes
Male,45,1,1,Nothing_to_Carry,1,2,1,1,1,1,3,1,No,Yes,5,5,1,5,5,5,5,5,No,Partially_Visually_Impaired,No,No,Yes
Male,59,3,4,Nothing_to_Carry,4,3,3,3,3,3,3,3,No,No,2,1,3,1,4,3,4,2,No,Blind,Yes,Yes,No
Male,38,3,3,Nothing_to_Carry,4,4,3,4,4,3,3,3,No,Yes,4,2,4,1,2,3,3,3,No,Partially_Visually_Impaired,Yes,No,Yes
Male,29,4,2,Nothing_to_Carry,4,4,4,4,3,4,4,3,Yes,Yes,4,3,3,3,3,3,4,3,No,Blind,Yes,No,Yes
}
Please advise...Thank you.
Arff don't need closing bracelet at the end of attribute section or data section. So remove them.
Attribute name must start with an alphabetic character.
If nominal values contains space then they must be quoted e.g here values of gender, 24# attributes needs to be quoted i.e. 'Male (Lelaki)'.
Please check whether space needs to be given in between attribute name and attribute datatype even for nominal values.
Also make it sure that each line of data input consists of number of values equal to number of attributes specified in attribute section.
If above points fail to remove error please check arff file format details at http://www.cs.waikato.ac.nz/ml/weka/arff.html
Next time consider editing your question instead of writing a question in an answer.
The problem is that you should change Don't_know and Don't know at attributes 23 and 24 to Dont_know. Meaning that you have to skip punctuation and spaces.
Also you need to change either this #attribute Gender? { Male(Lelaki),Female(Perempuan) } to #attribute Gender? { Male,Female }
Or your data should be like this Male(Lelaki),36,2,3,Nothing_to_Carry,1,3,... instead of Male,36,2,3,Nothing_to_Carry,1,3...
Try to open your .csv file dataset from Weka Software.
It has some steps to do
By installing the Weka software and go to the "Experimenter" tab
Then go to the "analyze" tab
Then select the.csv data set from opening it through "file" option
Then click the "open explorer" button
Then save the opened output as .arff format with a name
After that Close the Weka software
7.Then open the Weka software again and go to the "Explorer" tab
Then open the earlier saved .arff format dataset from that
Afterthat this type of error doesn't come. You can do the preprocessing, classifications, association rules easily with Weka software.

Exporting results

I'm sure this is an issue anyone who uses Stata for publications or reports has run into:
How do you conveniently export your output to something that can be parsed by a scripting language or Excel?
There are a few ado files that do this for specific commands. For example:
findit tabout
findit outreg2
But what about exporting the output of the table command? Or the results of an anova?
I would love to hear about how Stata users address this problem for either specific commands or in general.
After experimenting with this for a while, I've found a solution that works for me.
There are a variety of ADOs that handle exporting specific functions. I've made use of outreg2 for regressions and tabout for summary statistics.
For more simple commands, it's easy to write your own programs to save results automatically to plaintext in a standard format. Here are a few I wrote...note that these both display results (to be saved to a log file) and export them into text files – if you wanted to just save to text you could get rid of the di's and qui the sum, tab, etc. commands:
cap program drop sumout
program define sumout
di ""
di ""
di "Summary of `1'"
di ""
sum `1', d
qui matrix X = (r(mean), r(sd), r(p50), r(min), r(max))
qui matrix colnames X = mean sd median min max
qui mat2txt, matrix(X) saving("`2'") replace
end
cap program drop tab2_chi_out
program define tab2_chi_out
di ""
di ""
di "Tabulation of `1' and `2'"
di ""
tab `1' `2', chi2
qui matrix X = (r(p), r(chi2))
qui matrix colnames X = chi2p chi2
qui mat2txt, matrix(X) saving("`3'") replace
end
cap program drop oneway_out
program define oneway_out
di ""
di ""
di "Oneway anova with dv = `1' and iv = `2'"
di ""
oneway `1' `2'
qui matrix X = (r(F), r(df_r), r(df_m), Ftail(r(df_m), r(df_r), r(F)))
qui matrix colnames X = anova_between_groups_F within_groups_df between_groups_df P
qui mat2txt, matrix(X) saving("`3'") replace
end
cap program drop anova_out
program define anova_out
di ""
di ""
di "Anova command: anova `1'"
di ""
anova `1'
qui matrix X = (e(F), e(df_r), e(df_m), Ftail(e(df_m), e(df_r), e(F)), e(r2_a))
qui matrix colnames X = anova_between_groups_F within_groups_df between_groups_df P RsquaredAdj
qui mat2txt, matrix(X) saving("`2'") replace
end
The question is then how to get the output into Excel and format it. I found that the best way to import the text output files from Stata into Excel is to concatenate them into one big text file and then import that single file using the Import Text File... feature in Excel.
I concatenate the files by placing this Ruby code in the output folder and then running int from my Do file with qui shell cd path/to/output/folder/ && ruby table.rb:
output = ""
Dir.new(".").entries.each do |file|
next if file =~/\A\./ || file == "table.rb" || file == "out.txt"
if file =~ /.*xml/
system "rm #{file}"
next
end
contents = File.open(file, "rb").read
output << "\n\n#{file}\n\n" << contents
end
File.open("out.txt", 'w') {|f| f.write(output)}
Once I import out.txt into its own sheet in Excel, I use a bunch of Excel's built-in functions to pull the data together into nice, pretty tables.
I use a combination of vlookup, offset, match, iferror, and hidden columns with cell numbers and filenames to do this. The source .txt file is included in out.txt just above the contents of that file, which lets you look up the contents of the file using these functions and then reference specific cells using vlookup and offset.
This Excel business is actually the most complicated part of this system and there's really no good way to explain it without showing you the file, though hopefully you can get enough of an idea to figure it out for yourself. If not, feel free to contact me through http://maxmasnick.com and I can get you more info.
I have found that the estout package is the most developed and has good documentation.
This is an old question and a lot has happened since it was posted.
Stata now has several built-in commands and functions that allow anyone to
export customized output fairly easily:
putexcel
putexcel with advanced syntax
putdocx
putpdf
There are also equivalent Mata functions / classes, which offer greater flexibility:
_docx*()
Pdf*()
xl()
From my experience, there aren't 100% general solutions. Community-contributed commands such as estout are now mature enough to handle most basic operations. That said, if you have something that deviates even slightly from the template you will have to program this yourself.
Most tutorials throw in several packages where it would indeed nice to have only one exporting everything, which is what Max suggests above with his interesting method.
I personally use tabout for summary statistics and frequencies, estout for regression output, and am trying out mkcorr for correlation matrixes.
It's been a while, but I believe you can issue a log command to capture the output.
log using c:\data\anova_analysis.log, text
[commands]
log close
I use estpost-- part of the estout package-- to tabulate results from non-estimation commands. You can then store them and export easily.
Here's an example:
estpost corr varA varB varC varD, matrix
est store corrs
esttab corrs using corrs.rtf, replace
You can then add options to change formatting, etc.
You can use asdoc that is available on SSC. To download,
ssc install asdoc
asdoc works well with almost all Stata commands. Specifically, it produces publication quality tables for :
summarize command - to report summary statistics
cor or pwcorr command - to report correlations
tabstat - for flexible tables of descriptive statistics
tabulate - for one-way, two-way, three-way tabulations
regress - for detailed, nested, and wide regression tables
table - flexible tables
and many more. You can explore more about asdoc here
https://fintechprofessor.com/2018/01/31/asdoc/

Resources