From following links I came with some idea. I want to ask whether I am doing it right or I am in the wrong way. If I am in the wrong way, please guide me.
Links
Using libsvm for text classification c#
How to use libsvm for text classification?
My way
First calculate the word count in each training set
Create a maping list for each word
eg
sample word count form training set
|-----|-----------|
| | counts |
|-----|-----|-----|
|text | +ve | -ve |
|-----|-----|-----|
|this | 3 | 3 |
|forum| 1 | 0 |
|is | 10 | 12 |
|good | 10 | 5 |
|-----|-----|-----|
positive training data
this forum is good
so will the training set be
+1 1:3 2:1 3:10 4:10
this all is just what I received from above links.
Please help me.
You're doing it right.
I don't know why your laben is called "+1" - should be a simple integer (refering to the document "+ve"), but all in all it's the way to go.
For document classification you may want to take a look at liblinear which is specially designed for handling a lot of features.
you can also use libshorttext from here:
libshortText
in python
Related
Regular expression seems a steep learning curve for me. I have a dataframe that contains texts (up to 300,000 rows). The text as contained in outcome column of a dummy file named foo_df.csv has a mixture of English words, acronyms and Māori words. foo_df.csv is as thus:
outcome
0 I want to go to DHB
1 Self Determination and Self-Management Rangatiratanga
2 mental health wellness and AOD counselling
3 Kai on my table
4 Fishing
5 Support with Oranga Tamariki Advocacy
6 Housing pathway with WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services
The result I desire is in form of a table below such that has Abreviation and Māori_word columns:
outcome Abbreviation Māori_word
0 I want to go to DHB DHB
1 Self Determination and Self-Management Rangatiratanga Rangatiratanga
2 mental health wellness and AOD counselling AOD
3 Kai on my table Kai
4 Fishing
5 Support with Oranga Tamariki Advocacy Oranga Tamariki
6 Housing pathway with WINZ WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services Owaraika
The approach I am using is to extract the ACRONYMS using regular expression and extract the Māori words using nltk module.
I have been able to extract the ACRONYMS using regular expression with this code:
pattern = '(\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b)'
foo_df['Abbreviation'] = foo_df.outcome.str.extract(pattern)
I have been able to extract non-english words from a sentence using the code below:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "Self Determination and Self-Management Rangatiratanga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
However, I got an error TypeError: expected string or bytes-like object when I tried to iterate the above code over a dataframe. The iteration I tried is below:
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
foo_df['Māori_word'] = foo_df.apply(no_english, axis = 1)
print(foo_df)
Any help in python3 will be appreciated. Thanks.
You can't magically tell if a word is English/Māori/abbreviation with a simple short regex. Actually, it is quite likely that some words can be found in multiple categories, so the task itself is not binary (or trinary in this case).
What you want to do is natural language processing, here are some examples of libraries for language detection in python. What you'll get is a probability that the input is in a given language. This is usually ran on full texts but you could apply it to single words.
Another approach is to use Māori and abbreviation dictionaries (=exhaustive/selected lists of words) and craft a function to tell if a word is one of them and assume English otherwise.
I am trying to summarise some text using Gensim in python and want exactly 3 sentences in my summary. There doesn't seem to be an option to do this so I have done the following workaround:
with open ('speeches//'+speech, "r") as myfile:
speech=myfile.read()
sentences = speech.count('.')
x = gensim.summarization.summarize(speech, ratio=3.0/sentences)
However this code is only giving me two sentences. Furthermore, as I incrementally increase 3 to 5 still nothing happens.
Any help would be most appreciated.
You may not be able use 'ratio' for this. If you give ratio=0.3, and you have 10 sentences (assuming count of words in each sentence is same), your output will have 3 sentences, 6 for 20 and so on.
As per gensim doc
ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
Instead you might want to try using word_count, summarize(speech, word_count=60)
This question is a bit old, in case you found a better solution, pls share.
For a part of my job we make a comprehensive list based on all files a user has in their drive. These users have to decide per file whether to archive these or not (indicated by Y or N). As a service to these users we manually fill this in for them.
We export these files to a long list in excel, which displays each file as X:\4. Economics\10. xxxxxxxx\04. xxxxxxxxx\04. xxxxxxxxxx\filexyz.pdf
I'd argue that we can easily automate this, as standard naming conventions make it easy to decide which files to keep and which to delete. A file with the string "CAB" in the filename should for example be kept. However, I have no idea how and where to start. Can someone point me in the right direction?
I would suggest the following general steps
Get the raw data
You can read the excel file into a pandas dataframe in python. Ideally you will have a raw dataframe that looks something like this
Filename Keep
0 X:\4. Economics ...\filexyz.pdf 0
1 X:\4. Economics ...\fileabc.pdf 1
2 X:\3. Finance ...\filetef.pdf 1
3 X:\3. Finance ...\file123.pdf 0
4 G:\2. Philosophy ..\file285.pdf 0
....
Preprocess/clean
This part is more up to you, for example you could remove all special characters and numbers. This would leave letters as follows
Filename Keep
0 "X Economics filexyz pdf" 0
1 "X Economics fileabc pdf" 1
2 "X Finance filetef pdf" 1
3 "X Finance file123 pdf" 0
4 "G Philosophy file285 pdf" 0
....
Vectorize your strings
For an algorithm to understand your text data, you typically vectorize them. This means you turn them into numbers that the algorithm can process. An easy way to do this is with tf-idf and scikit-learn. After this your dataframe might look something like this
Filename Keep
0 [0.6461, 0.3816 ... 0.01, 0.38] 0
1 [0., 0.4816 ... 0.25, 0.31] 1
2 [0.61, 0.1663 ... 0.11, 0.35] 1
....
Train a classifier
Now that you have nice numbers for the algorithms to work with, you can train a classifier with scikit-learn. Simply search for "scikit learn classification example" and you will find plenty.
Once you have a trained classifier, you can compare its predictions on test data that it has not seen before. That way you get a feeling for accuracy.
Hopefully that is enough to get you started!
In response to #j.jerrod.taylor's answer, let me rephrase my question to clear any misunderstanding.
I'm new to Data Mining and am learning about how to handle noisy data by smoothing my data using the Equal-width/Distance Binning method via "Bin Boundaries". Assume the dataset 1,2,2,3,5,6,6,7,7,8,9. I want to perform:
distance binning with 3 bins, and
Smooth values by Bin Boundaries based on values binned in #1.
Based on definition in (Han,Kamber,Pei, 2012, Data Mining Concepts and Techniques, Section 3.2.2 Noisy Data):
In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
Interval width = (max-min)/k = (9-1)/3 = 2.7
Bin intervals = [1,3.7),[3.7,6.4),[6.4,9.1]
original Bin1: 1,2,2,3 | Bin boundaries: (1,3) | Smooth values by Bin Boundaries: 1,1,1,3
original Bin2: 5,6,6 | Bin boundaries: (5,6) | Smooth values by Bin Boundaries: 5,6,6
original Bin3: 7,7,8,9 | Bin boundaries: (7,9) | Smooth values by Bin Boundaries: 7,7,8,9
Question:
- where does 8 belong to in Bin3 when binned using Bin boundaries method, since it's +1 from 7 and -1 from 9?
If this is a problem, then you are calculating your bin widths incorrectly. For example, creating a histogram is an example of data binning.
You can read this response on crossvalidated. But in general if you're trying to bin integers, then your boundary will be a double.
For example if you want everything between 2 and 6 to be in one bin, your actual boundary will be 1.5 to 6.5. Since all of your data are integers there is no chance for anything to not be classified.
edit:I also have the same book, though it seems like I have a different version because the section on Data Discretization is in chapter 2 instead of chapter 3 like you pointed out. Based on your question, it seems like you don't really understand the concept yet.
The following is an except from page 88 Chapter 2 on Data Preprocessing. I'm using the second edition of the text.
For example, attribute values can be discretized by applying equal-width
or equal-frequency binning, and then replacing each bin value by the bin mean or
median, as in smoothing by bin means or smoothing by bin medians, respectively. 8 doesn't belong anywhere other than in bin 3. This gives you two options. You can either take the mean/median of all of the numbers that fall in bin 3 or you can use bin 3 as a category.
The building on your example, we can take the mean of the 4 numbers in bin 3. This gives us 7.75. We would now use 7.75 for the four numbers that are in that bin instead of 7,7,8 and 9.
The second option would be to use the bin number. For example, everything in bin 3 would get a category label of 3, everything in bin 2 would get a label of 2, etc.
UPDATE WITH CORRECT ANSWER:
My class finally covered this topic, and the answer to my own question is that 8 can belong to either 7 or 9. This scenario is described as "tie-breaking", where the value is equal distance from either boundary. It is acceptable for all such values to be consistently tied to the same boundary.
Here's is a real example of a NIH analysis paper that explains using "tie breaking" when they encounter equal-distance values: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3807594/
After I've done a 2D triangulation, some triangles have the same color and I want to recombine them for drawing into like-colored graphics paths. I find that if I just draw the triangles one by one, some graphic renderers shows seams between the triangles (at least if anti-aliasing and/or transparency is involved).
So how do I take a set of (non-overlapping) triangles and produce a graphics path, which may contain holes and disjoint polygons?
Blindly adding the triangles to a graphics path actually works pretty well for filling (though not for stroking, of course), but it doesn't feel right to export those extra interior points.
Think of each triangle as an outline comprised of three vectors going in a counter-clockwise chain.
<--^
| /
|/
V
So for all the triangles in your shape, take the union of their outline vectors. If two outline vectors in the union are identical but go in opposite directions, they cancel each other out and are removed from the union.
For example, for two triangles that are side by side the union is 6 vectors
<--^^
| //|
|// |
VV-->
which reduces to 4 vectors because the two diagonal vectors in the middle cancel because they are identical but run in opposite directions:
<--^
| |
| |
V-->
You'll find this works for larger aggregations of triangles. Just connect the resulting vectors tail to head to get closed paths. Some of the closed paths may run clockwise, and these are holes.
<-----<-----<-----^
| |
| |
V ^-----> ^
| | | |
| | | |
V <-----V ^
| |
| |
V----->----->----->