Text Classification-Cricket commentary

Text Classification-Cricket commentary - text

Predict each ball of the match into four categories – Dots, Runs between Wickets, Boundaries, and Wickets based on the masked commentary of the match. The total runs in an over is also provided as extra information which can be used to classify better. Unable to move forward after exploratory data analysis. Any help on how to unmask the commentary?
Variables: Match_ID, Over-Over and Ball Identifier, Commentary -Textual Commentary, Over_Run_Total -Total runs in the over, Target- One of the four classes: Dots, RunsbetweenWickets, Boundaries, Wickets

Related

Envelope of Lines

Its been a while that I'm stuck with an apparently "simple" problem. My goal is to build the envelope of a set of lines that are "attached" to a curve. Let's say a curve like this:
For the above example I would expect the envelope of lines (whose directions are depicted by arrows and are orthogonal to the edges of the red curve) to be an arc of a circle.
I thought of doing this in two computationally separate ways:
Intersection of consecutive lines: In an ideal smooth world, the envelop of lines attached is a curve where the red lines are all tangent to. Now, coming back to the discrete world I try to obtain the envelope curve by intersecting consecutive lines (for example the first line with the second line would give the first vertex of the envelope).
Evolute of the red curve: Again in an ideal smooth world, one can think of such an envelop as the evolute of the red curve (see Evolute - wikipedia). Therefore, all I had to do in addition to current info was to compute the curvature and then build the evolute (naturally I had to use a discrete version of curvature which you can find its definition here: Discrete Curvature - wikipedia).
Doing any of the above approaches I would get the following result:
However, finding the "correct arc" is heavily dependent on the accuracy of the initial data which is the red curve. As soon as the red curve has some "noises" in the vertices the envelope is heavily distorted. Here I add a picture (where the red curve is visually intact (but not actually) yet the envelope is distorted):
My Question: How can I rectify this? I believe there should be a numerical approach to solve this issue as I badly need this envelope to be correctly built. I'm a mathematician and am not fully aware of the numerical tricks that might exist in dealing with cases like this.
However, I believe that this should be a standard question in computer graphics community though I could not find anything properly relevant after searching for months.
It would be great if the solutions are in MATLAB language. Please let me know if you want me to be more accurate regarding the passage.

For the line intersecton method, yes, because the lines are relatively parallel, any small error in the defining data for a line will produce a dramatic error in their intersection points.
I suggest the following:
Calculate all lines.
Calculate all intersection points of the adjacent lines.
Calculate the distances between all adjacent intersection points.
Sequence plot the distances, and identify all distances which are more than,
perhaps, 2 standard deviations from the trend line of the distances.
If the data is not "too bad" then I think the identified distances
will mostly come in pairs, ie, there is one "bad" intersection line
causing two "bad" distances.
Exclude the "bad" lines and reprocess the remaining intersection points.
The above assumes the granularity of the data is greater where the base curve is curvier.
If the intersection point distances seem to form two trend lines, especially if they look like to two diverging, or two converging, trend lines, then group the intersection lines accordingly, plot two envelopes, and take the average of the two envelopes as "the envelope". (Or perahps even more trend lines, if there is a regular error in the data.)
But, if there are signs of regular data errors, then a contextual assessment and analysis of the data source and how it was generated/gathered/measured might be required to correctly determine which data should be excluded.

Use the polarity distribution of word to detect the sentiment of new words

I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot

Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.

Dependency in multidimensional marked point patterns

As I understand, currently, if we have multi-type point pattern we can determine dependencies between points of various marks using functions like Jmulti, Gmulti etc.
Now, if each point is associated with multiple marks (say, as a data frame where each column is a mark variable) then how do we find dependency between points of different mark variables? Note that in this case, a point could have two different marks but have the same spatial coordinate.
I think in this case, the number of points having the same coordinates but different marks is in some sense a measure of dependency between the point patterns of different mark variables, but I am not sure if there are methods to do this analysis in spatstat.
Thanks for your clarification.

This is discussed in Chapter 15 of the spatstat book.
However I think you may be confusing two different things: (1) a point pattern in which each point carries several different mark variables, so that the marks for the pattern are represented by a data frame with one row for each point and one column for each mark variable; and (2) a marked point pattern in which there may be several points that have the same spatial coordinate but different mark values.
An example of (1) is the finpines dataset in spatstat in which each tree location is marked by the tree's height and diameter. An example of (2) would be a spatial pattern of road accidents in which each vehicle is represented by a point, so that two-vehicle accidents are represented by two points at the same location, perhaps with different labels.
To deal with (1), you could use functions like Kmulti, Gmulti, Jmulti. These functions always compare two groups of points, identified by the arguments I and J which can be logical vectors. You can define any two subsets of your point pattern as the subsets I and J. For example in the finpines data you could define I <- with(marks(finpines), height > 10 * diameter) which would select all the trees whose height in metres is greater than 10 times diameter in cm. and similarly make another, different rule for J.
Other ways of investigating dependence in marked point patterns include the mark correlation function markcorr, nearest neighbour correlation nncorr, the conditional moments Emark, Vmark and other tools described in Chapter 15.
Finally a caution that summary functions do not "determine" dependence; they are only measures of correlation.

Adding noise to genomic data having discrete values (A, G, T, C)

Since genomic sequences vary greatly in length, I have been trying to work on using denoising autoencoders to get a compact representation for any given sequence. My expected input is a sequence of nucleotides (letters - A, G, T, C), for example, "AAAAGGAATTTCTCTGGGG....".
For images, adding a noise is easy since it's a continuous space. But in a discrete scenario such as this, what would be a good strategy to add noise to my input?
My first thought is to randomly replace some of the nucleotides with "N", which means that the nucleotide at that position couldn't be identified accurately during sequencing. But changing even one nucleotide leads to a completely different sequence altogether, unlike images where adding a small noise doesn't change how the image looks visually. Please let me know if this is right or there's a better way that I am not aware of.

I'm not sure if this will help you or further complicate your issue, but in biology people normally use FASTQ files to store biological sequences and their corresponding Phred quality scores. A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing.
For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000.
Public domain image from Wikipedia
So you can add noise to the Phred quality scores (i.e. the probabilities that the base calling is correct) without changing the sequence.
Also see this paragraph about current work done on compressing FASTQ files.

doc2vec: Pull documents from inferred document

i am new in word/paragraph embedding and trying to understand via doc2vec in GENSIM. I would like to seek advice on whether my understanding is incorrect. My understanding is that doc2vec is potentially able to return documents that may have semantically similar content. As a test, i tried the following and have the following questions.
Question 1: I noted that every run of training with the exact same parameters and examples will result in a model that produces very different results from previous trains (E.g. Different vectors and different ranking of similar documents eveytime).. Why is this so indeterministic? As such, can this be reliably used for any practical work?
Question 2: Why am i not getting the tag ids of the top similar documents instead?
Results: [('day',0.477),('2016',0.386)....
Question 2 answer: The problem was due to model.most_similar, should use model.docvecs.most_similar instead
Please advise if i misunderstood anything?
Data prep
I had created multiple documents with a sentence each. I had deliberately made it such that they are distinctly different semantically.
A: It is a fine summer weather, with the birds singing and sun shining bright.
B: It is a lovely day indeed, if only i had a degree in appreciating.
C: 2016-2017 Degree in Earth Science Earthly University
D: 2009-2010 Dip in Life and Nature Life College
Query: Degree in Philosophy from Thinking University from 2009 to 2010
Training
I trained the documents (tokens as words, running index as tag)
tdlist=[]
docstring=['It is a fine summer weather, with the birds singing and sun shining bright.',
'It is a lovely day indeed, if only i had a degree in appreciating.',
'2016-2017 Degree in Earth Science Earthly University',
'2009-2010 Dip in Life and Nature Life College']
counter=1
for para in docstring:
tokens=tokenize(para) #This will also strip punctuation
td=TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(tokens))).split(), str(counter))
tdlist.append(td)
counter=counter+1
model=gensim.models.Doc2Vec(tdlist,dm=0,alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(200):
model.train(tdlist, total_examples=model.corpus_count, epochs=model.iter)
Inference
I then attempted to infer the query. Although they are many missing words in the vocab for the query, i would expect closest document similarity results for C and D. But the results only gave me a list of 'words' followed by a similarity score. I am unsure if my understanding is wrong. Below is my code extract.
mydocvector=model.infer_vector(['Degree' ,'in' ,'Philosophy' ,'from' ,'Thinking' ,'University', 'from', '2009', 'to', '2010'])
print(model.docvecs.most_similar(positive=[mydocvector])

Doc2Vec doesn't work well on toy-sized datasets - few documents, few total words, few words per document. You'll absolutely want more documents than vector dimensions (size), and ideally tens-of-thousands of documents or more.
The second argument to TaggedDocument should be a list of tags. By supplying a single string-of-an-int, each of its elements (characters) will be seen as tags. (With just documents 1 to 4 this won't yet hurt, but as soon as you have document 10, Doc2Vec will see it as tags 1 and 0, unless you supply it as ['10'] (a single-element list).
Yes, to find most-similar documents you use model.docvecs.most_similar() rather than model.most_similar() (which only operates on learned words, if any).
You are using dm=0 mode, which is a pretty good starting idea – it's fast and often a top-performer. But note that this mode doesn't train word-vectors too. So anything you ask for from the top model, like model['summer'] or model.most_similar('sun'), will be nonsense results based on randomly-initialized but never-trained words. (If you need words trained too, either add dbow_words=1 to the dm=0 mode, or use a dm=1 mode. But for pure doc-vectors, dm=0 is a pretty good choice.)
There's no need to call train() in a loop - or indeed at all, given the line above it. The form you've used to instantiate Doc2Vec, with an actual corpus tdlist as the first argument, already triggers model-setup and training, using the default number of iter passes (5) and the supplied alpha and min_alpha. Now, for Doc2Vec training you often want more passes (10 to 20 are common, though smaller datasets might benefit from even more). And for any training, for properly gradient-descent, you want the effective learning-rate alpha to gradually decline to a negligible value, such as the default 0.0001 (rather than a forced same-as-starting value).
The only situation where you'd usually call train() explicitly is if you instantiate the model without a corpus. In that case, you'd need to both call model.build_vocab(tdlist) (to let the model initialize with a discovered vocabulary), and then some form of train() - but you'd still need only one call to train, supplying the desired number of passes. (Allowing the default model.iter 5 passes, inside an outer loop of 200 iterations, means a total of 1000 passes over the data... and all at the same fixed alpha, which is not proper gradient-descent.)
When you have a beefier dataset, you may find results improve with a higher min_count. Usually words that appear only a few times can't contribute much meaning, and thus only serve as noise slowing training and interfering with other vectors becoming more expressive. (Don't assume "more words must equal better results".) Throwing out the singletons, or more, usually helps.
Regarding inference, almost none of the words in your inference text are in the training set. (I only see 'Degree', 'in', and 'University' repeated.) So in addition to all the issues above, inferring a good vector for the example text would be hard. With a richer training set, you'd likely get better results. It also often helps to increase the steps optional parameter to infer_vector() far above its default of 5.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string