Statistics, machine learning and data mining - statistics

I am currently learning data mining and I have the following questions.
what is the relationship between machine learning and data mining?
I found many data mining techniques are associated with statistics, while I "hear" data mining has many thing to do with machine learning. So my question is: is machine learning closely related with statistics?
If they are not closely related, is there such divisions that separate data mining focusing on statistical techniques and data mining focusing on machine learning skills? Because I found department of statistics of some graduate schools open data mining courses.

Data mining is the process of extracting useful information from data, such as patterns, trends, customer/user behavior, liking/disliking etc. This involves the use of algorithms that are related to Artificial Intelligence and statistics.
Wikipedia's definition of Data Mining is:
Data Mining (the analysis step of the Knowledge Discovery in Databases
process,[1] or KDD), a relatively young and interdisciplinary field of
computer science,[2][3] is the process of discovering new patterns
from large data sets involving methods from statistics and artificial
intelligence but also database management. In contrast to for example
machine learning, the emphasis lies on the discovery of previously
unknown patterns as opposed to generalizing known patterns to new
data.
Machine Learning involves making the computers "learn" that behavior, trend etc, and to act according. For example, in credit card fraud, the computer "learns" the behavior of a customer, and if something strange occurs (a transaction involving very high amounts etc), it flags that transaction for potential fraud.
Wikipedia's definition of machine learning is:
Machine learning, a branch of artificial intelligence, is a scientific
discipline concerned with the design and development of algorithms
that allow computers to evolve behaviors based on empirical data, such
as from sensor data or databases. Machine Learning is concerned with
the development of algorithms allowing the machine to learn via
inductive inference based on observing data that represents incomplete
information about statistical phenomenon. Classification which is also
referred to as pattern recognition, is an important task in Machine
Learning, by which machines “learn” to automatically recognize complex
patterns, to distinguish between exemplars based on their different
patterns, and to make intelligent decisions.
Machine learning uses Data Mining to learn the pattern, behavior, trend etc, because Data Mining is the way of extracting this information from a set of data. Data Mining and Machine Learning both use Statistics make decisions. So yes statistics is involved and is very important in Data Mining and Machine learning.

There tends to be a lot of overlap between what different people call machine learning, data mining and statistics. The very definitions of the terms would depend on whom you ask.
Here is a nice overview, with lots of great links.

Although overlap between data Data mining and Machine Learning, we can distinguish between them; simply, such as:
Data mining search for patterns to predict and/or describe huge data,
Machine Learning goes further to use these patterns to learn.
And both based on Statistics.

A comprehensive answer was already given by #SpeedBirdNine. As a side note:
Data-mining and Machine-learning are mainly based on the old but ingenious ideas of statisticians. (Inferential statistics, decision theories, etc.)
Classic Statistics + today's powerful computers = DM & ML
Since we are living in the era of big data, the barrier statisticians used to be faced with, in terms of the absence of enough data, is no longer an issue. Therefore, in many cases (but not all of course), it is safe to say that Data-mining/Machine-learning is the new Statistics! (The infinity symbol ∞ they used to have in their equations that if n (the sample size) goes to infinity, then everything's behavior is predictable (!), is not a compromised reality anymore!).
Regarding your last question, in my opinion, in any meaningful research, you either need to apply some statistical methods on big data and this is when DM/ML comes in handy, or you need to apply a DM/ML method which is already designed based on classical statistics. These are the two sections that every DM/ML research is involved, and statistics is not excluded, let alone when the goal is to come up with a noble DM/ML algorithm to analyze/cluster/classify big data.

Related

Is my Statistical Treatment of Data Correct?

I am aware that consulting a statistician is not free and it is something I cannot afford, so I am trying my shot here. So for the problem at hand, I've already finished data gathering for my research and am now calculating the results. However, I am stuck on what should I use for my statistical treatment of data.
For background, I am using ISO 25010 to test my software quality and user acceptance. The questionnaire consists of a number of questions for each cluster (functionality, reliability, usability, efficiency, maintainability, and portability). I've also used Likert Scale: Agreement Type. The hypothesis of my research says "There is no significant difference in the user acceptance results in terms of [clusters]". As of now, I've used Descriptive Statistics, mean(for each question), average mean(ave. of mean for each cluster, and mode), for calculating the results.
I feel that the result I currently have might be lacking when the final defense came. As far as I know, using a combination of statistical methods is okay to give a more strong foundation for your result.
Based on the background of my research, what other statistical methods should I use?
I am thinking of sample standard deviation, but I don't know if I should compute it by questions or by cluster.
Sorry, statistics is not really my forte.
Thank you in advance for your answers

Are transformer-based language models overfitting on the paraphrase identification task? What tools overcome this?

I've been working on a sentence transformation task that involves paraphrase identification as a critical step: if we are confident enough that the state of the program (a sentence repeatedly modified) has become a paraphrase of a target sentence, stop transforming. The overall goal is actually to study potential reasoning in predictive models that can generate language prior to a target sentence. The approach is just one specific way of reaching that goal. Nevertheless, I've become interested in the paraphrase identification task itself, as it's received some boost from language models recently.
The problem I run into is when I manipulate sentences from examples or datasets. For example, in this HuggingFace example, if I negate either sequence or change the subject to Bloomberg, I still get a majority "is paraphrase" prediction. I started going through many examples in the MSRPC training set and negating one sentence in a positive example or making one sentence in a negative example a paraphrase of the other, especially when doing so would be a few word edit. I found to my surprise that various language models, like bert-base-cased-finetuned-mrpc and textattack/roberta-base-MRPC, don't change their confidences much on these sorts of changes. It's surprising as these models claim an f1 score of 0.918+. The dataset is clearly missing a focus on negative examples and small perturbative examples.
My question is, are there datasets, techniques, or models that deal well when given small edits? I know that this is an extremely generic question, much more than is typically asked on StackOverflow, but my concern is in finding practical tools. If there is a theoretical technique, then it might not be suitable as I'm in the category of "available tools define your approach" rather than vice-versa. So I hope that the community would have a recommendation on this.
Short answer to the question: yes, they are overfitting. Most of the important NLP data sets are not actually well-crafted enough to test what they claim to test, and instead test the ability of the model to find subtle (and not-so-subtle) patterns in the data.
The best tool I know for creating data sets that help deal with this is Checklist. The corresponding paper, "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" is very readable and goes into depth on this type of issue. They have a very relevant table... but need some terms:
We prompt users to evaluate each capability with
three different test types (when possible): Minimum Functionality tests, Invariance, and Directional Expectation tests... A Minimum Functionality test (MFT), is a collection of simple examples (and labels) to check a
behavior within a capability. MFTs are similar to
creating small and focused testing datasets, and are
particularly useful for detecting when models use
shortcuts to handle complex inputs without actually
mastering the capability.
...An Invariance test (INV) is when we apply
label-preserving perturbations to inputs and expect
the model prediction to remain the same.
A Directional Expectation test (DIR) is similar,
except that the label is expected to change in a certain way. For example, we expect that sentiment
will not become more positive if we add “You are
lame.” to the end of tweets directed at an airline
(Figure 1C).
I haven't been actively involved in NLG for long, so this answer will be a bit more anecdotal than SO's algorithms would like. Starting with the fact that in my corner of Europe, the general sentiment toward peer review requirements for any kind of NLG project are higher by several orders of magnitude compared to other sciences - and likely not without reason or tensor thereof.
This makes funding a bigger challenge, so wherever you are, I wish you luck on that front. I'm not sure of how big of a deal this site is in the niche, but [Ehud Reiter's Blog][1] is where I would start looking into your tooling ideas.
Maybe even reach out to them/him personally, because I can't think of another source that has an academic background and a strong propensity for practical applications of NLG, at least based on the kind of content they've been putting out over the years.
Your background, environment/funding, and seniority level/control you have over the project will eventually compose your vector decision for you. I's just how it goes on the bleeding edge of anything. What I will add, though, is not to limit yourself to a single language or technology in this phase because of those precise reasons you've mentioned. I'd recommend the same in terms of potential open source involvement but if your profile information is accurate, that probably won't happen, no matter what you do and accomplish.
But yeah, in the grand scheme of things, your question is far from too broad, in my view. It identifies a rather unmistakable problem pattern that not all branches of science are as lackadaisical to approach as NLG-adjacent fields seem to be right now. In that regard, it's not broad enough and will need to be promulgated far and wide before community-driven tooling will give you serious options on a micro level.
Blasphemy, sure, but the performance is already stacked against you As for the question potentially being too broad, I'd posit it is not broad enough, so long as we collectively remain in a "oh, I was waiting for you to start doing something about it" phase.
P.S. I'd eliminate any Rust and ECMAScript alternatives prior to looking into Python, blapshemous as this might sound to a 2021 data scientist
. Some might ARight nowccounting forr the ridicule this would receive xou sltrsfx hsbr s fszs drz zhsz s mrnzsl rcrtvidr, sz lrsdz
due to performance easons.
[1]: https://ehudreiter.com/2016/12/18/nlg-vs-templates/

Best evaluation method for real-time machine translation?

I'm aware that there are many different methods like BLEU, NIST, METEOR etc. They all have their pros and cons, and their effectiveness differs from corpus to corpus. I'm interested in real-time translation, so that two people could have a conversation by typing out a couple sentences at a time and having it immediately translated.
What kind of corpus would this count as? Would the text be considered too short for proper evaluation by most conventional methods? Would the fact that the speaker is constantly switching make the context more difficult?
What you are asking for, belongs to the domain of Confidence Estimation, nowadays (within the Machine Translation (MT) community) better known as Quality Estimation, i.e. "assigning a score to MT output without access to a reference translation".
For MT evaluation (using BLEU, NIST or METEOR) you need:
A hypothesis translation (MT output)
A reference translation (from a test set)
In your case (real-time translation), you do not have (2). So you will have to estimate the performance of your system, based on features of your source sentence and your hypothesis translation, and on the knowledge you have about the MT process.
A baseline system with 17 features is described in:
Specia, L., Turchi, M., Cancedda, N., Dymetman, M., & Cristianini, N. (2009b). Estimating the sentence level quality of machine translation systems. 13th Conference of the European Association for Machine Translation, (pp. 28-37)
Which you can find here
Quality Estimation is an active research topic. The most recent advances can be followed on the websites of the WMT Conferences. Look for the Quality Estimation shared tasks, for example http://www.statmt.org/wmt17/quality-estimation-task.html
Your corpus would be a chat or a type of question and answering.
If you have many sentence suggestions available, then you could try https://gitlab.com/Bachstelze/translation-metric/tree/master/
It is a vector space model approach on the sentence level, so you don't have to learn a language specific system and the switching between the speakers shouldn't be a problem as long as the sentences don't get too short.

Reconstructing now-famous 17-year-old's Markov-chain-based information-retrieval algorithm "Apodora"

While we were all twiddling our thumbs, a 17-year-old Canadian boy has apparently found an information retrieval algorithm that:
a) performs with twice the precision of the current, and widely-used vector space model
b) is 'fairly accurate' at identifying similar words.
c) makes microsearch more accurate
Here is a good interview.
Unfortunately, there's no published paper I can find yet, but, from the snatches I remember from the graphical models and machine learning classes I took a few years ago, I think we should be able to reconstruct it from his submision abstract, and what he says about it in interviews.
From interview:
Some searches find words that appear in similar contexts. That’s
pretty good, but that’s following the relationships to the first
degree. My algorithm tries to follow connections further. Connections
that are close are deemed more valuable. In theory, it follows
connections to an infinite degree.
And the abstract puts it in context:
A novel information retrieval algorithm called "Apodora" is introduced,
using limiting powers of Markov chain-like matrices to determine
models for the documents and making contextual statistical inferences
about the semantics of words. The system is implemented and compared
to the vector space model. Especially when the query is short, the
novel algorithm gives results with approximately twice the precision
and has interesting applications to microsearch.
I feel like someone who knows about markov-chain-like matrices or information retrieval would immediately be able to realize what he's doing.
So: what is he doing?
From the use of words like 'context' and the fact that he's introduced a second order level of statistical dependency, I suspect he is doing something related to the LDA-HMM method outlined in the paper: Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2005). Integrating topics and syntax. Advances in Neural Information Processing Systems. There are some inherent limits to the resolution of the search due to model averaging. However, I'm envious of doing stuff like this at 17 and I hope to heck he's done something independent and at least incrementally better. Even a different direction on the same topic would be pretty cool.

Word characteristics tags

I want to do a riddle AI chatbot for my AI class.
So i figgured the input to the chatbot would be :
Something like :
"It is blue, and it is up, but it is not the ceiling"
Translation :
<Object X>
<blue>
<up>
<!ceiling>
</Object X>
(Answer : sky?)
So Input is a set of characteristics (existing \ not existing in the object), output is a matched, most likely object.
The domain will be limited to a number of objects, i could input all attributes myself, but i was thinking :
How could I programatically build a database of characteristics for a word?
Is there such a database available? How could i tag a word, how could i programatically find all it's attributes? I was thinking on crawling wikipedia, or some forum, but i can't see it build any reliable word tag database.
Any ideas on how i could achieve such a thing? Any ideas on some literature on the subject?
Thank you
This sounds like a basic classification problem. You're essentially asking; given N features (color=blue, location=up, etc), which of M classifications is the most likely? There are many algorithms for accomplishing this (Naive Bayes, Maximum Entropy, Support Vector Machine), but you'll have to investigate which is the most accurate and easiest to implement. The biggest challenge is typically acquiring accurate training data, but if you're willing to restrict it to a list of manually entered examples, then that should simplify your implementation.
Your example suggests that whatever algorithm you choose will have to support sparse data. In other words, if you've trained the system on 300 features, it won't require you to enter all 300 features in order to get an answer. It'll also make your training and testing files smaller, because you'll be omit features that are irrelevant for certain objects. e.g.
sky | color:blue,location:up
tree | has_bark:true,has_leaves:true,is_an_organism=true
cat | has_fur:true,eats_mice:true,is_an_animal=true,is_an_organism=true
It might not be terribly helpful, since it's proprietary, but a commercial application that's similar to what you're trying to accomplish is the website 20q.net, albeit the system asks the questions instead of the user. It's interesting in that it's trained "online" based on user input.
Wikipedia certainly has a lot of data, but you'll probably find extracting that data for your program will be very difficult. Cyc's data is more normalized, but its API has a huge learning curve. Another option is the semantic dictionary project Wordnet. It has reasonably intuitive APIs for nearly every programming language, as well as an extensive hypernym/hyponym model for thousands of words (e.g. cat is a type of feline/mammal/animal/organism/thing).
The Cyc project has very similar aims: I believe it contains both inference engines to perform the AI, and databases of facts about commonsense knowledge (like the colour of the sky).

Resources