What happened to sklearn.datasets.load_boston? - scikit-learn

While I was coding a Boston housing model using sklearn.datasets.load_boston, it gave me an error saying that the database was deprecated due to 'ethical' issues. What are those issues? I looked online, and could not find anything.
Here's the full error:
DEPRECATED: load_boston is deprecated in 1.0 and will be removed in 1.2.
The Boston housing prices dataset has an ethical problem. You can refer to the documentation of this function for further details.
The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original source:

Actually, it is just exactly as it is in the error. You can check https://scikit-learn.org/1.1/modules/generated/sklearn.datasets.load_boston.html for further details.
As I understand, there are 2 problems in the data:
Racism: There is a great article, which was also cited in the Scikit-Learn documentation by M. Carlisle. It focuses on the main issues of the Boston Housing dataset, which he found that house prices effected by neighbourhood race.
No suitable goal: "the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption."
However, you can get the data from the source:
http://lib.stat.cmu.edu/datasets/boston
I hope these help.

Yes, this dataset is removed from scikit_learn version 1.2. If you want to use it you can install the earlier version of scikit-learn:
pip install scikit-learn==1.1.3
This will still show the warning and you can use it only for educational purposes.

Related

Where to raise a typo in the PyTorch Documentation?

I have found a typo in the official PyTorch Documentation. Where can I raise the flag so that it is rectified?
From the PyTorch Contribution Guide, in the section on Documentation:
Improving Documentation & Tutorials
We aim to produce high quality documentation and tutorials. On rare
occasions that content includes typos or bugs. If you find something
you can fix, send us a pull request for consideration.

Algorithmic details behind Deep Feature Synthesis and Featuretools?

In order to use properly, it is important to understand the algorithmic/mathematical basis for Deep Feature Synthesis and featuretools. Are there papers, patents, comparison with other tools?
You can find the peer reviewed paper on Deep Feature Synthesis (the algorithm used in Featuretools) here: https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf.
The implementation has changed since publication, but the core ideas have not. Refer to the documentation or source on GitHub for the latest details.

Is there a way to reduce the size of the Spacy installation?

I'm using Spacy in a project and noticed that my Docker images are pretty big. A bit of research led me to find out that just the Spacy installation itself (in /usr/local/lib/python3.6/site-packages/spacy) accounts for 267MB, so I was wondering if there's anything that can be done to reduce that footprint?
Out of interest, SpaCy's 2.2 was released yesterday (Oct 2nd, 2019).
One of the product features for this 2.2 is "Smaller disk foot-print, better language resource handling". So, upgrading to SpaCy 2.2. may be one way to reduce the size of a SpaCy installation.
(Although this post doesn't solve your specific problem, I believe it does answer this specific question.)

How to get "Universal dependencies, enhanced" in response from Stanford coreNLP?

I am playing around with the Stanford coreNLP parser and I am having a small issue that I assume is just something stupid I'm missing due to my lack of experience. I am currently using the node.js stanford-corenlp wrapper module with the latest full Java version of Stanford CoreNLP.
My current results are returning somehting similar to the "Collapsed Dependencies with CC processed" data here: http://nlp.stanford.edu/software/example.xml
I am trying to figure out how I can get the dependencies titled "Universal dependencies, enhanced" as show here: http://nlp.stanford.edu:8080/parser/index.jsp
If anyone can shed some light on even just what direction I need to research more about, it would be extremely helpful. Currently Google has not been helping much with the specific "Enhanced" results and I am just trying to find out what I need to pass,call or include in my annotators to get the results shown at the link above. Thanks for your time!
Extra (enhanced) dependencies can be enabled in the depparse annotator by using its 'depparse.extradependencies' option.
According to http://nlp.stanford.edu/software/corenlp.shtml it is set to NONE by default, and can be set to SUBJ_ONLY or MAXIMAL.

Generating questions from text (NLP)

What approaches are there to generating question from a sentence? Let's say I have a sentence "Jim's dog was very hairy and smelled like wet newspaper" - which toolkit is capable of generating a question like "What did Jim's dog smelled like?" or "How hairy was Jim's dog?"
Thanks!
Unfortunately there isn't one, exactly. There is some code written as part of Michael Heilman's PhD dissertation at CMU; perhaps you'll find it and its corresponding papers interesting?
If it helps, the topic you want information on is called "question generation". This is pretty much the opposite of what Watson does, even though "here is an answer, generate the corresponding question" is exactly how Jeopardy is played. But actually, Watson is a "question answering" system.
In addition to the link to Michael Heilman's PhD provided by dmn, I recommend checking out the following papers:
Automatic Question Generation and Answer Judging: A Q&A Game for Language Learning (Yushi Xu, Anna Goldie, Stephanie Seneff)
Automatic Question Generationg from Sentences (Husam Ali, Yllias Chali, Sadid A. Hasan)
As of 2022, Haystack provides a comprehensive suite of tools to accomplish the purpose of Question generation and answering using the latest and greatest Transformer models and Transfer learning.
From their website,
Haystack is an open-source framework for building search systems that work intelligently over large document collections. Recent advances in NLP have enabled the application of question answering, retrieval and summarization to real world settings and Haystack is designed to be the bridge between research and industry.
NLP for Search: Pick components that perform retrieval, question answering, reranking and much more
Latest models: Utilize all transformer based models (BERT, RoBERTa, MiniLM, DPR) and smoothly switch when new ones get published
Flexible databases: Load data into and query from a range of databases such as Elasticsearch, Milvus, FAISS, SQL and more
Scalability: Scale your system to handle millions of documents and deploy them via REST API
Domain adaptation: All tooling you need to annotate examples, collect user-feedback, evaluate components and finetune models.
Based on my personal experience, I am 95% successful in generating Questions and Answers in my Internship for training purposes. I have a sample web user interface to demonstrate and the code too. My Web App and Code.
Huge shoutout to the developers on the Slack channel for helping noobs in AI like me! Implementing and deploying a NLP model has never been easier if not for Haystack. I believe this is the only tool out there where one can easily develop and deploy.
Disclaimer: I do not work for deepset.ai or Haystack, am just a fan of haystack.
As of 2019, Question generation from text has become possible. There are several research papers for this task.
The current state-of-the-art question generation model uses language modeling with different pretraining objectives. Research paper, code implementation and pre-trained model are available to download on the Paperwithcode website link.
This model can be used to fine-tune on your own dataset (instructions for finetuning are given here).
I would suggest checking out this link for more solutions. I hope it helps.

Resources