Looking for dataset with US and other addresses - nlp

I'm looking for dataset which contain address within some sentences to train NER, preferable US addresses. Can't find such dataset. Do you know any?

How about the Australian Address Data: https://geoscape.com.au/data/g-naf/

Related

Neuroimage MRI scan CNN model preparation

I would like to know a couple of things to clear my confusion. I want to work on a medical neuroimage MRI image scans dataset from the ADNI database.
Each Alzheimer's Disease (AD) MRI image scan has multiple slices.
Do I have to separate each image scan slice and label each of them as AD or combine all image scan slices as a one-image scan and label it for classification?
Most of the medical neuroimage DICOM, NfINT, NII, etc., format. Is it mandatory to convert them to png or jpg for the CNN network model or keep it in NfNIT or nii format?
I have read several existing papers on neuroimaging regarding Alzheimer's disease but did not find the above question answer. Even I have sent an email to the research paper writer in reply; I got they can not help on this as they are very busy and mention their sincere apology for that.
It will be very helpful if anyone has the answer to clear my confusion and thought.
Thank you.
You can train with NIfTI, using, for example, TorchIO. There's no need to separate each slice, you can use the 3D image as is.
You can find some examples in the documentation.
Disclaimer: I'm the main developer of TorchIO.

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

NLP Aspect Mining approach

I'm trying to implement as aspect miner based on consumer reviews in amazon for durable- washing machine, refrigerator. The idea is to output sentiment polarity for aspects instead of the entire sentence. For eg: 'Food was good but service was bad' review must output food to be positive and service to be negative. I read through Richard Socher's paper on RNTN model for fine grained sentiment classifier but I guess I'll need to manually tag sentiment for phrases for a different domain and create my own treebank for better accuracy.
Here's an alternate approach I'd thought of. Could someone pls validate/guide me with your feedback
Break the approach into 2 sub tasks. 1) Identify aspects 2) Identify sentiment
Identify aspects
Use POS tagger to identify all nouns. This should shortlist
potentially all aspects in the reviews.
Use word2vec of these nouns to determine similar nouns and reduce the dataset size
Identify sentiments
Train a CNN or dense net model on reviews with rating 1,2,4,5(ignore
3 as we need data that has polarity)
Breakdown the test set reviews into phrases(eg 'Food was good') and then score them using the above model
Find the aspects identified in the 1st sub task and tag them to
their respective phrases.
I don't know how to answer this question but have a few suggestions:
Take a look at multitask learning in neuralnets literature and try an end2end neuralnet for multiple tasks.
Use pretrained word vectors like w2v or glov as inputs.
Don't rely on pos taggers when you use internet data,
Find a way to represent your name entities and oov in your design.
Don't ignore 3!!
You should annotate some data periodically.

Correcting the names in nlp

I have a dataset where lot of names are written like man1sh instead of manish, vikas as v1kas.
How can one correct these names in nlp?
Any help is appreciated.
Try the Deep Neural Network based spell correction https://medium.com/#majortal/deep-spelling-9ffef96a24f6 this method is the state of the art method at the moment. Here is the code https://github.com/MajorTal/DeepSpell and some one already made an improvement over it https://hackernoon.com/improving-deepspell-code-bdaab1c5fb7e.I am not able to find the paper but there is also a paper published that does character level deep neural network for edit distance with good results and a public dataset.
For the above methods, like for all Machine Learning solutions, you need data for training. If you don't have data for your case then the old simple edit distance methods http://norvig.com/spell-correct.html are the only way.

Need training data for categories like Sports, Entertainment, Health etc and all the sub categories

I am experimenting with Classification algorithms in ML and am looking for some corpus to train my model to distinguish among the different categories like sports,weather, technology, football,cricket etc,
I need some pointers on where i can find some dataset with these categories,
Another option for me, is to crawl wikipedia to get data for the 30+ categories, but i wanted some brainstorming and opinions, if there is a better way to do this.
Edit
Train the model using the bag of words approach for these categories
Test - classify new/unknown websites to these predefined categories depending on the content of the webpage.
The UCI machine learning repository contains a searchable archive of datasets for supervised learning.
You might get better answers if you provide more specific information about what inputs and outputs your ideal dataset would have.
Edit:
It looks like dmoz has a dump that you can download.
A dataset of newsgroup messages, classified by subject

Resources