Sample data with too many dimensions in SVM - svm

I am working on the training and test data as Google search snippets.
Traning data consists of 10,060 snippets. Each snippet on each line, and each snippet consists of a list of words/terms plus a class label at the end.
There are 8 class labels:
Business,Computers,Culture-Arts,Entertainment,Education-Science,Engineering,Health,Politics-Society,Sports
The following are some of the lines in the dataset:
manufacture manufacturer directory directory china taiwan products manufacturers directory- taiwan china products manufacturer direcory exporter directory supplier directory suppliers business
empmag electronics manufacturing procurement homepage electronics manufacturing procurement magazine procrement power products production essentials data management business
dfma truecost paper true cost overseas manufacture product design costs manufacturing products china manufacturing redesigned product china save business
As you can see, the data should have the same number of dimensions to use SVM.
I am thinking use 1 to indicate if a word occurs in a specific row, and 0 otherwise, so each row will be a 0/1 vector. However, there will be too many dimensions.
My question: Is there any other ways to preprocess the data in order to perform SVM efficiently?

You should check for term-weighting and feature selection before performing text-classification with SVM.
The default approach would be:
Check for tfc term-weighting. This is based on the so-called inverse document frequency multiplied with term frequencies (in the current document).
Check for Information Gain-based feature-selection
Transform your documents on the basis of 1. and 2.
Perform text-classification with SVM.
I recommend the following publications for further understanding / reading. In this publications you will find the typical approaches used for SVM-based text-classification in the research community:
Joachims T. (1998) Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec C., Rouveirol C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol 1398. Springer, Berlin, Heidelberg
Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In International Conference on Machine Learning (ICML), 1997.
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.

Related

Subset Text Similarity Score - How to detect a piece of small text that is very similar to subset of a much bigger text

There are many ways to detect similarity between 2 texts like Jaccard Index, TFIDF cosine similarity or sentence embedding. However all of them are refering to use case of 2 texts which are fully to be compared.
Here, I don't know how to call it but I add Subset Text Similarity Score, is to detect/calculate a score to see whether a small text is an extract from a bigger text.
For example, there is a big text (from a news)
Google Stadia, the company's cloud gaming service, will shut down on January 18 after the game failed to gain the traction with users the company had hoped for.
The cloud gaming service debuted through a closed beta in October 2018 and publicly launched in November 2019.
In spite of the fact that users are about to lose access to all of their titles and save on Stadia, many publishers share ways to keep playing their games on other platforms, reports The Verge.
Moreover, Google is also refunding all Stadia hardware purchased through the Google Store as well as all the games and add-on content purchased from the Stadia store.
the objective of the subset text similarity is to detect whether this small text is a subset (extract) from the bigger text above. The small text can have sentences not in the same order as the bigger text to be compared.
Example small text
On Stadia, users are will lose access to all of their titles and saves. all Stadia hardware purchased through the Google Store will be refunded.
For a small text above, the subset similarity score should be very high.
Is there some package or NLP method that can do this?

Selecting the most informative categorical features for a multi-class ML classification model

I have a dataset on all the software installed by a large group of users. I would have to classify the users into one of 4 categories based on which software they installed (each user can install up to 30 pieces of software).
The categories are highly imbalanced - one category contains nearly 45% of the users in the training dataset, another - 35%, a third only 15% and the forth - only 5%. Let's say that these 4 categories roughly correspond to 4 different IT job types (e.g. "Software Engineer", "DevOp", "Analyst", etc.).
"Software" is a feature which has high cardinality (above 1000) so using naive one-hot encoding does not seem appropriate.
I would like to identify a subset of informative values/levels for this variable. For instance, a software like an anti-virus program probably does not differentiate well between these categories as all or most users would have it installed. A specialised tool (e.g. IDE) might differentiate better, i.e. its frequency of occurrence might be different within each IT job category.
How to identify such "informative" features using python? Do we use sklearn.feature_selection.chi2? Or do we use sklearn.feature_selection.mutual_info_classif? Or is there some other method?

Statistical method to compare urban vs rural matched siblings

I am writing a study protocol for my masters thesis. The study seeks to compare the rates of Non Communicable Diseases and risk factors and determine the effects of rural to urban migration. Sibling pairs will be identified from a rural area. One of the siblings should have participated in the rural NCD survey which is currently on going in the area. The other sibling should have left the area and reported moving to a city.Data will collected by completing a questionnaire on demographics, family history,medical history, diet,alcohol consumption, smoking ,physical activity.This will be done for both the rural and urban sibling, with data on the amount of time spent in urban areas fur
The outcomes which are binary (whether one has a condition or not) are : 1.diabetic, 2.hypertensive, 3.obese
What statistical method can I use to compare the outcomes (stated above) between the two groups, considering that the siblings were matched (one urban sibling for every rural sibling)?
What statistical methods can also be used to explore associations between amount spent in urban residence and the outcomes?
Given that your main aim is to compare quantities of two nominal distributions, a chi-square test seems to be the method of choice with regard to your first question. However, it should be mentioned that a chi-square test is somehow "the smallest" test for answering differences in samples. If you are studying medicine (or related) a chi-square test is fine because it is also frequently applied by researchers of this field. If you are studying psychology or sociology (or related) I'd advise to highlight limitations of the test in the discussions section since it mostly tests your distributions against randomly expected distributions.
Regarding your second question, a logistic regression would be applicable since it allows binomial distributed variables both for independent variables (predictors) and dependent variables. However, if you have other interval scaled variables (e.g. age, weight etc.) you could also use t-tests or ANOVAs to investigate differences between these variables with respect to the existence of specific diseases (i.e. is diabetic or not).
Overall, this matter strongly depends on what you mean by "association". Classically, "association" refers to correlations or linear regression (for which you need interval scaled variables on "both sides") but given your data structure, the aforementioned methods possess a better fit.
How you actually calculate these tests depends on the statistics software used.

Text classification

I have a trivial understanding of NLP so please keep things basic.
I would like to run some PDFs at work through a keyword extractor/classifier and build a taxonomy - in the hope of delivering some business intelligence.
For example, given a few thousand PDFs to mine I would like to determine the markets they apply to (we serve about 5 major industries with each one having several minor industries. Each industry and sub-industry has a specific market and in most cases those deal with OEMs, which in turn deal models, which further sub divide into component parts, etc.
I would love to crunch these PDFs into a semi-structured (more a graph actually) output like:
Aerospace
Manufacturing
Repair
PT Support
M250
C20
C18
Distribution
Can text classifiers do that? Is this too specific? How do you train a system like this that C18 is a "model" of "manufacturer" Rolls Royce of the M250 series and "PT SUPPORT" is a sub-component?
I could build this data manually but would take forever...
Is there a way I could use a text classifier framework and build something more efficiently than regex and python?
Just looking for ideas at this point... Watched a few tutorials on R and python libs but they didn't sound quite like what I am looking for.
Ok lets break your problem into small sub-problems first , i will break the task as
Read PDF and extract data and meta data from them - take a look at Apache Tikka lib
Any classifier to be more effective need training data - Create training data for text classifier
Then apply any suitable classifier algo .
You can also have look at Carrot2 clustering algo , it will automatically analyse the data and group pdf into different categories.

How to classify text when pre defined categories are not available

I have a problem and not getting idea which algorithm have to apply.
I am thinking to apply clustering in case two but no idea on case one:
I have .5 million credit card activity documents. Each document is well defined and contains 1 transaction per line. The date, the amount, the retailer name, and a short 5-20 word description of the retailer.
Sample:
2004-11-47,$500,Amazon,An online retailer providing goods and services including books, hardware, music, etc.
Questions:
1. How would classify each entry given no pre defined categories.
2. How would do this if you were given pre defined categories such as "restaurant", "entertainment", etc.
1) How would classify each entry given no pre defined categories.
You wouldn't. Instead, you'd use some dimensionality reduction algorithm on the data's features to them in 2-d, make a guess at the number of "natural" clusters, then run a clustering algorithm.
2) How would do this if you were given pre defined categories such as "restaurant", "entertainment", etc.
You'd manually label a bunch of them, then train a classifier on that and see how well it works with the usual machinery of accuracy/F1, cross validation, etc. Or you'd check whether a clustering algorithm picks up these categories well, but then you still need some labeled data.

Resources