I was wondering if and how multilevel modeling such as hierarchical linear models (HLM), hierarchical generalized linear models (HGLM), structural equation modeling (SEM) and multilevel SEM can be conducted in julia? Are there packages available for such analyses? (Equivalents in julia to lme4, nlme, and lavaan in R.)
I was also wondering about how to get julia's output into documents? Jupyter can obviously create markdown documents, but I was wondering about functionality for creating more complex documents similar to how knitr integrates R with LaTeX.
I'm not familiar with nlme and lavaan unfortunately, but the lme4 equivalent in Julia is MixedModels.jl, which is also developed by Doug Bates, one of the main lme4 developers.
Related
!pip install transformers
from transformers import InputExample, InputFeatures
What are InputExample and InputFeatures here?
thanks.
Check out the documentation.
Processors
This library includes processors for several traditional tasks. These
processors can be used to process a dataset into examples that can be
fed to a model.
And
class transformers.InputExample
A single training/test example for simple sequence classification.
As well as
class transformers.InputFeatures
A single set of features of data. Property names are the same names as
the corresponding inputs to a model.
So basically InputExample is just a raw input and InputFeatures is the (numerical) feature representation of that Input that the model uses.
I couldn't find any tutorial explicitly explaining this but you can check out Chapter 4 (From text to features) in this tutorial where it is nicely explained on an example.
From my experience the transformers library has an absolute ton of classes and structures so going too deep into the technical implementation can make it easy to get lost in. For starters I would recommend trying to get an idea of the broader picture by just getting some example projects to work as well as checking out their 🤗 Course.
I am building an NLP pipeline and I am trying to get my head around in regards to the optimal structure. My understanding at the moment is the following:
Step1 - Text Pre-processing [a. Lowercasing, b. Stopwords removal, c. stemming, d. lemmatisation,]
Step 2 - Feature extraction
Step 3 - Classification - using the different types of classifier(linearSvC etc)
From what I read online there are several approaches in regard to feature extraction but there isn't a solid example/answer.
a. Is there a solid strategy for feature extraction ?
I read online that you can do [a. Vectorising usin ScikitLearn b. TF-IDF]
but also I read that you can use Part of Speech or word2Vec or other embedding and Name entity recognition.
b. What is the optimal process/structure of using these?
c. On the text pre-processing I am ding the processing on a text column on a df and the last modified version of it is what I use as an input in my classifier. If you do feature extraction do you do that in the same column or you create a new one and you only send to the classifier the features from that column?
Thanks so much in advance
The preprocessing pipeline depends mainly upon your problem which you are trying to solve. The use of TF-IDF, word embeddings etc. have their own restrictions and advantages.
You need to understand the problem and also the data associated with it. In order to make the best use of the data, we need to implement the proper pipeline.
Specifically for text related problems, you will find word embeddings to be very useful. TF-IDF is useful when the problem needs to be solved emphasising the words with lesser frequency. Word embeddings, on the other hand, convert the text to a N-dimensional vector which may show up similarity with some other vector. This could bring a sense of association in your data and the model can learn the best features possible.
In simple cases, we can use a bag of words representation to tokenize the texts.
So, you need to discover the best approach for your problem. If you are solving a problems which closely resembles the famous NLP problems like IMDB review classification, sentiment analysis on Twitter data, then you can find a number of approaches on the internet.
I can't seem to find a way to visualize my RF model, obtained using Spark's MLLib RandomForestModel. The model, printed as a string, is just a bunch of nested IF statements.. it seems natural to want to visualize like is possible in R. I am using Spark Python API, and Java API.. open to use anything that will produce an R-like visualization of my RF model.
There is a library out there to help with this, EurekaTrees. Basically it just takes the debug string builds a tree and then displays it as a webpage using d3.js
from Databricks (Oct 2015):
"The plots listed above as Scala-only will soon be available in Python notebooks as well. There are also other machine learning model visualizations on the way. Stay tuned for Decision Tree and Machine Learning Pipeline visualizations!"
Otherwise said replace eigen vectors with pattern matching and graph traversal and emulate dimension reduction?
I mean that given a semantic graph of english words compute something similar to:
king - man = queen
Which means that I can subtract from a graph a subgraph and score the resulting subgraph given a metric.
I don't expect that this will be a single neo4j or gremlin query. I'm interested in the underlying mechanic involved in reasoning at the same time globaly and localy over a graph database.
I think it's important to remember the difference between graph databases as a storage solution and then using machine learning to extract connected graphs as vectors that represent features that are used to train a ML model proper.
The difference is that you can structure your data in such a way that makes it easier to find patterns that are suitable for creating a machine learning model. It's certainly a good idea to use Neo4j to do this but it's not something that comes out of the box. I've created a plugin to Neo4j that will extract hierarchical pattern matches from text using a genetic algorithm that I thought up. You can take a look here: http://www.kennybastani.com/2014/08/using-graph-database-for-deep-learning-text-classification.html
You can then use the resulting data to construct a word2vec model.
It appears that the simplest, naivest way to do basic sentiment analysis is with a Bayesian classifier (confirmed by what I'm finding here on SO). Any counter-arguments or other suggestions?
A Bayesian classifier with a bag of words representation is the simplest statistical method. You can get significantly better results by moving to more advanced classifiers and feature representation, at the cost of more complexity.
Statistical methods aren't the only game in town. Rule based methods that have more understanding of the structure of the text are the other main option. From what I have seen, these don't actually perform as well as statistical methods.
I recommend Manning and Schütze's Foundations of Statistical Natural Language Processing chapter 16, Text Categorization.
I can't think of a simpler, more naive way to do Sentiment Analysis, but you might consider using a Support Vector Machine instead of Naive Bayes (in some machine learning toolkits, this can be a drop-in replacement). Have a look at "Thumbs up? Sentiment Classification using Machine Learning Techniques" by Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan which was one of the earliest papers on these techniques, and gives a good table of accuracy results on a family of related techniques, none of which are any more complicated (from a client perspective) than any of the others.
Building upon the answer provided by Ken above, there is another paper
"Sentiment analysis using support vector machines with diverse information sources" by Tony and Niger,
which looks at assigning more features than just a bag of words used by Pang and Lee. Here, they leverage wordnet to determine semantic differentiation of adjectives, and proximity of the sentiment towards the topic in the text, as additional features for SVM. They show better results than previous attempts to classify text based on sentiment.