I am now looking at a panel dataset on which I have to regress. Since I only started my Phd this semester together with the econometrics courses I am still new to many statistic applications and regression methods.
I want to do a simple regression as in Y = x1 x2 x3 etc, now I already browsed through some literature and found that for panel data it's common to do a fixed effects regression. Also, my Y variable only has positive values so I was thinking in the direction of a Tobit model?
I'm doing some research concerning the coverage of analysts in the financial business. My independent variable is the coverage of analysts on a certain firm, so per observation i have 1 analyst and 1 firm, together with different characteristics(market cap and betas etc) of the firm. All this data is monthly. As coverage cannot become negative (only 0) I was thinking of a Tobit model?
Do you have any ideas what would be a good regression method? Or have some good sources (e books, written books, through university I have access to almost anything concerning my field of work) of information (cause I do have to learn these things for future research)?
Fixed effects regression will be wrong. Your data are correlated across months, at least. In SAS/STAT you would use proc glimmix. SAS/ETS may have other procs which can do tobit links. Maybe proc qlim? For a first-year grad student this is pretty advanced. Suggest you get some help from a more senior colleague.
Related
I am in the process of creating a custom dataset to benchmark the accuracy of the 'bert-large-uncased-whole-word-masking-finetuned-squad' model for my domain, to understand if I need to fine-tune further, etc.
When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc. ), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).
I'm trying to understand:
The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
If so, is there a programmatic way to do this to avoid manual effort?
Any help or direction would be greatly appreciated!
Code example to show format:
import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])
Your question is a bit broad to give you a specific answer, but I will try my best to point you in some directions.
The intuition behind how the model uses the answer_start when
calculating the loss, accuracy, etc.
There are different types of QA tasks/datasets. The ones you mentioned (SQuAD and adversarial_qa) belong to the field of extractive question answering. There, a model must select a span from a given context that answers the given question. For example:
context = 'Second, Democrats have always elevated their minority floor leader to the speakership upon reclaiming majority status. Republicans have not always followed this leadership succession pattern. In 1919, for instance, Republicans bypassed James R. Mann, R-IL, who had been minority leader for eight years, and elected Frederick Gillett, R-MA, to be Speaker. Mann "had angered many Republicans by objecting to their private bills on the floor;" also he was a protégé of autocratic Speaker Joseph Cannon, R-IL (1903–1911), and many Members "suspected that he would try to re-centralize power in his hands if elected Speaker." More recently, although Robert H. Michel was the Minority Leader in 1994 when the Republicans regained control of the House in the 1994 midterm elections, he had already announced his retirement and had little or no involvement in the campaign, including the Contract with America which was unveiled six weeks before voting day.'
question='How did Republicans feel about Mann in 1919?'
answer='angered' #-> starting at character 365
A simple approach that is often used today, is a linear layer that predicts the answer start and answer end from the last hidden state of a transformer encoder (code example). The last hidden state holds one vector for each input token (token!= words) and the linear layer is trained to assign high probabilities to tokens that could potentially be the start and end of the answer span. To train a model with your data, the loss function needs to know which tokens should get a high probability (i.e. the answer and the start token).
If I need to go through the process of adding this to my custom
dataset (easier to run model evaluation code, etc?)
You should go through this process, otherwise, how should someone know where the answer starts in your context? They can of course interfere with it programmatically, but what if your answer string appears twice in the context? Providing an answer start position avoids confusion and allows your users to use it right away with one of the many extractive questions answering scripts that are already available out there.
If so, is there a programmatic way to do this to avoid manual effort?
You could simply loop through your dataset and use str.find:
context.find(answer)
Output:
365
I am working with data which contains marks and other features of students and trying to predict whether they will get a high salary or not using scikit-learn in python. I ran into a problem,
since a student does not take all the subject his/her score in a subject is -1 if he has not taken the subject (a student can take multiple subjects).
Below a snapshot taken from the data file:
Snapshot
I am trying to find a way to interpret the -1 in a way that doesn't alter the data much.
My Approach:
Take the percentile marks for each student and then take the average of all percentiles for each student giving a single number for each student which a lot easier to work with but this method may lose some information about the distribution of marks.
Fill the -1 value with the average of marks for all the students in that subject, but this will not work if the data is biased towards one subject
Is there any better way the deal with this kind of data?
Your "-1"'s amount to missing data, so you are asking how to approach a classification task with missing data. See here and here and here, among many others, for discussions on this topic.
A couple important considerations that come to mind:
One option is to "impute" the missing values, which is what you're describing with using "average marks." This approach often requires the assumption that the data is "missing at random" which in your case is unlikely to be true: for example, a bad student is more likely to not take a difficult subject, so missing values tell you something.
Using regression models (like logistic regression) are in general going to require some type of imputation. But there are other models, like decision trees or Random forests, that can handle missing data without imputation.
Suppose you have a set of transcribed customer service calls between customers and human agents, where on average each call's length is 7 minutes. Customers will mostly call because of issues they have with the product. Let's assume that a human can assign one label per axis per call:
Axis 1: What was the problem from the customer's perspective?
Axis 2: What was the problem from the agent's perspective?
Axis 3: Could the agent resolve the customer's issue?
Based on the manually labeled texts you want to train a text classifier that shall predict a label for each call for each of the three axes. But the labeling of recordings takes time and costs money. On the other hand you need a certain amount of training data to get good prediction results.
Given the above assumptions, how many manually labeled training texts would you start with? And how do you know that you need more labeled training texts?
Maybe you've worked on a similar task before and can give some advice.
UPDATE (2018-01-19): There's no right or wrong answer to my question. Ok, ideally, somebody worked on exactly the same task, but that's very unlikely. I'll leave the question open for one more week and then accept the best answer.
This would be tricky to answer but I will try my best based on my experience.
In the past, I have performed text classification on 3 datasets; the number in the bracket indicates how big my dataset was: restaurant reviews (50K sentences), reddit comments (250k sentences) and developer comments from issue tracking systems (10k sentences). Each of them had multiple labels as well.
In each of the three cases, including the one with 10k sentences, I achieved an F1 score of more than 80%. I am stressing on this dataset specifically because I was told by some that the size is less for this dataset.
So, in your case, assuming you have atleast 1000 instances (calls that include conversation between customer and agent) of average 7 minute calls, this should be a decent start. If the results are not satisfying, you have the following options:
1) Use different models (MNB, Random Forest, Decision Tree, and so on in addition to whatever you are using)
2) If point 1 gives more or less similar results, check the ratio of instances of all the classes you have (the 3 axis you are talking about here). If they do not share a good ratio, get more data or try out the different balancing techniques if you cannot get more data.
3) Another way would be to classify them at a sentence level than message or conversation level to generate more data and individual labels for sentences rather than message or the conversation itself.
PySparks mllib package provides train() and trainimplicit() methods for training a recommendation model on explicit and implicit data respectively.
I want to train a model on implicit data. More specifically item purchase data. Since it is very rare in my case that a user will purchase an item more than once, the "ratings" or "preference" is always 1. So my dataset looks like:
u1, i1, 1
u1, i2, 1
u2, i2, 1
u2, i3, 1
...
un, im, 1
where u represents a user and i an item.
I do have a lot of features for users demographics, location, etc. as well as item features. But I cannot incorporate user or item features in pyspark.mllib.als.train or pyspark.mllib.als.trainimplicit methods.
Alternatively, I have considered using fastFM or libfm. Both are packages for Factorization Machines which implements an ALS solver and frames recommendation as a regression/classification problem. Using those cases I can include the user, item and more features in the training data as X. However, the predicted variable y will only be a vector of ones (I don't have explicit ratings only purchases).
How do I get around this issue?
MF in Spark is a simple collaborative filtering implementation based on user-item events(implicit)/ratings(explicit). You can introduce a contextual information for 2D (user-item) recommender by pre-filtering or post-filtering data. For example, you have a demographic information M/F and kNN recommender (can be MF, doesn't matter), for pre-filtering first thing what you are doing is to select only records which have the same context. Than, you running kNN on them. For MF, doing the same way, two models have to be generated - for F and M. Then, while generating recommendation at first step you select the right model. Both techniques are well described in "Recommender Systems Handbook".
Modeling context - FM is a good way to go. Think, this post maybe useful for you: How to use Python's FastFM library (factorization machines) for recommendation tasks?. You will find there how negative examples are introduce for implicit users' feedback. And pay attention for a ranking prediction - mostly for recommendations is a right way to go.
Another option - introduce your own heuristic, e.g. by busting the final score. Maybe you got some knowledge/business goal/other thing that can introduce value for you or the users.
I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.