Compare two web pages (A/B testing) - Two sample portion test - statistics

I have two changes on my web page but I'm monitoring a bunch of variables. So what I'm able to extract from my website monitoring experiment is as follows:
Original solution: Visitors, body link click-visitors, most popular click-visitors, share-visitors.
Solution with some change: Visitors, body link click-visitors, most popular click-visitors, share-visitors.
I was wondering about simple 2 sample portion test. Take each of the monitored variable and compute portion test for original and changed solution.
I don't know if it tells me something about the overall result - if original solution is better than the solution with some change or not.
Is there something better what can I use for this purpose. I'll appreciate any of your advice.

Sounds to me like you’re confusing two things: business metric of interest and test for statistical significance. The former is some business mesurement that you would like to measure for. This could be sales, conversion, subscription rate, or many others. See e.g. this paper for a good discussion on the perils of using the wrong metric. Statistical significance is a test that tells you if the number of measurements you’ve seen so far is enough to substantiate a claim that the difference between the two experiences is very unlikely random. See e.g. this paper for a good discussion.

Related

Is my Statistical Treatment of Data Correct?

I am aware that consulting a statistician is not free and it is something I cannot afford, so I am trying my shot here. So for the problem at hand, I've already finished data gathering for my research and am now calculating the results. However, I am stuck on what should I use for my statistical treatment of data.
For background, I am using ISO 25010 to test my software quality and user acceptance. The questionnaire consists of a number of questions for each cluster (functionality, reliability, usability, efficiency, maintainability, and portability). I've also used Likert Scale: Agreement Type. The hypothesis of my research says "There is no significant difference in the user acceptance results in terms of [clusters]". As of now, I've used Descriptive Statistics, mean(for each question), average mean(ave. of mean for each cluster, and mode), for calculating the results.
I feel that the result I currently have might be lacking when the final defense came. As far as I know, using a combination of statistical methods is okay to give a more strong foundation for your result.
Based on the background of my research, what other statistical methods should I use?
I am thinking of sample standard deviation, but I don't know if I should compute it by questions or by cluster.
Sorry, statistics is not really my forte.
Thank you in advance for your answers

Differences in Differences Parallel Trends

I want to measure whether the impact of a company's headquarter country on my independent variable (goodwill paid) is stronger during recessions. After some researching, I found out that the differences-in-differences analysis could solve my problem. However, in the internet they always show a diagram (see example under: https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.publichealth.columbia.edu%2Fresearch%2Fpopulation-health-methods%2Fdifference-difference-estimation&psig=AOvVaw1yMN6knTtOEahZ9vstJpnV&ust=1676208292554000&source=images&cd=vfe&ved=0CAwQjRxqFwoTCLjbrNDIjf0CFQAAAAAdAAAAABAE ) with the "treatment" and "parallel trends". So two lines that increase or decrease in the same way until the treatment and then one line increase/decreases more than the other.
My question now is what is my treatment and what is my control variable in my example? The treatment cannot be recessions because otherwise I just have the treatment group after the treatment and the control group before the recessions. If you think another statistical test may be better, I would be happy to consider that.
Furthermore, I just want to make sure that I created my model correctly: Goodwil Paid=B0+B1ressions+B2Country+B3ressionsCountry
Would that tell me whether the impact of the country is stronger during recessions?
Thanks a lot for your help.

Are transformer-based language models overfitting on the paraphrase identification task? What tools overcome this?

I've been working on a sentence transformation task that involves paraphrase identification as a critical step: if we are confident enough that the state of the program (a sentence repeatedly modified) has become a paraphrase of a target sentence, stop transforming. The overall goal is actually to study potential reasoning in predictive models that can generate language prior to a target sentence. The approach is just one specific way of reaching that goal. Nevertheless, I've become interested in the paraphrase identification task itself, as it's received some boost from language models recently.
The problem I run into is when I manipulate sentences from examples or datasets. For example, in this HuggingFace example, if I negate either sequence or change the subject to Bloomberg, I still get a majority "is paraphrase" prediction. I started going through many examples in the MSRPC training set and negating one sentence in a positive example or making one sentence in a negative example a paraphrase of the other, especially when doing so would be a few word edit. I found to my surprise that various language models, like bert-base-cased-finetuned-mrpc and textattack/roberta-base-MRPC, don't change their confidences much on these sorts of changes. It's surprising as these models claim an f1 score of 0.918+. The dataset is clearly missing a focus on negative examples and small perturbative examples.
My question is, are there datasets, techniques, or models that deal well when given small edits? I know that this is an extremely generic question, much more than is typically asked on StackOverflow, but my concern is in finding practical tools. If there is a theoretical technique, then it might not be suitable as I'm in the category of "available tools define your approach" rather than vice-versa. So I hope that the community would have a recommendation on this.
Short answer to the question: yes, they are overfitting. Most of the important NLP data sets are not actually well-crafted enough to test what they claim to test, and instead test the ability of the model to find subtle (and not-so-subtle) patterns in the data.
The best tool I know for creating data sets that help deal with this is Checklist. The corresponding paper, "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" is very readable and goes into depth on this type of issue. They have a very relevant table... but need some terms:
We prompt users to evaluate each capability with
three different test types (when possible): Minimum Functionality tests, Invariance, and Directional Expectation tests... A Minimum Functionality test (MFT), is a collection of simple examples (and labels) to check a
behavior within a capability. MFTs are similar to
creating small and focused testing datasets, and are
particularly useful for detecting when models use
shortcuts to handle complex inputs without actually
mastering the capability.
...An Invariance test (INV) is when we apply
label-preserving perturbations to inputs and expect
the model prediction to remain the same.
A Directional Expectation test (DIR) is similar,
except that the label is expected to change in a certain way. For example, we expect that sentiment
will not become more positive if we add “You are
lame.” to the end of tweets directed at an airline
(Figure 1C).
I haven't been actively involved in NLG for long, so this answer will be a bit more anecdotal than SO's algorithms would like. Starting with the fact that in my corner of Europe, the general sentiment toward peer review requirements for any kind of NLG project are higher by several orders of magnitude compared to other sciences - and likely not without reason or tensor thereof.
This makes funding a bigger challenge, so wherever you are, I wish you luck on that front. I'm not sure of how big of a deal this site is in the niche, but [Ehud Reiter's Blog][1] is where I would start looking into your tooling ideas.
Maybe even reach out to them/him personally, because I can't think of another source that has an academic background and a strong propensity for practical applications of NLG, at least based on the kind of content they've been putting out over the years.
Your background, environment/funding, and seniority level/control you have over the project will eventually compose your vector decision for you. I's just how it goes on the bleeding edge of anything. What I will add, though, is not to limit yourself to a single language or technology in this phase because of those precise reasons you've mentioned. I'd recommend the same in terms of potential open source involvement but if your profile information is accurate, that probably won't happen, no matter what you do and accomplish.
But yeah, in the grand scheme of things, your question is far from too broad, in my view. It identifies a rather unmistakable problem pattern that not all branches of science are as lackadaisical to approach as NLG-adjacent fields seem to be right now. In that regard, it's not broad enough and will need to be promulgated far and wide before community-driven tooling will give you serious options on a micro level.
Blasphemy, sure, but the performance is already stacked against you As for the question potentially being too broad, I'd posit it is not broad enough, so long as we collectively remain in a "oh, I was waiting for you to start doing something about it" phase.
P.S. I'd eliminate any Rust and ECMAScript alternatives prior to looking into Python, blapshemous as this might sound to a 2021 data scientist
. Some might ARight nowccounting forr the ridicule this would receive xou sltrsfx hsbr s fszs drz zhsz s mrnzsl rcrtvidr, sz lrsdz
due to performance easons.
[1]: https://ehudreiter.com/2016/12/18/nlg-vs-templates/

Open source projects for email scrubbing generating structured data from unstructured source?

Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it.
The article states that they developed a DataMapper
The DataMapper is responsible for taking inbound email messages
addressed to plans [at] tripit.com and transforming them from the
semi-structured format you see in your mail reader into a highly
structured XML document.
There is a comment that also states
If you're looking to build this yourself, reading a little bit about
Wrappers and Wrapper Induction might be helpful
I Googled and read about wrapper induction but it was just too broad of a definition and didn't help me understand how one would go about solving such problem.
Is there some open source project out there that does similar things?
There are a couple of different ways and things you can do to accomplish this.
The first part, which involves getting access to the email content I'll not answer here. Basically, I'll assume that you have access to the text of emails, and if you don't there are some libraries that allow you to connect java to an email box like camel (http://camel.apache.org/mail.html).
So now you've got the email so then what?
A handy thing that could help is that lingpipe (http://alias-i.com/lingpipe/) has an entity recognizer that you can populate with your own terms. Specifically, look at some of their extraction tutorials and their dictionary extractor (http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html) So inside of the lingpipe dictionary extractor (http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html) you'd simply import the terms you're interested in and use that to associate labels with an email.
You might also find the following question helpful: Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?
Really a very broad question, but I can try to give you some general ideas, which might be enough to get started. Basically, it sounds like you're talking about an elaborate parsing problem - scanning through the text and looking to apply meaning to specific chunks. Depending on what exactly you're looking for, you might get some good mileage out of a few regular expressions to start - things like phone numbers, email addresses, and dates have fairly standard structures that should be matchable. Other data points might benefit from some indicator words - the phrase "departing from" might indicate that what follows is an address. The natural language processing community also has a large tool set available for text processing - check out things like parts of speech taggers and semantic analyzers if they're appropriate to what you're trying to do.
Armed with those techniques, you can follow a basic iterative development process: For each data point in your expected output structure, define some simple rules for how to capture it. Then, run the application over a batch of test data and see which samples didn't capture that datum. Look at the samples and revise your rules to catch those samples. Repeat until the extractor reaches an acceptable level of accuracy.
Depending on the specifics of your problem, there may be machine learning techniques that can automate much of that process for you.

Voting economy: balancing credits properly

Many websites today (including stackoverflow) and games allow people to perform voting, give feedback, enable additional features etc, according to a score: eg. reputation, or MMORPG credits.
As a programmer that will probably need to implement a community based website in the near future, I am interested in knowing about the existence of basic algorithms and decisions to be made so that everything is balanced. For example, the fact that one vote up grants 10 reputations and one down grants -2 was arbitrary or properly weighted ? How to decide the price of a given item and the rewards in a MMORPG, so that everything is balanced? I guess that WoW designers relied on their experience, but I am also sure that this experience can be found somewhere written down. Although this is a social problem, the pricing of a given feature and the reward for a given task are technical/mathematical ones, as you need to give a value to each feature according to some mathematical criteria (although not easy to devise, I guess)
Of course, this question could bring us far in terms of theory of economics, but I am sort of hoping that there are well defined and known simplified patterns and rules for this issue. I just don't know the keywords to query for.
Probably the most important thing to point out here is that this is a social problem not a technical one.
By that I mean that you could use the exact same system as SO on an MMORPG and it would flop or have really undesirable side effects. Whether a system works or not depends on the community you drop it into and the intended purpose. It can also depend on some luck whether people latch onto it or not. You may get early negative behaviour that sets the tone for future negativity and discourages positive involvement. Or it could go completely the other way.
There is no magic formula that made the vote/rep weighting what it is on SO other than long discussions about how to encourage certain behaviour and then some testing and fine-tuning. For example, a downvote costs 1 rep and is -2 rep to the recipient. The guiding principle was that downvotes should cost. After that, it was trial by error.
You might want to read The Value of Downvoting, or, How Hacker News Gets It Wrong and Vote Fraud for some of Jeff's and Joel's thoughts on that subject. Joel's Tech Talk on Stackoverflow at Google is also enlightening.
Voting is actually a very difficult problem. There are so many models of voting, and they all produce different results. For example, choosing your one favorite candidate versus ranking candidates produces a different result. Choosing your LEAST favorite candidate produces a different result. Organizing choices into good/bad produces different results.
Balancing then becomes something that can be done by asking the community. It's very difficult to balance games of that magnitude, simply because even your most exhaustive tests wont cover all of the cases. Having a properly established forum where users can give their opinions as well as having testers who watch out for balancing issues is probably the best way to go.
Oh, and if you want an abstract about the voting problem I mentioned, it's here:
http://www.cs.rochester.edu/~lane/computational-politics.html

Resources