Crowdsourcing reliability measurements - spam/fraud detection - statistics

I'd like to collect some kind of geographical information from website users - for given set of data they will mark checkbox indicating whether place has or has not given property. Are there any tools/frameworks for detecting fraud or spam submissions based on whole colected data set (and possibly other info)? I'd like to get filtered, more reliable data.

Not sure if that's exactly what you're asking for, but here are some tips from my experience using Amazon Turk:
There are several academic papers dealing with such problems. here is a good one.
In addition, based on the following general recommendations, I've created a custom procedure which worked on my data:
a. Include an open question, and filter out cases where it wasn't answered. It's harder to answer such a question automatically, and it might also be more time-consuming, thus less attractive, for a fraudster.
b. If possible, don't use a binary scale (i.e. a checkbox), but some grade (e.g. 1-4 or 1-6). This would give you more data to work with.
c. If available, filter out cases where the time spent in filling your form was too short. (especially useful if you include that open question)
d. If you have multiplicity of inputs per user, check for repetitive answers, and for users which consistently give far-from-average answers.
If each user submits only a single "form", consider putting more than a single element/question in it, so you'll get multiple submissions per-user.
e. If you have only a single submission per user or user-id, your options are more limited. I can suggest filtering out outliars, (e.g. data points farther than 3 standard deviations from the average), in case you have enough data.
f. After all the filtering, check the agreement or disagreement in your data (e.g. by checking what proportion of your data points fall within x standard deviations from the average). In case of agreement, use the average; in case of disagreement, collect some more data.
Hope it helps,

Related

Differences in Differences Parallel Trends

I want to measure whether the impact of a company's headquarter country on my independent variable (goodwill paid) is stronger during recessions. After some researching, I found out that the differences-in-differences analysis could solve my problem. However, in the internet they always show a diagram (see example under: https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.publichealth.columbia.edu%2Fresearch%2Fpopulation-health-methods%2Fdifference-difference-estimation&psig=AOvVaw1yMN6knTtOEahZ9vstJpnV&ust=1676208292554000&source=images&cd=vfe&ved=0CAwQjRxqFwoTCLjbrNDIjf0CFQAAAAAdAAAAABAE ) with the "treatment" and "parallel trends". So two lines that increase or decrease in the same way until the treatment and then one line increase/decreases more than the other.
My question now is what is my treatment and what is my control variable in my example? The treatment cannot be recessions because otherwise I just have the treatment group after the treatment and the control group before the recessions. If you think another statistical test may be better, I would be happy to consider that.
Furthermore, I just want to make sure that I created my model correctly: Goodwil Paid=B0+B1ressions+B2Country+B3ressionsCountry
Would that tell me whether the impact of the country is stronger during recessions?
Thanks a lot for your help.

Optimiser for excel spreadsheet

I'm a mechanical engineer, and I have developed a pretty cool spreadsheet that I use to size some steel members for lifting beams. The set back is that I need to do some trial and error in the selection of the member until I get one that gets as close to the allowable limits as possible.
What I'm hoping to improve on is to develop a function that based upon a length and weight variable that I enter, the program runs a loop and automatically selects the best member size(s) based upon a list of the members and their physical properties. Is this possible?
Yeah, depending on the complexity, either a simple search through parameters (less than, more than etc) might bring you the answer. You can do it quite easily via Pandas library. Just load up the excel as pandas DataFrame (pandas.read_excel()), which then will allow you to perform the searches on that DataFrame object.
If you want to run some optimization algo, you should look into SciPy's optimize to get what you're looking for based on the input data (it handles unconstrained and constrained functions).
Of course, the question you've stated is quite general, so I only pointed the direction. More info would be better.

Need ideas for rewarding the users of a wiki

I need ideas as to how best reward users of a wiki to make them motivated to keep contributing in a constructive manner. Articles can be upvoted, so the thought is to reward the contributors based on how much they have contributed to a specific article as well as how many upvotes it has gotten. The idea so far is to use a system of rewarding points to those who wrote the article ,and points from the amount of bytes the user have generated by editing articles.
The immediate problems i see is how to correctly assess when to give points in situations such as when a user edits parts of an article that has already been edited before. When a user edits for the sake of correcting misspellings(an example is a user who edits a single word), whether this should give points as i don’t see how the backend would distinct between a user correcting a mistake in spelling or farming points by making small changes here and there.
There is also the issue with how to manage the byte contribution point system with regards to how to handle a situation when a user’s contribution have been overridden by an edit, if they should get to keep their points from contributing bytes now that their original piece of text is gone.
The intention is to make the user feel rewarded for their work without making the reward system too competitive(making them focus more on generating points rather than producing content of value).
Give the major concern appears to be avoiding low value edits you could cap the amount per day and edits per article. For example instead of a user being able to apply multiple edits to a page one word at a time you make it so they are only rewarded for editing a page once a day. Additional edits would give them no additional points however would still be accepted. It doesn't have to be a page either you could use paragraph or whatever level of granularity works for the content. The most important thing is to track all of this over time and do spot checks on whether the top users are indeed the ones that have contributed value according to whatever metric you decide is important.
User always try and game any points system so whatever you choose I would make sure to track enough information so that you can change your algorithm in the future and understand how it will work.

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

How do you measure if an interface change improved or reduced usability?

For an ecommerce website how do you measure if a change to your site actually improved usability? What kind of measurements should you gather and how would you set up a framework for making this testing part of development?
Multivariate testing and reporting is a great way to actually measure these kind of things.
It allows you to test what combination of page elements has the greatest conversion rate, providing continual improvement on your site design and usability.
Google Web Optimiser has support for this.
Similar methods that you used to identify the usability problems to begin with-- usability testing. Typically you identify your use-cases and then have a lab study evaluating how users go about accomplishing certain goals. Lab testing is typically good with 8-10 people.
The more information methodology we have adopted to understand our users is to have anonymous data collection (you may need user permission, make your privacy policys clear, etc.) This is simply evaluating what buttons/navigation menus users click on, how users delete something (i.e. changing quantity - are more users entering 0 and updating quantity or hitting X)? This is a bit more complex to setup; you have to develop an infrastructure to hold this data (which is actually just counters, i.e. "Times clicked x: 138838383, Times entered 0: 390393") and allow data points to be created as needed to plug into the design.
To push the measurement of an improvement of a UI change up the stream from end-user (where the data gathering could take a while) to design or implementation, some simple heuristics can be used:
Is the number of actions it takes to perform a scenario less? (If yes, then it has improved). Measurement: # of steps reduced / added.
Does the change reduce the number of kinds of input devices to use (even if # of steps is the same)? By this, I mean if you take something that relied on both the mouse and keyboard and changed it to rely only on the mouse or only on the keyboard, then you have improved useability. Measurement: Change in # of devices used.
Does the change make different parts of the website consistent? E.g. If one part of the e-Commerce site loses changes made while you are not logged on and another part does not, this is inconsistent. Changing it so that they have the same behavior improves usability (preferably to the more fault tolerant please!). Measurement: Make a graph (flow chart really) mapping the ways a particular action could be done. Improvement is a reduction in the # of edges on the graph.
And so on... find some general UI tips, figure out some metrics like the above, and you can approximate usability improvement.
Once you have these design approximations of user improvement, and then gather longer term data, you can see if there is any predictive ability for the design-level usability improvements to the end-user reaction (like: Over the last 10 projects, we've seen an average of 1% quicker scenarios for each action removed, with a range of 0.25% and standard dev of 0.32%).
The first way can be fully subjective or partly quantified: user complaints and positive feedbacks. The problem with this is that you may have some strong biases when it comes to filter those feedbacks, so you better make as quantitative as possible. Having some ticketing system to file every report from the users and gathering statistics about each version of the interface might be useful. Just get your statistics right.
The second way is to measure the difference in a questionnaire taken about the interface by end-users. Answers to each question should be a set of discrete values and then again you can gather statistics for each version of the interface.
The latter way may be much harder to setup (designing a questionnaire and possibly the controlled environment for it as well as the guidelines to interpret the results is a craft by itself) but the former makes it unpleasantly easy to mess up with the measurements. For example, you have to consider the fact that the number of tickets you get for each version is dependent on the time it is used, and that all time ranges are not equal (e.g. a whole class of critical issues may never be discovered before the third or fourth week of usage, or users might tend not to file tickets the first days of use, even if they find issues, etc.).
Torial stole my answer. Although if there is a measure of how long it takes to do a certain task. If the time is reduced and the task is still completed, then that's a good thing.
Also, if there is a way to record the number of cancels, then that would work too.

Resources