Gauging expected membership length without new sign-ups overly affecting the value - statistics

My question isn't language specific!
I'm trying to find a metric to help understand membership length of members of a site. Not surprisingly, if the site is very successful and many new people sign up, the actual account age average drops. The average also drops if many people cancel, although more slowly.
I've thought about using an offset, to only include people who signed up more than a year ago for example, but this creates a weird bias that ignores people who sign up and cancel within the year.
Another thought was to count only cancellations, but this could have perverted results in the case where 1000 members have been members for a decade and none have cancelled, but 10 users signed up and canceled the next day.
It seems un-intuitive to use the average, since a bulk of new sign-ups (a good thing) will be perceived as a bad thing in terms of average account length.
Any ideas for ways to measure 'expected' account age without having too much noise from new sign-ups?

Why not actually measure account age, if that's what you want?
In pseudo-code:
def account-age(account):
if account.current:
return days(today() - signup_date)
else:
return days(cancel_date - signup_date)

Related

Store or train gpt model to "remember context"

Is there a way to train an llm model to store a specific context? For example, I had a long story I want to ask questions about, but I don't want to put the whole story in every prompt. How can I make the llm model "remember the story"?
Taking into account that GPT-3 models have no parameter that enables memorization of past conversations, it seems the only way at the moment to "memorize" past conversations is to include past conversations in the prompt.
If we take a look at the following example:
You are a friendly support person. The customer will ask you questions, and you will provide polite responses
Q: My phone won't start. What do I do? <-- This is a past question
A: Try plugging your phone into the charger for an hour and then turn it on. The most common cause for a phone not starting is that the battery is dead.
Q: I've tried that. What else can I try? <-- This is a past question
A: Hold the button in for 15 seconds. It may need a reset.
Q: I did that. It worked, but the screen is blank. <-- This is a current question
A:
Rule to follow:
Include prompt-completion pairs in the prompt with the oldest conversations at the top.
Problem you will face:
You will hit a token limit at some point (if you chat long enough). Each GPT-3 model has a maximum number of tokens you can pass to it. In the case of text-davinci-003, it is 4096 tokens. When you hit this limit, the OpenAI API will start to throw errors. When this happens, you need to reduce the number of past prompt-completion pairs (e.g., include only the most recent 4 past prompt-completion pairs).
Pros:
By including past prompt-completion pairs in the prompt, we are able to give
GPT-3 models the context of the conversation.
Cons:
What if a user asks a question that relates to a conversation that occurred more than 4 prompt-completion pairs ago?
Including past prompt-completion pairs in the prompt will cost (a lot of) money!

Is a brute force attack a viable option in this event ticketing scheme

I plan on creating an ticket "pass" platform. Basically, imagine you come to a specific city, you buy a "pass" for several days (for which you get things like free entrance to museums and other attractions).
Now, the main question that bothered me for several days is: How will museum staff VALIDATE if the pass is valid? I see platforms like EventBrite etc. using barcodes/QR codes, but that is not quite a viable solution because we'll need to get a good camera phone for every museum to scan the code and that's over-budget. So I was thinking of something like a simple 6-letter code, for eg: GHY-AGF. There are 26^6 = 308 million combinations, which is a tough nut to crack.
I've asked a question on the StackExchange security site about this, and the main concern was the brute forcing. However, I imagine someone doing this kind of attack if: they had access of doing pass lookup. The only people that will be able to do this are:
1) The museum staff (for which there will be a secure user/pass app, and rate limits of no more than 1000 look-ups per day)
b) Actual customers to check the validity of their pass, and this will be protected with Google ReCaptcha v3, which doesn't sacrifice user experience like with v1. Also rate limits and IP bans will be applied
Is a brute force STILL a viable attack if I implement these 2 measures in place? Also, is there something else I'm missing in terms of security, when using this approach?
By the way, Using a max. 6-character-long string as a unique "pass" has many advantages portable-wise, for eg. you could print "blank" passes, where the user will be give instructions on how to obtain it. After they pay, they'll be given a code like: GAS-GFS, which they can easily write with a pen on the pass. This is not possible with a QR/barcode. Also, the staff can check the validity in less than 10 seconds, by typing it in a web-app, or sending an SMS to check if it's valid. If you're aware of any other portable system like this, that may be more secure, let me know.
Brute forcing is a function of sparseness. How many codes at any given time are valid out of how large a space? For example, if out of your 308M possibilities, 10M are valid (for a given museum), then I only need ~30 guesses to hit a collision. If only 1000 are valid, then I need more like 300k guesses. If those values are valid indefinitely, I should expect to hit one in less than a year at 1000/day. It depends on how much they're worth to figure out if that's something anyone would do.
This whole question is around orders of magnitude. You want as many as you can get away with. 7 characters would be better than 6 (exactly 26x better). 8 would be better than that. It depends on how devoted your attackers are and how big the window is.
But that's how you think about the problem to choose your space.
What's much more important is making sure that codes can't be reused, and are limited to a single venue. In all problems like this, reconciliation (i.e. keeping track of what's been issued and what's been used) is more important than brute-force protection. Posting a number online and having everyone use it is dramatically simpler than making millions of guesses.

Potential security vulnerabilities in a ticketing implementation

I'm trying to brainstorm potential security vulnerabilities for this scenario (btw, I've asked a related question several days ago, however, from the answers, I've realized that explaining the EXACT scenario is extremely crucial, because many of the answers were (a bit) irrelevant due to this. I've also included vulnerabilities I've identified so far, and how to mitigate them, so feedback on them would be appreciated. So here we go:
a) The whole system will be a "ticketing" system, but not an ordinary ticket, but a "pass" system. Meaning: A customer goes and orders a "pass" ticket, which gives him access to certain perks at certain places (like free entrance to museums) for a SPECIFIC period of time. Meaning, it's a ticket that EXPIRES after 1-7 days (but no more than 7 days).
b) The "flow" of the user is:
User goes to the website, orders a ticket for a specific period of time, which gives him perks at certain locations (museums etc.)
After a successful order, the website prints a 6-letter-long string (an ID). Example: GFZ-GFY. There are 26^6 (~308 million) potential combinations. Of course, these IDs will be stored in a secure database.
User then goes to the museum (or other venue) and shows the 6-letter-long string. The employee checks its validity with a web-app or sending an SMS to a number, getting the validity status immediately (in both cases, the code will query against the database to check for the ticket validity).
So far, I've identified 2 potential issues:
a) Brute-force attacks
There will be 2 "attack surfaces" under which this can occur:
A museum employee will have a gated access to the web-app (to verify ticket validity). The way I mitigate this is limiting the # of look-ups to 1,000 a day per-user-account.
A user will be able to check the status of his order. I'll mitigate this in several ways: first, the URL not be "public", and available only to users who purchased the ticket. Second, I'll implement ReCaptcha v3, IP bans on more than 10 unsuccessful requests per hour.
The # of "active" tickets at a time is expected to be 5000 (at its peak), normal would be something like 500-1000, so considering there are hundreds of millions of combinations, it would take a significant effort for an attacker to brute-force the way through this.
The second (and easier) approach an attacker could take is simply buying a ticket and re-publishing it, or publishing it online for anyone to use. The way I'll mitigate this is by:
After a museum checks the validity of the pass, if they check it again, there will be a notification saying: This pass has been checked at this place at this time: [time-date].
While I do plan on re-using the same code, I'll make sure there is a period of minimum 90 days between periods. Maybe there's some vulnerability of doing this that I'm not aware of. The code MAY or MAY not be used again after 90 days passed after its "expiration" date. All I'm saying is that it will be released in the "pool" of potential (300+ million) codes that could be used. Maybe this is not such a good idea?
The customer will be given (sent to an address, or instructed to pick-up), a blank card-like "ticket" where the code will be written on it (or he'll have to write the code with a pen on the ticket). This will make an attack harder to do, since the attacker would now need to have access BOTH to the code + a printer that could print such cards with the same material.
Do you see any other potential attack that could be done? Is there anything I'm missing at my current mitigation approaches?
I would also plan for the case that your database is compromised, e.g. through SQL-injection. Instead of storing the codes plain text, you can use any decent password-hash function and store only the hash of the code. The verification procedure is the same as with passwords.
If there is no user id, the code must be retrievable with a database query, then you cannot use salting. In this case we can use a key derivation function (KDF), which needs some time to calculate a hash to make brute-forcing harder. The missing salt leads to the next point:
I would feel more comfortable using longer codes. If I read the table correctly, the probability to find a valid code with your setup (~28bit / 3000 active codes) is about 0.001 because of the birthday problem. Codes with 9 characters are probably a good compromise, since they are case insensitive they can still be typed fast, but allow for 5E12 combinations. Such codes could be left in the database, so one can tell that a ticket has expired, there is no need to re-use them. Brute-forcing 3 million hashes is no big obstacle even with a KDF, brute-forcing with 5E12 combinations is much more of a problem.
You seem to have spent a decent amount of time considering this and you have identified your largest potential attack surfaces.
Martinstoeckli identifies what I would have considered the next greatest issues by raising the potential of SQL injection and brute force. In my experience the best mitigation to SQL injection is just going to be to make sure that all input is properly sanitized. The brute force issue can never be fully solved and a 6 character key is somewhat easy to crack. Including the use of a KDF would seem like a good idea but you also have to consider what the performance impact on your database will be. With an estimated 500-1000 users/keys i don't think that it would be a huge concern.
I would recommend not reusing keys because depending on how they are stored that could lead to hash collision attacks after a time depending on the specifics of how they are stored.
After those issues I would actually recommend looking into the specifics of how you are hosting this application. Is it hosted on a physical server that you have access to or is it a VM sitting somewhere in the cloud? Each of those is going to have its own risks.

How long will it take to audit 29k lines of Drupal code?

A client is asking how long does it take to audit the security of his Drupal module that is 29k lines long. Does anyone know at least what ballpark I should give him? His main concerns are file encryption and user permission.
Nope, not a damn clue :-)
However, whatever value you choose, may I suggest one thing?
Monitor your progress! Tell your client that your initial estimate is (for example) twenty-nine working days but that it depends on a great many factors outside your control.
Tell them you plan to mitigate risks of budget overrun by providing a daily snapshot of progress:
current number of lines audited in total [a].
days spent [b].
current "run rate" (number of lines per day, average) [c = a/b].
number of lines yet to be audited [d = 29,000 - a].
estimated days to completion [e = d / c].
Allow them to pull the plug at any time if the run rate is well below what you estimated.
This basic project management/reporting should give them the confidence that you know what you're doing, and will minimise their exposure considerably, to the point where they'll feel a lot more comfortable about taking you on.
Just on that last bullet point above, you may want to consider giving them a range (say +/-5% of the estimate), but don't get too clever about working out best and worst case based on your best and worst days to date. The power of averaging is that it gives you a "best" guess without having to fiddle too much with figures.
Typical estimates I've seen are that you can expect a developer to review 100-150 lines of code per hour. This is a very rough estimate, and it will vary greatly depending upon the nature of the code and the thoroughness of the review. Also, if you can review code for 8 hours a day, 5 days a week, straight, you're inhuman and amazing; for the rest of us, we need a change of activity to clear the brain.

1 vs 1 vote: calculate ratings (Flickchart.com)

Instead of rating items with grades from 1 to 10, I would like to have 1 vs 1 "fights". Two items are displayed beside each other and you pick the one which you like more. Based on these "fight" results, an algorithm should calculate ratings for each item.
You can see this approach on Flickchart.com where movies are rated using this approach.
It looks like this:
As you can see, items are pushed upwards if they win a "fight". The ranking is always changing based on the "fight" results. But this can't be only based on the win quote (here 54%) since it's harder to win against "Titanic" than against "25th Hour" or so.
There are a few things which are quite unclear for me:
- How are the ratings calculated? How do you decide which film is on the first place in the ranking? You have to consider how often an items wins and how good are the beaten items.
- How to choose which items have a "fight"?
Of course, you can't tell me how Flickchart exactly does this all. But maybe you can tell me how it could be done. Thanks in advance!
This might not be exactly what flickchart is doing, but you could use a variant of the ELO algorithm used in chess (and other sports), since these are essentially fights/games that they win/lose.
Basically, all movies start off with 0 wins/losses and every time they get a win they get a certain amount of points. You usually have an average around 20 (but any number will do) and winning against a movie with the same rating as yourself will give exactly that 20. Winning against a bad movie will maybe give around 10 points, while winning against a better movie might give you 30 points. The other way around, losing to a good movie you only lose 10 points, but if you lose to a bad movie, you lose 30 points.
The specifics of the algorithm is in the wikipedia link.
How are the ratings calculated? How do you decide which film is on the first place in the ranking? You have to consider how often an items wins and how good are the beaten items.
What you want is a weighted rating, also called a Bayesian estimate.
I think IMDB's Top 250 movies is a better starting point to make a ranking website. Some movies have 300,000+ votes while others others have fewer than 50,000. IMDB uses a Bayesian estimate to rank movies against one another without unfairly weighting popular movies. The algorithm is given at the bottom of the page:
weighted rating (WR) = (v ÷ (v+m)) × R
+ (m ÷ (v+m)) × C where:
R = average for the movie (mean) =
(Rating)
v = number of votes for the
movie = (votes)
m = minimum votes
required to be listed in the Top 250
(currently 3000)
C = the mean vote
across the whole report (currently
6.9)
for the Top 250, only votes from
regular voters are considered.
I don't know how IMDB chose 3000 as their minimum vote. They could have chosen 1000 or 10000, and the list would have been more or less the same. Maybe they're using "average number of votes after 6 weeks in the box office" or maybe they're using trial and error.
In any case, it doesn't really matter. The formula above is pretty much the standard for normalizing votes on ranking websites, and I'm almost certain Flickrchart uses something similar in the background.
The formula works so well because it "pulls" ratings toward the mean, so ratings above the mean are slightly decreased, ratings below the mean are slightly increased. However, the strength of the pull is inversely proportional to the number of votes a movie has. So movies with few votes are pulled more aggressively toward the mean than movies with lots of votes. Here are two data points to demonstrate the property:
Rank Movie Votes Avg Rating Weighted Rating
---- ----- ----- ---------- ---------------
219 La Strada 15,000+ 8.2 8.0
221 Pirates of the 210,000+ 8.0 8.0
Caribbean 2
Both movies' ratings are pulled down, but the pull on La Strada is more dramatic since it has fewer votes and therefore is not as representative as ratings for PotC.
For your specific case, you have two items in a "fight". You should probably design your table as follows:
Items
-----
ItemID (pk)
FightsWon (int)
FightsEngaged (int)
The average rating is FightsWon / FightsEngaged. The weighted rating is calculated using the formula above.
When a user chooses a winner in a fight, increase the winning item's FightsWon field by 1, increase both items FightsEngaged field by 1.
Hope this helps!
- Juliet
I've been toying with the problem of ranking items by means of pair-wise comparison for some time myself, and wanted to take the time to describe the ideas I came up with so far.
For now I'm simply sorting by <fights won> / <total fights>, highest first. This works fine if you're the only one voting, or if there are a lot of people voting. Otherwise it can quickly become inaccurate.
One problem here is how to choose which two items should fight. One thing that does seem to work well (subjectively) is to let the item that has the least fights so far, fight against a random item. This leads to a relatively uniform number of fights for the items (-> accuracy), at the cost of possibly being boring for the voter(s). They will often be comparing the newest item against something else, which is kinda boring. To alleviate that, you can choose the n items with the lowest fight-count and chose one of those randomly as the first contender.
You mentioned that you want to make victories against strong opponents count more than against weak ones. As mentioned in other posts above, rating systems used for chess and the like (Elo, Glicko) may work. Personally I would love to use Microsoft's TrueSkill, as it seems to be the most accurate and also provides a good way to pick two items to pit against each other -- the ones with the highest draw-probability as calculated by TrueSkill. But alas, my math understanding is not good enough to really understand and implement the details of the system, and it may be subject to licensing fees anyway...
Collective Choice: Competitive Ranking Systems has a nice overview of a few different rating systems if you need more information/inspiration.
Other than rating systems, you could also try various simple ladder systems. One example:
Randomize the list of items, so they are ranked 1 to n
Pick two items at random and let them fight
If the winner is ranked above the loser: Do nothing
If the loser is ranked above the winner:
If the loser is directly above the winner: Swap them
Else: Move the winner up the ladder x% toward the loser of the fight.
Goto 2
This is relatively unstable in the beginning, but should improve over time. It never ceases to fluctuate though.
Hope I could help at least a little.
As for flickchart, I've been playing around with it a little bit, and I think the rating system is pretty unsophisticated. In pseudo-code, my guess is that it looks something like this:
if rank(loser) == null and rank(winner) == null
insert loser at position estimated from global rank
insert winner at position estimated from global rank
else if rank(winner) == null or rank(winner) < rank(loser)
then advance winner to loser's position and demote loser and all following by 1
Why do I think this? First, I'm completely convinced that their Bayesian priors are not based on a careful mining of my previous choices. They seem to have no way to guess that because I like Return of the Jedi that I like The Empire Strikes Back. In fact, they can't figure out that because I've seen Home Alone 2 that I may have seen Home Alone 1. After hundreds of ratings, the choice hasn't come up.
Second of all, if you look at the above code you might find a little bug, which you will definitely notice on the site. You may notice that sometimes you will make a choice and the winner will slide by one. This seems to only happen when the loser wasn't previously added. My guess is that what is happening is that the loser is being added higher than the winner.
Other than that, you will notice that rankings do not change at all unless a lower ranked movie beats a higher ranked movie directly. I don't think any real scores are being kept: the site seems to be entirely memoryless except for the ordinal rank of each movie and your most recent rating.
Or you might want to use a variant of PageRank see prof. Wilf's cool description.
After having thought things through, the best solution for this film ranking is as follows.
Required data:
The number of votes taken on each pairing of films.
And also a sorted version of this data grouped like in radix sort
How many times each film was voted for in each pairing of films
Optional data:
How many times each film has been involved in a vote for each user
How to select a vote for a user:
Pick out a vote selection from the sorted list in the lowest used radix group (randomly)
Optional: use the user's personal voting stats to filter out films they've been asked to vote on too many times, possibly moving onto higher radix buckets if there's nothing suitable.
How to calculate the ranking score for a film:
Start the score at 0
Go through each other film in the system
Add voteswon / votestaken versus this film to the score
If no votes have been taken between these two films, add 0.5 instead (This is of course assuming you want new films to start out as average in the rankings)
Note: The optional stuff is just there to stop the user getting bored, but may be useful for other statistics also, especially if you include how many times they voted for that film over another.
Making sure that newly added films have statistics colleted on them ASAP and very evenly distributed votes across all existing films is vital to keeping stats correct for the rest of the films. It may be worth staggering the entry of a bunch of new films to the system to avoid temporary glitches in the rankings (though not immediate nor severe).
===THIS IS THE ORIGINAL ANSWER===
The problem is actually very easy. I am assuming here that you want to order by preference to vote for the film i.e. the #1 ranked film is the film that is most likely to be chosen in the vote. If you make it so that in each vote, you choose two films completely at random you can calculate this with simple maths.
Firstly each selection of two films to vote on is equally likely, so results from each vote can just be added together for a score (saves multiplying by 1/nC2 on everything). And obviously the probability of someone voting for one specific film against another specific film is just votesforthisfilm / numberofvotes.
So to calculate the score for one film, you just sum votesforthisfilm / numberofvotes for every film it can be matched against.
There is a little trouble here if you add a new film which hasn't had a considerable number of votes against all the other films, so you probably want to leave it out of the rankings until a number of votes has built up.
===WHAT FOLLOWS IS MOSTLY WRONG AND IS MAINLY HERE FOR HISTORICAL CONTEXT===
This scoring method is derived from a Markov chain of your voting system, assuming that all possible vote questions were equally likely. [This first sentence is wrong because making all vote questions have to be equally likely in the Markov chain to get meaningful results] Of course, this is not the case, and actually you can fix this as well, since you know how likely each vote question was, it's just the number of votes that have been done on that question! [The probability of getting a particular vote question is actually irrelevant so this doesn't help] In this way, using the same graph but with the edges weighted by votes done...
Probability of getting each film given that it was included in the vote is the same as probability of getting each film and it being in the vote divided by the probability it was included in the vote. This comes to sumoverallvotes((votesforthisfilm / numberofvotes) * numberofvotes) / totalnumberofvotes divided by sumoverallvotes(numberofvotes) / totalnumberofvotes. With much cancelling this comes to votesforthisfilmoverallvotes / numberofvotesinvolvingthisfilm. Which is really simple!
http://en.wikipedia.org/wiki/Maximize_Affirmed_Majorities?
(Or the BestThing voting algorithm, originally called the VeryBlindDate voting algorithm)
I believe this kind of 1 vs. 1 scenario might be a type of conjoint analysis called Discrete Choice. I see these fairly often in web surveys for market research. The customer is generally asked to choose between two+ different sets of features that they would prefer the most. Unfortunately it is fairly complicated (for a non-statistics guy like myself) so you may have difficulty understanding it.
I heartily recommend the book Programming Collective Intelligence for all sorts of interesting algorithms and data analysis along these lines.

Resources