Is usage analysis based on HyperLogLog compliant with GDPR? - hyperloglog

Context: we have telemetry system for our service and would like to track retention, how many users use various features, etc.
There are two options to deal with user identifiable information and be GDPR compliant:
Support deleting user information based on request
Keep data for less than 30 days
Option #1 is hard to implement (for telemetry system). Option #2 doesn't allow answering questions such as "what is 6-month retention for feature X?".
One idea how to get answers for above question is to calculate HyperLogLog blobs per feature every week/day and store them separately forever. This will allow moving forward to merge/dcount/calculate retention based on these blobs.
Assuming that any user identifiable information is gone after 30 days (after user account gets deleted), will HyperLogLog blobs still allow to track users or not (i.e. to answer whether a particular user used feature X two years ago)?
If it allows then it is not compliant (doesn't mean that it is compliant if it doesn't allow).

In general HLLs are not GDPR compliant. This issue was somewhat addressed in a recent Google paper (see Section 8: 'Mitigation strategies').
The hash function used in HLL are usually not cryptographically secure (usually MurmurHash), hence even with salting you might still be able to answer the question "is a user part of a HLL data structure or not" and that's a no no.
Afaik you would be in compliance if you keep HLLs around for longer than 30 days iff you apply a salted crypto hash prior to HLL aggregation (i.e. a salted SHA-2 or BLAKE2b, BLAKE3) and you destroy the salt after each <30 day period. This would allow you to keep <30 day intervals. You would not be able to merge HLLs over several intervals but only over 28 day chunks, but that can still be super valuable dependent on your business needs.

Related

Is a brute force attack a viable option in this event ticketing scheme

I plan on creating an ticket "pass" platform. Basically, imagine you come to a specific city, you buy a "pass" for several days (for which you get things like free entrance to museums and other attractions).
Now, the main question that bothered me for several days is: How will museum staff VALIDATE if the pass is valid? I see platforms like EventBrite etc. using barcodes/QR codes, but that is not quite a viable solution because we'll need to get a good camera phone for every museum to scan the code and that's over-budget. So I was thinking of something like a simple 6-letter code, for eg: GHY-AGF. There are 26^6 = 308 million combinations, which is a tough nut to crack.
I've asked a question on the StackExchange security site about this, and the main concern was the brute forcing. However, I imagine someone doing this kind of attack if: they had access of doing pass lookup. The only people that will be able to do this are:
1) The museum staff (for which there will be a secure user/pass app, and rate limits of no more than 1000 look-ups per day)
b) Actual customers to check the validity of their pass, and this will be protected with Google ReCaptcha v3, which doesn't sacrifice user experience like with v1. Also rate limits and IP bans will be applied
Is a brute force STILL a viable attack if I implement these 2 measures in place? Also, is there something else I'm missing in terms of security, when using this approach?
By the way, Using a max. 6-character-long string as a unique "pass" has many advantages portable-wise, for eg. you could print "blank" passes, where the user will be give instructions on how to obtain it. After they pay, they'll be given a code like: GAS-GFS, which they can easily write with a pen on the pass. This is not possible with a QR/barcode. Also, the staff can check the validity in less than 10 seconds, by typing it in a web-app, or sending an SMS to check if it's valid. If you're aware of any other portable system like this, that may be more secure, let me know.
Brute forcing is a function of sparseness. How many codes at any given time are valid out of how large a space? For example, if out of your 308M possibilities, 10M are valid (for a given museum), then I only need ~30 guesses to hit a collision. If only 1000 are valid, then I need more like 300k guesses. If those values are valid indefinitely, I should expect to hit one in less than a year at 1000/day. It depends on how much they're worth to figure out if that's something anyone would do.
This whole question is around orders of magnitude. You want as many as you can get away with. 7 characters would be better than 6 (exactly 26x better). 8 would be better than that. It depends on how devoted your attackers are and how big the window is.
But that's how you think about the problem to choose your space.
What's much more important is making sure that codes can't be reused, and are limited to a single venue. In all problems like this, reconciliation (i.e. keeping track of what's been issued and what's been used) is more important than brute-force protection. Posting a number online and having everyone use it is dramatically simpler than making millions of guesses.

Potential security vulnerabilities in a ticketing implementation

I'm trying to brainstorm potential security vulnerabilities for this scenario (btw, I've asked a related question several days ago, however, from the answers, I've realized that explaining the EXACT scenario is extremely crucial, because many of the answers were (a bit) irrelevant due to this. I've also included vulnerabilities I've identified so far, and how to mitigate them, so feedback on them would be appreciated. So here we go:
a) The whole system will be a "ticketing" system, but not an ordinary ticket, but a "pass" system. Meaning: A customer goes and orders a "pass" ticket, which gives him access to certain perks at certain places (like free entrance to museums) for a SPECIFIC period of time. Meaning, it's a ticket that EXPIRES after 1-7 days (but no more than 7 days).
b) The "flow" of the user is:
User goes to the website, orders a ticket for a specific period of time, which gives him perks at certain locations (museums etc.)
After a successful order, the website prints a 6-letter-long string (an ID). Example: GFZ-GFY. There are 26^6 (~308 million) potential combinations. Of course, these IDs will be stored in a secure database.
User then goes to the museum (or other venue) and shows the 6-letter-long string. The employee checks its validity with a web-app or sending an SMS to a number, getting the validity status immediately (in both cases, the code will query against the database to check for the ticket validity).
So far, I've identified 2 potential issues:
a) Brute-force attacks
There will be 2 "attack surfaces" under which this can occur:
A museum employee will have a gated access to the web-app (to verify ticket validity). The way I mitigate this is limiting the # of look-ups to 1,000 a day per-user-account.
A user will be able to check the status of his order. I'll mitigate this in several ways: first, the URL not be "public", and available only to users who purchased the ticket. Second, I'll implement ReCaptcha v3, IP bans on more than 10 unsuccessful requests per hour.
The # of "active" tickets at a time is expected to be 5000 (at its peak), normal would be something like 500-1000, so considering there are hundreds of millions of combinations, it would take a significant effort for an attacker to brute-force the way through this.
The second (and easier) approach an attacker could take is simply buying a ticket and re-publishing it, or publishing it online for anyone to use. The way I'll mitigate this is by:
After a museum checks the validity of the pass, if they check it again, there will be a notification saying: This pass has been checked at this place at this time: [time-date].
While I do plan on re-using the same code, I'll make sure there is a period of minimum 90 days between periods. Maybe there's some vulnerability of doing this that I'm not aware of. The code MAY or MAY not be used again after 90 days passed after its "expiration" date. All I'm saying is that it will be released in the "pool" of potential (300+ million) codes that could be used. Maybe this is not such a good idea?
The customer will be given (sent to an address, or instructed to pick-up), a blank card-like "ticket" where the code will be written on it (or he'll have to write the code with a pen on the ticket). This will make an attack harder to do, since the attacker would now need to have access BOTH to the code + a printer that could print such cards with the same material.
Do you see any other potential attack that could be done? Is there anything I'm missing at my current mitigation approaches?
I would also plan for the case that your database is compromised, e.g. through SQL-injection. Instead of storing the codes plain text, you can use any decent password-hash function and store only the hash of the code. The verification procedure is the same as with passwords.
If there is no user id, the code must be retrievable with a database query, then you cannot use salting. In this case we can use a key derivation function (KDF), which needs some time to calculate a hash to make brute-forcing harder. The missing salt leads to the next point:
I would feel more comfortable using longer codes. If I read the table correctly, the probability to find a valid code with your setup (~28bit / 3000 active codes) is about 0.001 because of the birthday problem. Codes with 9 characters are probably a good compromise, since they are case insensitive they can still be typed fast, but allow for 5E12 combinations. Such codes could be left in the database, so one can tell that a ticket has expired, there is no need to re-use them. Brute-forcing 3 million hashes is no big obstacle even with a KDF, brute-forcing with 5E12 combinations is much more of a problem.
You seem to have spent a decent amount of time considering this and you have identified your largest potential attack surfaces.
Martinstoeckli identifies what I would have considered the next greatest issues by raising the potential of SQL injection and brute force. In my experience the best mitigation to SQL injection is just going to be to make sure that all input is properly sanitized. The brute force issue can never be fully solved and a 6 character key is somewhat easy to crack. Including the use of a KDF would seem like a good idea but you also have to consider what the performance impact on your database will be. With an estimated 500-1000 users/keys i don't think that it would be a huge concern.
I would recommend not reusing keys because depending on how they are stored that could lead to hash collision attacks after a time depending on the specifics of how they are stored.
After those issues I would actually recommend looking into the specifics of how you are hosting this application. Is it hosted on a physical server that you have access to or is it a VM sitting somewhere in the cloud? Each of those is going to have its own risks.

I m facing issue in Coding this rotation part in cics

How can we rotate CUSTOMER NUMBER values in CICS?
For eg. If customer number is c52063
How can i get onto next value ie, c52064(say) in CICS?
This is a very broad question, essentially you're asking what persistence mechanisms are available in CICS.
Please understand there is a big difference between...
what is technically possible
what is allowed in your shop
what is likely to provide a robust and maintainable solution given your requirements
These are three very different things. Some of us answering questions here on StackOverflow have life experiences that make us reticent about answering questions regarding what is technically possible absent any mention of what is allowed in your shop or what the actual business requirement that is being solved.
Mainframes have been around for over half a century, and many shops have standard solutions to technical problems. Sometimes the solution is "don't do that, and here's what we do instead." Working against the recommendations of your technical staff, or your shop standards, is career limiting.
A couple of options, not intended to be an exhaustive list...
SELECT and UPDATE the value in a DBMS (such as DB2). You must code your SELECT SQL with FOR UPDATE.
READ and REWRITE the value in a VSAM file. You must code your READ with the UPDATE option.
In either case you are holding a lock on the resource until you hit either an explicit (EXEC CICS SYNCPOINT) or implicit (end of transaction) syncpoint or rollback (EXEC CICS SYNCPOINT ROLLBACK or abend condition). Holding such a lock means all other instances of your transaction will wait until the syncpoint or rollback has occurred.
If you know for certain your application will be limited to a single CICS region... Other options would include having a transaction initiated as part of region initialization processing that would obtain and populate a shared resource such as a temporary storage queue with a name known to your application with the last known customer number. This initialization transaction would have to obtain the highest used customer number from somewhere, probably a DBMS or a VSAM file. Applications would have to be coded to ENQ and DEQ their access to the temporary storage queue. You could do this without using a temporary storage queue but with shared memory and storing the address of that memory in the CICS CWA for your region. Again, ENQ and DEQ logic would have to be coded in the applications.
You could use a named counter as defined by your CICS Systems Programmer. Be certain to read and understand the recovery requirements for your application as documented in the IBM Knowledge Center.
Again, this is not an exhaustive list, it is just to give an overview of some of the options available. Talk to your technical staff, they likely have either a standard solution as employed by your shop or a preference based on their experience and your requirements.

Best Practices / Patterns for Enterprise Protection/Remediation of SSNs (Social Security Numbers)

I am interested in hearing about enterprise solutions for SSN handling. (I looked pretty hard for any pre-existing post on SO, including reviewing the terriffic SO automated "Related Questions" list, and did not find anything, so hopefully this is not a repeat.)
First, I think it is important to enumerate the reasons systems/databases use SSNs: (note—these are reasons for de facto current state—I understand that many of them are not good reasons)
Required for Interaction with External Entities. This is the most valid case—where external entities your system interfaces with require an SSN. This would typically be government, tax and financial.
SSN is used to ensure system-wide uniqueness.
SSN has become the default foreign key used internally within the enterprise, to perform cross-system joins.
SSN is used for user authentication (e.g., log-on)
The enterprise solution that seems optimum to me is to create a single SSN repository that is accessed by all applications needing to look up SSN info. This repository substitutes a globally unique, random 9-digit number (ASN) for the true SSN. I see many benefits to this approach. First of all, it is obviously highly backwards-compatible—all your systems "just" have to go through a major, synchronized, one-time data-cleansing exercise, where they replace the real SSN with the alternate ASN. Also, it is centralized, so it minimizes the scope for inspection and compliance. (Obviously, as a negative, it also creates a single point of failure.)
This approach would solve issues 2 and 3, without ever requiring lookups to get the real SSN.
For issue #1, authorized systems could provide an ASN, and be returned the real SSN. This would of course be done over secure connections, and the requesting systems would never persist the full SSN. Also, if the requesting system only needs the last 4 digits of the SSN, then that is all that would ever be passed.
Issue #4 could be handled the same way as issue #1, though obviously the best thing would be to move away from having users supply an SSN for log-on.
There are a couple of papers on this:
UC Berkely
Oracle Vault
I have found a trove of great information at the Securosis site/blog. In particular, this white paper does a great job of summarizing, comparing and contrasting database encryption and tokenization. It is more focused on the credit card (PCI) industry, but it is also helpful for my SSN purpose.
It should be noted that SSNs are PII, but are not private. SSNs are public information that be easily acquired from numerous sources even online. That said if SSNs are the basis of your DB primary key you have a severe security problem in your logic. If this problem is evident at a large enterprise then I would stop what you are doing and recommend a massive data migration RIGHT NOW.
As far as protection goes SSNs are PII that is both unique and small in payload, so I would protect that form of data no differently than a password for one time authentication. The last four of a SSNs is frequently used for verification or non-unique identification as it is highly unique when coupled with another data attribute and is not PII on its own. That said the last four of a SSN can be replicated in your DB for open alternative use.
I have come across a company, Voltage, that supplies a product which performs "format preserving encryption" (FPE). This substitutes an arbitrary, reversibly encrypted 9-digit number for the real SSN (in the example of SSN). Just in the early stages of looking into their technical marketing collateral...

Storing partial credit card numbers

Possible Duplicates:
Best practices for taking and storing credit card information with PHP
Storing credit card details
Storing Credit Card Information
I need to store credit card numbers within an e-commerce site. I don't intend on storing the whole credit card number, as this would be highly risky. I would like to store at least the first five digits so I can later identify the financial institution that issued the card. Ideally, I would like to store as much of the credit number as I safely can, to aid any future cross-referencing etc.
How many digits, and which particular digits, can I safely store?
For example, I imagine this would not be safe enough:
5555 5555 555* 4444
Because you could calculate the missing digit.
Similarly, this would be safe, but not be as useful:
5555 5*** **** ****
Is there a well accepted pattern for storing partial credit numbers?
The Payment Card Data Security Standard states that if you are handling cardholder data, then you are subject to the constraints of the PCI DSS (which is very comprehensive and a challenge to comply with). If you want to store part of a card number, and don't want to have to deal with the Standard, then you need to make sure that a) you store NO MORE THAN the first 6 and last 4 digits; b) you don't ever store, process or transmit more than this. That means that the truncation has to be carried out before the data enters your control.
Given that you are talking about an ecommerce site, I think you'll have to deal with the PCI DSS sooner or later (since if you're not taking full PANs, you can't process transactions). Realistically, then, you should avoid storing more than the first 6 and last 4 digits of a PAN; the Standard then does not 'care' about this data, and you can store it in whatever form you see fit. If you store, say, the first 7 digits, then Requirement 3 of the Standard kicks in (and you start having to really understand key management in encryption).
I hope that this is of use.
March 2013 Edit:
A very pertinent resource is the PCI Security Standards Council, an organisation founded in 2006 by five of the biggest global Credit Card brands (AmEx, Visa, MasterCard, JCB International and Discovery) and which is the de facto authority on Security matters for the Payment Card Industry (PCI).
This organization publishes in particular the PCI Data Security Standard, currently in its version 2.0 edition which covers issues such as the management of complete or partial credit card numbers. This document if freely available but requires a simple registration and acknowledgment of license terms.
The following is the original, c. 2009 answer, mostly correct but apocryphal.
A common practice (whether legal or not I do not know) is to store the last 4 digits, as this may be used to help the customer confirm which of his/her credit cards were used for a particular transaction.
Without significantly improving the odds of a malicious person guessing the complete number, one can store the first 4 digits which are representative of the financial institution which issued the card, as mentioned in the question.
Do NOT, save many more digits than these 8 digits because otherwise, given the LUHN-10 checksum, you may provide enough info to make guessing the complete number more plausible (if still relatively hard, even with insight from the series used by a given issuer, in a given time period, but one should be careful...)
To make this whole thing safer, technically and legally, you may consider only storing such info if the customer explicitly allows it. You should also consider masking this info with a simple hash for storing in the database.
Also, what you can / should store following a particular transaction, is the transaction ID supplied by the Credit Card Processor, at the time the transacton is submitted. This ID is the key that allows locating most (all?) of the info you would even need, would there be any issue with a particular transaction. This type of info can typically be queried from a secure web site maintained by the Processing company, along with some aggregate reports which may include a grouping by card-type (Amex, Visa...) if that is why you are thinking of storing the first four.
If you don't need to store the whole credit card number, why do you need to store it at all? If you want to save the financial institution that issued the card, why don't you store the financial institution that issued the card?
Your specific question is answered in sec 3.3 of the PCI/DSS document.
First six and last four are max for display. Customer (paper?) receipts are more restrictive. Those with a legitimiate need to know can see full card data.
My recommendation is to contact your merchant provider and see what options are available to you. A number of the modern transaction gateways have "vault" features where sensitive information is stored at the provider and you simply reference customers by a token number when you want to bill them or check account information.
Along the same lines use of transaction specific tokens can be used to reference needed data stored on the providers system.
However I can't stress enough the importance of reading and understanding PCI DSS. Simply punting secure storage does not magically obsolve you from being subject to PCI compliance requirements!! This is only possible when your system never touches full card data.
The accepted pattern is don't store them at all.
In certain jurisdictions you may be breaking the law by storing them or any part of them.
You could instead, store a one-way (and therefore unrecoverable) hash of the credit card number.
The credit card companies have a standard for this. You'll probably find it buried somewhere in the terms of service of your payment processor that you will obey this standard. It answers you questions. You can find the standard here
Here in Canada, the usual way is to store the first 4 digit ( to identify the financial institution) and the 4 last digit to identify the credit card.
But be sure that you didn't break any laws.

Resources