Transform data in Azure Pipeline to make it anonymous - azure

In my new job at a community hall in the Netherlands, we work with databases that contain privacy-sensitive data (e.g. citizen service numbers). They also recently started working with Azure, which i'm getting familiar with as we speak. So this might be a beginners question but I hope someone can lead me in the right direction: Is there a way, to retrieve data through a direct connection with a database and make it 'anonymous' for example by hashing or using a key-file of some sorts somewhere in the pipeline?
I know that the pipelines are .JSON files and that it's possible to do some transformations. I'm curious about the possibilities for doing this in Azure!
** EDIT **
To be more clear: I want to write a piece of code preferably in the pipeline, that does something like this:
citizen service number person 1
102541220
#generate key/hash somewhere in pipeline of loading in data in azure
anonymous citizen service number, that is specific for person 1
0x10325476
Later, I want to add columns to this database, for example what kind of value the house has this person lives in. I want to be able to 'couple' the databases by using the
anonymous citizen service number 1
0x10325476

It sounds like you'd be interested in Azure SQL Database dynamic data masking.
SQL Database dynamic data masking limits sensitive data exposure by
masking it to non-privileged users.
Dynamic data masking helps prevent unauthorized access to sensitive
data by enabling customers to designate how much of the sensitive data
to reveal with minimal impact on the application layer. It’s a
policy-based security feature that hides the sensitive data in the
result set of a query over designated database fields, while the data
in the database is not changed.
For example, a service representative at a call center may identify
callers by several digits of their credit card number, but those data
items should not be fully exposed to the service representative. A
masking rule can be defined that masks all but the last four digits of
any credit card number in the result set of any query. As another
example, an appropriate data mask can be defined to protect personally
identifiable information (PII) data, so that a developer can query
production environments for troubleshooting purposes without violating
compliance regulations.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-dynamic-data-masking-get-started
This won't anonymise data irreversibly, in terms of it can be re-personalised by those who have the permissions in SQL server.
It will however allow you to do joins inside of SQL server but not expose the personal data back out.

Related

Encrypt all user data in my web application

This is not a typical StackOverflow question as it is quite specific and bound to my current project. Given my project (GitHub link), I would like to encrypt or handle all user data in a way that impedes me as a service provider to view data of specific users. This would probably not be feasible in a typical webapp with a rational SQL database. I am using Redis with data that is basically structured as follows:
Users can view their data filtered by two dimensions: A time range and a domain. These are further grouped by another dimension, which are multiple charts. So there is data for countries, top landing pages, etc (It's a web analytics app). Internally of course I also need to have the user baked in as dimension in the key that holds data for a chart and of course there is some indexing stuff going on.
Now here is the idea: I could hash the access key for this single charts - I am only doing direct key access anyway and no scanning (filtering over keys). Furthermore I would only save the hashed username in the database so the username becomes the missing information I don't have to retrieve the payloads.
This would leave me with the cleartext payloads, which represent specific charts given by specific user selections (Yes, I only save the user data in an aggregated form btw) but I would have no reasonable way to map a single chart to a specific user or domain. Given I have ~70 integrated users at the moment, it would be not feasible to try to manually map data points to specific users (But I could still see all domains a "user" uses).
Of course this is relying on the username being somewhat a secret and I would only save the hashed username to the database and only handle the cleartext username in ram. I can still greet the user since the cleartext username is saved in a cookie :-)
With usernames being too short and having almost no entropy of course I could brute force my own database in order to regain the missing links and access to all data individual users have. But before doing this the more obvious way to "cheat" would be to just run another software (without that hashing) on the server but still stating everything is encrypted. So my point is that the presented solution is good enough for a hosted service.
Does this sound plausible? Would such an approach add an additional layer of security or be meaningless because it is too easy to circumvent?
In my opinion I could compare this with locking a bicycle with a very cheap lock. Even if the lock is easily breakable it does have a strong symbolic meaning that someone that breaks the lock is doing something worse than stealing a bicycle that has no lock at all. So even it is not possible to protect user data from a hosting provider, it is possible to make the work to do so more "dirty" and such socially and legally less acceptable. Does this makes sense? :-)
So my question is: security by obscurity or sound approach?
Cheers!

Multi-tenancy Architecture in a graph DB

I would like to share my thoughts with you and try to get some advice. I would like to define my application with the best architecture as possible. Any comment would be highly appreciated. Here we go...
My technologies: NestJs(Node), neo4j/arangodb(graph DB), Nginx for proxy(Micro-services Approach).
My business case: SaaS application. Many customers with many users, one database per customer and the same code (just one instance) of our codebase.
we have a set of data models which will be same for all customer but a relation between them will differ. As per my research GraphDB is the best match for such operations. so I'm planning to create separate Instance/Database for each customer otherwise too many relations will make harder to scale.
Problem: From my point of view the problem can be seen with two different approach.
I need to allow multiple users to connect to different databases at the same time with the same code (just one installation). In Nestjs App how can I change the database configuration on each API request. Shall I save DB URI in a table, based on user/customer type it will fetch DB URI? then other concerns like does it affect on latency time, if any request failed then is there any possibility that request can fetch data from wrong DB?
How can we create sub-graphs in neo4j/arangodb so we can fetch sub-graph based on the customer.
On the other hand, I found a couple of interesting links:
https://neo4j.com/developer/multi-tenancy-worked-example/
https://www.arangodb.com/enterprise-server/oneshard/
https://dzone.com/articles/multitenant-graph-applications
Someone could provide me aditional info?
Thanks for your time
Best regards
With ArangoDB, a solution that works is:
Use a single database for all customers
Use Foxx microservices in that database to provide access to all data
Enforce a tenantId value on every call to Foxx
Use dedicated collections for each tenant in that database
Set up a web server (e.g. Node.js) in front of ArangoDB that serves data to all tenants
Only allow connections to Foxx from that front end web server
Each tenant will need a few collections, depending on your solution, try to keep that number as low as possible.
This model works pretty well, and you're able to migrate customers between instances / regions as their data is portable, because it's in collections.

DDD/Event sourcing, getting data from another microservice?

I wonder if you can help. I am writing an order system and currently have implemented an order microservice which takes care of placing an order. I am using DDD with event sourcing and CQRS.
The order service itself takes in commands that produce events, the actual order service listens to its own event to create a read model (The idea here is to use CQRS, so commands for writes and queries for reads)
After implementing the above, I ran into a problem and its probably just that I am not fully understanding the correct way of doing this.
An order actually has dependents, meaning an order needs a customer and a product/s. So i will have 2 additional microservices for customer and products.
To keep things simple, i would like to concentrate on the customer (although I have exactly the same issue with products but my thinking is that if I fix the customer issue then the other one is automatically fixed also).
So back to the problem at hand. To create an order the order needs a customer (and products), I currently have the customerId on the client, so sending down a command to the order service, I can pass in the customerId.
I would like to save the name and address of the customer with the order. How do I get the name and address of the customerId from the Customer Service in the Order Service ?
I suppose to summarize, when data from one service needs data from another service, how am I able to get this data.
Would it be the case of the order service creating an event for receiving a customer record ? This is going to introduce a lot of complexity (more events) in the system
The microservices are NOT coupled so the order service can't just call into the read model of the customer.
Anybody able to help me on this ?
If you are using DDD, first of all, please read about bounded context. Forget microservices, they are just implementation strategy.
Now back to your problem. Publish these events from Customer aggregate(in your case Customer microservice): CustomerRegistered, CustomerInfoUpdated, CustomerAccountRemoved, CustomerAddressChanged etc. Then subscribe your Order service(again in your case application service inside Order microservice) to listen all above events. Okay, not all, just what order needs.
Now, you may have a question, what if majority or some of my customers don't make orders? My order service will be full of unnecessary data. Is this a good approach?
Well, answer might vary. I would say, space in hard disk is cheaper than memory or a database query is faster than a network call in performance perspective. If your database host(or your server) is limited then you should not go with microservices. Moreover, I would make some business ideas with these unused customer data e.g. list all customers who never ordered anything, I will send them some offers to grow my business. Just kidding. Don't feel bothered with unused data in microservices.
My suggestion would be to gather the required data on the front-end and pass it along. The relevant customer details that you want to denormalize into the order would be a value object. The same goes for the product data (e.g. id, description) related to the order line.
It isn't impossible to have the systems interact to retrieve data but that does couple them on a lower level that seems necessary.
When data from one service needs data from another service, how am I able to get this data?
You copy it.
So somewhere in your design there needs to be a message that carries the data from where it is to where it needs to be.
That could mean that the order service is subscribing to events that are published by the customer service, and storing a copy of the information that it needs. Or it could be that the order service queries some API that has direct access to the data stored by the customer service.
Queries for the additional data that you need could be synchronous or asynchronous - maybe the work can be deferred until you have all of the data you need.
Another possibility is that you redesign your system so that the business capability you need is with the data, either moving the capability or moving the data. Why does ordering need customer data? Can the customer service do the work instead? Should ordering own the data?
There's a certain amount of complexity that is inherent in your decision to distribute the work across multiple services. The decision to distribute your system involves weighing various trade offs.

Kibana Dashboard (ELK) user based (scritped/dynamic) dashboards

In my use case, we have a number of clients who would like to access a (personalised) Kibana dashboard (pre-made in kibana). However, we wouldn't like different clients to see other clients data (for obvious reasons!)
The problem is, kibana "saves" dashboards as a URL (i.e.):
hxxp://myserver:8080/#/dashboard/Dash-1?embed&_g=(refreshInterval:(display:Off,pause:!f,section:0,value:0),time:(from:now-2y,mode:quick,to:now))&_a=(filters:!(),panels:!((col:1,id:UK-Log-Map,row:3,size_x:5,size_y:6,type:visualization),(col:1,id:Total-logs,row:1,size_x:12,size_y:2,type:visualization),(col:6,id:Logs-by-week,row:3,size_x:7,size_y:3,type:visualization),(col:6,id:Log-histogram,row:6,size_x:7,size_y:3,type:visualization)),query:(query_string:(analyze_wildcard:!t,query:'Name:Joe')),title:'Dash')
would represent a dashboard with 4 elements for "Joe" (filtered in the query - last part of URL).
Changing "joe" to any other client (i.e. "dave") would show their data, thus causing a security hole. What would be the best way to secure the data whilst providing the dashboards for each user?
I have full control over most of the tech used for this, so anything can be considered. I.e. libraries, proxies, RESTful services etc. This just needs a way forward!
Another user has tried to achieve this with encrypted URLs (js), but this seems a little hacky to me. There must be a cleaner way?

Securing the data accessed by Neo4j queries

I wish to implement security on the data contained in a Neo4j database down to the level of individual nodes and/or relationships.
Most data will be available to all users but some data will be restricted by user role. I can add either properties or labels to the data that I wish to restrict.
I want to allow users to run custom cypher queries against the data but hide any data that the user isn't authorised to see.
If I have to do something from the outside then not only do I have to filter the results returned but I also have to parse and either restrict or modify all queries that are run against the data to prevent a user from writing a query which acted on data that they aren't allowed to view.
The ideal solution would be if there is a low-level hook that allows intercepting the reads of nodes and relationships BEFORE a cypher query acts on those records. The interceptor would perform the security checks and if they fail then it would behave as though the node or relationship didn't exist at all. i.e. the same cypher query would have different results depending on who ran it. And this would apply to all possible queries e.g. count(n) not just those that returned the nodes/relationships.
Can something like this be done? If it's not supported already, is there a suitable place in the code that I could add such a security filter or would it require many code changes?
Thanks, Damon
As Chris stated, it's certainly not trivial on database level, but if you're looking for a solution on application level, you might have a look at Structr, a framework on top of and tightly integrated with Neo4j.
It provides node-level security based on ACLs, with users, groups, and different access levels. The security in Structr is implemented on the lowest level possible, f.e. we only instantiate objects if the querying user has the approriate access rights.
All higher access levels like REST API and UI see only the records available in the user's context.
[1] http://structr.org, https://github.com/structr/structr

Resources