Attachments management - domain-driven-design

I'm building a system where a Meeting can have zero or more Attachments.
To avoid loading the whole attachment binary each time I load a Meeting, I have an AttachmentRef(size, mimetype, reference, name, hash).
This reference is created via a factory that guess the mimetype, compute the hash and size and esnure that everything is saved aside of the binary content : AttachmentsFactory.create(name, byte[]):AttachmentRef.
Then, when the user want to retrieve an attachment it has to dereference the reference. The Attachment is more or less the same as a reference except that it has the binary content Attachment(size, mimetype, name, content) (Itw ill be implemented with a composition of reference and byte[]).
My question is about the retrieval of this attachmeent, I have two main possibilities and I would like to knwo which one looks best in a "DDD" design ?
1 - Dumb reference, smart service
AttachementService {
dereference(ref):Attachment {
// Get the binary, recompute and verify the hash and return an Attachment
}
}
attachmentService.dereference(ref)
2 - Smart reference, dumb service
AttachmentService {
read(ref):byte[] {
// Just return the content for the ref
}
}
AttachmentReference {
dereference(attachmentService) {
content = attachmentService.read(this)
// recompute and verify the hash
return new Attachment(this, content)
}
}
ref.dereference(attachmentService)

This is actually a pretty good example of interactions between Bounded Contexts.
If your Meeting is in one BC and your content is in the Content BC then you could very well have the physically attached content (byte[]) represented as a Value Object in your Meeting as you have done with your reference.
The attached content may be represented as a ContentItem or some such in your Content BC and in that it would be an Aggregate Root.
The retrieval of the actual content would typically occur on the integration/application layer. No need to have that in the Meeting BC as it wouldn't do much, I assume.

Related

User Segmentation Engine using MongoDB

I have an analytics system that tracks customers and their attributes as well as their behavior in the form of events. It is implemented using Node.js and MongoDB (with Mongoose).
Now I need to implement a segmentation feature that allows to group stored users into segments based on certain conditions. For example something like purchases > 3 AND country = 'Netherlands'
In the frontend this would look something like this:
An important requirement here is that the segments get updated in realtime and not just periodically. This basically means, that every time a user's attributes change or he triggers a new event, I have to check again which segments he does belong to.
My current approach is to store the conditions for the segments as MongoDB queries, that I can then execute on the user collection in order to determine which users belong to a certain segment.
For example a segment to filter out all users that are using Gmail would look like this:
{
_id: '591638bf833f8c843e4fef24',
name: 'Gmail Users',
condition: {'email': { $regex : '.*gmail.*'}}
}
When a user matches the condition I would then store that he belongs to the 'Gmail Users' segment directly on the user's document:
{
username: 'john.doe',
email: 'john.doe#gmail.com',
segments: ['591638bf833f8c843e4fef24']
}
However by doing this, I would have to execute all queries for all segments every time a user's data changes, so I can check if he is part of the segment or not. This feels a bit complicated and cumbersome from a performance point of view.
Can you think of any alternative way to approach this? Maybe use a rule-engine and do the processing in the application and not on the database?
Unfortunately I don't know a better approach but you can optimize this solution a little bit.
I would do the same:
Store the segment conditions in a collection
Once you find a matching user, store the segment id in the user's document (segments)
An important requirement here is that the segments get updated in realtime and not just periodically.
You have no choice, you need to run the segmentation query every times when a segment changes.
I would have to execute all queries for all segments every time a user's data changes
This is where I would change your solution, actually just optimise it a little bit:
You don't need to run the segmentation queries on the whole collection. If you put your user id into the query with an $and, Mongodb will fetch the user first and after that will check the rest of the segmentation conditions. You need to make sure Mongodb uses the user's _id as an index, for this you can use .explain() to check it or .hint() to force it. Unfortunately you need to run N+1 queries if you have N segments (+1 is for the user update)
I would fetch every segments and store them in a cache (redis). If someone changed the segment I would update the cache as well. (Or just invalidate the cache and the next query will handle the rest, depends on the implementation). The point is that I would have every segments without fetching the database and if a user updated a record I would go through every segments with Node.js and validate the user by the conditions and I could update the user's segments array in the original update query so it would not require any extra database operation.
I know it could be a pain in the ass implementing something like this but it doesn't overload the database ...
Update
Let me give you some technical details about my second suggestion:
(This is just a pseudo code!)
Segment cache
module.exporst = function() {
return new Promise(resolve) {
Redis.get('cache:segments', function(err, segments) {
// handle error
// Segments are cached
if(segments) {
segments = JSON.parse(segments);
return resolve(segments);
}
//fetch segments and save it to the cache
Segments.find().exec(function(err, segments) {
// handle error
segments = JSON.stringify(segments);
// Save to the database but set 60 seconds as an expiration
Redis.set('cache:segments', segments, 'EX', 60, function(err) {
// handle error
return resolve(segments);
})
});
})
}
}
User update
// ...
let user = user.findOne(_id: ObjectId(req.body.userId));
// etc ...
// fetch segments from cache or from the database
let segments = yield segmentCache();
let userSegments = [];
segments.forEach(function(segment) {
if(checkSegment(user, segment)) {
userSegments.push(segment._id)
}
});
// Override user's segments with userSegments
This is where the magic happens, somehow you need to define the conditions in a way you can use them in an if statement.
Hint: Lodash has this functions: _.gt, _.gte, _.eq ...
Check segments
module.exports = function(user, segment) {
let keys = Object.keys(segment.condition);
keys.forEach(function(key) {
if(user[key] === segment.condition[key]) {
return false;
}
})
return true;
}
You are already storing an entire segment "query" in a document in segments collection - why not include a field in the same document which will enumerate which fields in the users document impact membership in a particular segment.
Since action of changing user data will know which fields are being changed, it can fetch only the segments which are computed using the fields being changed significantly reducing the size of segmentation "queries" you have to re-run.
Note that a change in user's data may add them to a segment they are not currently a member of, so checking only the segments currently stored in the user is not sufficient.

Couchdb super slow view, 100% cpu usage

There is one account doc. This doc has ~1k seats. For each seat, we emit a doc. Naturally, you'd expect this to be slow. The map function runs like this:
function(doc) {
if (doc.type == 'account') {
doc.seats.map(function(seat) {
emit(seat.userID, doc))
}
}
}
However deleting doc.seats, then emitting the much smaller doc didn't seem to help.
function(doc) {
if (doc.type == 'account') {
doc.seats.map(function(seat) {
delete doc.seats
emit(seat.userID, doc))
}
}
}
Does anyone understand why deleting the seats doesn't speed this up? The only way we could speed it up was by not emitting the doc object, and just emitting an id.
function(doc) {
if (doc.type == 'account') {
doc.seats.map(function(seat) {
emit(seat.userID, doc.id))
}
}
}
Is this a problem with looping over a doc's array in a couch view map?
tldr;
Use a permanent view if you care about performance
doc is immutable from a view. You can't even add to it without making a copy.
It's almost always better to emit the _id and use include_docs than it is to emit an entire doc as your value.
explanation
Here are a couple of points to your question, using your example document which contains an array called seats with 1K entries.
Emitting the entire doc here is a bad idea. If this is a permanent view (which you should always use if performance is at all an issue), you've taken one copy of doc, and then made 1000 copies and indexed them by the seat.userID. This isn't efficient. It's worse as a temporary view because then it's generated on the fly, in memory, each time the view is called.
AFAIK the doc is totally immutable as accessed via view, so the way you're attempting to delete the seats field doesn't work. Therefore the deleting doc.seats shouldn't provide any performance gain since you're still going to complete the loop and create 1000 copies of the original doc. You can however make a deep copy of doc that doesn't have seats included and pass that through emit.
For example:
function(doc) {
var doc_without_seats = JSON.parse(JSON.stringify(doc))
doc_without_seats['seats'] = null;
doc.seats.map( function (seat){
emit(seat.userID, doc_without_seats);
});
}
You're certainly on the right track with emitting the doc._id instead of the doc. The index you're building in this case is at largest, 1/ 1000th the size. If you still need to access the entire document, you can pass the option include_docs=true to the view when you query it. This keeps the whole doc from being copied in the index.
Another potential optimization would be to just emit the things you'll want to reference when looking something up by seat.userID. If that's still large and unwieldy, use the include_docs method.

What is the best way to safely read user input?

Let's consider a REST endpoint which receives a JSON object. One of the JSON fields is a String, so I want to validate that no malicious text is received.
#ValidateRequest
public interface RestService {
#POST
#Consumes(APPLICATION_JSON)
#Path("endpoint")
void postData (#Valid #NotNull Data data);
}
public class Data {
#ValidString
private String s;
// get,set methods
}
I'm using the bean validation framework via #ValidString to delegate the validation to the ESAPI library.
#Override
public boolean isValid (String value, ConstraintValidatorContext context) {
return ESAPI.validator().isValidInput(
"String validation",
value,
this.constraint.type(),
this.constraint.maxLength(),
this.constraint.nullable(),
true);
}
This method canonicalizes the value (i.e. removes encryption) and then validates against a regular expression provided in the ESAPI config. The regex is not that important to the question, but it mostly whitelists 'safe' characters.
All good so far. However, in a few occasions, I need to accept 'less' safe characters like %, ", <, >, etc. because the incoming text is from an end user's free text input field.
Is there a known pattern for this kind of String sanitization? What kind of text can cause server-side problems if SQL queries are considered safe (e.g. using bind variables)? What if the user wants to store <script>alert("Hello")</script> as his description which at some point will be send back to the client? Do I store that in the DB? Is that a client-side concern?
When dealing with text coming from the user, best practice is to white list only known character sets as you stated. But that is not the whole solution, since there are times when that will not work, again as you pointed out sometimes "dangerous" characters are part of the valid character set.
When this happens you need to be very vigilant in how you handle the data. I, as well as the commenters, recommended is to keep the original data from the user in its original state as long as possible. Dealing with the data securely will be to use proper functions for the target domain/output.
SQL
When putting free format strings into a SQL database, best practice is to use prepared statements (in java this is the PreparedStatement object or using ORM that will automatically parameterizes the data.
To read more on SQL injection attacks and other forms of Injection attacks (XML, LDAP, etc.) I recommended OWASPS Top 10 - A1 Injections
XSS
You also mentioned what to do when outputting this data to client. In this case I you want to make sure you html encode the output for the proper context, aka contextual output encoding. ESAPI has Encoder Class/Interface for this. The important thing to note is which context (HTML Body, HTML Attribute, JavaScript, URL, etc.) will the data be outputted. Each area is going to be encoded differently.
Take for example the input: <script>alert('Hello World');<script>
Sample Encoding Outputs:
HTML: <script>alert('Hello World');<script>
JavaScript: \u003cscript\u003ealert(\u0027Hello World\u0027);\u003cscript\u003e
URL: %3Cscript%3Ealert%28%27Hello%20World%27%29%3B%3Cscript%3E
Form URL:
%3Cscript%3Ealert%28%27Hello+World%27%29%3B%3Cscript%3E
CSS: \00003Cscript\00003Ealert\000028\000027Hello\000020World\000027\000029\00003B\00003Cscript\00003E
XML: <script>alert(&apos;Hello World&apos;);<script>
For more reading on XSS look at OWASP Top 10 - A3 Cross-Site Scripting (XSS)

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

CouchDB: Single document vs "joining" documents together

I'm tryting to decide the best approach for a CouchApp (no middleware). Since there are similarities to my idea, lets assume we have a stackoverflow page stored in a CouchDB. In essence it consists of the actual question on top, answers and commets. Those are basically three layers.
There are two ways of storing it. Either within a single document containing a suitable JSON representation of the data, or store each part of the entry within a separate document combining them later through a view (similar to this: http://www.cmlenz.net/archives/2007/10/couchdb-joins)
Now, both approaches may be fine, yet both have massive downsides from my current point of view. Storing a busy document (many changes through multiple users are expected) as a signle entity would cause conflicts to happen. If user A stores his/her changes to the document, user B would receive a conflict error once he/she is finished typing his/her update. I can imagine its possible to fix this without the users knowledge through re-downloading the document before retrying.
But what if the document is rather big? I'll except them to become rather blown up over time which would put quite some noticeable delay on a save process, especially if the retry process has to happen multiple times due to many users updating a document at the same time.
Another problem I'd see is editing. Every user should be allowed to edit his/her contributions. Now, if they're stored within one document it might be hard to write a solid auth handler.
Ok, now lets look at the multiple documents approach. Question, Answers and Comments would be stored within their own documents. Advantage: only the actual owner of the document can cause conflicts, something that won't happen too often. Being rather small elements of the whole, redownloading wouldn't take much time. Furthermore the auth routine should be quite easy to realize.
Now here's the downside. The single document is real easy to query and display. Having a lot of unsorted snippets laying around seems like a messy thing since I didn't really get the actual view to present me with a 100% ready to use JSON object containing the entire item in an ordered and structured format.
I hope I've been able to communicate the actual problem. I try to decide which solution would be more suitable for me, which problems easier to overcome. I imagine the first solution to be the prettier one in terms of storage and querying, yet the second one the more practical one solvable through better key management within the view (I'm not entirely into the principle of keys yet).
Thank you very much for your help in advance :)
Go with your second option. It's much easier than having to deal with the conflicts. Here are some example docs how I might structure the data:
{
_id: 12345,
type: 'question',
slug: 'couchdb-single-document-vs-joining-documents-together',
markdown: 'Im tryting to decide the best approach for a CouchApp (no middleware). Since there are similarities to...' ,
user: 'roman-geber',
date: 1322150148041,
'jquery.couch.attachPrevRev' : true
}
{
_id: 23456,
type: 'answer'
question: 12345,
markdown: 'Go with your second option...',
user : 'ryan-ramage',
votes: 100,
date: 1322151148041,
'jquery.couch.attachPrevRev' : true
}
{
_id: 45678,
type: 'comment'
question: 12345,
answer: 23456,
markdown : 'I really like what you have said, but...' ,
user: 'somedude',
date: 1322151158041,
'jquery.couch.attachPrevRev' : true
}
To store revisions of each one, I would store the old versions as attachments on the doc being edited. If you use the jquery client for couchdb, you get it for free by adding the jquery.couch.attachPrevRev = true. See Versioning docs in CouchDB by jchris
Create a view like this
fullQuestion : {
map : function(doc) {
if (doc.type == 'question') emit([doc._id, null, null], null);
if (doc.type == 'answer') emit([doc.question, doc._id, null], null);
if (doc.type == 'comment') emit([doc.question, doc.answer, doc._id], null) ;
}
}
And query the view like this
http://localhost:5984/so/_design/app/_view/fullQuestion?startkey=['12345']&endkey=['12345',{},{}]&include_docs=true
(Note: I have not url encoded this query, but it is more readable)
This will get you all of the related documents for the question that you will need to build the page. The only thing is that they will not be sorted by date. You can sort them on the client side (in javascript).
EDIT: Here is an alternative option for the view and query
Based on your domain, you know some facts. You know an answer cant exist before a question existed, and a comment on an answer cant exist before an answer existed. So lets make a view that might make it faster to create the display page, respecting the order of things:
fullQuestion : {
map : function(doc) {
if (doc.type == 'question') emit([doc._id, doc.date], null);
if (doc.type == 'answer') emit([doc.question, doc.date], null);
if (doc.type == 'comment') emit([doc.question, doc.date], null);
}
}
This will keep all the related docs together, and keep them ordered by date. Here is a sample query
http://localhost:5984/so/_design/app/_view/fullQuestion?startkey=['12345']&endkey=['12345',{}]&include_docs=true
This will get back all the docs you will need, ordered from oldest to newest. You can now zip through the results, knowing that the parent objects will be before the child ones, like this:
function addAnswer(doc) {
$('.answers').append(answerTemplate(doc));
}
function addCommentToAnswer(doc) {
$('#' + doc.answer).append(commentTemplate(doc));
}
$.each(results.rows, function(i, row) {
if (row.doc.type == 'question') displyQuestionInfo(row.doc);
if (row.doc.type == 'answer') addAnswer(row.doc);
if (row.doc.type == 'comment') addCommentToAnswer(row.doc)
})
So then you dont have to perform any client side sorting.
Hope this helps.

Resources