After loading our initial facts into the cube, we then load a second file that adds measures to the existing facts (so no new facts are created by the second file). We use a Handler to do this.
When the second file is removed from the filesystem, we would like to remove just the relevant measures from the facts.
Is there a way for us to plug into the Directory/File Watcher mechanism to accomplish this?
You could extend
.CSVSource.onFileAction(IFileWatcher watcher, Collection<String> added, Collection<String> modified, Collection<String> deleted)
by calling super.onFileAction(...) which will process the added and modified files, and add more logic to handle deleted files.
This can be done by updating the facts which has contributed a deleted file in their deletedFile field. Such a field could be filled automatically by adding the FILEPATH metadata in your LoadInstructions.csv file:
Format,FilePattern,FilePath,MetaData
FormatName,formatRegex.csv,someFolder,FILEPATH=N/A
and having a field like:
<field name="FILEPATH" type="string" indexation="dictionary" nullable="true" defaultValue="N/A" />
If we understand correctly, and to simplify the usecase, your dataset has two measures A and B. For the same records one file brings measure 'A' and another file brings measure 'B'. And you want to freely update or delete the data for measure A or B independently.
There are several ways you can achieve this.
First you could decouple the measures: instead of records that bear both A and B fields, you would have two records with a generic "value" field, and a "mesure type" field to distinguish between both measure types. This design is flexible because you can introduce a new measure 'C' later, itself fed from another file.
The most elegant option is probably to use the ActivePivot Distributed Architecture, with Polymorphic Distribution. You would setup two independent cubes, one holding only the 'A' measure, another cube with the 'B' measure. Then join the cubes together with polymorphic distribution, ActivePivot will merge them together on the fly and present both measure as if they belonged to the same (virtual) cube.
Finally the quick and dirty solution: configure your measures as 'nullable' fields in ActivePivot. This way when you want to erase measure 'A', you actually write 'null' to the 'A' fields of your records.
Related
I have created a webpage for doctors where there are doctors, patients, and diagnoses stored in the database. There is an option to delete either of these entries (e.g. delete a doctor instance), but in, order to avoid data loss, I want to archive the data instead of deleting it entirely. I have found 2 options to do these:
1 - Keep an archive collection (e.g. doctors-archive, patients-archive) to move the entry there from the original collection.
2 - Keep an attribute isDeleted inside the original collection, so when an entry is deleted, isDeleted will become true, and when fetching the entries with isDeleted=true will not be returned.
Both of these options have their drawbacks - 1st option makes it hard to keep relations, as the patient and doctor have relations and if one of them is deleted the relation will be lost. The second option makes the original collection too heavy as the data will never be removed from it.
Is there a better option than these 2 to store archived data? If not, which of these options are better?
It's a very good discussion. I had the same confusion in one of my projects, and after a long research I found a solution that fit well my needs.
First of all, I found that the suppression is not reputed, developers avoid this action to avoid data loss wrongly.
Furthermore, the suppression of data has another disadvantages:
It will lead to having more data into your system, because you will create new models & collections (e.g., doctors-archive model for doctors model).
This will have a significant impact on the performance of the system: the suppression will be done after 3 (and may be more) queries:
find(): to select the object to be deleted.
insert(): to create an archive instance for the object to be deleted.
delete(): to delete the object.
Note that we can not use the findOneAndDelete() method since we have to create an archive instance before !
Contrariwise, the second option can be done by 1 query: findAndModify().
But, the best solution is not the first one, nor the second one ! But a combination of both of them: it consists of cleaning your data after a specific period (every 4 months for example).
In other worlds, when deleting an instance, you apply the 2nd solution (i.e., isDeleted = true), but after 4 months, you can delete this instance from the collection and keep a copy in the archive (also called backup). This solution will prevent the original collection from being too heavy.
Note that you can also have a separately backup database to avoid use of the originale database.
Both of these options have their drawbacks - 1st option makes it hard to keep relations, as the patient and doctor have relations and if one of them is deleted the relation will be lost. The second option makes the original collection too heavy as the data will never be removed from it. => What kind of relationship are you talking about?
If you still need a document post it is archived there has to be a middle ground there and by the why in those scenarios the data isn't lost you still have the power to do $lookup.
Based on my experience the option 2 has way too many negatives, it leads often to range index scans and on top of it you are maintaining extra burden for the indexes too where isDeleted will become part of every index you create.
TLDR; Definitely option 1 and if you need to query on archived content then use $lookup.
There are some techniques for mongodb which you can use to manage data growth in MongoDB:
Using capped collection
Using TTLs
Using multiple collections for months( rename old one and keep data in new one only)
I have the two Aggregates 'notebook' and 'note'.
When I use the role 'aggregates reference only by there ids', I think I have two options:
Notebook(List<NoteId>, [other properties])
Note([other properties])
or
Notebook([other properties])
Note(NotebookId, [other properties])
With the first option, I need two DB calls to show all notes of a notebook (one to get the list and the second to load the notes).
So my current favorite is the second option. Now I have few options in my mind to save the order of the notes, where anyone has some disadvantages.
What is a good approach to solve my problem? Or is the first option better and the two DB calls are negligible?
Can anybody help?
Big THX
It looks that the order of the Notes is important, at least related to the Notebook, so maybe it should be part of the domain. If yes, I would suggest to store it together with the Note. Or use some other information of the Note to give an ordering when a list is loaded.
If not, why is the order relevant? I mean, the two entities have a related but separated lifecycle, or at least it looks: one aggregate - the Notebook - has a list that only references the other - the Note. Hence no direct interaction is planned. But, given the the domain is correctly modelled (there's not enough information to say something about it), somewhere you need a ordered list of Notes. The only way to have it as you need it is to store the information (or use one already stored), otherwise the hypothesis (order is relevant) is not valid anymore.
update after infos about number of Notes and their size
It looks that your domain is organized in this way:
a root entity, the Notebook, where the order of each Note, with only its ID, is also stored: any change in the order will be updated from here, not from the Note
another root entity, the Note, with its own lifecycle and its own 'actions' (operations that trigger a change in the entity)
Whenever you load the Notebook, you must load also the Note and it's order to show it correctly ordered. On the other side, when you change the order, this structure allows you to have a single action (or operation) on the Notebook, for example changeOrder(NoteId), that updates the order of the given Note and, if needed, changes the order of all the others. The trick, here, is that when you persist the Notebook you work just with the ID of the Note, so you don't have to load all the entity, but just a part of it, update and save it again. So, how big is the Note entity is not important, because you don't use it all. Hence, at every change you could trigger an update of all the couples (NoteID, order) for that Notebook. You can't do differently. But, to support this you need a single function in the repository where you load the ID of the Note and its order and you save it again; that should be not so expensive.
On the other side, all the actions that operate directly on the Note should load it, hence you have to load all. But in this case is required to load all, and save all, because you are changing the Note itself.
Anyway, the way you persist the order is totally demanded to the persistence layer, that is built over the domain. I mean, the domain has a Notebook and a set of Notes with order 1, 2, 3, etc.
Even if I don't think that this needs such a complex solution, you could use a totally differen way to store the order: you can use for example steps of 100 (so 100, 200, 300, etc): each new Note is put in the middle of the old two ones, and is the only one to be saved each time. Every since a while you run a job, or something else, that just normalizes all the values restoring the 100 steps (or whatever you use to persit the order). As I said, this looks an overcomplicated solution to the problem, but it also shows the fact that the entities of the domain could be totally different from the Persitence ones.
Assume I have the following two sets of data. I'm attempting to associate products on hand with their rolled up tallies. For a roll up tally you may have products made of multiple categories with a primary and alternative category. In a relational database I would load the second set of data into a temporary table use a stored procedure to iterate through the rollup data and decrement the quantities until until they were zero or I had matched the tallies. I'm trying to implement a solution in Spark/PySpark and I'm not entirely sure where to start. I've attached a possible output solution that I'm trying to achieve though I recognize there are multiple outputs that would work.
#Rolled Up Quantities#
owner,category,alternate_category,quantity
ABC,1,4,50
ABC,2,3,25
ABC,3,2,15
ABC,4,1,10
#Actual Stock On Hand#
owner,category,product_id,quantity
ABC,1,123,30
ABC,2,456,20
ABC,3,789,20
ABC,4,012,30
#Possible Solution#
owner,category,product_id,quantity
ABC,1,123,30
ABC,1,012,20
ABC,2,456,20
ABC,2,789,5
ABC,3,789,15
ABC,4,012,10
The scenario is that I have a Projects list and there are a bunch of different SPFieldUser fields associated to it. I have another list representing the Project's Logbook (it contains a bunch of data about different milestones of the project). The relationship is like this: (1 project list item : 1 logbook list).
I have to store some metadata in a logbook's list item that points to a specific user, stored in Project's list item. For that I have to create a choice field which represents different SPFieldUser fields from the project's list.
The question is which is optimal way of representing such a structure?
I can just hard-code a choice option for every SPFieldUser in a Projects list, but then when I have to reference this data in a code, I'll have to somehow transform the choice's value into internal name of the associated project's field.
I can also create a lookup of those fields and this way, accessing it is easy. I can show the Title to user and have the internal name stored in a lookup.
I was also thinking about defining some kind of custom FieldType, but I feel like it would require far more work than an of the other methods.
So which method do I choose? Can someone probably suggest a better way?
Lets check out your options one by one in terms of efforts and scalability.
1 Hardconding option : High efforts [Not at all recommended]
- Column needs to be updated when new user joins or user leaves the
company.
- Once format of data is specified its difficult to change. [e.g. FirstName+Lastname or Empid]
Highly recommended OOTB option : very low efforts
Configurable [Please check if you can change format of user data once added as lookup column.]
Custom column type will take coding efforts.
My recommendation is 2nd OOTB option. If you find some flaws in 2nd option let us know we can look for soultion.
I'm trying to create an XML schema to describe some aspects of hospitals. A hospital may have 24 hour coverage on: emergency services, operating room, pharmacist, etc. The entire list is relatively short - around 10. The coverage may be on more than one of these services.
My question is how best to represent this. I'm thinking along the lines of:
<coverage>
<emergencyServices/>
<operatingRoom/>
</coverage>
Basically, the services are optional and, if they exist, the coverage is offered by the hospital.
Alternatively, I could have:
<coverage>
<emergencyServices>true</emergencyServices>
<operatingRoom>true</operatingRoom>
<pharmacist>false</pharmacist>
</coverage>
In this case, I require all the elements, but a value of false means that the coverage isn't offered.
There are probably other approaches.
What's the best practice for something like this? And, if I use the first option, what type should the elements be in the schema?
Best practice here depends really on the consumer.
The short and simple rule is that markup is for structure, and content is for data. So having them contain xs:boolean values is generally the best course.
Now, on to the options:
Having separate untyped elements is simple and clear; sometimes processing systems may have difficulty reading them, because some XML-relational mappers may not see any data in the elements to put in relational tables. But if they had values, like <emergencyServices>true</emergencyServices>, then the relational table would have a value to hold.
Again, if you have fixed element names, it means if your consumer is using a system that maps the XML to a database, every time you add a service, a schema change will have to be made.
There are several other ways; each has trade-offs:
Using a <xs:string> with an enumeration, and allow multiple copies. Then you could have <coverage>emergencyServices</coverage><coverage>operatingRoom</coverage>. It makes adding to the list simpler, but allows duplicates. This scheme does not require schema changes in the database for the consumer.
You could use attributes on the <coverage> element. They would have a xs:boolean type, but still require a schema change. Of course, this evokes the attribute vs. element argument.
One good resource is Chapter 11 of Effective XML. At least this should be read before making a final decision.