Do you store data in the Delta Lake Silver layer in a normalized format or do you derive it? - delta-lake

I am currently setting up a data lake trying to follow the principles of Delta Lake (landing in bronze, cleaning and merging into silver, and then, if needed, presenting the final view in gold) and have a question about what should be stored in Silver.
For example, if the data in bronze comes in from a REST API and is stored in the JSON form it comes in in this format:
id (Int)
name (String)
fields (Array of Strings)
An example looks like:
{
'id':12345,
'name':'Test',
'fields':['Hello','this','is','a','test']
}
In the end I want to present this as two tables. One would be the base table and look like:
TABLE 1
| id | name |
| -------- | -------------- |
| 12345 | Test |
And another would look like:
TABLE 2
| id | field_value |
| -------- | -------------- |
| 12345 | Hello |
| 12345 | this |
| 12345 | is |
| 12345 | a |
| 12345 | test |
My question is, should I pre-process the data in Spark and store the data in silver in separate folders like this:
-- root
---table 1
----file1.parquet
----etc.parquet
---table 2
----file1.parquet
----etc.parquet
Or store it all in silver under one folder and then derive those two tables using TSQL and functions like OPENJSON later?
Thank you for your help or insight!

I do not think there is a real answer to your questions, but here is a stab - based on your explicit example and this reference https://k21academy.com/microsoft-azure/data-engineer/delta-lake/
My question is, should I pre-process the data in Spark and store the
data in silver in separate folders like this: ...
Yes, I would as JSON takes more time to process. I use JSON for RAW on a current project if it comes in that format and in the REFined Area we store arrays if needed, as opposed to JSON structs. But this is because we use a data Hub approach based on Martin Fowler's Distributed Data Mesh. We have a BUSiness Area where we model the data according to a semantic model.
But for every expert there is an equal and opposite expert. Some would say do it on the fly, like SAP Hana ETL on-the-fly.
For analysis of datasets given to Data Scientist for analysis, or ad hoc analysis, the 2nd approach is fine. The data would be in the Bronze zone. That said gdpr aspects would, could mean refine them to the Silver zone with gdpr aspects removed.
In short, depends on your use case.

Related

Is Ambiguous entity identified with input context?

My question looks similar to this
, but it should be different because my question relates to the combination of entities and context.
Let me show an example; here duplicated synonyms are:
| Entity | Value | Synonyms |
|-------------+-------------+--------------|
| whether | whether | fine, cloudy |
| granularity | granularity | fine, coarse |
And I have same training phrase on different intents with different input context as follows:
Intent-a:
input context: A
training phrase: is it <fine>?
synonym <fine> is for #whether
Intent-b:
input context: B
training phrase: is it <fine>?
synonym <fine> is for #granularity
When a user says "is it fine?" under the context 'A', fine is identified to #whether? Or, input context is not considered on intent-detection for user input sentence so that is it fifty-fifty which intent is detected?
Context is considered when identifying the intents. Therefore if you have properly defined the contexts for the above scenario no ambiguity is created.

Mapping UML Class Diagram to Python Code

I've been asked to document a piece of code using UML diagrams. The code models a situation like the following: a driver can be assigned to one or more routes. Each route has an upstream and a downstream direction. For each route the driver can drive in the upstream direction and/or in the downstream direction.
A simplified pseudo-code is for the Driver class is the following:
class Driver:
HashMap<Route, Direction> upstream;
HashMap<Route, Direction> downstream;
HashMap<Route, Direction> assignedTo;
where the assignedTo map is actually a property returning a hashmap composed of the routes where the driver is assigned to both the upstream and downstream directions (think of it as a view on the other two hashmaps)
So far I've come up the the following UML representation.
----------- ---------
| CLASS | (assignedTo) | CLASS |
| DRIVER |----------------------------| ROUTE |
----------- * | * ---------
|
-------------
| CLASS |
| DIRECTION |
-------------
^ ^
| |
------------ --------------
| CLASS | | CLASS |
| UPSTREAM | | DOWNSTREAM |
------------ --------------
However, I'm a little puzzled by the fact that in the UML I;m using inheritance while the code uses no inheritance. What do you think?
I've changed a little but this is another sample of mine. I am not sure if I understand the shown pseudo-code correctly, but the case when assigned to both directions, it could be a problem. In my personal opinion, my sample diagram would be easier for implementation too.
Regarding the inheritance, the answer would be different what this UML is for.. to represent how to implement or to explain the concepts. If the latter, there would be no problem using inheritance.

Multi dimensional Scenario Outlines in Specflow

I'm creating a Scenario Outline similar to the following one (it is a simplified version but gives a good indication of my problem):
Given I have a valid operator such as 'MyOperatorName'
When I provide a valid phone number for the operator
And I provide an '<amount>' that is of the following '<type>'
And I send a request
Then the following validation message will be displayed: 'The Format of Amount is not valid'
And the following Status Code will be received: 'AmountFormatIsInvalid'
Examples:
| type | description | amount |
| Negative | An amount that is negative | -1.0 |
| Zero | An amount that is equal to zero | 0 |
| ......... | .......... | .... |
The Examples table provides the test data that I need but I would add another Examples table with just the names of the operators (instead of MyOperatorName) in order to replicate the tests for different operators
Examples:
| operator |
| op_numb_1 |
| op_numb_2 |
| op_numb_3 |
in order to avoid repeating the same scenario outline three times; I know that this is not possible but I'm wondering what is the best approach to avoid using three different scenario outlines inside the feature that are pretty the same apart from the operator name.
I know that I can reuse the same step definitions but I'm trying to understand if there is a best practice to prevent cluttering the feature with scenarios that are too much similar.
Glad you know this isn't possible...
So what options are there?
Seems like there are 5:
a: Make a table with every option (the cross product)
Examples:
| type | description | amount | operator |
| Negative | An amount that is negative | -1.0 | op_numb_1 |
| Zero | An amount that is equal to zero | 0 | op_numb_1 |
| Negative | An amount that is negative | -1.0 | op_numb_2 |
| Zero | An amount that is equal to zero | 0 | op_numb_2 |
| ......... | .......... | .... | ... |
b. Repeat the scenario for each operator, with a table of input rows
- but you said you didn't want to do this.
c. Repeat the scenario for each input row, with a table of operators
- I like this option, because each rule is a separate test. If you really, really want to ensure that every different implementation of your "operator" strategy passes and fails in the same validation scenarios, then why not write each validation scenario as a single Scenario Outline: e.g.
Scenario Outline: Operators should fail on Negative inputs
Given I have a valid operator such as 'MyOperatorName'
When I provide a valid phone number for the operator
And I send a request with the amount "-1.0"
Then the following validation message will be displayed: 'The Format of Amount is not valid'
And the following Status Code will be received: 'AmountFormatIsInvalid'
Scenario Outline: Operators should fail on Zero inputs
...etc...
d. Rethink how you are using Specflow - if you only need KEY examples to illustrate your features (as described by Specification by Example by Gojko Adzic), then you are overdoing it by checking every combination. If however you are using specflow to automate your full suite of integration tests then your scenarios could be appropriate... but you might want to think about e.
e. Write integration / unit tests based on the idea that your "operator" validation logic is applied only in one place. If the validation is the same on each operator, why not test it once, and then have all the operators inherit from or include in their composition the same validator class?

SpecFlow/Cucumber/Gherkin - Using tables in a scenario outline

Hopefully I can explain my issue clearly enough for others to understand, here we go, imagine I have the two following hypothetical scenarios:
Scenario: Filter sweets by king size and nut content
Given I am on the "Sweet/List" Page
When I filter sweets by
| Field | Value |
| Filter.KingSize | True |
| Filter.ContainsNuts | False |
Then I should see :
| Value |
| Yorkie King Size |
| Mars King Size |
Scenario: Filter sweets by make
Given I am on the "Sweet/List" Page
When I filter sweets by
| Field | Value |
| Filter.Make | Haribo |
Then I should see :
| Value |
| Starmix |
These scenarios are useful because I can add as many When rows of Field/Value and Then Value entries as I like without changing the associated compiled test steps. However copy/pasting scenarios for different filter tests will become repetitive and take up alot of code - something I would like to avoid. Ideally I would like to create a scenario outline and keep the dynamic nature I have with the tests above, however when I try to do that I run into a problem defining the example table I cant add new rows as I see fit because that would be a new test instance, currently I have this:
Scenario Outline: Filter Sweets
Given I am on the <page> Page
When I filter chocolates by
| Field | Value |
| <filter> | <value> |
Then I should see :
| Output |
| <output> |
Examples:
| page | filter | value | output |
| Sweet/List | Filter.Make | Haribo | Starmix |
So I have the problem of being able to dynamically add rows to my filter and expected data when using a scenario outline, is anyone aware of a way around this? Should I be approaching this from a different angle?
A workaround could be something like :
Then I should see :
| Output |
| <x> |
| <y> |
| <z> |
Examples:
| x | y | z |
But thats not very dynamic.... hoping for a better solution? :)
I don't think what you're asking for is possible with SpecFlow, Gherkin, and out-of-the-box Cucumber. I can't speak for the authors, but I bet it purposely is not meant to be used this way because it goes against the overall "flow" of writing and implementing these specs. Among many things, the specs are meant to be readable to non-programmers, to give the programmer a guide to implement code that matches the specs, for integration testing, and to give a certian amount of flexibility when refactoring.
I think this is one of the situations where the pain you're feeling is a sign that there's a problem, but it may not be the one you think. You said:
"However copy/pasting scenarios for different filter tests will become repetitive and take up alot of code - something I would like to avoid. "
First, I'd disagree that explaining yourself in writing is "repetitive," at least any more than it's repetitive to use specific words like "the, apple, car, etc." over and over again. The issue is: Are these words properly explaining what you're doing? If they are, and explaining your situation requires you to write out multiple scenarios, then that's just what it requires. Communication requires words, and sometimes the same ones.
In fact, what you call "repetitive" is one of the benefits of using Gherkin and a tool like Cucumber or SpecFlow. If you're able to use that sentence over and over and over and over, it means you're not having to write the test code over and over and over and over.
Second, are you sure you're writing a spec for the right thing? I ask only because if the number of scenarios gets out-of-hand, to the point where you have so many that a human can't follow what you write, it's possible that your spec isn't targeted at the right thing.
A possible example of this could be how you're testing the filtering and the pagination in this scenario. Yes, you want your specs to cover full features and your site will have pagination on the same page as your filtering, but at what cost? It takes experience and practice to know when giving up on the supposed "ideal" of no-mocking, full-integration tests will yield better results.
Third, don't think that specs are meant to be perfect coverage for every possible scenario. The scenarios are basically snapshots of state, which means that there are some features that could cover an infinitely-large set of scenarios, which is impossible. So what do you do? Write features that tell the story as best you can. Even let the story drive the development. However, details that don't translate to your specs or other cases are best left to straight-up TDD, done in addition to the specs.
In your example, it seems that you basically are telling a story about a site that lets a user create a dynamic search against sweets and candy. They enter one of a large set of possible search criteria, click a button, and get results. Just stick to that, writing only enough specs to fulfill the story. If you're not satisfied with your coverage, clean it up with more specs or unit tests.
Anyway, that's just my thoughts, hope it helps.
Technically, I think you could try calling steps from within a step definition:
Calling Steps from Step Definitions
For example I think you could rewrite the
Then I should see :
| Output |
| <output> |
To be a custom step like
I should have output that contains <output>
Where output is a comma separated list of expected values. In the custom step you could break the comma separated list into an array and iterate over it calling
Then "I should see #{iterated_value}"
You could use a similar technique to pass in lists of filters and filter values. Your example row for the king size test might look like
| page | filter | value | output |
| Sweet/List | Filter.KingSize, Filter.ContainsNuts | True, False | Yorkie King Size, Mars King Size |
Or maybe
| page | filter-value-pairs | output |
| Sweet/List | Filter.KingSize:True, Filter.ContainsNuts:False | Yorkie King Size, Mars King Size |
That being said, you should perhaps take Darren's words to heart. I'm not really sure that this method would help the ultimate goal of having scenarios that are readable by non-developers.

DDD: Help me further understand Value Objects and Entities

There are several questions on this, and reading them isn't helping me. In Eric Evans DDD, he uses the example of address being a value type in certain situations. For a mail order company, the address is a value type because it doesn't really matter if the address is shared, who else lives at the address, simply that the package arrives at the address.
This makes sense to me until I start thinking about how this would be designed. Given the diagram on page 99, he has it like this:
+------------+
|Customer |
+------------+
|customerId |
|name |
|street |
|city |
|state |
+------------+
This changes to:
+------------+
|Customer | (entity)
+------------+
|customerId |
|name |
|address |
+------------+
+------------+
|Address | (value object)
+------------+
|street |
|city |
|state |
+------------+
If these were tables, Address would have its own Id in order to have a relationship with the customer, turning it into an entity.
Is the idea that in a relational database these would stay in the same table, such as in the first example, and that you'd use features of the ORM to abstract address as a value object (such as nHibernate's component features)?
I realize that a couple of pages later he talks about denormalization, I'm just trying to make sure I'm understanding the concept correctly.
When Eric Evans talks about "entities have identity, Value Objects do not", he's not talking about an ID column in the database - he's talking about identity as a concept.
VOs have no conceptual identity. That doesn't mean that they shouldn't have persistence identity. Don't let persistence implementation cloud your understanding of Entities vs VOs.
You can create separate table for address or in same table in Customer
Is the idea that in a relational
database these would stay in the same
table, such as in the first example,
and that you'd use features of the ORM
to abstract address as a value object
(such as nHibernate's component
features)?
Yes, generally, that is the idea.
Alternatively (if your ORM doesn't support Value Objects directly), you can let the VO tables have an ID, but hide that within your domain model.
I personally don't give a damn about having ID on value objects as long as they override equality comparison properly (cause value objects differs by their value not identity).
Mapping value objects to database is technical concern, sometimes (e.g. marking props virtual so ORM could crawl underneath) You just need to sacrifice purity of domain model a bit. Or make Your infrastructure smarter - usage of nhib components or something.
Yes, generally Address would stay in the same table. Address would be mapped something like this:
+-----------------+
|Customer |
+-----------------+
|customerId |
|name |
|address_street |
|address_city |
|address_state |
+-----------------+
If Address was an entity, then it would be in a separate table, as you said. If two of the same Customers linked to the same Address entity, then changing an attribute of that Address would affect both Customers. However, a VO implementation would only affect one or the other.

Resources