Is there a standard pattern for invoking related pipelines in Kiba ETL? [closed] - kiba-etl

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm working on an ETL pipeline with Kiba which imports into multiple, related models in my Rails app. For example, I have records which have many images. There might also be collections which contain many records.
The source of my data will be various, including HTTP APIs and CSV files. I would like to make the pipeline as modular and reusable as possible, so for each new type of source, I only have to create the source, and the rest of the pipeline definition is the same.
Given multiple models in the destination, and possibly several API calls to get the data from the source, what's the standard pattern for this in Kiba?
I could create one pipeline where the destination is 'the application' and has responsibility for all these models, this feels like the wrong approach because the destination would be responsible for saving data across different Rails models, uploading images etc.
Should I create one master pipeline which triggers more specific ones, passing in a specific type of data (e.g. image URLs for import)? Or is there a better approach than this?
Thanks.

Kiba author here!
It is natural & common to look for some form of genericity, modularity and reusability in data pipelines. I would say though, that like for regular code, it can be hard initially to figure out what is the correct way to get that (it will depend quite a bit on your exact situation).
This is why my recommendation would be instead to:
Start simple (on one specific job)
Very important: make sure to implement end-to-end automated tests (use webmock or similar to stub out API requests & make tests completely isolated, create tests with 1 row from source to destination) - this will make it easy to refactor stuff later
Once you have that (1 pipeline with tests), you can start implementing a second one, and refactor to extract interesting patterns as reusable bits, and iterate from there
Depending on your exact situation, maybe you will extract specific components, or maybe you will end up extracting a whole generic job, or generic families of jobs etc.
This approach works well even as you get more experience working with Kiba (this is how I gradually extracted the components that you will find in kiba-common and kiba-pro, too.

Related

Agile-methodology in Project and Query-Driven methodology in Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We want to start a new project. Our DB will be Cassandra; and we do our project in a scrum team, based on agile.
My question is that, one of the most important issue is changes, that agile can handle this.
Agile software development teams embrace change, accepting the idea that requirements will evolve throughout a project. Agilists understand that because requirements evolve over time that any early investment in detailed documentation will only be wasted.
But we have:
Changes to just one of these query requirements will frequently warrant a data model change for maximum efficiency.
in Basic Rules of Cassandra Data Modeling article.
How can we manage our project gathering both rules together? the first one accept changes, but, the second one, want us to know every query that will be answered in our project. New requirements, causes new queries, that will changes our DB, and it will influence the quality(throughput).
How can we manage our project gathering both rules together? the first one accept changes easily, but, the second one, want us to know every query that will be answered in our project. New requirements, causes new queries, that will changes our DB, and it will influence the quality
The first rule does not suggest you accept changes easily just that you accept that changes to requirements will be a fact of life. Ie, you need to decide how to deal with that, rather than try to ignore it ore require sign off on final requirements up front.
I'd suggest you make part of your 'definition of done' (What you agree a piece of code must meet to be considered complete within a sprint) to include the requirements for changes to your DB code. This may mean changes to this code get higher estimates to allow you to complete the work in the sprint. In this way you are open to change, and have a plan to make sure it doesn't disrupt your work.
Consider the ways in which you can reduce the impact of a database change.
One good way to do this will be to have automated regression tests that cover the functionality that relies on the database. It will also be useful to have the database schema built regularly as a part of continuous integration. That then helps to remove the fear of refactoring the data model and encourages you to make the changes as often as necessary.
The work cycle then becomes:
Developer commits new code and new data model
Continuous integration tears down the test database
Continuous integration creates a new database based on the new data model
Continuous integration adds in some appropriate dummy data
Continuous integration runs a suite of regression tests to ensure nothing has been broken by the changes.
Team continues working with the confidence that nothing is broken
You may feel that writing automated tests and configuring continuous integration is a big commitment of time and resources. But think of the payoff in terms of how easily you can accept change during the project and in the future.
This kind of up-front investment in order to make change easier is a cornerstone of the agile approach.

Adding records to VSAM DATASET [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have some confusions regarding VSAM as I am new to it. Do correct me where I am wrong and solve the queries.
A cluster contains control areas and a control area contains control intervals. One control interval contains one dataset. Now for defining a cluster we mention a data component and index component. Now this name of data component that we gives creates a dataset and name of index generates a key. My queries are as follows-
1)If I have to add a new record in that dataset, what is the procedure?
2)What is the procedure for creating a new dataset in control area?
3)How to access a dataset and a particular record after they are created?
I tried finding a simple code but was unable so kindly explain with a simple example.
One thing that is going to help you is the IBM Redbook VSAM Demystified: http://www.redbooks.ibm.com/abstracts/sg246105.html which, these days, you can even get on your smartphone, amongst several other ways.
However, your current understanding is a bit astray so you'll need to drop all of that understanding first.
There are three main types of VSAM file and you'll probably only come across two of those as a beginner: KSDS; ESDS.
KSDS is a Key Sequenced Data Set (an indexed file) and ESDS is an Entry Sequenced Data Set (a sequential file but not a "flat" file).
When you write a COBOL program, there is little difference between using an ESDS and a flat/PS/QSAM file, and not even that much difference when using a KSDS.
Rather than providing an example, I'll refer you to the chapter in the Enterprise COBOL Programming Guide for your release of COBOL, it is Chapter 10 you want, up to and including the section on handling errors, and the publication can be found here: http://www-01.ibm.com/support/docview.wss?uid=swg27036733, you can also use the Language Reference for the details of what you can use with VSAM once you have a better understanding of what it is to COBOL.
As a beginning programmer, you don't have to worry about what the structure of a VSAM dataset is. However, you've had some exposure to the topic, and taken a wrong turn.
VSAM datasets themselves can only exist on disk (what we often refer to as DASD). They can be backed-up to non-DASD, but are only directly usable on DASD.
They consist of Control Areas (CA), which you can regard as just being a lump of DASD, and almost exclusively that lump of DASD will be one Cylinder (30 Tracks on a 3390 (which these days is very likely emulated 3390). You won't need to know much more about CAs. CAs are more of a conceptual thing that an actual physical thing.
Control Intervals (CI) are where any data (including index data) is. CIs live in CAs.
Records, the things you will have in the FILE SECTION under an FD in a COBOL program, will live in CIs.
Your COBOL program needs to know nothing about the structure of a VSAM dataset. COBOL uses VSAM Access Method Services (AMS) to do all VSAM file accesses, as far as your COBOL program is concerned it is an "indexed" file with a little bit on the SELECT statement to say that it is a VSAM file. Or is is a sequential file with a little... you know by now.

Tired of web development, looking for extreme ways to avoid boilerplate code - a meta-framework? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 8 years ago.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Improve this question
Having written quite a few apps with modern web frameworks (Symfony2, Laravel, Rails, expressjs, angularjs...) I can't help but think that about 90% of the time I spend developing is spent on writing CRUDs. Then I spend 10% of the time doing the interesting part: basically defining how models should behave.
This is demeaning, I want to reverse the ratio.
The above mentioned frameworks almost all go out of their way to make the boilerplate tasks easier, but still, it is not as easy as I feel it should be, and developer brain time should be devoted to application logic, not to writing boilerplate code.
I'll try to illustrate the problem with a Rails example:
Scaffold a CRUD (e.g. rails generate scaffold user name:string email:string password:string)
Change a few lines in the scaffolded views (maybe the User needs a Role, but scaffold doesn't know how to handle it)
... do other things ...
Realize you wanted to use twitter bootstrap after all, add the most magical gems I can find to help me and...
Re-scaffold my User CRUD
Re-do the various edits I performed on the views, now that they've been overriden by scaffold
...
And this will go on and on for a while.
It seems to me that most magic tools such as rails generate will only help you with initial setup. After that, you're on your own. It's not as DRY as it seems.
I'll be even more extreme: I, as a developer, should be able to almost build an entire project without worrying about the UI (and without delegating the task to someone else).
If in a project I need Users with Roles, I would like to be able to write just, say, a .json file containing something along the lines of:
{
"Schema": {
"User": {
"name" : "string",
"email": "email (unique)",
"password": "string",
"role": "Role"
},
"Role": {
"name": "string (unique)"
}
}
}
I would then let the framework create the database tables, corresponding views and controllers. If I want to use bootstrap for the UI, there would be a setting to toggle this in the master .json file. And at no point would I edit a single view. If later I add a field to User or want to change the UI style, I just edit the master .json file.
This would open the way for a creative branch of UX design. I could, for instance, assign an importance flag to each field of the User model and let a clever designer write a plugin that designs forms whose layout is optimized by the relative importance of fields. This is not my job, and it is not the UX designers' jobs to rewrite the same thing a 100 times over for different projects, they should write "recipes" that work on general, well specified cases.
I have a feeling that the MVC pattern, with all its virtues, has too much decoupled the view from the model. How many times have you had to write validation code twice: once server-side and once client-side because you wanted richer feedback? There is much information to be gotten from the model, and you should be able to get client side validation by just setting a property on the model telling the framework to do so.
You may say that Rails scaffold is close to what I'm imagining, sure. But scaffolding is only good at the beginning of the project. Even with scaffolding, I need to rewrite many things to, say, only allow an Admin to change a User's role.
The ideal framework would provide me with a simple hook to define whether the action of changing the Role field on a User is allowed or not, and the UI would automagically display the error to the user if I return it properly from the hook.
Writing a User system should only take a few minutes. Even with things like devise for Rails or FOSUserBundle for Symfony2 it takes a huge, unnecessary amount of configuring and tuning.
The meta-framework I have in mind could be, in theory, used to generate the boilerplate code for any modern web framework.
I would like my development workflow to look like this:
Create app.json file defining the models and their relationships, + the hooks I want to specialize
Run a command like generate_app app.json --ui=bootstrap --forms=some_spectacular_plugin --framework=rails4
Implement the hooks I need
Done.
The resulting app would then update itself whenever app.json changes, and adding entities, fields, custom logic should never be harder than writing a few lines of JSON and implementing the interesting parts of the logic in whatever the target language is.
I strongly believe that the vast number of frameworks out there address the wrong question: how to write little bits of unconnected code more efficiently. Frameworks should ask: what is an application and how can I describe it?
Do you know of any projects that would be going in this direction? Any pointers to literature on this?
I have been confronted couple of times to similar development tireness – boilerplate over an over. For me boiler plate is all that does not bring any added business value (static: project setup, contextual: CRUD (backend, frontend), drop-down list, sub-Usecase for affectation etc..., etc...)
The approach presented is command line generation of rails scaffold artifacts which has the following pitfalls : incomplete and brittle on maintenance.
Generation covers the aspects of generating same information (redundant info) over different type of artifacts (DB storage, presentation layer, persistence layer etc…)
Furthermore consecutive generation overrides your changes.
To solve this inconvenience, I see only two solutions :
Do not use generator but a generic framework that manages in one central place persistence and presentation aspects. In java there is Openxava that is designed for that. It works with annotations (persistence and presentation), it also answers your validation question with stereotype.
Use an updatable-generated-code generator.
Minuteproject gets updatable-generated-code feature. You can modify generated parts and the next generation keeps your modifications.
Meanwhile none of those solutions matches your technology target (Rails).
Minuteproject can be extended to generate for any text based language (java, ruby, python, c, js…).
Minuteproject generates from DB structure, query statement, transfer-object definition see productivity facet for analysts
Remark : minuteproject also provides a reverse-engineering solution for Openxava
In the sequence summary proposed :
create model, relationship + hooks, I would go rather consider the DB model as a center place and add hooks (not yet present) in model enrichment (minuteproject propose a dedicate to enrich the model with conventions…)
I rather go for reverse-engineering solution than Forward engineering for the following reasons :
correct DB storage is too crucial to be generated :
Forward engineering cannot generate view, stored procedure, function etc...
Forward engineering may not tune correctly (tablespace) your persistence model.
generate by picking up your technology target
implements the hooks in updatable-generated-code sections, so that at the next generation (model structure has changed, new hooks are to be implemented), your previous hook-implementations are kept.
I have create some to-be-implementated (hook) for Openxava (http://minuteproject.blogspot.be/2013/08/transient-definition-as-productivity.html)
But minuteproject approach is technology agnostic so artifacts could be as well generated for other frameworks.

How to document software algorithms? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm working on a large project for a university assignment, we're developing an application that is used by a business to compile quotes for their various services.
I need to document the algorithms in a way that the client can sign off on to make sure the way we calculate the prices is correct
So far I've tried using a large flow chart with decisions diamonds like in information systems modelling but it's proving to be overkill for even simple algorithms.
Can anybody please suggest some ways to do this? It needs to be as little like software code as possible, and enough for the client to see how we decide what prices are quoted
Maybe you should then use pseudocode.
Create two documents.
First: The business process model (BPM) that shows the sequence of steps required to be done. This should be annotated with the details for each step.
Second: Create a spreadsheet with each input data item defined so that business can see that you understand the type of field for entry of each data point and the rules for each data point. If the calculation uses a table for the step, then that is where you define the input lookup value from the table. So for each step you know where the data is coming from and then going to. Your spreadsheet can include the link to the BPM so they can walk through each data point in the BPM and see where it is coming from/going to.
You can prepare screen designs to show the users how your system is doing actually.
Well, the usual way to document algorithms is writing papers.
If your clients have studied business, I'm sure they are familiar with reading formulas.
Would a data flow diagrams help? Put psuedo code or math in the bubbles. I've had some success combining data flow models and entity relationship diagrams, but it's non standard.
What about Nassi-Shneiderman-Diagram, it's a diagram from structural programming. I think its good to show decision flows.
http://en.wikipedia.org/wiki/Nassi%E2%80%93Shneiderman_diagram
You could create an algorithm test screen to display and comment on the various steps through the calculations.

Building a code asset library [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have been thinking about setting up some sort of library for all our internally developed software at my organisation. I would like collect any ideas the good SO folk may have on this topic.
I figure, what is the point in instilling into developers the benefits of writing reusable code, if on the next project the first thing developers do is file -> new due to a lack of knowledge of what code is already out there to be reused.
As an added benefit, I think that just by having a library like this would encourage developers to think more in terms of reusability when writing code
I would like to keep this library as simple as possible, perhaps my only two requirements being:
Search facility
Usable for many types of components: assemblies, web services, etc
I see the basic information required on each asset/component to be:
Name & version
Description / purpose
Dependencies
Would you record any more information?
What would be the best platform for this i.e., wiki, forum, etc?
What would make a software library like this successful vs unsuccessful?
All ideas are greatly appreciated.
Thanks
Edit:
Found these similar questions after posting:
How do you ensure code is reused correctly?
How do you foster the use of shared components in your organization?
Sounds like there is no central repository of code available at your organization. Depending on what you do this could be because of compatmentalization of the knowledge due to security restrictions, the fact that external vendor code is included in some/all of the solutions, or your company has not yet seen the benefits of getting people to reuse, refactor, and evangelize the benefits of such a repository.
The common attributes of solutions I have seen work at mutiple corporations are a multi pronged approach.
Buy in at some level from the management. Usually it's a CTO/CIO that the idea resonates with and they claim it's a good thing and don't give any money to fund it but they won't sand in your way if they are aware that someone is going to champion the idea before they start soliciting code and consolidating it somewhere.
Some list of projects and the collateral available in english. Seen this on wikis, on sharepoint lists, in text files within a source repository. All of them share the common attribute of some sort of front end search server that allows full text over the description of a solution.
Some common share or repository for the binaries and / or code. Oftentimes a large org has different authentication/authorization methods for many different environments and it might not be practical (or possible logistically) to share a single soure repository - don't get hung up on that aspect - just try to get it to the point that there is a well known share/directory/repository that works for your org.
Always make sure there is someone listed as a contact - no one ever takes code and runs it in production without at lest talking to the previous owner of it - and if you don't have a person they can start asking questions of right away then they might just go ahead and hit file->new.
Unsuccessful attributes I've seen?
N submissions per engineer per time period = lots of crap starts making it's way in
No method of rating / feedback. If there is no means to favorite/rate/give some indicator that allows the cream to rise to the top you don't go back to search it often because you weren't able to benefit from everyone else's slogging through the code that wasn't really very good.
Lack of feedback/email link that contacts the author with questions directly into their email.
lack of ability to categorize organically. Every time there is some super rigid hierarchy or category list that was predetermined everything ends up in "other". If you use tags or similar you can avoid it.
Requirement of some design document to accompany it that is of a rigid format the code isn't accepted - no one can ever agree on the "centralized" format of a design doc and no one ever submits when this is required.
Just my thinking.

Resources