How to use generators with kedro?

How to use generators with kedro? - python-3.x

Thanks to David Beazley's slides on Generators I'm quite taken with using generators for data processing in order to keep memory consumption minimal. Now I'm working on my first kedro project, and my question is how I can use generators in kedro. When I have a node that yields a generator, and then run it with kedro run --node=example_node, I get the following error:
DataSetError: Failed while saving data to data set MemoryDataSet().
can't pickle generator objects
Am I supposed to always load all my data into memory when working with kedro?

Hi #ilja to do this you may need to change the type of assignment operation that MemoryDataSet applies.
In your catalog, declare your datasets explicitly, change the copy_mode to one of copy or assign. I think assign may be your best bet here...
https://kedro.readthedocs.io/en/stable/kedro.io.MemoryDataSet.html
I hope this works, but am not 100% sure.

Related

How to deal with (Apache Beam) high IO bottlenecks?

Let's say to cite a simple example, that I have a very simple beam pipeline which just reads from a file and dumps the data into an output file. Now let's consider that the input file is huge (some GBs in size, the type of file you can't typically open in a text editor). Since the direct-runner implementation is quite simple (it reads the whole input set into memory), it won't be able to read and output those huge files (unless you assign an impractically high amount of memory to the java vm process); so my question is: "How do production runners like flink/spark/cloud dataflow" deal with this 'huge dataset problem'? - assuming they would not just try to put the whole file(s)/dataset into memory?" -.
I'd expect production runner's implementation need to work "in parts or batches" (like reading/processing/outputting in parts) to avoid trying to fit huge datasets into memory at any specific point in time. Can somebody please share their feedback regarding how production runners deal with this "huge data" situation?
Generalizing, please notice this applies for other input/output mechanisms too, for example if my input is a PCollection coming from huge database table (broadly speaking huge in both row-size and amount), does the internal implementation of the production's runner somehow divides the given input SQL statement into many internally generated sub statements each taking smaller subsets (for example by internally generating a count(-) statement, followed by N statements, each taking (count(-)/N) elements? the direct-runner won't do this and will just pass the given query 1:1 to the DB), or is my responsibility as a developer to "iterate in batches" and divide the problem, and if this is indeed the case, what are the best practices here, ie: having one pipeline for this or many?, and if only one then somehow parametrise the pipeline to read/write in batches? or iterate over a simple pipeline and manage necessary metadata externally to the pipeline?
thanks in advance, any feedback would be greatly appreciated!
EDIT (reflecting David's feedback):
David your feedback is highly valuable and definitely touches the point i'm interested in. Having a work discovery phase for splitting a source and and read phase to concurrently read the split-partitions is definitely what I was interested in hearing, so thanks for pointing me in the right direction. I have a couple of small follow up questions if you don't mind:
1 - The article points out under the section "Generic enumerator-reader communication mechanism" the following:
"The SplitEnumerator and SourceReader are both user implemented class.
It is not rare that the implementation require some communication
between these two components. In order to facilitate such use cases [....]"
So my question here would be, is that "splitting + reading behaviour" triggered by some user (ie. developer) provided implementation (specifically SplitEnumerator and SourceReader), or can I benefit from that out of the box without any custom code?.
2 - Probably just delving deeper into the question above; if I have a batch/bounded workload (let's say I'm using apache flink), and I'm interested in processing a "huge file" like described in the original post, will the pipeline work "out of the box" (doing the behind the scenes "work preparation phase" splits and the parallel reads), or would that require some custom code implemented by the developer?
thank's in advance for all your valuable feedback!

Note that when the inputs are bounded and known in advance (i.e., a batch workload as opposed to streaming), this is more straightforward.
In Flink, which is designed with streaming in mind, this is done by separating "work discovery" from "reading". A single SplitEnumerator runs once and enumerates the chunks to be read (the splits/partitions), and assigns them to parallel readers. In the batch case a split is defined by a range of offsets, while in the streaming case, the end offset for each split is set to LONG_MAX.
This is described in more detail in FLIP-27: Refactor Source Interface.

Just to provide some closure to this question, the justification for this question was to know if apache beam - when coupled with a production runner-(like flink or spark or google cloud dataflow), offered out of the box mechanisms for -splitting work a.k.a reading/writing manipulating - huge files (or datasources in general). The comment provided by David Anderson above proved of great value in hintintg at how Apache flink deals with this workflows.
At this point I've implemented solutions using huge files (for testing possible IO bottlenecks) with a "beam on flink" based pipeline, and I can confirm, that flink will create an excecution plan which includes splitting sources, and dividing work in such a way that no memory problem arises. Now, there can be of course conditions under which stability/"IO performance" is compromised, but at least I can confirm that the workflows carried out behind the pipeline abstraction, uses the filesystem when carriying out tasks, avoiding fitting all data in memory and thus avoiding trivial memory errors. Conclusion: yes "beam on flink" (and likely spark and dataflow too) do offer proper work preparation, work splitting and filesystem usage so that available volatile memory is used in an efficient way.
Update about datasources: Regarding DBs as datasources, Flink won't (and can't - it is not trivial) optimize/split/distribute work related to DB datasources in the same way it optimizes reading from the filesystem. There are still approaches to read huge amount of data (records) from a DB though, but the implementation details need to be addressed by the developer instead of being responsibility of the framework. I've found this article (https://nl.devoteam.com/expert-view/querying-jdbc-database-in-parallel-with-google-dataflow-apache-beam/) very helpful in addressing the point of reading massive amounts of records from a DB in beam (the article uses a cloud dataflow runner, but I used Flink and it worked just fine), splitting queries and distributing the processing.

Getting started at tradingview and referencing balances

I may be new to tradingview but their pinescript programming language seems to be the best I've ever seen for automated trading. They seem to really want me to succeed but I cannot find where it tells me how to access the balances for certain balances. I am trying to make a code where I do not reinvest the extra I make so I have to be able to reference the available amount. I have not quite finished the manual yet but I do not see what variable or function allows me to do that, or at least not where I would expect it.

Have a look at strategy.equity. There are quite a few built-in variables for strategy values. You can inspect them from the refman by searching on "strategy".
You can also calculate your own metrics using a technique like this one if you don't find what you need in the built-ins.
And welcome to Pine! This is the best place to start your journey:
https://www.tradingview.com/?solution=43000561836

Writing DRY Spock tests

Assume that I have a Spock specification that given a city and state tests for the correct zip code. Assume I have a text file of cities and states that is used to drive the tests in the where clause.
Now assume that I want to split the tests so that I can run for "Virginia" or "Maryland". The approach that I have taken is to create a new VirginiaSpec and a new MarylandSpec and in that spec, I modify the where clause.
This works, but seems inefficient because every other part of the VirginiaSpec and MarylandSpec is exactly the same. In addition, if the logic changes, then I would need to change it in every spec that I have.
So what I am looking for is an approach that allows me to have one StateSpec in which the where clause can be parameterized.
I realize I have not included a code example, however if my question is not clear, then I can provide one. Thanks for your help.
-Dan

You have a couple of options. You could put the basic setup and structure and even the test itself in a base test class, then extend that class w/ your VirginiaSpec and MarylandSpec. Your spec classes would be very small probably just defining a constant that is the state for the spec.
But that seems needless. If both the cities and states are in this file, you could just read in the file in the where section of your test.
https://snekse.github.io/test-often-and-prosper-slides/#/42
If you cannot get the WHERE section working, you could always read in your file during the setupSpec and store the data in some kind of data structure then loop through it.
Spock: Reading Test Data from CSV File
But in general, using the Where label is going to be the right answer.

Optimiser for excel spreadsheet

I'm a mechanical engineer, and I have developed a pretty cool spreadsheet that I use to size some steel members for lifting beams. The set back is that I need to do some trial and error in the selection of the member until I get one that gets as close to the allowable limits as possible.
What I'm hoping to improve on is to develop a function that based upon a length and weight variable that I enter, the program runs a loop and automatically selects the best member size(s) based upon a list of the members and their physical properties. Is this possible?

Yeah, depending on the complexity, either a simple search through parameters (less than, more than etc) might bring you the answer. You can do it quite easily via Pandas library. Just load up the excel as pandas DataFrame (pandas.read_excel()), which then will allow you to perform the searches on that DataFrame object.
If you want to run some optimization algo, you should look into SciPy's optimize to get what you're looking for based on the input data (it handles unconstrained and constrained functions).
Of course, the question you've stated is quite general, so I only pointed the direction. More info would be better.

How to test a program processing large amounts of data stored in an unpredictable format

What I have to do
I'm trying to manipulate some rather large amounts of data stored in Excel files (one of the workbooks has as much as 150 spreadsheets). The result of these manipulations may yield approximately 800.000 rows in a database table.
The problem
Data stored in the spreadsheets has unpredictable format. The company that generated these spreadsheets had no fixed/documented format for exporting these files, and sometimes erroneous data appear. For example most of the years are represented like "2009" but there are cases where a year is represented as "20". Other example, data is not really normalized in these files, so I use separators to split the values of certain cells. Sometimes these separators change.
There are things like these that I couldn't predict and I only discovered them only after running an already evolved version of my program over a pretty large part of the available data.
The question
How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
Shall I take a defensive approach and throw exceptions whenever some kind of unexpected issue arises? Then the main loop of the program may catch and log them and continue with the available data? This would yield some processed data, but that means that on a subsequent iteration of the program I have to have checks for what's already inside the database from previous iterations (which I don't really like).
What's your opinion? How would you tackle this problem?

If there is no specification for what the format of the data is, then anything is acceptable.
If not, then there is either an explicit or implicit specification of the data. I would try and nail this down right now. If you can't get an explicit enough definition of the data to write your program so that it can be expected to run without error, then I would say you are taking a very large risk in causing some serious damage depending on how this data is being used.
You should write your program so that it either throws an exception or logs an error whenever running across data that does not meet the specification. Then, run the program on PART of the available data until it runs without exception. This can be viewed as a training set for the development of your program. Then, use some of the saved data to use as a TEST set. This will give you an estimate of how many exceptions/errors your program will generate in production.
Overfitting is a common machine learning concept, but it is useful to other tasks such as this - program development. It is surprising to me how developers can write a bunch of unit tests, code their application to perform well on it, and then expect similar or bug-free performance in production.
If you're not willing to take all these steps (i.e. run your code on essentially all of the data -- since the test set is also making use of the data) then I would say the task is too large to do.

As an aside, rather than creating a definition of a format that is very strange and peculiar to account for all the "errors" in the current data, you might want to create a new, normalized (in the sense these things are simplified away) specification for the data, and then write a "faulty document patcher" that can be run on faulty documents to fix the data.
If the application generating the data is still in production, then you might need to go to the developers of this application to get a buy in on the new spec. Once you have that, you can then start logging bugs against their application, so hopefully the faulty document patcher can be retired.
More likely, I'm guessing that the software developers are long gone, no one understands the code anymore, if it is even running at all.

How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
For every single data type I would set reasonable constraints on the values that it is allowed to be.
If a cell violates these constraints then throw an exception containing the piece of data it failed on and its data type. If a piece of data violated its constraints you can modify the source to include the additional constraints required for that piece of data, and a conversion method to make it uniform.
To give an example on the date you gave, initially a date would have the constraint that it could be only four digits. When the program came across the "20" it would throw an exception.
Then you could go and allow two digit dates, and a method to convert the two-digit dates into a four digit one to allow further processing.

One question is, will you run your program more than once? From your question it sounds possible you only want to run it once, and then you will then work with the data in the database.
In which case you can be very defensive - throw exceptions whenever unexpected data appears. Run the program repeatedly on ever-larger sets of the data. Initially, solve any exceptions by altering the code, as it's a good rule of thumb that the exceptions you find first are going to be common. You might want to empty the output database between runs.
Later on, you will be finding rare exceptions that might only occur a couple of times in the input. Just solve these by hand and insert the corresponding rows in the database yourself. Or write another small program that reads your exception information and inserts the new rows, rather than running your whole big program again.

Typically for this sort of thing I do these as #MarkJ suggested, and I encode the whole thing in unit tests.
So I compose a small datafile that at first contains only a few rows of normal data. That's unit test number 1.
Then I take a quick visual scan of some of the data to spot any obvious exceptions. Unit tests 2 through n.
Finally, I write parser code until it passes all unit tests, and throws and logs exceptions for all un-managed data.
I then use these oddball bits of data to make new unit tests, and improve the parser until it can pass those too.
Although sometimes accommodating some really strange bit of data adds more parser complexity than it's worth, and I'll just log the exception, dump it, and move on. This is a matter of professional judgment.

How about processing every piece of data (so you don't have to check for dupes). Those that pass go into the database. The exceptions go into an exception file. The user can open the exception file and make corrections/modifications to the data. Then they can run your program on the exception file.
This will isolate unhandled data for the user to correct and prevent you from processing the same data twice (or more).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string