I want to use the model to predict scores in map lambda function in PySpark.
def inference(user_embed, item_embed):
feats = user_embed + item_embed
dnn_model = load_model("best_model.h5")
infer = dnn_model.predict(np.array([feats]), verbose=0, steps=1)
return infer
iu_score = iu.map(lambda x: Row(userid=x.userid, entryid=x.entryid, score = inference(x.user_embed, x.item_embed)))
The running is extremely slow and it stuck at the final stage quickly after code start running.
[Stage 119:==================================================>(4048 + 2) / 4050]
In HTOP monitor, only 2 of 80 cores are in full work load, others core seems not working.
So what should I do to making the model predicting in parallel ? The iu is 300 million so the efficiency if important for me.
Thanks.
I have turn verbose=1, and the predict log appears, but it seems that the prediction is just one by one , instead of predict in parallel.
During the response I researched a little bit and found this question interesting.
First, if efficiency is really important, invest a little time on recoding the whole thing without Keres. You still can use the high-level API for tensorflow (Models) and with a little effort to extract the parameters and assign them to the new model. Regardless it is unclear from all the massive implementations in the framework of wrappers (is TensorFlow not a rich enough framework?), you will most likely meet problems with backward compatibility when upgrading. Really not recommended for production.
Having said that, can you inspect what is the problem exactly, for instance - are you using GPUs? maybe they are overloaded? Can you wrap the whole thing to not exceed some capacity and use a prioritizing system? You can use a simple queue if not there are no priorities. You can also check if you really terminate tensorflow's sessions or the same machine runs many models that interfere with the others. There are many issues that can be the reason for this phenomena, it will be great to have more details.
Regarding the parallel computation - you didn't implement anything that really opens a thread or a process for this models, so I suspect that pyspark just can't handle the whole thing by its own. Maybe the implementation (honestly I didn't read the whole pyspark documentation) is assuming that the dispatched functions runs fast enough and doesn't distributed as it should. PySpark is simply a sophisticated implementation of map-reduce principles. The dispatched functions plays the role of a mapping function in a single step, which can be problematic for your case. Although it is passed as a lambda expression, you should inspect more carefully which are the instances that are slow, and on which machines they are running.
I strongly recommend you do as follows:
Go to Tensorflow deplot official docs and read how to really deploy a model. There is a protocol for communicating with the deployed models called RPC and also a restful API. Then, using your pyspark you can wrap the calls and connect with the served model. You can create a pool of how many models you want, manage it in pyspark, distribute the computations over a network, and from here the sky and the cpus/gpus/tpus are the limits (I'm still skeptical about the sky).
It will be great to get an update from you about the results :) You made me curious.
I hope you the best with this issue, great question.
Related
Let's say to cite a simple example, that I have a very simple beam pipeline which just reads from a file and dumps the data into an output file. Now let's consider that the input file is huge (some GBs in size, the type of file you can't typically open in a text editor). Since the direct-runner implementation is quite simple (it reads the whole input set into memory), it won't be able to read and output those huge files (unless you assign an impractically high amount of memory to the java vm process); so my question is: "How do production runners like flink/spark/cloud dataflow" deal with this 'huge dataset problem'? - assuming they would not just try to put the whole file(s)/dataset into memory?" -.
I'd expect production runner's implementation need to work "in parts or batches" (like reading/processing/outputting in parts) to avoid trying to fit huge datasets into memory at any specific point in time. Can somebody please share their feedback regarding how production runners deal with this "huge data" situation?
Generalizing, please notice this applies for other input/output mechanisms too, for example if my input is a PCollection coming from huge database table (broadly speaking huge in both row-size and amount), does the internal implementation of the production's runner somehow divides the given input SQL statement into many internally generated sub statements each taking smaller subsets (for example by internally generating a count(-) statement, followed by N statements, each taking (count(-)/N) elements? the direct-runner won't do this and will just pass the given query 1:1 to the DB), or is my responsibility as a developer to "iterate in batches" and divide the problem, and if this is indeed the case, what are the best practices here, ie: having one pipeline for this or many?, and if only one then somehow parametrise the pipeline to read/write in batches? or iterate over a simple pipeline and manage necessary metadata externally to the pipeline?
thanks in advance, any feedback would be greatly appreciated!
EDIT (reflecting David's feedback):
David your feedback is highly valuable and definitely touches the point i'm interested in. Having a work discovery phase for splitting a source and and read phase to concurrently read the split-partitions is definitely what I was interested in hearing, so thanks for pointing me in the right direction. I have a couple of small follow up questions if you don't mind:
1 - The article points out under the section "Generic enumerator-reader communication mechanism" the following:
"The SplitEnumerator and SourceReader are both user implemented class.
It is not rare that the implementation require some communication
between these two components. In order to facilitate such use cases [....]"
So my question here would be, is that "splitting + reading behaviour" triggered by some user (ie. developer) provided implementation (specifically SplitEnumerator and SourceReader), or can I benefit from that out of the box without any custom code?.
2 - Probably just delving deeper into the question above; if I have a batch/bounded workload (let's say I'm using apache flink), and I'm interested in processing a "huge file" like described in the original post, will the pipeline work "out of the box" (doing the behind the scenes "work preparation phase" splits and the parallel reads), or would that require some custom code implemented by the developer?
thank's in advance for all your valuable feedback!
Note that when the inputs are bounded and known in advance (i.e., a batch workload as opposed to streaming), this is more straightforward.
In Flink, which is designed with streaming in mind, this is done by separating "work discovery" from "reading". A single SplitEnumerator runs once and enumerates the chunks to be read (the splits/partitions), and assigns them to parallel readers. In the batch case a split is defined by a range of offsets, while in the streaming case, the end offset for each split is set to LONG_MAX.
This is described in more detail in FLIP-27: Refactor Source Interface.
Just to provide some closure to this question, the justification for this question was to know if apache beam - when coupled with a production runner-(like flink or spark or google cloud dataflow), offered out of the box mechanisms for -splitting work a.k.a reading/writing manipulating - huge files (or datasources in general). The comment provided by David Anderson above proved of great value in hintintg at how Apache flink deals with this workflows.
At this point I've implemented solutions using huge files (for testing possible IO bottlenecks) with a "beam on flink" based pipeline, and I can confirm, that flink will create an excecution plan which includes splitting sources, and dividing work in such a way that no memory problem arises. Now, there can be of course conditions under which stability/"IO performance" is compromised, but at least I can confirm that the workflows carried out behind the pipeline abstraction, uses the filesystem when carriying out tasks, avoiding fitting all data in memory and thus avoiding trivial memory errors. Conclusion: yes "beam on flink" (and likely spark and dataflow too) do offer proper work preparation, work splitting and filesystem usage so that available volatile memory is used in an efficient way.
Update about datasources: Regarding DBs as datasources, Flink won't (and can't - it is not trivial) optimize/split/distribute work related to DB datasources in the same way it optimizes reading from the filesystem. There are still approaches to read huge amount of data (records) from a DB though, but the implementation details need to be addressed by the developer instead of being responsibility of the framework. I've found this article (https://nl.devoteam.com/expert-view/querying-jdbc-database-in-parallel-with-google-dataflow-apache-beam/) very helpful in addressing the point of reading massive amounts of records from a DB in beam (the article uses a cloud dataflow runner, but I used Flink and it worked just fine), splitting queries and distributing the processing.
Kiba is a very small library, and it is my understanding that most of its value is derived from enforcing a modular architecture of small independent transformations.
However, it seems to me that the model of a series of serial transformations does not fit most of the ETL problems we face. To explain the issue, let me give a contrived example:
A source yields hashes with the following structure
{ spend: 3, cost: 7, people: 8, hours: 2 ... }
Our prefered output is a list of hashes where some of the keys might be the same as those from the source, though the values might differ
{ spend: 8, cost: 10, amount: 2 }
Now, calculating the resulting spend requires a series of transformations: ConvertCurrency, MultiplyByPeople etc. etc. And so does calculating the cost: ConvertCurrencyDifferently, MultiplyByOriginalSpend.. Notice that the cost calculations depend on the original (non transformed) spend value.
The most natural pattern would be to calculate the spend and cost in two independent pipelines, and merge the final output. A map-reduce pattern if you will. We could even benefit from running the pipelines in parallel.
However in my case it is not really a question of performance (as the transformations are very fast). The issue is that since Kiba applies all transforms as a set of serial steps, the cost calculations will be affected by the spend calculations, and we will end up with the wrong result.
Does kiba have a way of solving this issue? The only thing I can think of is to make sure that the destination names are not the same as the source names, e.g. something like 'originSpend' and 'finalSpend'. It still bothers me however that my spend calculation pipeline will have to make sure to pass on the full set of keys for each step, rather than just passing the key relevant to it, and then merging in the Cost keys in the end. Or perhaps one can define two independent kiba jobs, and have a master job call the two and merge their result in the end? What is the most kiba-idiomatic solution to this?
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
I think I lack extra details to be able to properly answer your main question. I will get in touch via email for this round, and will maybe comment here later for public visibility.
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
The main focus of Kiba ETL today is: components reuse, lower maintenance cost, modularity and ability to have a strong data & process quality.
Parallelisation is supported to some extent though, via different patterns.
Using Kiba Pro parallel transform to run sister jobs
If your main input is something that you can manage to "partition" with a low volume of items (e.g. database id ranges, or a list of files), you can use Kiba Pro parallel transform like this:
source ... # something that generate list of work items
parallel_transform(max_threads: 10) do |group_items|
Kiba.run(...)
end
This works well if there is no output at all, or not much output, coming to the destinations of the sister jobs.
This works with threads but one can also "fork" here for extra performance.
Using process partitioning
In a similar fashion, one can structure their jobs in a way where each process will only process a subset of the input data.
This way one can start say 4 processes (via cron jobs, or monitored via a parent tool), and pass a SHARD_NUMBER=1,2,3,4, which is then used by the source for input-load partitioning.
But!
I'm pretty sure your problem, as you said, is more about workflow control & declarations & ability to express what you need to be done, rather than performance.
I'll reach out and we'll discuss that.
I have a task which consists of 3 concurrent self-defined (recursive to each other) processes. I need somehow to make it execute on computer, but any attempt to convert a requirement to program code with just my brain fails since first iteration produces 3^3 entities with 27^2 cross-relations, but it needs to implement at least several iterations to try if program even works at all.
So I decided to give up on trying to understand the whole system and formalized the problem and now want to map it to hardware to generate an algorithm and run. Language doesn't matter (maybe even directly to machine/assembly one?).
I never did anything like that before, so all topics I searched through like algorithm synthesis, software and hardware co-design, etc. mention hardware model as the second half (in addition to problem model) of solution generation, but I never seen one. The whole work supposed to look like this:
I don't know yet what level hardware model described at, so can't decide how problem model must be formalized to fit hardware model layer.
For example, target system may contain CPU and GPGPU, let's say target solution having 2 concurrent processes. System must decide which process to run on CPU and which on GPGPU. The highest level solution may come from comparing computational intensity of processes with target hardware, which is ~300 for CPUs and ~50 for GPGPUs.
But a normal model gotta be much more complete with at least cache hierarchy, memory access batch size, etc.
Another example is implementing k-ary trees. A synthesized algorithm could address parents and children with computing k * i + c / ( i - 1 ) / k or store direct pointers - depending on computations per memory latency ratio.
Where can I get a hardware model or data to use? Any hardware would suffice for now - to just see how it can look like - later would be awesome to get models of modern processors, GPGPUs and common heterogeneous clusters.
Do manufacturers supply such kinds of models? Description of how their systems work in any formal language.
I'm not pretty sure if it might be the case for you, but as you're mentioning modeling, I just thought about Modelica. It's used to model physical systems and combined with a simulation environment, you can run some simulations on it.
I have a software which proceeds one picture and give me some results for that picture and a database which contains a lot of pictures.
I would like to build a distributed architecture in order to process these pictures on multiple servers in order to gain time.
I heard about Spark and searched about it, but I'm not sure that this solution is good for me. Nevertheless, I don't want to miss something.
Indeed, in all the example I found for Spark, it's always dealing with tasks/jobs that can be split in smaller tasks/jobs.
For example, a text can be split in mutliple smaller texts and so, the wordcount can be easy processed.
However, when I use my software, I need to give a whole picture and not just parts of it.
So, is it possible to give Spark a task which contains 10 pictures (for example), and then Spark splits it in smaller tasks (1 task = 1 picture) and sends each picture to a worker?
And if it's possible, is this very efficient? I actually heard about Celery and I'm wondering if this kind of solution is better for my case.
Thank you for your help! :)
I think it depends on what you mean by "lot of pictures" and how often you will get "lot of pictures" to process. If you have tens of thousands of pictures and you will get them frequently, then Spark will definitely a good solution.
From an architecture and requirements viewpoint I think either Spark or Storm will fit the bill. My main concern would be wether the overhead is justified. This talk for instance is about realtime image processing with Spark:
https://www.youtube.com/watch?v=I6qmEcGNgDo
You could so look at this quora thread:
https://www.quora.com/Has-anyone-developed-computer-vision-image-processing-algorithms-on-Twitter-Storm
Hy,
Currently I am developing a program that takes 2 values from an amq queue and performs a series of mathematical calculations on them. A topic has been created on the amq server to which my program subscribes and receive messages via callbacks (listeners).
Now whenever a message arrives the two values are taken out of and added to the SynchronizedDescriptiveStatistics object. After each addition to the list of values the whole sequence of calculations is performed all over again (this is part of the requirement actually).
The problem I am facing right now is that since I am using listeners, sometimes a single or more messages are received in the middle of calculations. Although SynchronizedDescriptiveStatistics takes care of all the thread related issues it self but it adds all the waiting values in its list of numbers at once when it comes out of lock or something. While my problem was to add one value then perform calcls on it then second value and on and on.
The solution I came up with is to use job queues in my program (not amq queues). In this way whenever calcs are over the program would look for further jobs in the queue and goes on accordingly.
Since I am also looking for efficiency and speed I thought the Disruptor framework might be good for this problem and it is optimized for threaded situations. But I am not sure if its worth the trouble of implementing Disruptor in to my application because regular standard queue might be enough for what I am trying to do.
Let me also tell you that the data on which the calcs need to be performed is a lot and it will keep on coming and the whole calcs will need to be performed all over again for each addition of a single value in a continuous fashion. So keeping in mind the efficiency and the huge volume of data what do you think will be useful in the long run.
Waiting for a reply. . .
Regards.
I'll give our typical answer to this question: test first, and make your decision based on your results.
Although you talk about efficiency, you don't specifically say that performance is a fundamental requirement. If you have an idea of your performance requirements, you could mock up a simple prototype using queues versus a basic implementation of the Disruptor, and take measurements of the performance of both.
If one comes off substantially better than the other, that's your answer. If, however, one is much more effort to implement, especially if it's also not giving you the efficiency you require, or you don't have any hard performance requirements, then that suggests that solution is not the right one.
Measure first, and decide based on your results.