Should Spark transformation be broken to functions

Should Spark transformation be broken to functions - apache-spark

It looks like all the Spark examples found in web are built in as single long function (usually in main)
But it is often the case that it makes sense to break the long call into functions like:
Improve readability
Share code between solution paths
A typical signature would look like this (this is Java code but similar signature would appear in all languages)
private static Dataset<Row> myFiltering(Dataset<Row> data) {
return data.filter(functions.col("firstName").isNotNull()).filter(functions.col("lastName").isNotNull());
}
The problem here is that there is no safety for the content of the Row, as there is no enforcement on the fields, and calling the function becomes not only a matter of matching the singnature but also the content of Row. Which obviously may (and does in my case) cause errors.
What is the best practice you enforce in large scale development environments? do you leave the code as one long function? do you suffer every time you change field names?

Yes, you should split your method into smaller methods. Be aware, that many small functions also are not much readable ;)
My rules:
Split some chain of transformation if we can name it - if we have name, it means that this is some kind of subfunction
If there are many functions of the same domain - extract new class or even package.
KISS: don't split if your function call chain is short and also describes some subfunction, even if some lines may be extracted - it is not as much readable to read many many custom functions
Any larger - not in 1-3 lines - filters, maps I suggest to extract to method and use method reference
Marked as Community Wiki, because this is only my point of view and it's OT for StackOverflow. It someone else has any other suggestions, please share it :)

Related

Azure Webjobs: One Job with several Functions, or several Jobs with 1 function each?

How do I decide between creating several WebJobs with 1 function each and bundling several functions into one or only a few WebJobs?
Thanks

There is no straight answer to your question. Sorry.
Usually you group functions by workflow or role. For example if you have a workflow that contains a function that resizes an image, then a function that applies a watermark and another one that replicates the images then it makes sense to put all the functions together because they are related. You are more likely to change all of them when you modify the flow.
On the other hand, you might argue that functions should be separated. Unless you change the input/output, there is no reason to modify more than one function. However, if you need to change more than one function, you will end up editing more projects.
As you see, both arguments have pros/cons and there is really no right answer.
Try to experiment and see which approach works better for your solution.
PS: The only guideline that I can give is: if the functions are really small (a few lines of code), probably it is easier to put them in the same webjob because there is quite some overhead in maintaining multiple assemblies.

How do I create an array of resources using Jena?

I am using Jena and Java, and am reading a CSV file. For each line of the file there is a subject resource. Two subject resources, on adjacent lines, might have share the same value of a field in the line (e.g: both lines have the same process id). In this case, I need to combine the two subject resources as each one represents a sub-process in production (for example).
My question is: how can I reference those two resources dynamically so that I can combine them? I came to the idea that when I find that they share the same property to store them in an array resource subjects. Is it the right approach?

This question would be a lot easier to answer if you could show some sample data. As it is, I think you're focusing on the wrong bit of the question. If you can decide clearly what it means to have two rows in your CSV with identical process, and then you decide how you're going to encode that meaning in your RDF model, then the question of how to write the code - as an array or whatever - will be much clearer.
For example (and I'm going to make up some data here - as I said, it would be easier if you show an actual example), suppose your CSV contains:
processId,startTime,endTime
123,15:22:00,15:23:00
123,16:22:00,16:25:00
So process 123 has, apparently two start and end time pairs. If you model this naively in RDF, you'll end up with a confusing model:
process:process123
a :Process;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
.
which would suggest that one process had two start times (and two end times) which looks nonsensical. However, it might be that in reality you have a single process with multiple episodes, suggesting one way to model it, or a periodic process which occurs at different times, or, as you suggested, sub-processes of a parent process. Or something else entirely (I'm only guessing, I don't know your domain). Once you are clear what the data means, you can produce a suitable RDF model. For example, an episodic process might be:
process:process123
a :Process;
process:episode [
a process:Episode;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
];
process:episode [
a process:Episode;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
]
.
Once the modelling is clear in your mind, I think you can see that the question of how to produce the desire RDF triples from Java code - and whether or not you need an array - is much clearer. Equally importantly, you can think in terms of the JUnit tests you would write to test whether your code is behaving correctly.

Parameters in method

I am having a method in which there are more then 7 parameters ,type of all the parameters are different.
Well my question is it fine or i should replace all parameters with the single class(which will contaion all parameters as a instance variable).

Well my question is it fine or i should replace all parameters with the single class
7 is way too much. Replace with a class. With my VS custom theme and fonts settings Intellisense wouldn't fit on the screen when there is a method with so many parameters :-) I find it more readable and easier to understand when working with classes.
Of course those are just my 2 cents and it's subjective. I've seen people writing methods with many many parameters.

Well, there are places where I'd consider that okay - but they're few and far between. I would generally consider using a "parameter class" instead.
Note that it doesn't have to be an "all or nothing" approach - would it make more sense to encapsulate, say, 4 of the parameters together? Would that allow the new class to be used in other methods?
Other thing to consider is whether the method might be doing too much - does the functionality of the method definitely feel right as a single cohesive unit?

Thare are Microsoft APIs with even more parameters; anyway I agree with #Darin, a class could be a good solution, clear and efficient that can avoid passing parameters in the right order... so for example if you change order for oen without using refactoring you're in a mess...
Consider also if that class can be used in other parts of the program...

I somewhat agree with all given answers. However it might be a good idea to break down the method into smaller parts as Jon is suggesting. Usually when you have a situation like this means that the method is doing too much. Typically methods should have 1 or 2 parameters at most. While you would achieve the same by replacing all your parameters with a class parameter you might just be hiding the bigger issue. Is there any chance you can post the method or describe what it's doing?
As a rule of thumb, if you can't fit the entire method, including declaration onto 1 screen, it's too big. Programmers will normally recommend to break down methods to make them reusable. But, it also makes perfect sense to break them down to increase readability.

Generic graphing and charting solutions

I'm looking for a generic charting solution, ideally not a hosted one that provides the following features:
Charting a tuple of values where the values are:
1) A service identifier (e.g. CPU usage)
2) A client identifier within that service (e.g. server IP)
3) A value
4) A timestamp with millisecond/second resolution.
Optional:
I'd like to also extend the concept of a client identifier further, taking the above example further, I'd like to store statistics for each core separately, so, another identifier would be Core 1/Core 2..
Now, to make sure I'm clearly stating my problem, I don't want a utility that collects these statistics. I'd like something that stores them, but, this is also not mandatory, I can always store them in MySQL, or such.
What I'm looking for is something that takes values such as these, and charts them nicely, in a multitude of ways (timelines, motion, and the usual ones [pie, bar..]). Essentially, a nice visualization package that allows me to make use of all this data. I'd be collecting data from multiple services, multiple applications, and the datapoints will be of varying resolution. Some of the data will include multiple layers of nesting, some none. (For example, CPU would go down to Server IP, CPU#, whereas memory would only be Server IP, but would include a different identifier, i.e free/used/cached as the "secondary' identifier. Something like average request latency might not have a secondary identifier at all, in the case of ping). What I'm trying to get across is that having multiple layers of identifiers would be great. To add one final example of where multiple identifiers would be great: adding an extra identifier on top of ip/cpu#, namely, process name. I think the advantages of that are obvious.
For some applications, we might collect data at a very narrow scope, focusing on every aspect, in other cases, it might be a more general statistic. When stuff goes wrong, both come in useful, the first to quickly say "something just went wrong", and the second to say "why?".
Further, it would be a nice thing if the charting application threw out "bad" values, that is, if for some reason our monitoring program started to throw values of 300% CPU used on a single core for 10 seconds, it'd be nice if the charts themselves didn't reflect it in the long run. Some sort of smoothing, maybe? This could obviously be done at the data-layer though, so its not a requirement at all.
Finally, comparing two points in time, or comparing two different client identifiers of the same service etc without too much effort would be great.
I'm not partial to any specific language, although I'd prefer something in (one of the following) PHP, Python, C/C++, C#, as these are languages I'm familiar with. It doesn't have to be open source, it doesn't have to be a library, I'm open to using whatever fits my purpose the best.
More of a P.S than a requirement: I'd like to have pretty charts that are easy for non-technical people to understand, and act upon too (and like looking at!).
I'm open to clarifying, and, in advance, thanks for your time!

I am pretty sure that protovis meets all your requirements. But it has a bit of a learning curve. You are meant to learn by examples, and there are plenty to work from. It makes some pretty nice graphs by default. Every value can be a function, so you can do things like get rid of your "Bad" values.

C# 4.0 Named Parameters - should they always be used when calling non-Framework methods?

I really this is a hugely subjective topic but here is my current take:
When calling methods which do not form part of the .NET BCL named parameters should always be used as the method signatures may well change, especially during the development cycle of my own applications.
Although they might appear more verbose they are also far clearer.
Is the above a reasonable approach to calling methods or have I overlooked something fundamental?

I think it's over the top, personally. (I also believe it's more correct and clearer to call them named arguments - the parameters always have names, as that's part of the declaration.)
In many cases the meaning is crystal clear from the method name - particularly if there are only a couple of arguments. Why bother to add the name? Why does the fact that the name may change affect this?
I would suggest using named arguments:
When optional parameters are also being used
When there are multiple parameters of the same type, so they can be confused (e.g. a dialog box's title and content text)
When you're using null for an argument, and it's not clear what it's doing
To put it another way, how often did not having named arguments cause you confusion before C# 4? In my case it's certainly a non-zero number of times - the situations above do happen - but it's not something that regularly got in my way. Optional parameters make it more reasonable for a method (or particularly a constructor) to have potentially many parameters, and that leads to named arguments being more useful (and indeed necessary if you want to specify "late" optional parameters having skipped earlier ones).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string