How do I create an array of resources using Jena? - resources

I am using Jena and Java, and am reading a CSV file. For each line of the file there is a subject resource. Two subject resources, on adjacent lines, might have share the same value of a field in the line (e.g: both lines have the same process id). In this case, I need to combine the two subject resources as each one represents a sub-process in production (for example).
My question is: how can I reference those two resources dynamically so that I can combine them? I came to the idea that when I find that they share the same property to store them in an array resource subjects. Is it the right approach?

This question would be a lot easier to answer if you could show some sample data. As it is, I think you're focusing on the wrong bit of the question. If you can decide clearly what it means to have two rows in your CSV with identical process, and then you decide how you're going to encode that meaning in your RDF model, then the question of how to write the code - as an array or whatever - will be much clearer.
For example (and I'm going to make up some data here - as I said, it would be easier if you show an actual example), suppose your CSV contains:
processId,startTime,endTime
123,15:22:00,15:23:00
123,16:22:00,16:25:00
So process 123 has, apparently two start and end time pairs. If you model this naively in RDF, you'll end up with a confusing model:
process:process123
a :Process;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
.
which would suggest that one process had two start times (and two end times) which looks nonsensical. However, it might be that in reality you have a single process with multiple episodes, suggesting one way to model it, or a periodic process which occurs at different times, or, as you suggested, sub-processes of a parent process. Or something else entirely (I'm only guessing, I don't know your domain). Once you are clear what the data means, you can produce a suitable RDF model. For example, an episodic process might be:
process:process123
a :Process;
process:episode [
a process:Episode;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
];
process:episode [
a process:Episode;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
]
.
Once the modelling is clear in your mind, I think you can see that the question of how to produce the desire RDF triples from Java code - and whether or not you need an array - is much clearer. Equally importantly, you can think in terms of the JUnit tests you would write to test whether your code is behaving correctly.

Related

Do I use foreach for 2 different inspection checks in activity diagram?

I am new in doing an activity and currently, I am trying to draw one based on given description.
I enter into doubt on a particular section as I am unsure if it should be 'split'.
Under the "Employee", the given description is as follows:
Employee enter in details about physical damage and cleanliness on the
machine. For the cleanliness, there must be a statement to indicate
that the problem is no longer an issue.
As such, I use a foreach as a means to describe that there should be 2 checks - physical and cleanliness (see diagram in the link), before it moves on to the next activity under the System - for the system to record the checks.
Thus, am I on the right track? Thank you in advance for any replies.
Your example is no valid UML. In order to make it proper you need to enclose the fork/join in a expansion region like so:
A fork/join does not accept any sematic labels. They just split the control flow into several parallel ones which join at the end.
However, this still seems odd since you would probably have some control for the different inspections being entered. So I'd guess there's a decision which loops through multiple inspection entries. Personally I use regions only for handling interrupts. ADs are nice to a certain level. But sometimes a tabular text (like suggested by Cockburn) is just easier to write and read. Graphical programming is not the ultimate answer (unlike 42).
First, the 'NO' branch of the decision node must lead somewhere (at the end?).
After, It differs if you want to show the process for ONE or MULTIPLE inspections. But the most logical way is to represent the diagram for an inspection, because you wrote inspection without S ! If you want represent more than one inspection, you can use decision and merge node to represent loop that stop when there is no more inspection.

Should Spark transformation be broken to functions

It looks like all the Spark examples found in web are built in as single long function (usually in main)
But it is often the case that it makes sense to break the long call into functions like:
Improve readability
Share code between solution paths
A typical signature would look like this (this is Java code but similar signature would appear in all languages)
private static Dataset<Row> myFiltering(Dataset<Row> data) {
return data.filter(functions.col("firstName").isNotNull()).filter(functions.col("lastName").isNotNull());
}
The problem here is that there is no safety for the content of the Row, as there is no enforcement on the fields, and calling the function becomes not only a matter of matching the singnature but also the content of Row. Which obviously may (and does in my case) cause errors.
What is the best practice you enforce in large scale development environments? do you leave the code as one long function? do you suffer every time you change field names?
Yes, you should split your method into smaller methods. Be aware, that many small functions also are not much readable ;)
My rules:
Split some chain of transformation if we can name it - if we have name, it means that this is some kind of subfunction
If there are many functions of the same domain - extract new class or even package.
KISS: don't split if your function call chain is short and also describes some subfunction, even if some lines may be extracted - it is not as much readable to read many many custom functions
Any larger - not in 1-3 lines - filters, maps I suggest to extract to method and use method reference
Marked as Community Wiki, because this is only my point of view and it's OT for StackOverflow. It someone else has any other suggestions, please share it :)

matching common strings between two data sets

I am working on a website conversion. I have a dump of the database backend as an sql file. I also have a scrape of the website from wget.
What I'm wanting to do is map database tables and columns to directories, pages, and sections of pages in the scrape. I'd like to automate this.
Is there some tool or a script out there that could pull strings from one source and look for them in the other? Ideally, it would return a set of results that would say soemthing like
string "piece of website content here" on line 453 in table.sql matches string in website.com/subdirectory/certain_page.asp on line 56.
I don't want to do line comparisons because lines from the database dump (INSERT INTO table VALUES (...) ) aren't going to match lines in the page where it actually populates (<div id='left_column'><div id='left_content'>...</div></div>).
I realize this is a computationally intensive task, but even letting it run over the weekend is fine.
I've found similar questions, but I don't have enough CS background to know if they are identical to my problem or not. SO kindly suggested this question, but it appears to be dealing with a known set of needles to match against the haystack. In my case, I need to compare haystack to haystack, and see matching straws of hay.
Is there a command-line script or command out there, or is this something I need to build? If I build it, should I use the Aho–Corasick algorithm, as suggested in the other question?
So your two questions are 1) Is there already a solution that will do what you want, and 2) Should you use the Aho-Corasick algorithm.
The first answer is that I doubt you'll find a ready-built tool that will meet your needs.
The second answer is that, since you don't care about performance and have a limited CS background, that you should use whatever algorithm you find simplest to implement.
I will go one step further and propose an architecture.
First, you need to be able to parse the .sql files into a meaningful way, one that go line-by-line and return tablename, column_name, and value. A StreamReader is probably best for this.
Second, you need a parser for your webpages that will go element-by-element and return each text node and the name of each parent element all the way up to the html element and its parent filename. An XmlTextReader or similar streaming XML parser, such as SAXON is probably best, as long as it will operate on non-valid XML.
You would need to tie these two parsers together with a mutual search algorithm of some sort. You will have to customize it to suit your needs. Aho-Corasick will apparently get you the best performance if you can pull it off. A naive algorithm is easy to implement, though, and here's how:
Assuming you have your two parsers that loop through each field (on the one hand) and each text node (on the other hand), pick one of the two parsers and have it go through each of the strings in its data source, calling the other parser to search the other data source for all possible matches, and logging the ones it finds.
This cannot work, at least not reliably. Best case: you would fit every piece of data to its counterpart in your HTML files, but you would have many false positives. For example user names that are actual words etc.
Furthermore text is often manipulated before it is displayed. Sites often capitalize titles or truncate texts for preview etc.
AFAIK there is no such tool, and in my opinion there cannot exist one that solves your problem adequately.
Your best choice is to get the source code the site uses/used and analyze it. If that fails/is not possible you have to analyse the database manually. Get as much content as possible from the URL and try to fit the puzzle.

Generic graphing and charting solutions

I'm looking for a generic charting solution, ideally not a hosted one that provides the following features:
Charting a tuple of values where the values are:
1) A service identifier (e.g. CPU usage)
2) A client identifier within that service (e.g. server IP)
3) A value
4) A timestamp with millisecond/second resolution.
Optional:
I'd like to also extend the concept of a client identifier further, taking the above example further, I'd like to store statistics for each core separately, so, another identifier would be Core 1/Core 2..
Now, to make sure I'm clearly stating my problem, I don't want a utility that collects these statistics. I'd like something that stores them, but, this is also not mandatory, I can always store them in MySQL, or such.
What I'm looking for is something that takes values such as these, and charts them nicely, in a multitude of ways (timelines, motion, and the usual ones [pie, bar..]). Essentially, a nice visualization package that allows me to make use of all this data. I'd be collecting data from multiple services, multiple applications, and the datapoints will be of varying resolution. Some of the data will include multiple layers of nesting, some none. (For example, CPU would go down to Server IP, CPU#, whereas memory would only be Server IP, but would include a different identifier, i.e free/used/cached as the "secondary' identifier. Something like average request latency might not have a secondary identifier at all, in the case of ping). What I'm trying to get across is that having multiple layers of identifiers would be great. To add one final example of where multiple identifiers would be great: adding an extra identifier on top of ip/cpu#, namely, process name. I think the advantages of that are obvious.
For some applications, we might collect data at a very narrow scope, focusing on every aspect, in other cases, it might be a more general statistic. When stuff goes wrong, both come in useful, the first to quickly say "something just went wrong", and the second to say "why?".
Further, it would be a nice thing if the charting application threw out "bad" values, that is, if for some reason our monitoring program started to throw values of 300% CPU used on a single core for 10 seconds, it'd be nice if the charts themselves didn't reflect it in the long run. Some sort of smoothing, maybe? This could obviously be done at the data-layer though, so its not a requirement at all.
Finally, comparing two points in time, or comparing two different client identifiers of the same service etc without too much effort would be great.
I'm not partial to any specific language, although I'd prefer something in (one of the following) PHP, Python, C/C++, C#, as these are languages I'm familiar with. It doesn't have to be open source, it doesn't have to be a library, I'm open to using whatever fits my purpose the best.
More of a P.S than a requirement: I'd like to have pretty charts that are easy for non-technical people to understand, and act upon too (and like looking at!).
I'm open to clarifying, and, in advance, thanks for your time!
I am pretty sure that protovis meets all your requirements. But it has a bit of a learning curve. You are meant to learn by examples, and there are plenty to work from. It makes some pretty nice graphs by default. Every value can be a function, so you can do things like get rid of your "Bad" values.

How to test a program processing large amounts of data stored in an unpredictable format

What I have to do
I'm trying to manipulate some rather large amounts of data stored in Excel files (one of the workbooks has as much as 150 spreadsheets). The result of these manipulations may yield approximately 800.000 rows in a database table.
The problem
Data stored in the spreadsheets has unpredictable format. The company that generated these spreadsheets had no fixed/documented format for exporting these files, and sometimes erroneous data appear. For example most of the years are represented like "2009" but there are cases where a year is represented as "20". Other example, data is not really normalized in these files, so I use separators to split the values of certain cells. Sometimes these separators change.
There are things like these that I couldn't predict and I only discovered them only after running an already evolved version of my program over a pretty large part of the available data.
The question
How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
Shall I take a defensive approach and throw exceptions whenever some kind of unexpected issue arises? Then the main loop of the program may catch and log them and continue with the available data? This would yield some processed data, but that means that on a subsequent iteration of the program I have to have checks for what's already inside the database from previous iterations (which I don't really like).
What's your opinion? How would you tackle this problem?
If there is no specification for what the format of the data is, then anything is acceptable.
If not, then there is either an explicit or implicit specification of the data. I would try and nail this down right now. If you can't get an explicit enough definition of the data to write your program so that it can be expected to run without error, then I would say you are taking a very large risk in causing some serious damage depending on how this data is being used.
You should write your program so that it either throws an exception or logs an error whenever running across data that does not meet the specification. Then, run the program on PART of the available data until it runs without exception. This can be viewed as a training set for the development of your program. Then, use some of the saved data to use as a TEST set. This will give you an estimate of how many exceptions/errors your program will generate in production.
Overfitting is a common machine learning concept, but it is useful to other tasks such as this - program development. It is surprising to me how developers can write a bunch of unit tests, code their application to perform well on it, and then expect similar or bug-free performance in production.
If you're not willing to take all these steps (i.e. run your code on essentially all of the data -- since the test set is also making use of the data) then I would say the task is too large to do.
As an aside, rather than creating a definition of a format that is very strange and peculiar to account for all the "errors" in the current data, you might want to create a new, normalized (in the sense these things are simplified away) specification for the data, and then write a "faulty document patcher" that can be run on faulty documents to fix the data.
If the application generating the data is still in production, then you might need to go to the developers of this application to get a buy in on the new spec. Once you have that, you can then start logging bugs against their application, so hopefully the faulty document patcher can be retired.
More likely, I'm guessing that the software developers are long gone, no one understands the code anymore, if it is even running at all.
How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
For every single data type I would set reasonable constraints on the values that it is allowed to be.
If a cell violates these constraints then throw an exception containing the piece of data it failed on and its data type. If a piece of data violated its constraints you can modify the source to include the additional constraints required for that piece of data, and a conversion method to make it uniform.
To give an example on the date you gave, initially a date would have the constraint that it could be only four digits. When the program came across the "20" it would throw an exception.
Then you could go and allow two digit dates, and a method to convert the two-digit dates into a four digit one to allow further processing.
One question is, will you run your program more than once? From your question it sounds possible you only want to run it once, and then you will then work with the data in the database.
In which case you can be very defensive - throw exceptions whenever unexpected data appears. Run the program repeatedly on ever-larger sets of the data. Initially, solve any exceptions by altering the code, as it's a good rule of thumb that the exceptions you find first are going to be common. You might want to empty the output database between runs.
Later on, you will be finding rare exceptions that might only occur a couple of times in the input. Just solve these by hand and insert the corresponding rows in the database yourself. Or write another small program that reads your exception information and inserts the new rows, rather than running your whole big program again.
Typically for this sort of thing I do these as #MarkJ suggested, and I encode the whole thing in unit tests.
So I compose a small datafile that at first contains only a few rows of normal data. That's unit test number 1.
Then I take a quick visual scan of some of the data to spot any obvious exceptions. Unit tests 2 through n.
Finally, I write parser code until it passes all unit tests, and throws and logs exceptions for all un-managed data.
I then use these oddball bits of data to make new unit tests, and improve the parser until it can pass those too.
Although sometimes accommodating some really strange bit of data adds more parser complexity than it's worth, and I'll just log the exception, dump it, and move on. This is a matter of professional judgment.
How about processing every piece of data (so you don't have to check for dupes). Those that pass go into the database. The exceptions go into an exception file. The user can open the exception file and make corrections/modifications to the data. Then they can run your program on the exception file.
This will isolate unhandled data for the user to correct and prevent you from processing the same data twice (or more).

Resources