How do I achieve this in Apache Spark Java or Scala?

How do I achieve this in Apache Spark Java or Scala? - apache-spark

A device on a car will NOT send a TRIP ID when the trip starts but will send one when the TRIP ends. How do I apply corresponding TRIP IDS to the corresponding records
09:30,25,DEVICE_1
10:30,55,DEVICE_1
10:25,0,DEVICE_1,TRIP_ID_0
11:30,45,DEVICE_1
10:30,55,DEVICE_2
10:30,55,DEVICE_3
11:30,45,DEVICE_3
12:30,0,DEVICE_3,TRIP_ID_3
10:30,55,DEVICE_4
11:30,45,DEVICE_4
11:30,45,DEVICE_2
12:30,0,DEVICE_2,TRIP_ID_2
12:30,0,DEVICE_4,TRIP_ID_4
10:30,55,DEVICE_5
11:30,45,DEVICE_5
12:30,0,DEVICE_5,TRIP_ID_5
12:30,0,DEVICE_1,TRIP_ID_1
So the above becomes like this,
09:30,25,DEVICE_1,TRIP_ID_0
10:25,0,DEVICE_1,TRIP_ID_0
10:30,55,DEVICE_1,TRIP_ID_1
11:30,45,DEVICE_1,TRIP_ID_1
12:30,0,DEVICE_1,TRIP_ID_1
10:30,55,DEVICE_2,TRIP_ID_2
11:30,45,DEVICE_2,TRIP_ID_2
12:30,0,DEVICE_2,TRIP_ID_2
10:30,55,DEVICE_3,TRIP_ID_3
11:30,45,DEVICE_3,TRIP_ID_3
12:30,0,DEVICE_3,TRIP_ID_3
10:30,55,DEVICE_4,TRIP_ID_4
11:30,45,DEVICE_4,TRIP_ID_4
12:30,0,DEVICE_4,TRIP_ID_4
10:30,55,DEVICE_5,TRIP_ID_5
11:30,45,DEVICE_5,TRIP_ID_5
12:30,0,DEVICE_5,TRIP_ID_5

An interesting problem. Had to fix one bug!
You will need to convert to spark.sql as I tried this in ORACLE. But WITH clause is supported in spark.sql. Also, instead of using date strings, due to the fact it is quite late I just used numbers to represent time, so you will need to look at that.
But here is the SQL that you can adapt.
with X as (select device, time_asc, trip_id from trips where trip_id is not null)
select Y.TRIP_ID, Y.DEVICE, Y.TIME_ASC FROM (
select T1.TIME_ASC, T1.DEVICE, X.TRIP_ID, X.TIME_ASC AS TIME_ASC_COMPARE
,RANK() OVER (PARTITION BY T1.TIME_ASC, T1.DEVICE ORDER BY X.TIME_ASC) AS RANK_VAL from trips T1, X
where T1.DEVICE = X.DEVICE
and T1.TIME_ASC <= X.TIME_ASC) Y
where RANK_VAL = 1
order by TRIP_ID, TIME_ASC
Get rid of the order by, just used to show.
This data as input:
('1','A',null);
('2','A','TRIP_01');
('5','A',null);
('6','A',null);
('7','A',null);
('23','A','TRIP_02');
('56','A',null);
('60','A','TRIP_04');
('8','B',null);
('10','B','TRIP_03');
('1','E',null);
('2','E','TRIP_05');
removes quotes as I exported and got this format, returns the following, which I think will meet your needs - again excuse formatting:
('TRIP_01','A','1');
('TRIP_01','A','2');
('TRIP_02','A','5');
('TRIP_02','A','6');
('TRIP_02','A','7');
('TRIP_02','A','23');
('TRIP_03','B','8');
('TRIP_03','B','10');
('TRIP_04','A','56');
('TRIP_04','A','60');
('TRIP_05','E','1');
('TRIP_05','E','2');
Am wondering how well SPARK handles this with under the hood performance. This took some effort late at night, so some appreciation is sought. Enjoyable as well.

Related

How Can I Sort These Many To Many Values Without a Through Table?

I am trying to compare the values in two manytomany fields...this almost works...
author_confirm = Author.objects.filter(id=self.object.update_reader_id).values_list('author_confirm').order_by('pk')
author = Author.objects.filter(id=self.object.update_reader_id).values_list('author').order_by('pk')
authors_are_equal = list(author) == list(author_confirm)
authors_are_not_equal = list(author) != list(author_confirm)
This works in that it gets me the values...but doesn't seem to be cooperating with the order_by...I currently have both fields with identical values...but their PKs are transposed so it tells me these fields are not identical...which is technically correct...but I see the problem is that the PKs are not listed in order...Is there a way to do this without a Through Table?
I am using UUIDs as the primary key....I'm not sure if this is relevant or not...but nonetheless I can't seem to get the values in an ordered way.
Thanks in advance for any ideas.

You should order by the author__pk and author_confirmed__pk, otherwise you are only ordering by the author object itself, which we lready know: that is the self.object.update_reader_id, hence the problem:
author_confirm = (
Author.objects.filter(id=self.object.update_reader_id)
.values_list('author_confirm')
.order_by('author_confirm__pk')
)
author = (
Author.objects.filter(id=self.object.update_reader_id)
.values_list('author')
.order_by('author__pk')
)

Timeseries differencing - ArangoDB (AQL or Python)

I have a collection which holds documents, with each document having a data observation and the time that the data was captured.
e.g.
{
_key:....,
"data":26,
"timecaptured":1643488638.946702
}
where timecaptured for now is a utc timestamp.
What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.
Any help really appreciated.

Although the solution provided by CodeManX works, I prefer a different one:
FOR d IN docs
SORT d.timecaptured
WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
RETURN timediff
We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured we can therefore calculate the timecaptured of the previous document. So now we can easily calculate the requested difference.
I only use the COUNT to return null for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.
However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND aggregate function that could be used in WINDOW and COLLECT operations.

The WINDOW function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:
FOR doc IN collection
SORT doc.timecaptured
WINDOW { preceding: 1 }
AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
LET timediff = doc.timecaptured - d[0].timecaptured
RETURN MERGE(doc, {timediff})
The UNIQUE() function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE() will remove duplicate values. A document _key is unique within a collection, so we can add it to the projection to make sure that UNIQUE() doesn't remove anything.
The time difference is calculated by subtracting the previous' documents timecaptured value from the current document's one. In the case of the first record, d[0] is actually equal to the current document and the difference ends up being 0, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured to achieve the same. d[1].timecaptured - d[0].timecaptured on the other hand will give you the inverted timestamp for the first record because d[1] is null (no previous document) and evaluates to 0.
There is one risk: UNIQUE() may alter the order of the documents. You could use a subquery to sort by timecaptured again:
LET timediff = doc.timecaptured - (
FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]
But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:
LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)

Python Pandas Dataframe replace cell value by value of another cell of the same session

I'm using Python Pandas Dataframe for Data Analyse of some logs.
I have a csv with something like:
number_items event_type ... ... ... session_id ... ... ...
My problem is that in my session there are different types of events, and only one of them has something for number_items. Or, numbers_items is what interests me.
So what I want to see is how each parameter of each event influences the number_items.
So, what I want to do is:
Copy the number_items of the event that has it (always the last one in the session) to all the other events of the session. Separate each event_type in a different Dataframe (to avoid a lot of nulls that exist only because the attribute doesn't correspond to the event) and analyse it.
I'm blocked at the first part
I tried something like this:
currentSession = '0'
currentItems = 0
for index, row in reversed(df.iterrows()) :
if row['session_id'] == currentSession :
row['number_items'] = currentItems
else :
currentSession = row['session_id']
currentItems = row['number_items']
Obviously, it's not working, I just wanted to show the idea.
I'm kind of new in Python, so I would appreciate some help.
Thanks
edit: data sample here
For security reasons, I let only the relevant information

The rows you get back from iterrows are copies so they dont overwrite your original dataframe. Use another form of iterator that references the original dataframe.
see here Updating value in iterrow for pandas
(also im note entirely sure what it is you are trying to do but instinctively it seems very inefficient - i suspect there are natural pandas methods which might do what you trying to achieve in one or two lines, look up the where() method)

How to convert Rep[T] to T in slick 3.0?

I used a code, generated from slick code generator.
My table has more than 22 columns, hence it uses HList
It generates 1 type and 1 function:
type AccountRow
def AccountRow(uuid: java.util.UUID, providerid: String, email: Option[String], ...):AccountRow
How do I write compiled insert code from generated code?
I tried this:
val insertAccountQueryCompiled = {
def q(uuid:Rep[UUID], providerId:Rep[String], email:Rep[Option[String]], ...) = Account += AccountRow(uuid, providerId, email, ...)
Compiled(q _)
}
I need to convert Rep[T] to T for AccountRow function to work. How do I do that?
Thank you

;TLDR; Not possible
Explanation
There are two levels of abstraction in Slick: Querys and DBIOActions.
When you're dealing with Querys, you have to access your schema definitions, and rows, Reps and, basically, it's very constrained as it's the closest level of abstraction to the actual DB you're using. A Rep refers to an hypothetical value in the database, not in your program.
Then you have DBIOActions, which are the next level... not just some definition of a query, but the execution of it. You usually get DBIOActions when getting information out of a query, like with the result method or (TADAN!) when inserting rows.
Inserts and Updates are not queries and so what you're trying to do is not possible. You're dealing with DBIOAction (the += method), and Query stuff (the Rep types). The only way to get a Rep inside a DBIOAction is by executing a Query and obtaining a DBIOAction and then composing both Actions using flatMap or for comprehensions (which is the same).

How to get a list of all CouchDB documents that are valid on a given date?

I have a large collection of documents and each is valid for a range of days. The range could be from 1 week up to 1 year. I want to be able to get all the documents that are valid on a specific day.
How would I do that?
As an example say I have the following two documents:
doc1 = {
// 1 year ago to today
start_at: "2012-03-22T00:00:00Z",
end_at: "2013-03-22T00:00:00Z"
}
doc2 = {
// 2 months ago to today
start_at: "2012-01-22T00:00:00Z",
end_at: "2013-03-22T00:00:00Z"
}
And a map function:
(doc) ->
emit([doc.start_at, doc.end_at], null)
So for a date of 6 months ago I would only get doc1, a date of 1 week ago I would get both documents, and with a date of tomorrow I would receive no documents.
Note that actual resolution needs to be down to the second of the request being made and there are lots of documents, so strategies of emitting a key for every valid second would not be appropriate.

You could call emit for each day in your range, and then you can easily pick out the documents available for a specific day.
function(doc) {
var day = new Date(doc.start),
end = new Date(doc.end).getTime();
do {
emit(day);
day = new Date(day.getFullYear(), day.getMonth(), day.getDate() + 1);
} while (day.getTime() <= end);
}
Even though you will have lots of documents, if you leave out the value part (2nd param) of your emit, the index will be as small as it could possibly be.
If you need to get more sophisticated, you could try out couchdb-lucene. You can index date fields as date objects and execute range queries with multiple fields in 1 request.

You can translate the problem into the computational geometry problem of location. For documents in two dimensional plane [x,y]=[start_at,end_at] query for those, which are valid at date date is the list of the points in the rectangle bounded by: left=-infinity, right=date (start_at<date) and bottom=date, top=infinity (end_at>date).
Unfortunately, CouchDB team underrate the power of computational geometry and does not support multidimensional queries. There is GeoCouch extension that allows you to do this kind of queries as easy as:
http://localhost:5984/places/_design/main/_spatial/points?bbox=0,0,180,90
on the view emitting spatial value:
emit({ type: "Point", coordinates: [doc.start_at, doc.end_at] }, doc);
The problem is different data type. You get float in range of [-180.0,180.0]/[-90.0,90.0] and need at least int (UNIX time format). If GeoCouch works for you in ranges bigger then 180.0 and the precision of float operation designed for geographical calculation is sufficient for dates with precision of seconds your problem is solved :) I am sure, with few tricks and hacks, you could solve this problem efficiently in geo software. If not GeoCouch then perhaps ElastiSearch (also support multidimensional queries) which is easy to use with CouchDB with its River plugins system.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I achieve this in Apache Spark Java or Scala? - apache-spark

Related

How Can I Sort These Many To Many Values Without a Through Table?

Timeseries differencing - ArangoDB (AQL or Python)

Python Pandas Dataframe replace cell value by value of another cell of the same session

How to convert Rep[T] to T in slick 3.0?

How to get a list of all CouchDB documents that are valid on a given date?

Categories

Resources