I have successfully done this in python using cKDTree and multiprocessing over 20+ threads, but it really hits a wall when trying to scale it up to millions of rows. I'm wondering how someone could do this in Hive or Spark.
1 dataset is a dataset of all customers and their GPS coords. Another dataset is of all fire hydrants and their GPS coords. Like so:
tblCustomer (100MM rows)
+-------------+-------------------+--------------------+
| customer_id | customer_latitude | customer_longitude |
+-------------+-------------------+--------------------+
| 123 | 42.123456 | -56.123456 |
+-------------+-------------------+--------------------+
| 456 | 44.123456 | -55.123456 |
+-------------+-------------------+--------------------+
tblFireHydrants (50MM rows)
+----------------+------------------+-------------------+
| firehydrant_id | hydrant_latitude | hydrant_longitude |
+----------------+------------------+-------------------+
| 123456 | 42.987654 | -55.984657 |
+----------------+------------------+-------------------+
| 456233 | 45.569841 | -55.978946 |
+----------------+------------------+-------------------+
The goal is to query how many fire hydrants are with radius r (meters) of each customer_id.
After I repeat it for a few times for various distances, the final result will look like this:
+-------------+----------------------+----------------------+-----------------------+
| customer_id | hydrants_within_100m | hydrants_within_500m | hydrants_within_1000m |
+-------------+----------------------+----------------------+-----------------------+
| 123456 | 0 | 1 | 6 |
+-------------+----------------------+----------------------+-----------------------+
| 456233 | 1 | 1 | 9 |
+-------------+----------------------+----------------------+-----------------------+
I could start from scratch and try to use KDTrees in scala, or I noticed there are some geospatial UDF's for Hive-MapReduce that might work. I'm not really sure.
Looking for any suggestion of where to start, hopefully not starting from scratch.
If I understand correctly, you shouldn't use kNN-queries.
kNN-Queries return the closest 'k' hydrants. The problem is that you have to set 'k' quite high to be sure that you don't miss out any hydrants within the minimum distance. However, using a high 'k' may return lots of hydrants that are not within your required distance.
I suggest using window queries or, if available, range queries.
Range queries just return everything within a given range. This would be ideal, this feature is often not supported in kd-trees.
Window queries would be a good alternative. A window query typically returns all data within a specified axis-aligned rectangle. You have to make the window large enough to ensure that it returns all possible hydrants in the required distance. Afterwards, you would have to filter out all hydrants that are in the rectangle, but exceed your maximum distance, i.e. hydrants that are near the corners of the query window.
Range and window queries should be much more efficient than kd-queries, even if it may return additional results (in the case of window queries).
To put it another way, a naive kNN algorithm can be thought of as a series of window queries. The algorithm starts with a small window and then makes it larger until it finds 'k' elements. In the best case, the algorithm performs just one window query and find the right number of results (not too few, not too many), but in the typical case it has to perform many such queries.
I don't know how the cKDTree works, but I strongly would assume that window/range queries are much cheaper than kNN queries.
Related
We are creating a dashboard that shows us number of exception on a given system over a time window - specifically, the last 24 hours. The graph looks like this:
If you look closely though, the last bar is a day ago, and not today (See time generated on the last bar in the graph - 12/08/2022, but today is 12/09/2022).
This also occurring on some of our other graphs. Is there someway to get the dashboard to understand that this is a timeline, and always show the right most entry as "Now"? This is quite misleading - we ended up spending a bit of time trying to figure out why the issue wasn't solved. Turns out it was.
Reason why it is happening is because there have been no exceptions that happened since the last error (Yesterday).
Kusto query looks like this:
AppExceptions
| order by TimeGenerated desc
| where Properties.MS_FunctionName in ("WebhookFlow")
| summarize Exceptions=count() by bin(TimeGenerated, 15m)
| render columnchart with (title="Webhook exceptions")
make-series operator
AppExceptions
| where Properties.MS_FunctionName in ("WebhookFlow")
| make-series Exceptions=count() on TimeGenerated from bin(ago(1d), 15m) to now() step 15m
| render columnchart with (title="Webhook exceptions")
The bin function doesn't create empty bins for the 15m intervals where no exceptions occurred. A solution could be to make a projection that contains at least one element per 15m, so that bin can pick it up. The next problem is that count() no longer yields the number of exceptions, so you might want to use a different aggregator, like sum(x).
As an example you could do something like this:
AppExceptions
| order by TimeGenerated desc
| where Properties.MS_FunctionName in ("WebhookFlow")
| project TimeGenerated, Count = 1
| union (
range x from 1 to 1 step 1
| mv-expand TimeGenerated=range(now()-24h, now(), 15m) to typeof(datetime)
| extend Count=0
)
| summarize Exceptions=sum(Count) by bin(TimeGenerated, 15m)
| render columnchart with (title="Webhook exceptions")
I hope I've described the job I need to do in the correct terms. Essentially, I need to 'compress' a series of values so that all the values are closer to the mean, but their values should be reduced (or increased) relative to their distance from the mean...
The dataframe looks like this:
>>> df[['population', 'postalCode']].show(10)
+----------+----------+
|population|postalCode|
+----------+----------+
| 1464| 96028|
| 465| 96015|
| 366| 96016|
| 5490| 96101|
| 183| 96068|
| 569| 96009|
| 366| 96054|
| 90| 96119|
| 557| 96006|
| 233| 96116|
+----------+----------+
only showing top 10 rows
>>> df.describe().show()
+-------+------------------+------------------+
|summary| population| postalCode|
+-------+------------------+------------------+
| count| 1082| 1082|
| mean|23348.511090573014| 93458.60813308688|
| stddev|21825.045923603615|1883.6307236060127|
+-------+------------------+------------------+
The population mean is about right for my purposes, but I need the variance around it to be smaller...
Hope that makes sense, any help performing this job either in pyspark or node.js greatly appreciated.
The general idea is to:
translate the mean to zero.
rescale to the new standard deviation
translate to the desired mean (in this case, the original mean)
In pseudo-code, if your values are stored in the variable x:
x.scaled = new.mean + (x - mean(x)) * new.SD/sd(x)
Or, for the specific case of, say, SD=1000 and no change to the mean:
x.scaled = mean(x) + (x - mean(x)) * 1000/sd(x)
My data looks something like this:
given dataframe
just with many more rows (and different timestamps).
I now want to transform the table, so that I have one row per timestamp and the row is sorted by the the JI value and each JI value contains X,Y,Z,RX,RY,RZ and confidence. So it should look like this
timestamp | JI0 | JI1 | JI2 | ... | J21 | phase
timestamp1 | (confidence|X|Y|Z|RX|RY|RZ) of JI0 | ... | (confidence|X|Y|Z|RX|RY|RZ) of JI21 | phasevalue
timestamp2 |
The ID value is not needed for now. I already tried to get this result by using pivot (and pivot_table), but I was not able to get all values for one joint after another, but rather all (actually not even all for some reason) values from RX for every Joint, then all values from RY for every joint and so on.
The code:
df.pivot_table(index='timestamp', columns='JI', values=['confidence','X','Y','Z','RX','RY','RZ'])
The result using the code from above: result
I hope my question is understandable, otherwise I am of course willing to answer any questions.
Using Apache Spark 2.0.2 I have a table stored as parquet which contains about 23 millions rows and about 300 columns. I have a column called total_price stored as double, if I execute:
select sum(total_price) from my_table;
+-----------------+
| total_price |
+-----------------+
| 9.3923769592E8|
+-----------------+
So this number 9.3923769592E8 is wrong.
but if I execute:
select year, sum(total_price) from my_table;
+-------+------------------------+
| year| total_price|
+-------+------------------------+
| 2017| 44510982.10004025 |
| 2016| 293320440.63992333 |
| 2015| 311512575.890131 |
| 2014| 289885757.2799143 |
| 2013| 5192.319 |
| 2012| 2747.7000000000007|
+-------+------------------------+
My assumption is that on the first query the double data type has an overflow or something like it.
Why I'm getting the result with so many decimals after the dot if they are stored as #.##?
How can I fix the error of the first query?
The value you get looks just fine - 9.3923769592E8 is roughly ~939,237,695 and more or less what you expect based on the numbers aggregated by year.
Regarding the values you get you have to remember that only some numbers are representable using floating point arithmetics and commonly used types, like Scala Double or Float, are not suitable for use cases where exact values are necessary (accounting for example). For application like this you should use DecimalType.
I would also recommend reading What Every Computer Scientist Should Know About Floating-Point Arithmetic and Is floating point math broken?
Say I have two tables. attrsTable:
file | attribute | value
------------------------
A | xdim | 5
A | ydim | 6
B | xdim | 7
B | ydim | 3
B | zdim | 2
C | xdim | 1
C | ydim | 7
sizeTable:
file | size
-----------
A | 17
B | 23
C | 34
I have these tables related via the 'file' field. I want a PowerPivot measure within attrsTable, whose calculation uses size. For example, let's say I want xdim+ydim/size for each of A, B, C. The calculations would be:
A: (5+6)/17
B: (7+3)/23
C: (1+7)/34
I want the measure to be generic enough so I can use slicers later on to slice by file or attribute. How do I accomplish this?
I tried:
dimPerSize := CALCULATE([value]/SUM(sizeTable[size])) # Calculates 0
dimPerSize := CALCULATE([value]/SUM(RELATED(sizeTable[size]))) # Produces an error
Any idea what I'm doing wrong? I'm probably missing some fundamental concepts here of how to use DAX with relationships.
Hi Redstreet,
taking a step back from your solution and the one proposed by Jacob, I think it might be useful to create another table that would aggregate all the calculations (especially given you probably have more than 2 tables with file-specific attributes).
So I have created one more table that contains (only) unique file names, and thus the relationships could be visualized this way:
It's much simpler to add necessary measures (no need for calculated columns). I have actually tested 2 scenarios:
1) create simple SUM measures for both Attribute Value and File Size. Then divide those two measures and job done :-).
2) use SUMX functions to have a bit more universal solution. Then the final formula for DimPerSize calculation could look like this:
=DIVIDE(
SUMX(DISTINCT(fileTable[file]),[Sum of AttrValue]),
SUMX(DISTINCT(fileTable[file]),[Sum of FileSize]),
BLANK()
)
With [Sum of AttrValue] being:
=SUM(attrsTable[value])
And Sum of FileSize being:
=SUM(sizeTable[size])
This worked perfectly fine, even though SUMX in both cases goes over all instances of given file name. So for file B it also calculates with zdim (if there is a need to filter this out, then use simple calculate / filter combination). In case of file size, I am using SUMX as well, even though it's not really needed since the table contains only 1 record for each file name. If there would be 2 instances, then use SUMX or AVERAGEX depending on the desired outcome.
This is the link to my source file in Excel (2010).
Hope this helps.
You look to have the concept of relationships OK but you aren't on the right track in terms of CALCULATE() either in terms of the structure or the fact that you can't simply use 'naked' numerical columns, they need to be packaged in some way.
Your desired approach is correct in that once you get a simple version of the thing running, you will be able to slice and dice it over any of your related dimensions.
Best practice is probably to build this up using several measures:
[xdim] = CALCULATE(SUM('attrstable'[value]), 'attrstable'[attribute] = "xdim")
[ydim] = CALCULATE(SUM('attrstable'[value]), 'attrstable'[attribute] = "ydim")
[dimPerSize] = ([xdim] + [ydim]) / VALUES('sizeTable'[size])
But depending on exactly how your pivot is set up, this is likely to also throw an error because it will try and use the whole 'size' column in your totals. There are two main strategies for dealing with this:
Use an 'iterative' formula such as SUX() or AVERAGEX() to iterate individually over the 'file' field and then adds up or averages for the total e.g.
[ItdimPerSize] = AVERAGEX(VALUES('sizeTable'[file]), [dimPerSize])
Depending on the maths you want to use, you might find that produce a useful average that you need to use SUMX but devide by the number of cases i.e. COUNTROWS('sizeTable'[file]).
You might decide that the totals are irrelevant and simply introduce an error handling element that will make them blank e.g.
[NtdimPerSize] = IF(HASONEVALUE('sizeTable'[file]),[dimPerSize],BLANK())
NB, all of this assumes that when you are creating your pivot that you are 'dragging in' the file field from the 'sizetable'.