Using Apache Spark 2.0.2 I have a table stored as parquet which contains about 23 millions rows and about 300 columns. I have a column called total_price stored as double, if I execute:
select sum(total_price) from my_table;
+-----------------+
| total_price |
+-----------------+
| 9.3923769592E8|
+-----------------+
So this number 9.3923769592E8 is wrong.
but if I execute:
select year, sum(total_price) from my_table;
+-------+------------------------+
| year| total_price|
+-------+------------------------+
| 2017| 44510982.10004025 |
| 2016| 293320440.63992333 |
| 2015| 311512575.890131 |
| 2014| 289885757.2799143 |
| 2013| 5192.319 |
| 2012| 2747.7000000000007|
+-------+------------------------+
My assumption is that on the first query the double data type has an overflow or something like it.
Why I'm getting the result with so many decimals after the dot if they are stored as #.##?
How can I fix the error of the first query?
The value you get looks just fine - 9.3923769592E8 is roughly ~939,237,695 and more or less what you expect based on the numbers aggregated by year.
Regarding the values you get you have to remember that only some numbers are representable using floating point arithmetics and commonly used types, like Scala Double or Float, are not suitable for use cases where exact values are necessary (accounting for example). For application like this you should use DecimalType.
I would also recommend reading What Every Computer Scientist Should Know About Floating-Point Arithmetic and Is floating point math broken?
Related
We are creating a dashboard that shows us number of exception on a given system over a time window - specifically, the last 24 hours. The graph looks like this:
If you look closely though, the last bar is a day ago, and not today (See time generated on the last bar in the graph - 12/08/2022, but today is 12/09/2022).
This also occurring on some of our other graphs. Is there someway to get the dashboard to understand that this is a timeline, and always show the right most entry as "Now"? This is quite misleading - we ended up spending a bit of time trying to figure out why the issue wasn't solved. Turns out it was.
Reason why it is happening is because there have been no exceptions that happened since the last error (Yesterday).
Kusto query looks like this:
AppExceptions
| order by TimeGenerated desc
| where Properties.MS_FunctionName in ("WebhookFlow")
| summarize Exceptions=count() by bin(TimeGenerated, 15m)
| render columnchart with (title="Webhook exceptions")
make-series operator
AppExceptions
| where Properties.MS_FunctionName in ("WebhookFlow")
| make-series Exceptions=count() on TimeGenerated from bin(ago(1d), 15m) to now() step 15m
| render columnchart with (title="Webhook exceptions")
The bin function doesn't create empty bins for the 15m intervals where no exceptions occurred. A solution could be to make a projection that contains at least one element per 15m, so that bin can pick it up. The next problem is that count() no longer yields the number of exceptions, so you might want to use a different aggregator, like sum(x).
As an example you could do something like this:
AppExceptions
| order by TimeGenerated desc
| where Properties.MS_FunctionName in ("WebhookFlow")
| project TimeGenerated, Count = 1
| union (
range x from 1 to 1 step 1
| mv-expand TimeGenerated=range(now()-24h, now(), 15m) to typeof(datetime)
| extend Count=0
)
| summarize Exceptions=sum(Count) by bin(TimeGenerated, 15m)
| render columnchart with (title="Webhook exceptions")
I hope I've described the job I need to do in the correct terms. Essentially, I need to 'compress' a series of values so that all the values are closer to the mean, but their values should be reduced (or increased) relative to their distance from the mean...
The dataframe looks like this:
>>> df[['population', 'postalCode']].show(10)
+----------+----------+
|population|postalCode|
+----------+----------+
| 1464| 96028|
| 465| 96015|
| 366| 96016|
| 5490| 96101|
| 183| 96068|
| 569| 96009|
| 366| 96054|
| 90| 96119|
| 557| 96006|
| 233| 96116|
+----------+----------+
only showing top 10 rows
>>> df.describe().show()
+-------+------------------+------------------+
|summary| population| postalCode|
+-------+------------------+------------------+
| count| 1082| 1082|
| mean|23348.511090573014| 93458.60813308688|
| stddev|21825.045923603615|1883.6307236060127|
+-------+------------------+------------------+
The population mean is about right for my purposes, but I need the variance around it to be smaller...
Hope that makes sense, any help performing this job either in pyspark or node.js greatly appreciated.
The general idea is to:
translate the mean to zero.
rescale to the new standard deviation
translate to the desired mean (in this case, the original mean)
In pseudo-code, if your values are stored in the variable x:
x.scaled = new.mean + (x - mean(x)) * new.SD/sd(x)
Or, for the specific case of, say, SD=1000 and no change to the mean:
x.scaled = mean(x) + (x - mean(x)) * 1000/sd(x)
I am new to Stata and i assume this is a beginner question. Yet I have just spent the last hour searching the internet for an answer to no avail!
I am using World Bank GDP data (imported from a csv file) and the data is in the string format. When I destring, the GDP data that contains decimal places gets ignored and simply comes out as a big number.
destring yr*, replace ignore("..")
Here is a sample of my data:
yr2016
205276172134.901
..
13397100000
When I run the command I posted, it transforms to:
yr2016
2.053e+14
1.340e+10
As you can see the .901 was tacked into the number instead of being perceived as a decimal space.
I have tried:
set dp period
But it didn't work.
You just need to set the format of the converted variable:
clear
set obs 1
generate string = "205276172134.901"
destring string, generate(numeric)
list
+------------------------------+
| string numeric |
|------------------------------|
1. | 205276172134.901 2.053e+11 |
+------------------------------+
format numeric %18.0g
list
+-------------------------------------+
| string numeric |
|-------------------------------------|
1. | 205276172134.901 205276172134.901 |
+-------------------------------------+
Type help format for more information.
The problem is that the ignore() option is removing every instance of a . in the string variable, Stata is not searching for a sequence of two consecutive ... There is no need to use the ignore option in this case. Try destring var, replace force and allow Stata to set rows with .. to missing.
My data looks something like this:
given dataframe
just with many more rows (and different timestamps).
I now want to transform the table, so that I have one row per timestamp and the row is sorted by the the JI value and each JI value contains X,Y,Z,RX,RY,RZ and confidence. So it should look like this
timestamp | JI0 | JI1 | JI2 | ... | J21 | phase
timestamp1 | (confidence|X|Y|Z|RX|RY|RZ) of JI0 | ... | (confidence|X|Y|Z|RX|RY|RZ) of JI21 | phasevalue
timestamp2 |
The ID value is not needed for now. I already tried to get this result by using pivot (and pivot_table), but I was not able to get all values for one joint after another, but rather all (actually not even all for some reason) values from RX for every Joint, then all values from RY for every joint and so on.
The code:
df.pivot_table(index='timestamp', columns='JI', values=['confidence','X','Y','Z','RX','RY','RZ'])
The result using the code from above: result
I hope my question is understandable, otherwise I am of course willing to answer any questions.
I have successfully done this in python using cKDTree and multiprocessing over 20+ threads, but it really hits a wall when trying to scale it up to millions of rows. I'm wondering how someone could do this in Hive or Spark.
1 dataset is a dataset of all customers and their GPS coords. Another dataset is of all fire hydrants and their GPS coords. Like so:
tblCustomer (100MM rows)
+-------------+-------------------+--------------------+
| customer_id | customer_latitude | customer_longitude |
+-------------+-------------------+--------------------+
| 123 | 42.123456 | -56.123456 |
+-------------+-------------------+--------------------+
| 456 | 44.123456 | -55.123456 |
+-------------+-------------------+--------------------+
tblFireHydrants (50MM rows)
+----------------+------------------+-------------------+
| firehydrant_id | hydrant_latitude | hydrant_longitude |
+----------------+------------------+-------------------+
| 123456 | 42.987654 | -55.984657 |
+----------------+------------------+-------------------+
| 456233 | 45.569841 | -55.978946 |
+----------------+------------------+-------------------+
The goal is to query how many fire hydrants are with radius r (meters) of each customer_id.
After I repeat it for a few times for various distances, the final result will look like this:
+-------------+----------------------+----------------------+-----------------------+
| customer_id | hydrants_within_100m | hydrants_within_500m | hydrants_within_1000m |
+-------------+----------------------+----------------------+-----------------------+
| 123456 | 0 | 1 | 6 |
+-------------+----------------------+----------------------+-----------------------+
| 456233 | 1 | 1 | 9 |
+-------------+----------------------+----------------------+-----------------------+
I could start from scratch and try to use KDTrees in scala, or I noticed there are some geospatial UDF's for Hive-MapReduce that might work. I'm not really sure.
Looking for any suggestion of where to start, hopefully not starting from scratch.
If I understand correctly, you shouldn't use kNN-queries.
kNN-Queries return the closest 'k' hydrants. The problem is that you have to set 'k' quite high to be sure that you don't miss out any hydrants within the minimum distance. However, using a high 'k' may return lots of hydrants that are not within your required distance.
I suggest using window queries or, if available, range queries.
Range queries just return everything within a given range. This would be ideal, this feature is often not supported in kd-trees.
Window queries would be a good alternative. A window query typically returns all data within a specified axis-aligned rectangle. You have to make the window large enough to ensure that it returns all possible hydrants in the required distance. Afterwards, you would have to filter out all hydrants that are in the rectangle, but exceed your maximum distance, i.e. hydrants that are near the corners of the query window.
Range and window queries should be much more efficient than kd-queries, even if it may return additional results (in the case of window queries).
To put it another way, a naive kNN algorithm can be thought of as a series of window queries. The algorithm starts with a small window and then makes it larger until it finds 'k' elements. In the best case, the algorithm performs just one window query and find the right number of results (not too few, not too many), but in the typical case it has to perform many such queries.
I don't know how the cKDTree works, but I strongly would assume that window/range queries are much cheaper than kNN queries.