View Collation with Couchbase - collation

We are using couchbase as our nosql store and loving it for its capabilities.
There is however an issue that we are running in with creating associations
via view collation. This can be thought of akin to a join operation.
While our data sets are confidential I am illustrating the problem with this model.
The volume of data is considerable so cannot be processed in memory.Lets say we have data on ice-creams, zip-code and average temperature of the day.
One type of document contains a zipcode to icecream mapping
and the other one has transaction data of an ice-cream being sold in a particular zip.
The problem is to be able to determine a set of top ice-creams sold by the temperature of a given day.
We crunch this corpus with a view to emit two outputs, one is a zipcode to temperature mapping , while the other
represents an ice-cream sale in a zip code. :
Key Value
[zip1] temp1
[zip1,ice_cream1] 1
[zip2,ice_cream2] 1
The view collation here is a mechanism to create an association between the ice_cream sale, the zip and the average temperature ie a join.
We have a constraint that the temperature lookup happens only once in 24 hours when the zip is first seen and that is the valid
avg temperature to use for that day. eg lookup happened at 12:00 pm on Jan 1st, next lookup does not happen till 12:00 pm Jan 2nd.
However the avg temperature that is accepted in the 1st lookup is valid only for Jan 1st and that on the 2nd lookup only for Jan 2
including the first half of the day.
Now things get complicated when I want to do the same query with a time component involved, concretely associating the average temperature of a
day with the ice-creams that were sold on that day in that zip.eg. x vanilla icecreams were sold when the average temperature for that day is 70 F
Key Value
[y,m,d,zip1] temp1
[y,m,d,zip2,ice_cream2 ] 1
[y,m,d2,zip1,ice_cream1] 1
This has an interesting impact on the queries, say I query for the last 1 day I cannot make any associations between the ice-cream and temperature before the
first lookup happens, since that is when the two keys align. The net effect being that I lose the ice-cream counts for that day before that temperature lookup
happens. I was wondering if any of you have faced similar issues and if you are aware of a pattern or solution so as not to lose those counts.

First, welcome to StackOverflow, and thank you for the great question.
I understand the specific issue that you are having, but what I don't understand is the scale of your data - so please forgive me if I appear to be leading down the wrong path with what I am about to suggest. We can work back and forth on this answer depending on how it might suit your specific needs.
First, you have discovered that CB does not support joins in its queries. I am going to suggest that this is not really an issue if when CB is used properly. The conceptual model for how Couchbase should be used to filter out data is as follows:
Create CB view to be as precise as possible
Select records as precisely as possible from CB using the view
Fine-filter records as necessary in data-access layer (also perform any joins) before sending on to rest of application.
From your description, it sounds to me as though you are trying to be too clever with your CB view query. I would suggest one of two courses of action:
Manually look-up the value that you want when this happens with a second view query.
Look up more records than you need, then fine-filter afterward (step 3 above).

Related

Calculating number of times one "team" beats another in Excel

(I use the term "teams" generically here because the entirety of this question rests on ranking, and it seemed to be the most intuitive language to describe my problem.)
In a league of 30 teams, each day only 8 teams play. The results for those teams are ranked ordinally from 1 to 8 for the day. This continues "forever", so that additional results must be recorded every day.
Example after 4 days:
I want to calculate a single number to describe the relationship between two teams. For instance, given the example, the value (in a 2d table) that describes the relationship of Ace to Get is 1. Ace beat Get twice and Get beat Ace once (2-1).
I have been messing with Sumproduct, Match, and Index to get get values, which I could calculate using many extra tables, but I may need to add "teams" on the fly, and I do not know how large the pool of teams will become. Because of this, I was hoping to be able to use a single formula in the 2d relationship table. The results of that table, looking at just day 1 and day 2 given the previous example, are:
Is there a direct formula I can use to calculate the results to populate that table?
You can try following formula:
=IF($A11<>B$10;
SUMPRODUCT(
IF(MMULT(($B$1:$I$1)*($B$2:$I$3=$A11);ROW($1:$8)^0)
<MMULT(($B$1:$I$1)*($B$2:$I$3=B$10);ROW($1:$8)^0);
1;
-1)
*(((MMULT(--($B$2:$I$3<>$A11);ROW($1:$8)^0)=8)
+(MMULT(--($B$2:$I$3<>B$10);ROW($1:$8)^0)=8))
=0));
"")
Copy right and down.

Dynamic Percentile Analysis Across Multiple Categories - PowerPivot / DAX

I've spent a a lot of time trying to find a solution to the following issue but I haven't been able. There are similar threads to this issue both here and on other forums but they don't seem to be applicable. Please let me know any best demonstrated practices regarding posting on this forum that I may be going against.
I would like to be able to dynamically (and hopefully in as simple way as possible) create measures (ideally NOT via calculated columns) in power pivot to be able to carry out percentile analysis (e.g., value associated with top quartile, top quintile, third decile, etc etc) on different subsets of my data (in a pivot table). For example, I might want to create the percentile based on the yearly sales associated with a shop (although the records I have are based on monthly, or another time period).
Here is what this data could look like as an example, as well as what the results would be on this data (I did this jammily using excel). I know that there is a way to do this using calculated columns but I want to try and do it using measures (e.g., maybe using a combination of sumx, percentiles, top n??).
In case you're not able to view the picture of my data, my data is structured as such:
===============================================================================================
**Shop ID** ## **Value** ## **Metric**## **Period** (e.g., mm / yy) ## **Franchised or Co Owned** ## **Year** ## **Quarter**
===============================================================================================
1 50 etc etc please see screenshot! thank you
2 70
3 90
Additional explanation on data
Shop ID could have many entries
Value is the value for each metric - the record is based around having a value for each metric for each shop id for each month (or other time period)
Metric could be things like sales, ebitda, car count, etc etc
Period is typically month
Shop status could be "Co - Owned" or "Franchised"
Year and Quarter are based off the period
I want to be able to get percentile values for sales in a given period (e.g., total yearly sales for a given year, total quarterly sales etc) for whatever slicer i have going on for the current pivot table.
Super grateful for any help!
Thanks,
Louis
OK, I think I found an answer. Something like this formula might work:
PERCENTILEX.INC(ALLSELECTED(Facts[ID]),SUMX(ALLSELECTED(Facts[Period]),[Sum Values]),[Percentile Definition])

Using QDigest over a date range

I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.
The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.
So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.
I then noticed a similar function for medians. Qdigest.
https://prestodb.io/docs/current/functions/qdigest.html
Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.
The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.
Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.

"IF" function for analysis of hospital lab frequency

I work for a hospital that is part of a larger network. We were recently asked by our corporate overlords to address the use of a specific laboratory test. in general, this test should only be performed daily, which should be considered to corresponded to a 24 hour period from last draw. sometimes, however, based on when people arrive to the hospital (e.g. 7pm), and in the interest of bundling labs for a single draw, they may be drawn sooner to coincide with routine testing i.e. 5am. it would never be necessary to otherwise need to repeat within a short (8 hour) window, particularly on the same day.
we have been asked to validate to see if we are adhering to this general practice, as testing any more frequent than that, say, within 12h of a previous test, has no real clinical value and thus adds unnecessary cost.
To address this issue I was given a dataset that among other items includes all instances the lab was performed including collection date and time.
please see HIPPA-safe example below (to be clear, no real data and identifiers are not real); the actual dataset has over 4,174 entries corresponding to 1,328 unique persons. everyone had at least one test performed, not everyone had >1.
I THINK what I want to do is an IF formula that reads the antecedent cell to 1) check if same person and 2) if so, perform a subtraction of the time stamp to display the relevant difference in time, which I can then filter, create histogram, etc. does this seem like a reasonable approach? is there a more preferable method to facilitate analysis? do any other forms of analysis come to mind?
=IF(B2=B1, D2-D1, "n/a")
example data set with formula:
any other forms of analysis come to mind?
By the looks of it you should consider taking the values under "Results" into account, assuming there is a band that might be considered 'normal' readings. The "one in 24 hours is sufficient" rule of thumb may well be appropriate for a series of values within the 'normal' band but not so much so if readings are close to 'danger level'.
That is, in some cases a higher than 'standard' frequency of monitoring may be in the patient's interest, even if not hospital policy, so it may be worth separating the "less than 24 hours interval" readings into those where the higher frequency provided information of little value (eg readings remaining within a 'normal' band) from any that crossed into or out of the band and/or large changes in value. This though may be more a matter of statistical analysis than programming and depend upon whether any action might be taken as a result of such "extra" readings.

find average of calculated field in cognos 10.1

I have a calculated field a/b which makes sense at week level, where a is last of period metric and b is sum of the period metric. I need to find avg(a/b) for the weeks that fall under a month and not end_of_month(a)/sum(b) for the month. I made my a/b metric with regular aggregate as calculated and then monthly average metric with regular aggregate as average, but it doesn't work. The report is a crosstab report. How can I solve this?
Edit: a is end on hand inventory, b is sales, a/b is weeks of supply. Both a and b spread along product/location/time dimensions. For a, I've set its regular aggregate for time as last.
Your form of expression gives me error, so, I tried average((total([a] for [week]))/(total([b] for [week]))) which is error-free but doesn't give correct result. I used total[a] because it still has to sum along other dimensions except time. Any ideas?
I was also trying an alternative way- get the individual weeks of supply and then derive a new metric as (first week wos + ...+fifth week wos)/5. But I get a warning when I try to put a case statement as "Relational query objects are being used in conjunction with Dimensionally-modeled relational objects", and the metric give garbage value. How can I apply case involving relational item in query items of measure dimension?
You have to be more clearer on what you are trying to achieve.
Also, the header says framework manager, while you are talking about report.
My best guess is you need to use the for expression in your aggregate the values:
avg((max([a] for [week]))/(sum([b] for [week])))
You might need to use other (more sophisticated) summary function
for more details about the FOR and AT look here:
Using the AT and FOR Options with Relational Summary Functions

Resources