All documents in a Solr index have an "added" field containing the ISO 8601 date in which the document was added to Solr.
<result name="response" numFound="34587104" start="0">
<doc>
<date name="added">2013-03-04T01:00:26Z</date>
<str name="text">Hello, world!</str>
<str name="id">93416604d274d28a44e14a9535bb9e6e1db3d851</str>
<str name="_version_">1428536769315340290</str>
</doc>
<result/>
Assuming that no documents are removed, how might I get a count of how many documents exist in the index per day? For instance, in order to know how many documents were in the index on 2013-03-05 I could query q=added:[* TO 2013-03-05T00:00:00Z]. However, I need to know how many documents were in the index for each day from one month ago until today.
One solution might be to query how many documents were in the index on the date one month ago, then facet on how many documents were added each day and then add them to a cumulative count. PseudoCode:
initial_count = q=added:[* TO NOW/MONTH-1MONTH]
running_total = initial_count;
daily_added_array = facet.range=added
& f.added.facet.range.start=NOW/MONTH-1MONTH
& f.added.facet.range.end=NOW/DAY-1DAY
& f.added.facet.range.gap=+1DAY
foreach (daily_added_array as day) {
running_total += day;
printf(running_total);
}
However this method seems extremely fragile and prone to error.
Is there a way to get the cumulative amount of documents in the index per day?
I don't think there is a better way than faceting to pull out the daily counts, and using date math is preferable to any attempts to calculate the specific date strings, so I think you already have it right on those aspects.
About the only improvement I can see is to query *:* and grab the hit count off of that, then you can use the daily counts to generate your running totals backwards by subtraction, rather than addition forwards. This should perform a little better than your method since *:* requires no filtering work or score calculation at all for solr. It also gives you one less date math expression to write :)
Speaking of which I suspect NOW/MONTH-1MONTH isn't what you meant. That says: "NOW rounded to start of the current month minus one month". Which will be anywhere from 28 to 61 days ago depending on what day now is (consider if NOW is March 1 vs Dec 31, you get Feb1 or Nov 1 respectively). That won't correspond to your stated requirement:
I need to know how many documents were in the index for each day from
one month ago until today
I think you probably want NOW/DAY-1MONTH. Also, it seems you are excluding today's documents with your upper bound of your facets... is that desired? (if so my method still works, but you have to extend the upper bounds of your facets to NOW/DAY+1DAY and just ignore the document total when generating your running total list (still backwards).
Depending on your Solr version, you may use a combination of Grouping aka Field Collapsing together with group.func parameter.
http://wiki.apache.org/solr/FieldCollapsing
set rows=0&group.field=added&group.func=rint(div(ms(added),mul(24,mul(60,mul(60,1000)))))
The latter function converts to milliseconds and rounds to the day.
Number of groups returned is what you want. You can filter it by last month etc. as you like
Related
(I use the term "teams" generically here because the entirety of this question rests on ranking, and it seemed to be the most intuitive language to describe my problem.)
In a league of 30 teams, each day only 8 teams play. The results for those teams are ranked ordinally from 1 to 8 for the day. This continues "forever", so that additional results must be recorded every day.
Example after 4 days:
I want to calculate a single number to describe the relationship between two teams. For instance, given the example, the value (in a 2d table) that describes the relationship of Ace to Get is 1. Ace beat Get twice and Get beat Ace once (2-1).
I have been messing with Sumproduct, Match, and Index to get get values, which I could calculate using many extra tables, but I may need to add "teams" on the fly, and I do not know how large the pool of teams will become. Because of this, I was hoping to be able to use a single formula in the 2d relationship table. The results of that table, looking at just day 1 and day 2 given the previous example, are:
Is there a direct formula I can use to calculate the results to populate that table?
You can try following formula:
=IF($A11<>B$10;
SUMPRODUCT(
IF(MMULT(($B$1:$I$1)*($B$2:$I$3=$A11);ROW($1:$8)^0)
<MMULT(($B$1:$I$1)*($B$2:$I$3=B$10);ROW($1:$8)^0);
1;
-1)
*(((MMULT(--($B$2:$I$3<>$A11);ROW($1:$8)^0)=8)
+(MMULT(--($B$2:$I$3<>B$10);ROW($1:$8)^0)=8))
=0));
"")
Copy right and down.
I am required to do some summary statistics on the attached table as an example.
Some of the questions to answer include:
1) How many countries with valid time series (countries that have at least one value/number for a given indicator name over the time period of 2010-2015)
e.g: Count how many countries have valid times series for the indicator: "Number of completed applications"
2) For a given country and indicator what is the number of year(s) with valid time series.
e.g: For the indicator number of completed applications and the country Canada? (Answer: 2 --> 2014, 2015)
Alternatively, if the table look like this instead (which is a typical csv format) what approach could be taken to answer the two summary statistics questions above?
I have tried method of sumproduct formula for the pivoted table. Is there a better way than this method?
=SUMPRODUCT(N((B2:B14>0)+(C2:C14>0)+(D2:D14>0)+(E2:E14>0)+(F2:F14>0)+(G2:G14>0)+(H2:H14>0)+(I2:I14>0)+(J2:J14>0)>0))
But what about when it is a flat table?
So, an example of countifs() and also sumifs():
From Nevsky -- Thanks a lot for the example! I took the liberty to modify it a bit as follows :
I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.
The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.
So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.
I then noticed a similar function for medians. Qdigest.
https://prestodb.io/docs/current/functions/qdigest.html
Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.
The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.
Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.
We are using couchbase as our nosql store and loving it for its capabilities.
There is however an issue that we are running in with creating associations
via view collation. This can be thought of akin to a join operation.
While our data sets are confidential I am illustrating the problem with this model.
The volume of data is considerable so cannot be processed in memory.Lets say we have data on ice-creams, zip-code and average temperature of the day.
One type of document contains a zipcode to icecream mapping
and the other one has transaction data of an ice-cream being sold in a particular zip.
The problem is to be able to determine a set of top ice-creams sold by the temperature of a given day.
We crunch this corpus with a view to emit two outputs, one is a zipcode to temperature mapping , while the other
represents an ice-cream sale in a zip code. :
Key Value
[zip1] temp1
[zip1,ice_cream1] 1
[zip2,ice_cream2] 1
The view collation here is a mechanism to create an association between the ice_cream sale, the zip and the average temperature ie a join.
We have a constraint that the temperature lookup happens only once in 24 hours when the zip is first seen and that is the valid
avg temperature to use for that day. eg lookup happened at 12:00 pm on Jan 1st, next lookup does not happen till 12:00 pm Jan 2nd.
However the avg temperature that is accepted in the 1st lookup is valid only for Jan 1st and that on the 2nd lookup only for Jan 2
including the first half of the day.
Now things get complicated when I want to do the same query with a time component involved, concretely associating the average temperature of a
day with the ice-creams that were sold on that day in that zip.eg. x vanilla icecreams were sold when the average temperature for that day is 70 F
Key Value
[y,m,d,zip1] temp1
[y,m,d,zip2,ice_cream2 ] 1
[y,m,d2,zip1,ice_cream1] 1
This has an interesting impact on the queries, say I query for the last 1 day I cannot make any associations between the ice-cream and temperature before the
first lookup happens, since that is when the two keys align. The net effect being that I lose the ice-cream counts for that day before that temperature lookup
happens. I was wondering if any of you have faced similar issues and if you are aware of a pattern or solution so as not to lose those counts.
First, welcome to StackOverflow, and thank you for the great question.
I understand the specific issue that you are having, but what I don't understand is the scale of your data - so please forgive me if I appear to be leading down the wrong path with what I am about to suggest. We can work back and forth on this answer depending on how it might suit your specific needs.
First, you have discovered that CB does not support joins in its queries. I am going to suggest that this is not really an issue if when CB is used properly. The conceptual model for how Couchbase should be used to filter out data is as follows:
Create CB view to be as precise as possible
Select records as precisely as possible from CB using the view
Fine-filter records as necessary in data-access layer (also perform any joins) before sending on to rest of application.
From your description, it sounds to me as though you are trying to be too clever with your CB view query. I would suggest one of two courses of action:
Manually look-up the value that you want when this happens with a second view query.
Look up more records than you need, then fine-filter afterward (step 3 above).
I have a calculated field a/b which makes sense at week level, where a is last of period metric and b is sum of the period metric. I need to find avg(a/b) for the weeks that fall under a month and not end_of_month(a)/sum(b) for the month. I made my a/b metric with regular aggregate as calculated and then monthly average metric with regular aggregate as average, but it doesn't work. The report is a crosstab report. How can I solve this?
Edit: a is end on hand inventory, b is sales, a/b is weeks of supply. Both a and b spread along product/location/time dimensions. For a, I've set its regular aggregate for time as last.
Your form of expression gives me error, so, I tried average((total([a] for [week]))/(total([b] for [week]))) which is error-free but doesn't give correct result. I used total[a] because it still has to sum along other dimensions except time. Any ideas?
I was also trying an alternative way- get the individual weeks of supply and then derive a new metric as (first week wos + ...+fifth week wos)/5. But I get a warning when I try to put a case statement as "Relational query objects are being used in conjunction with Dimensionally-modeled relational objects", and the metric give garbage value. How can I apply case involving relational item in query items of measure dimension?
You have to be more clearer on what you are trying to achieve.
Also, the header says framework manager, while you are talking about report.
My best guess is you need to use the for expression in your aggregate the values:
avg((max([a] for [week]))/(sum([b] for [week])))
You might need to use other (more sophisticated) summary function
for more details about the FOR and AT look here:
Using the AT and FOR Options with Relational Summary Functions