Using QDigest over a date range - presto

I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.
The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.
So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.
I then noticed a similar function for medians. Qdigest.
https://prestodb.io/docs/current/functions/qdigest.html
Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.
The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.

Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.

Related

Calculate minutes in specific period

I have a spreadsheet which has employee working times, listed as Sat-In and Sat-Out for a specific date. The employee shift spans several ours and each "In-Out" period is recorded as a separate line which means the time between the Sat-Out and the next Sat-In means the employee is on a break. I also calculate the time, in minutes of each "sitting" period.
What I can't seem to figure out is how to add a formula which takes the data and further refines it in this manner:
1. I have a core period of 1030-1530, as an example, which is the busy time and requires the maximum employee coverage. The shifts of employees generally spans this core, but in some cases their shift may start or end in the core.
2. I want to calculate how many minutes the employee worked within the core only. I can obviously do this manually using the data, but a formula would be preferred, if possible.
3. As an example, if a person sat-in at 1445 and sat-out at 1545, the core time calculation would be 45 minutes (1445-1530).
I've attached a snapshot of the data to help my explanation.
FYI - the information is pulled from a database as JSON data and converted to excel. I'm not very familiar with JavaScript, but if someone knows a way to do it programatically, I'm willing to give it a try and learn.
Thanks!
![excel]: https://photos.app.goo.gl/dRSTE72CXNa18RzP8
In below example I've used: =MAX(0,MIN($O$2,H2)-MAX($O$1,G2)), and formatted like [mm].
In Excel, units are days, so if you want to calculate the amount of minutes between two timestamps, you need to subtract both and multiply the differencee by 24*60 (being the amount of minutes in one day).
E.g. You start working at 09:07 (cell B2), and finish at 18:07 (cell B3), having a 45-minutes break. Then the time you worked in minutes, is:
=(B3-B2)*24*60-45
Make sure the cell formatting is correct (general), you'll get : 495.

Series Design in Influxdb for showing number of repeat customers

Consider an analytics where you need to find out repeat customers for a date range. Repeat customers are defined for date range as customers who use the service 3*(Given Date Range Interval) before the starting range and also used the service in given date range.
For example repeat customer for this week is all customers who used service 3 weeks before starting of this week and all such customers used it this week.
I am using influxdb. I haven't decided the series yet, I am looking for inputs into how I can define a series such that I can do available operations in influxdb to obtain this analytics.
Data available to me is the timestamp at which user used the facility, user_id , service_category, service_instance_id, and a json dump of further details about service.
may be my thought process is limited, I need some intervention on how to approach this and any input is welcome.
So I thought about this and came to a decent solution. I have to save the last time a user visited along with the entry. So at least one reference will be there for any time period if user is repeating for that time period.
This is similar to a linkedlist except that we have direct access to time based filtering of nodes.

Access DB: Daily cumulative count and daily number on assignment

I have a table that tracks the dispatching of personnel. The table has the employee name and the date the person went out and the date they returned.
The table has hundreds of entries from 1988 to current.
In Excel I track the cumulative count per day (of the year) of how many people have been sent out, and I also track the number of people out on any given day. The table lists the Month & Day in the first column (every day of the year, including leap days) and the years on the first row. There is data for every date (a zero is entered until the first person is sent out that year, then starts counting up as there are more dispatches, or in the case of the number of people out each day, it will show zero if no one is out that day or if there were, say 5 people out, it would show "5" for that day). I then use the data in Excel to construct a graph that shows the number of dispatches on the y axis and the day of the year on the x axis (along with the current year’s number, the average number and the max over the 27 year history). Currently I just track this manually (I just keep a running count of each and enter it in manually in Excel.) I would like to build a query of my Access data that would return the same information that I could import into my Excel spreadsheet. One query that would show the day & month in the first column and the years along the top row and for each day show a cumulative count for that year of how many people have been sent out. Another query that has the day & month in the first column and the years along the top and a count of how many people were out for that particular day for that particular year. There shouldn't be any gaps (every day has data, even if it is "0"). I would then import those queries into Excel to replace my manual tracking that I am doing now.
I know how to construct the Excel stuff (I have that running already), and how to import info from Access to Excel, what I need to know is how to construct these 2 Access queries.
Any help/ideas on how to construct those 2 queries would be greatly appreciated!
I'd recommend that you migrate this app to a web based solution that uses a real database - SQL Server or MySQL, not Access.
"Desk drawer software" is what I call homegrown apps that someone creates for themselves to perform some small task that eventually become integral to running a business and grow out of hand. Your truck factor is 1: if anything happened to you, no one would know how to do this function. The software may not be backed up or checked into a source code management system. There's no QA. There's no way to migrate new features to production: if you alter the app, then that is what you have.
I'd recommend a web app to mitigate all the risks I've described:
You have to deploy a web app to a server, which takes it off your desktop and puts it in a central place where anyone who's authorized can access it.
Separates database from display issues.
Makes you think about how to archive historical data. Partitioning by year makes sense.
Likely you'll put this in a source code management system like Subversion or Git.

View Collation with Couchbase

We are using couchbase as our nosql store and loving it for its capabilities.
There is however an issue that we are running in with creating associations
via view collation. This can be thought of akin to a join operation.
While our data sets are confidential I am illustrating the problem with this model.
The volume of data is considerable so cannot be processed in memory.Lets say we have data on ice-creams, zip-code and average temperature of the day.
One type of document contains a zipcode to icecream mapping
and the other one has transaction data of an ice-cream being sold in a particular zip.
The problem is to be able to determine a set of top ice-creams sold by the temperature of a given day.
We crunch this corpus with a view to emit two outputs, one is a zipcode to temperature mapping , while the other
represents an ice-cream sale in a zip code. :
Key Value
[zip1] temp1
[zip1,ice_cream1] 1
[zip2,ice_cream2] 1
The view collation here is a mechanism to create an association between the ice_cream sale, the zip and the average temperature ie a join.
We have a constraint that the temperature lookup happens only once in 24 hours when the zip is first seen and that is the valid
avg temperature to use for that day. eg lookup happened at 12:00 pm on Jan 1st, next lookup does not happen till 12:00 pm Jan 2nd.
However the avg temperature that is accepted in the 1st lookup is valid only for Jan 1st and that on the 2nd lookup only for Jan 2
including the first half of the day.
Now things get complicated when I want to do the same query with a time component involved, concretely associating the average temperature of a
day with the ice-creams that were sold on that day in that zip.eg. x vanilla icecreams were sold when the average temperature for that day is 70 F
Key Value
[y,m,d,zip1] temp1
[y,m,d,zip2,ice_cream2 ] 1
[y,m,d2,zip1,ice_cream1] 1
This has an interesting impact on the queries, say I query for the last 1 day I cannot make any associations between the ice-cream and temperature before the
first lookup happens, since that is when the two keys align. The net effect being that I lose the ice-cream counts for that day before that temperature lookup
happens. I was wondering if any of you have faced similar issues and if you are aware of a pattern or solution so as not to lose those counts.
First, welcome to StackOverflow, and thank you for the great question.
I understand the specific issue that you are having, but what I don't understand is the scale of your data - so please forgive me if I appear to be leading down the wrong path with what I am about to suggest. We can work back and forth on this answer depending on how it might suit your specific needs.
First, you have discovered that CB does not support joins in its queries. I am going to suggest that this is not really an issue if when CB is used properly. The conceptual model for how Couchbase should be used to filter out data is as follows:
Create CB view to be as precise as possible
Select records as precisely as possible from CB using the view
Fine-filter records as necessary in data-access layer (also perform any joins) before sending on to rest of application.
From your description, it sounds to me as though you are trying to be too clever with your CB view query. I would suggest one of two courses of action:
Manually look-up the value that you want when this happens with a second view query.
Look up more records than you need, then fine-filter afterward (step 3 above).

Attendance Calculations / Period Calendar

This is a multi-tiered project. Let me give a quick overview. I have attendance data, card/ timestamp punches. I would like to have a pivot table with slicers in Excel. Ideally you'd be able to choose a department / last name / associate number. And also a period of time. Ideally this would be a table with the company period/week. And maybe default to last weeks.
I can get at timecard data in two ways:
(1) generate a CSV that automatically performs the timecard math, to figure out how many hours someone worked and it is smart enough to understand 3rd shift workers. The format of that CSV is:
Last Name, First Name, Personnel Type, Associate Number, Facility, Department, TimeIn, TimeOut, Total Hours
The problem with this method is that I would have to manually append the information to the CSV tables. Or come up with some autoIT script.
(2) Get at the raw data via sql/odbc. This way the math is not done. It is just all of the associates timestamps. I would have to figure up the daily hours myself and figure out a 3rd shift formula too. It is not a set schedule, many people swing shifts and others get called in a lot.
Lastly, I would like to be able to filter the dates by using our company fiscal calendar. I have a spreadsheet that goes from 2000 to 2093. With everyday listed and it's corresponding year/period/week.
Example period info spreadsheet:
date Year Period week WeekTotal Period Total
12/3/2007 2008 1 1 2008.1.1 2008.1
12/4/2007 2008 1 1 2008.1.1 2008.1
I know there is a lot going on here, but what would be the best way to approach this project?
First I have not been able to post any script however the last I tried it I used two options 1. Was a php conversion where the time was numbers ( which makes it easier for calculations)
2. Was in the tables where I deliberately entered the values places the time in different columns or fields for hours, mins, and seconds this meant that while the input is eased I still have to calculate the output in php especially for totals, averages and differences.
Hope it helps a bit

Resources