What is the best way to store daily prices on Apache Druid? - aggregation

I'd like to store daily prices on Apache Druid in a way that is possible to group it by daily, weekly and monthly. I was thinking of defining open and close prices as metrics, not dimensions, but for that I would have to use first and last aggregators to group open and close prices, respectively. The documentation says:
(Double/Float/Long) First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.
So it seems I cannot load the data in this way already. I would like to know what would be the best way to load daily prices data in a way that allows me to use all of druid's capabilities when grouping by day, week or month.

Related

spark partition strategy comparison between date=dd-mm-yyyy vs yyyy={xxxx}/mm={mm}/dd={xx}

How to choose which partition strategy in spark on dates. I have a column in data frame as the date in 2020-02-19 format. should specify the date in partition columns while writing or create multiple columns from the date as dd, mm,yyyy in the table and specify columns yyyy, mm, dd in repartition?
What kind of issues will come if I specify each partition strategy
There is no actual gain breaking in one partition date=yyyy-mm-dd or in multiple partitions year=yyyy/month=mm/day=dd, if you have to process the last 10 days will give the same amount of data at the same time. The biggest difference is the way you query or the way you will maintain your data.
With one single partition your life will be easy to write queries for an specific day. I need to run for something 3 days ago. Or I need to query a date range from 1st of Jan to 1st of May. Having one partition with the date make your life much easier for that.
Having multiple partitions is easy to make monthly analysis, is easy to query a whole month or a whole year in a easy way. But you will loose the capability of query the data in a range.
Besides those features from each type of format, in a performance perspective this will not create any overhead for you, both solutions would bring the data in the same speed because you will not going to break the data in smaller files. I prefer to break just with one partition with the day due to be easy to maintain in point of view.

NetSuite Saved Search: How can I create an average daily number of transaction created by a certain individual?

I'm not sure whether this is possible or not, but is it possible to create a saved search that displays the average number of daily transactions created by an individual? For example, I'm trying to create a Productivity KPI and would like to see how many Purchase Requisitions or Purchase Orders are created by an individual per day.
Is this possible within NetSuite? I tried averaging the created by transactions, but the formula field only will average non-text transactions.
Any help would be much appreciated.

Using QDigest over a date range

I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.
The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.
So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.
I then noticed a similar function for medians. Qdigest.
https://prestodb.io/docs/current/functions/qdigest.html
Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.
The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.
Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.

Series Design in Influxdb for showing number of repeat customers

Consider an analytics where you need to find out repeat customers for a date range. Repeat customers are defined for date range as customers who use the service 3*(Given Date Range Interval) before the starting range and also used the service in given date range.
For example repeat customer for this week is all customers who used service 3 weeks before starting of this week and all such customers used it this week.
I am using influxdb. I haven't decided the series yet, I am looking for inputs into how I can define a series such that I can do available operations in influxdb to obtain this analytics.
Data available to me is the timestamp at which user used the facility, user_id , service_category, service_instance_id, and a json dump of further details about service.
may be my thought process is limited, I need some intervention on how to approach this and any input is welcome.
So I thought about this and came to a decent solution. I have to save the last time a user visited along with the entry. So at least one reference will be there for any time period if user is repeating for that time period.
This is similar to a linkedlist except that we have direct access to time based filtering of nodes.

Access DB: Daily cumulative count and daily number on assignment

I have a table that tracks the dispatching of personnel. The table has the employee name and the date the person went out and the date they returned.
The table has hundreds of entries from 1988 to current.
In Excel I track the cumulative count per day (of the year) of how many people have been sent out, and I also track the number of people out on any given day. The table lists the Month & Day in the first column (every day of the year, including leap days) and the years on the first row. There is data for every date (a zero is entered until the first person is sent out that year, then starts counting up as there are more dispatches, or in the case of the number of people out each day, it will show zero if no one is out that day or if there were, say 5 people out, it would show "5" for that day). I then use the data in Excel to construct a graph that shows the number of dispatches on the y axis and the day of the year on the x axis (along with the current year’s number, the average number and the max over the 27 year history). Currently I just track this manually (I just keep a running count of each and enter it in manually in Excel.) I would like to build a query of my Access data that would return the same information that I could import into my Excel spreadsheet. One query that would show the day & month in the first column and the years along the top row and for each day show a cumulative count for that year of how many people have been sent out. Another query that has the day & month in the first column and the years along the top and a count of how many people were out for that particular day for that particular year. There shouldn't be any gaps (every day has data, even if it is "0"). I would then import those queries into Excel to replace my manual tracking that I am doing now.
I know how to construct the Excel stuff (I have that running already), and how to import info from Access to Excel, what I need to know is how to construct these 2 Access queries.
Any help/ideas on how to construct those 2 queries would be greatly appreciated!
I'd recommend that you migrate this app to a web based solution that uses a real database - SQL Server or MySQL, not Access.
"Desk drawer software" is what I call homegrown apps that someone creates for themselves to perform some small task that eventually become integral to running a business and grow out of hand. Your truck factor is 1: if anything happened to you, no one would know how to do this function. The software may not be backed up or checked into a source code management system. There's no QA. There's no way to migrate new features to production: if you alter the app, then that is what you have.
I'd recommend a web app to mitigate all the risks I've described:
You have to deploy a web app to a server, which takes it off your desktop and puts it in a central place where anyone who's authorized can access it.
Separates database from display issues.
Makes you think about how to archive historical data. Partitioning by year makes sense.
Likely you'll put this in a source code management system like Subversion or Git.

Resources