I’m curious to find out if, in terms of tidy data principles, a column containing “date and time” ( 1/1/21 11:31) would be considered as a single variable or tow separate ones?
Timestamp:
An important difference is that DATETIME represents a date (as found in a calendar) and a time (as can be observed on a wall clock), while TIMESTAMP represents a well defined point in time.
Related
Like the title says, I'm trying to sum elapsed time based on two criteria in Excel (2016). Specifically the elapsed time of a phone call based on user and call type, examples below.
The actual data is similar to this Dummy Data Set where there would be a list of users, Full name, the duration of a given call, Duration, and the type of call, Call Type.
The expected output would be in this format; however, the main focus for my issue falls under the Total Time column, and its worth noting that the red times shown were hand-done as these are the cause of this post.
As for the logic I'm using at the moment to try and get these results, it's =SUMIFS(C2:C16, B2:B16, G2, D2:D16, H2&"*") where C2"C16 is the Duration column, B2:B16 is the Full Name column, G2 is "Generic User", D2:D16 is Call Type, and H2&"" comes to Local, worth a note is the wildcard, as there are two types of Internal and Local calls, but I'm just trying to get all Local and all Internal without distinction of sub-types.
Additional notes for the cell formats are that the Duration and Total Time columns are set for [h]:mm:ss, but I've also tried other time and number formats which didn't change the outcome.
So at this point I'm just spinning my wheels as I've Googled and looked into different ways to go about this, such as different cell formats, cell operations/logic and so on, and I'm just unsure where the issue may actually be, so any suggestions or help would be very appreciated.
Edit 1: Including a link for the file
J2 This is for your reference only.
=SUMPRODUCT((B$2:B$16=G$2)*(C$2:C$16)*ISNUMBER(FIND(H2,D$2:D$16)))
I have two date formats for validStartDtTm (name of validAxes) stored in one collection. I want to harmonize all of the dates into one uniform format and wanted to know if there were any best practices for bi-temporal date format in MarkLogic.
Current formats are as follows:
2019-04-09T10:54:37.861434Z - generated by front end users and stored without transformation
2019-04-09T10:54:37.8614534-04:00 - ingested from back end with format 'DD/MM/YYYY HH:MM:SS' and transformed using xdmp.parseDateTime
Thanks!
These are both valid xs:dateTime values so they don't need harmonization per se, but each indicates a different time zone, so you should check to be sure those are what is intended.
Both times are in the same format, The "Z" trailing the first time indicates "Zulu" time, i.e. UTC, the equivalent of 2019-04-09T10:54:37.861434-00:00. The "-04:00` trailing the second time indicates the time is behind USC by 4 hours and 0 minutes.
I'm trying to rank some data in spotfire, and I'm having a bit of trouble writing a formula to calculate it. Here's a breakdown of what I am working with.
Group: the test group
SNP: what SNP I am looking at
Count: how many counts I get for the specific SNP
What I'd like to do is rank the average # of counts that are present for each SNP, within the group. Thus, I could then see, within a group, which SNP ranks #1, #2, etc.
Thanks!
TL;DR Disclaimer: You can do this, though if you are changing your cross table frequently, it may become a giant hassle. Make sure to double-check that logic is what you'd expect after any modification. Proceed with caution.
The basis of the Custom Expression you seem to be looking for is as follows:
Max(DenseRank(Count() OVER (Intersect([Group],[SNP])),"desc",[Group]))
This gives the total count of rows instead of the average; I was uncertain if "Count" was supposed to be a column or not. If you really do want to turn it into an average, make sure to adjust accordingly.
If all you have is the Group and the SNP nested on the left, you're done and good to go.
First issue, when you want to filter it down, it gives you the dense rank of only those in the filtered set. In some cases this is good, and what you're looking for; in others, it isn't. If you want it to hold fast to its value, regardless of filtering, you can use the same logic, but throw it in a Calculated column, instead of in the custom expression. Then, in your CrossTable Aggregation, get the max of the Calculated Column value.
Calculated Column:
DenseRank(Count() OVER (Intersect([Group],[SNP])),"desc",[Group])
Second Issue: You want to pivot by something other than Group and SNP. Perhaps, for example, by date? If you throw the Date across the top, it's going to show the same numbers for every month -- the overall numbers. This is not particularly helpful.
To a certain extent, Spotfire's Custom Expressions can handle this modification. If you switch between using a single column, you could use the following:
Max(DenseRank(Count() OVER (Intersect([${Axis.Columns.ShortDisplayName}],[Group],[SNP])),"desc",[Group],[${Axis.Columns.ShortDisplayName}]))
That would automatically pull in the column from the top, and show you the ranking for each individual process date.
However, if you start nesting, using hierarchies, renaming your columns, or having multiple aggregations and throwing (Column Names) across the top, you're going to start having to pay a great deal to your custom expression. You'll need to do some form of string replacement around the Axis.Column, or use expression instead of Short Names, and get rid of Nests, etc.
Any layer of complexity will require this sort of analysis, so if your end-users have access to modify the pivot table... honestly, I probably wouldn't give them this column.
Third Issue: I don't know if this is an issue, exactly, but you said "Average Counts" -- Average per day? Per Month? When averaging, you will need to decide if, for example, a month is the total number of days in month or the number of days that particular payor had data. However you decide to aggregate it, make sure you're doing it on the right level.
For the record, I liked the premise of this question; it's something I'd thought would be useful before, but never took the time to try to implement, since sorting a column or limiting a table to only show the top 10 values is much simpler
My Node.js app needs to index several gigabytes of timestamped CSV data, in such a way that it can quickly get the row count for any combination of values, either for each minute in a day (1440 queries) or for each hour in a couple of months (also 1440). Let's say in half a second.
The column values will not be read, only the row counts per interval for a given permutation. Reducing time to whole minutes is OK. There are rather few possible values per column, between 2 and 10, and some depend on other columns. It's fine to do preprocessing and store the counts in whatever format suitable for this single task - but what format would that be?
Storing actual values is probably a bad idea, with millions of rows and little variation.
It might be feasible to generate a short code for each combination and match with regex, but since these codes would have to be duplicated each minute, I'm not sure it's a good approach.
Or it can use an embedded database like SQLite, NeDB or TingoDB, but am not entirely convinced since they don't have native enum-like types and might or might not be made for this kind of counting. But maybe it would work just fine?
This must be a common problem with an idiomatic solution, but I haven't figured out what it might be called. Knowing what to call this and how to think about it would be very helpful!
Will answer with my own findings for now, but I'm still interested to know more theory about this problem.
NeDB was not a good solution here as it saved my values as normal JSON behind the hood, repeating key names for each row and adding unique IDs. It wasted lots of space and would surely have been too slow, even if just because of disk I/O.
SQLite might be better at compressing and indexing data, but I have yet to try it. Will update with my results if I do.
Instead I went with the other approach I mentioned: assign a unique letter to each column value we come across and get a short string representing a permutation. Then for each minute, add these strings as keys iff they occur, with the number of occurrences as values. We can later use our dictionary to create a regex that matches any set of combinations, and run it over this small index very quickly.
This was easy enough to implement, but would of course have been trickier if I had had more possible column values than the about 70 I found.
I am using Amazon CloudSearch to store a large set of places.
Each place has a opening time and a closing time, for each day of the week.
I need to retrieve places by current time. How do you suggest to model the index?
I am thinking to solve the problem by creating 7 text indexes in which I specify, for each day of the week, the valid hours.
For example, if a place is opened from 9 am to 13 am, in the index "monday" I will write the string "9-10-11-12". Then, filtering by bq=monday:'10' or bq=monday:'16' I will have only the places that at the specified time are opened.
Any other idea? My solution seems working but would suggest me another approach?
First, I wouldn't use multiple indexes.
You could use your approach, but just make the time in hours from the start of the week. So, Monday would be 0-23, Tuesday 24-47, etc. Or you could just have 7 fields, "monday_hours", "tuesday_hours", …
You could also use uints, instead of strings. Not better, but different, might be worth benchmarking.
With uints you can use range queries. If the document contained the fields "open" and "close" and you want to know if it's open between 10 and 12.
&bq=(and open:..12 close:10..)
One issue remaining is that CloudSearch's range searches are inclusive of endpoints. So I think this will show a false positive if the store opens at twelve. Technically, the ranges overlap, but not usefully. To fix that, I'd do two things. First, I wouldn't go by hours, I'd use minute-of-the-day as the value in the field (0 to 1439). Then add one to the starting range, and subtract one from the end.
Using uints will perform differently from using text fields. I'd definitely benchmark them to see which one works better for you.