where can I find the default returned features by dfs [featuretools] - featuretools

quick question:
Is there some doc / resource where to find the default features output by featuretools dfs?
For example if i use trans_primitives=["time_since_previous"] it seems that it outputs time in between transactions and also time from first transaction.
It would be great if I could find somewhere the default output from the transformers, and also, the different options for each argument.
Thanks

The features returned by DFS will vary based on the entity set and the primitives used. The default primitives used by DFS are listed below as defined in the API Reference:
agg_primitives: [
"sum",
"std",
"max",
"skew",
"min",
"mean",
"count",
"percent_true",
"num_unique",
"mode",
]
trans_primitives: [
"day",
"year",
"month",
"weekday",
"haversine",
"num_words",
"num_characters",
]
Yes, the time_since_previous does output the time between transactions with different options available for the unit parameter:
unit (str) – Defines the unit of time to count from. Defaults to Seconds. Acceptable values: years, months, days, hours, minutes, seconds, milliseconds, nanoseconds
Let me know if this helps.

Related

filter on partition key before iterating over array cosmos db

I have a CosmosDbQuery that works fine but is a bit slow and expensive:
SELECT c.actionType as actionType, count(1) as count
FROM c in t.processList
WHERE c.processTimestamp > #from
GROUP BY c.actionType
To optimise my query i would like to first have a Where clause on my parent partitionKey e.g. parent.minute > x before iterating over the processlist. After this where there is no need for the c.processTimestamp > #from.
"id": "b6fd10cc-3a0b-4666-bf55-f22436a5f8d9",
"Name": "xxx",
"Age": 1,
"minute": 202302021026,
"processList": [
{
"processTimestamp": "2023-02-01T10:28:48.3004825Z",
"actionType": "Action1",
"oldValue": "2/1/2023 10:28:41 AM",
"newValue": "2/1/2023 10:28:48 AM"
},
{
"processTimestamp": "2023-02-01T10:28:48.3004825Z",
"actionType": "Action2",
"oldValue": "2/1/2023 10:28:48 AM",
"newValue": "2/1/2023 10:28:48 AM"
}],
}
I have tried subqueries and joins but i could not get it to work:
SELECT c.actionType as actionType, count(1) as count
FROM (SELECT * FROM C WHERE c.minute > 9) in t.processList
WHERE c.processTimestamp > #from
GROUP BY c.actionType")
My desired result would be:
[
{
"actionType": "action1",
"count": 85351
},
{
"actionType": "action2",
"count": 2354
}
]
A few comments here.
As noted in my comment, Group By with Sub-Queries is unsupported, documented here.
Using a Date/Time value as a partition key is typically an anti-pattern for Cosmos DB. This query may be slow and expensive because at large scales, using time as a partition key means that most queries are hitting the same partition due of data recency (newer data gets more requests than older data). This is also bad for writes as well for the same reason.
When this happens, it is typical to increase the throughput. However this often does little to help and in some cases can even make things worse. Also, because throughput is evenly distributed across all partitions, this results in wasted unused throughput on partition keys for older dates.
Two things to consider. Make your partition key a combination of two properties to increase cardinality. In an IOT scenario this would typically be deviceId_dateTime (Hierarchical Partition keys, in preview now, is a better way you can do this today). This will help with writes especially where data is always written with the current dateTime.
On the read path for queries, you might explore implementing a materialized view using Change Feed into a second container. This will move the throughput for reads off of the container used for ingestion and can result in more efficient throughput usage. However, you should measure this yourself to be sure.
If your container is small and will always stay that way, then this information below will not apply (< 10K RU/s and 50GB). However, such a design will not scale.
Like Mark said: Groupby is not supported on a subquery. Tried to fix it with linq but groupby is not supported for linq aswell so i changed my code so it uses join insteadof looping over the array with the IN keyword:
SELECT pl.actionType as actionType, count(1) as count
FROM c
JOIN pl IN c.processList
WHERE c.minute > #from
GROUP BY pl.actionType")

Get Last Value of a Time Series with Azure TimeSeries Insights

How can i query the last (most recent) event along with it's timestamp within a time series?
The approach described here does not work for me as i can not guarantee that the most recent event is within a fixed time window. The event might have been received hours or days ago in my case.
The LAST() function return the last events and the Get Series API should preserve the actual event time stamps according to the documentation but i am a bit confused about the results i am getting back from this API. I get multiple results (sometimes not even sorted by timestamp) and have to find out the latest value on my own.
Also i noticed that the query result does not actually reflect the latest ingested value. The latest ingested value is only contained in the result set if i ingest this value multiple times.
It there any more straight-forward or reliable way to get the last value of a time series with Azure Time Series Insights?
The most reliable way to get the last known value, at the moment, is to use the AggregateSeries API.
You can use the last() aggregation in a variable calculating the last event property and the last timestamp property. You must provide a search span in the query, so you will still have to "guess" when the latest value could have occurred.
Some options are to always have a larger search span than what you may need (e.g. if a sensor sends data every day, you may input a search span of a week to be safe) or use the Availability API to get the time range and distribution of the entire data set across all TSIDs and use that as the search span. Keep in mind that having large search spans will affect performance of the query.
Here's an example of a LKV query:
"aggregateSeries": {
"searchSpan": {
"from": "2020-02-01T00:00:00.000Z",
"to": "2020-02-07T00:00:00.000Z"
},
"timeSeriesId": [
"motionsensor"
],
"interval": "P30D",
"inlineVariables": {
"LastValue": {
"kind": "aggregate",
"aggregation": {
"tsx": "last($event['motion_detected'].Bool)"
}
},
"LastTimestamp": {
"kind": "aggregate",
"aggregation": {
"tsx": "last($event.$ts)"
}
}
},
"projectedVariables": [
"LastValue",
"LastTimestamp"
]
}

Mongodb mapreduce performance

I have a mapreduce function that I use to prepare data for my web app to be used in realtime.
It works fine but it doesn't require my performance requirements.
My aim is (and I know that it's not meant to be this way) to perform it when the webapp user request for it (more or less in realtime).
I do use Mapreduce because the transformation of the data needs a lot of if/else conditions due to functional requirements.
My subset of initial data to be transformed is about 100k rich documents ( < 1kB ).
The result is stored in a collection (in Replace mode) that will be then used by the webapp.
The duration of processing now is about 6-9 seconds and the CPU and RAM usage are very low.
The acceptable waiting time for my users should be less than 5 seconds.
So, to benefit from the not used CPU, I tried to divide my initial input data into subsets and perform the mapreduce in each subset by a different thread (20k documents per thread).
For that I had to change the Replae mode to Merge mode to be able to collect the result into the same collection.
But it didn't help. It consumes more CPU but the total execution time is the more or less the same.
Setting "nonAtomic" to true in my mapReduce calls didn't help neither.
I read somewhere that there are (at least) 2 issues with running it this way :
My threads are not running in parallel for the inserts as the insert locks the output collection.
My threads are not running in parallel during processing because the js engine used by mongodb is not thread safe.
Are these points correct? And do you know any other better solutions?
PS: My mapreduce doesn't group data, it only tranforms it based on functional conditions (a lot of them). All emitted documents are unique (so reduce is always 0).
EDIT:
Here is an example:
My input objects are a products groups. ie:
{
_id : "1",
products : [
{code : "P1", name : "P1", price : 22.1 ...., competitors : [{code : "c1", price : 22.2},{code : "c2", price : 21.9}]},
{code : "P2", name : "P2", price : 22.1 ...., competitors : [{code : "c1", price : 22.2},{code : "c2", price : 21.9}]},
]
}
Users should be able to define dynamically functional groups based on some criterias applied to each product and define a pricing strategy for each one of them.
As a simple example of functional groupping, they could define 4 groups like this :
Cheap Products (whose price is less than 20)
Products that are sold by both competitors "C1" and "C2"
Products that are sold only by the competitor "C3"
Products that are sold by the competitor "C4" and is not in Promo
...
All these groups are defined based on properties of the Product object and because 1 product can possibly fit more than 1 group, the first encountred should be the used one (if it fits in the first group, it must not appear in any other one).
Once the groups criteria defined, users can define for each group a strategy to apply to calculate a new price for each product based on some conditions (also uses Product properties BUT ALSO other products properties on the same Array of the original input object).
The result is a collection of separate products with its functinal group, its new price and some other calculated stats and values.

Azure data factory : Process previous date data

I have been trying to find a way to dynamically set the start and end properties of the pipeline. The reason for this is to process a time series of files 5 days before the current day of the pipeline execution.
I try to set this in the pipeline JSON:
"start": "Date.AddDays(SliceStart, -5)"
"start": "Date.AddDays(SliceEnd, -5)"
and when publishing through VS2015, I get the error below:
Unable to convert 'Date.AddDays(SliceEnd, -5)' to a DateTime value.
Please use ISO8601 DateTime format such as \"2014-10-01T13:00:00Z\"
for UTC time, or \"2014-10-01T05:00:00-8:00\" for Pacific Standard
Time. If the timezone designator is omitted, the system will consider
it denotes UTC time by default. Hence, \"2014-10-01\" will be
converted to \"2014-10-01T00:00:00Z\" automatically..
","code":"InputIsMalformedDetailed"
What could be the other ways to do this?
Rather than trying to set this at the pipeline level dynamically, which won't work. You need to deal with it when you provision the time slices against the dataset and in the activity.
Use the JSON attribute called Offset within the availability block for the dataset and within the scheduler block for the activity.
This will use the time slice start value configured by the interval and frequency and offset it by the given value in day/hours/minutes etc.
For example (in the dataset):
// etc....
},
"availability": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval",
"offset": "-5.00:00:00" //minus 5 days
}
//etc....
You'll need to configure this in both places, otherwise the activity will fail validation at deployment time.
Check out this Microsoft article for details on all the attributes you can use to configure more complex time slice scenarios.
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-create-pipelines
Hope this helps.

azure data factory - performing full IDL for the first slice

I'm working on data factory POC to replace existing data integration solution that loads data from one system to another. The existing solution extracts all data available until present point in time and then on consecutive runs extracts new/updated data that changed since last time it ran. Basically IDL (initial data load) first and then updates.
Data factory works somewhat similar and extracts data in slices. However I need the first slice to include all the data from the beginning of time. I could say that pipeline start time is "the beginning of time", but that would create too many slices.
For example I want it to run daily and grab daily increments. But I want to extract data for last 10 years first. I don't want to have 3650 slices created to catch up. I want the first slice to have WindowStart parameter overridden and set to some predetermined point in the past. And then consecutive slices to use normal WindowStart-WindowEnd time interval.
Is there a way to accomplish that?
Thanks!
How about you create two pipelines, one as a "run once" which transfers all the initial data, then clone that one, so you copy all the datasets and linked service references in the pipeline. Then add the schedule to it, and a SQL query to fetch only new data which uses the date variables? You'll need something like this in the second pipeline:
"source":
{
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('SELECT * FROM yourTable WHERE createdDate > \\'{0:yyyyMMdd-HH}\\'', SliceStart)"
},
"sink":
{
...
}
Hope that makes sense.

Resources