Cannot convert column when using FeatureTools to normalize_entity with timestamps - featuretools

I'm attempting to use FeatureTools to normalize a table for feature synthesis. My table is similar to Max-Kanter's response from How to apply Deep Feature Synthesis to a single table. I'm hitting an exception I would appreciate some help working around.
The exception originates in featuretools.entityset.entity.entityset_convert_variable_type, which doesn't seem to handle time types.
What is the nature of the exception, and can I work around it?
The Table, df:
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show
12345 | 5642903 | F | 2016-04-29 | 2016-04-29 | 62 | JARDIM DA | 0 | 1 | 0 | 0 | 0 | 0 | No
67890 | 3902943 | M | 2016-03-18 | 2016-04-29 | 44 | Other Nbh | 1 | 1 | 0 | 0 | 0 | 0 | Yes
...
My Code:
appointment_entity_set = ft.EntitySet('appointments')
appointment_entity_set.entity_from_dataframe(
dataframe=df, entity_id='appointments',
index='AppointmentID', time_index='AppointmentDay')
# error generated here
appointment_entity_set.normalize_entity(base_entity_id='appointments',
new_entity_id='patients',
index='PatientId')
ScheduledDay and AppointmentDay are type pandas._libs.tslib.Timestamp as is the case in Max-Kanter's response.
The Exception:
~/.virtualenvs/trane/lib/python3.6/site-packages/featuretools/entityset/entity.py in entityset_convert_variable_type(self, column_id, new_type, **kwargs)
474 df = self.df
--> 475 if df[column_id].empty:
476 return
477 if new_type == vtypes.Numeric:
Exception: Cannot convert column first_appointments_time to <class 'featuretools.variable_types.variable.DatetimeTimeIndex'>
featuretools==0.1.21
This dataset is from the Kaggle Show or No Show competition

The error that’s showing up seems to be a problem with the way the AppointmentDay variable is being read by pandas. We actually have an example Kaggle kernel with that dataset. There, we needed to use pandas.read_csv with parse_dates:
data = pd.read_csv("data/KaggleV2-May-2016.csv", parse_dates=['AppointmentDay', 'ScheduledDay'])
That returns a pandas Series whose values are of type numpy.datetime64. This should load in fine to Featuretools.
Also, can you make sure you have the latest version of Featuretools from pip? There is a set trace command in that stack trace that isn’t in the latest release.

Related

Possible corner case: pandas.read_csv

Why are all dots stripped from strings that consist of numbers and dots, only when engine='python', and in the face of dtype being defined?
The unexpected behaviour is experienced when processing a csv file that:
has strings that solely consist of numbers and single dots spread throughout the string
the read_csv parameters are set: engine='python' and thousands='.'
Sample of testcode:
import pandas as pd # version 1.5.2
import io
data = """a;b;c\n0000.7995;16.000;0\n3.03.001.00514;0;4.000\n4923.600.041;23.000;131"""
df1 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='c')
df2 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')
df1 out: col a as desired and expected
| | a | b | c |
|---:|:---------------|------:|-----:|
| 0 | 0000.7995 | 16000 | 0 |
| 1 | 3.03.001.00514 | 0 | 4000 |
| 2 | 4923.600.041 | 23000 | 131 |
df2 out: col a not expected
| | a | b | c |
|---:|------------:|------:|-----:|
| 0 | 00007995 | 16000 | 0 |
| 1 | 30300100514 | 0 | 4000 |
| 2 | 4923600041 | 23000 | 131 |
Even though dtype={'a': str}, it seems that engine='python' handles it differently from engine='c'. dtype={'a': object} yields the same result.
I have spent quite some time getting to know all settings from the pandas read_csv and I can't see any other option I can set to alter this behaviour.
Is there anything I missed or is this behaviour 'normal'?
Looks like a bug (was't reported - so I filed it). Was only able to create a clumsy workaround:
df = pd.read_csv(io.StringIO(data), sep=';', dtype=str, engine='python')
int_columns = ['b', 'c']
df[int_columns] = df[int_columns].apply(lambda x: x.str.replace('.', '')).astype(int)
a
b
c
0000.7995
16000
0
3.03.001.00514
0
4000
4923.600.041
23000
131

How to simultaneously group/apply two Spark DataFrames?

/* My question is language-agnostic I think, but I'm using PySpark if it matters. */
Situation
I currently have two Spark DataFrames:
One with per-minute data (1440 rows per person and day) of a person's heart rate per minute:
| Person | date | time | heartrate |
|--------+------------+-------+-----------|
| 1 | 2018-01-01 | 00:00 | 70 |
| 1 | 2018-01-01 | 00:01 | 72 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 11:32 | 123 |
| ... | ... | ... | ... |
And another DataFrame with daily data (1 row per person and day), of daily metadata, including the results of a clustering of days, i.e. which cluster day X of person Y fell into:
| Person | date | cluster | max_heartrate |
|--------+------------+---------+----------------|
| 1 | 2018-01-01 | 1 | 180 |
| 1 | 2018-01-02 | 4 | 166 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 1 | 147 |
| ... | ... | ... | ... |
(Note that clustering is done separately per person, so cluster 1 for person 1 has nothing to do with person 2's cluster 1.)
Goal
I now want to compute, say, the mean heart rate per cluster and per person, that is, each person gets different means. If I have three clusters, I am looking for this DF:
| Person | cluster | mean_heartrate |
|--------+---------+----------------|
| 1 | 1 | 123 |
| 1 | 2 | 89 |
| 1 | 3 | 81 |
| 2 | 1 | 80 |
| ... | ... | ... |
How do I best do this? Conceptually, I want to group these two DataFrames per person and send two DF chunks into an apply function. In there (i.e. per person), I'd group and aggregate the daily DF per day, then join the daily DF's cluster IDs, then compute the per-cluster mean values.
But grouping/applying multiple DFs doesn't work, right?
Ideas
I have two ideas and am not sure which, if any, make sense:
Join the daily DF to the per-minute DF before grouping, which would result in highly redundant data (i.e. the cluster ID replicated for each minute). In my "real" application, I will probably have per-person data too (e.g. height/weight), which would be a completely constant column then, i.e. even more memory wasted. Maybe that's the only/best/accepted way to do it?
Before applying, transform the DF into a DF that can hold complex structures, e.g. like
.
| Person | dataframe | key | column | value |
|--------+------------+------------------+-----------+-------|
| 1 | heartrates | 2018-01-01 00:00 | heartrate | 70 |
| 1 | heartrates | 2018-01-01 00:01 | heartrate | 72 |
| ... | ... | ... | ... | ... |
| 1 | clusters | 2018-01-01 | cluster | 1 |
| ... | ... | ... | ... | ... |
or maybe even
| Person | JSON |
|--------+--------|
| 1 | { ...} |
| 2 | { ...} |
| ... | ... |
What's the best practice here?
But grouping/applying multiple DFs doesn't work, right?
No, AFAIK this does not work not in pyspark nor pandas.
Join the daily DF to the per-minute DF before grouping...
This is the way to go in my opinion. You don't need to merge all redundant columns but only those requrired for your groupby-operation. There is no way to avoid redundancy for your groupby-columns as they will be needed for the groupby-operation.
In pandas, it is possible to provide an extra groupby-column as a pandas Series specifically but it requires to have the exact same shape as the to be grouped dataframe. However, in order create the groupby-column, you will need a merge anyway.
Before applying, transform the DF into a DF that can hold complex structures
Performance and memory wise, I would not go with this solution unless you have multiple required groupby operations which will benefit from more complex data structures. In fact, you will need to put in some effort to actually create the data structure in the first place.

How to loop through dataset to create a dataset of summary

I just start learning and using Spark, and currently facing a problem. Any suggestion or hint will be greatly appreciated.
Basically I have a dataset that contain all kind of event of different user, like AppLaunch, GameStart, GameEnd, etc. and I want to create a summary of each user's action of each time he/she start the app.
For example: I have the following dataset:
UserId | Event Type | Time | GameType | Event Id|
11111 | AppLauch | 11:01:53| null | 101 |
11111 | GameStart | 11:01:59| Puzzle | 102 |
11111 | GameEnd | 11:05:31| Puzzle | 103 |
11111 | GameStart | 11:05:58| Word | 104 |
11111 | GameEnd | 11:09:13| Word | 105 |
11111 | AppEnd | 11:09:24| null | 106 |
11111 | AppLauch | 12:03:43| null | 107 |
22222 | AppLauch | 12:03:52| null | 108 |
22222 | GameStart | 12:03:59| Puzzle | 109 |
11111 | GameStart | 12:04:01| Puzzle | 110 |
22222 | GameEnd | 12:06:11| Puzzle | 111 |
11111 | GameEnd | 12:06:13| Puzzle | 112 |
11111 | AppEnd | 12:06:23| null | 113 |
22222 | AppEnd | 12:06:33| null | 114 |
And what I want is a dataset similar to this:
EventId | USerId| Event Type | Time | FirstGamePlayed| LastGamePlayed|
101 |11111 | AppLauch | 11:01:53| Puzzle | Word |
107 |11111 | AppLauch | 12:03:43| Puzzle | Puzzle |
108 |22222 | AppLauch | 12:03:52| Puzzle | Puzzle |
Only need to know the first game played and the last game played, even if there are more than 3 games played in one app-launch.
My initial idea is group them by the user Id and window of time frame (AppLaunch to AppEnd), and then find a way to scan through the dataset, if there is an gameStart event and it fell into the any window, it will be the FirstGamePlayed, the last GameStart event before the time of AppEnd will be the LastGamePlayed. but I didn't find a way to achieve this.
Any hint/suggestion will be nice.
Thanks
I think this can be solved using window function followed by a aggregation like this:
df
// enumerate AppLaunches
.withColumn("AppLauchNr", sum(when($"EventType" === "AppLauch", 1)).over(Window.partitionBy($"UserId").orderBy($"Time".asc)))
// get first last game per AppLaunch
.withColumn("firstGamePlayed", first($"GameType", true).over(Window.partitionBy($"UserId", $"AppLauchNr").orderBy($"Time".asc)))
.withColumn("lastGamePlayed", first($"GameType", true).over(Window.partitionBy($"UserId", $"AppLauchNr").orderBy($"Time".desc)))
// now aggregate
.groupBy($"AppLauchNr")
.agg(
first($"UserId").as("UserId"),
min($"EventId").as("EventId"),
lit("AppLauch").as("EventType"), // this is always AppLauch
min($"Time").as("Time"),
first($"firstGamePlayed", true).as("firstGamePlayed"),
first($"lastGamePlayed", true).as("lastGamePlayed")
)
.drop($"AppLauchNr")
First and Last game played can also be determined using orderBy().groupBy() instead of window functions, but I'm still not sure about spark preserves the ordering during aggregation (this is not mentioned in the docs, see e.g. Spark DataFrame: does groupBy after orderBy maintain that order? and discussion in https://issues.apache.org/jira/browse/SPARK-16207)
df
.withColumn("AppLauchNr", sum(when($"EventType" === "AppLauch", 1)).over(Window.partitionBy($"UserId").orderBy($"Time".asc)))
.orderBy($"UserId",$"AppLauchNr",$"Time")
.groupBy($"UserId",$"AppLauchNr")
.agg(
first($"EventId").as("EventId"),
first($"EventType").as("EventType"),
first($"Time").as("Time"),
first($"GameType", true).as("firstGamePlayed"),
last($"GameType", true).as("lastGamePlayed")
)
.drop($"AppLauchNr")

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

Impact ranking in Excel

I have a table with name, volumes processed and accuracy percentage. I want to calculate who has more impact overall on my business based on those two data points. I tried SUMPRODUCT by giving 50 - 50 weight-age for both data points equally but I didn't get good result. Any other method I can use to know the impact/ranking?
+------+--------+----------+
| Name | Volume | Accuracy |
+------+--------+----------+
| ABC | 251 | 52.99% |
| DEF | 240 | 70.00% |
| FGH | 230 | 74.35% |
| IJK | 137 | 84.67% |
| LMN | 136 | 56.62% |
| OPQ | 135 | 75.56% |
| RST | 128 | 60.16% |
| UVW | 121 | 70.25% |
| XYZ | 120 | 68.33% |
| AJK | 115 | 35.00% |
| LOP | 113 | 100.00% |
+------+--------+----------+
I'm afraid nobody here will be able to tell you what the business rules are for your business.
One approach might be to multiply volume by accuracy, then rank it, either with a formula or with conditional formatting color scales.
Depending on your business logic, you may want to put more weight on either volume or accuracy, so you may want to raise one value to the power of 2 before multiplying it with the other.
The technique is "use a formula to calculate the weighting". What that formula is will depend on your business logic. The following screenshot shows a weighting formula with simple multiplication, conditionally formatted with color scales and a rank formula in the next column.

Resources