how does featuretools DiffDatetimes work within the dfs? - featuretools

I've got the following dataset:
where:
customer id represents a unique customer
each customer has multiple invoices
each invoice is marked by a unique identifier (Invoice)
each invoice has multiple items (rows)
I want to determine the time difference between invoices for a customer. In other words, the time between one invoice and the next. Is this possible? and how should I do it with DiffDatetime?
Here is how I am setting up the entities:
es = ft.EntitySet(id="data")
es = es.add_dataframe(
dataframe=df,
dataframe_name="items",
index = "items",
make_index=True,
time_index="InvoiceDate",
)
es.normalize_dataframe(
base_dataframe_name="items",
new_dataframe_name="invoices",
index="Invoice",
copy_columns=["Customer ID"],
)
es.normalize_dataframe(
base_dataframe_name="invoices",
new_dataframe_name="customers",
index="Customer ID",
)
I tried:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="invoices",
agg_primitives=[],
trans_primitives=["diff_datetime"],
verbose=True,
)
And also changing the target dataframe to invoices or customers, but none of those work.
The df that I am trying to work on looks like this:
es["invoices"].head()
And what I want can be done with pandas like this:
es["invoices"].groupby("Customer ID")["first_items_time"].diff()
which returns:
489434 NaT
489435 0 days 00:01:00
489436 NaT
489437 NaT
489438 NaT
...
581582 0 days 00:01:00
581583 8 days 01:05:00
581584 0 days 00:02:00
581585 10 days 20:41:00
581586 14 days 02:27:00
Name: first_items_time, Length: 40505, dtype: timedelta64[ns]

Thank you for your question.
You can use the groupby_trans_primitives argument in the call to dfs.
Here is an example:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="invoices",
agg_primitives=[],
groupby_trans_primitives=["diff_datetime"],
return_types="all",
verbose=True,
)
The return_types argument is required since DiffDatetime returns a Feature with Timedelta logical type. Without specifying return_types="all", DeepFeatureSynthesis will only return Features with numeric, categorical, and boolean data types.

Related

pandas groupby performance / combine 2 functions

I am learning python and trying to understand the best practices of data queries.
Here is some dummy data (customer sales) to test
import pandas as pd
df = pd.DataFrame({'Name':['tom', 'bob', 'bob', 'jack', 'jack', 'jack'],'Amount':[3, 2, 5, 1, 10, 100], 'Date':["01.02.2022", "02.02.2022", "03.02.2022", "01.02.2022", "03.02.2022", "05.02.2022"]})
df.Date = pd.to_datetime(df.Date, format='%d.%m.%Y')
I want to investigate 2 kinds of queries:
How long is a person our customer?
What is the period between first
and last purchase.
How can I run the first query without writing loops manually?
What I have done so far for the second part is this
result = df.groupby("Name").max() - df.groupby("Name").min()
Is it possible to combine these two groupby queries into one to improve the performance?
P.S. I am trying to understand pandas and key concepts how to optimize queries. Different approaches and explanations are highly appreciated.
You can use GroupBy.agg with a custom function to get the difference between the max and min date.
df.groupby('Name')['Date'].agg(lambda x: x.max()-x.min())
As you already have datetime type, this will nicely yield a Timedelta object, which by default is shown as a string in the form 'x days'.
You can also save the GroupBy object in a variable and reuse it. This way, computation of the groups occurs only once:
g = df.groupby("Name")['Date']
g.max() - g.min()
output:
Name
bob 1 days
jack 4 days
tom 0 days
Name: Date, dtype: timedelta64[ns]

Pandas DataFrame: Groupby.First - Index limitation?

I have below data frame t:
import pandas as pd
t = pd.DataFrame(data = (['AFG','Afghanistan',38928341],
['CHE','Switzerland',8654618],
['SMR','San Marino', 33938]), columns = ['iso_code', 'location', 'population'])
g = t.groupby('location')
g.size()
I can see in each group there's only one record, which is expected.
However if I run below code it didn't populate any error message:
g.first(10)
It shows
population
location
Afghanistan 38928341
San Marino 33938
Switzerland 8654618
My understanding is the first(n) for a group is the nth record for this group but each of my location group has only one record - so how did pandas give me that record?
Thanks
I think you're looking for g.nth(10).
g.first(10) is NOT doing what you think it is. The first (optional) parameter of first is numeric_only and takes a boolean, so you're actually running g.first(numeric_only=True) as bool(10) evaluates to True.
After read the comments from mozway and Henry Ecker/ sammywemmy I finally got it.
t = pd.DataFrame(data = (['AFG','Afghanistan',38928341,'A1'],
['CHE','Switzerland',8654618,'C1'],
['SMR','San Marino', 33938,'S1'],
['AFG','Afghanistan',38928342,'A2'] ,
['AFG','Afghanistan',38928343, 'A3'] ), columns = ['iso_code', 'location', 'population', 'code'])
g = t.groupby('location')
Then
g.nth(0)
g.nth(1)
g.first(True)
g.first(False)
g.first(min_countint=2)
shows the difference

efficient cumulative pivot in pyspark

Is there a more efficient/idiomatic way of rewriting this query:
spark.table('registry_data')
.withColumn('age_days', datediff(lit(today), col('date')))
.withColumn('timeframe',
when(col('age_days')<7, "1w")
.when(col('age_days')<30, '1m')
.when(col('age_days')<92, '3m')
.when(col('age_days')<183, '6m')
.when(col('age_days')<365, '1y')
.otherwise('1y+')
)
.groupby('make', 'model')
.pivot('timeframe')
.agg(countDistinct('id').alias('count'))
.fillna(0)
.withColumn('1y+', col('1y+')+col('1y')+col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('1y', col('1y')+col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('6m', col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('3m', col('3m')+col('1m')+col('1w'))
.withColumn('1m', col('1m')+col('1w'))
The gist of the query is for every make/model combination to return the number of entries seen within a set of time periods from today. The period counts are cumulative, i.e. an entry that registered within the last 7 days would be counted for 1 week, 1 month, 3 months, etc.
if you want to use cumulative sum instead of summing for each columns, you can replace the code from .groupby onwards and use window functions
from pyspark.sql.window import Window
import pyspark.sql.functions as F
spark.table('registry_data')
.withColumn('age_days', datediff(lit(today), col('date')))
.withColumn('timeframe',
when(col('age_days')<7, "1w")
.when(col('age_days')<30, '1m')
.when(col('age_days')<92, '3m')
.when(col('age_days')<183, '6m')
.when(col('age_days')<365, '1y')
.otherwise('1y+')
)
.groupBy('make', 'model', 'timeframe')
.agg(F.countDistinct('id').alias('count'),
F.max('age_days').alias('max_days')) # for orderBy clause
.withColumn('cumsum',
F.sum('count').over(Window.partitionBy('make', 'model')
.orderBy('max_days')
.rowsBetween(Window.unboundedPreceding, 0)))
.groupBy('make', 'model').pivot('timeframe').agg(F.first('cumsum'))
.fillna(0)

How to graph Binance API Orderbook with Pandas-matplotlib?

the data comes in 3 columns after (orderbook = pd.DataFrame(orderbook_data):
timestamp bids asks
UNIX timestamp [bidprice, bidvolume] [askprice, askvolume]
list has 100 values of each. timestamp is the same
the problem is that I don't know how to access/index the values inside each row list [price, volume] of each column
I know that by running ---> bids = orderbook["bids"]
I get the list of 100 lists ---> [bidprice, bidvolume]
I'm looking to avoid doing a loop.... there has to be a way to just plot the data
I hope someone can undertand my problem. I just want to plot price on x and volume on y. The goal is to make it live
As you didn't present your input file, I prepared it on my own:
timestamp;bids
1579082401;[123.12, 300]
1579082461;[135.40, 220]
1579082736;[130.76, 20]
1579082801;[123.12, 180]
To read it I used:
orderbook = pd.read_csv('Input.csv', sep=';')
orderbook.timestamp = pd.to_datetime(orderbook.timestamp, unit='s')
Its content is:
timestamp bids
0 2020-01-15 10:00:01 [123.12, 300]
1 2020-01-15 10:01:13 [135.40, 220]
2 2020-01-15 10:05:36 [130.76, 20]
3 2020-01-15 10:06:41 [123.12, 180]
Now:
timestamp has been converted to native pandasonic type of datetime,
but bids is of object type (actually, a string).
and, as I suppose, this is the same when read from your input file.
And now the main task: The first step is to extract both numbers from bids,
convert them to float and int and save in respective columns:
orderbook = orderbook.join(orderbook.bids.str.extract(
r'\[(?P<bidprice>\d+\.\d+), (?P<bidvolume>\d+)]'))
orderbook.bidprice = orderbook.bidprice.astype(float)
orderbook.bidvolume = orderbook.bidvolume.astype(int)
Now orderbook contains:
timestamp bids bidprice bidvolume
0 2020-01-15 10:00:01 [123.12, 300] 123.12 300
1 2020-01-15 10:01:01 [135.40, 220] 135.40 220
2 2020-01-15 10:05:36 [130.76, 20] 130.76 20
3 2020-01-15 10:06:41 [123.12, 180] 123.12 180
and you can generate e.g. a scatter plot, calling:
orderbook.plot.scatter('bidprice', 'bidvolume');
or other plotting function.
Another possibility
Or maybe your orderbook_data is a dictionary? Something like:
orderbook_data = {
'timestamp': [1579082401, 1579082461, 1579082736, 1579082801],
'bids': [[123.12, 300], [135.40, 220], [130.76, 20], [123.12, 180]] }
In this case, when you create a DataFrame from it, the column types
are initially:
timestamp - int64,
bids - also object, but this time each cell contains a plain
pythonic list.
Then you can also convert timestamp column to datetime just like
above.
But to split bids (a column of lists) into 2 separate columns,
you should run:
orderbook[['bidprice', 'bidvolume']] = pd.DataFrame(orderbook.bids.tolist())
Then you have 2 new columns with respective components of the
source column and you can create your graphics jus like above.

Index by date ranges in Python Pandas

I am new to using python pandas, and have the below script to pull in time series data from an excel file, set the dates = index, and then will want to perform various calculations on the data referencing by date. Script:
df = pd.read_excel("myfile.xls")
df = df.set_index(df.Date)
df = df.drop("Date",1)
df.index.name = None
df.head()
The output of that (to give you a sense of the data) is:
Px1 Px2 Px3 Px4 Px5 Px6 Px7
2015-08-12 19.850000 10.25 7.88 10.90 109.349998 106.650002 208.830002
2015-08-11 19.549999 10.16 7.81 10.88 109.419998 106.690002 208.660004
2015-08-10 19.260000 10.07 7.73 10.79 109.059998 105.989998 210.630005
2015-08-07 19.240000 10.08 7.69 10.92 109.199997 106.430000 207.919998
2015-08-06 19.250000 10.09 7.76 10.96 109.010002 106.010002 208.350006
When I try to retrieve data based on one date like df.loc['20150806'] that works, but when I try to retrieve a slice like df.loc['20150806':'20150812'] I return Empty DataFrame.
Again, the index is a DateTimeIndex with dtype = 'datetime64[ns]', length = 1412, freq = None, tz = None
Like I said, my ultimate goal is to be able to group the data by Day, Month, Year, different periods etc., and perform calculations on the data. I want to give that context, but don't even want to get into that here since I'm clearly stuck on something more basic - perhaps misunderstanding how to operate with a DateTimeIndex
Thank you.
EDIT: Meant to also include, I think the main problem I referenced with indexing has something to do with freq=0, bc when I tried simpler examples with contiguous date series, I did not have this problem.
df.loc['2015-08-12':'2015-08-10'] and df.loc['2015-08-10':'2015-08-12':-1] both work. df = df.sort_index() and slicing the way I was trying also works. Thank you all. Was missing the forest for the trees there I think.

Resources