Do you know if it is possible to make groupers by hour?
I know that by day you can.
context="{'group_by': 'my_datetime:day'}"
I mean odoo filters like this:
<filter name="booking_group" string="Group by Booking" context="{'group_by': 'booking_id'}"
No, it is not possible. The implemented values are 'day', 'week', 'month', 'quarter' or 'year' (See <path_to_v12/odoo/models.py lines 1878 to 1899):
1878 #api.model
1879 def read_group(self, domain, fields, groupby, offset=0, limit=None, orderby=False, lazy=True):
1880 """
1881 Get the list of records in list view grouped by the given ``groupby`` fields
1882
1883 :param domain: list specifying search criteria [['field_name', 'operator', 'value'], ...]
1884 :param list fields: list of fields present in the list view specified on the object
1885 :param list groupby: list of groupby descriptions by which the records will be grouped.
1886 A groupby description is either a field (then it will be grouped by that field)
1887 or a string 'field:groupby_function'. Right now, the only functions supported
1888 are 'day', 'week', 'month', 'quarter' or 'year', and they only make sense for
1889 date/datetime fields.
1890 :param int offset: optional number of records to skip
1891 :param int limit: optional max number of records to return
1892 :param list orderby: optional ``order by`` specification, for
1893 overriding the natural sort ordering of the
1894 groups, see also :py:meth:`~osv.osv.osv.search`
1895 (supported only for many2one fields currently)
1896 :param bool lazy: if true, the results are only grouped by the first groupby and the
1897 remaining groupbys are put in the __context key. If false, all the groupbys are
1898 done in one call.
1899 :return: list of dictionaries(one dictionary for each record) containing:
Related
I've got the following dataset:
where:
customer id represents a unique customer
each customer has multiple invoices
each invoice is marked by a unique identifier (Invoice)
each invoice has multiple items (rows)
I want to determine the time difference between invoices for a customer. In other words, the time between one invoice and the next. Is this possible? and how should I do it with DiffDatetime?
Here is how I am setting up the entities:
es = ft.EntitySet(id="data")
es = es.add_dataframe(
dataframe=df,
dataframe_name="items",
index = "items",
make_index=True,
time_index="InvoiceDate",
)
es.normalize_dataframe(
base_dataframe_name="items",
new_dataframe_name="invoices",
index="Invoice",
copy_columns=["Customer ID"],
)
es.normalize_dataframe(
base_dataframe_name="invoices",
new_dataframe_name="customers",
index="Customer ID",
)
I tried:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="invoices",
agg_primitives=[],
trans_primitives=["diff_datetime"],
verbose=True,
)
And also changing the target dataframe to invoices or customers, but none of those work.
The df that I am trying to work on looks like this:
es["invoices"].head()
And what I want can be done with pandas like this:
es["invoices"].groupby("Customer ID")["first_items_time"].diff()
which returns:
489434 NaT
489435 0 days 00:01:00
489436 NaT
489437 NaT
489438 NaT
...
581582 0 days 00:01:00
581583 8 days 01:05:00
581584 0 days 00:02:00
581585 10 days 20:41:00
581586 14 days 02:27:00
Name: first_items_time, Length: 40505, dtype: timedelta64[ns]
Thank you for your question.
You can use the groupby_trans_primitives argument in the call to dfs.
Here is an example:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="invoices",
agg_primitives=[],
groupby_trans_primitives=["diff_datetime"],
return_types="all",
verbose=True,
)
The return_types argument is required since DiffDatetime returns a Feature with Timedelta logical type. Without specifying return_types="all", DeepFeatureSynthesis will only return Features with numeric, categorical, and boolean data types.
I'm trying to make a table from a list of data using pandas.
Originally I wanted to make a function where I can pass dynamic variables so I could continuously add new rows from data list.
It works up until a point where adding rows part begun. Column headers are adding, but the data - no. It either keeps value at only last col or adds nothing.
My scrath was:
for title in titles:
for x in data:
table = {
title: data[x]
}
df.DataFrame(table, columns=titles, index[0]
columns list:
titles = ['timestamp', 'source', 'tracepoint']
data list:
data = ['first', 'second', 'third',
'first', 'second', 'third',
'first', 'second', 'third']
How can I make something like this?
timestamp, source, tracepoint
first, second, third
first, second, third
first, second, third
If you just want to initialize pandas dataframe, you can use dataframe’s constructor.
And you can also append row by using a dict.
Pandas provides other useful functions,
such as concatenation between data frames, insert/delete columns. If you need, please check pandas’s doc.
import pandas as pd
# initialization by dataframe’s constructor
titles = ['timestamp', 'source', 'tracepoint']
data = [['first', 'second', 'third'],
['first', 'second', 'third'],
['first', 'second', 'third']]
df = pd.DataFrame(data, columns=titles)
print('---initialization---')
print(df)
# append row
new_row = {
'timestamp': '2020/11/01',
'source': 'xxx',
'tracepoint': 'yyy'
}
df = df.append(new_row, ignore_index=True)
print('---append result---')
print(df)
output
---initialization---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
---append result---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
3 2020/11/01 xxx yyy
the data comes in 3 columns after (orderbook = pd.DataFrame(orderbook_data):
timestamp bids asks
UNIX timestamp [bidprice, bidvolume] [askprice, askvolume]
list has 100 values of each. timestamp is the same
the problem is that I don't know how to access/index the values inside each row list [price, volume] of each column
I know that by running ---> bids = orderbook["bids"]
I get the list of 100 lists ---> [bidprice, bidvolume]
I'm looking to avoid doing a loop.... there has to be a way to just plot the data
I hope someone can undertand my problem. I just want to plot price on x and volume on y. The goal is to make it live
As you didn't present your input file, I prepared it on my own:
timestamp;bids
1579082401;[123.12, 300]
1579082461;[135.40, 220]
1579082736;[130.76, 20]
1579082801;[123.12, 180]
To read it I used:
orderbook = pd.read_csv('Input.csv', sep=';')
orderbook.timestamp = pd.to_datetime(orderbook.timestamp, unit='s')
Its content is:
timestamp bids
0 2020-01-15 10:00:01 [123.12, 300]
1 2020-01-15 10:01:13 [135.40, 220]
2 2020-01-15 10:05:36 [130.76, 20]
3 2020-01-15 10:06:41 [123.12, 180]
Now:
timestamp has been converted to native pandasonic type of datetime,
but bids is of object type (actually, a string).
and, as I suppose, this is the same when read from your input file.
And now the main task: The first step is to extract both numbers from bids,
convert them to float and int and save in respective columns:
orderbook = orderbook.join(orderbook.bids.str.extract(
r'\[(?P<bidprice>\d+\.\d+), (?P<bidvolume>\d+)]'))
orderbook.bidprice = orderbook.bidprice.astype(float)
orderbook.bidvolume = orderbook.bidvolume.astype(int)
Now orderbook contains:
timestamp bids bidprice bidvolume
0 2020-01-15 10:00:01 [123.12, 300] 123.12 300
1 2020-01-15 10:01:01 [135.40, 220] 135.40 220
2 2020-01-15 10:05:36 [130.76, 20] 130.76 20
3 2020-01-15 10:06:41 [123.12, 180] 123.12 180
and you can generate e.g. a scatter plot, calling:
orderbook.plot.scatter('bidprice', 'bidvolume');
or other plotting function.
Another possibility
Or maybe your orderbook_data is a dictionary? Something like:
orderbook_data = {
'timestamp': [1579082401, 1579082461, 1579082736, 1579082801],
'bids': [[123.12, 300], [135.40, 220], [130.76, 20], [123.12, 180]] }
In this case, when you create a DataFrame from it, the column types
are initially:
timestamp - int64,
bids - also object, but this time each cell contains a plain
pythonic list.
Then you can also convert timestamp column to datetime just like
above.
But to split bids (a column of lists) into 2 separate columns,
you should run:
orderbook[['bidprice', 'bidvolume']] = pd.DataFrame(orderbook.bids.tolist())
Then you have 2 new columns with respective components of the
source column and you can create your graphics jus like above.
I have some entries from users and how many interactions this user had on my website...
I have 340k rows and 70+ columns, and I want to use Vaex, but I'm having problems to do simple things like to drop duplicates.
Could someone help me on how to do it?
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Bob', 'Alice', 'Alice', 'Alice', "Ralph", "Ralph"],
'date': ['2013-12-05', '2014-02-05', '2013-11-07', '2014-04-22', '2014-04-30', '2014-04-20', '2014-05-29'],
'interaction_num': ['1', '2', '1', '2', '3', '1','2']})
I want to have the same result of the pandas.drop_duplicates(keep="last") function
df.drop_duplicates('user', keep='last', inplace=True)
the expected result using Vaex should be:
user date interaction_num
1 Bob 2014-02-05 2
4 Alice 2014-04-30 3
6 Ralph 2014-05-29 2
Duplicate question
It seems there is none yet, but we should expect this functionality at some point.
In the meantime, there is an attempt from the creator of vaex
The code adapted from https://github.com/vaexio/vaex/pull/1623/files works for me:
def drop_duplicates(df, columns=None):
"""Return a :class:`DataFrame` object with no duplicates in the given columns.
.. warning:: The resulting dataframe will be in memory, use with caution.
:param columns: Column or list of column to remove duplicates by, default to all columns.
:return: :class:`DataFrame` object with duplicates filtered away.
"""
if columns is None:
columns = df.get_column_names()
if type(columns) is str:
columns = [columns]
return df.groupby(columns, agg={'__hidden_count': vaex.agg.count()}).drop('__hidden_count')
Is there a way we can split a JSON containing array of strings into computed columns?
credit_cards column is of jsonb type with below sample data:
[{"bank": "HDFC Bank", "cvv": "8253", "expiry": "2020-05-31T14:22:34.61426Z", "name": "18a81ea99250bf236a5e27a762a32d62", "number": "c4ca4238acf96a36"}, {"bank": "HDFC Bank", "cvv": "9214", "expiry": "2020-05-30T21:44:55.173339Z", "name": "6725a156df733ec2dd33b94f06ee2e06", "number": "c81e728dacf96a36"}, {"bank": "HDFC Bank", "cvv": "1161", "expiry": "2020-05-31T07:59:28.458905Z", "name": "eb102765424d07d8b713211c14e837b4", "number": "eccbc87eacf96a36"}]
I tried this, but looks like its not supported in Computed Column.
alter table users_combined add column projected_number STRING[] AS (json_array_elements(credit_cards)->>'number') STORED;
ERROR: json_array_elements(): generator functions are not allowed in computed column
Another alternative which worked was:
alter table users_combined add column projected_number STRING[] AS (ARRAY[credit_cards->0->>'number',credit_cards->1->>'number',credit_cards->2->>'number']) STORED;
However this has a problem of user having to specify the "indices" of credit_card array. If we've more than 3 credit cards then we'll have to alter the column with new indices.
So is there a way to create Computed Column without having to specify the indices?
There is no way to do this. But, inverted indexes are a way to get the same capabilities for looking up data in the table that I think you're going for here.
If you create an inverted index on the table, then you can search for rows that have a given number attribute efficiently:
demo#127.0.0.1:26257/defaultdb> create table users_combined (credit_cards jsonb);
CREATE TABLE
Time: 3ms total (execution 3ms / network 0ms)
demo#127.0.0.1:26257/defaultdb> create inverted index on users_combined(credit_cards);
CREATE INDEX
Time: 53ms total (execution 4ms / network 49ms)
demo#127.0.0.1:26257/defaultdb> explain select * from users_combined where credit_cards->'number' = '"c4ca4238acf96a36"';
info
----------------------------------------------------------------------------------------
distribution: local
vectorized: true
• index join
│ table: users_combined#primary
│
└── • scan
table: users_combined#users_combined_credit_cards_idx
spans: [/'{"number": "c4ca4238acf96a36"}' - /'{"number": "c4ca4238acf96a36"}']
(10 rows)