The following defines an EntitySet. I've declared did as Index on the transaction table tx but it registers as Id, not Index. Why is that?
The objective is to remove the warning below.
Under what circumstances would an Index assignment be overridden as Id (primary vs. external key?), and is the fact that did registers as Id related to the warning?
One uid can have several dids in the tx table.
es = ft.EntitySet(id="the_entity_set")
# hse
es = es.entity_from_dataframe(entity_id="hse",
dataframe=hse,
index="uid",
variable_types={"Gender": ft.variable_types.Categorical,
"Income": ft.variable_types.Numeric,
"dob" : ft.variable_types.Datetime})
# types
es = es.entity_from_dataframe(entity_id="types",
dataframe=types,
index="type_id",
variable_types={"type": ft.variable_types.Categorical})
# files
es = es.entity_from_dataframe(entity_id="files",
dataframe=files,
index="file_id",
variable_types={"file": ft.variable_types.Categorical})
# uid_donations
es = es.entity_from_dataframe(entity_id="uid_txlup",
dataframe=uid_txlup,
index="did",
variable_types={"uid": ft.variable_types.Categorical})
# transactions
es = es.entity_from_dataframe(entity_id="tx",
dataframe=tx,
index="did",
time_index="dt",
variable_types={"file_id": ft.variable_types.Categorical,
"type_id": ft.variable_types.Categorical,
"amt": ft.variable_types.Numeric})
rels = [
ft.Relationship(es["files"]["file_id"],es["tx"]["file_id"]),
ft.Relationship(es["types"]["type_id"],es["tx"]["type_id"]),
ft.Relationship(es["hse"]["uid"], es["uid_txlup"]["uid"]),
ft.Relationship(es["uid_txlup"]["did"],es["tx"]["did"])
]
es.add_relationships( rels )
This is what the EntitySet looks like
Entityset: the_entity_set
Entities:
hse [Rows: 100, Columns: 4]
types [Rows: 8, Columns: 2]
files [Rows: 2, Columns: 2]
uid_txlup [Rows: 336, Columns: 2]
tx [Rows: 336, Columns: 5]
Relationships:
tx.file_id -> files.file_id
tx.type_id -> types.type_id
uid_txlup.uid -> hse.uid
tx.did -> uid_txlup.did
es.entities
[Entity: hse
Variables:
uid (dtype: index)
Gender (dtype: categorical)
Income (dtype: numeric)
dob (dtype: datetime)
Shape:
(Rows: 100, Columns: 4), Entity: types
Variables:
type_id (dtype: index)
type (dtype: categorical)
Shape:
(Rows: 8, Columns: 2), Entity: files
Variables:
file_id (dtype: index)
file (dtype: categorical)
Shape:
(Rows: 2, Columns: 2), Entity: uid_txlup
Variables:
did (dtype: index)
uid (dtype: categorical)
Shape:
(Rows: 336, Columns: 2), Entity: tx
Variables:
did (dtype: id) ### <<< external key ???
dt (dtype: datetime)
file_id (dtype: categorical)
type_id (dtype: categorical)
amt (dtype: numeric)
Shape:
(Rows: 336, Columns: 5)]
Why does did show up as Id and not Index when I call fts?
Here is the warning:
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="hse",
agg_primitives=["sum","mode","percent_true"],
where_primitives=["count", "avg_time_between"],
max_depth=2)
feature_defs
.../anaconda3/lib/python3.6/site-packages/featuretools-0.2.1-py3.6.egg/featuretools/entityset/entityset.py:432: FutureWarning: 'did' is both an index level and a column label.
Defaulting to column, but this will raise an ambiguity error in a future version
end_entity_id=child_eid)
A relationship in your entity set will always between an Id variable in the parent entity and an Index variable in the child entity. Therefore, featuretools will automatically convert the variable from the child entity to an Index type when you add the relationship regardless of what you specify.
There is the possibility that a variable is both an Index and an Id if there is a one-to-one relationship between entities. In this case, you should join the to two entities into one.
Related
I have the following dataframes:
assets = pd.DataFrame(columns = ['date','asset'], data = [[datetime.date(2022,10,21),'SPY'],[datetime.date(2022,10,21),'FTSE'], [datetime.date(2022,11,12),'SPY'],[datetime.date(2022,11,12),'FTSE']])
prices = pd.DataFrame(columns = ['date','asset', 'price'], data = [[datetime.date(2022,10,11),'SPY',10],[datetime.date(2022,10,11),'FTSE',5],[datetime.date(2022,11,8),'SPY',100],[datetime.date(2022,11,8),'FTSE',50]]
for each asset I want to get the as-of price (at the nearest date). How to do so?
If I have only one asset it is easy:
assets_spy = assets#.loc[assets['asset']=='SPY']
prices_spy = prices#.loc[prices['asset']=='SPY']
assets_spy.index = pd.to_datetime(assets_spy['date'])
prices_spy.index = pd.to_datetime(prices_spy['date'])
merged = pd.merge_asof(assets_spy.sort_index(),
prices_spy.sort_index(),
direction='nearest',right_index=True,left_index=True)
but if I follow the same logic for multiple assets, it won't match.
The function pandas.merge_asof has an optional parameter named by that you can use for matching using a column or list of columns before performing merge operation. Therefore, you could adapt your code like this:
import pandas as pd
import datetime
assets = pd.DataFrame(columns = ['date', 'asset'],
data = [[datetime.date(2022, 10, 21), 'SPY'],
[datetime.date(2022, 10, 21), 'FTSE'],
[datetime.date(2022, 11, 12), 'SPY'],
[datetime.date(2022, 11, 12), 'FTSE']])
prices = pd.DataFrame(columns = ['date', 'asset', 'price'],
data = [[datetime.date(2022, 10, 11), 'SPY', 10],
[datetime.date(2022, 10, 11), 'FTSE', 5],
[datetime.date(2022, 11, 8), 'SPY', 100],
[datetime.date(2022, 11, 8), 'FTSE', 50]])
merged = pd.merge_asof(
assets.astype({'date': 'datetime64[ns]'}).sort_values('date').convert_dtypes(),
prices.astype({'date': 'datetime64[ns]'}).sort_values('date').convert_dtypes(),
direction = 'nearest',
on = 'date',
by = 'asset',
)
merged
# Returns:
#
# date asset price
# 0 2022-10-21 SPY 10 <-- Merged using SPY price from '2022-10-11'
# 1 2022-10-21 FTSE 5 <-- Merged using FTSE price from '2022-10-11'
# 2 2022-11-12 SPY 100 <-- Merged using SPY price from '2022-11-08'
# 3 2022-11-12 FTSE 50 <-- Merged using FTSE price from '2022-11-08'
Output Screenshot:
cars = {"Austin 7": 1922, "Lancia Lambda": 1922, "Bugatti Type 35": 1924, "Hanomag 2": 1925,
"Ford Model A": 1927, "Cadillac V16": 1930}
for i in range:
print("car: {}, year: {}".format(_, cars[_]))
cars = {"Austin 7": 1922, "Lancia Lambda": 1922, "Bugatti Type 35": 1924, "Hanomag 2": 1925, "Ford Model A": 1927, "Cadillac V16": 1930}
list_keys = cars.keys()
for key in list_keys:
print("car: {}, year: {}".format(key, cars[key]))
What you are missing is that you can iterate over the keys of a dictionary using the syntax: for key in dict:.
cars = {"Austin 7": 1922, "Lancia Lambda": 1922, "Bugatti Type 35": 1924, "Hanomag 2": 1925, "Ford Model A": 1927, "Cadillac V16": 1930}
for car in cars:
print("car: {}, year: {}".format(car, cars[car]))
Alternatives exist such as iterating over the items of the dictionary:
for car, year in cars.items():
print(f"car: {car}, year: {year}")
cars.items() returns a list of tuples containing the key and value. These are unpacked into the variables car and year which are then printed using an f-string.
Or you can do it in one line using Python f-strings, list comprehension, and tuple unpacking:
print(*[f'car: {car}, year: {cars[car]}' for car in cars], sep='\n')
My ADLA solution is being transitioned to Spark. I'm trying to find the right replacement for U-SQL REDUCE expression to enable:
Read logical partition and store information in a list/dictionary/vector or other data structure in memory
Apply logic that requires multiple iterations
Output results as additional columns together with the original data (the original rows might be partially eliminated or duplicated)
Example of possible task:
Input dataset has sales and return transactions with their IDs and attributes
The solution is supposed finding the most likely sale for each return
Return transaction must happen after the sales transaction and be as similar to the sales transactions as possible (best available match)
Return transaction must be linked to exactly one sales transaction; sales transaction could be linked to one or no return transaction - link is supposed to be captured in the new column LinkedTransactionId
The solution could be probably achieved by groupByKey command, but I'm failing identify how to apply the logic across multiple rows. All examples I've managed to find are some variation of in-line function (usually an aggregate - e.g. .map(t => (t._1, t._2.sum))) which doesn't require information about individual records from the same partition.
Can anyone share example of similar solution or point me to the right direction?
Here is one possible solution - feedbacks and suggestions for different approach or examples of iterative Spark/Scala solutions are greatly appreciated:
Example will read Sales and Credit transactions for each customer (CustomerId) and process each customer as separate partition (outer mapPartition loop)
Credit will be mapped to the sales with closest score (i.e. smallest score difference - using the foreach inner loop inside each partition)
Mutable map trnMap is preventing double-assignmet of each transactions and captures updates from the process
Results are outputted thru an iterator as into final dataset dfOut2
Note: in this particular case the same result could have been achieved using windowing functions w/o using iterative solution, but the purpose is to test the iterative logic itself)
import org.apache.spark.sql.SparkSession
import org.apache.spark._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.api.java.JavaRDD
case class Person(name: String, var age: Int)
case class SalesTransaction(
CustomerId : Int,
TransactionId : Int,
Score : Int,
Revenue : Double,
Type : String,
Credited : Double = 0.0,
LinkedTransactionId : Int = 0,
IsProcessed : Boolean = false
)
case class TransactionScore(
TransactionId : Int,
Score : Int
)
case class TransactionPair(
SalesId : Int,
CreditId : Int,
ScoreDiff : Int
)
object ExampleDataFramePartition{
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Example Combiner")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val df = Seq(
(1, 1, 123, "Sales", 100),
(1, 2, 122, "Credit", 100),
(1, 3, 99, "Sales", 70),
(1, 4, 101, "Sales", 77),
(1, 5, 102, "Credit", 75),
(1, 6, 98, "Sales", 71),
(2, 7, 200, "Sales", 55),
(2, 8, 220, "Sales", 55),
(2, 9, 200, "Credit", 50),
(2, 10, 205, "Sales", 50)
).toDF("CustomerId", "TransactionId", "TransactionAttributesScore", "TransactionType", "Revenue")
.withColumn("Revenue", $"Revenue".cast(DoubleType))
.repartition(2,$"CustomerId")
df.show()
val dfOut2 = df.mapPartitions(p => {
println(p)
val trnMap = scala.collection.mutable.Map[Int, SalesTransaction]()
val trnSales = scala.collection.mutable.ArrayBuffer.empty[TransactionScore]
val trnCredits = scala.collection.mutable.ArrayBuffer.empty[TransactionScore]
val trnPairs = scala.collection.mutable.ArrayBuffer.empty[TransactionPair]
p.foreach(row => {
val trnKey: Int = row.getAs[Int]("TransactionId")
val trnValue: SalesTransaction = new SalesTransaction(row.getAs("CustomerId")
, trnKey
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
trnMap += (trnKey -> trnValue)
if(trnValue.Type == "Sales") {
trnSales += new TransactionScore(trnKey, trnValue.Score)}
else {
trnCredits += new TransactionScore(trnKey, trnValue.Score)}
})
if(trnCredits.size > 0 && trnSales.size > 0) {
//define transaction pairs
trnCredits.foreach(cr => {
trnSales.foreach(sl => {
trnPairs += new TransactionPair(cr.TransactionId, sl.TransactionId, math.abs(cr.Score - sl.Score))
})
})
}
trnPairs.sortBy(t => t.ScoreDiff)
.foreach(t => {
if(!trnMap(t.CreditId).IsProcessed && !trnMap(t.SalesId).IsProcessed){
trnMap(t.SalesId) = new SalesTransaction(trnMap(t.SalesId).CustomerId
, trnMap(t.SalesId).TransactionId
, trnMap(t.SalesId).Score
, trnMap(t.SalesId).Revenue
, trnMap(t.SalesId).Type
, math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue)
, t.CreditId
, true
)
trnMap(t.CreditId) = new SalesTransaction(trnMap(t.CreditId).CustomerId
, trnMap(t.CreditId).TransactionId
, trnMap(t.CreditId).Score
, trnMap(t.CreditId).Revenue
, trnMap(t.CreditId).Type
, math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue)
, t.SalesId
, true
)
}
})
trnMap.map(m => m._2).toIterator
})
dfOut2.show()
spark.stop()
}
}
I have a standard list of objects, where each object is defined as
class MyRecord(object):
def __init__(self, name, date, category, memo):
self.name = name
self.date = date
self.category = category
self.memo = memo.strip().split()
When I create an object usually the input memo is a long sentence, for example: "Hello world this is a new funny-memo", which then in the init function turns into a list ['Hello', 'world', 'is', 'a', 'new', 'funny-memo'].
Given let's say a 10000 of such records in the list (with different memos) I want to group them (as fast as possible) in the following way:
'Hello' : [all the records, which memo contains word 'Hello']
'world' : [all the records, which memo contains word 'world']
'is' : [all the records, which memo contains word 'is']
I know how to use group-by to group the records by for example name, date, or category (since it is a single value), but I'm having a problem to group in the way described above.
If you want to group them really fast then you should do it once and never recalculate. To achieve this you may try approach used for caching that is group objects during the creation:
class MyRecord():
__groups = dict()
def __init__(self, name, date, category, memo):
self.name = name
self.date = date
self.category = category
self.memo = memo.strip().split()
for word in self.memo:
self.__groups.setdefault(word, set()).add(self)
#classmethod
def get_groups(cls):
return cls.__groups
records = list()
for line in [
'Hello world this is a new funny-memo',
'Hello world this was a new funny-memo',
'Hey world this is a new funny-memo']:
records.append(MyRecord(1, 1, 1, line))
print({key: len(val) for key, val in MyRecord.get_groups().items()})
Output:
{'Hello': 2, 'world': 3, 'this': 3, 'is': 2, 'a': 3, 'new': 3, 'funny-memo': 3, 'was': 1, 'Hey': 1}
I'm trying to write a dictionary into an existing sql database, but without success giving me:
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.
Based on my minimal example, has anzbody some useful hints? (python3)
Command to create the empty db3 anywhere on your machine:
CREATE TABLE "testTable" (
sID INTEGER NOT NULL UNIQUE PRIMARY KEY,
colA REAL,
colB TEXT,
colC INTEGER);
And the code for putting my dictionary into the database looks like:
import sqlite3
def main():
path = '***anywhere***/test.db3'
data = {'sID': [1, 2, 3],
'colA': [0.3, 0.4, 0.5],
'colB': ['A', 'B', 'C'],
'colC': [4, 5, 6]}
db = sqlite3.connect(path)
c = db.cursor()
writeDict2Table(c, 'testTable', data)
db.commit()
db.close()
return
def writeDict2Table(cursor, tablename, dictionary):
qmarks = ', '.join('?' * len(dictionary))
cols = ', '.join(dictionary.keys())
values = tuple(dictionary.values())
query = "INSERT INTO %s (%s) VALUES (%s)" % (tablename, cols, qmarks)
cursor.execute(query, values)
return
if __name__ == "__main__":
main()
I had already a look at
Python : How to insert a dictionary to a sqlite database?
but unfortunately I did not succeed.
You must not use a dictionary with question marks as parameter markers, because there is no guarantee about the order of the values.
To handle multiple rows, you must use executemany().
And executemany() expects each item to contain the values for one row, so you have to rearrange the data:
>>> print(*zip(data['sID'], data['colA'], data['colB'], data['colC']), sep='\n')
(1, 0.3, 'A', 4)
(2, 0.4, 'B', 5)
(3, 0.5, 'C', 6)
cursor.executemany(query, zip(data['sID'], data['colA'], data['colB'], data['colC']))