when combining features and then aggregating them featuretools returns some variables that don't make sense, how can this be avoided? - featuretools

I've got a dataset that contains invoices, with a unique identifier, and customers with a unique identifier. Each customer can have 1 or more invoices.
I set up the entity sets as follows:
es = ft.EntitySet(id="data")
es = es.add_dataframe(
dataframe=df,
dataframe_name="data",
index="rows",
make_index=True,
time_index="invoice_date",
logical_types={
"customer_id": Categorical,
"description": NaturalLanguage,
}
)
es.normalize_dataframe(
base_dataframe_name="data",
new_dataframe_name="invoices",
index="invoice",
copy_columns=["customer_id"],
)
es.normalize_dataframe(
base_dataframe_name="invoices",
new_dataframe_name="customers",
index="customer_id",
)
So that customers is child of invoices which is child of the entire dataset.
Now, I want to combine the variables price and quantity at the entire dataframe level, to obtain price*quantity, which happens all good. But when aggregating, I see combinations of variables that don't make human sense (or maybe it is me who does not understand them).
I set up the dfs as follows:
date_primitives = ["month", "weekday"]
text_primitives = ["num_words"]
trans_primitives = date_primitives + text_primitives + ["multiply_numeric"]
agg_primitives = ["mean"]
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=agg_primitives,
trans_primitives=trans_primitives,
primitive_options={
("multiply_numeric"): {
'include_columns': {
'data': ['price', 'quantity']
}
}
},
max_depth=3,
)
and the result of dfs contains the following features:
[<Feature: MEAN(data.price)>,
<Feature: MEAN(data.quantity)>,
<Feature: MONTH(first_invoices_time)>,
<Feature: WEEKDAY(first_invoices_time)>,
<Feature: MEAN(invoices.MEAN(data.price))>,
<Feature: MEAN(invoices.MEAN(data.quantity))>,
<Feature: MEAN(data.NUM_WORDS(description))>,
<Feature: MEAN(data.price * quantity)>,
<Feature: MEAN(data.price) * MEAN(data.quantity)>,
<Feature: MEAN(invoices.MEAN(data.NUM_WORDS(description)))>,
<Feature: MEAN(invoices.MEAN(data.price * quantity))>,
<Feature: MEAN(invoices.MEAN(data.price) * MEAN(data.quantity))>,
<Feature: MEAN(data.price * quantity) * MEAN(data.price)>,
<Feature: MEAN(data.price * quantity) * MEAN(data.quantity)>,
<Feature: MEAN(data.price * quantity) * MEAN(invoices.MEAN(data.price))>,
<Feature: MEAN(data.price * quantity) * MEAN(invoices.MEAN(data.quantity))>,
<Feature: MEAN(data.price) * MEAN(invoices.MEAN(data.price))>,
<Feature: MEAN(data.price) * MEAN(invoices.MEAN(data.quantity))>,
<Feature: MEAN(data.quantity) * MEAN(invoices.MEAN(data.price))>,
<Feature: MEAN(data.quantity) * MEAN(invoices.MEAN(data.quantity))>,
<Feature: MEAN(invoices.MEAN(data.price)) * MEAN(invoices.MEAN(data.quantity))>]
From those, these feature MEAN(data.price * quantity)> makes sense to me, the variations of this feature MEAN(invoices.MEAN(data.price * quantity))> also make sense to me. But features like these ones MEAN(data.quantity) * MEAN(invoices.MEAN(data.price)), MEAN(invoices.MEAN(data.price)) * MEAN(invoices.MEAN(data.quantity)), don't make sense to me.
I was wondering if they could be prevented from the output? I tried reducing the depth, but that would prevent the text primitive from executing. So not sure what else I can try?
Thank you!

Thank you for your question.
You can use the drop_contains argument for dfs. It drops features that contain the specified string.
A sample call to dfs would be:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=agg_primitives,
trans_primitives=trans_primitives,
primitive_options={
("multiply_numeric"): {
'include_columns': {
'data': ['Price', 'Quantity']
}
}
},
drop_contains=[") * MEAN("],
max_depth=3,
)

Related

Using binomal distribution with Spark to calculate an expected value and a variance ("number of times getting a six when throwing a dice three times")

I'm training myself to resolve classical statistics exercices with Spark and it's MLib module when needed, in order to face any situation later.
Spark is dedicated to calculation on matrices, but let's say that I have a side calculation to do in my program, at a time, and that I would like to resolve it with Spark, without adding others APIs.
The calculation is simple : learn the expected value and the variance
of having a six on a dice, when you throw it three times.
Currently, I resolve it by the help of the Apache maths API and a bit of Spark :
/** Among imports : */
import org.apache.commons.math3.distribution.*;
/**
* E8.5 : On lance trois dés. Probabilité d'un six obtenu ?
* Suit une loi binomiale B(3, 1/6)
* Par calcul : F(0) = 125/216, F(1) = 125/216 + 75/216, F(2) = (125 + 75 + 15)/216,
* F(3) = (125 + 75 + 15 + 1)/216 = 1
*/
#Test
#DisplayName("On lance trois dés. Probabilité d'un six obtenu ?. Caractéristiques de la loi B(3, 1/6)")
public void troisDésDonnentUnSix() {
int n = 3; // Nombre de lancés de dés.
double v[] = {0, 1, 2, 3}; // Valeurs possibles de la variable aléatoire.
double f[] = new double[n+1]; // Fréquences cumulées.
double p[] = new double[n+1]; // Probabilités.
// Calculer les probabilités et les fréquences.
BinomialDistribution loiBinomale = new BinomialDistribution(n, 1.0/6.0);
for(int i=0; i <= n; i++) {
p[i] = loiBinomale.probability(i);
f[i] = loiBinomale.cumulativeProbability(i);
}
LOGGER.info("P(x) = {}", p);
LOGGER.info("F(x) = {}", f);
Dataset<Row> ds = fromDouble(v, p);
LOGGER.info("E(X) = {}, V(X) = {}", esperance(ds), variance(ds));
}
where fromDouble(v, p) method creates a Dataset from a list of random variable values (column x) and their associated frequencies (column Px) :
/**
* Renvoyer un Dataset depuis une série de valeurs entières à probabilités.
* #param valeurs Valeurs.
* #param probabilites Probabilités.
* #return Dataset avec une colonne x, entière, contenant les valeurs<br>
* et une colonne Px, décimale, contenant les probabilités.
*/
protected Dataset<Row> fromDouble(double[] valeurs, double[] probabilites) {
StructType schema = new StructType()
.add("x", DoubleType, false)
.add("Px", DoubleType, false);
List<Row> rows = new ArrayList<>();
for(int index=0; index < valeurs.length; index ++) {
rows.add(RowFactory.create(valeurs[index], probabilites[index]));
}
return this.session.createDataFrame(rows, schema);
}
And esperance (= expected value) and variance methods called are doing these calculations :
/**
* Calculer l'espérance sur un Dataset avec valeurs et probabilités.
* #param ds Dataset avec colonnes : <br>
* x : valeur<br>
* Px : fréquence<br>
* #return espérance.
*/
protected double esperance(Dataset<Row> ds) {
return ds.agg(sum(col("x").multiply(col("Px")))).first().getDouble(0);
}
/**
* Calculer la variance sur un Dataset avec valeurs et probabilités.
* #param ds Dataset avec colonnes : <br>
* x : valeur<br>
* Px : fréquence<br>
* #return espérance.
*/
protected double variance(Dataset<Row> ds) {
Column variation = col("x").minus(esperance(ds));
Column variationCarre = variation.multiply(variation);
Column termeCalculVariance = col("Px").multiply(variationCarre);
return ds.agg(sum(termeCalculVariance)).first().getDouble(0);
}
LOGGER output :
P(x) = [0.5787037037037037, 0.34722222222222215, 0.06944444444444445, 0.0046296296296296285]
F(x) = [0.5787037037037035, 0.9259259259259259, 0.9953703703703703, 1.0]
E(X) = 0.49999999999999994, V(X) = 0.41666666666666663
It works (? I have caculatd by hand "E(X) = 53/108", and it finds 54/108 = 0.5..., I might be wrong), but it's not perfect.
Is there a more elegant way to solve this problem using Spark (and Spark-MLib, if needed) ?

Why i cant use acos function

I have a problem with this function. Excel return #name error.
=ACOS(COS(RADIANS(90-B3)) * COS(RADIANS(90-$G$3)) + SIN(RADIANS(90-B3)) * SIN(RADIANS(90-$G$3)) * COS(RADIANS(C3-$G$4))) * 6371

Ball to Ball Collision resolution

I was going through some collision detection tutorials on youtube, In one of the tutorial, the guy used the following code to resolve a collision between two balls:
/**
* Rotates coordinate system for velocities
*
* Takes velocities and alters them as if the coordinate system they're on was rotated
*
* #param Object | velocity | The velocity of an individual particle
* #param Float | angle | The angle of collision between two objects in radians
* #return Object | The altered x and y velocities after the coordinate system has been rotated
*/
function rotate(velocity, angle) {
const rotatedVelocities = {
x: velocity.x * Math.cos(angle) - velocity.y * Math.sin(angle),
y: velocity.x * Math.sin(angle) + velocity.y * Math.cos(angle)
};
return rotatedVelocities;
}
/**
* Swaps out two colliding particles' x and y velocities after running through
* an elastic collision reaction equation
*
* #param Object | particle | A particle object with x and y coordinates, plus velocity
* #param Object | otherParticle | A particle object with x and y coordinates, plus velocity
* #return Null | Does not return a value
*/
function resolveCollision(particle, otherParticle) {
const xVelocityDiff = particle.velocity.x - otherParticle.velocity.x;
const yVelocityDiff = particle.velocity.y - otherParticle.velocity.y;
const xDist = otherParticle.x - particle.x;
const yDist = otherParticle.y - particle.y;
// Prevent accidental overlap of particles
if (xVelocityDiff * xDist + yVelocityDiff * yDist >= 0) {
// Grab angle between the two colliding particles
const angle = -Math.atan2(otherParticle.y - particle.y, otherParticle.x - particle.x);
// Store mass in var for better readability in collision equation
const m1 = particle.mass;
const m2 = otherParticle.mass;
// Velocity before equation
const u1 = rotate(particle.velocity, angle);
const u2 = rotate(otherParticle.velocity, angle);
// Velocity after 1d collision equation
const v1 = { x: u1.x * (m1 - m2) / (m1 + m2) + u2.x * 2 * m2 / (m1 + m2), y: u1.y };
const v2 = { x: u2.x * (m1 - m2) / (m1 + m2) + u1.x * 2 * m2 / (m1 + m2), y: u2.y };
// Final velocity after rotating axis back to original location
const vFinal1 = rotate(v1, -angle);
const vFinal2 = rotate(v2, -angle);
// Swap particle velocities for realistic bounce effect
particle.velocity.x = vFinal1.x;
particle.velocity.y = vFinal1.y;
otherParticle.velocity.x = vFinal2.x;
otherParticle.velocity.y = vFinal2.y;
}
}
I've mostly understood this code. However, I'm unable to understand how this if condition is working to find out whether the balls have overlapped or not.
if (xVelocityDiff * xDist + yVelocityDiff * yDist >= 0)
Can somebody please explain?
By taking the differences of positions and velocities, you view everything in the frame of otherParticle. In that frame, otherParticle is standing still at the origin and particle is moving with velocityDiff. Here is how it looks like:
The term xVelocityDiff * xDist + yVelocityDiff * yDist is the dot product of the two vectors. This dot product is negative if velocityDiff points somewhat in the opposite direction of dist, i.e. if the particle is getting closer like in the above image. If the dot product is positive, the particle is moving away from otherParticle and you don't need to do anything.

how to Calculate google map circle radius js to C#

I know how to use the javascript to calculate the radius by using the below code
var center = new google.maps.LatLng(3.2987599, 102.6872022);
var latLng = new google.maps.LatLng(3.0987599, 101.6872022);
var distanceInMetres = google.maps.geometry.spherical.computeDistanceBetween(center, latLng);
But how to convert the google.maps.geometry.spherical.computeDistanceBetween into C# function?
Distance between 2 points: (lat1,lon1) to (lat2,lon2)
distance = acos(
cos(lat1 * (PI()/180)) *
cos(lon1 * (PI()/180)) *
cos(lat2 * (PI()/180)) *
cos(lon2 * (PI()/180))
+
cos(lat1 * (PI()/180)) *
sin(lon1 * (PI()/180)) *
cos(lat2 * (PI()/180)) *
sin(lon2 * (PI()/180))
+
sin(lat1 * (PI()/180)) *
sin(lat2 * (PI()/180))
) * 3959
3959 is the Earth radius in Miles. Replace this value with
radius in KM, (or any other unit), to get results on the same unit.
You can verify your implementation by comparing to this worked example:
i have write the C# solution to calculate the distance to convert
var distanceInMetres = google.maps.geometry.spherical.computeDistanceBetween(center, latLng);
into C#. Below is the code i have using. 6371 is the radius of the Earth.
//Calculate distance earth between 2 coordinate point
double e = lat * (Math.PI / 180);
double f = lng * (Math.PI / 180);
double g = lat2 * (Math.PI / 180);
double h = lng2 * (Math.PI / 180);
double i =
(Math.Cos(e) * Math.Cos(g) * Math.Cos(f) * Math.Cos(h)
+ Math.Cos(e) * Math.Sin(f) * Math.Cos(g) * Math.Sin(h)
+ Math.Sin(e) * Math.Sin(g));
double j = Math.Acos(i);
double k = (6371 * j); //Distance in KM
The distance between 2 lat/long points can be calculated with the haversine formula, which is described here http://en.wikipedia.org/wiki/Haversine_formula
There also is another question here at stackoverflow about more or less the same issue: Calculate distance between two latitude-longitude points? (Haversine formula)

Cron : Setting alternative seconds

I posted a question the other day about setting alternative minutes in cron, and i was given a lovely simple answer.
0-59/2 * * * * first_script
1-59/2 * * * * second_script
This worked brilliantly, however i have seen realized that i need my scripts to run quicker than every minute.
I know cron doesn't support seconds, but you can bluff it by using sleep, like so
* * * * * /foo/bar/your_script
* * * * * sleep 15; /foo/bar/your_script
* * * * * sleep 30; /foo/bar/your_script
* * * * * sleep 45; /foo/bar/your_script
So i need to combine the both of these so that i can get them to run alternatively every 15 seconds for instance.
Any ideas?
Ended up with the following code to get my scripts to run in shorter intervals than 1 minute.
* * * * * /usr/bin/php -q /path/to/file/script1.php
* * * * * sleep 15; /usr/bin/php -q /path/to/file/script2.php
* * * * * sleep 30; /usr/bin/php -q /path/to/file/script1.php
* * * * * sleep 45; /usr/bin/php -q /path/to/file/script2.php

Resources