Does DFS take default primitives if we dont specify any? - featuretools

if we don't specify the list of primitives to use in DFS, will it take all primitives possible?
if i give only agg_primitives list and not trans primiives, will it consider aggregated list that i provided and for trans, all the default primitives ? or it won't do any trans primitives at all and use only agg?

DFS in Featuretools does use a set a of default primitives if you do not specify them
The default aggregation primitives are ['sum', 'std, 'max', 'skew', 'min', 'mean', 'count', 'percent_true', 'n_unique', 'mode']
The default transformation primitives are ['day', 'year', 'month', 'weekday', 'haversine', 'num_words', 'num_characters']
If you provide a value for one but not the other, the default list is used. If you do not want any primitives to be used pass an empty list.
You can find this information in the Featuretools documentation here.

Related

Including only certain features when running deep feature synthesis?

For example one of my entities has two sets of IDs.
One that is continuous (which apparently is necessary to create the EntitySet), and one to use as a foreign key when merging with my other table.
This results in featuretools including the ID in the set of features to aggregate. SUM(ID) isn't a feature I am interested in though.
Is there a way to include certain feature when running deep feature synthesis?
There are three ways to exclude features when calling ft.dfs.
Use the ignore_variables to specify variables in an entity that should not be used to create features. It is a dictionary mapping an entity id to a list of variable names to ignore.
Use drop_contains to drop features that contain any of the strings
listed in this parameter.
Use drop_exact to drop features that exactly match any of the strings listed in this parameter.
Here is a example usage of all three in a ft.dfs call
ft.dfs(target_entity="customers"],
ignore_variables={
"transactions": ["amount"],
"customers": ["age", "gender", "date_of_birth"]
}, # ignore these variables
drop_contains=["customers.SUM("], # drop features that contain these strings
drop_exact=["STD(transactions.quanity)"], # drop features named exactly this
...
)
These 3 parameters are all documented here.
The final thing to consider if you are getting features you don't want is the variable types of the variables in your entity set. If you are seeing the sum of an ID variable that must mean that featuretools thinks the ID variable is a numeric value. If you tell featuretools it is an ID it will not apply a numeric aggregation to it.

Apache Spark Transformations: groupByKey vs reduceByKey vs aggregateByKey

These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one?
I think official guide explains it well enough.
I will highlight differences (you have RDD of type (K, V)):
if you need to keep the values, then use groupByKey
if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same K), you have two choices: reduceByKey or aggregateByKey (reduceByKey is kind of particular aggregateByKey)
2.1 if you can provide an operation which take as an input (V, V) and returns V, so that all the values of the group can be reduced to the one single value of the same type, then use reduceByKey. As a result you will have RDD of the same (K, V) type.
2.2 if you can not provide this aggregation operation, then use aggregateByKey. It happens when you reduce values to another type. So you will have (K, V2) as a result.
In addition to #Hlib answer, I would like to add few more points.
groupByKey() is just to group your dataset based on a key.
reduceByKey() is something like grouping + aggregation. We can say reduceBykey() equvelent to dataset.group(...).reduce(...).
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output.

How to assign unique contiguous numbers to elements in a Spark RDD

I have a dataset of (user, product, review), and want to feed it into mllib's ALS algorithm.
The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs.
Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark.
I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n, then call zip on the two RDDs.
Starting with Spark 1.0 there are two methods you can use to solve this easily:
RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.
RDD.zipWithUniqueId also gives you unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.
For a similar example use case, I just hashed the string values. See http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
def nnHash(tag: String) = tag.hashCode & 0x7FFFFF
var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
It sounds like you're already doing something like this, although hashing can be easier to manage.
Matei suggested here an approach to emulating zipWithIndex on an RDD, which amounts to assigning IDs within each partiition that are going to be globally unique: https://groups.google.com/forum/#!topic/spark-users/WxXvcn2gl1E
Another easy option, if using DataFrames and just concerned about the uniqueness is to use function MonotonicallyIncreasingID
import org.apache.spark.sql.functions.monotonicallyIncreasingId
val newDf = df.withColumn("uniqueIdColumn", monotonicallyIncreasingId)
Edit: MonotonicallyIncreasingID was deprecated and removed since Spark 2.0; it is now known as monotonically_increasing_id .
monotonically_increasing_id() appears to be the answer, but unfortunately won't work for ALS since it produces 64-bit numbers and ALS expects 32-bit ones (see my comment below radek1st's answer for deets).
The solution I found is to use zipWithIndex(), as mentioned in Darabos' answer. Here's how to implement it:
If you already have a single-column DataFrame with your distinct users called userids, you can create a lookup table (LUT) as follows:
# PySpark code
user_als_id_LUT = sqlContext.createDataFrame(userids.rdd.map(lambda x: x[0]).zipWithIndex(), StructType([StructField("userid", StringType(), True),StructField("user_als_id", IntegerType(), True)]))
Now you can:
Use this LUT to get ALS-friendly integer IDs to provide to ALS
Use this LUT to do a reverse-lookup when you need to go back from ALS ID to the original ID
Do the same for items, obviously.
People have already recommended monotonically_increasing_id(), and mentioned the problem that it creates Longs, not Ints.
However, in my experience (caveat - Spark 1.6) - if you use it on a single executor (repartition to 1 before), there is no executor prefix used, and the number can be safely cast to Int. Obviously, you need to have less than Integer.MAX_VALUE rows.

UML metamodel: derived, derived union and subsetting

If you have ever worked with the metamodel of UML, you propably know the concepts of unions and subsets - As far as I understand it:
Attributes and associations of an element/class marked as "derived union" cannot be used directly. In more specific sub-classes, you can possibly find subsets of them that can be used, as long as they are not marked as derived unions themselves.
"derived" (without union) attributes and associations have also subsets in more specific classes, but unlike above you can use them directly without having to look for subsets in more specific classes
My questions:
Does this make sense or am I on the wrong track here?
What is the meaning of the "/" (slash) you can find in front of some
attributes/associations, that they have subsets in child-classes?
E.g. /general : Classifier[*]
An union property is a property that consists of multiple other properties. You can only understand the union, when you combine all subsets. A list is almost by definition an union.
Almost, because it might be uninitialized.
A derived union is a property requiring a specific collection of subsets. I would not talk about accessing them directly, but about how direct you can understand them. You need all information before you can make an interpretation.
The difference between the two that a derived union requires a specific subset and an union might have a subset and might have different subsets in different contexts. A very simple example being the fields on a form. All required fields show the definition of a derived union. All other fields are part of the complete union.
Derived unions can contain derived unions in their subsets. It directs the creation of classes and their instances, it does not make them impossible.
All derived features require other features to be known. Temperature can be read directly, but to know if someone has fever requires more knowledge, like the time of the day, the place of collecting information etc..
The slash implies that it is being derived.

Tuples in .net 4.0.When Should I use them

I have come across Tuples in net 4.0. I have seen few example on msdn,however it's still not clear to me about the purpose of it and when to use them.
Is it the idea that if i want to create a collections of mix types I should use a tuple?
Any clear examples out there I can relate to?
When did you last use them?
Thanks for any suggestions
Tuples are just used on the coding process by a developer. If you want to return two informations instead of one, then you can use a Tuple for fast coding, but I recoment you make yourself a type that will contain both properties, with appropriate naming, and documentation.
Tuples are not used to mix types as you imagine. Tuples are used to make compositions of other types. e.g. a type that holds both an int and a string can be represented by a tuple: Tuple<int,string>.
Tuples exists in a lot of sizes, not only two.
I don't recommend using tuples in your final code, since their meaning is not clear.
Tuples can be used as multipart keys for dictionaries or grouping statements because being value types they understand equality. Avoid using them to move data around because they have poor language support in C# and named values (classes, structs) are better (simpler) than ordered values (tuples).

Resources