Iterate cluster centers for K means in Python - scikit-learn

I have 4 columns of data. For these Xs, I need to pick 3 cluster centers randomly and find the cluster with least SSE. Why is it that the centers and inertia(SSE) turn out to be the same both with varying random states, and init=random parameter?
Xvar=stud.iloc[:,1:5]
#X1=np.random.randint(22,99,size=(3,4))
kmeans1= KMeans(n_clusters=3, init='random', random_state=101)
kmeans1.fit(Xvar)
kmeans1.labels_
kmeans1.cluster_centers_
kmeans1.inertia_

On too simple data, many different initial seeds will converge to the same result.
Plus, he default of n_init is 10 if I remember correctly, so if just 1 out of ten runs yields the same...

Related

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour.
After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to complete.
As you can see the median is 6ms per task (I'm running it on a smaller source dataset to test), but 1 task takes 10min. It's hardly using any CPU cycles, it actually joins data, but so, so slow.
The next slowest task runs in 14s, has 4x more records & actually spills to disk.
If you look
The join itself is a inner join between the two datasets on pos & hashValue (minhash) in accordance with minhash specification & udf to calculate the jaccard distance between match pairs.
Explode the hashtables:
modelDataset.select(
struct(col("*")).as(inputName), posexplode(col($(outputCol))).as(explodeCols))
Jaccard distance function:
override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
val xSet = x.toSparse.indices.toSet
val ySet = y.toSparse.indices.toSet
val intersectionSize = xSet.intersect(ySet).size.toDouble
val unionSize = xSet.size + ySet.size - intersectionSize
assert(unionSize > 0, "The union of two input sets must have at least 1 elements")
1 - intersectionSize / unionSize
}
Join of processed datasets :
// Do a hash join on where the exploded hash values are equal.
val joinedDataset = explodedA.join(explodedB, explodeCols)
.drop(explodeCols: _*).distinct()
// Add a new column to store the distance of the two rows.
val distUDF = udf((x: Vector, y: Vector) => keyDistance(x, y), DataTypes.DoubleType)
val joinedDatasetWithDist = joinedDataset.select(col("*"),
distUDF(col(s"$leftColName.${$(inputCol)}"), col(s"$rightColName.${$(inputCol)}")).as(distCol)
)
// Filter the joined datasets where the distance are smaller than the threshold.
joinedDatasetWithDist.filter(col(distCol) < threshold)
I've tried combinations of caching, repartitioning and even enabling spark.speculation, all to no avail.
The data consists of shingles address text that have to be matched:
53536, Evansville, WI => 53, 35, 36, ev, va, an, ns, vi, il, ll, le, wi
will have a short distance with records where there is a typo in the city or zip.
Which gives pretty accurate results, but may be the cause of the join skew.
My question is:
What may cause this discrepancy? (One task taking very very long, even though it has less records)
How can I prevent this skew in minhash without losing accuracy?
Is there a better way to do this at scale? ( I can't Jaro-Winkler / levenshtein compare millions of records with all records in location dataset)
It might be a bit late but I will post my answer here anyways to help others out. I recently had similar issues with matching misspelled company names (All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster). Someone helped me out by suggesting to take NGrams to reduce the data skew. It helped me a lot. You could also try using e.g. 3-grams or 4-grams.
I don’t know how dirty the data is, but you could try to make use of states. It reduces the number of possible matches substantially already.
What really helped me improving the accuracy of the matches is to postprocess the connected components (group of connected matches made by the MinHashLSH) by running a label propagation algorithm on each component. This also allows you to increase N (of the NGrams), therefore mitigating the problem of skewed data, setting the jaccard distance parameter in approxSimilarityJoin less tightly, and postprocess using label propagation.
Finally, I am currently looking into using skipgrams to match it. I found that in some cases it works better and reduces the data skew somewhat.

Spark Geolocated Points Clustering

I have a dataset of points of interest on the maps like the following:
ID latitude longitude
1 48.860294 2.338629
2 48.858093 2.294694
3 48.8581965 2.2937403
4 48.8529717 2.3477134
...
The goal is to find those clusters of points that are very close to each other (distance less than 100m).
So the output I expect for this dataset would be:
(2, 3)
The point 2 and 3 are very close to each other with a distance less than 100m, while the others are far away so they should be ignored.
Since the dataset is huge with all the points of interest in the world, I need to do it with Spark with some parallel processing.
What approach should I take for this case?
I actually solved this problem using the following 2 approaches:
DBSCAN algorithm implemented as Spark job with partitioning
https://github.com/irvingc/dbscan-on-spark
GeoSpark with spacial distance join
https://github.com/DataSystemsLab/GeoSpark
both of them are based on Spark so they work well with large scale of data.
however I found the dbscan-on-spark consumes a lot of memory, so I ended up using the GeoSpark with distance join.
I would love to do a cross join here , however that probably won't work since your data is huge.
One approach is to partition the data per region wise. That means you can change the input data as
ID latitude longitude latitiude_int longitude_int group_unique_id
1 48.860294 2.338629 48 2 48_2
2 48.858093 2.294694 48 2 48_2
3 48.8581965 2.2937403 48 2 48_2
4 48.8529717 2.3477134 48 2 48_2
The assumption here if the integral portion of the lat/long changes that will result > 100m deviation.
Now you can partition the data w.r.t group_unique_id and then do a cross join per partition.
This will probably reduce the workload.

Can Apache K Means WSSSE increase with some K?

I am trying to see if there is point in the "elbow graph" which would help me to choose K in K means algorithm
However, I notice that the WSSSE sometimes increases as K increases. I was under the assumption that WSSSE would always decrease as K increases. I attach a picture showing this along with the Pyspark code.
enter image description here
The only thing which is guaranteed is that once you've reached k==n you'll get a WSSSE of zero because each point directly lies on a cluster centroid, bringing SSE to zero for each point and therefore WSSSE as well. The reason why your function is non-convex is that k-means uses random initialization of cluster centroids (seeds) and the optimization function of k-means is non-deterministic with respect to the initial cluster centroid distribution (as this problem is NP-Hard). Therefore you can end up in a lower local optima on different runs. Here another thread on this topic.

Random Forest: Running out of memory

I'm using scikit-learn Random Forest to fit a training data (~30mb) and my laptop keeps crashing running of out application memory. The test data is a few times bigger than the training data. Using Macbook Air 2GHz 8GB memory.
What are some of the ways to deal with this?
rf = RandomForestClassifier(n_estimators = 100, n_jobs=4)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rf, X_train_a, y_train, cv=20, scoring='roc_auc'))
Your best choice is to tune the arguments.
n_jobs=4
This makes the computer compute four train-test cycles simultaneously. Different Python jobs run in separate processes, thus the full dataset is also copied. Try to reduce n_jobs to 2 or 1 to save memory. n_jobs==4 uses four times the memory n_jobs==1 uses.
cv=20
This splits the data into 20 pieces and the code does 20 train-test iterations. This means that the training data is the size of 19 pieces of the original data. You can quite safely reduce it to 10, however your accuracy estimate might get worse. It won't save much memory, but makes runtime faster.
n_estimators = 100
Reducing this will save little memory, but it will make the algorithm run faster as the random forest will contain fewer trees.
To sum up, I'd recommend reducing n_jobs to 2 to save the memory (2-fold increase in runtime). To compensate runtime, I'd suggest changing cv to 10 (2-fold savings in runtime). If that does not help, change n_jobs to 1 and also reduce the number of estimators to 50 (extra two times faster processing).
I was dealing with ~4MB dataset and Random Forest from scikit-learn with default hyper-parameters was ~50MB (so more than 10 times of the data). By setting the max_depth = 6 the memory consumption decrease 66 times. The performance of shallow Random Forest on my dataset improved!
I write down this experiment in the blog post.
From my experience, in the case of regression tasks the memory usage can grow even much more, so it is important to control the tree depth. The tree depth can be controlled directly with max_depth or by tuning: min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes.
The memory of the Random Forest can be of course controlled with number of trees in the ensemble.

Creating a measure that combines a percentage with a low decimal number?

I'm working on a project in Tableau (which uses functions very similar to Excel, if that helps) where I need a single measurement derived from two different measurements, one of which is a low decimal number (2.95 on the high end, 0.00667 on the low) and the other is a percentage (ranging from 29.8 to 100 percent).
Put another way, I have two tables detailing bus punctuality -- one is for high frequency routes and measured in Excess Waiting Time (EWT, in minutes), the other for low frequency routes and measured in terms of percent on time. I have a map of all the routes, and want to colour the lines based on how punctual that route is (thinner lines for routes with a low EWT or a high percentage on time; thicker lines for routes with high EWT or low percentage on time). In preparation for this, I've combined both tables and zero'd out the non-existent value.
I thought I'd do something like log(EWT + PercentOnTime), but am realizing that might not give the value I'm wanting (especially because I ultimately need an inverse of one or the other, since low EWT is favourable and high % on time favourable).
Any idea how I'd do this? Thanks!
If you are combining/comparing the metrics in an even manner and the data is relatively linear then all you need to do is normalise them.
If you have the EWT expected ranges (eg. 0.00667 to 2.95). Then a 2 would be
(2 - 0.00667)/(2.95 - 0.00667) = 0.67723 but because EWT is the inverse semantically to punctuality we need to use 1-0.67723 = 0.32277.
If you do the same for the Punctuality percentage range:
Eg. 80%
(80 - 29.8)/(100-29.8) = 0.7169
You can compare these metrics because they are normalised (between 0-1 : multiply by 100 to get percentages) if you are assuming the underlying metrics (1-EWT) and on time percentage (OTP) are analogous.
Thus you can combine these into a single table. You will want to ignore all zero'd values as this is actually an indication you have no data at these points.
you'll have to use an if statement to say something like :
if OTP > 0 then ([OTP-29.8])/(100-29.8) else (1-(([EWT]-0.00667)/(2.95-0.00667)))
Hope this helps.

Resources