There is a task of dividing product prices into 3 groups {high, avg, low} groups of prices. Have tried to implement it via K-means by using sklearn package. Data is in pandas Dataframe format of float64 type
dfcl
Out[173]:
price
product_option_id
10012|0 372.15
10048|0 11.30
10049|0 12.26
10050|0 6.20
10051|0 5.90
10052|0 9.00
10053|0 11.10
10054|0 9.30
10055|0 4.20
10056|0 5.60
# Convert DataFrame to matrix
mat = dfcl.as_matrix()
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=3)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pd.DataFrame(data=labels, columns=['cluster'], index=dfcl.index)
Have gotten the results but it seems so unbalanced between groups
print('Total features -', len(results))
print('Cluster 0 -',len(results.loc[results['cluster'] == 0]))
print('Cluster 1 -',len(results.loc[results['cluster'] == 1]))
print('Cluster 2 -',len(results.loc[results['cluster'] == 2]))
Total features - 5222
Cluster 0 - 4470
Cluster 1 - 733
Cluster 2 - 19
By the way, when I recount fitting data some times happens that data highly swaps between clusters. Is there any way to solve the problem with so unbalanced data between groups and leave the cluster names static to recount algorithm? I've also tried normalizing data using preprocessing.MinMaxScaler() and it didn't help.
Maybe there are some cluster algorithms that can help me do what I want or any others hacks?
Total features - 5222
Cluster 0 - 733
Cluster 1 - 4470
Cluster 2 - 19
Probably your data distribution is already skewed. K-means minimizes squared errors; it does not care about balanced clusters.
Furthermore, k-means does not produce "low" or "high" - you need to assign such semantics yourself. You cannot assume that cluster 2 is "high".
It may be worth looking at a histogram of the data, then define thresholds for "low" and "high" as you seem fit.
Related
I want to study a population of 47532 individuals with 16230 features. Thus I created a matrix with 16230 lines and 47532 columns
>>> import scipy.cluster.hierarchy as hcluster
>>> from scipy.spatial import distance
>>> import sklearn.cluster import AgglomerativeClustering
>>> matrix.shape
(16230, 47532)
# remove all duplicate vectors in order to not waste computation time
>>> uniq_vectors, row_index = np.unique(matrix, return_index=True, axis=0)
>>> uniq_vectors.shape
(22957, 16230)
# compute distance between each observations
>>> distance_matrix = distance.pdist(uniq_vectors, metric='jaccard')
>>> distance_matrix_2d = distance.squareform(distance_matrix, force='tomatrix')
>>> distance_matrix_2d.shape
(22957, 22957)
# Perform linkage
>>> linkage = hcluster.linkage(distance_matrix, method='complete')
So now I can use scikit-learn to perform a clustering
>>> model = AgglomerativeClustering(n_clusters=40, affinity='precomputed', linkage='complete')
>>> cluster_label = model.fit_predict(distance_matrix_2d)
How to predict future observations using this model ?
Indeed AgglomerativeClustering do not own a predict method and it will be too long to compute again the distance for 16230 x (47532 + 1)
Is it possible to compute a distance between new observations and all pre-computed cluster ?
Indeed the use of pdist from scipy will compute the distance n x n In my case I would like compute the distance from one observation o vs n samples o x n
Thanks for your highlight
The answer is simple: you cannot. Hierarchical clustering is not designed to predict cluster labels for new observations. The reason why this is happening is because it just links data points according to their distances and it is not defining "regions" for each cluster.
There are two solutions for you at this stage I believe:
For new data points, find the nearest observation in your data set (using the same distance function as during the training) and assign the same cluster label. This requires a bit more coding, and obviously, it is a bit of a hack. But keep in mind that the results might not make a lot of sense as you will be extrapolating cluster labels using a different methodology than the training procedure.
Use another clustering algorithm! It seems like you are using hierarchical clustering when your use case does not match the model. KMeans could be a good choice, as it explicitly can assign new data points to the closest cluster.
I am attempting to calculate some moving averages in Spark, but am running into issues with skewed partitions. Here is the simple calculation I'm trying to perform:
Getting the base data
# Variables
one_min = 60
one_hour = 60*one_min
one_day = 24*one_hour
seven_days = 7*one_day
thirty_days = 30*one_day
# Column variables
target_col = "target"
partition_col = "partition_col"
df_base = (
spark
.sql("SELECT * FROM {base}".format(base=base_table))
)
df_product1 = (
df_base
.where(F.col("product_id") == F.lit("1"))
.select(
F.col(target_col).astype("double").alias(target_col),
F.unix_timestamp("txn_timestamp").alias("window_time"),
"transaction_id",
partition_col
)
)
df_product1.persist()
Calculating running averages
window_lengths = {
"1day": one_day,
"7day": seven_days,
"30day": thirty_days
}
# Create window specs for each type
part_windows = {
time: Window.partitionBy(F.col(partition_col))
.orderBy(F.col("window_time").asc())
.rangeBetween(-secs, -one_min)
for (time, secs) in window_lengths.items()
}
cols = [
# Note: not using `avg` as I will be smoothing this at some point
(F.sum(target_col).over(win)/F.count("*").over(win)).alias(
"{time}_avg_target".format(time=time)
)
for time, win in part_windows.items()
]
sample_df = (
df_product1
.repartition(2000, partition_col)
.sortWithinPartitions(F.col("window_time").asc())
.select(
"*",
*cols
)
)
Now, I can collect a limited subset of these data (say just 100 rows), but if I try to run the full query, and, for example, aggregate the running averages, Spark gets stuck on some particularly large partitions. The vast majority of the partitions have fewer than 1million records in them. Only about 50 of them have more than 1M record and only about 150 have more than 500K
However, a small handful have more than 2.5M (~10), and 3 of them have more than 5M records. These partitions have run for more than 12 hours and failed to complete. The skew in these partitions are a natural part of the data representing larger activity in these distinct values of the partitioning column. I have no control over the definition of the values of this partitioning column.
I am using a SparkSession with dynamic allocation enabled, 32G of RAM and 4 cores per executor, and 4 executors minimum. I have attempted to up the executors to 96G with 8 cores per executor and 10 executors minimum, but the job still does not complete.
This seems like a calculation which shouldn't take 13 hours to complete. The df_product1 DataFrame contains just shy of 300M records.
If there is other information that would be helpful in resolving this problem, please comment below.
Context: I have a dataset too large to fit in memory I am training a Keras RNN on. I am using PySpark on an AWS EMR Cluster to train the model in batches that are small enough to be stored in memory. I was not able to implement the model as distributed using elephas and I suspect this is related to my model being stateful. I'm not entirely sure though.
The dataframe has a row for every user and days elapsed from the day of install from 0 to 29. After querying the database I do a number of operations on the dataframe:
query = """WITH max_days_elapsed AS (
SELECT user_id,
max(days_elapsed) as max_de
FROM table
GROUP BY user_id
)
SELECT table.*
FROM table
LEFT OUTER JOIN max_days_elapsed USING (user_id)
WHERE max_de = 1
AND days_elapsed < 1"""
df = read_from_db(query) #this is just a custom function to query our database
#Create features vector column
assembler = VectorAssembler(inputCols=features_list, outputCol="features")
df_vectorized = assembler.transform(df)
#Split users into train and test and assign batch number
udf_randint = udf(lambda x: np.random.randint(0, x), IntegerType())
training_users, testing_users = df_vectorized.select("user_id").distinct().randomSplit([0.8,0.2],123)
training_users = training_users.withColumn("batch_number", udf_randint(lit(N_BATCHES)))
#Create and sort train and test dataframes
train = df_vectorized.join(training_users, ["user_id"], "inner").select(["user_id", "days_elapsed","batch_number","features", "kpi1", "kpi2", "kpi3"])
train = train.sort(["user_id", "days_elapsed"])
test = df_vectorized.join(testing_users, ["user_id"], "inner").select(["user_id","days_elapsed","features", "kpi1", "kpi2", "kpi3"])
test = test.sort(["user_id", "days_elapsed"])
The problem I am having is that I cannot seem to be able to filter on batch_number without caching train. I can filter on any of the columns that are in the original dataset in our database, but not on any column I have generated in pyspark after querying the database:
This: train.filter(train["days_elapsed"] == 0).select("days_elapsed").distinct.show() returns only 0.
But, all of these return all of the batch numbers between 0 and 9 without any filtering:
train.filter(train["batch_number"] == 0).select("batch_number").distinct().show()
train.filter(train.batch_number == 0).select("batch_number").distinct().show()
train.filter("batch_number = 0").select("batch_number").distinct().show()
train.filter(col("batch_number") == 0).select("batch_number").distinct().show()
This also does not work:
train.createOrReplaceTempView("train_table")
batch_df = spark.sql("SELECT * FROM train_table WHERE batch_number = 1")
batch_df.select("batch_number").distinct().show()
All of these work if I do train.cache() first. Is that absolutely necessary or is there a way to do this without caching?
Spark >= 2.3 (? - depending on a progress of SPARK-22629)
It should be possible to disable certain optimization using asNondeterministic method.
Spark < 2.3
Don't use UDF to generate random numbers. First of all, to quote the docs:
The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.
Even if it wasn't for UDF, there are Spark subtleties, which make it almost impossible to implement this right, when processing single records.
Spark already provides rand:
Generates a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
and randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
which can be used to build more complex generator functions.
Note:
There can be some other issues with your code but this makes it unacceptable from the beginning (Random numbers generation in PySpark, pyspark. Transformer that generates a random number generates always the same number).
Hello I am using Kmeans to build a topic classifier, my idea is to take several Facebook comments from different users to have several documents.
My list of documents looks as follows:
list=["comment1","comment2",...,"commentN"]
Then I used tfidf to vectorize every comment and assign it to a specific cluster,
the output of my program is the following:
tfidf = tfidf_vectorizer.fit_transform(list)
tf = tf_vectorizer.fit_transform(list)
print("size of tf",tf.shape)
print("size of tfidf",tfidf.shape)
#Creating clusters from data
kmeans = KMeans(n_clusters=8, random_state=0).fit(tf)
print("printing labels",kmeans.labels_)
#Printing the number of clusters
print("Number of clusters",set(kmeans.labels_))
print("dimensions of matrix labels",(kmeans.labels_).shape)
#Predicting new labels
y_pred = kmeans.predict(tf)
print("dimensions of predict matrix",y_pred.shape)
My output looks as follows:
size of tf (202450, 2000)
size of tfidf (202450, 2000)
printing labels [1 1 1 ..., 1 1 1]
Number of clusters {0, 1, 2, 3, 4, 5, 6, 7}
dimensions of matrix labels (202450,)
dimensions of predict matrix (202450,)
C:\Program Files\Anaconda3\lib\site-packages\sklearn\utils\validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64.
warnings.warn(msg, DataConversionWarning)
Now the problema is that I would like to find a way to give sense to this clusters I mean the class 0 is about sports, class 1 is talking about politics, so I would like to appreciate any recomendation to understand this clusters, or at least to find a way to get all the commments that belongs to a specific cluster to then interpret this result thanks for the support.
There are multiple approaches
The easiest approache is to get the centroid of each cluster, it is a good summary of most words used in the cluster.
The second approache is to get the sub matrix of tf-idf of element assigned to each cluster,
after that you can use ACP on sub matrix to extract factors , and understand more The composition of each cluster.
Sorry I do not use sckit-learn, so I cannot help you by code
Hop that will help
I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg
df1.count() = 10
df2.count() = 1000
noOfSamples = 10
I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)
Now while doing so,
var newSample = df1.sample(true, df1.count() / noOfSamples)
println(newSample.count())
What does the fraction here imply? can it be greater than 1? I checked this and this but wasn't able to comprehend it fully.
Also is there anyway we can specify the number of rows to be sampled?
The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:
val newSample = df1.sample(true, 1D*noOfSamples/df1.count)
However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:
val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)
Some scalability observations
You may note that doing a df1.count might be expensive as it evaluates the whole DataFrame, and you'll lose one of the benefits of sampling in the first place.
Therefore depending on the context of your application, you may want to use an already known number of total samples, or an approximation.
val newSample = df1.sample(true, 1D*noOfSamples/knownNoOfSamples)
Or assuming the size of your DataFrame as huge, I would still use a fraction and use limit to force the number of samples.
val guessedFraction = 0.1
val newSample = df1.sample(true, guessedFraction).limit(noOfSamples)
As for your questions:
can it be greater than 1?
No. It represents a fraction between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1.
Also is there anyway we can specify the number of rows to be sampled?
You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use.
To answer your question, is there anyway we can specify the number of rows to be sampled?
I recently needed to sample a certain number of rows from a spark data frame. I followed the below process,
Convert the spark data frame to rdd.
Example: df_test.rdd
RDD has a functionality called takeSample which allows you to give the number of samples you need with a seed number.
Example: df_test.rdd.takeSample(withReplacement, Number of Samples, Seed)
Convert RDD back to spark data frame using sqlContext.createDataFrame()
Above process combined to single step:
Data Frame (or Population) I needed to Sample from has around 8,000 records:
df_grp_1
df_grp_1
test1 = sqlContext.createDataFrame(df_grp_1.rdd.takeSample(False,125,seed=115))
test1 data frame will have 125 sampled records.
To answer if the fraction can be greater than 1. Yes, it can be if we have replace as yes. If a value greater than 1 is provided with replace false, then following exception will occur:
java.lang.IllegalArgumentException: requirement failed: Upper bound (2.0) must be <= 1.0.
I too find lack of sample by count functionality disturbing. If you are not picky about creating a temp view I find the code below useful (df is your dataframe, count is sample size):
val tableName = s"table_to_sample_${System.currentTimeMillis}"
df.createOrReplaceTempView(tableName)
val sampled = sqlContext.sql(s"select *, rand() as random from ${tableName} order by random limit ${count}")
sqlContext.dropTempTable(tableName)
sampled.drop("random")
It returns an exact count as long as your current row count is as large as your sample size.
The below code works if you want to do a random split of 70% & 30% of a data frame df,
val Array(trainingDF, testDF) = df.randomSplit(Array(0.7, 0.3), seed = 12345)
I use this function for random sampling when exact number of records are desirable:
def row_count_sample (df, row_count, with_replacement=False, random_seed=113170):
ratio = 1.08 * float(row_count) / df.count() # random-sample more as dataframe.sample() is not a guaranteed to give exact record count
# it could be more or less actual number of records returned by df.sample()
if ratio>1.0:
ratio = 1.0
result_df = (df
.sample(with_replacement, ratio, random_seed)
.limit(row_count) # since we oversampled, make exact row count here
)
return result_df
May be you want to try below code..
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))