Bug/Error with KMeans (and BisectingKMeans) clustering - apache-spark

I'm working on a data where I'm required to work with clusters.
I know the Spark framework won't let me have one single cluster; the minimum number of clusters is two.
I created some dummy random data to test my program, and my program is displaying wrong results because my KMeans function is generating ONE cluster! How come? I don't understand. Is it because my data is random? I have not specified anything on my kmeans. This is the part of the code that handles the K-Means:
kmeans = new BisectingKMeans();
model = kmeans.fit(dataset); //trains the k-means with the dataset to create a model
clusterCenters = model.clusterCenters();
dataset.show(false);
for(Vector v : clusterCenters){
System.out.println(v);
}
The output is the following:
+----+----+------+
|File|Size|Volume|
+----+----+------+
|F1 |13 |1689 |
|F2 |18 |1906 |
|F3 |16 |1829 |
|F4 |14 |1726 |
|F5 |10 |1524 |
|F6 |16 |1844 |
|F7 |15 |1752 |
|F8 |12 |1610 |
|F9 |10 |1510 |
|F10 |11 |1554 |
|F11 |12 |1632 |
|F12 |13 |1663 |
|F13 |18 |1901 |
|F14 |13 |1686 |
|F15 |18 |1910 |
|F16 |19 |1986 |
|F17 |11 |1585 |
|F18 |10 |1500 |
|F19 |13 |1665 |
|F20 |13 |1664 |
+----+----+------+
only showing top 20 rows
[-1.7541523789077474E-16,2.0655699373151038E-15] //only one cluster center!!! why??
Why does this happen? What do I need to fix to solve this? Having only one cluster ruins my program

On random data, the correct output of bisecting k-means often is a single cluster only.
With bisecting k-means you only give a maximum number of clusters. But it can stop early, if the results do not improve. In you case, splitting the data into two clusters apparently did not improve the quality, so this bisection is not accepted.

Related

How to filter or delete the row in spark dataframe by a specific number?

I want to make a manipulate to a spark dataframe. For example, there is a dataframe with two columns.
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|1 |Bob |
|2 |Bob |
|3 |Alice |
|4 |Alice |
|5 |Alice |
............
There are two kinds of name in the column value and the number of Alice is more than Bob, what I want to modify is to delete some row containing Alice to make the number of row with Alice same of the row with Bob. The row should be deleted ramdomly but I found no API supporting such manipulation. What should I do to delete the row to a specific number?
Perhaps you can use spark window function with row_count and subsequent filtering, something like this:
>>> df.show(truncate=False)
+---+-----+
|key|value|
+---+-----+
|1 |Bob |
|2 |Bob |
|3 |Alice|
|4 |Alice|
|5 |Alice|
+---+-----+
>>> from pyspark.sql import Window
>>> from pyspark.sql.functions import *
>>> window = Window.orderBy("value").partitionBy("value")
>>> df2 = df.withColumn("seq",row_number().over(window))
>>> df2.show(truncate=False)
+---+-----+---+
|key|value|seq|
+---+-----+---+
|1 |Bob |1 |
|2 |Bob |2 |
|3 |Alice|1 |
|4 |Alice|2 |
|5 |Alice|3 |
+---+-----+---+
>>> N = 2
>>> df3 = df2.where("seq <= %d" % N).drop("seq")
>>> df3.show(truncate=False)
+---+-----+
|key|value|
+---+-----+
|1 |Bob |
|2 |Bob |
|3 |Alice|
|4 |Alice|
+---+-----+
>>>
Here's your sudo code:
Count "BOB"
[repartition the data]/[groupby] (partionBy/GroupBy)
[use iteration to cut off data at "BOB's" Count] (mapParitions/mapGroups)
You must remember that technically spark does not guarantee ordering on a dataset, so adding new data can randomly change the order of the data. So you could consider this random and just cut the count when your done. This should be faster than creating a window. If you really felt compelled you could create your own random probability function to return a fraction of each partition.
You can also use a window with this, paritionBy("value").orderBy("value") and use row_count & where to filter the partitions to "Bob's" Count.

Understanding repartition in Pyspark

My question is a bit conceptual and related to how things work under the hood. I have written below code related to repartiton.
df_cube = spark.createDataFrame([ ("Sachin" , "M"), ("Dipti", "F") , ("Roshani", "F"), ("Tushar", "M"), ("Satyendra", "M") ], ["Name" , "Gender"] )
data = df_cube.union(df_cube).repartition("Gender")
data.show()
The above code gives me below output.
+---------+------+
| Name|Gender|
+---------+------+
| Dipti| F|
| Sachin| M|
| Roshani| F|
| Tushar| M|
|Satyendra| M|
| Dipti| F|
| Sachin| M|
| Roshani| F|
| Tushar| M|
|Satyendra| M|
+---------+------+
After that I repartition by Name and Gender both and I get below output.
df_name = data.repartition(7, "Name", "Gender")
df_name.show()
+---------+------+
| Name|Gender|
+---------+------+
| Tushar| M|
| Tushar| M|
|Satyendra| M|
|Satyendra| M|
| Sachin| M|
| Dipti| F|
| Sachin| M|
| Dipti| F|
| Roshani| F|
| Roshani| F|
+---------+------+
My main question is that how can I figure out the ordering of the rows when I call out repartition on one column and two columns as depicted above . How does Spark rearrange the rows . On my local machine it shows two partitions by default , is there a way to view what rows goes into which partition and after repartitioning , how to view which partition have which rows. Please help me answer these both queries , if possible, in a verbose manner
how can I figure out the ordering of the rows when I call out repartition
No you can't and in fact it should be irrelevant. Ordering is for when you query the data back ie. select from where order by, you can event make it a "view" for users. In short, ordering cannot be guaranteed because Spark distribute and parallelize data.
is there a way to view what rows goes into which partition and after repartitioning
Yes, you can experiment with the following snippet which will show what is inside each partition.
for i, part in enumerate(df.rdd.glom().collect()):
print({i: part})

How random number behaves in spark

What is the use of rand(INT) and what is happening when we multiply another number to that result rand(INT) * 10 or rand(INT) * 12
What benefit do we get if we multiply some number to rand(INT)
It seems that it produces distributed number as per the number we use for multiply. In my case it is 19.
If we give 19 then all our random numbers are never be beyond 19. Is that so?
Also, it is mentioned that rand(seed) produces output in deterministic way. What does it mean, and why does the value in my dataset actually varies?
scala> val df = hc.sql("""select 8 , rand(123456), rand(123456)*19, floor(rand(123456)*19) as random_number ,rand(1) from v-vb.omega """)
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: double, _c2: double, random_number: bigint, _c4: double]
scala> df.show(100,false)
+---+--------------------+-------------------+-------------+--------------------+
|_c0|_c1 |_c2 |random_number|_c4 |
+---+--------------------+-------------------+-------------+--------------------+
|8 |0.4953081723589211 |9.4108552748195 |9 |0.13385709732307427 |
|8 |0.8134447122366524 |15.455449532496395 |15 |0.5897562959687032 |
|8 |0.37061329883387006 |7.0416526778435315 |7 |0.01540012100242305 |
|8 |0.039605242829950704|0.7524996137690634 |0 |0.22569943461197162 |
|8 |0.6789785261613072 |12.900591997064836 |12 |0.9207602095112212 |
|8 |0.9696879210080743 |18.424070499153412 |18 |0.6222816020094926 |
|8 |0.773136636564404 |14.689596094723676 |14 |0.1029837279488438 |
|8 |0.7990411192888535 |15.181781266488215 |15 |0.6678762139023474 |
|8 |0.8089896546054375 |15.370803437503312 |15 |0.06748208566157787 |
|8 |0.0895536147884225 |1.7015186809800276 |1 |0.5215181769983375 |
|8 |0.7385237395885733 |14.031951052182894 |14 |0.8895645473999635 |
|8 |0.059503458368902584|1.130565709009149 |1 |0.321509746474296 |
|8 |0.14662556746599353 |2.785885781853877 |2 |0.28823975483307396 |
|8 |0.28066416785509374 |5.332619189246781 |5 |0.45999786693699807 |
|8 |0.5563531082651644 |10.570709057038123 |10 |0.17320175842535657 |
|8 |0.6619862377687996 |12.577738517607193 |12 |1.152006730106292E-4|
|8 |0.9090845495301373 |17.272606441072607 |17 |0.7500451351287378 |
In Spark, rand(seed) and randn(seed) are not deterministic, which is an unresolved bug. Corresponding note was added to its source code via JIRA SPARK-13380:
/**
* Generate a random column with independent and identically distributed (i.i.d.) samples
* from U[0.0, 1.0].
*
* #note The function is non-deterministic in general case.
*
* #group normal_funcs
* #since 1.4.0
*/
def rand(seed: Long): Column = withExpr { Rand(seed) }
Deterministic means that function result is always the same for the same argument (Different invocations of function with the same argument produces the same result).
In your dataset it is definitely not deterministic because there are different random numbers produced with the same seed argument. If the documentation states it should be deterministic, then it is a bug in the documentation or bug in the function implementation.
Another question is why rand(12345)*19 is never beyond 19? This is because rand values are > 0 and < 1. This I believe is as per specification

Difference between describe() and summary() in Apache Spark

What is the difference between summary() and describe() ?
It seems that they both serve the same purpose. I didn't manage to find any differences (if any).
If we are passing any args then these functions works for different purposes:
.describe() function takes cols:String*(columns in df) as optional args.
.summary() function takes statistics:String*(count,mean,stddev..etc) as optional args.
Example:
scala> val df_des=Seq((1,"a"),(2,"b"),(3,"c")).toDF("id","name")
scala> df_des.describe().show(false) //without args
//Result:
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//|mean |2.0|null|
//|stddev |1.0|null|
//|min |1 |a |
//|max |3 |c |
//+-------+---+----+
scala> df_des.summary().show(false) //without args
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//|mean |2.0|null|
//|stddev |1.0|null|
//|min |1 |a |
//|25% |1 |null|
//|50% |2 |null|
//|75% |3 |null|
//|max |3 |c |
//+-------+---+----+
scala> df_des.describe("id").show(false) //descibe on id column only
//+-------+---+
//|summary|id |
//+-------+---+
//|count |3 |
//|mean |2.0|
//|stddev |1.0|
//|min |1 |
//|max |3 |
//+-------+---+
scala> df_des.summary("count").show(false) //get count summary only
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//+-------+---+----+
The first operation to perform after importing data is to get some sense of what it looks like. For numerical columns, knowing the descriptive summary statistics can help a lot in understanding the distribution of your data. The function describe returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column.
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
Hope it helps.
Both have same functionality but the api syntax is just different. Hope this helps

Divide spark dataframe into chunks using row values as separators

In my PySpark code I have a DataFrame populated with data coming from a sensor and each single row has timestamp, event_description and event_value.
Each sensor event is composed by measurements defined by an id and a value. The only guarantee I have is that all the "phases" related to a single event are included between two EV_SEP rows (unsorted).
Inside each event "block" there is an event label which is the value associated to EV_CODE.
+-------------------------+------------+-------------+
| timestamp | event_id | event_value |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:12.540 | EV_SEP | ----- |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:14.201 | EV_2 | 10 |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:13.331 | EV_1 | 11 |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:15.203 | EV_CODE | ABC |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:16.670 | EV_SEP | ----- |
+-------------------------+------------+-------------+
I would like to create a new column containing that label, so that I know that all the events are associated to that label:
+-------------------------+----------+-------------+------------+
| timestamp | event_id | event_value | event_code |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:12.540 | EV_SEP | ----- | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:14.201 | EV_2 | 10 | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:13.331 | EV_1 | 11 | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:15.203 | EV_CODE | ABC | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:16.670 | EV_SEP | ----- | ABC |
+-------------------------+----------+-------------+------------+
With pandas I can easily get the indexes of the EV_SEP rows, split the table into blocks, take the EV_CODE from each block and create an event_code column with such value.
A possible solution would be:
Sort the DataFrame according to timestamp
Convert the dataframe to a RDD and call zipWithIndex
get the indexes containing EV_SEP
calculate the block ranges (start_index, end_index)
Process single "chunks" (filtering on indexes) to extract EV_CODE
finally create the wanted column
Is there any better way to solve this problem?
from pyspark.sql import functions as f
Sample data:
df.show()
+-----------------------+--------+-----------+
|timestamp |event_id|event_value|
+-----------------------+--------+-----------+
|2017-01-01 00:00:12.540|EV_SEP |null |
|2017-01-01 00:00:14.201|EV_2 |10 |
|2017-01-01 00:00:13.331|EV_1 |11 |
|2017-01-01 00:00:15.203|EV_CODE |ABC |
|2017-01-01 00:00:16.670|EV_SEP |null |
|2017-01-01 00:00:20.201|EV_2 |10 |
|2017-01-01 00:00:24.203|EV_CODE |DEF |
|2017-01-01 00:00:31.670|EV_SEP |null |
+-----------------------+--------+-----------+
Add index:
df_idx = df.filter(df['event_id'] == 'EV_SEP') \
.withColumn('idx', f.row_number().over(Window.partitionBy().orderBy(df['timestamp'])))
df_block = df.filter(df['event_id'] != 'EV_SEP').withColumn('idx', f.lit(0))
'Spread' index:
df = df_idx.union(df_block).withColumn('idx', f.max('idx').over(
Window.partitionBy().orderBy('timestamp').rowsBetween(Window.unboundedPreceding, Window.currentRow))).cache()
Add EV_CODE:
df_code = df.filter(df['event_id'] == 'EV_CODE').withColumnRenamed('event_value', 'event_code')
df = df.join(df_code, on=[df['idx'] == df_code['idx']]) \
.select(df['timestamp'], df['event_id'], df['event_value'], df_code['event_code'])
Finally:
+-----------------------+--------+-----------+----------+
|timestamp |event_id|event_value|event_code|
+-----------------------+--------+-----------+----------+
|2017-01-01 00:00:12.540|EV_SEP |null |ABC |
|2017-01-01 00:00:13.331|EV_1 |11 |ABC |
|2017-01-01 00:00:14.201|EV_2 |10 |ABC |
|2017-01-01 00:00:15.203|EV_CODE |ABC |ABC |
|2017-01-01 00:00:16.670|EV_SEP |null |DEF |
|2017-01-01 00:00:20.201|EV_2 |10 |DEF |
|2017-01-01 00:00:24.203|EV_CODE |DEF |DEF |
+-----------------------+--------+-----------+----------+
Creating a new Hadoop InputFormat would be a more computationally efficient way to accomplish your goal here (although is arguably the same or more gymnastics in terms of code). You can specify alternative Hadoop input formats using sc.hadoopFile in the Python API, but you must take care of conversion from the Java format to Python. You can then specify the format. The converters available in PySpark are relatively few but this reference proposes using the Avro converter as an example. You might also simply find it convenient to let your custom Hadoop input format output text which you then additionally parse in Python to avoid the issue of implementing a converter.
Once you have that in place, you would create a special input format (in Java or Scala using the Hadoop API's) to treat the special sequences of rows having EV_SEP as record delimiters instead of the newline character. You could do this quite simply by collecting rows as they are read in an accumulator (just a simple ArrayList could do as a proof-of-concept) and then emitting the accumulated list of records when you find two EV_SEP rows in a row.
I would point out that using TextInputFormat as a basis for such a design might be tempting, but that the input format will split such files arbitrarily at newline characters and you will need to implement custom logic to properly support splitting the files. Alternatively, you can avoid the problem by simply not implementing file splitting. This is a simple modification to the partitioner.
If you do need to split files, the basic idea is:
Pick a split offset by evenly dividing the file into parts
Seek to the offset
Seek back character-by-character from the offset to where the delimiter sequence is found (in this case two rows in a row with type EV_SEP.
Detecting these sequences for the edge case around file splitting would be a challenge. I would suggest establishing the largest byte-width of rows and reading sliding-window chunks of an appropriate width (basically 2x the size of the rows) backwards from your starting point, then matching against those windows using a precompiled Java regex and Matcher. This is similar to how Sequence Files find their sync marks, but uses a regex to detect the sequence instead of strict equality.
As a side note, I would be concerned given some of the other background you mention here that sorting the DataFrame by timestamp could alter the contents of events that happen in the same time period in different files.

Resources