How random number behaves in spark

How random number behaves in spark - apache-spark

What is the use of rand(INT) and what is happening when we multiply another number to that result rand(INT) * 10 or rand(INT) * 12
What benefit do we get if we multiply some number to rand(INT)
It seems that it produces distributed number as per the number we use for multiply. In my case it is 19.
If we give 19 then all our random numbers are never be beyond 19. Is that so?
Also, it is mentioned that rand(seed) produces output in deterministic way. What does it mean, and why does the value in my dataset actually varies?
scala> val df = hc.sql("""select 8 , rand(123456), rand(123456)*19, floor(rand(123456)*19) as random_number ,rand(1) from v-vb.omega """)
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: double, _c2: double, random_number: bigint, _c4: double]
scala> df.show(100,false)
+---+--------------------+-------------------+-------------+--------------------+
|_c0|_c1 |_c2 |random_number|_c4 |
+---+--------------------+-------------------+-------------+--------------------+
|8 |0.4953081723589211 |9.4108552748195 |9 |0.13385709732307427 |
|8 |0.8134447122366524 |15.455449532496395 |15 |0.5897562959687032 |
|8 |0.37061329883387006 |7.0416526778435315 |7 |0.01540012100242305 |
|8 |0.039605242829950704|0.7524996137690634 |0 |0.22569943461197162 |
|8 |0.6789785261613072 |12.900591997064836 |12 |0.9207602095112212 |
|8 |0.9696879210080743 |18.424070499153412 |18 |0.6222816020094926 |
|8 |0.773136636564404 |14.689596094723676 |14 |0.1029837279488438 |
|8 |0.7990411192888535 |15.181781266488215 |15 |0.6678762139023474 |
|8 |0.8089896546054375 |15.370803437503312 |15 |0.06748208566157787 |
|8 |0.0895536147884225 |1.7015186809800276 |1 |0.5215181769983375 |
|8 |0.7385237395885733 |14.031951052182894 |14 |0.8895645473999635 |
|8 |0.059503458368902584|1.130565709009149 |1 |0.321509746474296 |
|8 |0.14662556746599353 |2.785885781853877 |2 |0.28823975483307396 |
|8 |0.28066416785509374 |5.332619189246781 |5 |0.45999786693699807 |
|8 |0.5563531082651644 |10.570709057038123 |10 |0.17320175842535657 |
|8 |0.6619862377687996 |12.577738517607193 |12 |1.152006730106292E-4|
|8 |0.9090845495301373 |17.272606441072607 |17 |0.7500451351287378 |

In Spark, rand(seed) and randn(seed) are not deterministic, which is an unresolved bug. Corresponding note was added to its source code via JIRA SPARK-13380:
/**
* Generate a random column with independent and identically distributed (i.i.d.) samples
* from U[0.0, 1.0].
*
* #note The function is non-deterministic in general case.
*
* #group normal_funcs
* #since 1.4.0
*/
def rand(seed: Long): Column = withExpr { Rand(seed) }

Deterministic means that function result is always the same for the same argument (Different invocations of function with the same argument produces the same result).
In your dataset it is definitely not deterministic because there are different random numbers produced with the same seed argument. If the documentation states it should be deterministic, then it is a bug in the documentation or bug in the function implementation.
Another question is why rand(12345)*19 is never beyond 19? This is because rand values are > 0 and < 1. This I believe is as per specification

Related

Filter rows with minimum and maximum count

This is what the dataframe looks like:
+---+-----------------------------------------+-----+
|eco|eco_name |count|
+---+-----------------------------------------+-----+
|B63|Sicilian, Richter-Rauzer Attack |5 |
|D86|Grunfeld, Exchange |3 |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd |5 |
|A44|Old Benoni Defense |3 |
|C46|Three Knights |1 |
|C08|French, Tarrasch, Open, 4.ed ed |13 |
|E59|Nimzo-Indian, 4.e3, Main line |2 |
|A20|English |2 |
|B20|Sicilian |4 |
|B37|Sicilian, Accelerated Fianchetto |2 |
|A33|English, Symmetrical |8 |
|C77|Ruy Lopez |8 |
|B43|Sicilian, Kan, 5.Nc3 |10 |
|A04|Reti Opening |6 |
|A59|Benko Gambit |1 |
|A54|Old Indian, Ukrainian Variation, 4.Nf3 |3 |
|D30|Queen's Gambit Declined |19 |
|C01|French, Exchange |3 |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1 |
|E74|King's Indian, Averbakh, 6...c5 |2 |
+---+-----------------------------------------+-----+
Schema:
root
|-- eco: string (nullable = true)
|-- eco_name: string (nullable = true)
|-- count: long (nullable = false)
I want to filter it so that only two rows with minimum and maximum counts remain.
The output dataframe should look something like:
+---+-----------------------------------------+--------------------+
|eco|eco_name |number_of_occurences|
+---+-----------------------------------------+--------------------+
|D30|Queen's Gambit Declined |19 |
|C46|Three Knights |1 |
+---+-----------------------------------------+--------------------+
I'm a beginner, I'm really sorry if this is a stupid question.

No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:
df = spark.createDataFrame(
[('a', 1), ('b', 1), ('c', 2), ('d', 3)],
schema=['col1', 'col2']
)
df.show(10, False)
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |1 |
|c |2 |
|d |3 |
+----+----+
Just use filtering to find the min/max count row after the ranking:
df\
.withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
.withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
.filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
.show(100, False)
+----+----+-------+-------+
|col1|col2|min_row|max_row|
+----+----+-------+-------+
|d |3 |4 |1 |
|a |1 |1 |3 |
|b |1 |1 |3 |
+----+----+-------+-------+
Please note that if the min/max row count are the same, they will be both filtered out.

You can use row_number function twice to order records by count, ascending and descending.
SELECT eco, eco_name, count
FROM (SELECT *,
row_number() over (order by count asc) as rna,
row_number() over (order by count desc) as rnd
FROM df)
WHERE rna = 1 or rnd = 1;
Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.

How to filter or delete the row in spark dataframe by a specific number?

I want to make a manipulate to a spark dataframe. For example, there is a dataframe with two columns.
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|1 |Bob |
|2 |Bob |
|3 |Alice |
|4 |Alice |
|5 |Alice |
............
There are two kinds of name in the column value and the number of Alice is more than Bob, what I want to modify is to delete some row containing Alice to make the number of row with Alice same of the row with Bob. The row should be deleted ramdomly but I found no API supporting such manipulation. What should I do to delete the row to a specific number?

Perhaps you can use spark window function with row_count and subsequent filtering, something like this:
>>> df.show(truncate=False)
+---+-----+
|key|value|
+---+-----+
|1 |Bob |
|2 |Bob |
|3 |Alice|
|4 |Alice|
|5 |Alice|
+---+-----+
>>> from pyspark.sql import Window
>>> from pyspark.sql.functions import *
>>> window = Window.orderBy("value").partitionBy("value")
>>> df2 = df.withColumn("seq",row_number().over(window))
>>> df2.show(truncate=False)
+---+-----+---+
|key|value|seq|
+---+-----+---+
|1 |Bob |1 |
|2 |Bob |2 |
|3 |Alice|1 |
|4 |Alice|2 |
|5 |Alice|3 |
+---+-----+---+
>>> N = 2
>>> df3 = df2.where("seq <= %d" % N).drop("seq")
>>> df3.show(truncate=False)
+---+-----+
|key|value|
+---+-----+
|1 |Bob |
|2 |Bob |
|3 |Alice|
|4 |Alice|
+---+-----+
>>>

Here's your sudo code:
Count "BOB"
[repartition the data]/[groupby] (partionBy/GroupBy)
[use iteration to cut off data at "BOB's" Count] (mapParitions/mapGroups)
You must remember that technically spark does not guarantee ordering on a dataset, so adding new data can randomly change the order of the data. So you could consider this random and just cut the count when your done. This should be faster than creating a window. If you really felt compelled you could create your own random probability function to return a fraction of each partition.
You can also use a window with this, paritionBy("value").orderBy("value") and use row_count & where to filter the partitions to "Bob's" Count.

Difference between describe() and summary() in Apache Spark

What is the difference between summary() and describe() ?
It seems that they both serve the same purpose. I didn't manage to find any differences (if any).

If we are passing any args then these functions works for different purposes:
.describe() function takes cols:String*(columns in df) as optional args.
.summary() function takes statistics:String*(count,mean,stddev..etc) as optional args.
Example:
scala> val df_des=Seq((1,"a"),(2,"b"),(3,"c")).toDF("id","name")
scala> df_des.describe().show(false) //without args
//Result:
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//|mean |2.0|null|
//|stddev |1.0|null|
//|min |1 |a |
//|max |3 |c |
//+-------+---+----+
scala> df_des.summary().show(false) //without args
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//|mean |2.0|null|
//|stddev |1.0|null|
//|min |1 |a |
//|25% |1 |null|
//|50% |2 |null|
//|75% |3 |null|
//|max |3 |c |
//+-------+---+----+
scala> df_des.describe("id").show(false) //descibe on id column only
//+-------+---+
//|summary|id |
//+-------+---+
//|count |3 |
//|mean |2.0|
//|stddev |1.0|
//|min |1 |
//|max |3 |
//+-------+---+
scala> df_des.summary("count").show(false) //get count summary only
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//+-------+---+----+

The first operation to perform after importing data is to get some sense of what it looks like. For numerical columns, knowing the descriptive summary statistics can help a lot in understanding the distribution of your data. The function describe returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column.
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
Hope it helps.

Both have same functionality but the api syntax is just different. Hope this helps

Bug/Error with KMeans (and BisectingKMeans) clustering

I'm working on a data where I'm required to work with clusters.
I know the Spark framework won't let me have one single cluster; the minimum number of clusters is two.
I created some dummy random data to test my program, and my program is displaying wrong results because my KMeans function is generating ONE cluster! How come? I don't understand. Is it because my data is random? I have not specified anything on my kmeans. This is the part of the code that handles the K-Means:
kmeans = new BisectingKMeans();
model = kmeans.fit(dataset); //trains the k-means with the dataset to create a model
clusterCenters = model.clusterCenters();
dataset.show(false);
for(Vector v : clusterCenters){
System.out.println(v);
}
The output is the following:
+----+----+------+
|File|Size|Volume|
+----+----+------+
|F1 |13 |1689 |
|F2 |18 |1906 |
|F3 |16 |1829 |
|F4 |14 |1726 |
|F5 |10 |1524 |
|F6 |16 |1844 |
|F7 |15 |1752 |
|F8 |12 |1610 |
|F9 |10 |1510 |
|F10 |11 |1554 |
|F11 |12 |1632 |
|F12 |13 |1663 |
|F13 |18 |1901 |
|F14 |13 |1686 |
|F15 |18 |1910 |
|F16 |19 |1986 |
|F17 |11 |1585 |
|F18 |10 |1500 |
|F19 |13 |1665 |
|F20 |13 |1664 |
+----+----+------+
only showing top 20 rows
[-1.7541523789077474E-16,2.0655699373151038E-15] //only one cluster center!!! why??
Why does this happen? What do I need to fix to solve this? Having only one cluster ruins my program

On random data, the correct output of bisecting k-means often is a single cluster only.
With bisecting k-means you only give a maximum number of clusters. But it can stop early, if the results do not improve. In you case, splitting the data into two clusters apparently did not improve the quality, so this bisection is not accepted.

Split Contents of String column in PySpark Dataframe

I have a pyspark data frame whih has a column containing strings. I want to split this column into words
Code:
>>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
>>> sentenceData.show(truncate=False)
+---+---------------------------+
|key|desc |
+---+---------------------------+
|1 |Virat is good batsman |
|2 |sachin was good |
|3 |but modi sucks big big time|
|4 |I love the formulas |
+---+---------------------------+
Expected Output
---------------
>>> sentenceData.show(truncate=False)
+---+-------------------------------------+
|key|desc |
+---+-------------------------------------+
|1 |[Virat,is,good,batsman] |
|2 |[sachin,was,good] |
|3 |.... |
|4 |... |
+---+-------------------------------------+
How can I achieve this?

Use split function:
from pyspark.sql.functions import split
df.withColumn("desc", split("desc", "\s+"))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How random number behaves in spark - apache-spark

Related

Filter rows with minimum and maximum count

How to filter or delete the row in spark dataframe by a specific number?

Difference between describe() and summary() in Apache Spark

Bug/Error with KMeans (and BisectingKMeans) clustering

Split Contents of String column in PySpark Dataframe

Categories

Resources