+----+----------+-------+
|type|date |count |
+----+----------+-------+
|Typ1|2022-02-14|2 |
|Typ1|2022-02-12|0 |
|Typ2|2022-02-10|1 |
|Typ2|2022-02-01|1 |
|Typ2|2022-01-20|1 |
|Typ2|2022-01-15|1 |
|Typ2|2022-01-05|1 |
+----+----------+-------+
This table gets appended whenever the new row comes up. So now I need to maintain a new table such that it stores the data of last 2,5,10,30 and 180 days. For reference, this might be the new table structure:
+-------+-------+
|Days |count |
+-------+-------+
|Last2 |2 |
|Last5 |3 |
|Last10 |3 |
|Last30 |6 |
|Last180|7 |
+----+----------+
I tried doing groupBy and count but doing this everyday is not good from 'processing-time' point of view because the Table-1 may contain millions of rows and doing groupBy everytime doesn't feel like a good solution to me.
What could be the solution to this problem?
You should be doing the populating/updating new table via application at the same time any new record is inserted in first table if you intend to keep 5 keys as mentioned in example. Better solution could be using spark/spark-cassandra-connector for reading data from first table and processing in memory the data you want at runtime.
Recently I was asked in an interview about the algorithm of spark df.show() function.
how will spark decide from which executor/executors the records will be fetched?
Without undermining #thebluephantom's and #Hristo Iliev's answers (each give some insight into what's happening under the hood), I also wanted to add my answer to this list.
I came to the same conclusion(s), albeit by observing the behaviour of the underlying partitions.
Partitions have an index associated with them. This is seen in the code below.
(Taken from original spark source code here).
trait Partition extends Serializable {
def index: Int
:
So amongst partitions, there is an order.
And as already mentioned in other answers, the df.show() is the same as df.show(20) or the top 20 rows. So the underlying partition indexes determine which partition (and hence executor) those rows come from.
The partition indexes are assigned at the time of read, or (re-assigned) during a shuffle.
Here is some code to see this behaviour:
val df = Seq((5,5), (6,6), (7,7), (8,8), (1,1), (2,2), (3,3), (4,4)).toDF("col1", "col2")
// above sequence is defined out of order - to make behaviour visible
// see partition structure
df.rdd.glom().collect()
/* Array(Array([5,5]), Array([6,6]), Array([7,7]), Array([8,8]), Array([1,1]), Array([2,2]), Array([3,3]), Array([4,4])) */
df.show(4, false)
/*
+----+----+
|col1|col2|
+----+----+
|5 |5 |
|6 |6 |
|7 |7 |
|8 |8 |
+----+----+
only showing top 4 rows
*/
In the above code, we see 8 partitions (each inner-Array is a partition) - this is because spark defaults to 8 partitions when we create a dataframe.
Now let's repartition the dataframe.
// Now let's repartition df
val df2 = df.repartition(2)
// lets see the partition structure
df2.rdd.glom().collect()
/* Array(Array([5,5], [6,6], [7,7], [8,8], [1,1], [2,2], [3,3], [4,4]), Array()) */
// lets see output
df2.show(4,false)
/*
+----+----+
|col1|col2|
+----+----+
|5 |5 |
|6 |6 |
|7 |7 |
|8 |8 |
+----+----+
only showing top 4 rows
*/
In the above code, the top 4 rows came from the first partition (which actually has all elements of the original data in it). Also note the skew in partition sizes, since no partitioning column was mentioned.
Now lets try and create 3 partitions
val df3 = df.repartition(3)
// lets see partition structures
df3.rdd.glom().collect()
/*
Array(Array([8,8], [1,1], [2,2]), Array([5,5], [6,6]), Array([7,7], [3,3], [4,4]))
*/
// And lets see the top 4 rows this time
df3.show(4, false)
/*
+----+----+
|col1|col2|
+----+----+
|8 |8 |
|1 |1 |
|2 |2 |
|5 |5 |
+----+----+
only showing top 4 rows
*/
From the above code, we observe that Spark went to the first partition and tried to get 4 rows. Since only 3 were available, it grabbed those. Then moved on to the next partition, and got one more row. Thus, the order that you see from show(4, false), is due to the underlying data partitioning and the index ordering amongst partitions.
This example uses show(4), but this behaviour can be extended to show() or show(20).
It's simple.
In Spark 2+, show() calls showString() to format the data as a string and then prints it. showString() calls getRows() to get the top rows of the dataset as a collection of strings. getRows() calls take() to take the original rows and transforms them into strings. take() simply wraps head(). head() calls limit() to build a limit query and executes it. limit() adds a Limit(n) node at the front of the logical plan which is really a GlobalLimit(n, LocalLimit(n)). Both GlobalLimit and LocalLimit are subclasses of OrderPreservingUnaryNode that override its maxRows (in GlobalLimit) or maxRowsPerPartition (in LocalLimit) methods. The logical plan now looks like:
GlobalLimit n
+- LocalLimit n
+- ...
This goes through analysis and optimisation by Catalyst where limits are removed if something down the tree produces less rows than the limit and ends up as CollectLimitExec(m) (where m <= n) in the execution strategy, so the physical plan looks like:
CollectLimit m
+- ...
CollectLimitExec executes its child plan, then checks how many partitions the RDD has. If none, it returns an empty dataset. If one, it runs mapPartitionsInternal(_.take(m)) to take the first m elements. If more than one, it applies take(m) on each partition in the RDD using mapPartitionsInternal(_.take(m)), builds a shuffle RDD that collects the results in a single partition, then again applies take(m).
In other words, it depends (because optimisation phase), but in the general case it takes the top rows of the concatenation of the top rows of each partition and so involves all executors holding a part of the dataset.
OK, perhaps not so simple.
A shitty question as not what u would use in prod.
It is a smart action that looks at what you have in terms of transformations.
Show() is in fact show(20). If just show it looks at 1st and consecutive partitions to get 20 rows. Order by also optimized. A count does need complete processing.
Many google posts btw.
What is the use of rand(INT) and what is happening when we multiply another number to that result rand(INT) * 10 or rand(INT) * 12
What benefit do we get if we multiply some number to rand(INT)
It seems that it produces distributed number as per the number we use for multiply. In my case it is 19.
If we give 19 then all our random numbers are never be beyond 19. Is that so?
Also, it is mentioned that rand(seed) produces output in deterministic way. What does it mean, and why does the value in my dataset actually varies?
scala> val df = hc.sql("""select 8 , rand(123456), rand(123456)*19, floor(rand(123456)*19) as random_number ,rand(1) from v-vb.omega """)
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: double, _c2: double, random_number: bigint, _c4: double]
scala> df.show(100,false)
+---+--------------------+-------------------+-------------+--------------------+
|_c0|_c1 |_c2 |random_number|_c4 |
+---+--------------------+-------------------+-------------+--------------------+
|8 |0.4953081723589211 |9.4108552748195 |9 |0.13385709732307427 |
|8 |0.8134447122366524 |15.455449532496395 |15 |0.5897562959687032 |
|8 |0.37061329883387006 |7.0416526778435315 |7 |0.01540012100242305 |
|8 |0.039605242829950704|0.7524996137690634 |0 |0.22569943461197162 |
|8 |0.6789785261613072 |12.900591997064836 |12 |0.9207602095112212 |
|8 |0.9696879210080743 |18.424070499153412 |18 |0.6222816020094926 |
|8 |0.773136636564404 |14.689596094723676 |14 |0.1029837279488438 |
|8 |0.7990411192888535 |15.181781266488215 |15 |0.6678762139023474 |
|8 |0.8089896546054375 |15.370803437503312 |15 |0.06748208566157787 |
|8 |0.0895536147884225 |1.7015186809800276 |1 |0.5215181769983375 |
|8 |0.7385237395885733 |14.031951052182894 |14 |0.8895645473999635 |
|8 |0.059503458368902584|1.130565709009149 |1 |0.321509746474296 |
|8 |0.14662556746599353 |2.785885781853877 |2 |0.28823975483307396 |
|8 |0.28066416785509374 |5.332619189246781 |5 |0.45999786693699807 |
|8 |0.5563531082651644 |10.570709057038123 |10 |0.17320175842535657 |
|8 |0.6619862377687996 |12.577738517607193 |12 |1.152006730106292E-4|
|8 |0.9090845495301373 |17.272606441072607 |17 |0.7500451351287378 |
In Spark, rand(seed) and randn(seed) are not deterministic, which is an unresolved bug. Corresponding note was added to its source code via JIRA SPARK-13380:
/**
* Generate a random column with independent and identically distributed (i.i.d.) samples
* from U[0.0, 1.0].
*
* #note The function is non-deterministic in general case.
*
* #group normal_funcs
* #since 1.4.0
*/
def rand(seed: Long): Column = withExpr { Rand(seed) }
Deterministic means that function result is always the same for the same argument (Different invocations of function with the same argument produces the same result).
In your dataset it is definitely not deterministic because there are different random numbers produced with the same seed argument. If the documentation states it should be deterministic, then it is a bug in the documentation or bug in the function implementation.
Another question is why rand(12345)*19 is never beyond 19? This is because rand values are > 0 and < 1. This I believe is as per specification
I'm working on a program that brands data as OutOfRange based on the values present on certain columns.
I have three columns: Age, Height, and Weight. I want to create a fourth column called OutOfRange and assign it a value of 0(false) or 1(true) if the values in those three columns exceed a specific threshold.
If age is lower than 18 or higher than 60, that row will be assigned a value of 1 (0 otherwise). If height is lower than 5, that row will be assigned a value of 1 (0 otherwise), and so on.
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
Given a dataframe as
+---+------+------+
|Age|Height|Weight|
+---+------+------+
|20 |3 |70 |
|17 |6 |80 |
|30 |5 |60 |
|61 |7 |90 |
+---+------+------+
You can apply when function to apply the logics explained in the question as
import org.apache.spark.sql.functions._
df.withColumn("OutOfRange", when(col("Age") <18 || col("Age") > 60 || col("Height") < 5, 1).otherwise(0))
which would result following dataframe
+---+------+------+----------+
|Age|Height|Weight|OutOfRange|
+---+------+------+----------+
|20 |3 |70 |1 |
|17 |6 |80 |1 |
|30 |5 |60 |0 |
|61 |7 |90 |1 |
+---+------+------+----------+
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
This is not possible without recreating the Dataset all together since Datasets are inherently immutable.
However you can save the Dataset as a Hive table, which will allow you to do what you want to do. Saving the Dataset as a Hive table will write the contents of your Dataset to disk under the default spark-warehouse directory.
df.write.mode("overwrite").saveAsTable("my_table")
// Add a row
spark.sql("insert into my_table (Age, Height, Weight, OutofRange) values (20, 30, 70, 1)
// Update a row
spark.sql("update my_table set OutOfRange = 1 where Age > 30")
....
Hive support must be enabled for spark at time of instantiation in order to do this.
I'm working on a data where I'm required to work with clusters.
I know the Spark framework won't let me have one single cluster; the minimum number of clusters is two.
I created some dummy random data to test my program, and my program is displaying wrong results because my KMeans function is generating ONE cluster! How come? I don't understand. Is it because my data is random? I have not specified anything on my kmeans. This is the part of the code that handles the K-Means:
kmeans = new BisectingKMeans();
model = kmeans.fit(dataset); //trains the k-means with the dataset to create a model
clusterCenters = model.clusterCenters();
dataset.show(false);
for(Vector v : clusterCenters){
System.out.println(v);
}
The output is the following:
+----+----+------+
|File|Size|Volume|
+----+----+------+
|F1 |13 |1689 |
|F2 |18 |1906 |
|F3 |16 |1829 |
|F4 |14 |1726 |
|F5 |10 |1524 |
|F6 |16 |1844 |
|F7 |15 |1752 |
|F8 |12 |1610 |
|F9 |10 |1510 |
|F10 |11 |1554 |
|F11 |12 |1632 |
|F12 |13 |1663 |
|F13 |18 |1901 |
|F14 |13 |1686 |
|F15 |18 |1910 |
|F16 |19 |1986 |
|F17 |11 |1585 |
|F18 |10 |1500 |
|F19 |13 |1665 |
|F20 |13 |1664 |
+----+----+------+
only showing top 20 rows
[-1.7541523789077474E-16,2.0655699373151038E-15] //only one cluster center!!! why??
Why does this happen? What do I need to fix to solve this? Having only one cluster ruins my program
On random data, the correct output of bisecting k-means often is a single cluster only.
With bisecting k-means you only give a maximum number of clusters. But it can stop early, if the results do not improve. In you case, splitting the data into two clusters apparently did not improve the quality, so this bisection is not accepted.