I have an RDD and I want to find distinct values for multiple columns.
Example:
Row(col1=a, col2=b, col3=1), Row(col1=b, col2=2, col3=10)), Row(col1=a1, col2=4, col3=10)
I would like to find have a map:
col1=[a,b,a1]
col2=[b,2,4]
col3=[1,10]
Can dataframe help compute it faster/simpler?
Update:
My solution with RDD was:
def to_uniq_vals(row):
return [(k,v) for k,v in row.items()]
rdd.flatMap(to_uniq_vals).distinct().collect()
Thanks
I hope I understand your question correctly;
You can try the following:
import org.apache.spark.sql.{functions => F}
val df = Seq(("a", 1, 1), ("b", 2, 10), ("a1", 4, 10))
df.select(F.collect_set("_1"), F.collect_set("_2"), F.collect_set("_3")).show
Results:
+---------------+---------------+---------------+
|collect_set(_1)|collect_set(_2)|collect_set(_3)|
+---------------+---------------+---------------+
| [a1, b, a]| [1, 2, 4]| [1, 10]|
+---------------+---------------+---------------+
The code above should be more efficient than the purposed select distinct
column-by-column for several reasons:
Less workers-host round trips.
De-duping should be done locally on the worker prior to inter-worker de-doupings.
Hope it helps!
You can use drop duplicates and then select the same columns. Might not be the most efficient way but still a decent way:
df.dropDuplicates("col1","col2", .... "colN").select("col1","col2", .... "colN").toJSON
** Works well using Scala
Related
I have a dataframe like below
df.show(2,False)
col1
----------
[[1,2][3,4]]
I want to add the some static value in each array content like this
col2
----------
[[1,2,"Value"],[3,4,"value]]
Please suggest me the way to achieve
explode the array and then use concat function to add the value to the array, finally use collect_list to recreate nested array.
from pyspark.sql.functions import *
df.withColumn("spark_parti_id",spark_partition_id()).\
withColumn("col2",explode(col("col1"))).\
withColumn("col2",concat(col("col2"),array(lit(2)))).\
groupBy("spark_parti_id").\
agg(collect_list(col("col2")).alias("col2")).\
drop("spark_parti_id").\
show(10,False)
#+----------------------+
#|col2 |
#+----------------------+
#|[[1, 2, 2], [3, 4, 2]]|
#+----------------------+
I want to add a column of random values to a dataframe (has an id for each row) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value against each row id. I am able to reproduce the results by using
from pyspark.sql.functions import rand
new_df = my_df.withColumn("rand_index", rand(seed = 7))
but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script.
I also tried defining a udf, testing to see if i can generate random values (integers) within an interval and using random from Python with random.seed set
import random
random.seed(7)
spark.udf.register("getRandVals", lambda x, y: random.randint(x, y), LongType())
but to no avail.
Is there a way to ensure reproducible random number generation across Spark sessions such that a row id gets same random value? I would really appreciate some guidance :)
Thanks for the help!
I suspect that you are getting the same common values for the seed, but in different order based on your partitioning which is influenced by the data distribution when reading from disk and there could be more or less data per time. But I am not privy to your code in reality.
The rand function generates the same random data (what is the point of the seed otherwise) and somehow the partitions get a slice of it. If you look you should guess the pattern!
Here is an an example of 2 different cardinality dataframes. You can see that the seed gives the same or a superset of results. So, ordering and partitioning play a role imo.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import col
df1 = spark.range(1, 5).select(col("id").cast("double"))
df1 = df1.withColumn("rand_index", rand(seed = 7))
df1.show()
df1.rdd.getNumPartitions()
print('Partitioning distribution: '+ str(df1.rdd.glom().map(len).collect()))
returns:
+---+-------------------+
| id| rand_index|
+---+-------------------+
|1.0|0.06498948189958098|
|2.0|0.41371264720975787|
|3.0|0.12030715258495939|
|4.0| 0.2731073068483362|
+---+-------------------+
8 partitions & Partitioning distribution: [0, 1, 0, 1, 0, 1, 0, 1]
The same again with more data:
...
df1 = spark.range(1, 10).select(col("id").cast("double"))
...
returns:
+---+-------------------+
| id| rand_index|
+---+-------------------+
|1.0| 0.9147159860432812|
|2.0|0.06498948189958098|
|3.0| 0.7069655052310547|
|4.0|0.41371264720975787|
|5.0| 0.1982919638208397|
|6.0|0.12030715258495939|
|7.0|0.44292918521277047|
|8.0| 0.2731073068483362|
|9.0| 0.7784518091224375|
+---+-------------------+
8 partitions & Partitioning distribution: [1, 1, 1, 1, 1, 1, 1, 2]
You can see 4 common random values - within a Spark session or out of session.
I know it's a bit late, but have you considered using hashing of IDs, dates etc. that are deterministic, instead of using built-in random functions? I'm encountering similar issue but I believe my problem can be solved using for example xxhash64, which is a PySpark built-in hash function. You can then use the last few digits, or normalize if you know the total range of the hash values, which I couldn't find in its documentations.
The following code can be used to filter rows that contain a value of 1. Image there are a lot of columns.
import org.apache.spark.sql.types.StructType
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)
df.withColumn("ones", ones).where($"ones" === 0).show
The downside here is that it should ideally stop when the first such condition is met. I.e. the first column found. OK, we all know that.
But I cannot find an elegant method to achieve this without presumably using a UDF or very specific logic. The map will process all cols.
Can therefore a fold(Left) be used that can terminate when first occurrence found possibly? Or some other approach? May be an oversight.
My first idea was to use logical expressions and hope for short-circuiting, but it seems spark is not doing this :
df
.withColumn("ones", df.columns.tail.map(x => when(col(x) === 1, true)
.otherwise(false)).reduceLeft(_ or _))
.where(!$"ones")
.show()
But I'm no sure whether spark does support short-circuiting, I think not (https://issues.apache.org/jira/browse/SPARK-18712)
So alternatively you can apply a custom function on your rows using lazy exist on scala's Seq:
df
.map{r => (r.getString(0),r.toSeq.tail.exists(c => c.asInstanceOf[Int]==1))}
.toDF("ID","ones")
.show()
This approach is similar to an UDF, so not sure if thats what you accept.
I intend to implement the Apriori algorithm according to YAFIM article with pySpark. It contains with two phases in processing workflow:
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support.
[1]: http://bayanbox.ir/download/2410687383784490939/phase1.png phase-1
Phase 2: in this phase, we iteratively using k-frequent itemsets to generate (k+1)-frequent itemsets.
[2]: http://bayanbox.ir/view/6380775039559120652/phase2.png phase-2
To implement the first phase I have written the following code:
from operator import add
transactions = sc.textFile("/FileStore/tables/wo7gkiza1500361138466/T10I4D100K.dat").cache()
minSupport = 0.05 * transactions.count()
items = transactions.flatMap(lambda line:line.split(" "))
itemCount = items.map(lambda item:(item, 1)).reduceByKey(add)
l1 = itemCount.filter(lambda (i,c): c > minSupport)
l1.take(5)
output: [(u'', 100000), (u'494', 5102), (u'829', 6810), (u'368', 7828), (u'766', 6265)]
My problem is that I have no idea for implementing the second phase, and especially getting the candidate set item.
For example, suppose we have the following RDD (frequent 3-itemsets):
([1, 4, 5], 7), ([1, 4, 6], 6), ...
We want to find the candidate 4-itemsets so that if in the 3-itemsets, the two first items is the same, you will have four items available as follows:
[1, 4, 5, 6], ...
In spark often one performs a filter operations before using a map, to make sure that the map is possible. See the example below:
bc_ids = sc.broadcast(ids)
new_ids = users.filter(lambda x: x.id in ids.value).map(lambda x: ids.value[x])
If you want to know how many users you filtered out, how can you do this efficiently? So I would prefer not to use:
count_before = users.count()
new_ids = users.filter(lambda x: x.id in ids.value).map(lambda x: ids.value[x])
count_after = new_ids .count()
The question is related to 1 but in contrast is not about spark SQL.
In spark often one performs a filter operations before using a map, to
make sure that the map is possible.
The reason to perform filter() before map() is to process only necessary data.
Answer to your question
val base = sc.parallelize(Seq(1, 2, 3, 4, 5, 6, 7))
println(base.filter { _.==(7) }.count())
println(base.filter { !_.==(7) }.count())
First one will give you filtered result and second line will give you how many values are filtered.if you are working against cached and partitioned data, then this could be done effectively.