Randomly shuffle column in Spark RDD or dataframe - apache-spark

Is there anyway I can shuffle a column of an RDD or dataframe such that the entries in that column appear in random order? I'm not sure which APIs I could use to accomplish such a task.

What about selecting the column to shuffle, orderBy(rand) the column and zip it by index to the existing dataframe?
import org.apache.spark.sql.functions.rand
def addIndex(df: DataFrame) = spark.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
case class Entry(name: String, salary: Double)
val r1 = Entry("Max", 2001.21)
val r2 = Entry("Zhang", 3111.32)
val r3 = Entry("Bob", 1919.21)
val r4 = Entry("Paul", 3001.5)
val df = addIndex(spark.createDataFrame(Seq(r1, r2, r3, r4)))
val df_shuffled = addIndex(df
.select(col("salary").as("salary_shuffled"))
.orderBy(rand))
df.join(df_shuffled, Seq("_index"))
.drop("_index")
.show(false)
+-----+-------+---------------+
|name |salary |salary_shuffled|
+-----+-------+---------------+
|Max |2001.21|3001.5 |
|Zhang|3111.32|3111.32 |
|Paul |3001.5 |2001.21 |
|Bob |1919.21|1919.21 |
+-----+-------+---------------+

If you don't need a global shuffle across your data, you can shuffle within partitions using the mapPartitions method.
rdd.mapPartitions(Random.shuffle(_));
For a PairRDD (RDDs of type RDD[(K, V)]), if you are interested in shuffling the key-value mappings (mapping an arbitrary key to an arbitrary value):
pairRDD.mapPartitions(iterator => {
val (keySequence, valueSequence) = iterator.toSeq.unzip
val shuffledValueSequence = Random.shuffle(valueSequence)
keySequence.zip(shuffledValueSequence).toIterator
}, true)
The boolean flag at the end denotes that partitioning is preserved (keys are not changed) for this operation so that downstream operations e.g. reduceByKey can be optimized (avoid shuffles).

While one can not not just shuffle a single column directly - it is possible to permute the records in an RDD via RandomRDDs. https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/random/RandomRDDs.html
A potential approach to having only a single column permuted might be:
use mapPartitions to do some setup/teardown on each Worker task
suck all of the records into memory. i.e. iterator.toList. Make sure you have many (/small) partitions of data to avoid OOME
using the Row object rewrite all back out as original except for the given column
within the mapPartitions create an in-memory sorted list
for the desired column drop its values in a separate collection and randomly sample the collection for replacing each record's entry
return the result as list.toIterator from the mapPartitions

You can add one additional column random generated, and then sort the record based on this random generated column. By this way, you are randomly shuffle your destined column.
In this way, you do not need to have all data in memory, which can easily cause OOM. Spark will take care of sorting and memory limitation issue by spill to disk if necessary.
If you don't want the extra column, you can remove it after sorting.

In case someone is looking for a PySpark equivalent of Sascha Vetter's post, you can find it below:
from pyspark.sql.functions import rand
from pyspark.sql import Row
from pyspark.sql.types import *
def add_index_to_row(row, index):
print(index)
row_dict = row.asDict()
row_dict["index"] = index
return Row(**row_dict)
def add_index_to_df(df):
df_with_index = df.rdd.zipWithIndex().map(lambda x: add_index_to_row(x[0], x[1]))
new_schema = StructType(df.schema.fields + [StructField("index", IntegerType(), True)])
return spark.createDataFrame(df_with_index, new_schema)
def shuffle_single_column(df, column_name):
df_cols = df.columns
# select the desired column and shuffle it (i.e. order it by column with random numbers)
shuffled_col = df.select(column_name).orderBy(F.rand())
# add explicit index to the shuffled column
shuffled_col_index = add_index_to_df(shuffled_col)
# add explicit index to the original dataframe
df_index = add_index_to_df(df)
# drop the desired column from df, join it with the shuffled column on created index and finally drop the index column
df_shuffled = df_index.drop(column_name).join(shuffled_col_index, "index").drop("index")
# reorder columns so that the shuffled column comes back to its initial position instead of the last position
df_shuffled = df_shuffled.select(df_cols)
return df_shuffled
# initialize random array
z = np.random.randint(20, size=(10, 3)).tolist()
# create the pyspark dataframe
example_df = sc.parallelize(z).toDF(("a","b","c"))
# shuffle one column of the dataframe
example_df_shuffled = shuffle_single_column(df = example_df, column_name = "a")

Related

Pyspark - Index from monotonically_increasing_id changes after list aggregation

I'm creating an index using the monotonically_increasing_id() function in Pyspark 3.1.1.
I'm aware of the specific characteristics of that function, but they don't explain my issue.
After creating the index I do a simple aggregation applying the collect_list() function on the created index.
If I compare the results the index changes in certain cases, that is specifically on the upper end of the long-range when the input data is not too small.
Full example code:
import random
import string
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '8')\
.getOrCreate()
# Create random input data of around length 100000:
input_data = []
ii = 0
while ii <= 100000:
L = random.randint(1, 3)
B = ''.join(random.choices(string.ascii_uppercase, k=5))
for i in range(L):
C = random.randint(1,100)
input_data.append((B,))
ii += 1
# Create Spark DataFrame:
input_rdd = sc.parallelize(tuple(input_data))
schema = StructType([StructField("B", StringType())])
dg = spark.createDataFrame(input_rdd, schema=schema)
# Create id and aggregate:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id())
dg2 = dg.groupBy("B").agg(f.collect_list("ID0"))
Output:
dg.sort('B', ascending=False).show(10, truncate=False)
dg2.sort('B', ascending=False).show(5, truncate=False)
This of course creates different data with every run, but if the length is large enough (problem appears slightly at 10000, but not at 1000), it should appear everytime. Here's an example result:
+-----+-----------+
|B |ID0 |
+-----+-----------+
|ZZZVB|60129554616|
|ZZZVB|60129554617|
|ZZZVB|60129554615|
|ZZZUH|60129554614|
|ZZZRW|60129554612|
|ZZZRW|60129554613|
|ZZZNH|60129554611|
|ZZZNH|60129554609|
|ZZZNH|60129554610|
|ZZZJH|60129554606|
+-----+-----------+
only showing top 10 rows
+-----+---------------------------------------+
|B |collect_list(ID0) |
+-----+---------------------------------------+
|ZZZVB|[60129554742, 60129554743, 60129554744]|
|ZZZUH|[60129554741] |
|ZZZRW|[60129554739, 60129554740] |
|ZZZNH|[60129554736, 60129554737, 60129554738]|
|ZZZJH|[60129554733, 60129554734, 60129554735]|
+-----+---------------------------------------+
only showing top 5 rows
The entry ZZZVB has the three IDs 60129554615, 60129554616, and 60129554617 before aggregation, but after aggregation the numbers have changed to 60129554742, 60129554743, 60129554744.
Why? I can't imagine this is supposed to happen. Isn't the result of monotonically_increasing_id() a simple long that keeps its value after having been created?
EDIT: As expected a workaround is to coalesce(1) the DataFrame before creating the id.
dg and df2 are two different dataframes, each with their own DAG. These DAGs are executed independently from each other when an action on one of the dataframes is called. So each time show() is called, the DAG of the respective dataframe is evaluated and during that evaluation, f.monotonically_increasing_id() is called.
To prevent f.monotonically_increasing_id() being called twice, you could add a cache after the withColumn transformation:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id()).cache()
With the cache, the result of the first evaluation of f.monotonically_increasing_id() is cached and reused when evaluating the second dataframe.

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

Taking value from one dataframe and passing that value into loop of SqlContext

Looking to try do something like this:
I have a dataframe that is one column of ID's called ID_LIST. With that column of id's I would like to pass it into a Spark SQL call looping through ID_LIST using foreach returning the result to another dataframe.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)
id_list println output:
[123]
[234]
[345]
[456]
Trying to now loop through ID_LIST and run a Spark SQL call for each:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i
items.foreach(println)
}
First.. not sure how to pull the individual value out, getting this error:
org.apache.spark.sql.AnalysisException: cannot recognize input near '[' '123' ']' in expression specification; line 1 pos 61
Second: how can I alter my code to output the result to a dataframe I can use later ?
Thanks, any help is appreciated!
Answer To First Question
When you perform the "foreach" Spark converts the dataframe into an RDD of type Row. Then when you println on the RDD it prints the Row, the first row being "[123]". It is boxing [] the elements in the row. The elements in the row are accessed by position. If you wanted to print just 123, 234, etc... try
id_list.foreach(i => println(i(0)))
Or you can use native primitive access
id_list.foreach(i => println(i.getString(0))) //For Strings
Seriously... Read the documentation I have linked about Row in Spark. This will transform your code to:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
items.foreach(i => println(i.getString(0)))
})
Answer to Second Question
I have a sneaking suspicion about what you actually are trying to do but I'll answer your question as I have interpreted it.
Let's create an empty dataframe which we will union everything to it in a loop of the distinct items from the first dataframe.
import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row
// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
.add("col1", StringType, true)
var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)
// Loop over, select, and union to the empty df
id_list.foreach{ i =>
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
df = df.union(items)
}
df.show()
You now have the dataframe df that you can use later.
NOTE: An easier thing to do would probably be to join the two dataframes on the matching columns.
import sqlContext.implicits.StringToColumn
val bar = id_list.join(another_items_orc, $"distinct_id" === $"id", "inner").select("id")
bar.show()

What is the most efficient way to repartition a data frame with the following constraints?

I have a time series dataframe stored at one partition
+-------------+------+----+-------+
| TimeStamp| X| Y| Z|
+-------------+------+----+-------+
|1448949705421|-35888|4969|3491754|
|1448949705423|-35081|2795|3489177|
|1448949705425|-35976|5830|3488618|
|1448949705426|-36927|4729|3491807|
|1448949705428|-36416|6246|3490364|
|1448949705429|-36073|7067|3491556|
|1448949705431|-38553|3714|3489545|
|1448949705433|-39008|3034|3490230|
|1448949705434|-35295|4005|3489426|
|1448949705436|-36397|5076|3490941|
+-------------+------+----+-------+
I want to repartition this dataframe into 10 partitions, such that the first partition has roughly the first 1/10 rows, the second partition has roughly the second 1/10 rows, and so on.
One way I can think of is:
var df = ???
// add index to df
val rdd = df.rdd.zipWithIndex().map(indexedRow =>
Row.fromSeq(indexedRow._2.toLong +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("rn", LongType, true)).++(df.schema.fields))
val dfWithIndex = sqlContext.createDataFrame(rdd, newstructure)
// create a group number using the index
val udfToInt = udf[Int, Double](_.toInt)
val dfWithGrp = dfWithIndex.withColumn("group", udfToInt($"rn" / (df.count / 10)))
// repartition by the "group" column
val partitionedDF = dfWithGrp.repartition(10, $"group")
Another way I can think of is by using a partitioner:
//After creating a group number
val grpIndex = dfWithGrp.schema.fields.size - 1
val partitionedRDD = dfWithGrp.rdd.map(r => (r.getInt(grpIndex), r))
.partitionBy(new HashPartitioner(10))
.values
But they seem to be not efficient because we need to add index first and then create a group number using the index. Is there a way to do this without adding an extra group column?

get specific row from spark dataframe

Is there any alternative for df[100, c("column")] in scala spark data frames. I want to select specific row from a column of spark data frame.
for example 100th row in above R equivalent code
Firstly, you must understand that DataFrames are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.
However, continuing with my explanation, I would use some methods of the RDD API cause all DataFrames have one RDD as attribute. Please, see my example bellow, and notice how I take the 2nd record.
df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
myIndex = 1
values = (df.rdd.zipWithIndex()
.filter(lambda ((l, v), i): i == myIndex)
.map(lambda ((l,v), i): (l, v))
.collect())
print(values[0])
# (u'b', 2)
Hopefully, someone gives another solution with fewer steps.
This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding
val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")
val myRow7th = parquetFileDF.rdd.take(7).last
In PySpark, if your dataset is small (can fit into memory of driver), you can do
df.collect()[n]
where df is the DataFrame object, and n is the Row of interest. After getting said Row, you can do row.myColumn or row["myColumn"] to get the contents, as spelled out in the API docs.
The getrows() function below should get the specific rows you want.
For completeness, I have written down the full code in order to reproduce the output.
# Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('scratch').getOrCreate()
# Create the dataframe
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
# Function to get rows at `rownums`
def getrows(df, rownums=None):
return df.rdd.zipWithIndex().filter(lambda x: x[1] in rownums).map(lambda x: x[0])
# Get rows at positions 0 and 2.
getrows(df, rownums=[0, 2]).collect()
# Output:
#> [(Row(letter='a', name=1)), (Row(letter='c', name=3))]
This Works for me in PySpark
df.select("column").collect()[0][0]
There is a scala way (if you have a enough memory on working machine):
val arr = df.select("column").rdd.collect
println(arr(100))
If dataframe schema is unknown, and you know actual type of "column" field (for example double), than you can get arr as following:
val arr = df.select($"column".cast("Double")).as[Double].rdd.collect
you can simply do that by using below single line of code
val arr = df.select("column").collect()(99)
When you want to fetch max value of a date column from dataframe, just the value without object type or Row object information, you can refer to below code.
table = "mytable"
max_date = df.select(max('date_col')).first()[0]
2020-06-26
instead of Row(max(reference_week)=datetime.date(2020, 6, 26))
Following is a Java-Spark way to do it , 1) add a sequentially increment columns. 2) Select Row number using Id. 3) Drop the Column
import static org.apache.spark.sql.functions.*;
..
ds = ds.withColumn("rownum", functions.monotonically_increasing_id());
ds = ds.filter(col("rownum").equalTo(99));
ds = ds.drop("rownum");
N.B. monotonically_increasing_id starts from 0;

Resources