Pyspark - Index from monotonically_increasing_id changes after list aggregation - apache-spark

I'm creating an index using the monotonically_increasing_id() function in Pyspark 3.1.1.
I'm aware of the specific characteristics of that function, but they don't explain my issue.
After creating the index I do a simple aggregation applying the collect_list() function on the created index.
If I compare the results the index changes in certain cases, that is specifically on the upper end of the long-range when the input data is not too small.
Full example code:
import random
import string
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '8')\
.getOrCreate()
# Create random input data of around length 100000:
input_data = []
ii = 0
while ii <= 100000:
L = random.randint(1, 3)
B = ''.join(random.choices(string.ascii_uppercase, k=5))
for i in range(L):
C = random.randint(1,100)
input_data.append((B,))
ii += 1
# Create Spark DataFrame:
input_rdd = sc.parallelize(tuple(input_data))
schema = StructType([StructField("B", StringType())])
dg = spark.createDataFrame(input_rdd, schema=schema)
# Create id and aggregate:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id())
dg2 = dg.groupBy("B").agg(f.collect_list("ID0"))
Output:
dg.sort('B', ascending=False).show(10, truncate=False)
dg2.sort('B', ascending=False).show(5, truncate=False)
This of course creates different data with every run, but if the length is large enough (problem appears slightly at 10000, but not at 1000), it should appear everytime. Here's an example result:
+-----+-----------+
|B |ID0 |
+-----+-----------+
|ZZZVB|60129554616|
|ZZZVB|60129554617|
|ZZZVB|60129554615|
|ZZZUH|60129554614|
|ZZZRW|60129554612|
|ZZZRW|60129554613|
|ZZZNH|60129554611|
|ZZZNH|60129554609|
|ZZZNH|60129554610|
|ZZZJH|60129554606|
+-----+-----------+
only showing top 10 rows
+-----+---------------------------------------+
|B |collect_list(ID0) |
+-----+---------------------------------------+
|ZZZVB|[60129554742, 60129554743, 60129554744]|
|ZZZUH|[60129554741] |
|ZZZRW|[60129554739, 60129554740] |
|ZZZNH|[60129554736, 60129554737, 60129554738]|
|ZZZJH|[60129554733, 60129554734, 60129554735]|
+-----+---------------------------------------+
only showing top 5 rows
The entry ZZZVB has the three IDs 60129554615, 60129554616, and 60129554617 before aggregation, but after aggregation the numbers have changed to 60129554742, 60129554743, 60129554744.
Why? I can't imagine this is supposed to happen. Isn't the result of monotonically_increasing_id() a simple long that keeps its value after having been created?
EDIT: As expected a workaround is to coalesce(1) the DataFrame before creating the id.

dg and df2 are two different dataframes, each with their own DAG. These DAGs are executed independently from each other when an action on one of the dataframes is called. So each time show() is called, the DAG of the respective dataframe is evaluated and during that evaluation, f.monotonically_increasing_id() is called.
To prevent f.monotonically_increasing_id() being called twice, you could add a cache after the withColumn transformation:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id()).cache()
With the cache, the result of the first evaluation of f.monotonically_increasing_id() is cached and reused when evaluating the second dataframe.

Related

Can I get metadata of files reading by Spark

Let's suppose we have 2 files, file#1 created at 12:55 and file#2 created at 12:58. While reading these two files I want to add a new column "creation_time". Rows belong to file#1 have 12:55 in "creation_time" column and Rows belong to file#2 have 12:58 in "creation_time".
new_data = spark.read.option("header", "true").csv("s3://bucket7838-1/input")
I'm using above code snippet to read the files in "input" directory.
Use input_file_name() function to get the filename and then use hdfs file api to get the file timestamp finally join both dataframes on filename.
Example:
from pyspark.sql.types import *
from pyspark.sql.functions import *
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("hdfs://<namenode_address>:8020"), Configuration())
status = fs.listStatus(Path('<hdfs_directory>'))
filestatus_df=spark.createDataFrame([[str(i.getPath()),i.getModificationTime()/1000] for i in status],["filename","modified_time"]).\
withColumn("modified_time",to_timestamp(col("modified_time")))
input_df=spark.read.csv("<hdfs_directory>").\
withColumn("filename",input_file_name())
#join both dataframes on filename to get filetimestamp
df=input_df.join(filestatus_df,['filename'],"left")
Here are the steps
Use sparkcontext.wholeTextFiles("/path/to/folder/containing/all/files")
The above returns an RDD where key is the path of the file, and value is the content of the file
rdd.map(lambda x:x[1]) - this give you an rdd with only file contents
rdd.map(lambda x: customeFunctionToProcessFileContent(x))
since map function works in parallel, any operations you do, would be faster and not sequential - as long as your tasks don't depend on each other, which is the main criteria for parallelism
import os
import time
import pyspark
from pyspark.sql.functions import udf
from pyspark.sql.types import *
# reading all the files to create PairRDD
input_rdd = sc.wholeTextFiles("file:///home/user/datatest/*",2)
#convert RDD to DF
input_df=spark.createDataFrame(input_rdd)
input_df.show(truncate=False)
'''
+---------------------------------------+------------+
|_1 |_2 |
+---------------------------------------+------------+
|file:/home/user/datatest/test.txt |1,2,3 1,2,3|
|file:/home/user/datatest/test.txt1 |4,5,6 6,7,6|
+---------------------------------------+------------+
'''
input_df.select("_2").take(2)
#[Row(_2=u'1,2,3\n1,2,3\n'), Row(_2=u'4,5,6\n6,7,6\n')]
# function to get a creation time of a file
def time_convesion(filename):
return time.ctime(os.path.getmtime(filename.split(":")[1]))
#udf registration
time_convesion_udf = udf(time_convesion, StringType())
#udf apply over the DF
final_df = input_df.withColumn("created_time", time_convesion_udf(input_df['_1']))
final_df.show(2,truncate=False)
'''
+---------------------------------------+------------+------------------------+
|_1 |_2 |created_time |
+---------------------------------------+------------+------------------------+
|file:/home/user/datatest/test.txt |1,2,3 1,2,3|Sat Jul 11 18:31:03 2020|
|file:/home/user/datatest/test.txt1 |4,5,6 6,7,6|Sat Jul 11 18:32:43 2020|
+---------------------------------------+------------+------------------------+
'''
# proceed with the next steps for the implementation
The above works with default partition though. So you might not get input files count equal to output file count(as output is number of partitions).
You can re-partition the RDD based on count or any other unique value based on your data, so you end up with output files count equal to input count. This approach will have only parallelism but will not have the performance achieved with optimal number of partitions

PySpark merge 2 column values by index into new list [duplicate]

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:
df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'
I wanted to join these two columns in a third column like below for each row of my dataframe.
df['column_3']: [abc_1.0, def_2.0, ghi_3.0]
I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?
df['column_3'] = df['column_2']
for index, row in df.iterrows():
while index < 3:
if isinstance(row['column_1'], str):
row['column_1'] = list(row['column_1'].split(','))
row['column_2'] = list(row['column_2'].split(','))
row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]
I have converted the two columns to arrays in PySpark by using the below code
from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split
crash.withColumn("column_1",
split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)
Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.
A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:
pyspark.sql.functions.arrays_zip(*cols)
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
So if you already have two arrays:
from pyspark.sql.functions import split
df = (spark
.createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
.toDF("column_1", "column_2")
.withColumn("column_1", split("column_1", "\s*,\s*"))
.withColumn("column_2", split("column_2", "\s*,\s*")))
You can just apply it on the result
from pyspark.sql.functions import arrays_zip
df_zipped = df.withColumn(
"zipped", arrays_zip("column_1", "column_2")
)
df_zipped.select("zipped").show(truncate=False)
+------------------------------------+
|zipped |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+
Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):
df_zipped_concat = df_zipped.withColumn(
"zipped_concat",
expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
)
df_zipped_concat.select("zipped_concat").show(truncate=False)
+---------------------------+
|zipped_concat |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+
Note:
Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.
For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:
df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))"))
The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).
You can also UDF to zip the split array columns,
df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2'])
+-----------+-----------+
|col1 |col2 |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is
from pyspark.sql import functions as F
from pyspark.sql.types import *
def concat_udf(*args):
return ['_'.join(x) for x in zip(*args)]
udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1 |col2 |col3 |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+
For Spark 3.1+, they now provide pyspark.sql.functions.zip_with() with Python lambda function, therefore it can be done like this:
import pyspark.sql.functions as F
df = df.withColumn("column_3", F.zip_with("column_1", "column_2", lambda x,y: F.concat_ws("_", x, y)))

Why does a distinct count after a MergeSort join on a Spark dataframe of Vectors give a non-deterministic results

A spark inner SortMerge Join on using a column containing a ml Vector appears to give non-deterministic and inaccurate result on larger datasets.
I was using the approxNearestNeighbors method of BucketRandomLSH projection of Spark v2.4.3, and discovered it gives different numbers of results for large data sets.
This problem only appears when executing a SortMerge join; a broadcast join gives the same result every time.
I tracked the problem down to the join on the LSH hash keys. A reproducible example is below...
import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.Vectors
import scala.util.Random
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
//create a large dataframe containing an Array (length two) of ml Vectors of length one.
val df = Seq.fill(30000)(
(java.util.UUID.randomUUID.toString,
Seq(
Vectors.dense(Random.nextInt(10).toDouble),
Vectors.dense(Random.nextInt(10).toDouble))
)
).toDF
//ensure it's caches
df.cache.count()
//positional explode the vector column
val dfExploded = df.select(col("*"), posexplode(col("_2")))
// now self join on the exploded 'col' and 'pos' fields
dfExploded.join(dfExploded, Seq("pos","col")).drop("pos","col").distinct.count
Different result each time...
scala>dfExploded.join(dfExploded,Seq("pos","col")).drop("pos","col").distinct.count
res139: Long = 139663581
scala>dfExploded.join(dfExploded,Seq("pos","col")).drop("pos","col").distinct.count
res140: Long = 156349630

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

Randomly shuffle column in Spark RDD or dataframe

Is there anyway I can shuffle a column of an RDD or dataframe such that the entries in that column appear in random order? I'm not sure which APIs I could use to accomplish such a task.
What about selecting the column to shuffle, orderBy(rand) the column and zip it by index to the existing dataframe?
import org.apache.spark.sql.functions.rand
def addIndex(df: DataFrame) = spark.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
case class Entry(name: String, salary: Double)
val r1 = Entry("Max", 2001.21)
val r2 = Entry("Zhang", 3111.32)
val r3 = Entry("Bob", 1919.21)
val r4 = Entry("Paul", 3001.5)
val df = addIndex(spark.createDataFrame(Seq(r1, r2, r3, r4)))
val df_shuffled = addIndex(df
.select(col("salary").as("salary_shuffled"))
.orderBy(rand))
df.join(df_shuffled, Seq("_index"))
.drop("_index")
.show(false)
+-----+-------+---------------+
|name |salary |salary_shuffled|
+-----+-------+---------------+
|Max |2001.21|3001.5 |
|Zhang|3111.32|3111.32 |
|Paul |3001.5 |2001.21 |
|Bob |1919.21|1919.21 |
+-----+-------+---------------+
If you don't need a global shuffle across your data, you can shuffle within partitions using the mapPartitions method.
rdd.mapPartitions(Random.shuffle(_));
For a PairRDD (RDDs of type RDD[(K, V)]), if you are interested in shuffling the key-value mappings (mapping an arbitrary key to an arbitrary value):
pairRDD.mapPartitions(iterator => {
val (keySequence, valueSequence) = iterator.toSeq.unzip
val shuffledValueSequence = Random.shuffle(valueSequence)
keySequence.zip(shuffledValueSequence).toIterator
}, true)
The boolean flag at the end denotes that partitioning is preserved (keys are not changed) for this operation so that downstream operations e.g. reduceByKey can be optimized (avoid shuffles).
While one can not not just shuffle a single column directly - it is possible to permute the records in an RDD via RandomRDDs. https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/random/RandomRDDs.html
A potential approach to having only a single column permuted might be:
use mapPartitions to do some setup/teardown on each Worker task
suck all of the records into memory. i.e. iterator.toList. Make sure you have many (/small) partitions of data to avoid OOME
using the Row object rewrite all back out as original except for the given column
within the mapPartitions create an in-memory sorted list
for the desired column drop its values in a separate collection and randomly sample the collection for replacing each record's entry
return the result as list.toIterator from the mapPartitions
You can add one additional column random generated, and then sort the record based on this random generated column. By this way, you are randomly shuffle your destined column.
In this way, you do not need to have all data in memory, which can easily cause OOM. Spark will take care of sorting and memory limitation issue by spill to disk if necessary.
If you don't want the extra column, you can remove it after sorting.
In case someone is looking for a PySpark equivalent of Sascha Vetter's post, you can find it below:
from pyspark.sql.functions import rand
from pyspark.sql import Row
from pyspark.sql.types import *
def add_index_to_row(row, index):
print(index)
row_dict = row.asDict()
row_dict["index"] = index
return Row(**row_dict)
def add_index_to_df(df):
df_with_index = df.rdd.zipWithIndex().map(lambda x: add_index_to_row(x[0], x[1]))
new_schema = StructType(df.schema.fields + [StructField("index", IntegerType(), True)])
return spark.createDataFrame(df_with_index, new_schema)
def shuffle_single_column(df, column_name):
df_cols = df.columns
# select the desired column and shuffle it (i.e. order it by column with random numbers)
shuffled_col = df.select(column_name).orderBy(F.rand())
# add explicit index to the shuffled column
shuffled_col_index = add_index_to_df(shuffled_col)
# add explicit index to the original dataframe
df_index = add_index_to_df(df)
# drop the desired column from df, join it with the shuffled column on created index and finally drop the index column
df_shuffled = df_index.drop(column_name).join(shuffled_col_index, "index").drop("index")
# reorder columns so that the shuffled column comes back to its initial position instead of the last position
df_shuffled = df_shuffled.select(df_cols)
return df_shuffled
# initialize random array
z = np.random.randint(20, size=(10, 3)).tolist()
# create the pyspark dataframe
example_df = sc.parallelize(z).toDF(("a","b","c"))
# shuffle one column of the dataframe
example_df_shuffled = shuffle_single_column(df = example_df, column_name = "a")

Resources