A quick way to get the mean of each position in large RDD - python-3.x

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.

I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

Related

Dataframe's intersection with itself is smaller than the original DataFrame

I have a PySpark DataFrame with approx. 150 million rows (with a single string column).
Here's the minimal code block demonstrating what happens in my application:
df = spark.read.csv("source.csv")
df = df.distinct() # No duplicates!
df1 = df
intersection = df.intersect(df1)
print(df.count(), intersection.count())
# This prints:
# 156203100, 156184232
How could this possibly be? Is df.intersect(df1).count() not an accurate count?
Additional information:
df is deterministic (no sampling or random values)
unfortunately, I do not have access to the the actual content of the DataFrame, so I can not look at how the content actually differs.

Pyspark - Index from monotonically_increasing_id changes after list aggregation

I'm creating an index using the monotonically_increasing_id() function in Pyspark 3.1.1.
I'm aware of the specific characteristics of that function, but they don't explain my issue.
After creating the index I do a simple aggregation applying the collect_list() function on the created index.
If I compare the results the index changes in certain cases, that is specifically on the upper end of the long-range when the input data is not too small.
Full example code:
import random
import string
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '8')\
.getOrCreate()
# Create random input data of around length 100000:
input_data = []
ii = 0
while ii <= 100000:
L = random.randint(1, 3)
B = ''.join(random.choices(string.ascii_uppercase, k=5))
for i in range(L):
C = random.randint(1,100)
input_data.append((B,))
ii += 1
# Create Spark DataFrame:
input_rdd = sc.parallelize(tuple(input_data))
schema = StructType([StructField("B", StringType())])
dg = spark.createDataFrame(input_rdd, schema=schema)
# Create id and aggregate:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id())
dg2 = dg.groupBy("B").agg(f.collect_list("ID0"))
Output:
dg.sort('B', ascending=False).show(10, truncate=False)
dg2.sort('B', ascending=False).show(5, truncate=False)
This of course creates different data with every run, but if the length is large enough (problem appears slightly at 10000, but not at 1000), it should appear everytime. Here's an example result:
+-----+-----------+
|B |ID0 |
+-----+-----------+
|ZZZVB|60129554616|
|ZZZVB|60129554617|
|ZZZVB|60129554615|
|ZZZUH|60129554614|
|ZZZRW|60129554612|
|ZZZRW|60129554613|
|ZZZNH|60129554611|
|ZZZNH|60129554609|
|ZZZNH|60129554610|
|ZZZJH|60129554606|
+-----+-----------+
only showing top 10 rows
+-----+---------------------------------------+
|B |collect_list(ID0) |
+-----+---------------------------------------+
|ZZZVB|[60129554742, 60129554743, 60129554744]|
|ZZZUH|[60129554741] |
|ZZZRW|[60129554739, 60129554740] |
|ZZZNH|[60129554736, 60129554737, 60129554738]|
|ZZZJH|[60129554733, 60129554734, 60129554735]|
+-----+---------------------------------------+
only showing top 5 rows
The entry ZZZVB has the three IDs 60129554615, 60129554616, and 60129554617 before aggregation, but after aggregation the numbers have changed to 60129554742, 60129554743, 60129554744.
Why? I can't imagine this is supposed to happen. Isn't the result of monotonically_increasing_id() a simple long that keeps its value after having been created?
EDIT: As expected a workaround is to coalesce(1) the DataFrame before creating the id.
dg and df2 are two different dataframes, each with their own DAG. These DAGs are executed independently from each other when an action on one of the dataframes is called. So each time show() is called, the DAG of the respective dataframe is evaluated and during that evaluation, f.monotonically_increasing_id() is called.
To prevent f.monotonically_increasing_id() being called twice, you could add a cache after the withColumn transformation:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id()).cache()
With the cache, the result of the first evaluation of f.monotonically_increasing_id() is cached and reused when evaluating the second dataframe.

Partitioning of Data Frame in Pyspark using Custom Partitioner

Looking for some info on using custom partitioner in Pyspark. I have a dataframe holding country data for various countries. So if I do repartition on country column, it will distribute my data into n partitions and keeping similar country data to specific partitions. This is creating a skew partition data when I see using glom() method.
Some countries like USA and CHN has huge amount of data in particular dataframe. I want to repartition my dataframe such that if the countries are USA and CHN then it will further split into some 10 partitions else keep the partitions same for other countries like IND, THA, AUS etc. Can we extend partitioner class in Pyspark code.
I have read this in below link that we can extend scala partitioner class in scala Spark application and can modify the partitioner class to use custom logic to repartition our data on base of requirements. Like the one I have.. please help to achieve this solution in Pyspark.. See the link below What is an efficient way to partition by column but maintain a fixed partition count?
I am using Spark version 2.3.0.2 and below is my Dataframe structure:
datadf= spark.sql("""
SELECT
ID_NUMBER ,SENDER_NAME ,SENDER_ADDRESS ,REGION_CODE ,COUNTRY_CODE
from udb.sometable
""");
The incoming data has data for six countries, like AUS, IND, THA, RUS, CHN and USA.
CHN and USA has skew data.
so if I do repartition on COUNTRY_CODE, two partitions contains a lot data whereas others are fine. I checked this using glom() method.
newdf = datadf.repartition("COUNTRY_CODE")
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext, DataFrameWriter, DataFrame
newDF = datadf.repartitionByRange(3,"COUNTRY_CODE","USA")
I was trying repartition my data into 3 more partitions for country USA and CHN only and would like to keep the other countries data into single partition.
This is what I am expecting
AUS- one partition
IND- one partition
THA- one partition
RUS- one partition
CHN- three partition
USA- three partition
Traceback (most recent call last): File "", line 1, in
File
"/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line
1182, in getattr
"'%s' object has no attribute '%s'" % (self.class.name, name)) AttributeError: 'DataFrame' object has no attribute
'repartitionByRange'
Try something like this with hashing:
newDf = oldDf.repartition(N, $"col1", $"coln")
or for ranging approach:
newDF = oldDF.repartitionByRange(N, $"col1", $"coln")
There is no custom partitioning for DF's just yet.
In your case I would go for hashing, but there are no guarantees.
But if your data is skew you may need some extra work, like 2 columns for partitioning being the simplest approach.
E.g. an existing or new column - in this case a column that applies a grouping against a given country, e.g. 1 .. N, and the partition on two cols.
For countries with many grouping you get N synthetic sub divisions; for others with low cardinality, only with 1 such group number. Not too hard. Both partitioning can take more than 1 col.
In my view uniform number filling of partitions takes a lot of effort and not really attainable, but a next best approach as in this here can suffice well enough. Amounts to custom partitioning to an extent.
Otherwise, using .withColumn on a DF you can simulate custom partitioning with those rules and filling of a new DF column and then apply the repartitionByRange. Also not so hard.
There is no custom partitioner in Structured API, so in order to use custom partitioner, you'll need to drop down to RDD API. Simple 3 steps as follows:
Convert Structured API to RDD API
dataRDD = dataDF.rdd
Apply custom partitioner in RDD API
import random
# Extract key from Row object
dataRDD = dataRDD.map(lambda r: (r[0], r))
def partitioner(key):
if key == "CHN":
return random.randint(1, 10)
elif key == "USA":
return random.randint(11, 20)
else:
# distinctCountryDict is a dict mapping distinct countries to distinct integers
# these distinct integers should not overlap with range(1, 20)
return distinctCountryDict[key]
numPartitions = 100
dataRDD = dataRDD.partitionBy(numPartitions, partitioner)
# Remove key extracted previously
dataRDD = dataRDD.map(lambda r: r[1])
Convert RDD API back to Structured API
dataDF = dataRDD.toDF()
This way, you get the best of both worlds, Spark types and optimized physical plan in Structured API, as well as custom partitioner in low-level RDD API. And we only drop down to low-level API only when it's absolutely necessary.
There is no direct way to apply user defined partitioner on PySpark, the short cut is to create a new column with a UDF, assigning each record with a partition ID based on the business logic. And use the new column for partitioning, that way the data gets spread evenly.
numPartitions= 3
df = df.withColumn("Hash#", udf_country_hash(df['Country']))
df = df.withColumn("Partition#", df["Hash#"] % numPartitions)
df.repartition(numPartitions, "Partition#")
Please check the online version of code #
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8963851468310921/2231943684776180/5846184720595634/latest.html
In my experience converting DataFrame to RDD and back to DataFrame is a costly operation, better to avoid it.

Appending data to an empty dataframe

I am creating an empty dataframe and later trying to append another data frame to that. In fact I want to append many dataframes to the initially empty dataframe dynamically depending on number of RDDs coming.
the union() function works fine if I assign the value to another a third dataframe.
val df3=df1.union(df2)
But I want to keep appending to the initial dataframe (empty) I created because I want to store all the RDDs in one dataframe. The below code however does not show right counts. It seems that it simply did not append
df1.union(df2)
df1.count() // this shows 0 although df2 has some data and that is shown if I assign to third datafram.
If I do the below (I get reassignment error since df1 is val. And if I change it to var type, I get kafka multithreading not safe error.
df1=d1.union(df2)
Any idea how to add all the dynamically created dataframes to one initially created data frame?
Not sure if this is what you are looking for!
# Import pyspark functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define your schema
field = [StructField("Col1",StringType(), True), StructField("Col2", IntegerType(), True)]
schema = StructType(field)
# Your empty data frame
df = spark.createDataFrame(sc.emptyRDD(), schema)
l = []
for i in range(5):
# Build and append to the list dynamically
l = l + [([str(i), i])]
# Create a temporary data frame similar to your original schema
temp_df = spark.createDataFrame(l, schema)
# Do the union with the original data frame
df = df.union(temp_df)
df.show()
DataFrames and other distributed data structures are immutable, therefore methods which operate on them always return new object. There is no appending, no modification in place, and no ALTER TABLE equivalent.
And if I change it to var type, I get kafka multithreading not safe error.
Without actual code is impossible to give you a definitive answer, but it is unlikely related to union code.
There is a number of known Spark bugs cause by incorrect internal implementation (SPARK-19185, SPARK-23623 to enumerate just a few).

Pyspark Pair RDD from Text File

I have a local text file kv_pair.log formatted such as that key value pairs are comma delimited and records begin and terminate with a new line:
"A"="foo","B"="bar","C"="baz"
"A"="oof","B"="rab","C"="zab"
"A"="aaa","B"="bbb","C"="zzz"
I am trying to read this to a Pair RDD using pySpark as follows:
from pyspark import SparkContext
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))
print type(pairs)
print pairs.take(2)
I feel I am close! The output of above is:
[[u'A=foo', u'B=bar', u'C=baz'], [u'A=oof', u'B=rab', u'C=zab']]
So it looks like pairs is a list of records, which contains a list of the kv pairs as strings.
How can I use pySpark to transform this into a Pair RDD such as that the keys and values are properly separated?
Ultimate goal is to transform this Pair RDD into a DataFrame to perform SQL operations - but one step at a time, please help transforming this into a Pair RDD.
You can use flatMap with a custom function as lambda can't be used for multiple statements
def tranfrm(x):
lst = x.replace('"', '').split(",")
return [(x.split("=")[0], x.split("=")[1]) for x in lst]
pairs = lines.map(tranfrm)
This is really bad practice for a parser, but I believe your example could be done with something like this:
from pyspark import SparkContext
from pyspark.sql import Row
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))\
.map(lambda r: Row(A=r[0].split('=')[1], B=r[1].split('=')[1], C=r[2].split('=')[1] )
print type(pairs)
print pairs.take(2)

Resources