Appending data to an empty dataframe - apache-spark

I am creating an empty dataframe and later trying to append another data frame to that. In fact I want to append many dataframes to the initially empty dataframe dynamically depending on number of RDDs coming.
the union() function works fine if I assign the value to another a third dataframe.
val df3=df1.union(df2)
But I want to keep appending to the initial dataframe (empty) I created because I want to store all the RDDs in one dataframe. The below code however does not show right counts. It seems that it simply did not append
df1.union(df2)
df1.count() // this shows 0 although df2 has some data and that is shown if I assign to third datafram.
If I do the below (I get reassignment error since df1 is val. And if I change it to var type, I get kafka multithreading not safe error.
df1=d1.union(df2)
Any idea how to add all the dynamically created dataframes to one initially created data frame?

Not sure if this is what you are looking for!
# Import pyspark functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define your schema
field = [StructField("Col1",StringType(), True), StructField("Col2", IntegerType(), True)]
schema = StructType(field)
# Your empty data frame
df = spark.createDataFrame(sc.emptyRDD(), schema)
l = []
for i in range(5):
# Build and append to the list dynamically
l = l + [([str(i), i])]
# Create a temporary data frame similar to your original schema
temp_df = spark.createDataFrame(l, schema)
# Do the union with the original data frame
df = df.union(temp_df)
df.show()

DataFrames and other distributed data structures are immutable, therefore methods which operate on them always return new object. There is no appending, no modification in place, and no ALTER TABLE equivalent.
And if I change it to var type, I get kafka multithreading not safe error.
Without actual code is impossible to give you a definitive answer, but it is unlikely related to union code.
There is a number of known Spark bugs cause by incorrect internal implementation (SPARK-19185, SPARK-23623 to enumerate just a few).

Related

How to join efficiently 2 Spark dataframes partitioned by some column, when that column is one of multiple join keys?

I am currently facing some issues in Spark 3.0.2 to efficiently join 2 Spark dataframes when
The 2 Spark DataFrames are partitioned by some key id;
id is part of the join key, but it is not the only one.
My intuition is telling me that the query optimizer is, in this case, not choosing the optimal path. I will illustrate my issue through a minimal example (note that this particular example does not really require a join, it's just for illustrative purposes).
Let's start from the simple case: the 2 dataframes are partitioned by id, and we join by id only:
from pyspark.sql import SparkSession, Row, Window
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
# Make up some test dataframe
df = spark.createDataFrame([Row(id=i // 10, order=i % 10, value=i) for i in range(10000)])
# Create the left side of the join (repartitioned by id)
df2 = df.repartition(50, 'id')
# Create the right side of the join (also repartitioned by id)
df3 = df2.select('id', F.col('order').alias('order_alias'), F.lit(0).alias('dummy'))
# Perform the join
joined_df = df2.join(df3, on='id')
joined_df.foreach(lambda x: None)
This results in the following efficient plan:
This plan is efficient: it recognizes that the 2 dataframes are already partitioned by the join key and avoids to re-shuffle them. The 2 dataframes are not only repartitioned, but also colocated.
What happens if there is an additional join key? It results in an inefficient plan:
joined_df = df2.join(df3, on=[df2.id==df3.id, df2.order==df3.order_alias])
joined_df.foreach(lambda x: None)
The plan is inefficient since it is repartitioning the 2 dataframes to do the join. This does not make sense to me. Intuitively, we could use the existing partitions: all keys to be joined will be found in the same partition as before, there is just one additional condition to apply! So I thought: perhaps we could phrase the 2nd condition as a filter?
joined_df.foreach(lambda x: None)
joined_df = df2.join(df3, on='id')
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
This however results in the same inefficient plan, since Spark query optimizer will just merge the 2nd filter with the join.
So, I finally thought that maybe I could force Spark to process the join as I want by adding a dummy cache step, by trying the following:
from pyspark import StorageLevel
joined_df = df2.join(df3, on='id')
# Note that this storage level will not cache anything, it's just to suggest to Spark that I need this intermediate result
joined_df.persist(StorageLevel(False, False, False, False))
# Do the filtering after "persisting" the join
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
joined_df_filtered.foreach(lambda x: None)
This results in an efficient plan! It is in fact much faster than the previous ones.
The workaround of "persisting" the first join to force Spark to use a more efficient processing plan is "good enough" for my use case, but I still have a few questions:
Am I missing something in my intuition that Spark should actually be reusing partitions when the partition key is part of the join key, instead of re-shuffling?
Is this expected behavior of the query optimizer? Should a ticket be filed for it?
Is there a better way to force the desired processing plan than adding the "persist" step? It seems more like an indirect workaround than a direct solution.

Caching in spark before diverging the flow

I have a basic question regarding working with Spark DataFrame.
Consider the following piece of pseudo code:
val df1 = // Lazy Read from csv and create dataframe
val df2 = // Filter df1 on some condition
val df3 = // Group by on df2 on certain columns
val df4 = // Join df3 with some other df
val subdf1 = // All records from df4 where id < 0
val subdf2 = // All records from df4 where id > 0
* Then some more operations on subdf1 and subdf2 which won't trigger spark evaluation yet*
// Write out subdf1
// Write out subdf2
Suppose I start of with main dataframe df1(which I lazy read from the CSV), do some operations on this dataframe (filter, groupby, join) and then comes a point where I split this datframe based on a condition (for eg, id > 0 and id < 0). Then I further proceed to operate on these sub dataframes(let us name these subdf1, subdf2) and ultimately write out both the sub dataframes.
Notice that the write function is the only command that triggers the spark evaluation and rest of the functions(filter, groupby, join) result in lazy evaluations.
Now when I write out subdf1, I am clear that lazy evaluation kicks in and all the statements are evaluated starting from reading of CSV to create df1.
My question comes when we start writing out subdf2. Does spark understand the divergence in code at df4 and store this dataframe when command for writing out subdf1 was encountered? Or will it again start from the first line of creating df1 and re-evaluate all the intermediary dataframes?
If so, is it a good idea to cache the dataframe df4(Assuming I have sufficient memory)?
I'm using scala spark if that matters.
Any help would be appreciated.
No, Spark cannot infer that from your code. It will start all over again. To confirm this, you can do subdf1.explain() and subdf2.explain() and you should see that both dataframes have query plans that start right from the beginning where df1 was read.
So you're right that you should cache df4 to avoid redoing all the computations starting from df1, if you have enough memory. And of course, remember to unpersist by doing df4.unpersist() at the end if you no longer need df4 for any further computations.

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

Spark infer schema with limit during a read.csv

I'd like to infer a Spark.DataFrame schema from a directory of CSV files using a small subset of the rows (say limit(100)).
However, setting inferSchema to True means that the Input Size / Records for the FileScanRDD seems to always be equal to the number of rows in all the CSV files.
Is there a way to make the FileScan more selective, such that Spark looks at fewer rows when inferring a schema?
Note: setting the samplingRatio option to be < 1.0 does not have the desired behaviour, though it is clear that inferSchema uses only the sampled subset of rows.
You could read a subset of your input data into a dataSet of String.
The CSV method allows you to pass this as a parameter.
Here is a simple example (I'll leave reading the sample of rows from the input file to you):
val data = List("1,2,hello", "2,3,what's up?")
val csvRDD = sc.parallelize(data)
val df = spark.read.option("inferSchema","true").csv(csvRDD.toDS)
df.schema
When run in spark-shell, the final line from the above prints (I reformatted it for readability):
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_c0,IntegerType,true),
StructField(_c1,IntegerType,true),
StructField(_c2,StringType,true)
)
Which is the correct Schema for my limited input data set.
Assuming you are only interested in the schema, here is a possible approach based on cipri.l's post in this link
import org.apache.spark.sql.execution.datasources.csv.{CSVOptions, TextInputCSVDataSource}
def inferSchemaFromSample(sparkSession: SparkSession, fileLocation: String, sampleSize: Int, isFirstRowHeader: Boolean): StructType = {
// Build a Dataset composed of the first sampleSize lines from the input files as plain text strings
val dataSample: Array[String] = sparkSession.read.textFile(fileLocation).head(sampleSize)
import sparkSession.implicits._
val sampleDS: Dataset[String] = sparkSession.createDataset(dataSample)
// Provide information about the CSV files' structure
val firstLine = dataSample.head
val extraOptions = Map("inferSchema" -> "true", "header" -> isFirstRowHeader.toString)
val csvOptions: CSVOptions = new CSVOptions(extraOptions, sparkSession.sessionState.conf.sessionLocalTimeZone)
// Infer the CSV schema based on the sample data
val schema = TextInputCSVDataSource.inferFromDataset(sparkSession, sampleDS, Some(firstLine), csvOptions)
schema
}
Unlike GMc's answer from above, this approach tries to directly infer the schema the same way the DataFrameReader.csv() does in the background (but without going through the effort of building an additional Dataset with that schema, that we would then only use to retrieve the schema back from it)
The schema is inferred based on a Dataset[String] containing only the first sampleSize lines from the input files as plain text strings.
When trying to retrieve samples from data, Spark has only 2 types of methods:
Methods that retrieve a given percentage of the data. This operation takes random samples from all partitions. It benefits from higher parallelism, but it must read all the input files.
Methods that retrieve a specific number of rows. This operation must collect the data on the driver, but it could read a single partition (if the required row count is low enough)
Since you mentioned you want to use a specific small number of rows and since you want to avoid touching all the data, I provided a solution based on option 2
PS: The DataFrameReader.textFile method accepts paths to files, folders and it also has a varargs variant, so you could pass in one or more files or folders.

How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works.
I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes).
Could anyone explain me the right way to do it?
As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a single column named "features" which should contain Vector typed rows. How should I do this?
All you need is an Encoder. Imports
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.ml.linalg
RDD:
val rdd = sc.parallelize(Seq(
linalg.Vectors.dense(1.0, 2.0), linalg.Vectors.sparse(2, Array(), Array())
))
Conversion:
val ds = spark.createDataset(rdd)(ExpressionEncoder(): Encoder[linalg.Vector])
.toDF("features")
ds.show
// +---------+
// | features|
// +---------+
// |[1.0,2.0]|
// |(2,[],[])|
// +---------+
ds.printSchema
// root
// |-- features: vector (nullable = true)
To convert a RDD to a dataframe, the easiest way is to use toDF() in Scala. To use this function, it is necessary to import implicits which is done using the SparkSession object. It can be done as follows:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val df = rdd.toDF("features")
toDF() takes an RDD of tuples. When the RDD is built up of common Scala objects they will be implicitly converted, i.e. there is no need to do anything, and when the RDD has multiple columns there is no need to do anything either, the RDD already contains a tuple. However, in this special case you need to first convert RDD[org.apache.spark.ml.linalg.Vector] to RDD[(org.apache.spark.ml.linalg.Vector)]. Therefore, it is necessary to do a convertion to tuple as follows:
val df = rdd.map(Tuple1(_)).toDF("features")
The above will convert the RDD to a dataframe with a single column called features.
To convert to a dataset the easiest way is to use a case class. Make sure the case class is defined outside the Main object. First convert the RDD to a dataframe, then do the following:
case class A(features: org.apache.spark.ml.linalg.Vector)
val ds = df.as[A]
To show all possible convertions, to access the underlying RDD from a dataframe or dataset can be done using .rdd:
val rdd = df.rdd
Instead of converting back and forth between RDDs and dataframes/datasets it's usually easier to do all the computations using the dataframe API. If there is no suitable function to do what you want, usually it's possible to define an UDF, user defined function. See for example here: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html

Resources