How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable") - python-3.x

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?

I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

Related

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.
Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

Lazy loading of partitioned parquet in Apache Spark

As I understand it, Apache Spark uses lazy evaluation. So for example code like the following that consists only of transformations will do no actual processing:
val transformed_df = df.filter("some_field = 10").select("some_other_field", "yet_another_field")
Only when we do an "action" will any processing actually occur:
transformed_df.show()
I had been under the impression that load operations are also lazy in spark. (See How spark loads the data into memory.)
However, my experiences with spark have not borne this out. When I do something like the following,
val df = spark.read.parquet("/path/to/parquet/")
execution seems to depend greatly on the size of the data in the path. In other words, it's not strictly lazy. This is inconvenient if the data is partitioned and I only need to look at a fraction of the partitions.
For example:
df.filter("partitioned_field = 10").show()
If the data is partitioned in storage on "partitioned_field", I would have expected spark to wait until show() is called, and then read only data under "/path/to/parquet/partitioned_field=10/". But again, this doesn't seem to be the case. Spark appears to perform at least some operations on all of the data as soon as read or load is called.
I could get around this by only loading /path/to/parquet/partitioned_field=10/ in the first place, but this is much less elegant than just calling "read" and filtering on the partitioned field, and it's harder to generalize.
Is there a more elegant preferred way to lazily load partitions of parquet data?
(To clarify, I am using Spark 2.4.3)
I think I've stumbled on an answer to my question while learning about a key distinction that is often overlooked when talking about lazy evaluation in spark.
Data is lazily evaluated, but schemas are not. So if we are reading parquet, which is a structured data type, spark does have to at least determine the schema of any files it's reading as soon as read() or load() is called. So calling read() on a large number of files will take longer than on a small number of files.
Given that partitions are part of the schema, it's less surprising to me now that spark has to look at all of the files in the path to determine the schema before filtering on a partition field.
It would be convenient for my purposes if spark were to wait until schema evaluation was strictly necessary and was able to filter on partition fields prior to determining the rest of the schema, but it sounds like this is not the case. I believe Dataset objects always must have a schema, so I'm not sure there's a way around this problem without significant changes to Spark.
In conclusion, it seems like my only option currently is to pass in a list of paths for the partitions that I need rather than the base path if I want to avoid evaluating the schema over the entire data repository.

Pyspark Checkpointing in Dataproc(StackOverFlowError)

I faced stackoveroverflow error when persisting dataset with pyspark. I am casting whole dataframe into doubletype and then persisting to calculate statistics, and I read that checkpointing is a solution to stackoverflow. However, I am having trouble with implementing it in dataproc.
I am working with pyspark, and when I checkpoint the dataframe and checkedpointed with df.isCheckpointed(), it returns false. However, when I debug it, df.rdd.is_checkpointed says True. Is there any issue with the package / am I doing something wrong?
I thought localCheckpoint is more appropriate for my purpose(https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint()), as my problem was simply DAG depth was too deep, but I couldn't find any use case. Also, if I just checkpointed RDD says it is checkpointed(as in first question), but if I tried localcheckpoint, it says it is not. Has anyone tried this function?
After I tried with local standalone mode, I tried it with dataproc. I tried both hdfs and google cloud storage, but either way the storage was empty, but rdd says it is checkpointed.
Can anyone help me with this? Thanks in advance!
If you're using localCheckpoint, it will write to the local disk of executors, not to HDFS/GCS: https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint--.
Also note that there's an eager (checkpoint immediately) and non-eager (checkpoint when the RDD is actually materialized) mode to checkpointing. That may affect what those methods return. The code is often the best documentation: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1615.
In general, please attach sample code (repros) to questions like this -- that way we can answer your question more directly.
Create a temp dir using your SparkSession object:
spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
dataframe_name = # Any Spark Dataframe of type <pyspark.sql.dataframe.DataFrame>
at this point of time, the dataframe_name would be a DAG, which you can store as a checkpoint like,
dataframe_checkpoint = dataframe_name.checkpoint()
dataframe_checkpoint is also a spark dataframe of type <pyspark.sql.dataframe.DataFrame> but instead of the DAG, it stores the result of the query
Use checkpoints if:
the computation takes a long time
the computing chain is too long
depends too many RDDs

Spark on localhost

For testing purposes, while I donĀ“t have production cluster, I am using spark locally:
print('Setting SparkContext...')
sconf = SparkConf()
sconf.setAppName('myLocalApp')
sconf.setMaster('local[*]')
sc = SparkContext(conf=sconf)
print('Setting SparkContext...OK!')
Also, I am using a very very small dataset, consisting of only 20 rows in a postgresql database ( ~2kb)
Also(!), my code is quite simple as well, only grouping 20 rows by a key and applying a trivial map operation
params = [object1, object2]
rdd = df.rdd.keyBy(lambda x: (x.a, x.b, x.c)) \
.groupByKey() \
.mapValues(lambda value: self.__data_interpolation(value, params))
def __data_interpolation(self, data, params):
# TODO: only for testing
return data
What bothers me is that the whole execution takes about 5 minutes!!
Inspecting the Spark UI, I see that most of the time was spent in Stage 6: byKey method. (Stage 7, collect() method was also slow...)
Some info:
These numbers make no sense to me... Why do I need 22 tasks, executing for 54 sec, to process less than 1 kb of data
Can it be a network issue, trying to figure out the ip address of localhost?
I don't know... Any clues?
It appears the main reason for the slower performance in your code snippet is due to the use of groupByKey(). The issue with groupByKey is that it ends up shuffling all of the key-value pairs resulting in a lot of data unnecessarily being transferred. A good reference to explain this issue is Avoid GroupByKey.
To work around this issue, you can:
Try using reduceByKey which should be faster (more info is also included in the above Avoid GroupByKey link).
Use DataFrames (instead of RDDs) as DFs include performance optimizations (and the DF GroupBy statement is faster than the RDD version). As well, as you're using Python, you can avoid the Python-to-JVM issues with PySpark RDDs. More information on this can be seen in PySpark Internals
By the way, reviewing the Spark UI diagram above, the #22 refers to the task # within the DAG (not the number of tasks executed).
HTH!
I suppose the "postgresql" is the key to solve that puzzle.
keyBy is probably the first operation that really uses the data so it's execution time is bigger as it needs to get the data from external database. You can verify it by adding at the beginning:
df.cache()
df.count() # to fill the cache
df.rdd.keyBy....
If I am right, you need to optimize the database. It may be:
Network issue (slow network to DB server)
Complicated (and slow) SQL on this database (try it using postgre shell)
Some authorization difficulties on DB server
Problem with JDBC driver you use
From what I have seen happening in my system while running spark:
When we run a spark job it internally creates map and reduce tasks and runs them. In your case, to run the data you have, it created 22 such tasks. I bigger the size of data the number may be big.
Hope this helps.

Parallel writing from same DataFrame in Spark

Let's say I have a DataFrame in Spark and I need to write the results of it to two databases, where one stores the original data frame but the other stores a slightly modified version (e.g. drops some columns). Since both operations can take a few moments, is it possible/advisable to run these operations in parallel or will that cause problems because Spark is working on the same object in parallel?
import java.util.concurrent.Executors
import scala.concurrent._
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
def write1(){
//your save statement for first dataframe
}
def write2(){
//your save statement for second dataframe
}
def writeAllTables() {
Future{ write1()}
Future{ write2()}
}
Let me ask you, do you really need to do it? If you are not sure, then, you most probably don't.
So, lets assume below scenario similar to one you explained:
val df1 = spark.read.csv('someFile.csv') // Original Dataframe
val df2 = df1.withColumn("newColumn", concat(col("oldColumn"), lit(" is blah!"))) // Modified Dataframe, FYI, this df2 is a different object
df1.write('db_loc1') // Write to DB1, already parallelised & uses spark resources optimally
df2.write('db_loc2') // Write to DB2, already parallelised & uses spark resources optimally
Spark scheduler divides the first DataFrame df1 into partitions and writes them in parallel in db_loc1.
It picks up the second DataFrame df2 and again breaks it into partitions and writes these partitions in parallel in db_loc2.
By default, the degree of parallelisation per write is speculated in order to optimally use available cluster resources.
Small writes might not be repartitioned as mostly the write time is low and repartitioning will only increase overhead. In a extraordinary case where you have a lot of small writes, it might make a good case for trying to parallelise these writes. But, the best way to do so is to redesign your code to run one spark job per DataFrame instead of trying to parallelise DataFrame.write() call in same driver program.
Large writes will probably use all available resources in parallel during the single DataFrame write itself. Hence, if spark allowed issuing another write operation for a different DataFrame at the same time, it would only delay both operations as now they are racing with each other for resources. Not to mention, there may be some performance slowdown due to increased overhead because of sheer increase in number of tasks that spark now needs to manage and track.
Also, You can read this answer to and learn more about this

Resources