Recently i came across a requirement where i tried to change python re with pyspark regexp_extract, the reason behind change re to pyspark regexp_extract is spark is more faster. by comparing the process speed with pyspark and re process I concluded like re is faster than pyspark regexp_extract. Is there any specific reason that cause pyspark regexp_extract is slow.
Thanks in advance
Probably more context is needed to give an specific answer, but what I can infer from what you said is the following:
I would think that it depends on the size of the data and how are the partitions in spark. As spark is parallelizing, probably in not huge amounts of data, regular python functions will work faster, but not in huge amounts of data were parallelization is more handy.
Related
I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.
I need to cache a dataframe in Pyspark(2.4.4), and the memory caching is slow.
I benchmark the Pandas caching with Spark caching, by reading the same file(CSV). Specifically, Pandas was 3-4 times faster.
Thanks,
In advance
You are comparing apples and oranges. Pandas is a single machine single core data analysis library whereas pyspark is distributed (cluster computing) data analysis engine. That means you will never outperform pandas reading a small file on a single machine with pyspark due to the overhead (distributed architecture, JVM...). That also means that pyspark will outperform pandas as soon as your file exceeds a certain size.
You as a developer has to choose the solution which best fits your requirements. When pandas is faster for your project and you don't expect a huge increase of data in the future, use pandas. Otherwise use pyspark or dask or...
I know that you can convert a spark dataframe df into a pandas dataframe with
df.toPandas()
However, this is taking very long, so I found out about a koala package in databricks that could enable me to use the data as a pandas dataframe (for instance, being able to use scikit learn) without having a pandas dataframe. I already have the spark dataframe, but I cannot find a way to make it into a Koalas one.
To go straight from a pyspark dataframe (I am assuming that is what you are working with) to a koalas dataframe you can use:
koalas_df = ks.DataFrame(your_pyspark_df)
Here I've imported koalas as ks.
Well. First of all, you have to understand the reason why toPandas() takes so long :
Spark dataframe are distributed in different nodes and when you run toPandas()
It will pull the distributed dataframe back to the driver node (that's the reason it takes long time)
you are then able to use pandas, or Scikit-learn in the single(Driver) node for faster analysis and modeling, because it's like your modeling on your own PC
Koalas is the pandas API in spark and when you convert it to koalas dataframe : It's still distributed, so it will not shuffle data between different nodes, so you can use pandas' similar syntax for distributed dataframe transformation
I'm running Pyspark on a single server with multiple CPUs. All other operations (reading, joining, filtering, custom UDFs) are executed quickly except for writing to disk. The dataframe I'm trying to save is of size around ~400 gb with 200 partitions.
sc.getConf().getAll()
The driver memory is 16g, and working directory has enough space (> 10TB)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
Wondering if anyone has run into the same issue. Also will changing any of the config parameters before pyspark is invoked help solve the issue?
Edits (a few clarifications):
When I mean other operations were executed quickly, there was always an action after transformation, in my case they were row counts. So all the operations were executed super fast. Still haven't gotten around why writing takes such a ridiculous amount of time.
One of my colleagues brought up the fact that the disks in our server might have a limit on concurrent writing which might be slowing things down, still investigating on this. Interested in knowing if others are seeing slow write times on a Spark cluster too. I have confirmation from one user regarding this on AWS cluster.
All other operations (reading, joining, filtering, custom UDFs)
There are because there are transformations - they don't do anything until data has to be saved.
The dataframe I'm trying to save is of size around ~400 gb
(...)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
That just cannot work well. Even ignoring part where you use a single machine, saving 400GB with a single thread (!) is just hopeless. Even if it succeeds, it is not better than using plain bash script.
Skipping over Spark - sequential writes for 400GB will take a substantial amount of time, even on average size disk. And given multiple shuffle (join, repartition) data will be written to disk multiple times.
After a lot of trial and error, I realized that the issue was due to the method I used to read the file from disk. I was using the in-built read.csv function, and when I switched over to the read function in databricks-csv package the problem went away. I'm now able to write files to disk at a reasonable time. It's really strange, maybe it's a bug in 2.1.1 or databricks csv package is really optimized.
1.read.csv method
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("model") \
.config("spark.worker.dir", "xxxx") \
.getOrCreate()
df = spark.read.load("file.csv", format="csv", header = True)
write.csv("file_after_processing.csv")
2.Using the databricks-csv package
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')
For the purposes of comparison, suppose we have a table "T" with two columns "A","B". We also have a hiveContext operating in some HDFS database. We make a data frame:
In theory, which of the following is faster:
sqlContext.sql("SELECT A,SUM(B) FROM T GROUP BY A")
or
df.groupBy("A").sum("B")
where "df" is a dataframe referring to T. For these simple kinds of aggregate operations, is there any reason why one should prefer one method over the other?
No, these should boil down to the same execution plan. Underneath the Spark SQL engine is using the same optimization engine, the catalyst optimizer. You can always check this yourself by looking at the spark UI, or even calling explain on the resultant DataFrame.
Spark developers have made great effort to optimise. The performance between DataFrame Scala and DataFrame SQL is undistinguishable. Even for DataFrame Python, the differ is when collect data to driver.
It opens a new world
It doesn't have to be one vs. another
We can just choose what ever way we comfortable with
The performance comparison published by databricks