I have a huge table which I am loading from RedShift into a csv file on S3 using a Databricks(DBx) notebookA. This notebook is running on clusterA.
I have another notebookB, which is reading the data from csv files in S3 into a dataframe. This notebookB is running on clusterB.
Now, I want to access this dataframe in third notebookC, which is on clusterC.
How can I do this?
registerTempTable is session specific.
createGlobalTempView can be accessed across notebooks but not across notebooks on different clusters.
Related
This question is almost a replica of the requirement here: Writing files to local system with Spark in Cluster mode
but my query is with a twist. The page above writes files from HDFS directly to local filesystem using spark but after converting it to RDD.
I'm in search of options available with just the Dataframe; conversion to RDD for huge data takes a toll on resource utilisation.
You can make use of below syntax to directly write a dataframe to HDFS filesystem.
df.write.format("csv").save("path in hdfs")
Refer spark doc for more details: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#generic-loadsave-functions
You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.
Let's say we have a data lake of people with first_name, last_name and country columns.
If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count(), then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation. This is really inefficient because we don't need all the last_name and country data to run this query.
If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name column to run the query.
spark
.read
.format("s3select")
.schema(...)
.options(...)
.load("s3://bucket/filename")
.select("first_name")
.distinct()
.count()
If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count() is run, then S3 will only transfer the data in the first_name column to the ec2 cluster. Parquet is a columnar file format and this is one of the main advantages.
So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.
I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format. Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?
This is an interesting question. I don't have any real numbers, though I have done the S3 select binding code in the hadoop-aws module. Amazon EMR have some values, as do databricks.
For CSV IO Yes, S3 Select will speedup given aggressive filtering of source data, e.g many GB of data but not much back. Why? although the read is slower, you save on the limited bandwidth to your VM.
For Parquet though, the workers split up a large file into parts and schedule the work across them (Assuming a splittable compression format like snappy is used), so > 1 worker can work on the same file. And they only read a fraction of the data (==bandwidth benefits less), But they do seek around in that file (==need to optimise seek policy else cost of aborting and reopening HTTP connections)
I'm not convinced that Parquet reads in the S3 cluster can beat a spark cluster if there's enough capacity in the cluster and you've tuned your s3 client settings (for s3a this means: seek policy, thread pool size, http pool size) for performance too.
Like I said though: I'm not sure. Numbers are welcome.
Came across this spark package for s3 select on parquet [1]
[1] https://github.com/minio/spark-select
We are currently using Spark Job on Databricks which do processing on our data lake which in S3.
Once the processing is done we export our result to S3 bucket using normal
df.write()
The issue is when we write dataframe to S3 the name of file is controlled by Spark, but as per our agreement we need to rename this files to a meaningful name.
Since S3 doesn't have renaming feature we are right now using boto3 to copy and paste file with expected name.
This process is very complex and not scalable with more client getting onboard.
Do we have any better solution to rename exported files from spark to S3 ?
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.
df_pd = df.toPandas()
df_pd.to_csv("path")
How can I load a bunch of files from a S3 bucket into a single PySpark dataframe? I'm running on an EMR instance. If the file is local, I can use the SparkContext textFile method. But when the file is on S3, how can I use boto3 to load multiple files of various types (CSV, JSON, ...) into a single dataframe for processing?
Spark natively reads from S3 using Hadoop APIs, not Boto3. And textFile is for reading RDD, not DataFrames. Also do not try to load two different formats into a single dataframe as you won't be able to consistently parse them
I would suggest using
csvDf = spark.read.csv("s3a://path/to/files/*.csv")
jsonDf = spark.read.json("s3a://path/to/files/*.json")
And from there, you can filter and join the dataframes using SparkSQL.
Note: JSON files need to contain single JSON objects each on their own line
I am trying to understand which of the below two would be better option especially in case of Spark environment :
Loading the parquet file directly into a dataframe and access the data (1TB of data table)
Using any database to store and access the data.
I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.
Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.