Reading Files from S3 Bucket to PySpark Dataframe Boto3 - apache-spark

How can I load a bunch of files from a S3 bucket into a single PySpark dataframe? I'm running on an EMR instance. If the file is local, I can use the SparkContext textFile method. But when the file is on S3, how can I use boto3 to load multiple files of various types (CSV, JSON, ...) into a single dataframe for processing?

Spark natively reads from S3 using Hadoop APIs, not Boto3. And textFile is for reading RDD, not DataFrames. Also do not try to load two different formats into a single dataframe as you won't be able to consistently parse them
I would suggest using
csvDf = spark.read.csv("s3a://path/to/files/*.csv")
jsonDf = spark.read.json("s3a://path/to/files/*.json")
And from there, you can filter and join the dataframes using SparkSQL.
Note: JSON files need to contain single JSON objects each on their own line

Related

Automatically update Glue Data Catalog after writing parquet to S3 using Pyspark

Is there a way to automatically update Glue Data Catalog (without having a separate crawler) after writing parquet files to S3 from PySpark?
The awswrangler python package can do it by specifying database and table in the s3.to_parquet(). Is there an equivalent way to do it in PySpark?

AWS Glue append to paruqet file

I am currently in the process of designing AWS backed Data Lake.
What I have right now:
XML files uploaded to s3
AWS Glue crawler buids catalogue
AWS ETL job transforms data and saves it in the parquet format.
Each time etl jobs transforms the data it creates new parquet files. I assume that the most efficient way to store my data would be a single parquet file. Is it the case? If so how to achieve this.
Auto generated job code: https://gist.github.com/jkornata/b36c3fa18ae04820c7461adb52dcc1a1
You can do that by 'overwrite'. Glue doesn`t support 'overwrite' mode. But you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:
dropnullfields3.toDF()
.write
.mode("overwrite")
.format("parquet")
.save(s3//output-bucket/[nameOfyourFile].parquet)

Renaming Exported files from Spark Job

We are currently using Spark Job on Databricks which do processing on our data lake which in S3.
Once the processing is done we export our result to S3 bucket using normal
df.write()
The issue is when we write dataframe to S3 the name of file is controlled by Spark, but as per our agreement we need to rename this files to a meaningful name.
Since S3 doesn't have renaming feature we are right now using boto3 to copy and paste file with expected name.
This process is very complex and not scalable with more client getting onboard.
Do we have any better solution to rename exported files from spark to S3 ?
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.
df_pd = df.toPandas()
df_pd.to_csv("path")

Spark Dataframe Write as Json works differently as Spark Data frame write as Parquet

I have tried converting CSV file to both JSON and Parquet using spark. Am using aws EMR 5.10 with spark version 2.2
below is the sample code I have written for conversion to JSON
val data1 = sql("select * from csv_table")
data1.write.json("s3://sparktest/jsonout/")
When above piece of code is executed the Data is directly written in the s3 directory with files names as "part-000"
Next i have tried parquet conversion
val data1 = sql("select * from csv_table")
data1.write.parquet("s3://sparktest/parquetout/")
Above Code when executed first created a temporary directly namely "_temporary" in the S3 output location and later on move files from the temporary location.
Why is dataframe write operation behaving differently in both cases ?

Saving dataframe as a single file to AWS S3 through Apache-Spark

How do I save a dataframe as a single file to AWS S3? I tried both repartition and coalesce , but didn't work. The data frame is saved as multiple part files in a folder, I need that to be saved as single file (myfile.csv)
val s3path="s3a://***:***#***/myfile.csv"
df.coalesce(1).write.format("com.databricks.spark.csv").save(s3path)
Thanks

Resources