AWS Glue performance when write - apache-spark

After performing joins and aggregation i want the output to be in 1 file and partition based on some column.
when I use repartition(1) the time taken by job is 1 hr and if I remove preparation(1) there will be multiple partitions of that file it takes 30 mins (refer to example below).
So is there a way to write data into 1 file ??
...
...
df= df.repartition(1)
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")
Is there any other way to increase the write performance. does changing format helps? and how to achieve parallelism by having 1 file output
S3 storage example
**if repartition(1)** // what I want but takes more time
choice=0/part-00-001
..
..
choice=500/part-00-001
**if removed** // takes less time but multiple files are present
choice=0/part-00-001
....
choice=0/part-00-0032
..
..
choice=500/part-00-001
....
choice=500/part-00-0032

Instead of using df.repartition(1)
USE df.repartition("choice")
df= df.repartition("choice")
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")

If the goal is to have one single file, use coalesce instead of repartition, it avoids data shuffle.

Related

Does PySpark run operation out-of-sequence due to optimization?

I'm confused about the result my code is giving me. Here is the code I wrote:
def update_cassandra(df : DataFrame, aggr: str):
aggr_map_dict = {
'Giornaliera' : 'day',
'Settimanale' : 'week',
'Bi-Settimanale' : 'bi_week',
'Mensile': 'month'
}
max_min_dates = df.agg(F.max(df['data']), F.min(df['data'])).collect()[0]
upper_date = max_min_dates[0]
lower_date = max_min_dates[1]
df = (df.select('data', 'punto_di_interesse', 'id_telco', 'presenze', 'presenze_uniche', 'presenze_00_06','presenze_06_08', 'presenze_08_10', 'presenze_10_12', 'presenze_12_14', 'presenze_14_16', 'presenze_16_18', 'presenze_18_20', 'presenze_20_22', 'presenze_22_24')
)
print('contenuto del csv')
display(df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
telco_day_aggr = read_from_cassandra_dev(f'telco_{aggr_map_dict[aggr]}_aggr').where(F.col('data').between(lower_date,upper_date))
if telco_day_aggr.count() == 0:
telco_day_aggr = create_empty_df()
print('telco_day_aggr as is')
display(telco_day_aggr.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
union_df = df.union(telco_day_aggr)
print('unione del AS-IS e del csv')
display(union_df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
output_df = (union_df.groupBy('data', 'punto_di_interesse', 'id_telco')
.agg(
F.sum('presenze').alias('presenze'),
F.sum('presenze_uniche').alias('presenze_uniche'),
F.sum('presenze_00_06').alias('presenze_00_06'),
F.sum('presenze_06_08').alias('presenze_06_08'),
F.sum('presenze_08_10').alias('presenze_08_10'),
F.sum('presenze_10_12').alias('presenze_10_12'),
F.sum('presenze_12_14').alias('presenze_12_14'),
F.sum('presenze_14_16').alias('presenze_14_16'),
F.sum('presenze_16_18').alias('presenze_16_18'),
F.sum('presenze_18_20').alias('presenze_18_20'),
F.sum('presenze_20_22').alias('presenze_20_22'),
F.sum('presenze_22_24').alias('presenze_22_24')
)
)
return output_df
aggregate_df = aggregate_table(df_daily, 'Giornaliera')
write_on_cassandra_dev(aggregate_df, 'telco_day_aggr')
What I expect to achieve is to create a sort of update for cassandra, becouse the cassandra drivers. So the operation in my head are like this:
read from blob storage the csv and store it in a dataframe (the df variable, input of the method)
with max and min dates of this csv file, query the table in cassandra and save it in another variable
concatenate the two dataframe
summing up with the groupby
write on cassandra the new dataframe overwriting the existing rows with the new ones
it seems to me that, some how, what is in the dataframe "df" is written before I can read "telco_day_aggr" and that the union and grupby part are ininfluent. In other words on my cassandra table there is present only the content of df.
I can provide additional information if needed.

AWS Glue / Spark - AnalysisException Cannot resolve column name after Filter -> Left Join

I am trying to do filter and left join operations on some CSV files from AWS Glue 2.0 with pySpark. Sometimes, if the filter filters out all data or if the input csv is empty my job crashes with:
AnalysisException: 'Cannot resolve column name "col_a" among ();'
-I have seen this exception occurring for other people on a number of other questions, but I think my issue is losing the header information when the rows are removed -> is this a DynamicFrame feature (could not find anything about it in the aws glue docs)?
-I realise that I could do the filters after all the joins but wanted to avoid this because it seems like it might be more expensive, and because ideally I would like the job to also not crash if the input data was an empty CSV.
-Any suggestions greatly appreciated :)
Here is a mock of the pySpark code (please note that in the real thing I would like to chain together many joins, transforms and filters):
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_name", table_name = "table_1", transformation_ctx = "DataSource0")
DataSource1 = glueContext.create_dynamic_frame.from_catalog(database = "db_name", table_name = "table_2", transformation_ctx = "DataSource1")
Table_1_Renames = ApplyMapping.apply(frame = DataSource0, mappings = [("col_a", "string", "col_a", "string"), ("col_b", "string", "col_xyz", "string")], transformation_ctx = "Transform0")
Table_2_Renames = ApplyMapping.apply(frame = DataSource1, mappings = [("col0", "string", "col0_renamed", "string"), ("col1", "string", "col1_renamed", "string")], transformation_ctx = "Transform1")
Table_1_Filter = Filter.apply(frame = Table_1_Renames, f = lambda row : (bool(re.match("KeepValue", row["col_b"]))), transformation_ctx = "Table_1_Filter")
Table_1_Filter_DF = Table_1_Filter.toDF()
Table_2_Renames_DF = Table_2_Renames.toDF()
#If the original data was empty, or the filter removes all rows of the data, we get:
#AnalysisException: 'Cannot resolve column name "col_a" among ();'
LeftJoin_1 = DynamicFrame.fromDF(Table_1_Filter_DF.join(Table_2_Renames_DF, (Table_1_Filter_DF['col_a'] == Table_2_Renames_DF['col0_renamed']), "left"), glueContext, "LeftJoin_1") ```

Slightly more complicated Spark word count

I have a large list of file locations. I want to read parquet from those locations, group by some column, do a count, and reduce by key.
var commaDelim = spark.sparkContext.textFile("s3://some_location")
var locs = commDelim.flatmap(l => l.split(","))
locs.map(loc => spark.read.parquet(loc).groupBy("col").count ...
Not sure how to turn the count dataframe into a format that can be reduced by key.
Pass list files directly to parquet function like below
val locs = [file1,file2,file3,...]
spark.read.parquet(locs:_*)
.select("col")
.as[String]
.flatMap(value => value.split("\\s+"))
.groupBy($"value")
.agg(count("*").as("count"))
.show(false)

Splitting a mishappen csv file using pyspark RDD. EMR. Yarn memory exception errors

I have been working on this code for a while. Following I have listed the code and most of the cluster attributes I am using on EMR. The purpose of the code is to split some csv files in two at a certain line number based on some basic iteration (I have included a simple split in the code below).
I frequently get this error "Container killed by YARN for exceeding memory limits" and have followed these design principles (link below) to resolve it, but I just don't know why this would get memory problems. I have over 22GB for yarn overhead, and the files are in the MB to single digit GB ranges.
I am using sometimes r5a.12xlarges to no avail. I just really don't see any kind of memory leak in this code. It also seems very slow, was only able to process something like 20GB in 16 hours output to S3. Is this a good way to parallelize this split operation? Is there a memory leak? What gives?
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-yarn-memory-limit/
[
{
"Classification": "spark",
"Properties": {
"spark.maximizeResourceAllocation": "true"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.executor.memoryOverheadFactor":".2"
}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Configurations": [],
"Properties": {
"PYSPARK_PYTHON": "python36"
},
"Classification": "export"
}
],
"Properties": {
}
}
]
def writetxt(txt: Union[List[str], pandas.DataFrame], path: str) -> None:
s3 = boto3.resource('s3')
s3path = S3Url(path)
object = s3.Object(s3path.bucket, s3path.key)
if isinstance(txt, pandas.DataFrame):
csv_buffer = StringIO()
txt.to_csv(csv_buffer)
object.put(Body=csv_buffer.getvalue())
else:
object.put(Body='\n'.join(txt).encode())
def main(
x: Iterator[Tuple[str, str]],
output_files: str
) -> None:
filename, content = x
filename = os.path.basename(S3Url(filename).key)
content = content.splitlines()
# Split the csv file
columnAttributes, csvData = data[:100], data[100:]
writetxt(csvData, os.path.join(output_files, 'data.csv', filename))
writetxt(columnAttributes, os.path.join(output_files, 'attr.csv', filename))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Split some mishapen csv files.')
parser.add_argument('input_files', type=str,
help='The location of the input files.')
parser.add_argument('output_files', type=str,
help='The location to put the output files.')
parser.add_argument('--nb_partitions', type=int, default=4)
args = parser.parse_args()
# creating the context
sc = SparkContext(appName="Broadcom Preprocessing")
# We use minPartitions because otherwise small files get put in the same partition together
# by default, which we have a lot of
# We use foreachPartition to reduce the number of function calls, which slow down spark
distFiles = sc.wholeTextFiles(args.input_files, minPartitions=args.nb_partitions) \
.foreach(partial(main, output_files=args.output_files))
I think your memory issues are because you're doing the actual data-splitting with Python code. Spark processes run in the JVM, but when you call custom Python code, the related data must be serialized over to a Python process (on each worker node) in order to execute. This adds a lot of overhead. I believe you can accomplish what you're trying to do entirely with Spark operations - meaning the final program will run entirely in the JVM-based Spark processes.
Try something like this:
from pyspark.sql.types import IntegerType
from pyspark.sql.window import Window
from pyspark.sql.functions import *
input_path = "..."
split_num = 100
# load filenames & contents
filesDF = spark.createDataFrame( sc.wholeTextFiles(input_path), ['filename','contents'] )
# break into individual lines & number them
linesDF = filesDF.select( "filename", \
row_number().over(Window.partitionBy("filename").orderBy("filename")).alias("line_number"), \
explode(split(col("contents"), "\n")).alias("contents") )
# split into headers & body
headersDF = linesDF.where(col("line_number") == lit(1))
bodyDF = linesDF.where(col("line_number") > lit(1))
# split the body in 2 based
splitLinesDF = bodyDF.withColumn("split", when(col("line_number") < lit(split_num), 0).otherwise(1))
split_0_DF = splitLinesDF.where(col("split") == lit(0)).select("filename", "line_number", "contents").union(headersDF).orderBy("filename", "line_number")
split_1_DF = splitLinesDF.where(col("split") == lit(1)).select("filename", "line_number", "contents").union(headersDF).orderBy("filename", "line_number")
# collapse all lines back down into a file
firstDF = split_0_DF.groupBy("filename").agg(concat_ws("\n",collect_list(col("contents"))).alias("contents"))
secondDF = split_1_DF.groupBy("filename").agg(concat_ws("\n",collect_list(col("contents"))).alias("contents"))
# pandas-UDF for more memory-efficient transfer of data from Spark to Python
#pandas_udf(returnType=IntegerType())
def writeFile( filename, contents ):
<save to S3 here>
# write each row to a file
firstDF.select( writeFile( col("filename"), col("contents") ) )
secondDF.select( writeFile( col("filename"), col("contents") ) )
Finally you will need to use some custom python code to save each split file to S3 (or, you could just code everything in Scala/Java). It is far, far more efficient to do this via pandas UDFs than to pass a standard python function to .foreach(...). Internally, spark will serialize the data to Arrow format in chunks (one per partition), which will be very efficient.
Additionally, it looks like you're trying to put the entire object into S3 in a single request. If the data is too large for this, it will fail. You should check out the S3 streaming upload functionality.

How to Include the Value of Partitioned Column in a Spark data frame or Spark SQL Temp Table in AWS Glue?

I am using python 3, Glue 1.0 for this code.
I have partitioned data in S3. The data is partitioned in year,month,day,extra_field_name columns.
When I load the data into data frame, I am getting all the columns in it's schema other than the partitioned ones.
Here is the code and output
glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True, 'groupFiles': 'inPartition'}, format = "parquet").toDF().registerTempTable(final_arguement_list["read_table_" + str(i+1)])
The path_list variable contains a string of list of paths that need to be loaded into a data frame.
I am printing schema using the below command
glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True}, format = "parquet").toDF().printSchema()
The schema that I am getting in cloudwatch logs does not contain any of the partitioned columns.
Please note that I have already tried loading data by giving path by only providing path till year, month, day, extra_field_name separately but still getting only those columns which are present in the parquet files itself.
I was able to do this with an additional step of having a crawler crawl the directory on S3, and then use the table from Glue Catalog as the source for Glue ETL.
Once you have a crawler over the location s3://path/to/source/data/, automatically year,month and day will be treated as partition columns. And then you could try the following in your Glue ETL script.
data_dyf = glueContext.create_dynamic_frame.from_catalog(
database = db_name,
table_name = tbl_name,
push_down_predicate="(year=='2018' and month=='05')"
)
You can find more details here
As a workaround, I have created a duplicate column in the data frame itself named - year_2, month_2, day_2 and extra_field_name_2 as a copy of year, month, day and extra_field_name.
During data ingestion phase, I have partitioned the data frame on year, month, day and extra_field_name and stored it in S3 which retains the column value of year_2, month_2, day_2 and extra_field_name_2 in the parquet files itself.
While performing data manipulation, I am loading the data in a dynamic frame by providing the list of paths in the following manner:
['s3://path/to/source/data/year=2018/month=1/day=4/', 's3://path/to/source/data/year=2018/month=1/day=5/', 's3://path/to/source/data/year=2018/month=1/day=6/']
This gives me year_2, month_2, day_2 and extra_field_name_2 in the dynamic frame that I can further use for data manipulation.
Try passing the basePath to the connection_options argument:
glueContext.create_dynamic_frame_from_options(
connection_type = "s3",
connection_options = {
"paths": path_list,
"recurse" : True,
"basePath": "s3://path/to/source/data/"
},
format = "parquet").toDF().printSchema()
This way, partition discovery will discover the partitions that are above your paths. According to the documentation, these options will be passed to the Spark SQL DataSource.
Edit: given that your experiment shows it doesn’t work, have you considered passing the top level directory and filtering from there for the dates of interest? The Reader will only read the relevant Hive partitions, as the filter gets ”pushed down” to the file system.
(glueContext.create_dynamic_frame_from_options(
connection_type = "s3",
connection_options = {
"paths": ["s3://path/to/source/data/"],
"recurse" : True,
},
format = "parquet")
.toDF()
.filter(
(col("year") == 2018)
&& (col("month") == 1)
&& (col("day").between(4, 6)
).printSchema()

Resources