How do I know if a file has been dumped in badrecordspath ? [ Azure Databricks -Spark] - apache-spark

I want to use badrecordspath in Spark in in Azure Databricks to get a count of corrupt records associated to the job, but there is no simple way to know :
if a file has been written
in which partition the file has been written
I thought maybe i could check if the last partition was created in the last 60 seconds with some code like that :
from datetime import datetime, timedelta
import time
import datetime
df = spark.read.format('csv').option("badRecordsPath", corrupt_record_path)
partition_dict = {} #here the dictionnary of partition and corresponding timestamp
for i in dbutils.fs.ls(corrupt_record_path):
partition_dict[i.name[:-1]]=time.mktime(datetime.datetime.strptime(i.name[:-1], "%Y%m%dT%H%M%S").timetuple())
#here i get the timestamp of one minute ago
submit_timestamp_utc_minus_minute = datetime.datetime.now().replace(tzinfo = timezone.utc) - timedelta(seconds=60)
submit_timestamp_utc_minus_minute = time.mktime(datetime.datetime.strptime(submit_timestamp_utc_minus_minute.strftime("%Y%m%dT%H%M%S"), "%Y%m%dT%H%M%S").timetuple())
#Here i compare the latest partition to check if is more recent than 60 seconds ago
if partition_dict[max(partition_dict, key=lambda k: partition_dict[k])]>submit_timestamp_utc_minus_minute :
corrupt_dataframe = spark.read.format('json').load(corrupt_record_path+partition+'/bad_records')
corrupt_records_count = corrupt_dataframe.count()
else:
corrupt_records_count = 0
But i see two issues :
it is a lot of overhead (ok the code could also be better written, but
still)
i'm not even sure when does the partition name is defined in the
reading job. Is it at the beginning of the job or at the end ? If it is at the beginning, then the 60 seconds are not relevant at all.
As a side note i cannot use PERMISSIVE read with corrupt_records_column, as i don't want to cache the dataframe (you can see my other question here )
Any suggestion or observation would be much appreciated !

Related

Error While fetching columns from join condition in pyspark [duplicate]

I am trying to build for each of my users a vector containing the average number of records per hour of day. Hence the vector has to have 24 dimensions.
My original DataFrame has userID and hour columns, andI am starting by doing a groupBy and counting the number of record per user per hour as follow:
val hourFreqDF = df.groupBy("userID", "hour").agg(count("*") as "hfreq")
Now, in order to generate a vector per user I am doing the follow, based on the first suggestion in this answer.
val hours = (0 to 23 map { n => s"$n" } toArray)
val assembler = new VectorAssembler()
.setInputCols(hours)
.setOutputCol("hourlyConnections")
val exprs = hours.map(c => avg(when($"hour" === c, $"hfreq").otherwise(lit(0))).alias(c))
val transformed = assembler.transform(hourFreqDF.groupBy($"userID")
.agg(exprs.head, exprs.tail: _*))
When I run this example, I get the following warning:
Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
I presume this is because the expression is too long?
My question is: can I safely ignore this warning?
You can safely ignore it, if you are not interested in seeing the sql schema logs. Otherwise, you might want to set the property to a higher value, but it might affect the performance of your job:
spark.debug.maxToStringFields=100
Default value is: DEFAULT_MAX_TO_STRING_FIELDS = 25
The performance overhead of creating and logging strings
for wide schemas can be large. To limit the impact, we bound the
number of fields to include by default. This can be overridden by
setting the 'spark.debug.maxToStringFields' conf in SparkEnv.
Taken from: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L90
This config, along many others, has been moved to: SQLConf - sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
This can be set either in the config file or via command line in spark, using:
spark.conf.set("spark.sql.debug.maxToStringFields", 1000)

Parquet data to AWS Redshift slow

I want to insert data from S3 parquet files to Redshift.
Files in parquet comes from a process that reads JSON files, flatten them out, and store as parquet. To do it we use pandas dataframes.
To do so, I tried two different things. The first one:
COPY schema.table
FROM 's3://parquet/provider/A/2020/11/10/11/'
IAM_ROLE 'arn:aws:iam::XXXX'
FORMAT AS PARQUET;
It returned:
Invalid operation: Spectrum Scan Error
error: Spectrum Scan Error
code: 15001
context: Unmatched number of columns between table and file. Table columns: 54, Data columns: 41
I understand the error but I don't have an easy option to fix it.
If we have to do a reload from 2 months ago the file will only have for example 40 columns, because on that given data we needed just this data but table already increased to 50 columns.
So we need something automatically, or that we can specify the columns at least.
Then I applied another option which is to do a SELECT with AWS Redshift Spectrum. We know how many columns the table have using system tables, and we now the structure of the file loading again to a Pandas dataframe. Then I can combine both to have the same identical structure and do the insert.
It works fine but it is slow.
The select looks like:
SELECT fields
FROM schema.table
WHERE partition_0 = 'A'
AND partition_1 = '2020'
AND partition_2 = '11'
AND partition_3 = '10'
AND partition_4 = '11';
The partitions are already added as I checked using:
select *
from SVV_EXTERNAL_PARTITIONS
where tablename = 'table'
and schemaname = 'schema'
and values = '["A","2020","11","10","11"]'
limit 1;
I have around 170 files per hour, both in json and parquet file. The process list all files in S3 json path, and process them and store in S3 parquet path.
I don't know how to improve execution time, as the INSERT from parquet takes 2 minutes per each partition_0 value. I tried the select alone to ensure its not an INSERT issue, and it takes 1:50 minutes. So the issue is to read data from S3.
If I try to select a non existent value for partition_0 it takes again around 2 minutes, so there is some kind of problem to access data. I don't know if partition_0 naming and others are considered as Hive partitioning format.
Edit:
AWS Glue Crawler table specification
Edit: Add SVL_S3QUERY_SUMMARY results
step:1
starttime: 2020-12-13 07:13:16.267437
endtime: 2020-12-13 07:13:19.644975
elapsed: 3377538
aborted: 0
external_table_name: S3 Scan schema_table
file_format: Parquet
is_partitioned: t
is_rrscan: f
is_nested: f
s3_scanned_rows: 1132
s3_scanned_bytes: 4131968
s3query_returned_rows: 1132
s3query_returned_bytes: 346923
files: 169
files_max: 34
files_avg: 28
splits: 169
splits_max: 34
splits_avg: 28
total_split_size: 3181587
max_split_size: 30811
avg_split_size: 18825
total_retries:0
max_retries:0
max_request_duration: 360496
avg_request_duration: 172371
max_request_parallelism: 10
avg_request_parallelism: 8.4
total_slowdown_count: 0
max_slowdown_count: 0
Add query checks
Query: 37005074 (SELECT in localhost using pycharm)
Query: 37005081 (INSERT in AIRFLOW AWS ECS service)
STL_QUERY Shows that both queries takes around 2 min
select * from STL_QUERY where query=37005081 OR query=37005074 order by query asc;
Query: 37005074 2020-12-14 07:44:57.164336,2020-12-14 07:46:36.094645,0,0,24
Query: 37005081 2020-12-14 07:45:04.551428,2020-12-14 07:46:44.834257,0,0,3
STL_WLM_QUERY Shows that no queue time, all in exec time
select * from STL_WLM_QUERY where query=37005081 OR query=37005074;
Query: 37005074 Queue time 0 Exec time: 98924036 est_peak_mem:0
Query: 37005081 Queue time 0 Exec time: 100279214 est_peak_mem:2097152
SVL_S3QUERY_SUMMARY Shows that query takes 3-4 seconds in s3
select * from SVL_S3QUERY_SUMMARY where query=37005081 OR query=37005074 order by endtime desc;
Query: 37005074 2020-12-14 07:46:33.179352,2020-12-14 07:46:36.091295
Query: 37005081 2020-12-14 07:46:41.869487,2020-12-14 07:46:44.807106
stl_return Comparing min start for to max end for each query. 3-4 seconds as says SVL_S3QUERY_SUMMARY
select * from stl_return where query=37005081 OR query=37005074 order by query asc;
Query:37005074 2020-12-14 07:46:33.175320 2020-12-14 07:46:36.091295
Query:37005081 2020-12-14 07:46:44.817680 2020-12-14 07:46:44.832649
I dont understand why SVL_S3QUERY_SUMMARY shows just 3-4 seconds to run query in spectrum, but then STL_WLM_QUERY says the excution time is around 2 minutes as i see in my localhost and production environtments... Neither how to improve it, because stl_return shows that query returns few data.
EXPLAIN
XN Partition Loop (cost=0.00..400000022.50 rows=10000000000 width=19608)
-> XN Seq Scan PartitionInfo of parquet.table (cost=0.00..22.50 rows=1 width=0)
Filter: (((partition_0)::text = 'A'::text) AND ((partition_1)::text = '2020'::text) AND ((partition_2)::text = '12'::text) AND ((partition_3)::text = '10'::text) AND ((partition_4)::text = '12'::text))
-> XN S3 Query Scan parquet (cost=0.00..200000000.00 rows=10000000000 width=19608)
" -> S3 Seq Scan parquet.table location:""s3://parquet"" format:PARQUET (cost=0.00..100000000.00 rows=10000000000 width=19608)"
svl_query_report
select * from svl_query_report where query=37005074 order by segment, step, elapsed_time, rows;
Just like in your other question you need to change your keypaths on your objects. It is not enough to just have "A" in the keypath - it needs to be "partition_0=A". This is how Spectrum knows that the object is or isn't in the partition.
Also you need to make sure that your objects are of reasonable size or it will be slow if you need to scan many of them. It takes time to open each object and if you have many small objects the time to open them can be longer than the time to scan them. This is only an issue if you need to scan many many files.

Reading large volume data from Teradata using Dask cluster/Teradatasql and sqlalchemy

I need to read large volume data(app. 800M records) from teradata, my code is working fine for a million record. for larger sets its taking time to build metadata. Could someone please suggest how to make it faster. Below is the code snippet which I am using for my application.
def get_partitions(num_partitions):
list_range =[]
initial_start=0
for i in range(num_partitions):
amp_range = 3240//num_partitions
start = (i*amp_range+1)*initial_start
end = (i+1)*amp_range
list_range.append((start,end))
initial_start = 1
return list_range
#delayed
def load(query,start,end,connString):
df = pd.read_sql(query.format(start, end),connString)
engine.dispose()
return df
connString = "teradatasql://{user}:{password}#{hostname}/?logmech={logmech}&encryptdata=true"
results = from_delayed([load(query,start, end,connString) for start,end in get_partitions(num_partitions)])
The build time is probably taken in finding out the metadata of your table. This is done by fetching the whole of the first partition and analysing it.
You would be better off either specifying it explcitly, if you know the dtypes upfront, e.g., {col: dtype, ...} for all the columns, or generating it from a separate query that you limit to just as many rows as it takes to be sure you have the right types:
meta = dask.compute(load(query, 0,10 ,connString))
results = from_delayed(
[
load(query,start, end,connString) for start,end in
get_partitions(num_partitions)
],
mete=meta.loc[:0, :] # zero-length version of table
)

Why reading an small subset of the rows with Parquet Dataset take the same time than reading the whole file?

I'm developing a program to analyze some historical prices of some assets. The data is structured and analyzed as a pandas dataframe. The columns are the dates and the rows are the assets. Previously I was using the transpose of this, but this format gave me better reading time. I saved this data in a parquet file and now I want to read an interval of dates from A to B for example and an small set of assets, analyze it and then repeat the same process with the same assets but in the interval from B + 1 to C.
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file. Is there a way to improve this behaviour?, It would be good that, once it filter the rows, it saves where the blocks in memory are to speed up the nexts reads. Do I have to write a new file with the assets filtered?.
I tried writing the parquet file with a small number of row groups and smaller data page size to avoid the complete reading, but this doesn't gave me a good results in terms of time.
Other question that I have is the follwing. Why if we read the complete parquet file using a Parquet Dataset and use_legacy_dataset = False, it takes more time than reading the same parquet dataset with use_legacy_dataset = True?
Code example:
import pandas as pd
import numpy as np
import time
import pyarrow.parquet as pq
# generating the small data for the example, the file weight like 150MB for this example, the real data
# has 2 GB
dates = pd.bdate_range('2019-01-01', '2020-03-01')
assets = list(range(1000, 50000))
historical_prices = pd.DataFrame(np.random.rand(len(assets), len(dates)), assets, dates)
historical_prices.columns = historical_prices.columns.strftime('%Y-%m-%d')
# name of the index
historical_prices.index.name = 'assets'
# writing the parquet file using the lastest version, in the comments are the thigns that I tested
historical_prices.to_parquet(
'historical_prices.parquet',
version='2.0',
data_page_version='2.0',
writer_engine_version='2.0',
# row_group_size=100,
# compression=None
# use_dictionary=False,
# data_page_size=1000,
# use_byte_stream_split=True,
# flavor='spark',
)
# reading the complete parquet dataset
start_time = time.time()
historical_prices_dataset = pq.ParquetDataset(
'historical_prices.parquet',
use_legacy_dataset=False
)
historical_prices_dataset.read_pandas().to_pandas()
print(time.time() - start_time)
# Reading only one asset of the parquet dataset
start_time = time.time()
filters = [('assets', '=', assets[0])]
historical_prices_dataset = pq.ParquetDataset(
'historical_prices.parquet',
filters=filters,
use_legacy_dataset=False
)
historical_prices_dataset.read_pandas().to_pandas()
print(time.time() - start_time)
# this is what I want to do, read by intervals.
num_intervals = 5
for i in range(num_intervals):
start = int(i * len(dates) / num_intervals)
end = int((i + 1) * len(dates) / num_intervals)
interval = list(dates[start:end].strftime('%Y-%m-%d'))
historical_prices_dataset.read_pandas(columns=interval).to_pandas()
# Here goes some analyzing process that can't be done in parallel due that the results of every interval
# are used in the next interval
print(time.time() - start_time)
I was using the transpose of this, but this format gave me better reading time.
Parquet supports individual column reads. So if you have 10 columns of 10k rows and you want 5 columns then you'll read 50k cells. If you have 10k columns of 10 rows and you want 5 columns then you'll read 50 cells. So presumably this is why the transpose gave you better reading time. I don't think I have enough details here. Parquet also supports reading individual row groups, more on that later.
You have roughly 49,000 assets and 300 dates. I'd expect you to get better performance with assets as columns but 49,000 is a lot of columns to have. It's possible that either you are having to read too much column metadata or you are dealing with CPU overhead from keeping track of so many columns.
It is a bit odd to have date values or asset ids as columns. A far more typical layout would be to have three columns: "date", "asset id", & "price".
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file
Yes, if you have a single row group. Parquet does not support partial row group reads. I believe this is due to the fact that the columns are compressed. However, I do not get the same results you are getting. The middle time in your example (the single asset read) is typically ~60-70% of the time of the first read. So it is faster. Possibly just because there is less conversion to do to get to pandas or maybe there is some optimization I'm not aware of.
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file. Is there a way to improve this behaviour?, It would be good that, once it filter the rows, it saves where the blocks in memory are to speed up the nexts reads. Do I have to write a new file with the assets filtered?.
Row groups may be your answer. See the next section.
I tried writing the parquet file with a small number of row groups and smaller data page size to avoid the complete reading, but this doesn't gave me a good results in terms of time.
This is probably what you are after (or you can use multiple files). Parquet supports reading just one row group out of a whole file. However, 100 is too small of a number for row_group_size. Each row group creates some amount of metadata in the file and has some overhead for processing. If I change that to 10,000 for example then the middle read is twice as fast (and now only 30-40% of the full table read).
Other question that I have is the follwing. Why if we read the complete parquet file using a Parquet Dataset and use_legacy_dataset = False, it takes more time than reading the same parquet dataset with use_legacy_dataset = True?
This new datasets API is pretty new (new as of 1.0.0 which released in July). It's possible there is just a bit more overhead. You are not doing anything that would take advantage of the new datasets API (e.g. using scan or non-parquet datasets or new filesystems). So while use_legacy_datasets shouldn't be faster it should not be any slower either. They should take roughly the same amount of time.
It sounds like you have many assets (tens of thousands) and you want to read a few of them. You also want to chunk the read into smaller reads (which you are using the date for).
First, instead of using the date at all, I would recommend using dataset.scan (https://arrow.apache.org/docs/python/dataset.html). This will allow you to process your data one row group at a time.
Second, is there any way you can group your asset ids? If each asset ID has only a single row you can ignore this. However, if you have (for example) 500 rows for each asset ID (or 1 row for each asset ID/date pair) can you write your file so that it looks something like this...
asset_id date price
A 1 ?
A 2 ?
A 3 ?
B 1 ?
B 2 ?
B 3 ?
If you do this AND you set the row group size to something reasonable (try 10k or 100k and then refine from there) then you should be able to get it so that you are only reading 1 or 2 row groups per asset ID.
I found another approach that give me better times for my specific cases, of course, this is a not very general solution. It has some not pyarrow's functions, but it do what I thought the filters of pyarrow do when we read multiple times the same rows. When the number of row groups to read grow, the parquet dataset gave better performance.
import pandas as pd
import numpy as np
import time
import pyarrow.parquet as pq
from typing import Dict, Any, List
class PriceGroupReader:
def __init__(self, filename: str, assets: List[int]):
self.price_file = pq.ParquetFile(filename)
self.assets = assets
self.valid_groups = self._get_valid_row_groups()
def _get_valid_row_groups(self):
"""
I don't fine a parquet function to make this row group search, so I did this manual search.
Note: The assets index is sorted, so probably this can be improved a lot.
"""
start_time = time.time()
assets = pd.Index(self.assets)
valid_row_groups = []
index_position = self.price_file.schema.names.index("assets")
for i in range(self.price_file.num_row_groups):
row_group = self.price_file.metadata.row_group(i)
statistics = row_group.column(index_position).statistics
if np.any((statistics.min <= assets) & (assets <= statistics.max)):
valid_row_groups.append(i)
print("getting the row groups: {}".format(time.time() - start_time))
return valid_row_groups
def read_valid_row_groups(self, dates: List[str]):
row_groups = []
for row_group_pos in self.valid_groups:
df = self.price_file.read_row_group(row_group_pos, columns=dates, use_pandas_metadata=True).to_pandas()
df = df.loc[df.index.isin(self.assets)]
row_groups.append(df)
df = pd.concat(row_groups)
"""
# This is another way to read the groups but I think it can consume more memory, probably is faster.
df = self.price_file.read_row_groups(self.valid_groups, columns=dates, use_pandas_metadata=True).to_pandas()
df = df.loc[df.index.isin(self.assets)]
"""
return df
def write_prices(assets: List[int], dates: List[str]):
historical_prices = pd.DataFrame(np.random.rand(len(assets), len(dates)), assets, dates)
# name of the index
historical_prices.index.name = 'assets'
# writing the parquet file using the lastest version, in the comments are the thigns that I tested
historical_prices.to_parquet(
'historical_prices.parquet',
version='2.0',
data_page_version='2.0',
writer_engine_version='2.0',
row_group_size=4000,
# compression=None
# use_dictionary=False,
# data_page_size=1000,
# use_byte_stream_split=True,
# flavor='spark',
)
# generating the small data for the example, the file weight like 150MB, the real data weight 2 GB
total_dates = list(pd.bdate_range('2019-01-01', '2020-03-01').strftime('%Y-%m-%d'))
total_assets = list(range(1000, 50000))
write_prices(total_assets, total_dates)
# selecting a subset of the whole assets
valid_assets = total_assets[:3000]
# read the price file for the example
price_group_reader = PriceGroupReader('historical_prices.parquet', valid_assets)
# reading all the dates, only as an example
start_time = time.time()
price_group_reader.read_valid_row_groups(total_dates)
print("complete reading: {}".format(time.time() - start_time))
# this is what I want to do, read by intervals.
num_intervals = 5
start_time = time.time()
for i in range(num_intervals):
start = int(i * len(total_dates) / num_intervals)
end = int((i + 1) * len(total_dates) / num_intervals)
interval = list(total_dates[start:end])
df = price_group_reader.read_valid_row_groups(interval)
# print(df)
print("interval reading: {}".format(time.time() - start_time))
filters = [('assets', 'in', valid_assets)]
price_dataset = pq.ParquetDataset(
'historical_prices.parquet',
filters=filters,
use_legacy_dataset=False
)
start_time = time.time()
price_dataset.read_pandas(columns=total_dates).to_pandas()
print("complete reading with parquet dataset: {}".format(time.time() - start_time))
start_time = time.time()
for i in range(num_intervals):
start = int(i * len(total_dates) / num_intervals)
end = int((i + 1) * len(total_dates) / num_intervals)
interval = list(total_dates[start:end])
df = price_dataset.read_pandas(columns=interval).to_pandas()
print("interval reading with parquet dataset: {}".format(time.time() - start_time))

How to expire state of dropDuplicates in structured streaming to avoid OOM?

I want to count the unique access for each day using spark structured streaming, so I use the following code
.dropDuplicates("uuid")
and in the next day the state maintained for today should be dropped so that I can get the right count of unique access of the next day and avoid OOM. The spark document indicates using dropDuplicates with watermark, for example:
.withWatermark("timestamp", "1 day")
.dropDuplicates("uuid", "timestamp")
but the watermark column must be specified in dropDuplicates. In such case the uuid and timestamp will be used as a combined key to deduplicate elements with the same uuid and timestamp, which is not what I expected.
So is there a perfect solution?
After a few days effort I finally find out the way myself.
While studying the source code of watermark and dropDuplicates, I discovered that besides an eventTime column, watermark also supports window column, so we can use the following code:
.select(
window($"timestamp", "1 day"),
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")
Since all events in the same day have the same window, this will produce the same results as using only uuid to deduplicate. Hopes can help someone.
Below is the modification of the procedure proposed in Spark documentation. Trick is to manipulate event time i.e. put event time in
buckets. Assumption is that event time is provided in milliseconds.
// removes all duplicates that are in 15 minutes tumbling window.
// doesn't remove duplicates that are in different 15 minutes windows !!!!
public static Dataset<Row> removeDuplicates(Dataset<Row> df) {
// converts time in 15 minute buckets
// timestamp - (timestamp % (15 * 60))
Column bucketCol = functions.to_timestamp(
col("event_time").divide(1000).minus((col("event_time").divide(1000)).mod(15*60)));
df = df.withColumn("bucket", bucketCol);
String windowDuration = "15 minutes";
df = df.withWatermark("bucket", windowDuration)
.dropDuplicates("uuid", "bucket");
return df.drop("bucket");
}
I found out that window function didn't work so I chose to use window.start or window.end.
.select(
window($"timestamp", "1 day").start,
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")

Resources