Use pyspark to partition 100 rows from csv file - apache-spark

I'm trying to group 100 rows of a large csv file (100M+ rows) to send to a Lambda function.
I can use SparkContext to have a workaround like this:
csv_file_rdd = sc.textFile(csv_file).collect()
count = 0
buffer = []
while count < len(csv_file_rdd):
buffer.append(csv_file_rdd[count])
count += 1
if count % 100 == 0 or count == len(csv_file_rdd):
# Send buffer to process
print("Send:", buffer)
# Clear buffer
buffer = []
but there must be a more elegant solution. I've tried using SparkSession and mapPartition but I haven't been able to make it work.

I suppose that your current data is not partitioned in any way (I mean its only one file), so iterating over it sequencially is a must. I suggest to load it as a data frame spark.read.csv(csv_file) then repartition as in this question and save to disk. Once it's saved you'll have a big number of files containing the specified number of records (100 in your case), taht can be used by other program to send to a Lambda (probably with a Pool of workers). See this post to get an idea. Probably is a naive idea but get's the job done.

Related

AWS Glue dynamic frames not getting updated

We are currrently facing an issue where we cannot insert more than 600K records in oracle db using AWS glue. We are getting connection reset error and DBA's are currently looking into it. As a temporary solution we thought of adding data in chunks by splitting a dataframe into multiple dataframe and looping this list of dataframe to add data. We are sure that splitting algorithm works fine and here is the code we use
def split_by_row_index(df, num_partitions=10):
# Let's assume you don't have a row_id column that has the row order
t = df.withColumn('_row_id', monotonically_increasing_id())
# Using ntile() because monotonically_increasing_id is discontinuous across partitions
t = t.withColumn('_partition', ntile(num_partitions).over(Window.orderBy(t._row_id)))
return [t.filter(t._partition == i + 1) for i in range(num_partitions)]
Here each DF have unique data but somehow when we convert this df in dynamic frame in loop it is we are getting common data in each dynamic frame. here is small snippet for this example
df_trns_details_list = split_by_row_index(df_trns_details, int(df_trns_details.count() / 100000))
trnsDetails1 = DynamicFrame.fromDF(df_trns_details_list[0], glueContext, "trnsDetails1")
trnsDetails2 = DynamicFrame.fromDF(df_trns_details_list[1], glueContext, "trnsDetails2")
print(df_trns_details_list[0].count())# counts are same
print(trnsDetails1.count())
print('-------------------------------')
print(df_trns_details_list[1].count()) # counts are same
print(trnsDetails2.count())
print('-------------------------------')
subDf1 = trnsDetails1.toDF().select(col("id"), col("details_id"))
subDf2 = trnsDetails2.toDF().select(col("id"), col("details_id"))
common = subDf1.intersect(subDf2)
# ------------------ common data exists----------------
print(common.count())
subDf3 = df_trns_details_list[0].select(col("id"), col("details_id"))
subDf4 = df_trns_details_list[1].select(col("id"), col("details_id"))
#------------------0 common data----------------
common1 = subDf3.intersect(subDf4)
print(common1.count())
here Id and details_id combination will be unique
We used this logic in multiple areas where it worked not sure why this is happening.
We are also quite new to Python and AWS Glue so any suggestion to improve it also welcomed. Thanks

Reading large volume data from Teradata using Dask cluster/Teradatasql and sqlalchemy

I need to read large volume data(app. 800M records) from teradata, my code is working fine for a million record. for larger sets its taking time to build metadata. Could someone please suggest how to make it faster. Below is the code snippet which I am using for my application.
def get_partitions(num_partitions):
list_range =[]
initial_start=0
for i in range(num_partitions):
amp_range = 3240//num_partitions
start = (i*amp_range+1)*initial_start
end = (i+1)*amp_range
list_range.append((start,end))
initial_start = 1
return list_range
#delayed
def load(query,start,end,connString):
df = pd.read_sql(query.format(start, end),connString)
engine.dispose()
return df
connString = "teradatasql://{user}:{password}#{hostname}/?logmech={logmech}&encryptdata=true"
results = from_delayed([load(query,start, end,connString) for start,end in get_partitions(num_partitions)])
The build time is probably taken in finding out the metadata of your table. This is done by fetching the whole of the first partition and analysing it.
You would be better off either specifying it explcitly, if you know the dtypes upfront, e.g., {col: dtype, ...} for all the columns, or generating it from a separate query that you limit to just as many rows as it takes to be sure you have the right types:
meta = dask.compute(load(query, 0,10 ,connString))
results = from_delayed(
[
load(query,start, end,connString) for start,end in
get_partitions(num_partitions)
],
mete=meta.loc[:0, :] # zero-length version of table
)

Why reading an small subset of the rows with Parquet Dataset take the same time than reading the whole file?

I'm developing a program to analyze some historical prices of some assets. The data is structured and analyzed as a pandas dataframe. The columns are the dates and the rows are the assets. Previously I was using the transpose of this, but this format gave me better reading time. I saved this data in a parquet file and now I want to read an interval of dates from A to B for example and an small set of assets, analyze it and then repeat the same process with the same assets but in the interval from B + 1 to C.
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file. Is there a way to improve this behaviour?, It would be good that, once it filter the rows, it saves where the blocks in memory are to speed up the nexts reads. Do I have to write a new file with the assets filtered?.
I tried writing the parquet file with a small number of row groups and smaller data page size to avoid the complete reading, but this doesn't gave me a good results in terms of time.
Other question that I have is the follwing. Why if we read the complete parquet file using a Parquet Dataset and use_legacy_dataset = False, it takes more time than reading the same parquet dataset with use_legacy_dataset = True?
Code example:
import pandas as pd
import numpy as np
import time
import pyarrow.parquet as pq
# generating the small data for the example, the file weight like 150MB for this example, the real data
# has 2 GB
dates = pd.bdate_range('2019-01-01', '2020-03-01')
assets = list(range(1000, 50000))
historical_prices = pd.DataFrame(np.random.rand(len(assets), len(dates)), assets, dates)
historical_prices.columns = historical_prices.columns.strftime('%Y-%m-%d')
# name of the index
historical_prices.index.name = 'assets'
# writing the parquet file using the lastest version, in the comments are the thigns that I tested
historical_prices.to_parquet(
'historical_prices.parquet',
version='2.0',
data_page_version='2.0',
writer_engine_version='2.0',
# row_group_size=100,
# compression=None
# use_dictionary=False,
# data_page_size=1000,
# use_byte_stream_split=True,
# flavor='spark',
)
# reading the complete parquet dataset
start_time = time.time()
historical_prices_dataset = pq.ParquetDataset(
'historical_prices.parquet',
use_legacy_dataset=False
)
historical_prices_dataset.read_pandas().to_pandas()
print(time.time() - start_time)
# Reading only one asset of the parquet dataset
start_time = time.time()
filters = [('assets', '=', assets[0])]
historical_prices_dataset = pq.ParquetDataset(
'historical_prices.parquet',
filters=filters,
use_legacy_dataset=False
)
historical_prices_dataset.read_pandas().to_pandas()
print(time.time() - start_time)
# this is what I want to do, read by intervals.
num_intervals = 5
for i in range(num_intervals):
start = int(i * len(dates) / num_intervals)
end = int((i + 1) * len(dates) / num_intervals)
interval = list(dates[start:end].strftime('%Y-%m-%d'))
historical_prices_dataset.read_pandas(columns=interval).to_pandas()
# Here goes some analyzing process that can't be done in parallel due that the results of every interval
# are used in the next interval
print(time.time() - start_time)
I was using the transpose of this, but this format gave me better reading time.
Parquet supports individual column reads. So if you have 10 columns of 10k rows and you want 5 columns then you'll read 50k cells. If you have 10k columns of 10 rows and you want 5 columns then you'll read 50 cells. So presumably this is why the transpose gave you better reading time. I don't think I have enough details here. Parquet also supports reading individual row groups, more on that later.
You have roughly 49,000 assets and 300 dates. I'd expect you to get better performance with assets as columns but 49,000 is a lot of columns to have. It's possible that either you are having to read too much column metadata or you are dealing with CPU overhead from keeping track of so many columns.
It is a bit odd to have date values or asset ids as columns. A far more typical layout would be to have three columns: "date", "asset id", & "price".
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file
Yes, if you have a single row group. Parquet does not support partial row group reads. I believe this is due to the fact that the columns are compressed. However, I do not get the same results you are getting. The middle time in your example (the single asset read) is typically ~60-70% of the time of the first read. So it is faster. Possibly just because there is less conversion to do to get to pandas or maybe there is some optimization I'm not aware of.
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file. Is there a way to improve this behaviour?, It would be good that, once it filter the rows, it saves where the blocks in memory are to speed up the nexts reads. Do I have to write a new file with the assets filtered?.
Row groups may be your answer. See the next section.
I tried writing the parquet file with a small number of row groups and smaller data page size to avoid the complete reading, but this doesn't gave me a good results in terms of time.
This is probably what you are after (or you can use multiple files). Parquet supports reading just one row group out of a whole file. However, 100 is too small of a number for row_group_size. Each row group creates some amount of metadata in the file and has some overhead for processing. If I change that to 10,000 for example then the middle read is twice as fast (and now only 30-40% of the full table read).
Other question that I have is the follwing. Why if we read the complete parquet file using a Parquet Dataset and use_legacy_dataset = False, it takes more time than reading the same parquet dataset with use_legacy_dataset = True?
This new datasets API is pretty new (new as of 1.0.0 which released in July). It's possible there is just a bit more overhead. You are not doing anything that would take advantage of the new datasets API (e.g. using scan or non-parquet datasets or new filesystems). So while use_legacy_datasets shouldn't be faster it should not be any slower either. They should take roughly the same amount of time.
It sounds like you have many assets (tens of thousands) and you want to read a few of them. You also want to chunk the read into smaller reads (which you are using the date for).
First, instead of using the date at all, I would recommend using dataset.scan (https://arrow.apache.org/docs/python/dataset.html). This will allow you to process your data one row group at a time.
Second, is there any way you can group your asset ids? If each asset ID has only a single row you can ignore this. However, if you have (for example) 500 rows for each asset ID (or 1 row for each asset ID/date pair) can you write your file so that it looks something like this...
asset_id date price
A 1 ?
A 2 ?
A 3 ?
B 1 ?
B 2 ?
B 3 ?
If you do this AND you set the row group size to something reasonable (try 10k or 100k and then refine from there) then you should be able to get it so that you are only reading 1 or 2 row groups per asset ID.
I found another approach that give me better times for my specific cases, of course, this is a not very general solution. It has some not pyarrow's functions, but it do what I thought the filters of pyarrow do when we read multiple times the same rows. When the number of row groups to read grow, the parquet dataset gave better performance.
import pandas as pd
import numpy as np
import time
import pyarrow.parquet as pq
from typing import Dict, Any, List
class PriceGroupReader:
def __init__(self, filename: str, assets: List[int]):
self.price_file = pq.ParquetFile(filename)
self.assets = assets
self.valid_groups = self._get_valid_row_groups()
def _get_valid_row_groups(self):
"""
I don't fine a parquet function to make this row group search, so I did this manual search.
Note: The assets index is sorted, so probably this can be improved a lot.
"""
start_time = time.time()
assets = pd.Index(self.assets)
valid_row_groups = []
index_position = self.price_file.schema.names.index("assets")
for i in range(self.price_file.num_row_groups):
row_group = self.price_file.metadata.row_group(i)
statistics = row_group.column(index_position).statistics
if np.any((statistics.min <= assets) & (assets <= statistics.max)):
valid_row_groups.append(i)
print("getting the row groups: {}".format(time.time() - start_time))
return valid_row_groups
def read_valid_row_groups(self, dates: List[str]):
row_groups = []
for row_group_pos in self.valid_groups:
df = self.price_file.read_row_group(row_group_pos, columns=dates, use_pandas_metadata=True).to_pandas()
df = df.loc[df.index.isin(self.assets)]
row_groups.append(df)
df = pd.concat(row_groups)
"""
# This is another way to read the groups but I think it can consume more memory, probably is faster.
df = self.price_file.read_row_groups(self.valid_groups, columns=dates, use_pandas_metadata=True).to_pandas()
df = df.loc[df.index.isin(self.assets)]
"""
return df
def write_prices(assets: List[int], dates: List[str]):
historical_prices = pd.DataFrame(np.random.rand(len(assets), len(dates)), assets, dates)
# name of the index
historical_prices.index.name = 'assets'
# writing the parquet file using the lastest version, in the comments are the thigns that I tested
historical_prices.to_parquet(
'historical_prices.parquet',
version='2.0',
data_page_version='2.0',
writer_engine_version='2.0',
row_group_size=4000,
# compression=None
# use_dictionary=False,
# data_page_size=1000,
# use_byte_stream_split=True,
# flavor='spark',
)
# generating the small data for the example, the file weight like 150MB, the real data weight 2 GB
total_dates = list(pd.bdate_range('2019-01-01', '2020-03-01').strftime('%Y-%m-%d'))
total_assets = list(range(1000, 50000))
write_prices(total_assets, total_dates)
# selecting a subset of the whole assets
valid_assets = total_assets[:3000]
# read the price file for the example
price_group_reader = PriceGroupReader('historical_prices.parquet', valid_assets)
# reading all the dates, only as an example
start_time = time.time()
price_group_reader.read_valid_row_groups(total_dates)
print("complete reading: {}".format(time.time() - start_time))
# this is what I want to do, read by intervals.
num_intervals = 5
start_time = time.time()
for i in range(num_intervals):
start = int(i * len(total_dates) / num_intervals)
end = int((i + 1) * len(total_dates) / num_intervals)
interval = list(total_dates[start:end])
df = price_group_reader.read_valid_row_groups(interval)
# print(df)
print("interval reading: {}".format(time.time() - start_time))
filters = [('assets', 'in', valid_assets)]
price_dataset = pq.ParquetDataset(
'historical_prices.parquet',
filters=filters,
use_legacy_dataset=False
)
start_time = time.time()
price_dataset.read_pandas(columns=total_dates).to_pandas()
print("complete reading with parquet dataset: {}".format(time.time() - start_time))
start_time = time.time()
for i in range(num_intervals):
start = int(i * len(total_dates) / num_intervals)
end = int((i + 1) * len(total_dates) / num_intervals)
interval = list(total_dates[start:end])
df = price_dataset.read_pandas(columns=interval).to_pandas()
print("interval reading with parquet dataset: {}".format(time.time() - start_time))

How can I repartition RDD by key and then pack it to shards?

I have many files containing millions of rows in format:
id, created_date, some_value_a, some_value_b, some_value_c
This way of repartitioning was super slow and created for me over million of small ~500b files:
rdd_df = rdd.toDF(["id", "created_time", "a", "b", "c"])
rdd_df.write.partitionBy("id").csv("output")
I would like to achieve output files, where each file contains like 10000 unique IDs and all their rows.
How could I achieve something like this?
You can repartition by adding a Random Salt key.
val totRows = rdd_df.count
val maxRowsForAnId = rdd_df.groupBy("id").count().agg(max("count"))
val numParts1 = totRows/maxRowsForAnId
val totalUniqueIds = rdd_df.select("id").distinct.count
val numParts2 = totRows/(10000*totalUniqueIds)
val numPart = numParts1.min(numParts2)
rdd_df
.repartition(numPart,col("id"),rand)
.csv("output")
The main concept is each partition will be written as 1 file. SO you would have bring your required rows in to 1 partition by repartition(numPart,col("id"),rand).
The first 4-5 operations is just to calculate how many partitions we need to achieve almost 10000 ids per file.
Calculate assuming 10000 ids per partition
Corner case : if a single id has too many rows and doesn't fit in the above calculated partition size.
Hence we calculate no of paritition according to the largest count of ID present
Take min of the 2 noOfPartitons
rand is necessary so, that we can bring multiple IDs in a single partition
NOTE : Although this will give you larger files and each file will contain a set of unique ids for sure. But this involves shuffling , due to which your operation actually might be slower than the code you have mentioned in question.
You would need something like this:
rdd_df.repartition(*number of partitions you want*).write.csv("output", header = True)
or honestly - just let the job decide the number partitions instead of repartitioning. In theory, that should be faster:
rdd_df.write.csv("output", header = True)

Apache Spark write to multiple outputs [different parquet schemas] without caching

I want to transform my input data (XML files) and produce 3 different outputs.
Each output will be in parquet format and will have a different schema/number of columns.
Currently in my solution, the data is stored in RDD[Row], where each Row belongs to one of three types and has a different number of fields. What I'm doing now is caching the RDD, then filtering it (using the field telling me about the record type) and saving the data using the following method:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
...
// the same for filtered_data_2 and filtered_data_3
Is there any way to do it better, for example do not cache entire data in memory?
In MapReduce we have MultipleOutputs class and we can do it this way:
MultipleOutputs.addNamedOutput(job, "data_type_1", DataType1OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_2", DataType2OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_3", DataType3OutputFormat.class, Void.class, Group.class);
...
MultipleOutputs<Void, Group> mos = new MultipleOutputs<>(context);
mos.write("data_type_1", null, myRecordGroup1, filePath1);
mos.write("data_type_2", null, myRecordGroup2, filePath2);
...
We had exactly this problem, to re-iterate: we read 1000s of datasets into one RDD, all of different schemas (we used a nested Map[String, Any]) and wanted to write those 1000s of datasets to different Parquet partitions in their respective schemas. All in a single embarrassingly parallel Spark Stage.
Our initial approach indeed did the hacky thing of caching, but this meant (a) 1000 passes of the cached data (b) hitting a lot of memory issues!
For a long time now I've wanted to bypass the Spark's provided .parquet methods and go to lower level underlying libraries, and wrap that in a nice functional signature. Finally recently we did exactly this!
The code is too much to copy and paste all of it here, so I will just paste the main crux of the code to explain how it works. We intend on making this code Open Source in the next year or two.
val successFiles: List[String] = successFilePaths(tableKeyToSchema, tableKeyToOutputKey, tableKeyToOutputKeyNprs)
// MUST happen first
info("Deleting success files")
successFiles.foreach(S3Utils.deleteObject(bucket, _))
if (saveMode == SaveMode.Overwrite) {
info("Deleting past files as in Overwrite mode")
parDeleteDirContents(bucket, allDirectories(tableKeyToOutputKey, tableKeyToOutputKeyNprs, partitions, continuallyRunStartTime))
} else {
info("Not deleting past files as in Append mode")
}
rdd.mapPartitionsWithIndex {
case (index, records) =>
records.toList.groupBy(_._1).mapValues(_.map(_._2)).foreach {
case (regularKey: RegularKey, data: List[NotProcessableRecord Either UntypedStruct]) =>
val (nprs: List[NotProcessableRecord], successes: List[UntypedStruct]) =
Foldable[List].partitionEither(data)(identity)
val filename = s"part-by-partition-index-$index.snappy.parquet"
Parquet.writeUntypedStruct(
data = successes,
schema = toMessageType(tableKeyToSchema(regularKey.tableKey)),
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKey(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
Parquet.writeNPRs(
nprs = nprs,
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKeyNprs(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
} pipe Iterator.single
}.count() // Just some action to force execution
info("Writing _SUCCESS files")
successFiles.foreach(S3Utils.uploadFileContent(bucket, "", _))
Of course this code cannot be copy and pasted as many methods and values are not provided. The key points are:
We hand crank the deleting of _SUCCESS files and previous files when overwriting
Each spark partition will result in one-or-many output files (many when multiple data schemas are in the same partition)
We hand crank the writing of _SUCCESS files
Notes:
UntypedStruct is our nested representation of arbitrary schema. It's a little bit like Row in Spark but much better, as it's based on Map[String, Any].
NotProcessableRecord are essentially just dead letters
Parquet.writeUntypedStruct is the crux of the logic of writing a parquet file, so we'll explain this in more detail. Firstly
val toMessageType: StructType => MessageType = new org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter().convert
Should be self explanatory. Next fsMode contains within it the com.amazonaws.auth.AWSCredentials, then inside writeUntypedStruct we use that to construct org.apache.hadoop.conf.Configuration setting fs.s3a.access.key and fs.s3a.secret.key.
writeUntypedStruct basically just calls out to:
def writeRaw(
data: List[UntypedStruct],
schema: MessageType,
config: Configuration,
path: Path,
compression: CompressionCodecName = CompressionCodecName.SNAPPY
): Unit =
Using.resource(
ExampleParquetWriter.builder(path)
.withType(schema)
.withConf(config)
.withCompressionCodec(compression)
.withValidation(true)
.build()
)(writer => data.foreach(data => writer.write(transpose(data, new SimpleGroup(schema)))))
where SimpleGroup comes from org.apache.parquet.example.data.simple, and ExampleParquetWriter extends ParquetWriter<Group>. The method transpose is a very tedious self writing recursion through the UntypedStruct populating a Group (some ugly Java mutable low level thing).
Credit must go to https://github.com/davidainslie for figuring out how these underlying libraries work, and labouring out the code, which like I said, we intend on making Open Source soon!
AFAIK, there is no way to split one RDD into multiple RDD per se. This is just how the way Spark's DAG works: only child RDDs pulling data from parent RDDs.
We can, however, have multiple child RDDs reading from the same parent RDD. To avoid recomputing the parent RDD, there is no other way but to cache it. I assume that you want to avoid caching because you're afraid of insufficient memory. We can avoid Out Of Memory (OOM) issue by persisting the RDD to MEMORY_AND_DISK so that large RDD will spill to disk if and when needed.
Let's begin with your original data:
val allDataRDD = sc.parallelize(Seq(Row(1,1,1),Row(2,2,2),Row(3,3,3)))
We can persist this in memory first, but allow it to spill over to disk in case of insufficient memory:
allDataRDD.persist(StorageLevel.MEMORY_AND_DISK)
We then create the 3 RDD outputs:
filtered_data_1 = allDataRDD.filter(_.get(1)==1) // //
filtered_data_2 = allDataRDD.filter(_.get(2)==1) // use your own filter funcs here
filtered_data_3 = allDataRDD.filter(_.get(3)==1) // //
We then write the outputs:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
var resultDF_2 = sqlContext.createDataFrame(filtered_data_2, schema_2)
resultDF_2.write.parquet(output_path_2)
var resultDF_3 = sqlContext.createDataFrame(filtered_data_3, schema_3)
resultDF_3.write.parquet(output_path_3)
If you truly really want to avoid multiple passes, there is a workaround using a custom partitioner. You can repartition your data into 3 partitions and each partition will have its own task and hence its own output file/part. The caveat is that parallelism will be heavily reduced to 3 threads/tasks, and there's also the risk of >2GB of data stored in a single partition (Spark has a 2GB limit per partition). I am not providing detailed code for this method because I don't think it can write parquet files with different schema.

Resources