Why reading an small subset of the rows with Parquet Dataset take the same time than reading the whole file? - python-3.x

I'm developing a program to analyze some historical prices of some assets. The data is structured and analyzed as a pandas dataframe. The columns are the dates and the rows are the assets. Previously I was using the transpose of this, but this format gave me better reading time. I saved this data in a parquet file and now I want to read an interval of dates from A to B for example and an small set of assets, analyze it and then repeat the same process with the same assets but in the interval from B + 1 to C.
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file. Is there a way to improve this behaviour?, It would be good that, once it filter the rows, it saves where the blocks in memory are to speed up the nexts reads. Do I have to write a new file with the assets filtered?.
I tried writing the parquet file with a small number of row groups and smaller data page size to avoid the complete reading, but this doesn't gave me a good results in terms of time.
Other question that I have is the follwing. Why if we read the complete parquet file using a Parquet Dataset and use_legacy_dataset = False, it takes more time than reading the same parquet dataset with use_legacy_dataset = True?
Code example:
import pandas as pd
import numpy as np
import time
import pyarrow.parquet as pq
# generating the small data for the example, the file weight like 150MB for this example, the real data
# has 2 GB
dates = pd.bdate_range('2019-01-01', '2020-03-01')
assets = list(range(1000, 50000))
historical_prices = pd.DataFrame(np.random.rand(len(assets), len(dates)), assets, dates)
historical_prices.columns = historical_prices.columns.strftime('%Y-%m-%d')
# name of the index
historical_prices.index.name = 'assets'
# writing the parquet file using the lastest version, in the comments are the thigns that I tested
historical_prices.to_parquet(
'historical_prices.parquet',
version='2.0',
data_page_version='2.0',
writer_engine_version='2.0',
# row_group_size=100,
# compression=None
# use_dictionary=False,
# data_page_size=1000,
# use_byte_stream_split=True,
# flavor='spark',
)
# reading the complete parquet dataset
start_time = time.time()
historical_prices_dataset = pq.ParquetDataset(
'historical_prices.parquet',
use_legacy_dataset=False
)
historical_prices_dataset.read_pandas().to_pandas()
print(time.time() - start_time)
# Reading only one asset of the parquet dataset
start_time = time.time()
filters = [('assets', '=', assets[0])]
historical_prices_dataset = pq.ParquetDataset(
'historical_prices.parquet',
filters=filters,
use_legacy_dataset=False
)
historical_prices_dataset.read_pandas().to_pandas()
print(time.time() - start_time)
# this is what I want to do, read by intervals.
num_intervals = 5
for i in range(num_intervals):
start = int(i * len(dates) / num_intervals)
end = int((i + 1) * len(dates) / num_intervals)
interval = list(dates[start:end].strftime('%Y-%m-%d'))
historical_prices_dataset.read_pandas(columns=interval).to_pandas()
# Here goes some analyzing process that can't be done in parallel due that the results of every interval
# are used in the next interval
print(time.time() - start_time)

I was using the transpose of this, but this format gave me better reading time.
Parquet supports individual column reads. So if you have 10 columns of 10k rows and you want 5 columns then you'll read 50k cells. If you have 10k columns of 10 rows and you want 5 columns then you'll read 50 cells. So presumably this is why the transpose gave you better reading time. I don't think I have enough details here. Parquet also supports reading individual row groups, more on that later.
You have roughly 49,000 assets and 300 dates. I'd expect you to get better performance with assets as columns but 49,000 is a lot of columns to have. It's possible that either you are having to read too much column metadata or you are dealing with CPU overhead from keeping track of so many columns.
It is a bit odd to have date values or asset ids as columns. A far more typical layout would be to have three columns: "date", "asset id", & "price".
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file
Yes, if you have a single row group. Parquet does not support partial row group reads. I believe this is due to the fact that the columns are compressed. However, I do not get the same results you are getting. The middle time in your example (the single asset read) is typically ~60-70% of the time of the first read. So it is faster. Possibly just because there is less conversion to do to get to pandas or maybe there is some optimization I'm not aware of.
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file. Is there a way to improve this behaviour?, It would be good that, once it filter the rows, it saves where the blocks in memory are to speed up the nexts reads. Do I have to write a new file with the assets filtered?.
Row groups may be your answer. See the next section.
I tried writing the parquet file with a small number of row groups and smaller data page size to avoid the complete reading, but this doesn't gave me a good results in terms of time.
This is probably what you are after (or you can use multiple files). Parquet supports reading just one row group out of a whole file. However, 100 is too small of a number for row_group_size. Each row group creates some amount of metadata in the file and has some overhead for processing. If I change that to 10,000 for example then the middle read is twice as fast (and now only 30-40% of the full table read).
Other question that I have is the follwing. Why if we read the complete parquet file using a Parquet Dataset and use_legacy_dataset = False, it takes more time than reading the same parquet dataset with use_legacy_dataset = True?
This new datasets API is pretty new (new as of 1.0.0 which released in July). It's possible there is just a bit more overhead. You are not doing anything that would take advantage of the new datasets API (e.g. using scan or non-parquet datasets or new filesystems). So while use_legacy_datasets shouldn't be faster it should not be any slower either. They should take roughly the same amount of time.
It sounds like you have many assets (tens of thousands) and you want to read a few of them. You also want to chunk the read into smaller reads (which you are using the date for).
First, instead of using the date at all, I would recommend using dataset.scan (https://arrow.apache.org/docs/python/dataset.html). This will allow you to process your data one row group at a time.
Second, is there any way you can group your asset ids? If each asset ID has only a single row you can ignore this. However, if you have (for example) 500 rows for each asset ID (or 1 row for each asset ID/date pair) can you write your file so that it looks something like this...
asset_id date price
A 1 ?
A 2 ?
A 3 ?
B 1 ?
B 2 ?
B 3 ?
If you do this AND you set the row group size to something reasonable (try 10k or 100k and then refine from there) then you should be able to get it so that you are only reading 1 or 2 row groups per asset ID.

I found another approach that give me better times for my specific cases, of course, this is a not very general solution. It has some not pyarrow's functions, but it do what I thought the filters of pyarrow do when we read multiple times the same rows. When the number of row groups to read grow, the parquet dataset gave better performance.
import pandas as pd
import numpy as np
import time
import pyarrow.parquet as pq
from typing import Dict, Any, List
class PriceGroupReader:
def __init__(self, filename: str, assets: List[int]):
self.price_file = pq.ParquetFile(filename)
self.assets = assets
self.valid_groups = self._get_valid_row_groups()
def _get_valid_row_groups(self):
"""
I don't fine a parquet function to make this row group search, so I did this manual search.
Note: The assets index is sorted, so probably this can be improved a lot.
"""
start_time = time.time()
assets = pd.Index(self.assets)
valid_row_groups = []
index_position = self.price_file.schema.names.index("assets")
for i in range(self.price_file.num_row_groups):
row_group = self.price_file.metadata.row_group(i)
statistics = row_group.column(index_position).statistics
if np.any((statistics.min <= assets) & (assets <= statistics.max)):
valid_row_groups.append(i)
print("getting the row groups: {}".format(time.time() - start_time))
return valid_row_groups
def read_valid_row_groups(self, dates: List[str]):
row_groups = []
for row_group_pos in self.valid_groups:
df = self.price_file.read_row_group(row_group_pos, columns=dates, use_pandas_metadata=True).to_pandas()
df = df.loc[df.index.isin(self.assets)]
row_groups.append(df)
df = pd.concat(row_groups)
"""
# This is another way to read the groups but I think it can consume more memory, probably is faster.
df = self.price_file.read_row_groups(self.valid_groups, columns=dates, use_pandas_metadata=True).to_pandas()
df = df.loc[df.index.isin(self.assets)]
"""
return df
def write_prices(assets: List[int], dates: List[str]):
historical_prices = pd.DataFrame(np.random.rand(len(assets), len(dates)), assets, dates)
# name of the index
historical_prices.index.name = 'assets'
# writing the parquet file using the lastest version, in the comments are the thigns that I tested
historical_prices.to_parquet(
'historical_prices.parquet',
version='2.0',
data_page_version='2.0',
writer_engine_version='2.0',
row_group_size=4000,
# compression=None
# use_dictionary=False,
# data_page_size=1000,
# use_byte_stream_split=True,
# flavor='spark',
)
# generating the small data for the example, the file weight like 150MB, the real data weight 2 GB
total_dates = list(pd.bdate_range('2019-01-01', '2020-03-01').strftime('%Y-%m-%d'))
total_assets = list(range(1000, 50000))
write_prices(total_assets, total_dates)
# selecting a subset of the whole assets
valid_assets = total_assets[:3000]
# read the price file for the example
price_group_reader = PriceGroupReader('historical_prices.parquet', valid_assets)
# reading all the dates, only as an example
start_time = time.time()
price_group_reader.read_valid_row_groups(total_dates)
print("complete reading: {}".format(time.time() - start_time))
# this is what I want to do, read by intervals.
num_intervals = 5
start_time = time.time()
for i in range(num_intervals):
start = int(i * len(total_dates) / num_intervals)
end = int((i + 1) * len(total_dates) / num_intervals)
interval = list(total_dates[start:end])
df = price_group_reader.read_valid_row_groups(interval)
# print(df)
print("interval reading: {}".format(time.time() - start_time))
filters = [('assets', 'in', valid_assets)]
price_dataset = pq.ParquetDataset(
'historical_prices.parquet',
filters=filters,
use_legacy_dataset=False
)
start_time = time.time()
price_dataset.read_pandas(columns=total_dates).to_pandas()
print("complete reading with parquet dataset: {}".format(time.time() - start_time))
start_time = time.time()
for i in range(num_intervals):
start = int(i * len(total_dates) / num_intervals)
end = int((i + 1) * len(total_dates) / num_intervals)
interval = list(total_dates[start:end])
df = price_dataset.read_pandas(columns=interval).to_pandas()
print("interval reading with parquet dataset: {}".format(time.time() - start_time))

Related

Pandas converts some numbers into zeros or other fixed values

I'm using Python with Pandas through Google Colab for some data analysis. I was analyzing some data through plots and noticed some missing data. However, when I looked at the original Excel data before any Python work, there was no missing data in these places. Somehow it's turning the first four days of a month of hourly data into zeros OR, but only for some of the files and some of the time periods. Following the zeros is also a period of other constant values.
I have four similar data files and two of them seem to be working just fine, but the other two get these zeros at the start of SOME (consecutive) months, while nothing is wrong with the original data. Is there some feature in Pandas that could cause some numbers to turn into zeros or other constant values? The same code is used for all the different files, which are all in the same format.
I thought it could be just a problem with using 'resample' during plotting, but even when I just print the values without 'resample', the values are still missing. I included a figure here to show what the data problem looks like.
Function to read the data:
def read_elec_data(data_file_name):
df = pd.read_excel(data_file_name) # Read the original data
# Convert the time value (30.11.2018 0:00-1:00) into a Pandas-compatible timestamp format (2018-11-30 0:00)
new = df["Päivämäärä ja tunti"].str.split("-", n = 1, expand = True) # Split the time column by the delimiter and make it into two new columns [0, 1]. The ending hour [1] can be ignored.
time_data = new[0]
time_data_fixed = pd.to_datetime(time_data) # Convert the modified time data into datetime format
df['Aika'] = time_data_fixed # Add the new time column to the dataframe
# Remove all columns except the new timestamp and energy consumption columns. Rename the consumption according to the building name
building_name = df['Kohde'][0]
df.drop(columns =["Päivämäärä ja tunti", "Tunti", 'Kohde', 'Mittarin nimi'], inplace = True) # Remove everything except the new timestamp and energy consumption
df = df.rename(columns={'Kulutus[kWh]' : building_name})
df = df.set_index('Aika') # Set the timestamp as the index for the final DataFrame that will be utilized in the calculations
return df
Calling of the function:
all_electricity_data_list = []
for buildingname in list_of_electricity_data:
df = read_elec_data(buildingname) # Use the file reading and modification function
all_electricity_data_list.append(df)
all_electricity_data = pd.concat(all_electricity_data_list, axis=1)
Some numbers are converted to zeros or other constant values even though the original data is fine:

Write Pandas dataframe data to CSV file

I am trying to write a pipeline to bring oracle database table data to aws.
It only takes a few ms to fill the dataframe, but when I try to write the dataframe to a csv-file it takes more than 2 min to write 10000 rows. In addition, one of the column's datatypes is cx_oracle lob type.
I thought this meant that it must take time to write data. So I converted the data to categorical data. But then the operation will take more memory space. Does anyone have any suggestions on how to optimize this process?
query = 'select * from tablename'
cursor.execute(query)
iter_idx = 0
while True:
results = cursor.fetchmany()
if not results:
break
iter_idx += 1
df = pd.DataFrame(results)
df.columns = field['source_field_names']
rec_count = df.shape[0]
t_rec_count += rec_count
file = generate_micro_file()
print('memory usage : \n', df.info(memory_usage='deep'))
# sd = dd.from_pandas(df, npartitions=1)
df.to_csv(file, encoding=str(encoding_type), header=False, index=False, escapechar='\\',chunksize=arraysize)
code output:
From the data access side, there is room for improvement by optimizing the fetching of rows across the network. Either by:
passing a large num_rows value to fetchmany(), see the cx_Oracle doc on [Cursor.fetchmany()[(https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#Cursor.fetchmany).
or increasing the value of Cursor.arraysize.
Your question didn't explain enough about your LOB usage. See the sample return_lobs_as_strings.py for optimizing fetches.
See the cx_Oracle documentation Tuning Fetch Performance.
Is there a particular reason to spend the overhead of converting to a Pandas dataframe? Why not write directly using the csv module?
Maybe something like this:
with connection.cursor() as cursor:
sql = "select * from all_objects where rownum <= 100000"
cursor.arraysize = 10000
with open("testwrite.csv", "w", encoding="utf-8") as outputfile:
writer = csv.writer(outputfile, lineterminator="\n")
results = cursor.execute(sql)
writer.writerows(results)
You should benchmark and choose the best solution.

Use pyspark to partition 100 rows from csv file

I'm trying to group 100 rows of a large csv file (100M+ rows) to send to a Lambda function.
I can use SparkContext to have a workaround like this:
csv_file_rdd = sc.textFile(csv_file).collect()
count = 0
buffer = []
while count < len(csv_file_rdd):
buffer.append(csv_file_rdd[count])
count += 1
if count % 100 == 0 or count == len(csv_file_rdd):
# Send buffer to process
print("Send:", buffer)
# Clear buffer
buffer = []
but there must be a more elegant solution. I've tried using SparkSession and mapPartition but I haven't been able to make it work.
I suppose that your current data is not partitioned in any way (I mean its only one file), so iterating over it sequencially is a must. I suggest to load it as a data frame spark.read.csv(csv_file) then repartition as in this question and save to disk. Once it's saved you'll have a big number of files containing the specified number of records (100 in your case), taht can be used by other program to send to a Lambda (probably with a Pool of workers). See this post to get an idea. Probably is a naive idea but get's the job done.

Reading large volume data from Teradata using Dask cluster/Teradatasql and sqlalchemy

I need to read large volume data(app. 800M records) from teradata, my code is working fine for a million record. for larger sets its taking time to build metadata. Could someone please suggest how to make it faster. Below is the code snippet which I am using for my application.
def get_partitions(num_partitions):
list_range =[]
initial_start=0
for i in range(num_partitions):
amp_range = 3240//num_partitions
start = (i*amp_range+1)*initial_start
end = (i+1)*amp_range
list_range.append((start,end))
initial_start = 1
return list_range
#delayed
def load(query,start,end,connString):
df = pd.read_sql(query.format(start, end),connString)
engine.dispose()
return df
connString = "teradatasql://{user}:{password}#{hostname}/?logmech={logmech}&encryptdata=true"
results = from_delayed([load(query,start, end,connString) for start,end in get_partitions(num_partitions)])
The build time is probably taken in finding out the metadata of your table. This is done by fetching the whole of the first partition and analysing it.
You would be better off either specifying it explcitly, if you know the dtypes upfront, e.g., {col: dtype, ...} for all the columns, or generating it from a separate query that you limit to just as many rows as it takes to be sure you have the right types:
meta = dask.compute(load(query, 0,10 ,connString))
results = from_delayed(
[
load(query,start, end,connString) for start,end in
get_partitions(num_partitions)
],
mete=meta.loc[:0, :] # zero-length version of table
)

Time Optimization for pandas dataframe reconstruction (random to fixed sampling)

I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.
I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.
What I am trying to achieve is transform to a new data-set with fixed sampling rate.
I am transforming my data-set as follows:
I'm setting a time_delta as a static time step of 0.5 seconds
I'm dropping reccords corresponding to the same nanosecond
I'm generating timestamps from my start_time to the calculated end_time
I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.
I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.
Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.
I've also tested locally without any improvement.
# ====================================================================
# Rate Re-Definition
# ====================================================================
SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES
start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD
fixed_timestamps = np.arange(start_of_day_timestamp,
end_of_day_timestamp,
dt,
dtype=np.uint64
)
# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================
df1 = df.drop_duplicates(subset='TimeStamp', keep="last")
# ====================================================================
# Construct new dataframe
# ====================================================================
df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0
for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
index += 1
df2 = df2.append(df1.iloc[index], ignore_index=True)
df2['TimeStamp'] = fixed_timestamps
I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").
I would appreciate any help and pointers towards the right direction.
Thanks in advance

Resources