Turbocharge a lambda function comparing values in two dataframes

Turbocharge a lambda function comparing values in two dataframes - python-3.x

I've got an OHLCV financial dataset I'm working with that's approx 1.5M rows. It's for a single security back to 2000 at a 1min resolution. The dataset had all the zero volume timestamps removed so has gaps in the trading day which I don't want for my purposes.
Raw data (df1) looks like:
Timestamp, Open, High, Low, Close, Volume
To fill in the all the zero volume timestamps I created an empty trading calendar (df2) using (pandas_market_calendars which is an amazing time saver) and I've then used df3 = merge_asof(df2, df1, on='Timestamp', direction = nearest) to fill all the timestamps. This is the behaviour I wanted for the price data (OHLC) but not for volume. I need all the 'filled' timestamps to show zero volume so I figured a lambda function would suit (as below) to check if each timestamp was in the original dataframe (df1) or not.
tss = df1.Timestamp.to_numpy()
df2['Adj_Volume'] = df2.apply(lambda x: x['Volume'] if x['Timestamp'] in tss else 0, axis=1)
I ran this for an hour, then two hours, then five hours and it's still not finished. To try and work out what's going on, I then used tqdm (progress_apply) and it estimates that it'll take 100 hours to complete! I running the conda dist of Jupyter Notebooks on an 2014 MacbookAir (1.7Ghz, 8Gb RAM) which isn't a supercomputer but 100 hours seems whacky.
If I cut down tss and df3 to a single year (~50k rows), it'll run in ~5mins. However, this doesn't scale linearly to the full dataset. 100 hours vs 100 mins (5mins x 20 years (2000 - 2019)). Slicing the dataframe up into years in a python level loop to then join them again afterwards feels clunky, but I can't think of another way.
Is there a smarter way to do this which takes advantage of vectorised operations that can be run on the entire dataset in a single operation?
Cheers

Maybe you could try with np.wherefunction and .isin method ?
import numpy as np
df2['Volume'] = np.where(df2['Timestamp'].isin(tss),df2['Volume'], 0)

Related

How can I use python methods to split up the data drame with respect to weekday and hour?

I have a dataset, I have split up the time into weekdays and hour here (first table). I have to count the number of trips(demand) strating from PULocationID to DOLocationID for every hour for every day. I did this by groupping and using the size method(table 2). I have also sorted(ascending) these entries with respect to time before I do this. How can I use the default pyhon methods from numpy and pandas(so that I can still use this method to scale up for more data) to generate a 2D feature map for every hour and for all weekdays and fill in the demand(trip count)?
The code snippet and dataframe information
Additonal information: There are 265 PULocationIDs, 265 DOLocationIDs. So the 2D map should be a (265,265) a matrix(np array or df) for an hour in one week day filled by demand values. for ex. if the first map is the demand at 8 AM, second map should be demand at 9 AM and so on after 23 it should strat the next day. Any help or suggestions would be greatly appreciated.
Thank you!

Pyspark: Doing a count on sample of dataframe instead whole dataframe

I currently have some code that computes the overall time taken to run the count operation on a dataframe. I have another implementation which measures the time taken to run count on a sampled version of this dataframe.
sampled_df = df.sample(withReplacement=False, fraction=0.1)
sampled_df.count()
I then extrapolate the overall count from the sampled count. But I do not see an overall decrease in the time taken for calculating this sampled count when compared to doing a count on the whole dataset. Both seem to take around 40 seconds. Is there a reason this happens? Also, is there an improvement in terms of memory when using a sampled count over count on whole dataframe?

You can use countApprox. This lets you choose how long your willing to wait for an approximate count/confidence interval.
Sample still needs to access all partitions to create a sample that is uniform. You aren't really saving anytime using a sample.

Interpolating a huge set of data

So I have a very large set of data (4 million rows+) with journey times between two location nodes for two separate years (2015 and 2024). These are stored in dat files in a format of:
Node A
Node B
Journey Time (s)
123
124
51.4
So I have one long file of over 4 million rows for each year. I need to interpolate journey times for a year between the two for which I have data. I've tried Power Query in Excel as well as Power BI Desktop but have had no reasonable solution beyond cutting the files into smaller < 1 million row pieces so that Excel can manage.
Any ideas?

What type of output are you looking for? PowerBI can easily handle this amount of data, but it depends what you expect your result to be. If you're looking for the average % change in node to node travel time between the two years, then PowerBI could be utilised as it is great at aggregating and comparing large datasets.
However, if you are wanting an output of every single node to node delta between those two years i.e. 4M row output, then PowerBI will calculate this, but then what do you do with it.... a 4M long table?
If you're looking to have an exported result >150K rows (PowerBI limit) or >1M rows (Excel limit), then I would use Python for that (as mentioned above)

Resampling DataFrame accounting for holidays and weekends

I'm just getting started playing around with Python and Pandas, with ~10 hours total invested to far. I have a dataframe of daily stock data and I've resampled it weekly. The problem lies in weeks where Friday is a holiday, I get NaN in my dataset. Is there a way to accommodate for this scenario? (Same issue as well when I resample monthly, where the final day is a weekend).
sample = 'W-FRI'
for i in range(tickerCount):
datalist.append(yf.download(stock_list[i], start, end))
datalist[i]['High'] = datalist[i]['High'].resample(sample).max()
datalist[i]['Low'] = datalist[i]['Low'].resample(sample).min()
datalist[i]['Open'] = datalist[i]['Open'].resample(sample).first()
datalist[i]['Close'] = datalist[i]['Close'].resample(sample).last()
datalist[i] = datalist[i].asfreq(sample, method='pad')
As you can see the week of Good Friday could not be sampled properly. I know its possible to remove these from the dataframe:
datalist[i] = datalist[i][datalist[i]['High'].notna()]
But ideally I would like to grab the last day of data for the specified resampled period (In this case, use Thursday's data. I've looked at this answer
Is there a way to accomplish this?
EDIT:
#ElliottCollins had an idea to use .ffill() to backfill the Friday with the previous data (from Thursday). This also backfills every Saturday and Sunday with the previous data. Unfortunately when I do this and then resample W-FRI my Open values are incorrect; They become Previous Friday's open rather than Monday's Open
EDIT 2
I just realized if I set index again after all this, I'm able to resample as desired. I'll post the solution below

Thanks #ElliottCollins tip about backfilling data.
datalist[i] = datalist[i].ffill()
This also backfills weekends, which I don't want. So I need to create a column from the index
datalist[i] = datalist[i].reset_index()
And then remove weekends
datalist[i] = datalist[i][datalist[i]['Date'].dt.dayofweek < 5]
And I need the Date column to be reset as the index for transformations later on, so
datalist[i] = datalist[i].set_index('Date')
And I was able to effectively get the data I needed

Pyspark job being stuck at the final task

The flow of my program is something like this:
1. Read 4 billion rows (~700GB) of data from a parquet file into a data frame. Partition size used is 2296
2. Clean it and filter out 2.5 billion rows
3. Transform the remaining 1.5 billion rows using a pipeline model and then a trained model. The model is trained using a logistic regression model where it predicts 0 or 1 and 30% of the data is filtered out of the transformed data frame.
4. The above data frame is Left outer joined with another dataset of ~1 TB (also read from a parquet file.) Partition size is 4000
5. Join it with another dataset of around 100 MB like
joined_data = data1.join(broadcast(small_dataset_100MB), data1.field == small_dataset_100MB.field, "left_outer")
6. The above dataframe is then exploded to the factor of ~2000 exploded_data = joined_data.withColumn('field', explode('field_list'))
7. An aggregate is performed aggregate = exploded_data.groupBy(*cols_to_select)\
.agg(F.countDistinct(exploded_data.field1).alias('distincts'), F.count("*").alias('count_all')) There are a total of 10 columns in the cols_to_select list.
8. And finally an action, aggregate.count() is performed.
The problem is, the third last count stage (200 tasks) gets stuck at task 199 forever. In spite of allocating 4 cores and 56 executors, the count uses only one core and one executor to run the job. I tried breaking down the size from 4 billion rows to 700 million rows which is 1/6th part, it took four hours. I would really appreciate some help in how to speed this process up Thanks

The operation was being stuck at the final task because of the skewed data being joined to a huge dataset. The key that was joining the two dataframes was heavily skewed. The problem was solved for now by removing the skewed data from the dataframe. If you must include the skewed data, you can use iterative broadcast joins (https://github.com/godatadriven/iterative-broadcast-join). Look into this informative video for more details https://www.youtube.com/watch?v=6zg7NTw-kTQ

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Turbocharge a lambda function comparing values in two dataframes - python-3.x

Maybe you could try with np.wherefunction and .isin method ? import numpy as np df2['Volume'] = np.where(df2['Timestamp'].isin(tss),df2['Volume'], 0)

Related

How can I use python methods to split up the data drame with respect to weekday and hour?

Pyspark: Doing a count on sample of dataframe instead whole dataframe

Interpolating a huge set of data

Resampling DataFrame accounting for holidays and weekends

Pyspark job being stuck at the final task

Categories

Resources