Another problem with: PerformanceWarning: DataFrame is highly fragmented - python-3.x

Since I am still learning Python, I am getting some optimisation errors here.
I keep getting the error
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
and it does take for a while to load for what I am doing now.
Here is my code:
def Monte_Carlo_for_Tracking_Error(N,S,K,Ru,Rd,r,I,a):
ldv=[]
lhp=[]
lsp=[]
lod=[]
Tracking_Error_df=pd.DataFrame()
# Go through different time steps of rebalancing
for y in range(1,I+1):
i=0
# do the same step a amount of times
while i<a:
Sample_Stock_Prices=[]
Sample_Hedging_Portfolio=[]
Hedging_Portfolio_Value=np.zeros(N) # Initzialize Hedging PF
New_Path=Portfolio_specification(N,S,K,Ru,Rd,r) # Get a New Sample Path
Sample_Stock_Prices.append(New_Path[0])
Sample_Hedging_Portfolio.append(Changing_Rebalancing_Rythm(New_Path,y))
Call_Option_Value=[]
Call_Option_Value.append(New_Path[1])
Differences=np.zeros(N)
for x in range(N):
Hedging_Portfolio_Value[x]=Sample_Stock_Prices[0][x]*Sample_Hedging_Portfolio[0][x]
for z in range(N):
Differences[z]=Call_Option_Value[0][z]-Hedging_Portfolio_Value[z]
lhp.append(Hedging_Portfolio_Value)
lsp.append(np.asarray(Sample_Stock_Prices))
ldv.append(np.asarray(Sample_Hedging_Portfolio))
lod.append(np.asarray(Differences))
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
i=i+1
return(Tracking_Error_df,lod,lsp,lhp,ldv)
Code starts to give me warnings when I try to run:
Simulation=MCTE(100,100,104,1.05,0.95,0,10,200)
Small part of the warning:
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
I am using jupyter notebook for this. If somebody could help me optimise it, would appreciate it.
I tried to test the code and I am expecting to have a more performance-oriented version of it.

Related

Pyspark - Loop n times - Each loop gets gradually slower

So basically i want to loop n times through my dataframe and apply a function in each loop
(perform a join).
My test-Dataframe is like 1000 rows and in each iteration, exactly one column will be added.
The first three loops perform instantly and from then its gets really really slow.
The 10th loop e.g. needs more than 10 minutes.
I dont understand why this happens because my Dataframe wont grow larger in terms of rows.
If i call my functions with n=20 e.g., the join performs instantly.
But when i loop iteratively 20 times, it gets stucked soon.
You have any idea what can potentially cause this problem?
Examble Code from Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller
import time
from pyspark import SparkContext
sc = SparkContext()
def push_and_pop(rdd):
# two transformations: moves the head element to the tail
first = rdd.first()
return rdd.filter(
lambda obj: obj != first
).union(
sc.parallelize([first])
)
def serialize_and_deserialize(rdd):
# perform a collect() action to evaluate the rdd and create a new instance
return sc.parallelize(rdd.collect())
def do_test(serialize=False):
rdd = sc.parallelize(range(1000))
for i in xrange(25):
t0 = time.time()
rdd = push_and_pop(rdd)
if serialize:
rdd = serialize_and_deserialize(rdd)
print "%.3f" % (time.time() - t0)
do_test()
I have fixed this issue with converting the df every n times to a rdd and back to df.
Code runs fast now. But i dont understand what exactly is the reason for that. The explain plan seems to rise very fast during iterations if i dont do the conversion.
This fix is also issued in the book "High Performance Spark" with this workaround.
While the Catalyst optimizer is quite powerful, one of the cases where
it currently runs into challenges is with very large query plans.
These query plans tend to be the result of iterative algorithms, like
graph algorithms or machine learning algorithms. One simple workaround
for this is converting the data to an RDD and back to
DataFrame/Dataset at the end of each iteration

How to avoid from for loop in this python script

I am doing some data analysis task , with this python script i can get my desired results , but its very slow maybe due to for loop , i have to handle millions of data rows , is there way to change this script to fast ?
df=df.sort_values(by='ts')
df = df.set_index(pd.DatetimeIndex(df['ts']))
df = df.rename(columns={'ts': 'Time'})
x2=df.groupby(pd.Grouper(freq='1D', base=30, label='right'))
for name, df1 in x2:
df1_split=np.array_split(df1,2)
df_first=df1_split[0]
df_second=df1_split[1]
length_1=len(df_first)
length_2=len(df_second)
if len(df_first)>=5000:
df_first_diff_max=abs(df_first['A'].diff(periods=1)).max()
if df_first_diff_max<=10:
time_first=df_first['Time'].values[0]
time_first=pd.DataFrame([time_first],columns=['start_time'])
time_first['End_Time']=df_first['Time'].values[-1]
time_first['flag']=1
time_first['mean_B']=np.mean(df_first['B'])
time_first['mean_C']=np.mean(df_first['C'])
time_first['mean_D']=np.mean(df_first['D'])
time_first['E']=df_first['E'].values[0]
time_first['F']=df_first['F'].values[0]
result.append(time_first)
if len(df_second)>=5000:
df_second_diff_max=abs(df_second['A'].diff(periods=1)).max()
if df_second_diff_max<=10:
print('2')
time_first=df_second['Time'].values[0]
time_first=pd.DataFrame([time_first],columns=['start_time'])
time_first['End_Time']=df_second['Time'].values[-1]
time_first['flag']=2
time_first['mean_B']=np.mean(df_second['B'])
time_first['mean_C']=np.mean(df_second['C'])
time_first['mean_D']=np.mean(df_second['D'])
time_first['E']=df_second['E'].values[0]
time_first['F']=df_second['F'].values[0]
result.append(time_first)
final=pd.concat(result)
If you want to handle millions of rows maybe you should try to use Hadoop or Spark if you have resources enough.
I think that analyze such amount of data in a single node is a bit crazy.
If you are willing to try something different with Pandas, you could try using vectorization. Here is a link to a quick overview of the time to iterate over a set of data. It looks like Numpy has the most efficient vectorization method, but the internal Pandas one might work for you as well.
https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
The Pandas Built-In Function: iterrows() — 321 times faster
The apply() Method — 811 times faster
Pandas Vectorization — 9280 times faster
Numpy Vectorization — 71,803 times faster
(All according to timing the operations on a dataframe with 65 columns and 1140 rows)

How to reduce time taken by to convert dask dataframe to pandas dataframe

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)

Featuretools Deep Feature Synthesis (DFS) extremely high overhead

The execution of both ft.dfs(...) and ft.calculate_feature_matrix(...) on some time series to extract the day month and year from a very small dataframe (<1k rows) takes about 800ms. When I compute no features at all, it still takes about 750ms. What is causing this overhead and how can I reduce it?
I've testing different combinations of features as well as testing it on a bunch of small dataframes, and the execution time is pretty constant at 700-800ms.
I've also tested it on much larger dataframes with >1million rows. The execution time without any actual features (primitives) is pretty comparable at around to that with all the date features at around 80-90 seconds. So it seems like the computation time depends on the number of rows but not on the features?
I'm running with a n_jobs=1 to avoid any weirdness with parallelism. It seems to me like featuretools is doing some configuration or setup for the dask back-end every time and that is causing all of the overhead.
es = ft.EntitySet(id="testing")
es = es.entity_from_dataframe(
entity_id="time_series",
make_index=True,
dataframe=df_series[[
"date",
"flag_1",
"flag_2",
"flag_3",
"flag_4"
]],
variable_types={},
index="id",
time_index="date"
)
print(len(data))
features = ft.dfs(entityset=es, target_entity="sales", agg_primitives=[], trans_primitives=[])
The actual output seems to be correct, I am just surprised that FeatureTools would take 800ms to compute nothing on a small dataframe. Is the solution simply to avoid small dataframes and compute everything with a custom primitive on a large dataframe to mitigate the overhead? Or is there a smarter/more correct way of using ft.dfs(...) or ft.compute_feature_matrix.

How can I achieve server side filtering with the join in spark dataframe api

This is a part of my spark app. The first part is the part where I get all the articles within the last 1 hour and the second part of the code grabs all these articles comments. The third part adds the comments to the articles.
The problem is that the articles.map(lambda x:(x.id,x.id)).join(axes) part is too slow, it takes around 1 minute. I would like to improve this to 10 seconds or even less but don't know how to? Thanks for your reply.
articles = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="articles", keyspace=source).load() \
.map(lambda x:x).filter(lambda x:x.created_at!=None).filter(lambda x:x.created_at>=datetime.now()-timedelta(hours=1) and x.created_at<=datetime.now()-timedelta(hours=0)).cache()
axes = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source).load().map(lambda x:(x.article,x))
speed_rdd = articles.map(lambda x:(x.id,x.id)).join(axes)
EDIT
This is my new code, which I changed according to your suggestions. It is now already 2 times as fast as before, so thanks for that ;). Just another improvement I would like to make with the last part of my code in the axes part, which is still too slow and needs 38 seconds for 30 million data:
range_expr = col("created_at").between(
datetime.now()-timedelta(hours=timespan),
datetime.now()-timedelta(hours=time_delta(timespan))
)
article_ids = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="article_by_created_at", keyspace=source).load().where(range_expr).select('article','created_at').persist()
axes = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source).load()
I tried this here (which should substitute the last axes part of my code) and this is also the solution I would like to have but it doesn't seem to work properly:
in_expr = col("article").isin(article_ids.collect())
axes = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source).load().where(in_expr)
I always get this error message:
in_expr = col("article").isin(article_ids.collect())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable
Thanks for your help.
1) Predicate Pushdown is automatically detected by the Spark-Cassandra connector, as long as the filtering is possible in Cassandra (using primary key for filtering or secondary index): https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#pushing-down-clauses-to-cassandra
2) For more efficient joins, you can call the method repartitionByCassandraReplica. Unfortunately this method may not be available for PySpark, only for Scala/Java API. Read the doc here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
3) Another hint is to try to debug and understand how the connector is creating Spark partitions. There are some examples and caveats mentioned in the docs: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
As mentioned before if you want to achieve reasonable performance don't convert your data to RDD. It not not only makes optimizations like predicate pushdown impossible, but also introduces as huge overhead of moving data out of JVM to Python.
Instead you should use use SQL expressions / DataFrame API in a way similar to this:
from pyspark.sql.functions import col, expr, current_timestamp
range_expr = col("created_at").between(
current_timestamp() - expr("INTERVAL 1 HOUR"),
current_timestamp())
articles = (sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(...).load()
.where(col("created_at").isNotNull()) # This is not really required
.where(range_expr))
It should be also possible to formulate predicate expression using standard Python utilities as you've done before:
import datetime
range_expr = col("created_at").between(
datetime.datetime.now() - datetime.timedelta(hours=1),
datetime.datetime.now()
)
Subsequent join should be performed without moving data out of data frame as well:
axes = (sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(...)
.load())
articles.join(axes, ["id"])

Resources