Brightway2: difference in calculation time between "aggregate LCI" ecoinvent version and "unit" ecoinvent version - brightway

I am curious as to what explains the significant difference in calculation time for a random process using the "aggregate LCI" (or "system" as it is sometimes called) and the "unit" version of ecoinvent 3.4 with Brigthway2.
Intuitively, I expected faster calculation times with the aggregate LCI version. But it turns out that using the unit version of ecoinvent is about 20 times faster.
What is the reason for that? The following code (10 iterations) gives 76 seconds for the aggregate LCI version and 3.7 seconds for the unit version.
def lca_road():
lca = bw.LCA({eidb.random():1}, ("IPCC 2013", "climate change", "GWP
100a"))
lca.lci()
lca.lcia()
lca.score
timeit.timeit(lca_road, number=10)
Therefore, are there benefits in using the aggregate LCI version of ecoinvent? Or am I missing something?

It takes much longer to build the biosphere matrix for the aggregate version, as it has many more numbers. I wouldn't ever use the aggregate version, but I can imagine that the sparse matrix fill rate would go up from around 2% to close to 100%. This easily explains the time difference, as solving the matrix equation is now less than 50% of the total calculation time. If you insist on using the aggregated results, then split off the relevant activities into a new database.

Related

Pyspark FP growth implementation running slow

I am using the pyspark.ml.fpm (FP Growth) implementation of association rule mining on Spark v2.3.
The spark UI shows that the tasks as the end run very slowly. This seems to be a common problem and might be related to data skew.
Is this the real reason? Is there any solution for this?
I don't want to change the minSupport or minConfidence thresholds because that would effect by results. Removing the columns isn't a solution either.
I was facing a similar issue. One solution you might try is setting a threshold on the amount of products in a transaction. If there are a couple of transactions that have way more products than the average, the tree computed by FP Growth blows up. This causes the runtime increases significantly and the risk for memory errors is much higher.
Hence, doing outlier removal on the transactions with disproportional amount of products might do the trick.
Hope this helps you out a bit :)
Late answer but I also had an issue with long FPGrowth wait times, and the above answer really helped. Implemented as such to filter out anything that's above one standard deviation (this is after the transactions have been grouped):
def clean_transactions(df):
transactions_init = df.withColumn("basket_size", size("basket"))
print('---collecting stats')
df_stats = transactions_init.select(
_mean(col('basket_size')).alias('mean'),
_stddev(col('basket_size')).alias('std')
).collect()
mean = df_stats[0]['mean']
std = df_stats[0]['std']
max_ct = mean + std
print("--filtering out outliers")
transactions_cleaned = transactions_init.filter(transactions_init.basket_size <= max_ct)
return transactions_cleaned

Featuretools Deep Feature Synthesis (DFS) extremely high overhead

The execution of both ft.dfs(...) and ft.calculate_feature_matrix(...) on some time series to extract the day month and year from a very small dataframe (<1k rows) takes about 800ms. When I compute no features at all, it still takes about 750ms. What is causing this overhead and how can I reduce it?
I've testing different combinations of features as well as testing it on a bunch of small dataframes, and the execution time is pretty constant at 700-800ms.
I've also tested it on much larger dataframes with >1million rows. The execution time without any actual features (primitives) is pretty comparable at around to that with all the date features at around 80-90 seconds. So it seems like the computation time depends on the number of rows but not on the features?
I'm running with a n_jobs=1 to avoid any weirdness with parallelism. It seems to me like featuretools is doing some configuration or setup for the dask back-end every time and that is causing all of the overhead.
es = ft.EntitySet(id="testing")
es = es.entity_from_dataframe(
entity_id="time_series",
make_index=True,
dataframe=df_series[[
"date",
"flag_1",
"flag_2",
"flag_3",
"flag_4"
]],
variable_types={},
index="id",
time_index="date"
)
print(len(data))
features = ft.dfs(entityset=es, target_entity="sales", agg_primitives=[], trans_primitives=[])
The actual output seems to be correct, I am just surprised that FeatureTools would take 800ms to compute nothing on a small dataframe. Is the solution simply to avoid small dataframes and compute everything with a custom primitive on a large dataframe to mitigate the overhead? Or is there a smarter/more correct way of using ft.dfs(...) or ft.compute_feature_matrix.

Faster way to count values greater than 0 in Spark DataFrame?

I have a Spark DataFrame where all fields are integer type. I need to count how many individual cells are greater than 0.
I am running locally and have a DataFrame with 17,000 rows and 450 columns.
I have tried two methods, both yielding slow results:
Version 1:
(for (c <- df.columns) yield df.where(s"$c > 0").count).sum
Version 2:
df.columns.map(c => df.filter(df(c) > 0).count)
This calculation takes 80 seconds of wall clock time. With Python Pandas, it takes a fraction of second. I am aware that for small data sets and local operation, Python may perform better, but this seems extreme.
Trying to make a Spark-to-Spark comparison, I find that running MLlib's PCA algorithm on the same data (converted to a RowMatrix) takes less than 2 seconds!
Is there a more efficient implementation I should be using?
If not, how is the seemingly much more complex PCA calculation so much faster?
What to do
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns map (c => count(when(col(c) > 0, 1)) as c): _*)
Why
Your both attempts create number of jobs proportional to the number of columns. Computing the execution plan and scheduling the job alone are expensive and add significant overhead depending on the amount of data.
Furthermore, data might be loaded from disk and / or parsed each time the job is executed, unless data is fully cached with significant memory safety margin which ensures that the cached data will not be evicted.
This means that in the worst case scenario nested-loop-like structure you use can roughly quadratic in terms of the number of columns.
The code shown above handles all columns at the same time, requiring only a single data scan.
The problem with your approach is that the file is scanned for every column (unless you have cached it in memory). The fastet way with a single FileScan should be:
import org.apache.spark.sql.functions.{explode,array}
val cnt: Long = df
.select(
explode(
array(df.columns.head,df.columns.tail:_*)
).as("cell")
)
.where($"cell">0).count
Still I think it will be slower than with Pandas, as Spark has a certain overhead due to the parallelization engine

oracle: Is there a way to check what sql_id downgraded to serial or lesser degree over the period of time

I would like to know if there is a way to check sql_ids that were downgraded to either serial or lesser degree in an Oracle 4-node RAC Data warehouse, version 11.2.0.3. I want to write a script and check the queries that are downgraded.
SELECT NAME, inst_id, VALUE FROM GV$SYSSTAT
WHERE UPPER (NAME) LIKE '%PARALLEL OPERATIONS%'
OR UPPER (NAME) LIKE '%PARALLELIZED%' OR UPPER (NAME) LIKE '%PX%'
NAME VALUE
queries parallelized 56083
DML statements parallelized 6
DDL statements parallelized 160
DFO trees parallelized 56249
Parallel operations not downgraded 56128
Parallel operations downgraded to serial 951
Parallel operations downgraded 75 to 99 pct 0
Parallel operations downgraded 50 to 75 pct 0
Parallel operations downgraded 25 to 50 pct 119
Parallel operations downgraded 1 to 25 pct 2
Does it ever refresh? What conclusion can be drawn from above output? Is it for a day? month? hour? since startup?
This information is stored as part of Real-Time SQL Monitoring. But it requires licensing the Diagnostics and Tuning packs, and it only stores data for a short period of time.
Oracle 12c can supposedly store SQL Monitoring data for longer periods of time. If you don't have Oracle 12c, or if you don't have those options licensed, you'll need to create your own monitoring tool.
Real-Time SQL Monitoring of Parallel Downgrades
select /*+ parallel(1000) */ * from dba_objects;
select sql_id, sql_text, px_servers_requested, px_servers_allocated
from v$sql_monitor
where px_servers_requested <> px_servers_allocated;
SQL_ID SQL_TEXT PX_SERVERS_REQUESTED PX_SERVERS_ALLOCATED
6gtf8np006p9g select /*+ parallel ... 3000 64
Creating a (Simple) Historical Monitoring Tool
Simplicity is the key here. Real-Time SQL Monitoring is deceptively simple and you could easily spend weeks trying to recreate even a tiny portion of it. Keep in mind that you only need to sample a very small amount of all activity to get enough information to troubleshoot. For example, just store the results of GV$SESSION or GV$SQL_MONITOR (if you have the license) every minute. If the query doesn't show up from sampling every minute then it's not a performance issue and can be ignored.
For example: create a table create table downgrade_check(sql_id varchar2(100), total number), and create a job with DBMS_SCHEDULER to run insert into downgrade_check select sql_id, count(*) total from gv$session where sql_id is not null group by sql_id;. Although the count from GV$SESSION will rarely be exactly the same as the DOP.
Other Questions
V$SYSSTAT is updated pretty frequently (every few seconds?), and represents the total number of events since the instance started.
It's difficult to draw many conclusions from those numbers. From my experience, having only 2% of your statements downgraded is a good sign. You likely either have good (usually default) settings and not too many parallel jobs running at once.
However, some parallel queries run for seconds and some run for weeks. If the wrong job is downgraded even a single downgrade can be disastrous. Storing some historical session information (or using DBA_HIST_ACTIVE_SESSION_HISTORY) may help you find out if your critical jobs were affected.

SQL Query by code or codename?, Which is supposed to be more optimized?

I was checking through my reports written using SQL queries to see if I can further optimize my codes when I suddenly wondered.
"Would the system be faster if I filter by code instead of codename?"
Query #1:
SELECT owneridname, scheduledstart
FROM dbo.FilteredAppointment
WHERE statecode IN (1, 3)
Query #2:
SELECT owneridname, scheduledstart
FROM dbo.FilteredAppointment
WHERE statecode IN ('1', '3')
Query #3:
SELECT owneridname, scheduledstart
FROM dbo.FilteredAppointment
WHERE statecodename IN ('Completed', 'Scheduled')
My initial thoughts:
The codename is often get from the StringMap table and code usually
resides in the base table. So it should be faster if I filter the
query by codes instead. Looking at the three queries above, I would
think that query by string would be slower.
But results show otherwise. The result shows query #3 is the fastest among the 3
Since it's querying against the same view, it should not show vast difference in resource usage.
Again I was wrong. Query #3 was more than 50% less exhausting compared to Query #1 and #3
Results for the 3 queries in terms of "Est Subtree cost"
Query #1: Est Subtree cost: 0.325311
Query #2: Est Subtree cost: 0.325311
Query #3: Est Subtree cost: 0.190786
Anyone can explain why does it behave in this manner?
First of all, when it comes to performance comparisons, I always refer to Eric Lippert's Excellent Which is Faster? blog.
I believe you're viewing your test results incorrectly. Since you're not actually doing tests but estimates, the cost shown is a percentage of the query as a whole. It can not be directly compared to the percentage costs from another estimate.
For example, let's say units of the cost are a percentage of the CPU required. Let's also assume that since they are less complicated Query #1 & Query #2 only require 100 CPU cycles, but Query 3 requires 1000 CPU Cycles. If we apply the math of the percentages to this is the result:
Query 1: 33 Cycles
Query 2: 33 Cycles
Query 3: 190 Cycles
Even though Query 3 is a smaller percentage, it takes more actual resources.
I'm also going to guess that the performance difference is entirely negligible, and therefore you should be looking at other issues like readability / maintainability.
WHERE statecodename IN ('Completed', 'Scheduled') is definitely more readable, but much less maintainable (What if someone changes the text values to 'Complete' and 'Schedule'?. Therefore I would suggest this instead:
WHERE statecodename IN (1, 3) -- ('Completed', 'Scheduled')
And stop preemptively worrying about performance.
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." - Donald Knuth

Resources