Lighthouse/PSI Score Variability Between Environments - pagespeed-insights

We're seeing a big discrepancy (around 30 points) in the performance score when we use Lightouse or PSI when we test a site in a test environment v the prod environment. The discrepancy is consistent over many tests.
We're trying to determine the effect of various changes we make to the site on the score. Our understanding is that the tools try to consider network-independent aspects of the site but our results would indicate the network has a big part in the performance score.
The code, db, media, etc. are the same across the two environments. We removed all third party scripts and are still seeing a big discrepancy. Removing third party scripts increased the score on our test environment but there was no change on the prod environment - in fact, we saw some tests show even lower scores and that didn't make much sense to us.
There are definitely differences in the network setup between the two environments and we're trying to sort through where to focus our efforts. I've found info that would explain varying scores between Lighthouse and PSI but can't find much about how network, hosting, etc. play a part in the score variance.
Any information on how much different versions of IIS or anti-virus could affect the score?

Related

Independent vs Dependent Scenarios in feature files (Cucumber Java-Maven)?

We have 30 000 Scenarios and trying to use cucumber with maven(No test NG), and really big project. Dependent scenarios will stop us to pick only one part of Test Suit or one Test Case from Manual Test Plan on the other hand Independent scenarios will significantly increase time for test execution (if you start regression than it is almost waist of time).
Is answerer something between?
e.g.
Use independent where can, and divide dependent into functionalities and put them into separate feature files which are based on functionalities?
What is the best practice for writing feature files for big projects?
Dependent VS Independent
Functional Feature Files vs US feature files.
That is question for you and your team, you have to get together and decide what is the best solution for you guys.
Someone here can give you his point of view, but you guys know what is best for your project needs.
In general, it is not a good idea to have dependent tests, as they are harder to maintain and if dependency breaks then all of your tests will fail and produce false negatives. Also if your important factor is time when doing automated regression testing then perhaps find middle ground, and your solution will be there.

Is it possible to use Benchmark.NET to "fail" a CI build if performance has regressed too much?

I have unit tests. If one of them fails, my build fails.
I would like to apply the same principle to performance. I have a series of microbenchmarks for several hot paths through a library. Empirically, slowdowns in these areas have a disproportionate effect on the library's overall performance.
It would be nice if there were some way to have some concept of a "performance build" that can fail in the event of a too-significant performance regression.
I had considered hard-coding thresholds that must not be exceeded. Something like:
Assert.IsTrue(hotPathTestResult.TotalTime <= threshold)
but pegging that to an absolute value is hardware and environment-dependent, and therefore brittle.
Has anyone implemented something like this? What does Microsoft do for Kestrel?
I would not do this via unit-tests -- it's the wrong place.
Do this in a build/test-script. You gain more flexibility and can do a lot of more things that may be necessary.
A rough outline would be:
build
run unit tests
run integration tests
run benchmarks
upload benchmark results to results-store (commercial product e.g. "PowerBI")
check current results with previous results
upload artefacts / deploy packages
On 6. if there is a regression you can let the build fail with non-zero exit-code.
BenchmarkDotNet can export results as JSON, etc., so you can take advantage of that.
The point is how to determine if a regression occures. Espcecially on CI builds (with containers, and that like) there may be different hardware on different benchmark-runs, so the results are not 1:1 comparable, and you have to take this into account.
Personally I don't let the script fail in case of a possible regression, but it sends an information about that, so I can manually check if it's a true regression or just a cause by different hardware.
Regression is simply detected if the current results are worse than the median of the last 5 results. Of course this is a rough method, but an effective one and you can tune that to your needs.

Organizing Scientific Data and Code - Experiments, Models, Simulation, Implementation

I am working on a robotics research project, and would like to know: Does anyone have suggestions for best practices when organizing scientific data and code? Does anyone know of existing scientific libraries with source that I could examine?
Here are the elements of our 'suite':
Experiments - Two types:
Gathering data from existing, 'natural' system.
Data from running behaviors on robotic system.
Models
Description of dnamical system - dynamics, kinematics, etc
Parameters for said system, some of which are derived from type 1 experiments
Simulation - trying to simulate natural behaviors, simulating behaviors on robots
Implementation - code for controlling the robots. Granted this is a large undertaking and has a large infrastructure of its own.
Some design aspects of our 'suite':
Would be good if simulation environment allowed for 'rapid prototyping' (scripts / interactive prompt for simple hacks, quick data inspection, etc - definitely something hard to incorporate) - Currently satisfied through scripting language (Python, MATLAB)
Multiple programming languages
Distributed, collaborative setup - Will be using Git
Unit tests have not yet been incorporated, but will hopefully be later on
Cross Platform (unfortunately) - I am used to Linux, but my team members use Windows, and some of our tools are wed to that platform
I saw this post, and the books look interesting and I have ordered "Writing Scientific Software", but I feel like it will focus primarily on the implementation of the simulation code and less on the overall organization.
The situation you describe is very similar to what we have in our surface dynamics lab.
Some of the work involves keeping measurements data which are analysed at real time, or saved for late analysis.Some other work, on the other hand, involves running simulations and analysing their results.
The data management scheme, which the lab leader picked up at Cambridge while studying there, is centred around a main server which holds the personal files of all lab members. Each member access the files from his work station by mounting the appropriate server folder using NFS. This has its merits and faults. It is easier to back up everything, but is problematic when processing large amounts of data over the net. For this reason i am an exception in the lab, since the simulation i work with generates a large amount of data. This data is saved on my work station, and only the code used to generate it (source code of the simulation and configuration files) are saved on the server.
I also keep my code in an online SVN service, since i can not log into to lab server from home. This is a mandatory practice, which stems from the need to be able to reproduce older results on demands and trace changes to the code if some obscure bug appears. Hence the need to maintain older versions and configuration files.
We also employ low tech methods, such as lab notebooks to record results, modifications, etc.
This content can sometimes be more abstract (no point describing every changed line in the code - you have diff for this. Just the purpose of the change, perhaps some notes about implementations and its date).
Work is done mostly with Matlab. Again i am an exception, as i prefer Python. I also use C for the data generating simulation. Testing are mostly of convergences, since my project now is concerned with comparing to computational models. I just generate results with different configurations, saved in their own respected folder (which i track in my lab logbook). This has the benefits of being able to control and interface the data exactly as i want to, instead of conforming to someone else's ideas and formats.

How to keep track of performance testing

I'm currently doing performance and load testing of a complex many-tier system investigating the effect of different changes, but I'm having problems keeping track of everything:
There are many copies of different assemblies
Orignally released assemblies
Officially released hotfixes
Assemblies that I've built containing further additional fixes
Assemblies that I've build containing additional diagnostic logging or tracing
There are many database patches, some of the above assemblies depend on certain database patches being applied
Many different logging levels exist, in different tiers (Application logging, Application performance statistics, SQL server profiling)
There are many different scenarios, sometimes it is useful to test only 1 scenario, other times I need to test combinations of different scenarios.
Load may be split across multiple machines or only a single machine
The data present in the database can change, for example some tests might be done with generated data, and then later with data taken from a live system.
There is a massive amount of potential performance data to be collected after each test, for example:
Many different types of application specific logging
SQL Profiler traces
Event logs
DMVs
Perfmon counters
The database(s) are several Gb in size so where I would have used backups to revert to a previous state I tend to apply changes to whatever database is present after the last test, causing me to quickly loose track of things.
I collect as much information as I can about each test I do (the scenario tested, which patches are applied what data is in the database), but I still find myself having to repeat tests because of inconsistent results. For example I just did a test which I believed to be an exact duplicate of a test I ran a few months ago, however with updated data in the database. I know for a fact that the new data should cause a performance degregation, however the results show the opposite!
At the same time I find myself sepdning disproportionate amounts of time recording these all these details.
One thing I considered was using scripting to automate the collection of performance data etc..., but I wasnt sure this was such a good idea - not only is it time spent developing scripts instead of testing, but bugs in my scripts could cause me to loose track of things even quicker.
I'm after some advice / hints on how better to manage the test environment, in particular how to strike a balance between collecting everything and actually getting some testing done at the risk of missing something important?
Scripting the collection of the test parameters + environment is a very good idea to check out. If you're testing across several days, and the scripting takes a day, it's time well spent. If after a day you see it won't finish soon, reevaluate and possibly stop pursuing this direction.
But you owe it to yourself to try it.
I would tend to agree with #orip, scripting at least part of your workload is likely to save you time. You might consider taking a moment to ask what tasks are the most time consuming in terms of your labor and how amenable are they to automation? Scripts are especially good at collecting and summarizing data - much better then people, typically. If the performance data requires a lot of interpretation on your part, you may have problems.
An advantage to scripting some of these tasks is that you can then check them in along side the source / patches / branches and you may find you benefit from organizational structure of your systems complexity rather than struggling to chase it as you do now.
If you can get away with testing only against a few set configurations that will keep the admin simple. It may also make it easier to put one on each of several virtual machines which can be quickly redeployed to give clean baselines.
If you genuinely need the complexity you describe I'd recommend building a simple database to allow you to query the multivariate results you have. Having a column for each of the important factors will a allow you to query in for questions like "what testing config had the lowest variance in latency?" and "which test database allowed the raising of most bugs?". I use sqlite3 (probably through the Python wrapper or the Firefox plug-in) for this kind of lightweight collection, because it keeps maintenance overhead relatively low and allows you to avoid perturbing the system under test too far, even if you need to run on the same box.
Scripting the tests will make them quicker to execute and permit results to be gathered in an already-ordered way, but it sounds like your system may be too complex to make this easy to do.

How do you evaluate reliability in software?

We are currently setting up the evaluation criteria for a trade study we will be conducting.
One of the criterion we selected is reliability (and/or robustness - are these the same?).
How do you assess that software is reliable without being able to afford much time evaluating it?
Edit: Along the lines of the response given by KenG, to narrow the focus of the question:
You can choose among 50 existing software solutions. You need to assess how reliable they are, without being able to test them (at least initially). What tangible metrics or other can you use to evaluate said reliability?
Reliability and robustness are two different attributes of a sytem:
Reliability
The IEEE defines it as ". . . the
ability of a system or component to
perform its required functions under
stated conditions for a specified
period of time."
Robustness
is robust if it continues to operate despite abnormalities in input, calculations, etc.
So a reliable system performs its functions as it was designed to within constraints; A robust system continues to operate if the unexpected/unanticipated occurs.
If you have access to any history of the software you're evaluating, some idea of reliability can be inferred from reported defects, number of 'patch' releases over time, even churn in the code base.
Does the product have automated test processes? Test coverage can be another indication of confidence.
Some projects using agile methods may not fit these criteria well - frequent releases and a lot of refactoring are expected
Check with current users of the software/product for real world information.
It depends on what type of software you're evaluating. A website's main (and maybe only) criteria for reliability might be its uptime. NASA will have a whole different definition for reliability of its software. Your definition will probably be somewhere in between.
If you don't have a lot of time to evaluate reliability, it is absolutely critical that you automate your measurement process. You can use continuous integration tools to make sure that you only ever have to manually find a bug once.
I recommend that you or someone in your company read Continuous Integration: Improving Software Quality and Reducing Risk. I think it will help lead you to your own definition of software reliability.
Talk to people already using it. You can test yourself for reliability, but it's difficult, expensive, and can be very unreliable depending on what you're testing, especially if you're short on time. Most companies will be willing to put you in contact with current clients if it will help sell you their software and they will be able to give you a real-world idea of how the software handles.
As with anything, if you don't have the time to assess something yourself, then you have to rely on the judgement of others.
Reliability is one of three aspects of somethings' effectiveness.. The other two are Maintainability and Availability...
An interesting paper... http://www.barringer1.com/pdf/ARMandC.pdf discusses this in more detail, but generally,
Reliability is based on the probability that a system will break.. i.e., the more likely it is to break, the less reliable it is... In other systems (other than software) it is often measured in Mean Time Between Failure (MTBF) This is a common metric for things like a hard disk... (10000 hrs MTBF) In software, I guess you could measure it in Mean Time between critical system failures, or between application crashes, or between unrecoverable errors, or between errors of any kind that impede or adversely affect normal system productivity...
Maintainability is a measure of how long/how expensive (how many man-hours and/or other resources) it takes to fix it when it does break. In software, you could add to this concept how long/how expensive it is to enhance or extend the software (if that is an ongoing requirement)
Availability is a combination of the first two, and indicates to a planner, if I had a 100 of these things running for ten years, after figuring the failures and how long each failed unit was unavailable while it was being fixed, repaired, whatever, How many of the 100, on average, would be up and running at any one time? 20% , or 98% ?
Well, the keyword 'reliable' can lead to different answers... When thinking of reliability, I think of two aspects:
always giving the right answer (or the best answer)
always giving the same answer
Either way, I think it boils down to some repeatable tests. If the application in question is not built with a strong suite of unit and acceptance tests, you can still come up with a set of manual or automated tests to perform repeatedly.
The fact that the tests always return the same results will show that aspect #2 is taken care of. For aspect #1 it really is up to the test writers: come up with good tests that would expose bugs or imperfections.
I can't be more specific without knowing what the application is about, sorry. For instance, a messaging system would be reliable if messages were always delivered, never lost, never contain errors, etc etc... a calculator's definition of reliability would be much different.
My advice is to follow SRE methodology around SLI, SLO and SLA, best summarized in free ebooks:
Site Reliability Engineering which provides principal introduction
The Site Reliability Workbook which comes with concrete examples
Looking at the reliability more from tool perspective you need:
monitoring infrastructure (I recommend Prometheus)
alerting (I recommend Prometheus AlertManager, OpsGenie or PagerDuty)
SLO computation tooling for instance slo-exporter
You will have to go into the process by understanding and fully accepting that you will be making a compromise, which could have negative effects if reliability is a key criterion and you don't have (or are unwilling to commit) the resources to appropriately evaluate based on that.
Having said that - determine what the key requirements are that make software reliability critical, then devise tests to evaluate based on those requirements.
Robustness and reliability cross in their relationship to each other, but are not necessarily the same.
If you have a data server that cannot handle more than 10 connections and you expect 100000 connections - it is not robust. It will be unreliable if it dies at > 10 connections. If that same server can handle the number of required connections but intermittently dies, you could say that it is still not robust and not reliable.
My suggestion is that you consult with an experienced QA person who is knowledgeable in the field for the study you will conduct. That person will be able to help you devise tests for key areas -hopefully within your resource constraints. I'd recommend a neutral 3rd party (rather than the software writer or vendor) to help you decide on the key features you'll need to test to make your determination.
If you can't test it, you'll have to rely on the reputation of the developer(s) along with how well they followed the same practices on this application as their other tested apps. Example: Microsoft does not do a very good job with the version 1 of their applications, but 3 & 4 are usually pretty good (Windows ME was version 0.0001).
Depending on the type of service you are evaluating, you might get reliability metrics or SLI - service level indicators - metrics capturing how well the service/product is doing. For example - process 99% of requests under 1sec.
Based on the SLI you might setup service level agreements - a contract between you and the software provider on what SLO (service level objectives) you would like with the consequences of not them not delivering those.

Resources