Does Google PSI "trailing thirty days" testing still occur? - pagespeed-insights

I noticed in this Google PSI FAQ written for a previous deprecated version of the test that it says that changes made to the website do no effect the PSI score immediately.
"The speed data shown in PSI is not updated in real-time. The reported metrics reflect the user experience over the trailing thirty days and are updated on a daily basis."
Does this part of the FAQ still apply today? I've noticed that if I reduce the number of DOM elements, the "Avoid an excessive DOM size" complaint in Google PSI immediately shows the correct new count of DOM elements but scores still remain in the same range.

The part you are referring to is "field data", which is indeed still calculated on a trailing 30 day period.
However when you run your website through Page Speed Insights that is tested without any cache and is calculated each time you run it. (known as "Lab Data")
Think of field data as "real world" information, based on visitors to your site and their experiences, it is a far more accurate representation of what is really happening when people visit your site.
Think of the "lab data" as a synthetic test and a diagnostic tool. They try to simulate a slower CPU and a 4G connection but it is still a simulation, it is designed to give you feedback on potential problems. It has the advantage of updating instantly when you make changes though.
For this reason your "field data" will always lag behind your "lab data" when you make changes to the site.
Also bear in mind that some items in the report are purely diagnostics. In your example of "excessive DOM size" this has no direct scoring implications. However it is there to explain why you might be getting slow render times and or a large Cumulative Layout Shift as lots of DOM elements = more rendering time & more chance of a reflow.
See this answer I gave on the new scoring model PSI uses.


Will chunk/bundle optimisations help on my website if first Input delay(FID) is already less?

According to core web vitals there are only 3 core vitals for measuring the user experience of any website LCP(Largest contentful paint), FID(First input delay) and CLS(Cumulative Layout shift). According to Pagespeedinsights or CRUX dashboard, FID of my website is in good limits i.e 90% of users have an input delay of less than 100 ms
Will there be any benefit if I do the chunk optimisations(splitting, lazy loading) on the user experience of people landing on my website?
I understand that it will effect TBT(Total Blocking Time), TTI(Time to interactive) but anyways it doesn't matter if my FID is ver less. is my understanding correct?
I work on several large sites and we measure FID and TBT across thousands of pages. My work on this shows there is little correlation between TBT and FID. I have lots of pages reporting TBT of 2s or more but then are in the 90% score for FID. So I would NOT spend money or time optimizing TBT, what I would do instead is optimize for something that you can correlate to a business metric. For instance, add some user timings to measure how fast a CTA button appears and when it becomes interactive. This is a metric that is useful.
Being in the green on the core web vitals report (for one or all metrics) is great, but it doesn't mean that you should not try to improve performance further. In fact, if all your competitors have better FID / CLS / LCP / etc. than you, you will be at a disadvantage. Generally speaking, I think the web vitals report can be used as a guide to continuously prioritise changes to try and improve performance.
While it's impossible to predict the improvements without looking at a current report and the codebase, it seems fair to expect code-splitting to improve FID and LCP, and lazy-loading to help with LCP. Both improvements would benefit users.
Note that TBT and FID are pretty similar.

Is the PageSpeed Insight display score got from the “Lab Data” or “Field Data”?

I've randomly tested a web link and got 64. However, the Lab Data and Field Data seems quite different. I think it's because the web page owner just modified it.
Is the score “64” reflecting the Lab Data or Field Data?
Short Answer
It is lab data score.
Longer Answer
The score you see there is the "lab data" score, it is the score for this synthetic test you just ran. It will change every time you run Page Speed Insights.
"Field Data" will not contribute towards your score in Page Speed Insights and is purely for diagnostics.
The "Field Data" is calculated over a rolling 30 days so is useful to see if there are issues that automated tests do not pick up, but useless if you have just done a major update to fix a site issue (for 30 days at least).
Additionally CLS in "Field Data" is calculated the whole time someone is on the site (until the "unload" event on a page), the PSI "Lab Data" is only calculated on the above the fold content. That is sometimes another reason for disparity between results.

Why can't I target the complement of my goal in Optimizely?

Optimizely's Sample Size calculator shows, that a higher baseline conversion rate leads to a smaller required sample size for an A/B-test. So, instead of maximizing my conversion goal, I'd like to minimize the opposite, i.e. not reaching the goal.
For every goal with a conversion rate less than 50%, its complement would be higher than 50% and would thus require a smaller sample size if targeted.
An example: instead of measuring all users that visit payment-success.html, I'd rather measure all users that don't visit it, and try minimizing that. Which would usually require a lot smaller sample size if my reasoning is correct!
Optimizely only lets me target pageviews as goals, not not-pageviewing.
I realize I'm probably missing or misunderstanding something important here, but if so, what is it?
Statistically there's nothing wrong with your approach, but unfortunately it won't have the desired effect of lowering the duration.
While you'll reduce the margin of error, you'll proportionately decrease the lift, causing you to take the same amount of time to reach confidence.
Since the lift is calculated as a percentage of the baseline conversion rate, the same change in conversion rate of a larger baseline will produce a smaller lift.
Say your real conversion rate is 10% and the test winds up increasing it to 12%. The inverse conversion rate would be 90% which gets lowered to 88%. In both cases it's a change of 2%, but 2% is a much greater change to 10% (it's a 20% lift) than it is to 90% (only a -2.22% lift).
Practically, you run a much larger risk of incorrectly bucketing people into the goal with the inverse. You know that someone who hits the success page should be counted toward the goal. I'm pretty sure what you're suggesting would cause every pageview that wasn't on the success page after the user saw the experiment to count as a goal.
Say you're testing the home page. Person A and B both land on the home page and view the experiment.
Person A visits 1 other pages and leaves
Person B visits 1 other pages and buys something
If your goal was setup on the success page, only person B would trigger the goal. If the goal was setup on all other pages, both people would trigger the goal. That's obviously not the inverse.
In fact, if there are any pages necessary to reach the success page after the user first sees the experiment (so unless you're testing the final step of checkout), everyone will trigger the inverse pageview goal (whether they hit the success page or not).
Optimizely pageview goals aren't just for pages included in the URL Targeting of your experiment. They're counted for anyone who's seen the experiment and at any point afterward hit that page.
Just to answer whether this is possible (not addressing whether your setup will result in the same outcome), you're right that Optimizely pageview goal doesn't allow for not, but you can probably use the Regex match type to achieve what you want (see 'URL Match Type' in point 3 here). In this case it would look like this, taken from this answer here (which also explains the complexity involved with not matching in Regex, suggesting why Optimizely hasn't built not pageviews into the product).
Hopefully that helps you get to where you want.

Will different website A/B tests interfere with either test's results?

I have a question about running an A/B test against different pages on a website and if I should worry about them interfering with either test's results. Not that it matters, but I'm using Visual Website Optimizer to do the testing.
For example, if I have two A/B tests running on different pages in the order placement flow, should I worry about the tests having an effect on one anothers goal conversion rate for the same conversion goal? For example, I have two tests running on a website, one against the product detail page and another running on the shopping cart. Ultimately I want to know if a variation of either page affects the order placement conversion rate. I'm not sure if I should be concerned with the different test's results interfering with one another if they are run at the same time.
My gut is telling me we don't have to worry about it, as the visitors on each page will be distributed across each variation of the other page. So the product detail page version A visitors will be distributed across the A and B variations of the cart, therefore the influence of the product detail page's variation A on order conversion will still be measured correctly even though the visitor sees different versions of the cart from the other test. Of course, I may be completely wrong, and hopefully someone with a statistics background can answer this question more precisely.
The only issue I can think of, is a combination between one page's variation and another page's variation worked together better than other combinations. But this seems unlikely.
I'm not sure if I'm explaining the issue clearly enough, so please let me know if my question makes sense. I searched the web and Stackoverflow for an answer, but I'm not having any luck finding anything.
I understand your problem and there is no quick answer to it and it depends on the types of test you are running. There are times that A/B tests on different pages influence each other, specially if they are within the same sequence of actions, e.g. checkout.
A simple example, if on your first page, variation A says "Click here to view pricing" and variation B says "Click here to get $500 cash". You may find that click through on B is higher and declare that one successful. Once the user clicks, on the following page, there are asked to enter their credit card details, with variations being "Pay" button being either green or red. In a situation like this, people from variation A might have a better chance of actually entering their CC details and converting as opposed to variation B who may feel cheated.
I have noticed when websites are in their seminal stages and they are trying to get a feel of what customers respond to well, drastic changes are made these multivariate tests are more important. When there is some stability and traffic, however, the changes tend to be very subtle and overall message and flow are the same and A/B tests become more micro refinements. In those cases, there might be less value in multi page cross testings (does background colour on page one means anything three pages down the process? probably not!).
Hope this answer helps!

How do you measure if an interface change improved or reduced usability?

For an ecommerce website how do you measure if a change to your site actually improved usability? What kind of measurements should you gather and how would you set up a framework for making this testing part of development?
Multivariate testing and reporting is a great way to actually measure these kind of things.
It allows you to test what combination of page elements has the greatest conversion rate, providing continual improvement on your site design and usability.
Google Web Optimiser has support for this.
Similar methods that you used to identify the usability problems to begin with-- usability testing. Typically you identify your use-cases and then have a lab study evaluating how users go about accomplishing certain goals. Lab testing is typically good with 8-10 people.
The more information methodology we have adopted to understand our users is to have anonymous data collection (you may need user permission, make your privacy policys clear, etc.) This is simply evaluating what buttons/navigation menus users click on, how users delete something (i.e. changing quantity - are more users entering 0 and updating quantity or hitting X)? This is a bit more complex to setup; you have to develop an infrastructure to hold this data (which is actually just counters, i.e. "Times clicked x: 138838383, Times entered 0: 390393") and allow data points to be created as needed to plug into the design.
To push the measurement of an improvement of a UI change up the stream from end-user (where the data gathering could take a while) to design or implementation, some simple heuristics can be used:
Is the number of actions it takes to perform a scenario less? (If yes, then it has improved). Measurement: # of steps reduced / added.
Does the change reduce the number of kinds of input devices to use (even if # of steps is the same)? By this, I mean if you take something that relied on both the mouse and keyboard and changed it to rely only on the mouse or only on the keyboard, then you have improved useability. Measurement: Change in # of devices used.
Does the change make different parts of the website consistent? E.g. If one part of the e-Commerce site loses changes made while you are not logged on and another part does not, this is inconsistent. Changing it so that they have the same behavior improves usability (preferably to the more fault tolerant please!). Measurement: Make a graph (flow chart really) mapping the ways a particular action could be done. Improvement is a reduction in the # of edges on the graph.
And so on... find some general UI tips, figure out some metrics like the above, and you can approximate usability improvement.
Once you have these design approximations of user improvement, and then gather longer term data, you can see if there is any predictive ability for the design-level usability improvements to the end-user reaction (like: Over the last 10 projects, we've seen an average of 1% quicker scenarios for each action removed, with a range of 0.25% and standard dev of 0.32%).
The first way can be fully subjective or partly quantified: user complaints and positive feedbacks. The problem with this is that you may have some strong biases when it comes to filter those feedbacks, so you better make as quantitative as possible. Having some ticketing system to file every report from the users and gathering statistics about each version of the interface might be useful. Just get your statistics right.
The second way is to measure the difference in a questionnaire taken about the interface by end-users. Answers to each question should be a set of discrete values and then again you can gather statistics for each version of the interface.
The latter way may be much harder to setup (designing a questionnaire and possibly the controlled environment for it as well as the guidelines to interpret the results is a craft by itself) but the former makes it unpleasantly easy to mess up with the measurements. For example, you have to consider the fact that the number of tickets you get for each version is dependent on the time it is used, and that all time ranges are not equal (e.g. a whole class of critical issues may never be discovered before the third or fourth week of usage, or users might tend not to file tickets the first days of use, even if they find issues, etc.).
Torial stole my answer. Although if there is a measure of how long it takes to do a certain task. If the time is reduced and the task is still completed, then that's a good thing.
Also, if there is a way to record the number of cancels, then that would work too.
