Which solar geometry should be assumed by pvlib? - geometry

I try to calibrate my PV panel's efficiency by means of reference GHI, dHI values measured by our national weather service, which is 12 km away from my local PV site. At the reference site, I calibrate the Ineichen model by trimming its linke-turbidity parameter (LT) such that the measured and computed irradiances match equally. I assume that LT factor is valid within a radius 12km on a clear-sky day with a low aerosol optical depth. So I can port the calibrated LT to my location and height. At the same time, I measure my local PV's DC power (under MPPT condition) as a delta of (being exposed to global normal irradiance) minus (being exposed only to global diffused irradiance). My panel is small, so its shadowing from direct beams is pretty easy. Having computed DNI=(GHI-dHI)/cos(pi-zenith) by Ineichen at my site, I get the PV efficiency from the measured DC power at the given surface temperature (measured as well). So far, everything looks fine. But, I am getting different optimum LT parameters for the two reference matches GHI, dHI!
Is Ineichen model not enough exact for calibration purposes? No, the reason is elsewhere:
When using a single, compromised LT value, the both GHI, dHI values can be computed relatively equally greater (+3%) than their measured counterparts. This fact naturally raises a question, which extra-terrestrial irradiance value (GXI) is used by the numeric model? The reason of error stems in the Earth orbit's excentricity 0.017, which causes about 0.034 variance in GXI, well correlating with my observed "compromised" error. The authors' comment in pvlib confirms, that pvlib applies the circular solar geometry. According to my own experience, this one is far enough precise, when calculating the solar azimuth and zenith by the timestamp and position. A typical error of the computed solar angle is about 0.5%.
On the other hand, the high 3% error of extra-terrestrial irradiance could be easily fixed, if an accurate Sun-to-Earth distance was calculated by means of the ecliptic orbit model. This is even easier, then the calculation of solar angle by the circular model!
Currently, I use a following workaround: Trim LT so, that equal relative error is reached by the both model outputs GHI, dHI. Then port LT to the local site, and correct the computed DNI here by its (known) relative error.

which extra-terrestrial irradiance value (GXI) is used by the numeric model? ... The authors' comment in pvlib confirms, that pvlib applies the circular solar geometry.
I'm not sure what you mean by this. Can you provide a link to this comment? In any case, several models are available for extraterrestrial irradiance via pvlib.irradiance.get_extra_radiation (docs, v0.9.1 source code). Here's what the default model looks like across a year:
import pandas as pd
import pvlib
times = pd.date_range('2022-01-01', '2022-12-31', freq='d')
dni_et = pvlib.irradiance.get_extra_radiation(times)
dni_et.plot()
There are also functions for calculating earth-sun distance, e.g. pvlib.solarposition.nrel_earthsun_distance.

Related

what type of model should I use?

I am trying to assess the infuence of sex (nominal), altitude (nominal) and latitude (nominal) on corrected wing size (continuous; residual of wing size by body mass) of an animal species. I considered altitude as a nominal factor given the fact that this particular species is mainly distributed at the extremes (low and high) of steep elevational gradients in my study area. I also considered latitude as a nominal fixed factor given the fact that I have sampled individuals only at three main latitudinal levels (north, center and south). 
I have been suggested to use Linear Mixed Model for this analysis. Specifically, considering sex, altitude, latitude, sex:latitude, sex:altitude, and altitude:latitude as fixed factors, and collection site (nominal) as the random effect. The latter given the clustered distribution of the collection sites.
However, I noticed that despite the corrected wing size follow a normal distribution, it violates the assumption of homoscedasticity among some altitudinal/latitudinal groups. I tried to use a non-parametric equivalent of factorial ANOVA (ARTool) but I cannot make it run because it does not allow cases of missing data and it requires to asses all possible fixed factor and their interactions. I will appreciate any advice on what type of model I can use given the design of my data and what software/package can I use to perform the analysis.
Thanks in advance for your kind attention.
Regards,

outlier detection using 2D spatial information

I have a list of sensor measurements for air quality with geo-coordinates, and I would like to implement outlier detection. The list of sensors is relatively small (~50).
The air quality can gradually change with the distance, but abrupt local spikes are likely outliers. If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK.
Of course, I can ignore coordinates and do simple outlier detection assuming the normal distribution, but I was hoping to do something more sophisticated. What would be a good statistical way to model this and implement outlier detection?
The above statement, ("If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK."), would indicate that sensors that are closer to each other tend to have values that are more alike.
Tobler’s first law of geography - “everything is related to everything else, but near things are more related than distant things”
You can quantify an answer to this question. The focus is should not be on the location and values from outlier sensors. Use global spatial autocorrelation to answer the degree to which sensors that are near each other tend to be more alike.
As a start, you will first need to define neighbors for each sensor.
I'd calculate a cost function, consisting of two costs:
1: cost_neighbors: Calculates the deviance from the sensor value of an expected value. The expected value is calculated by summing up all the values and weighting them by their distance.
2: cost_previous_step: Check how much the value of the sensor changed compared to the last time step. Large change in value leads to a large cost.
Here is some pseudo code describing how to calculate the costs:
expected_value = ((value_neighbor_0 / distance_neighbor_0)+(value_neighbor_1 / distance_neighbor_1)+ ... )/nb_neighbors
cost_neighbors = abs(expected_value-value)
cost_previous_timestep = value#t - value#t-1
total_cost = a*cost_neighbors + b*cost_previous_timestep
a and b are parameters that can be tuned to give each of the costs more or less impact. The total cost is then used to determine if a sensor value is an outlier, the larger it is, the likelier it is an outlier.
To figure out the performance and weights, you can plot the costs of some labeled data points, of which you know if they are an outlier or not.
cost_neigbors
| X
| X X
|
|o o
|o o o
|___o_____________ cost_previous_step
X= outlier
o= non-outlier
You can now either set the threshold by hand or create a small dataset with the labels and costs, and apply any sort of classifier function (e.g. SVM).
If you use python, an easy way to find neighbors and their distances is scipy.spatial.cKDtree

Impulse response analysis

I ran an impulse response analysis on a value weighted stock index and a few variables in python and got the following results:
I am not sure how to interpret these results.
Can anyone please help me out?
You might want to check the book "New introduction to Multiple Time Series Analysis" by Helmut Lutkepohl, 2005, for a slightly dense theory about the method.
In the meantime, a simple way you can interpret your plots is, let's say your variables are VW, SP500, oil, uts, prod, cpi, n3 and usd. They all are parts of the same system; what the impulse response analysis does is, try to assess how much one variable impacts another one independently of the other variables. Therefore, it is a pairwise shock from one variable to another. Your first plot is VW -> VW, this is pretty much an autocorrelation plot. Now, look at the other plots: apparently, SP500 exerts a maximum impact on VW (you can see a peak in the blue line reaching 0.25. The y-axis is given in standard deviations and x-axis in lag-periods. So in your example, SP500 cause a 0.25 change in VW at the lag of whatever is in your x-axis (I can't see from your figure). Similarly, you can see n3 negatively impacting VW at a given period.
There is an interesting link that you probably know and shows an example of the application of Python statsmodels VAR for Impulse Response analysis
I used this method to assess how one variable impact another in a plant-water-atmosphere system, there are some explanations there and also the interpretation of similar plots, take a look:
Use of remote sensing indicators to assess effects of drought and human-induced land degradation on ecosystem health in Northeastern Brazil
Good luck!

Scale before PCA

I'm using PCA from sckit-learn and I'm getting some results which I'm trying to interpret, so I ran into question - should I subtract the mean (or perform standardization) before using PCA, or is this somehow embedded into sklearn implementation?
Moreover, which of the two should I perform, if so, and why is this step needed?
I will try to explain it with an example. Suppose you have a dataset that includes a lot features about housing and your goal is to classify if a purchase is good or bad (a binary classification). The dataset includes some categorical variables (e.g. location of the house, condition, access to public transportation, etc.) and some float or integer numbers (e.g. market price, number of bedrooms etc). The first thing that you may do is to encode the categorical variables. For instance, if you have 100 locations in your dataset, the common way is to encode them from 0 to 99. You may even end up encoding these variables in one-hot encoding fashion (i.e. a column of 1 and 0 for each location) depending on the classifier that you are planning to use. Now if you use the price in million dollars, the price feature would have a much higher variance and thus higher standard deviation. Remember that we use square value of the difference from mean to calculate the variance. A bigger scale would create bigger values and square of a big value grow faster. But it does not mean that the price carry significantly more information compared to for instance location. In this example, however, PCA would give a very high weight to the price feature and perhaps the weights of categorical features would almost drop to 0. If you normalize your features, it provides a fair comparison between the explained variance in the dataset. So, it is good practice to normalize the mean and scale the features before using PCA.
Before PCA, you should,
Mean normalize (ALWAYS)
Scale the features (if required)
Note: Please remember that step 1 and 2 are not the same technically.
This is a really non-technical answer but my method is to try both and then see which one accounts for more variation on PC1 and PC2. However, if the attributes are on different scales (e.g. cm vs. feet vs. inch) then you should definitely scale to unit variance. In every case, you should center the data.
Here's the iris dataset w/ center and w/ center + scaling. In this case, centering lead to higher explained variance so I would go with that one. Got this from sklearn.datasets import load_iris data. Then again, PC1 has most of the weight on center so patterns I find in PC2 I wouldn't think are significant. On the other hand, on center | scaled the weight is split up between PC1 and PC2 so both axis should be considered.

How to properly clamp beckmann distribution

I am trying to implement a Microfacet BRDF shading model (similar to the Cook-Torrance model) and I am having some trouble with the Beckmann Distribution defined in this paper: https://www.cs.cornell.edu/~srm/publications/EGSR07-btdf.pdf
Where M is a microfacet normal, N is the macrofacet normal and ab is a "hardness" parameter between [0, 1].
My issue is that this distribution often returns obscenely large values, especially when ab is very small.
For instance, the Beckmann distribution is used to calculate the probability of generating a microfacet normal M per this equation :
A probability has to be between the range [0,1], so how is it possible to get a value within this range using the function above if the Beckmann distribution gives me values that are 1000000000+ in size?
So there a proper way to clamp the distribution? Or am I misunderstanding it or the probability function? I had tried simply clamping it to 1 if the value exceeded 1 but this didn't really give me the results I was looking for.
I was having the same question you did.
If you read
http://blog.selfshadow.com/publications/s2012-shading-course/hoffman/s2012_pbs_physics_math_notes.pdf
and
http://blog.selfshadow.com/publications/s2012-shading-course/hoffman/s2012_pbs_physics_math_notebook.pdf
You'll notice it's perfectly normal. To quote from the links:
"The Beckmann Αb parameter is equal to the RMS (root mean square) microfacet slope. Therefore its valid range is from 0 (non-inclusive –0 corresponds to a perfect mirror or Dirac delta and causes divide by 0 errors in the Beckmann formulation) and up to arbitrarily high values. There is no special significance to a value of 1 –this just means that the RMS slope is 1/1 or 45°.(...)"
Also another quote:
"The statistical distribution of microfacet orientations is defined via the microfacet normal distribution function D(m). Unlike F (), the value of D() is not restricted to lie between 0 and 1—although values must be non-negative, they can be arbitrarily large (indicating a very high concentration of microfacets with normals pointing in a particular direction). (...)"
You should google for Self Shadow's Physically Based Shading courses which is full of useful material (there is one blog post for each year: 2010, 2011, 2012 & 2013)

Resources