What is the power law exponent of unstructured p2p networks? - p2p

I have found contradictory literature on this topic. Some papers suggest that the power law exponent is close to 2 (between 2.1 to 2.3). But some other papers show this value is higher (around 3). Kindly provide references to any study/references related to this topic.
Here are some links and quotes:
Search in Power-Law Networks
A number of large distributed systems, [...] display a power-law distribution in their node degree. This distribution reflects the existence of a few nodes with very high degree and many with low degree, a feature not found in standard random graphs
Modeling Peer-to-peer Network Topologies Through “small-world” Models And Power Laws

The real problem here is that large scale p2p networks don't really exist in academia. It's incredibly difficult to scale a real p2p network. There are no great p2p simulators for lookup algorithms which help measure these details.
I've recently started using jxta-sim which is a p2p simulator built on top of planet sim.
jxta sim link - http://jxta.dsg.cs.tcd.ie/

Given that it's an empirical fit, I'd say that it depends on the network (what drives it, how it grows, etc.) and the variation in reported values should be taken as a range (rather than as errors in measurement).

Related

How to understand the process of transforming research data into certain distribution (not statistical distribution)?

Statistics is not my major and English is not my native language. I tried to apply for data analysis or data science work in industry. However, I do not know how to describe my research process below in a concise and professional way. I highly appreciated if you could provide me such help.
Background: I simulating properties of materials using different research packages, such as LAMMPS. The simulated data are only coordinates of atoms. Below are my data analysis.
step 1: clean the data to make sure the data complete and atom ID is unique and not exchangeable at different time moments (timesteps).
step 2: Calculated the neighbor atoms' distance of each center atom to find the target species (a configuration formed by several target atoms, such as Al-O-H, Si-O-H, Al-O-H2, H3-O)
step 3: count the amount of species as functions of space and/or time and draw the species distribution as functions of space and/or time, lifetime distribution of species.
NOTE: such distribution is different from statistical distribution, such as Normal Distribution, Binomial Distribution.
step 4: Based on above distribution, the correlation between species would be explored and interpreted.
After above steps, I study the mechanism behind based on materials selves and local environment.
Could anyone point out how to understand above steps in statistical terms or data analytic terms or others?
I sincerely appreciate your time and help.

Analyze networks which are not scale free

I am trying to analyze the graph constructed with networkx having around 7000 nodes. When I plot the degree distribution there are nodes that are far away from the fitted power law as shown in the attached plot. This means the network is not scale-free (to my understanding). I am trying to analyze this network by using various parameters such as Degree, clustering coefficient, betweenness centrality, and many others. Does analyzing such networks with these parameters is acceptable? I try to find some examples of analyzing networks that are not scale-free but no luck so far. Any suggestions and pointer for such examples would be really great. In addition, some differences in network characteristics of scale-free and non-scale free networks would be very helpful. Thanks in advance.
1. What type of model did you constructed? Did you use a data from a file?
2. What do you want to check?
Models such as Watts-Strogatz (https://en.wikipedia.org/wiki/Watts%E2%80%93Strogatz_model) is also no scale-free :
'They do not account for the formation of hubs. Formally, the degree
distribution of ER graphs converges to a Poisson distribution, rather
than a power law observed in many real-world, scale-free
networks.[3]'
WS is a 'small-world' network. It is characterized by high clustering coefficient. Why you think you can't analyze it?

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Calculate expected color temperature of daylight

I have a location (latitude/longitude) and a timestamp (year/month/day/hour/minute).
Assuming clear skies, is there an algorithm to loosely estimate the color temperature of sunlight at that time and place?
If I know what the weather was at that time, is there a suggested way to modify the color temperature for the amount of cloud cover at that time?
I suggest taking a look at this paper which has nice practical implementation for CG applications:
A Practical Analytic Model for Daylight A. J. Preetham Peter Shirley Brian Smits
Abstract
Sunlight and skylight are rarely rendered correctly in computer
graphics. A major reason for this is high computational expense.
Another is that precise atmospheric data is rarely available. We
present an inexpensive analytic model that approximates full spectrum
daylight for various atmospheric conditions. These conditions are
parameterized using terms that users can either measure or estimate.
We also present an inexpensive analytic model that approximates the
effects of atmosphere (aerial perspective). These models are fielded
in a number of conditions and intermediate results verified against
standard literature from atmospheric science. Our goal is to achieve
as much accuracy as possible without sacrificing usability.
Both compressed postscript and pdf files of the paper are available.
Example code is available.
Color images from the paper are shown below.
Link only answers are discouraged but I can not post neither sufficient portion of the article nor any complete C++ code snippet here as both are way too big. Following the link you can find both right now.

How do I measure the distribution of an attribute of a given population?

I have a catalog of 900 applications.
I need to determine how their reliability is distributed as a whole. (i.e. is it normal).
I can measure the reliability of an individual application.
How can I determine the reliability of the group as a whole without measuring each one?
That's a pretty open-ended question! Overall, distribution fitting can be quite challenging and works best with large samples (100's or even 1000's). It's generally better to pick a modeling distribution based on known characteristics of the process you're attempting to model than to try purely empirical fitting.
If you're going to go empirical, for a start you could take a random sample, measure the reliability scores (whatever you're using for that) of your sample, sort them, and plot them vs normal quantiles. If they fall along a relatively straight line the normal distribution is a plausible model, and you can estimate sample mean and variance to parameterize it. You can apply the same idea of plotting vs quantiles from other proposed distributions to see if they are plausible as well.
Watch out for behavior in the tails, in particular. Pretty much by definition the tails occur rarely and may be under-represented in your sample. Like all things statistical, the larger the sample size you can draw on the better your results will be.
I'd also add that my prior belief would be that a normal distribution wouldn't be a great fit. Your reliability scores probably fall on a bounded range, tend to fall more towards one side or the other of that range. If they tend to the high range, I'd predict that they get lopped off at the end of the range and have a long tail to the low side, and vice versa if they tend to the low range.

Resources