I am trying to analyze the graph constructed with networkx having around 7000 nodes. When I plot the degree distribution there are nodes that are far away from the fitted power law as shown in the attached plot. This means the network is not scale-free (to my understanding). I am trying to analyze this network by using various parameters such as Degree, clustering coefficient, betweenness centrality, and many others. Does analyzing such networks with these parameters is acceptable? I try to find some examples of analyzing networks that are not scale-free but no luck so far. Any suggestions and pointer for such examples would be really great. In addition, some differences in network characteristics of scale-free and non-scale free networks would be very helpful. Thanks in advance.
1. What type of model did you constructed? Did you use a data from a file?
2. What do you want to check?
Models such as Watts-Strogatz (https://en.wikipedia.org/wiki/Watts%E2%80%93Strogatz_model) is also no scale-free :
'They do not account for the formation of hubs. Formally, the degree
distribution of ER graphs converges to a Poisson distribution, rather
than a power law observed in many real-world, scale-free
networks.[3]'
WS is a 'small-world' network. It is characterized by high clustering coefficient. Why you think you can't analyze it?
Related
I'm using a tutorial (https://www.tidytextmining.com/nasa.html?q=correlation%20ne#networks-of-keywords) to learn about tidy text mining. I am hoping someone might be able to help with two questions:
in this tutorial, the correlation used to make the graph is 0.15. Is this best practice? I can't find any literature to help choose a cut off.
In the graph attached from the tutorial, how are clusters centrality chosen? Are more important words closer to the centre?
Thanks very much
I am not aware of any literature on a correlation threshold to use for this kind of network analysis; this will (I believe) depend on your particular dataset and how language is used in your context. This is a heuristic decision. Given what a correlation coefficient measures, I would expect 0.15 to be on the low side of what you might use.
The graph is represented visually in a two-dimensional plot via the layout argument of ggraph. You can read more about that here but the very high-level takeaways are that there are a lot of options, they have a big impact on what your graph looks like, and often it's not clear what is the best choice.
Statistics is not my major and English is not my native language. I tried to apply for data analysis or data science work in industry. However, I do not know how to describe my research process below in a concise and professional way. I highly appreciated if you could provide me such help.
Background: I simulating properties of materials using different research packages, such as LAMMPS. The simulated data are only coordinates of atoms. Below are my data analysis.
step 1: clean the data to make sure the data complete and atom ID is unique and not exchangeable at different time moments (timesteps).
step 2: Calculated the neighbor atoms' distance of each center atom to find the target species (a configuration formed by several target atoms, such as Al-O-H, Si-O-H, Al-O-H2, H3-O)
step 3: count the amount of species as functions of space and/or time and draw the species distribution as functions of space and/or time, lifetime distribution of species.
NOTE: such distribution is different from statistical distribution, such as Normal Distribution, Binomial Distribution.
step 4: Based on above distribution, the correlation between species would be explored and interpreted.
After above steps, I study the mechanism behind based on materials selves and local environment.
Could anyone point out how to understand above steps in statistical terms or data analytic terms or others?
I sincerely appreciate your time and help.
Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.
I have read through the scikit learn documentation and Googled to no avail. I have 2000 data sets, clustered as the picture shows. Some of the clusters, as shown, are wrong, here the red cluster. I need a metrics to method to validate all the 2000 cluster-sets. Almost every metric in scikit learn requires the ground truth class labels, which I do not think I have or CAN have for that matter. I have the hourly traffic flow for 30 days and I am clustering them using k-means. The lines are the cluster centers. What should I do? Am I even on the right track?!The horizontal axis is the hour, 0 to 23, and the vertical axis is the traffic flow, so the data points represent the traffic flow in that hour over the 30 days, and k=3.
SciKit learn has no methods, except from the silhouette coefficient, for internal evaluation, to my knowledge, we can implement the DB Index (Davies-Bouldin) and the Dunn Index for such problems. The article here provides good metrics for k-means:
http://www.iaeng.org/publication/IMECS2012/IMECS2012_pp471-476.pdf
Both the Silhouette coefficient and the Calinski-Harabaz index are implemented in scikit-learn nowadays and will help you evaluate your clustering results when there is no ground-truth.
More details here:
http://scikit-learn.org/stable/modules/clustering.html
And here:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_samples.html#sklearn.metrics.silhouette_samples
Did you look at the Agglomerative clustering and then the subsection (Varying the metric):
http://scikit-learn.org/stable/modules/clustering.html#varying-the-metric
To me it seems very similar to what you are trying to do.
I have found contradictory literature on this topic. Some papers suggest that the power law exponent is close to 2 (between 2.1 to 2.3). But some other papers show this value is higher (around 3). Kindly provide references to any study/references related to this topic.
Here are some links and quotes:
Search in Power-Law Networks
A number of large distributed systems, [...] display a power-law distribution in their node degree. This distribution reflects the existence of a few nodes with very high degree and many with low degree, a feature not found in standard random graphs
Modeling Peer-to-peer Network Topologies Through “small-world” Models And Power Laws
The real problem here is that large scale p2p networks don't really exist in academia. It's incredibly difficult to scale a real p2p network. There are no great p2p simulators for lookup algorithms which help measure these details.
I've recently started using jxta-sim which is a p2p simulator built on top of planet sim.
jxta sim link - http://jxta.dsg.cs.tcd.ie/
Given that it's an empirical fit, I'd say that it depends on the network (what drives it, how it grows, etc.) and the variation in reported values should be taken as a range (rather than as errors in measurement).