How to correctly define dimensions for 1D structured points legacy vtk format? - vtk

I am just in the beginnings of learning how to make vtk files. I was trying to start really simple and just made a 1D structured points vtk file. Below is an extremely simple attempt:
# vtk DataFile Version 2.0
Vis Project Data
ASCII
DATASET STRUCTURED_POINTS
DIMENSIONS 2 1 1
ORIGIN 0 0 0
SPACING 1 1 1
POINT_DATA 2
SCALARS phi float 1
LOOKUP_TABLE default
0.1
0.2
However, whenever I try to load in the file on paraview I get the error "Incorrect Dimensionality"
This is probably a really stupid question but I will be forever grateful for an answer.
Thanks!

Related

Fsolve, replicating code from matlab, answer not matching because of Fsolve requirements

It will be very helpful if I can get a help, I am stuck in this problem for a long time. TIA
My function is returning 2 arrays, 6 by 6 and 6 by 1 , and the starting point for solving has the size of 6 by 1, as the output has to be scalar or 1d ,I have converted the 2 output arrays into one with size of (42,1) and I have reshaped the size of the starting point to (42,1) , I have tried by adding additional zeros, repeating the original starting point 7 times, to increase the size from (6,1) to size (42,1) .
Because of this, I am not getting the desired result from the fsolve, I was replicating the code from matlab to python and in matlab fsolve was giving right answer and it was working fine with output as 2 arrays of(6,6) and (6,1) and starting point for solving as (6,1)

Classify filenames (exported to Excel) based on names/type

For a part of my job we make a comprehensive list based on all files a user has in their drive. These users have to decide per file whether to archive these or not (indicated by Y or N). As a service to these users we manually fill this in for them.
We export these files to a long list in excel, which displays each file as X:\4. Economics\10. xxxxxxxx\04. xxxxxxxxx\04. xxxxxxxxxx\filexyz.pdf
I'd argue that we can easily automate this, as standard naming conventions make it easy to decide which files to keep and which to delete. A file with the string "CAB" in the filename should for example be kept. However, I have no idea how and where to start. Can someone point me in the right direction?
I would suggest the following general steps
Get the raw data
You can read the excel file into a pandas dataframe in python. Ideally you will have a raw dataframe that looks something like this
Filename Keep
0 X:\4. Economics ...\filexyz.pdf 0
1 X:\4. Economics ...\fileabc.pdf 1
2 X:\3. Finance ...\filetef.pdf 1
3 X:\3. Finance ...\file123.pdf 0
4 G:\2. Philosophy ..\file285.pdf 0
....
Preprocess/clean
This part is more up to you, for example you could remove all special characters and numbers. This would leave letters as follows
Filename Keep
0 "X Economics filexyz pdf" 0
1 "X Economics fileabc pdf" 1
2 "X Finance filetef pdf" 1
3 "X Finance file123 pdf" 0
4 "G Philosophy file285 pdf" 0
....
Vectorize your strings
For an algorithm to understand your text data, you typically vectorize them. This means you turn them into numbers that the algorithm can process. An easy way to do this is with tf-idf and scikit-learn. After this your dataframe might look something like this
Filename Keep
0 [0.6461, 0.3816 ... 0.01, 0.38] 0
1 [0., 0.4816 ... 0.25, 0.31] 1
2 [0.61, 0.1663 ... 0.11, 0.35] 1
....
Train a classifier
Now that you have nice numbers for the algorithms to work with, you can train a classifier with scikit-learn. Simply search for "scikit learn classification example" and you will find plenty.
Once you have a trained classifier, you can compare its predictions on test data that it has not seen before. That way you get a feeling for accuracy.
Hopefully that is enough to get you started!

Smooth values using bin Boundaries: Where do you set a value who sits right between the lower and upper boundary?

In response to #j.jerrod.taylor's answer, let me rephrase my question to clear any misunderstanding.
I'm new to Data Mining and am learning about how to handle noisy data by smoothing my data using the Equal-width/Distance Binning method via "Bin Boundaries". Assume the dataset 1,2,2,3,5,6,6,7,7,8,9. I want to perform:
distance binning with 3 bins, and
Smooth values by Bin Boundaries based on values binned in #1.
Based on definition in (Han,Kamber,Pei, 2012, Data Mining Concepts and Techniques, Section 3.2.2 Noisy Data):
In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
Interval width = (max-min)/k = (9-1)/3 = 2.7
Bin intervals = [1,3.7),[3.7,6.4),[6.4,9.1]
original Bin1: 1,2,2,3 | Bin boundaries: (1,3) | Smooth values by Bin Boundaries: 1,1,1,3
original Bin2: 5,6,6 | Bin boundaries: (5,6) | Smooth values by Bin Boundaries: 5,6,6
original Bin3: 7,7,8,9 | Bin boundaries: (7,9) | Smooth values by Bin Boundaries: 7,7,8,9
Question:
- where does 8 belong to in Bin3 when binned using Bin boundaries method, since it's +1 from 7 and -1 from 9?
If this is a problem, then you are calculating your bin widths incorrectly. For example, creating a histogram is an example of data binning.
You can read this response on crossvalidated. But in general if you're trying to bin integers, then your boundary will be a double.
For example if you want everything between 2 and 6 to be in one bin, your actual boundary will be 1.5 to 6.5. Since all of your data are integers there is no chance for anything to not be classified.
edit:I also have the same book, though it seems like I have a different version because the section on Data Discretization is in chapter 2 instead of chapter 3 like you pointed out. Based on your question, it seems like you don't really understand the concept yet.
The following is an except from page 88 Chapter 2 on Data Preprocessing. I'm using the second edition of the text.
For example, attribute values can be discretized by applying equal-width
or equal-frequency binning, and then replacing each bin value by the bin mean or
median, as in smoothing by bin means or smoothing by bin medians, respectively. 8 doesn't belong anywhere other than in bin 3. This gives you two options. You can either take the mean/median of all of the numbers that fall in bin 3 or you can use bin 3 as a category.
The building on your example, we can take the mean of the 4 numbers in bin 3. This gives us 7.75. We would now use 7.75 for the four numbers that are in that bin instead of 7,7,8 and 9.
The second option would be to use the bin number. For example, everything in bin 3 would get a category label of 3, everything in bin 2 would get a label of 2, etc.
UPDATE WITH CORRECT ANSWER:
My class finally covered this topic, and the answer to my own question is that 8 can belong to either 7 or 9. This scenario is described as "tie-breaking", where the value is equal distance from either boundary. It is acceptable for all such values to be consistently tied to the same boundary.
Here's is a real example of a NIH analysis paper that explains using "tie breaking" when they encounter equal-distance values: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3807594/

A strange warning from proc phreg (Survival Analysis) in SAS

I am been trying to fit a Cox regression on a small dataset but I have come across a strange problem. Although the model runs well, I am unable to get an ouput from it. Instead in the log one reads
WARNING: The OUTPUT data set has no observations due to the presence of time-dependent explanatory
variables.
It's true that I have a time dependent variable on the RHS but this shouldn't be a problem, I think. Many analyses use this kind of variables. Could you please help me understand why that happens and how I can get past it? There is plenty of information to be got from this statement and it would be really helpful to me. Here is my dataset and the code I have been using so far.
data surv;
input time event fin;
cards;
2 0 1
3 1 1
4 1 1
1 1 0
5 1 0
6 0 1
7 0 0
8 1 1
9 0 0
10 1 0
;
proc phreg data=surv;
model time*event(0)=fin ft;
ft=fin*log(time);
output out=b;
run;
Wasn't sure whether I should post it here or in stats stack.exchange but in any case, I would really appreciate some help. Thank you.
SAS is just telling you that you have a time dependent variable (it doesn't impede the code from running). You are violating the proportional hazards assumption for the Cox PH test, but the test is robust enough to handle it. There is really no "correct" answer here. You can perform some transformations and run the model after each transformation. Whichever model returns the lowest AIC would be your best model. Check out this presentation. Also, this lecture has some good information as well. IF however the PH assumption is not important, you should switch to a parametric model. I hope this is what (or somewhat) close to what you were looking for.

Histogram in logarithmic scale in gnuplot

I have to plot an histogram in logarithmic scale on both axis using gnuplot. I need bins to be equally spaced in log10. Using a logarithmic scale on the y axis isn't a problem. The main problem is creating the bin on the x axis. For example, using 10 bins in log10, first bins will be [1],[2],[3]....[10 - 19][20 - 29].....[100 190] and so on. I've searched on the net but I couldn't find any practical solution. If realizing it in gnuplot is too much complicated could you suggest some other software/language to do it?
As someone asked I will explain more specifically what I need to do. I have a (huge) list like this:
1 14000000
2 7000000
3 6500000
.
.
.
.
6600 1
8900 1
15000 1
19000 1
It shows, for example, that 14 milions of ip addresses have sent 1 packet, 7 milions 2 packets.... 1 ip address have sent 6600 packets, ... , 1 ip address have sent 19000 packets. As you can see the values on both axes are pretty high so I cannot plot it without a logarithmic scale.
The first things I tried because I needed to do it fast was plotting this list as it is with gnuplot setting logscale on both axes using boxes. The result is understandable but not too appropriate. In fact, the boxes became more and more thin going right on the x axis because, obviously, there are more points in 10-100 than in 1-10! So it became a real mess after the second decade.
I tried plotting a histogram with both axis being logarithmically scaled and gnuplot through the error
Log scale on X is incompatible with histogram plots.
So it appears that gnuplot does not support a log scale on the x axis with histograms.
Plotting in log-log scale in GnuPlot is perfectly doable contrary to the other post in this thread.
One can set the log-log scale in GnuPlot with the command set logscale.
Then, the assumption is that we have a file with positive (strictly non-zero) values both in the x-axis, as well as the y-axis. For example, the following file is a valid file:
1 0.5
2 0.2
3 0.15
4 0.05
After setting the log-log scale one can plot the file with the command:
plot "file.txt" w p where of course file.txt is the name of the file. This command will generate the output with points.
Note also that plotting boxes is tricky and is probably not recommended. One first has to restrict the x-range with a command of the form set xrange [1:4] and only then plot with boxes. Otherwise, when the x-range is undefined an error is returned. I am assuming that in this case plot requires (for appropriate x-values) some boxes to have size log(0), which of course is undefined and hence the error is returned.
Hope it is clear and it will also help others.
Have you tried Matplotlib with Python? Matplotlib is a really nice plotting library and when used with Python's simple syntax, you can plot things quite easily:
import matplotlib.pyplot as plot
figure = plot.figure()
axis = figure.add_subplot(1 ,1, 1)
axis.set_yscale('log')
# Rest of plotting code

Resources