170! approaches the limit of a floating point double: 171! will overflow.
However 170! is over 300 digits long.
There is, therefore, no way that 170! can be represented precisely in floating point.
Yet Excel returns the correct answer for 170! / 169!.
Why is this? I'd expect some error to creep in, but it returns an integral value. Does Excel somehow know how to optimise this calculation?
If you find the closest doubles to 170! and 169!, they are
double oneseventy = 5818033100654137.0 * 256;
double onesixtynine = 8761273375102700.0;
times the same power of two. The closest double to the quotient of these is exactly 170.0.
Also, Excel may compute 170! by multiplying 169! by 170.
William Kahan has a paper called "How Futile are Mindless Assessments of Roundoff in Floating-Point Computation?" where he discusses some of the insanity that goes on in Excel. It may be that Excel is not computing 170 exactly, but rather it's hiding an ulp of reality from you.
The answer of tmyklebu is already perfect. But I wanted to know more.
What if implementation of n! was something trivial as return double(n)*(n-1)!...
Here is a Smalltalk snippet, but you can translate in many other languages, that's not the point:
(2 to: 170) count: [:n |
| num den |
den := (2 to: n - 1) inject: 1.0 into: [:p :e | p*e].
num := n*den.
num / den ~= n].
And the answer is 12
So you have not been particulary lucky, due to good properties of round to nearest even rounding mode, out of these 169 numbers, only 12 don't behave as expected.
Which ones? Replace count: by select: and you get:
#(24 47 59 61 81 96 101 104 105 114 122 146)
If I had an Excel handy, I would ask to evaluate 146!/145!.
Curiously (only apparently curiously), a less naive solution that computes the exact factorial with large integer arithmetic, then convert to nearest float, does not perform better !
(2 to: 170) reject: [:n |
n factorial asFloat / (n-1) factorial asFloat = n]
leads to:
#(24 31 34 40 41 45 46 57 61 70 75 78 79 86 88 92 93 111 115 116 117 119 122 124 141 144 147 164)
Related
This might sound bonkers, but looking to see if there are any ideas on how to do this.
I have N categories (say 7) where a set number of people (say 1000) have to be allocated. I know from historical data the minimum and maximum for each category (there is limited historical data, say 15 samples, so I have data that looks like this - if I had a larger sample, I would try to generate a distribution for each category from all the samples, but there isn't.
-Year 1: [78 97 300 358 132 35 0]
-Year 2: [24 74 346 300 148 84 22]
-.
-.
-Year 15:[25 85 382 302 146 52 8]
The min and max for each category over these 15 years of data is:
Min: [25 74 252 278 112 27 0 ]
Max: [132 141 382 360 177 84 22]
I am trying to scale this using simulation - by allocating 1000 to each category within the min and max limits, and repeating it. The only condition is that the sum of the allocation across the seven categories in each simulation has to sum to 1000.
Any ideas would be greatly appreciated!
The distribution you want is called the multinomial distribution. You can use the RandMultinomial function in SAS/IML to produce random samples from the multinomial distribution. To use the multinomial distribution, you need to know the probability of an individual in each category. If this probability has not changed over time, the best estimate of this probability is to take the average proportion in each category.
Thus, I would recommend using ALL the data to estimate the probability, not just max and min:
proc iml;
X = {...}; /* X is a 15 x 7 matrix of counts, each row is a year */
mean = mean(X);
p = mean / sum(mean);
/* simulate new counts by using the multinomial distribution */
numSamples = 10;
SampleSize = 1000;
Y = randmultinomial(numSamples, SampleSize, p);
print Y;
Now, if you insist on using the max/min, you could use the midrange to estimate the most likely value and use that to estimate the probabilty, as follows:
Min = {25 74 252 278 112 27 0};
Max = {132 141 382 360 177 84 22};
/* use midrange to estimate probabilities */
midrange = (Min + Max)/2;
p = midrange / sum(midrange);
/* now use RandMultinomial, as before */
If you use the second method, there is no guarantee that the simulated values will not exceed the Min/Max values, although in practice many of the samples will obey that criterion.
Personally, I advocate the first method, which uses the average count. Or you can use a time-weighted count, if you think recent observations are more relevant than observations from 15 years ago.
When using SHA256 it always returned me characters contained in the standard english alphabet(lowercase) and arabic numbers(0-9). So the character set returned was [a-z]U[0-9]
The reason this confuses me is that I've heard a SHA256 should have 2^256 different results, since each bit is "random" each byte should be represented by a completely random ASCII character, not one that fits into a restricted set of 36 characters(26 letters and 10 numerals)
Basically, I want to know if my SHA256 is behaving properly and if it is, why is it like this. I am using the standard sha256sum function that comes with linux.
Yes, your assumption is correct. SHA256 will generate a total of 32 bytes (= 256 bits), each byte having an arbitrary value between 0 and 255 (inclusive).
But here lies the problem, most of those bytes do not represent valid ASCII characters (only 0 - 127) and some of them are invisible (space, tab, linefeed, and several control characters).
To "render" the SHA256, the bytes are encoded in hexadecimal format. A single byte is represented by 2 characters. 00 = 0, 7f = 127, ff = 255.
The SHA256 hash of the empty string is e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, or if each byte is converted to decimal:
227 176 196 66 152 252 28 20 154 251 244 200 153 111 185 36 39 174 65 228 100 155 147 76 164 149 153 27 120 82 184 85
I am trying to apply normalization to my data and I have tried the Conventional scaling techniques using sklearn packages readily available for this kind of requirement. However, I am looking to implement something called Decimal scaling.
I read about it in this research paper and looks like a technique which can improve results of a neural network regression. As per my understanding, this is what I believe needs to be done -
Suppose the range of attribute X is −4856 to 28. The maximum absolute value of X is 4856.
To normalize by decimal scaling I will need to divide each value by 10000 (c = 4). In this case, −4856 becomes −0.4856 while 28 becomes 0.0028.
So for all values: new value = old value/ 10^c
How can I reproduce this as a function in Python so as to normalize all the features(column by column) in my data set?
Input:
A B C
30 90 75
56 168 140
28 84 70
369 1107 922.5
485 1455 1212.5
4856 14568 12140
40 120 100
56 168 140
45 135 112.5
78 234 195
899 2697 2247.5
Output:
A B C
0.003 0.0009 0.0075
0.0056 0.00168 0.014
0.0028 0.00084 0.007
0.0369 0.01107 0.09225
0.0485 0.01455 0.12125
0.4856 0.14568 1.214
0.004 0.0012 0.01
0.0056 0.00168 0.014
0.0045 0.00135 0.01125
0.0078 0.00234 0.0195
0.0899 0.02697 0.22475
Thank you guys for asking questions which led me to think about my problem more clearly and break it into steps. I have arrived to a solution. Here's how my solution looks like:
def Dec_scale(df):
for x in df:
p = df[x].max()
q = len(str(abs(p)))
df[x] = df[x]/10**q
I hope this solution looks agreeable!
def decimal_scaling (df):
df_abs = abs(df)
max_valus= df_abs.max()
log_num=[]
for i in range(max_valus.shape[0]):
log_num.append(int(math.log10(max_valus[i]))+1)
log_num = np.array(log_num)
log_num = [pow(10, number) for number in log_num]
X_full =df/log_num
return X_full
Using integers ONLY (no floating-point), is there a way to determine between two fractions, which result is greater?
for example say we have these two fraction:
1000/51 = 19(.60) && 1000/52 = 19(.23)
If we were to use floating point numbers obviously the first fraction is greater; however, both fractions equal 19 if we were to use integers only. How might one find out which is greater with out using floating point math?
I have tried to get the remainder using the % operator but does not seem to work in all cases.
1/2 can be think one apple give two people, so every people take 0.5 apple.
so 1000/51 consider as 1000 apples give 51 people.
1000/51 > 1000/52, because the apple the same,but we wanna give it to more people.
it is simple example, more complex exmaple:
1213/109 1245/115 which is greater?
1245 is greater than 1213 and 115 is greater than 109, difference:
1245 - 1213 = 32, and 115 - 109 = 6, 32/6 replce 1245/109, compare 1213/109 to 32/6.
32/6 ≈ 5 and less 6, 6*109 = 654 < 1213, so 1213/109 > 1245/115.
1213/109 1245/115
1213/109 32/6 # make diff 1245 - 1213 = 32 115 - 109 = 6
# compare diff to 1213/109
1213 > 109 * 6
# then
1213/109 > 1245/115
I have a text file consisting of data that is separated by tab-delimited columns. There are many ways to read data in from the file into python, but I am specifically trying to use a method similar to one outlined below. When using a context manager like with open(...) as ..., I've seen that the general concept is to have all of the subsequent code indented within the with statement. Yet when defining a function, the return statement is usually placed at the same indentation as the first line of code within the function (excluding cases with awkward if-else loops). In this case, both approaches work. Is one method considered correct or generally preferred over the other?
def read_in(fpath, contents=[], row_limit=np.inf):
"""
fpath is filelocation + filename + '.txt'
contents is the initial data that the file data will be appeneded to
row_limit is the maximum number of rows to be read (in case one would like to not read in every row).
"""
nrows = 0
with open(fpath, 'r') as f:
for row in f:
if nrows < row_limit:
contents.append(row.split())
nrows += 1
else:
break
# return contents
return contents
Below is a snippet of the text-file I am using for this example.
1996 02 08 05 17 49 263 70 184 247 126 0 -6.0 1.6e+14 2.7e+28 249
1996 02 12 05 47 26 91 53 160 100 211 236 2.0 1.3e+15 1.6e+29 92
1996 02 17 02 06 31 279 73 317 257 378 532 9.9 3.3e+14 1.6e+29 274
1996 02 17 05 18 59 86 36 171 64 279 819 27.9 NaN NaN 88
1996 02 19 05 15 48 98 30 266 129 403 946 36.7 NaN NaN 94
1996 03 02 04 11 53 88 36 108 95 120 177 1.0 1.5e+14 8.7e+27 86
1996 03 03 04 12 30 99 26 186 141 232 215 2.3 1.6e+14 2.8e+28 99
And below is a sample call.
fpath = "/Users/.../sample_data.txt"
data_in = read_in(fpath)
for i in range(len(data_in)):
print(data_in[i])
(I realize that it's better to use chunks of pre-defined sizes to read in data, but the number of characters per row of data varies. So I'm instead trying to give user control over the number of rows read in; one could read in a subset of the rows at a time and append them into contents, continually passing them into read_in - possibly in a loop - if the file size is large enough. That said, I'd love to know if I'm wrong about this approach as well, though this isn't my main question.)
If your function needs to do some other things after writing to the file, you usually do it outside the with block. So essentially you need to return outside the with block too.
However if the purpose of your function is just to read in a file, you can return within the with block, or outside it. I believe none of the methods are preferred in this case.
I don't really understand your second question.
You can put return also withing with context.
By exiting context, the cleanup are done. This is the power of with, not to need to check all possible exit paths. Note: also with exception inside with the exit context is called.
But if file is empty (as an example), you should still return something. So in such case your code is clear, and follow the principle: one exit path. But if you should handle end of file without finding something important, I would putting normal return within with context, and handle the special case after it.