Create row ids for dataframes based on contents of rows [duplicate]

Create row ids for dataframes based on contents of rows [duplicate] - python-3.x

Is there a method that converts a string of text such as 'you' to a number other than
y = tuple('you')
for k in y:
k = ord(k)
which only converts one character at a time?

In order to convert a string to a number (and the reverse), you should first always work with bytes. Since you are using Python 3, strings are actually Unicode strings and as such may contain characters that have a ord() value higher than 255. bytes however just have a single byte per character; so you should always convert between those two types first.
So basically, you are looking for a way to convert a bytes string (which is basically a list of bytes, a list of numbers 0–255) into a single number, and the inverse. You can use int.to_bytes and int.from_bytes for that:
import math
def convertToNumber (s):
return int.from_bytes(s.encode(), 'little')
def convertFromNumber (n):
return n.to_bytes(math.ceil(n.bit_length() / 8), 'little').decode()
>>> convertToNumber('foo bar baz')
147948829660780569073512294
>>> x = _
>>> convertFromNumber(x)
'foo bar baz'

Treat the string as a base-255 number.
# Reverse the digits to make reconstructing the string more efficient
digits = reversed(ord(b) for b in y.encode())
n = reduce(lambda x, y: x*255 + y, digits)
new_y = ""
while n > 0:
n, b = divmod(n, 255)
new_y += chr(b)
assert y == new_y.decode()
(Note this is essentially the same as poke's answer, but written explicitly rather than using available methods for converting between a byte string and an integer.)

You don't need to convert the string into tuple
k is overwritten. Collect items using something like list comprehension:
>>> text = 'you'
>>> [ord(ch) for ch in text]
[121, 111, 117]
To get the text back, use chr, and join the characters using str.join:
>>> numbers = [ord(ch) for ch in text]
>>> ''.join(chr(n) for n in numbers)
'you'

Though there are a number of ways to fulfill this task, I prefer the hashing way because it has the following nice properties
it ensures that the number you get is highly random, actually uniformly random
it ensures that even a small change in your input string will lead to a significant difference in output integer.
it is an irreversible process, i.e., you can't tell which string is the input based on the integer output.
import hashlib
# there are a number of hashing functions you can pick, and they provide tags of different lengths and security levels.
hashing_func = hashlib.md5
# the lambda func does three things
# 1. hash a given string using the given algorithm
# 2. retrive its hex hash tag
# 3. convert hex to integer
str2int = lambda s : int(hashing_func(s.encode()).hexdigest(), 16)
To see how the resulting integers are uniform randomly distributed, we first need to have some random string generator
import string
import numpy as np
# candidate characters
letters = string.ascii_letters
# total number of candidates
L = len(letters)
# control the seed or prng for reproducible results
prng = np.random.RandomState(1234)
# define the string prng of length 10
prng_string = lambda : "".join([letters[k] for k in prng.randint(0, L, size=(10))])
Now we generate sufficient number of random strings and obtain corresponding integers
ss = [prng_string() for x in range(50000)]
vv = np.array([str2int(s) for s in ss])
Let us check the randomness by comparing the theoretical mean and standard deviation of a uniform distribution and those we observed.
for max_num in [256, 512, 1024, 4096] :
ints = vv % max_num
print("distribution comparsions for max_num = {:4d} \n\t[theoretical] {:7.2f} +/- {:8.3f} | [observed] {:7.2f} +/- {:8.3f}".format(
max_num, max_num/2., np.sqrt(max_num**2/12), np.mean(ints), np.std(ints)))
Finally, you will see the results below, which indicates that the number you got are very uniform.
distribution comparsions for max_num = 256
[theoretical] 128.00 +/- 73.901 | [observed] 127.21 +/- 73.755
distribution comparsions for max_num = 512
[theoretical] 256.00 +/- 147.802 | [observed] 254.90 +/- 147.557
distribution comparsions for max_num = 1024
[theoretical] 512.00 +/- 295.603 | [observed] 512.02 +/- 296.519
distribution comparsions for max_num = 4096
[theoretical] 2048.00 +/- 1182.413 | [observed] 2048.67 +/- 1181.422
It is worthy to call out that other posted answers may not attain these these properties.
For example, #poke's convertToNumber solution will give
distribution comparsions for max_num = 256
[theoretical] 128.00 +/- 73.901 | [observed] 93.48 +/- 17.663
distribution comparsions for max_num = 512
[theoretical] 256.00 +/- 147.802 | [observed] 220.71 +/- 129.261
distribution comparsions for max_num = 1024
[theoretical] 512.00 +/- 295.603 | [observed] 477.67 +/- 277.651
distribution comparsions for max_num = 4096
[theoretical] 2048.00 +/- 1182.413 | [observed] 1816.51 +/- 1059.643

I was trying to find a way to convert a numpy character array into a unique numeric array in order to do some other stuff. I have implemented the following functions including the answers by #poke and #falsetrue (these methods were giving me some trouble when the strings were too large). I have also added the hash method (a hash is a fixed sized integer that identifies a particular value.)
import numpy as np
def str_to_num(x):
"""Converts a string into a unique concatenated UNICODE representation
Args:
x (string): input string
Raises:
ValueError: x must be a string
"""
if isinstance(x, str):
x = [str(ord(c)) for c in x]
x = int(''.join(x))
else:
raise ValueError('x must be a string.')
return x
def chr_to_num(x):
return int.from_bytes(x.encode(), 'little')
def char_arr_to_num(arr, type = 'hash'):
"""Converts a character array into a unique hash representation.
Args:
arr (np.array): numpy character array.
"""
if type == 'unicode':
vec_fun = np.vectorize(str_to_num)
elif type == 'byte':
vec_fun = np.vectorize(chr_to_num)
elif type == 'hash':
vec_fun = np.vectorize(hash)
out = np.apply_along_axis(vec_fun, 0, arr)
out = out.astype(float)
return out
a = np.array([['x', 'y', 'w'], ['x', 'z','p'], ['y', 'z', 'w'], ['x', 'w','y'], ['w', 'z', 'q']])
char_arr_to_num(a, type = 'unicode')
char_arr_to_num(a, type = 'byte')
char_arr_to_num(a, type = 'hash')

Related

Function to Convert Square Matrix to Upper Hessenberg with Similarity Transformations

I am attempting to translate a MATLAB function to Python from Timothy Sauer,
Numerical Analysis Second Edition, page 546, Program 12.8. The original function
receives a square matrix and returns a matrix with the same eigenvalues but in
Upper Hessenberg form. The original function creates Householder reflectors to produce zeros in the
offdiagonals of the matrix and performs similarity transformations on the original matrix to
get it to upper hessenberg form.
My Python translation succeeds only in obtaining the eigenvalues for 3x3 matrices
but not for 4x4 matrices. Would anyone know the cause of the error? I pasted my code with success and failing cases below. Thank you.
import numpy as np
import math
norm = lambda v:math.sqrt(np.sum(v**2))
def upper_hessenberg(A):
'''
Translated from Timothy Sauer, Numerical Analysis Second Edition, page 546, Program 12.8
Input: Square Matrix, A
Output: B, a Similar Matrix with Same Eigenvalues as A except in Upper Hessenberg form
V, a matrix containing the reflectors used to produce zeros in the off diagonals
'''
rows, columns = A.shape
B = A[:,:].astype(np.float) #will store the similar matrix
V = np.zeros(shape=(rows,columns),dtype=float) #will store the reflectors
for column in range(columns-2): #start from the 1st column end at the third to last column
row = column
x = B[row+1: ,column] #decapitate the column
reflection_of_x = np.zeros(len(x)) #first entry is the norm, followed by 0s
if abs(norm(x)) <= np.finfo(float).eps: #if there are already 0s inthe offdiagonals skip this column
continue
reflection_of_x[0] = norm(x)
v = reflection_of_x - x # v, (the difference vector) represents the line connecting the original column to the reflection of the column (see Timothy Sauer Num Analysis 2nd Edition Figure 4.11 Householder reflector)
v = v/norm(v) #normalize to length of 1 (unit vector)
V[:len(v), column] = v #save the reflector in an upper triangular matrix called V
#verify with x-2*(x # v * v) should equal a vector with all zeros except the leading entry
column_projections = np.outer(v , v # B[row+1:, column:]) #project each col onto difference vector
B[row+1:, column:] = B[row+1:, column:] - (2 * column_projections)
row_projections = np.outer(v, B[row:, column + 1:] # v).T #project each row onto difference vector
B[row:, column + 1:] = B[row:, column + 1:] - (2 * row_projections)
return V, B
# Algorithm succeeds only with 3x3 matrices
eigvectors = np.array([
[1,3,2],
[4,5,6],
[7,8,9],
])
eigvalues = np.array([
[4,0,0],
[0,3,0],
[0,0,2]
])
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 3x3 matrices, The function successfully produces these eigvals",np.linalg.eigvals(B))
#But with 4x4 matrices it fails
eigvectors = np.array([
[1,3,2,4],
[4,5,6,2],
[7,8,9,5],
[5,2,7,8]
])
eigvalues = np.array([
[4,0,0,0],
[0,3,0,0],
[0,0,2,0],
[0,0,0,1]
])
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 4x4 matrices, The function fails to obtain correct eigvals",np.linalg.eigvals(B))

Your error is that you try to be too efficient. While the last rows are indeed increasingly reduced with leading zeros, this is not the case for the last columns. So in row_projections you need to remove the limiter row:, change to B[:, column + 1:].
You are using the unstable variant of the "improved" Householder reflector. The older version would use the larger of x_refl - x and x_refl + x by setting reflection_of_x[0] = -np.sign(x[0])*norm(x) (or remove all minus signs there).
The stable variant of the improved reflector would use the binomial trick in the normalization of x_refl - x if this difference becomes too small.
x_refl - x = [ norm(x) - x[0], - x[1:] ]
= [ norm(x[1:])^2/(norm(x) + x[0]), - x[1:] ]
(x_refl - x)/norm(x_refl - x)
[ norm(x[1:]), - (norm(x)+x[0])*(x[1:]/norm(x[1:])) ]
= -----------------------------------------------------
sqrt(2*norm(x)*(norm(x)+x[0]))
While the parts may have wildly different scales, no catastrophic cancellation happens for x[0]>0.
See the discussion about the same algorithm from Golub/van Loan 4th ed. in for further details and opinions and the code from that book.

integer and floating result from multiplication in python [duplicate]

This question already has answers here:
Formatting floats without trailing zeros
(21 answers)
Closed 2 years ago.
In the same function, I have tried to use integer, float, and rounding, but I could not get this result. What did I do wrong?
The goal is:
10*12.3 = 123
3*12.3= 36.9
my code:
def multi(n1, n2):
x = n1*n2
return x
I have tried int(n1*n2), but I got 123 and 36. Then I tried float(n1*n2) and I got 123.0 and 36.9. What did I do wrong and how can I fix it?

You are always multiplying an integer with a float which will always output a float.
If you want the number that your function returns to be a float with 1 decimal point you can use round(num, 1).
def multi(n1, n2):
x = n1*n2
return round(x, 1)
print(multi(10, 12.3)) # outputs '123.0'
print(multi(3, 12.3)) # outputs '36.9'
To escape the .0 you could probably use an if statement although I don't see the use of it, since doing calculations with floats have the same output as integers (when they are .0)
def multi(n1, n2):
x = n1 * n2
return round(x, 1)
output = []
output.append(multi(10, 12.3)) # outputs '123.0'
output.append(multi(3, 12.3)) # outputs '36.9'
for index, item in enumerate(output):
if int(item) == float(item):
output[index] = int(item)
print(output) # prints [129, 36.9]
This should probably help you but it shouldn't matter all that match to you

The number is not the representation of the number. For example, all these representations are 123:
123
12.3E1
123.0
123.0000000000000000000
My advice is to do them as floating point and either use output formatting to get them all in a consistent format:
>>> for i in (12.3 * 10, 42., 36.9 / 10):
... print(f"{i:8.2f}")
...
123.00
42.00
3.69
or string manipulation to remove useless suffixes:
>>> import re
>>> x = 12.3 * 10
>>> print(x)
123.0
>>> print(re.sub(r"\.0*$", "", str(x)))
123

How to return floating values using floor division

In Python 3, I want to return the units place of an integer value, then tens, then hundreds and so on. Suppose I have an integer 456, first I want to return 6, then 5 then 4. Is there any way? I tried floor division and for loop but didn't work.

If you look at the list of basic operators from the documentation, for example here,
Operator Description Example
% Modulus Divides left hand operand by right hand operand and returns remainder b % a = 1
// Floor Division - The division of operands where the result is the quotient in which the digits after the decimal point are removed. But if one of the operands is negative, the result is floored, i.e., rounded away from zero (towards negative infinity): 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.0//3 = -4.0
With that knowledge, you can get what you want as follows:
In [1]: a = 456
In [2]: a % 10
Out[2]: 6
In [3]: (a % 100) // 10
Out[3]: 5
In [4]: a // 100
Out[4]: 4

Write a generator if you want to retrieve digits in different places of your code based on requirement as follows.
If you are not much familiar with Python's generator, have a quick look at https://www.programiz.com/python-programming/generator.
» Here get_digits() is a generator.
def get_digits(n):
while str(n):
yield n % 10
n = n // 10
if not n:
break
digit = get_digits(1729)
print(next(digit)) # 9
print(next(digit)) # 2
print(next(digit)) # 7
print(next(digit)) # 1
» If you wish to iterate over digits, you can also do so as follows.
for digit in get_digits(74831965):
print(digit)
# 5
# 6
# 9
# 1
# 3
# 8
# 4
# 7
» Quick overview about its usage (On Python3's Interactive terminal).
>>> def letter(name):
... for ch in name:
... yield ch
...
>>>
>>> char = letter("RISHIKESH")
>>>
>>> next(char)
'R'
>>>
>>> "Second letter is my name is: " + next(char)
'Second letter is my name is: I'
>>>
>>> "3rd one: " + next(char)
'3rd one: S'
>>>
>>> next(char)
'H'
>>>

list slices equal to the last digit

I am working on a school project and for that, I am trying to slice a string equal to the last digit of the string, I need to make sure that each slice is of equal length to the last digit, if it not of equal length then I need to add trailing zeroes
Example: "132567093" should be ['132', '567', '090']
When I try i get ['132', '567', '09']
This is so far the code I have
n = input()
int_n = int(n)
num=int(n)
lastdigit=num%10
s = n[:-1]
print([s[idx:idx+lastdigit] for idx,val in enumerate(s) if idx%lastdigit
== 0])

Another way of doing this is to use the zip clustering idiom with a fill value
from itertools import zip_longest as zipl
*digits, length = '132567093'
length = int(length)
print([''.join(t) for t in zipl(*[iter(digits)]*length, fillvalue='0')])
# ['132', '567', '090']
# To sum them together
print(sum(int(''.join(t)) for t in zipl(*[iter(digits)]*length, fillvalue='0')))
# 789
You can also use string formatting:
s='132567093'
length = int(s[-1])
digits = s[:-1]
format_str = "{{:0<{}}}".format(length) # {:0<3}
print([format_str.format(digits[i:i+length]) for i in range(0, len(digits), length)])
# ['132', '567', '090']
print(sum(int(format_str.format(digits[i:i+length])) for i in range(0, len(digits), length)))
# 789

To zero-pad on the left, there's a str.zfill method.
By reversing the string, applying zero padding and reversing again, you can achieve this
print([s[idx:idx+lastdigit][::-1].zfill(lastdigit)[::-1] for idx,val in enumerate(s) if idx%lastdigit== 0])
outputs:
['132', '567', '090']
without zfill you can compute the number of characters to add. The formula is ugly but it's faster because it doesn't create 3 strings in the process:
[s[idx:idx+lastdigit]+"0"*(max(0,lastdigit-len(s)+idx)) for idx,val in enumerate(s) if idx%lastdigit== 0]

Optimising a fibonacci sequence generator python

I am trying to create a program which creates a Fibonacci sequence up to the value of the sequence being 200. I have the basic set up down where I can compute the sequence but I wish to display it in a certain way and I have forgotten how to achieve this.
I wish to write the numbers to an array which I have defined as empty initially, compute the numbers and assign them to the array and print said array. In my code below the computation is ok but when printed to screen, the array shows the value 233 which is above 200 and not what I'm looking for. I wish to print all the values under 200 which I've stored in an array.
Is there a better way to initially define the array for what I want and what is the correct way to print the array at the end with all elements below 200?
Code follows:
#This program calculates the fibonacci sequence up to the value of 200
import numpy as np
x = np.empty(14, float) #Ideally creates an empty array to deposit the fibonacci numbers in
f = 0.0 #Dummy variable to be edited in the while loop
#Here the first two values of the sequence are defined alongside a counter starting at i = 1
x[0] = 0.0
x[1] = 1.0
i = 1
#While loop which computes the values and writes them to the array x
while f <= 200:
f = x[i]+x[i-1] #calculates the sequence element
i += 1 #Increases the iteration counter by 1 for each loop
x[i] = f #set the array element equal to the calculated sequence number
print(x)
For reference here is a quick terminal output, Ideally I wish to remove the last element:
[ 0. 1. 1. 2. 3. 5. 8. 13. 21. 34. 55. 89.
144. 233.]

There are a number of stylistic points here. Firstly, you should probably use integers, rather than floats. Secondly, you should simply append each number to a list, rather than pre-define an array of a particular size.
Here's an interactive session:
>>> a=[0,1]
>>> while True:
b=a[-1]+a[-2]
if b<=200:
a.append(b)
else:
break
>>> a
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144]

Here is a way without using indices:
a = 0
x = [a]
b = 1
while b <= 200:
x.append(b)
a, b = b, a+b
print(x)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Create row ids for dataframes based on contents of rows [duplicate] - python-3.x

Is there a method that converts a string of text such as 'you' to a number other than y = tuple('you') for k in y: k = ord(k) which only converts one character at a time?

Related

Function to Convert Square Matrix to Upper Hessenberg with Similarity Transformations

integer and floating result from multiplication in python [duplicate]

How to return floating values using floor division

list slices equal to the last digit

Optimising a fibonacci sequence generator python

Categories

Resources