Features engineering for a String - string

Suppose we have a data set consisting of 6 digit strings (all lower case letters) e.g. "olmido" and a corresponding binary value.
For example, "olmido" has a value of 1 and "lgoead" has a value of 0. For new 6-digit strings (all lower case letters) I want to predict which value they have (i.e. 1 or 0).
My question now is, what would be a good method to convert the strings into numeric ones so that you can train machine learning models on them. So far I have simply divided the strings into letters and converted them into numbers, so I have 6 features. But with this variant I still don't have satisfying results for my machine learning model.
With my variant the order of the letters does not matter ( so "olmido" e.g is treated the same as e.g. "loimod" ), but the order of the letters should play a big role. How can I best take this into account ?

Your question looks to me can be solved by character n-gram. You said you have only 6 features, because you only consider character uni-gram. Since you said the order the characters play important roles in your classifier. You should use character bi-gram or even tri-gram as features.

I'm not exactly sure of the use case here, but I assume you would want to predict based on the sub-sequence of alphabets.
If its a full string match and you do not have memory constraints, using a dictionary should suffice. If its a partial string match, have a look at Aho-Corasick methodology where you could do substring matches.
A more probabilistic approach is to use a sequence learning algorithm such as Conditional Random Field (CRF). Looking at this as a sequence learning problem, the below snippet learns the left side alphabet features and right side alphabet features per alphabet in a word. I have added a DEPENDENCY_CHAIN_LENGTH parameter than can be used to control how many dependencies you want to learn per alphabet. So if you want the model to learn only the immediate left and immediate right alphabet dependencies, you can assign this to 1. I have assigned this to 3 for the below snippet.
During prediction, a label is predicted for each (encoded) alphabet (and its dependencies to left and right). I have averaged the prediction for each alphabet and aggregated it into a single output per word.
Please do a pip install sklearn_crfsuite to install crfsuite if not already installed.
import sklearn_crfsuite
import statistics
DEPENDENCY_CHAIN_LENGTH = 3
def translate_to_features(word, i):
alphabet = word[i]
features = {
'bias': 1.0,
'alphabet.lower()': alphabet.lower(),
'alphabet.isupper()': alphabet.isupper(),
'alphabet.isdigit()': alphabet.isdigit(),
}
j = 1
# Builds dependency towards the left side characters upto
# DEPENDENCY_CHAIN_LENGTH characters
while i - j >= 0 and j <= DEPENDENCY_CHAIN_LENGTH:
position = (i - j)
alphabet1 = word[position]
features.update({
'-' + str(position) + ':alphabet.lower()': alphabet1.lower(),
'-' + str(position) + ':alphabet.isupper()': alphabet1.isupper(),
'-' + str(position) + ':alphabet.isdigit()': alphabet1.isdigit(),
})
j = j + 1
else:
features['BOW'] = True
j = 1
# Builds dependency towards the right side characters upto
# DEPENDENCY_CHAIN_LENGTH characters
while i + j < len(word) and j <= DEPENDENCY_CHAIN_LENGTH:
position = (i + j)
alphabet1 = word[position]
features.update({
'+' + str(position) + ':alphabet.lower()': alphabet1.lower(),
'+' + str(position) + ':alphabet.isupper()': alphabet1.isupper(),
'+' + str(position) + ':alphabet.isdigit()': alphabet1.isupper(),
})
j = j + 1
else:
features['EOW'] = True
return features
raw_training_data = {"Titles": "1",
"itTels": "0",
}
print("Learning dataset with labels : {}".format(raw_training_data))
raw_testing_data = ["titles", "ittsle"]
X_train = []
Y_train = []
print("Feature encoding in progress ... ")
# Prepare encoded features from words
for word in raw_training_data.keys():
word_tr = []
word_lr = []
word_length = len(word)
if word_length < DEPENDENCY_CHAIN_LENGTH:
raise Exception("Dependency chain cannot have length greater than a word")
for i in range(0, len(word)):
word_tr.append(translate_to_features(word, i))
word_lr.append(raw_training_data[word])
X_train.append(word_tr)
Y_train.append(word_lr)
print("Feature encoding in completed")
# Training snippet
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=1,
all_possible_transitions=True
)
print("Training in progress")
crf.fit(X_train, Y_train)
print("Training completed")
print("Beginning predictions")
# Prediction Snippet
for word in raw_testing_data:
# Encode into features
word_enc = []
for i in range(0, len(word)):
word_enc.append(translate_to_features(word, i))
# Predict using the encoded features
pred_values = crf.predict_marginals_single(word_enc)
# Aggregate scores across spans per label
label_scores = {}
for span_prediction in pred_values:
for label in span_prediction.keys():
if label in label_scores:
label_scores[label].append(span_prediction[label])
else:
label_scores[label] = [span_prediction[label]]
# Print aggregated score
print("Predicted label for the word '{}' is :".format(word))
for label in label_scores:
print("\tLabel {} Score {}".format(label, statistics.mean(label_scores[label])))
print("Predictions completed")
Produces output :
Learning dataset with labels : {'Titles': '1', 'itTels': '0'}
Feature encoding in progress ...
Feature encoding in completed
Training in progress
Training completed
Beginning predictions
Predicted label for the word 'titles' is :
Label 1 Score 0.6821365857513837
Label 0 Score 0.3178634142486163
Predicted label for the word 'ittsle' is :
Label 1 Score 0.36701890171374996
Label 0 Score 0.6329810982862499
Predictions completed

Related

word2vec cosine similarity greater than 1 arabic text

I have trained my word2vec model from gensim and I am getting the nearest neighbors for some words in the corpus. Here are the similarity scores:
top neighbors for الاحتلال:
الاحتلال: 1.0000001192092896
الاختلال: 0.9541053175926208
الاهتلال: 0.872565507888794
الاحثلال: 0.8386293649673462
الاكتلال: 0.8209128379821777
It is odd to get a similarity greater than 1. I cannot apply any stemming to my text because the text includes many OCR spelling mistakes (I got the text from ORC-ed documents). How can I fix the issue ?
Note I am using model.similarity(t1, t2)
This is how I trained my Word2Vec Model:
documents = list()
tokenize = lambda x: gensim.utils.simple_preprocess(x)
t1 = time.time()
docs = read_files(TEXT_DIRS, nb_docs=5000)
t2 = time.time()
print('Reading docs took: {:.3f} mins'.format((t2 - t1) / 60))
print('Number of documents: %i' % len(docs))
# Training the model
model = gensim.models.Word2Vec(docs, size=EMBEDDING_SIZE, min_count=5)
if not os.path.exists(MODEL_DIR):
os.makedirs(MODEL_DIR)
model.save(os.path.join(MODEL_DIR, 'word2vec'))
weights = model.wv.vectors
index_words = model.wv.index2word
vocab_size = weights.shape[0]
embedding_dim = weights.shape[1]
print('Shape of weights:', weights.shape)
print('Vocabulary size: %i' % vocab_size)
print('Embedding size: %i' % embedding_dim)
Below is the read_files function I defined:
def read_files(text_directories, nb_docs):
"""
Read in text files
"""
documents = list()
tokenize = lambda x: gensim.utils.simple_preprocess(x)
print('started reading ...')
for path in text_directories:
count = 0
# Read in all files in directory
if os.path.isdir(path):
all_files = os.listdir(path)
for filename in all_files:
if filename.endswith('.txt') and filename[0].isdigit():
count += 1
with open('%s/%s' % (path, filename), encoding='utf-8') as f:
doc = f.read()
doc = clean_text_arabic_style(doc)
doc = clean_doc(doc)
documents.append(tokenize(doc))
if count % 100 == 0:
print('processed {} files so far from {}'.format(count, path))
if count >= nb_docs and count <= nb_docs + 200:
print('REACHED END')
break
if count >= nb_docs and count <= nb_docs:
print('REACHED END')
break
return documents
I tried this thread but it won't help me because I rather have arabic and misspelled text
Update
I tried the following: (getting the similarity between the exact same word)
print(model.similarity('الاحتلال','الاحتلال'))
and it gave me the following result:
1.0000001
Definitionally, the cosine-similarity measure should max at 1.0.
But in practice, floating-point number representations in computers have tiny imprecisions in the deep-decimals. And, especially when a number of calculations happen in a row (as with the calculation of this cosine-distance), those will sometimes lead to slight deviations from what the expected maximum or exactly-right answer "should" be.
(Similarly: sometimes calculations that, mathematically, should result in the exact same answer no matter how they are reordered/regrouped deviate slightly when done in different orders.)
But, as these representational errors are typically "very small", they're usually not of practical concern. (They are especially small in the range of numbers around -1.0 to 1.0, but can become quite large when dealing with giant numbers.)
In your original case, the deviation is just 0.000000119209289. In the word-to-itself case, the deviation is just 0.0000001. That is, about one-ten-millionth off. (Your other sub-1.0 values have similar tiny deviations from perfect calculation, but they aren't noticeable.)
In most cases, you should just ignore it.
If you find it distracting to you or your users in numerical displays/logging, simply choosing to display all such values to a limited number of after-the-decimal-point digits – say 4 or even 5 or 6 – will hide those noisy digits. For example, using a Python 3 format-string:
sim = model.similarity('الاحتلال','الاحتلال')
print(f"{sim:.6}")
(Libraries like numpy that work with large arrays of such floats can even set a global default for display precision – see numpy.set_print_options – though that shouldn't affect the raw Python floats you're examining.)
If for some reason you absolutely need the values to be capped at 1.0, you could add extra code to do that. But, it's usually a better idea to choose your tests & printouts to be robust to, & oblivious with regard to, such tiny deviations from perfect math.

Fuzzy matching string in python pyspark for dataframe

I am doing a fuzzy similarity matching between all rows in 'name' column using python pyspark in Jupyter notebook. The expected output is to produce a column with the similar string together with the score for each of the string as a new column. My question is quite fimiliar with this question, it's just that the question is in R language and it used 2 datasets (mine is only 1). As I'm quite new to python, I'm quite confused how to do it.
I'm also have used a simple code with similar function however not so sure how to run it for the dataframe.
Here is the code:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
#example I do for simple string
Str1 = "Apple Inc."
Str2 = "Jo Inc"
Distance = levenshtein_ratio_and_distance(Str1,Str2)
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
print(Ratio)
However, the code above only applicable for string. What is I want to run the dataframe as the input instead of string. For example, the input data is (Saying that dataset name is customer):
name
1 Ace Co
2 Ace Co.
11 Baes
4 Bayes Inc.
8 Bayes
12 Bays
10 Bcy
15 asd
13 asd
The expected outcome is:
name b_name dist
Ace Co Ace Co. 0.64762
Baes Bayes Inc., Bayes,Bays, Bcy 0.80000,0.86667,0.70000,0.97778
asd asdf 0.08333

LSTM getting caught up in loop

I recently implemented a name generating RNN "from scratch" which was doing ok but far from perfect. So I thought about trying my luck with pytorch's LSTM class to see if it makes a difference. Indeed it does and the outpus looks way better for the first 7 ~ 8 characters. But then the networks gets caught in a loop and outputs things like "laulaulaulau" or "rourourourou" (it is supposed the generate french names).
Is it a often occuring problem ? If so do you know a way to fix it ? I'm concern about the fact the network doesn't produce EOS tokens...
This is an issue which has already been asked here Why does my keras LSTM model get stuck in an infinite loop?
but not really answered hence my post.
here is the model :
class pytorchLSTM(nn.Module):
def __init__(self,input_size,hidden_size):
super(pytorchLSTM,self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size, hidden_size)
self.output_layer = nn.Linear(hidden_size,input_size)
self.tanh = nn.Tanh()
self.softmax = nn.LogSoftmax(dim = 2)
def forward(self, input, hidden)
out, hidden = self.lstm(input,hidden)
out = self.tanh(out)
out = self.output_layer(out)
out = self.softmax(out)
return out, hidden
The input and target are two sequences of one-hot encoded vectors respectively with a start of sequence and end of sequence vector at the start and the end. They represent the characters inside of a name taken from the name list (database).
I use a and token on each name from the database. here are the function I use
def inputTensor(line):
#tensor starts with <start of sequence> token.
tensor = torch.zeros(len(line)+1, 1, n_letters)
tensor[0][0][n_letters - 2] = 1
for li in range(len(line)):
letter = line[li]
tensor[li+1][0][all_letters.find(letter)] = 1
return tensor
# LongTensor of second letter to end (EOS) for target
def targetTensor(line):
letter_indexes = [all_letters.find(line[li]) for li in range(len(line))]
letter_indexes.append(n_letters - 1) # EOS
return torch.LongTensor(letter_indexes)
training loop :
def train_lstm(model):
start = time.time()
criterion = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters())
n_iters = 20000
print_every = 1000
plot_every = 500
all_losses = []
total_loss = 0
for iter in range(1,n_iters+1):
line = randomChoice(category_line)
input_line_tensor = inputTensor(line)
target_line_tensor = targetTensor(line).unsqueeze(-1)
optimizer.zero_grad()
loss = 0
output, hidden = model(input_line_tensor)
for i in range(input_line_tensor.size(0)):
l = criterion(output[i], target_line_tensor[i])
loss += l
loss.backward()
optimizer.step()
the sampling function :
def sample():
max_length = 20
input = torch.zeros(1,1,n_letters)
input[0][0][n_letters - 2] = 1
output_name = ""
hidden = (torch.zeros(2,1,lstm.hidden_size),torch.zeros(2,1,lstm.hidden_size))
for i in range(max_length):
output, hidden = lstm(input)
output = output[-1][:][:]
l = torch.multinomial(torch.exp(output[0]),num_samples = 1).item()
if l == n_letters - 1:
break
else:
letter = all_letters[l]
output_name += letter
input = inputTensor(letter)
return output_name
The typical sampled output looks something like that :
Laurayeerauerararauo
Leayealouododauodouo
Courouauurourourodau
Do you know how I can improve that ?
I found the explanation :
When using instances of the LSTM class as part of a RNN, the default input dimensions are (seq_length,batch_dim,input_size). To be able to interpret the output of the lstm as a probability (over the set of inputs) I needed to pass it to a Linear layer before the Softmax call, which is where the problem happens : Linear instances expects the input to be in the format (batch_dim,seq_length,input_size).
To fix this, one needs to pass batch_first = True as an argument to the LSTM upon creation, and then feed the RNN with an input of the form (batch_dim, seq_length, input_size).
Some tips to improve the network in the order of importance (and ease of implementing):
1. Training data
If you want your generated samples to look real, you have to give some real data to the network. Find a set of names, split those into letters and transform into indices. This step alone would give way more realistic names.
2. Separate start and end tokens.
I would go with <SON> (Start Of Name) and <EON> (End Of Name). In this configuration neural network can learn combinations of letters leading to <EON> and combinations of letters coming after <SON>. ATM it's trying to fit two different concepts into this one custom token.
3. Unsupervised Pretaining
You may want to give your letters some semantic meaning instead of one-hot encoded vectors, check word2vec for basic approach.
Basically, each letter would be represented by N-dimensional vector (say 50 dimensions) and would be closer in space if the letter occurs more often next to another letter (a closer to k than x).
Simple way to implement that would be taking some text dataset and trying to predict next letter at each timestep. Each letter would be represented by random vector at the beginning, through backpropagation letter representations would be updated to reflect their similarity.
Check pytorch embedding tutorial for more info.
4. Different architecture
You may want to check Andrej Karpathy's idea for generating baby names. It is simply described here.
Essentially, after training, you feed your model with random letters (say 10) and tell it to predict the next letter.
You remove last letter from random seed and put the predicted one in it's place. Iterate until <EON> is outputted.

updating centroids in k-means Python

I'm implementing the K means algorithm in python and I got stuck in
the part in which we suppose to update the centroids.
I have created something that works but its really not python-like.
I know it can be written better and would love for some suggestions
for example how to improve the histogram that counts how many points
are assigned to each centroid.
Here is my code:
def updateCentroids(centroids, pixelList):
k = len(centroids)
centoidsCount = [0]*k #couts how many pixels classified for each cent.
centroidsSum = np.zeros([k, 3])#sum value of centroids
for pixel in pixelList:
index = 0
#find whitch centroid equals
for centroid in centroids:
if np.array_equal(pixel.classification, centroid):
centoidsCount[index] += 1
centroidsSum[index] += pixel.point
break
index += 1
index = 0
for centroid in centroidsSum:
centroids[index] = centroid/centoidsCount[index]
index += 1

Euler beam, solving differential equation in python

I must solve the Euler Bernoulli differential beam equation which is:
w’’’’(x) = q(x)
and boundary conditions:
w(0) = w(l) = 0
and
w′′(0) = w′′(l) = 0
The beam is as shown on the picture below:
beam
The continious force q is 2N/mm.
I have to use shooting method and scipy.integrate.odeint() func.
I can't even manage to start as i do not understand how to write the differential equation as a system of equation
Can someone who understands solving of differential equations with boundary conditions in python please help!
Thanks :)
The shooting method
To solve the fourth order ODE BVP with scipy.integrate.odeint() using the shooting method you need to:
1.) Separate the 4th order ODE into 4 first order ODEs by substituting:
u = w
u1 = u' = w' # 1
u2 = u1' = w'' # 2
u3 = u2' = w''' # 3
u4 = u3' = w'''' = q # 4
2.) Create a function to carry out the derivation logic and connect that function to the integrate.odeint() like this:
function calc(u, x , q)
{
return [u[1], u[2], u[3] , q]
}
w = integrate.odeint(calc, [w(0), guess, w''(0), guess], xList, args=(q,))
Explanation:
We are sending the boundary value conditions to odeint() for x=0 ([w(0), w'(0) ,w''(0), w'''(0)]) which calls the function calc which returns the derivatives to be added to the current state of w. Note that we are guessing the initial boundary conditions for w'(0) and w'''(0) while entering the known w(0)=0 and w''(0)=0.
Addition of derivatives to the current state of w occurs like this:
# the current w(x) value is the previous value plus the current change of w in dx.
w(x) = w(x-dx) + dw/dx
# others are calculated the same
dw(x)/dx = dw(x-dx)/dx + d^2w(x)/dx^2
# etc.
This is why we are returning values [u[1], u[2], u[3] , q] instead of [u[0], u[1], u[2] , u[3]] from the calc function, because u[1] is the first derivative so we add it to w, etc.
3.) Now we are able to set up our shooting method. We will be sending different initial boundary values for w'(0) and w'''(0) to odeint() and then check the end result of the returned w(x) profile to determine how close w(L) and w''(L) got to 0 (the known boundary conditions).
The program for the shooting method:
# a function to return the derivatives of w
def returnDerivatives(u, x, q):
return [u[1], u[2], u[3], q]
# a shooting funtion which takes in two variables and returns a w(x) profile for x=[0,L]
def shoot(u2, u4):
# the number of x points to calculate integration -> determines the size of dx
# bigger number means more x's -> better precision -> longer execution time
xSteps = 1001
# length of the beam
L= 1.0 # 1m
xSpace = np.linspace(0, L, xSteps)
q = 0.02 # constant [N/m]
# integrate and return the profile of w(x) and it's derivatives, from x=0 to x=L
return odeint(returnDerivatives, [ 0, u2, 0, u4] , xSpace, args=(q,))
# the tolerance for our results.
tolerance = 0.01
# how many numbers to consider for u2 and u4 (the guess boundary conditions)
u2_u4_maxNumbers = 1327 # bigger number, better precision, slower program
# you can also divide into separate variables like u2_maxNum and u4_maxNum
# these are already tested numbers (the best results are somewhere in here)
u2Numbers = np.linspace(-0.1, 0.1, u2_u4_maxNumbers)
# the same as above
u4Numbers = np.linspace(-0.5, 0.5, u2_u4_maxNumbers)
# result list for extracted values of each w(x) profile => [u2Best, u4Best, w(L), w''(L)]
# which will help us determine if the w(x) profile is inside tolerance
resultList = []
# result list for each U (or w(x) profile) => [w(x), w'(x), w''(x), w'''(x)]
resultW = []
# start generating numbers for u2 and u4 and send them to odeint()
for u2 in u2Numbers:
for u4 in u4Numbers:
U = []
U = shoot(u2,u4)
# get only the last row of the profile to determine if it passes tolerance check
result = U[len(U)-1]
# only check w(L) == 0 and w''(L) == 0, as those are the known boundary cond.
if (abs(result[0]) < tolerance) and (abs(result[2]) < tolerance):
# if the result passed the tolerance check, extract some values from the
# last row of the w(x) profile which we will need later for comaprisons
resultList.append([u2, u4, result[0], result[2]])
# add the w(x) profile to the list of profiles that passed the tolerance
# Note: the order of resultList is the same as the order of resultW
resultW.append(U)
# go through the resultList (list of extracted values from last row of each w(x) profile)
for i in range(len(resultList)):
x = resultList[i]
# both boundary conditions are 0 for both w(L) and w''(L) so we will simply add
# the two absolute values to determine how much the sum differs from 0
y = abs(x[2]) + abs(x[3])
# if we've just started set the least difference to the current
if i == 0:
minNum = y # remember the smallest difference to 0
index = 0 # remember index of best profile
elif y < minNum:
# current sum of absolute values is smaller
minNum = y
index = i
# print out the integral for w(x) over the beam
sum = 0
for i in resultW[index]:
sum = sum + i[0]
print("The integral of w(x) over the beam is:")
print(sum/1001) # sum/xSteps
This outputs:
The integral of w(x) over the beam is:
0.000135085272117
To print out the best profile for w(x) that we found:
print(resultW[index])
which outputs something like:
# w(x) w'(x) w''(x) w'''(x)
[[ 0.00000000e+00 7.54147813e-04 0.00000000e+00 -9.80392157e-03]
[ 7.54144825e-07 7.54142917e-04 -9.79392157e-06 -9.78392157e-03]
[ 1.50828005e-06 7.54128237e-04 -1.95678431e-05 -9.76392157e-03]
...,
[ -4.48774290e-05 -8.14851572e-04 1.75726275e-04 1.01560784e-02]
[ -4.56921910e-05 -8.14670764e-04 1.85892353e-04 1.01760784e-02]
[ -4.65067671e-05 -8.14479780e-04 1.96078431e-04 1.01960784e-02]]
To double check the results from above we will also solve the ODE using the numerical method.
The numerical method
To solve the problem using the numerical method we first need to solve the differential equations. We will get four constants which we need to find with the help of the boundary conditions. The boundary conditions will be used to form a system of equations to help find the necessary constants.
For example:
w’’’’(x) = q(x);
means that we have this:
d^4(w(x))/dx^4 = q(x)
Since q(x) is constant after integrating we have:
d^3(w(x))/dx^3 = q(x)*x + C
After integrating again:
d^2(w(x))/dx^2 = q(x)*0.5*x^2 + C*x + D
After another integration:
dw(x)/dx = q(x)/6*x^3 + C*0.5*x^2 + D*x + E
And finally the last integration yields:
w(x) = q(x)/24*x^4 + C/6*x^3 + D*0.5*x^2 + E*x + F
Then we take a look at the boundary conditions (now we have expressions from above for w''(x) and w(x)) with which we make a system of equations to solve the constants.
w''(0) => 0 = q(x)*0.5*0^2 + C*0 + D
w''(L) => 0 = q(x)*0.5*L^2 + C*L + D
This gives us the constants:
D = 0 # from the first equation
C = - 0.01 * L # from the second (after inserting D=0)
After repeating the same for w(0)=0 and w(L)=0 we obtain:
F = 0 # from first
E = 0.01/12.0 * L^3 # from second
Now, after we have solved the equation and found all of the integration constants we can make the program for the numerical method.
The program for the numerical method
We will make a FOR loop to go through the entire beam for every dx at a time and sum up (integrate) w(x).
L = 1.0 # in meters
step = 1001.0 # how many steps to take (dx)
q = 0.02 # constant [N/m]
integralOfW = 0.0; # instead of w(0) enter the boundary condition value for w(0)
result = []
for i in range(int(L*step)):
x= i/step
w = (q/24.0*pow(x,4) - 0.02/12.0*pow(x,3) + 0.01/12*pow(L,3)*x)/step # current w fragment
# add up fragments of w for integral calculation
integralOfW += w
# add current value of w(x) to result list for plotting
result.append(w*step);
print("The integral of w(x) over the beam is:")
print(integralOfW)
which outputs:
The integral of w(x) over the beam is:
0.00016666652805511192
Now to compare the two methods
Result comparison between the shooting method and the numerical method
The integral of w(x) over the beam:
Shooting method -> 0.000135085272117
Numerical method -> 0.00016666652805511192
That's a pretty good match, now lets see check the plots:
From the plots it's even more obvious that we have a good match and that the results of the shooting method are correct.
To get even better results for the shooting method increase xSteps and u2_u4_maxNumbers to bigger numbers and you can also narrow down the u2Numbers and u4Numbers to the same set size but a smaller interval (around the best results from previous program runs). Keep in mind that setting xSteps and u2_u4_maxNumbers too high will cause your program to run for a very long time.
You need to transform the ODE into a first order system, setting u0=w one possible and usually used system is
u0'=u1,
u1'=u2,
u2'=u3,
u3'=q(x)
This can be implemented as
def ODEfunc(u,x): return [ u[1], u[2], u[3], q(x) ]
Then make a function that shoots with experimental initial conditions and returns the components of the second boundary condition
def shoot(u01, u03): return odeint(ODEfunc, [0, u01, 0, u03], [0, l])[-1,[0,2]]
Now you have a function of two variables with two components and you need to solve this 2x2 system with the usual methods. As the system is linear, the shooting function is linear as well and you only need to find the coefficients and solve the resulting linear system.

Resources