Multiply each row of an array with coefficients in list - Python - python-3.x

I am very new to Python an need help. This is the problem statement:
I want to calculate the value of each of the three houses by multiplying the rows of the arraym X (each row representing one house) with the coefficients in list c, so for the first house: Price = (66x3000)+(5x200)+ (15x-50) + (2x5000) + (500x100) = 258.000
Do not use numpy
Print the price of the three houses
This is what I have so far:
# input values for three houses:
# - size [m^2],
# - size of the sauna [m^2],
# - distance to water [m],
# - number of indoor bathrooms,
# - proximity of neighbors [m]
X = [[66, 5, 15, 2, 500],
[21, 3, 50, 1, 100],
[120, 15, 5, 2, 1200]]
# coefficient values
c = [3000, 200 , -50, 5000, 100]
def predict(X, c):
price = 0
for i in range (len(X)):
for j in range (len(X[i])):
price += (c[j]*X[i][j])
print(price)
predict(X, c)
The output is
258250
334350
827100.
The program adds the value of the 2nd an 3rd hourse the the previous result, rather than returning each house's value. How can I fix this?
Many thanks!

Move the line
price = 0
into the outer for loop:
def predict(X, c):
for i in range (len(X)):
price = 0
for j in range (len(X[i])):
...

Related

Python optimization of time-series data re-indexing based on multiple-parameter multi-varialbe input and singular value output

I am trying to optimize a funciton that is trying to maximize the correlation between two (pandas) time series arrays (X and Y). This is done by using three parameters (a, b, c) and a third time series array (Z). The Z array is used to reindex the values in the X array (based on the parameters a, b, c) in such a way as to maximize the correlation of the reindexed X array (Xnew) with the Y array.
Below is some pseudo-code to demonstrate what I amy trying to do. I have attempted this using LMfit and scipy optimize but I am not sure how to make this task work in those packages. For example in LMfit if I tried to minimize the MyOpt function (which passes back a single value of the correlation metric) then it complains that I have more parameters than outputs. However, if I pass back the time series of the corrlation metric (diff) the the parameter values remain fixed at their input values.
I know the reindexing function I am using works because using the rather crude methods similar to the code below give signifianct changes in the mean (diff) metric passed back.
My knowledge of these optimizaiton packages is not up to scratch for this job so if anyone has a suggestion on how to tackle this, I would be greatfull.
def GetNewIndex(Z, a, b ,c):
old_index = np.arange(0, len(Z))
index_adj = some_func(a,b,c)
new_index = old_index + index_adj
max_old = np.max(old_index)
new_index[new_index > max_old] = max_old
new_index[new_index < 0] = 0
return new_index
def MyOpt(params, X, Y ,Z):
a = params['A']
b = params['B']
c = params['C']
# estimate lag (in samples) based on ambient RH
new_index = GetNewIndex(Z, a, b, c)
# assign old values to new locations and convert back to pandas series
Xnew = np.take(X.values, new_index)
Xnew = pd.Series(Xnew, index=X.index)
cc = Y.rolling(1201, center=True).corr(Xnew)
cc = cc.interpolate(limit_direction='both', limit_area=None)
diff = 1-np.abs(cc)
return np.mean(diff)
#==================================================
X = some long pandas time series data
Y = some long pandas time series data
Z = some long pandas time series data
As = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]
Bs = [0, 0 ,0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
Cs = [5, 6, 5, 6, 5, 6, 5, 6, 5, 6, 5, 6]
outs = []
for A, B, C in zip(As, Bs, Cs):
params={'A':A, 'B':B, 'C':C}
out = MyOpt(params, X, Y, Z)
outs.append(out)

Foobar Lucky Triple

I am trying to solve the following problem:
Write a function solution(l) that takes a list of positive integers l and counts the number of "lucky triples" of (li, lj, lk) where the list indices meet the requirement i < j < k. The length of l is between 2 and 2000 inclusive. A "lucky triple" is a tuple (x, y, z) where x divides y and y divides z, such as (1, 2, 4). The elements of l are between 1 and 999999 inclusive. The solution fits within a signed 32-bit integer. Some of the lists are purposely generated without any access codes to throw off spies, so if no triples are found, return 0.
For example, [1, 2, 3, 4, 5, 6] has the triples: [1, 2, 4], [1, 2, 6], [1, 3, 6], making the solution 3 total.
My solution only passes the first two tests; I am trying to understand what it is wrong with my approach rather then the actual solution. Below is my function for reference:
def my_solution(l):
from itertools import combinations
if 2<len(l)<=2000:
l = list(combinations(l, 3))
l= [value for value in l if value[1]%value[0]==0 and value[2]%value[1]==0]
#l= [value for value in l if (value[1]/value[0]).is_integer() and (value[2]/value[1]).is_integer()]
if len(l)<0xffffffff:
l= len(l)
return l
else:
return 0
If you do nested iteration of the full list and remaining list, then compare the two items to check if they are divisors... the result counts as the beginning and middle numbers of a 'triple',
then on the second round it will calculate the third... All you need to do is track which ones pass the divisor test along the way.
For Example
def my_solution(l):
row1, row2 = [[0] * len(l) for i in range(2)] # Tracks which indices pass modulus
for i1, first in enumerate(l):
for i2 in range(i1+1, len(l)): # iterate the remaining portion of the list
middle = l[i2]
if not middle % first: # check for matches
row1[i2] += 1 # increment the index in the tracker lists..
row2[i1] += 1 # for each matching pair
result = sum([row1[i] * row2[i] for i in range(len(l))])
# the final answer will be the sum of the products for each pair of values.
return result

How to filter time series if data exists at least data every 6 hours?

I'd like to verify if there is data at least once every 6 hours per ID, and filter out the IDs that do not meet this criteria.
essentially a filter: "if ID's data not at least every 6h, drop id from dataframe"
I try to use the same method for filtering one per day, but having trouble adapting the code.
# add day column from datetime index
df['1D'] = df.index.day
# reset index
daily = df.reset_index()
# count per ID per day. Result is per ID data of non-zero
a = daily.groupby(['1D', 'id']).size()
# filter by right join
filtered = a.merge(df, on = id", how = 'right')
I cannot figure out how to adapt this for the following 6hr periods each day: 00:01-06:00, 06:01-12:00, 12:01-18:00, 18:01-24:00.
Groupby ID and then integer divide hour by 6 and get unique counts. In your case it should be greater than or equal to 4 because there are 4 - 6 hour bins in 24 hours and each day has 4 unique bins i.e.
Bins = 4
00:01-06:00
06:01-12:00
12:01-18:00
18:01-24:00
Code
mask = df.groupby('id')['date'].transform(lambda x: (x.dt.hour // 6).nunique() >= 4)
df = df[mask]
I propose to use pivot_table with resample which allows to change to arbitrary frequencies. Please see comments for further explanations.
# build test data. I need a dummy column to use pivot_table later. Any column with numerical values will suffice
data = [[datetime(2020, 1, 1, 1), 1, 1],
[datetime(2020, 1, 1, 6), 1, 1],
[datetime(2020, 1, 1, 12), 1, 1],
[datetime(2020, 1, 1, 18), 1, 1],
[datetime(2020, 1, 1, 1), 2, 1],
]
df = pd.DataFrame.from_records(data=data, columns=['date', 'id', 'dummy'])
df = df.set_index('date')
# We need a helper dataframe df_tmp.
# Transform id entries to columns. resample with 6h = 360 minutes = 360T.
# Take mean() because it will produce nan values
# WARNING: It will only work if you have at least one id with observations for every 6h.
df_tmp = pd.pivot_table(df, columns='id', index=df.index).resample('360T').mean()
# Drop MultiColumnHierarchy and drop all columns with NaN values
df_tmp.columns = df_tmp.columns.get_level_values(1)
df_tmp.dropna(axis=1, inplace=True)
# Filter values in original dataframe where
mask_id = df.id.isin(df_tmp.columns.to_list())
df = df[mask_id]
I kept your requirements on timestamps but I believe you want to use the commented lines in my solution.
import pandas as pd
period = pd.to_datetime(['2020-01-01 00:01:00', '2020-01-01 06:00:00'])
# period = pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 06:00:00'])
shift = pd.to_timedelta(['6H', '6H'])
id_with_data = set(df['ID'])
for k in range(4): # for a day (00:01 --> 24:00)
period_mask = (period[0] <= df.index) & (df.index <= period[1])
# period_mask = (period[0] <= df.index) & (df.index < period[1])
present_ids = set(df.loc[period_mask, 'ID'])
id_with_data = id_with_data.intersection(present_ids)
period += shift
df = df.loc[df['ID'].isin(list(id_with_data))]

Filter array by last value Toleranz

i‘ m using Python 3.7.
I have an Array like this:
L1 = [1,2,3,-10,8,12,300,17]
Now i want to filter the values(the -10 and the 300 is not okay)
The values in the array may be different but always counting up or counting down.
Has Python 3 a integrated function for that?
The result should look like this:
L1 = [1,2,3,8,12,17]
Thank you !
Edit from comments:
I want to keep each element if it is only a certain distance (toleranz: 10 f.e.) distance away from the one before.
Your array is a list. You can use built in functions:
L1 = [1,2,3,-10,8,12,300,17]
min_val = min(L1) # -10
max_val = max(L1) # 300
p = list(filter(lambda x: min_val < x < max_val, L1)) # all x not -10 or 300
print(p) # [1, 2, 3, 8, 12, 17]
Doku:
min()
max()
filter()
If you want instead an incremental filter you go through your list of datapoints and decide if to keep or not:
delta = 10
result = []
last = L1[0] # first one as last value .. check the remaining list L1[1:]
for elem in L1[1:]:
if last-delta < elem < last+delta:
result.append(last)
last = elem
if elem-delta < result[-1] < elem+delta :
result.append(elem)
print(result) # [1, 2, 3, 8, 12, 17]

grouping coordinates within a distance to each other

I have written this code which works but takes a very long time (~8hrs) to finish execution.
Wondering if it can be optimized to execute quicker.
The aim is to group a lots of items (x,y,z) coordinates based on their distance to one another. For example;
I would like to group them for a distance of +-0.5 in x, +-0.5 in y and +-0.5 in z, then the output from the data below would be [(0,3),(1),(2,4)...].
x y z
0 1000.1 20.2 93.1
1 647.7 91.7 87.7
2 941.2 44.3 50.6
3 1000.3 20.3 92.9
4 941.6 44.1 50.6
...
What I have done (and which works) is described below.
It compares the first row of the data_frame with the 2nd, 3rd, 4th .. until the end, and for each row, if the distance from x to x < +-0.5 and y to y < +-0.5 and z to z < +- 0.5 then the index is added to a list, group. If it doesn't then it compares the next row until reaching the end of the loop.
After each loop is complete the indexes which matched (stored in group), are added to another list, groups, as a set and then removed from the original list, a, and then next a[0] is compared and so on.
groups = []
group = []
data = [(x,y,z),(x,y,z),(etc)] # > 50,000 entries
data_frame = pd.DataFrame(data, columns=['x','y','z'])
a = list(i for i in range(len(data_frame)))
threshold = 0.5
for j in range(len(a) - 1) :
if len(a) > 0:
group.append(a[0])
for ii in range(a[0], len(data_frame) - 1):
if ((data_frame.loc[a[0],'x'] - data_frame.loc[ii,'x']) < threshold) and ((data_frame.loc[a[0],'y'] - data_frame.loc[ii,'y']) < threshold) and ((data_frame.loc[a[0],'z'] - data_frame.loc[ii,'z']) < threshold):
group.append(ii)
else:
continue
groups.append(set(group))
for iii in group:
if iii in a:
a.remove(iii)
else:
continue
group = []
else:
break
which returns something like this, for example;
groups = [{0}, {1, 69}, {2, 70}, {3, 67}, {4}, {5}, {6}, {7, 9}, {8}, {10}, {11}, {12}, 13}, {14, 73}, {15}, {16}, {17, 21, 74}, {18, 20}, {19}, {22, 23}]
Have made many edits to this question as it was not very clear. Hopefully makes sense now.
Below is an attempt using better logic 'O(NlogN)' which is much faster but doesn't return the correct answer. Have used the same +-0.5 for x,y,z.
Edit:
test_list = [(i,x,y,z), ... , (i,x,y,z)]
df3 = sorted(test_list,key=lambda x: x[1])
result = []
while df3:
if len(df3) > 1: ####added this because was crashing at the end of the loop
a = df3.pop(0)
alist=[a[0]]
while ((abs(a[1] - df3[0][1]) < 0.5) and (abs(a[2] - df3[0][2]) < 0.5) and (abs(a[3] - df3[0][3]) < 0.5)):
alist.append(df3.pop(0)[0])
if df3:
continue
else:
break
result.append(alist)
else:
result.append(a[0])
break
Since you are comparing each data point with every other one, your implementation has a worst time complexity of O(N!). A better way is to do a sorting first.
import random
df = [i for i in range(100)]
random.shuffle(df)
df2 = [(i,x) for i,x in enumerate(df)]
df3 = sorted(df2,key=lambda x: x[1])
df3
[(31, 0), (24, 1), (83, 2)......
Assuming now you want to group number that are +5/-5 into one list. You can then slice number into list based on a condition.
result = []
while df3:
a = df3.pop(0)
alist=[a[0]]
while a[1] + 5 >= df3[0][1]:
alist.append(df3.pop(0)[0])
if df3:
continue
else:
break
result.append(alist)
result
[[31, 24, 83, 58, 82, 35], [0, 65, 77, 41, 67, 56].......
Sorting takes O(NlogN) and a grouping basically takes linear time. So this would be much faster than N!

Resources