Averaging n elements along 1st axis of 4D array with numpy - python-3.x

I have a 4D array containing daily time-series of gridded data for different years with shape (year, day, x-coordinate, y-coordinate). The actual shape of my array is (19, 133, 288, 620), so I have 19 years of data with 133 days per year over a 288 x 620 grid. I want to take the weekly average of each grid cell over the period of record. The shape of the weekly averaged array should be (19, 19, 288, 620), or (year, week, x-coordinate, y-coordinate). I would like to use numpy to achieve this.
Here I construct some dummy data to work with and an array of what the solution should be:
import numpy as np
a1 = np.arange(1, 10).reshape(3, 3)
a1days = np.repeat(a1[np.newaxis, ...], 7, axis=0)
b1 = np.arange(10, 19).reshape(3, 3)
b1days = np.repeat(b1[np.newaxis, ...], 7, axis=0)
c1year = np.concatenate((a1days, b1days), axis=0)
a2 = np.arange(19, 28).reshape(3, 3)
a2days = np.repeat(a2[np.newaxis, ...], 7, axis=0)
b2 = np.arange(29, 38).reshape(3, 3)
b2days = np.repeat(b2[np.newaxis, ...], 7, axis=0)
c2year = np.concatenate((a2days, b2days), axis=0)
dummy_data = np.concatenate((c1year, c2year), axis=0).reshape(2, 14, 3, 3)
solution = np.concatenate((a1, b1, a2, b2), axis=0).reshape(2, 2, 3, 3)
The shape of the dummy_data is (2, 14, 3, 3). Per the dummy data, I have two years of data, 14 days per year, over a 3 X 3 grid. I want to return the weekly average of the grid for both years, resulting in a solution with shape (2, 2, 3, 3).

You can reshape and take mean:
week_mean = dummy_data.reshape(2,-1,7,3,3).mean(axis=2)
# in your case .reshape(year, -1, 7, x_coord, y_coord)
# check:
(dummy_data.reshape(2,2,7,3,3).mean(axis=2) == solution).all()
# True

Related

How to structure vlookup calculations in python?

I am having trouble , understanding how to go about structuring an excel model that is basically a bunch of top-down customer calculations, in python.
Excel calcs
excel model consists of many worksheets that look up a bunch of values and from different worksheets and perform calculations on a customer level.
each customer starts of with an amount, a starting year ,an end year, and a starting state.
example of calculation in excel:
customer 1:
amount 100 , at starting state B , and starting year 3.
Multiplied by matrix (worksheet1)
the matrix consists of 10 3d arrays with states 5, (A-E).Each of the 10 3d arrays represent a year (1-10)
I multiply the amount 100 by the matrix at year and get an array, [800,650,400,300,840]
I then take this array and do another vlookup calculation, from another worksheet.
example limits (worksheet2). Which consists of years and limit %.
Year|Limit%
| 0.32
| 0.23
| 0.11
| 0.21
I vlook-up the customers year , year 3 in this case and then multiply [800,650,400,300,840] * 0.11
I then need to do a few more vlookup calculations like the one above.
after that
I need to multiply the result by the matrix at year 4, then do the vlookup calcs for year 4 like i did year 3, and basically continue until year 10 is reached.
It is very difficult to understand what the data looks like form your description. However, I would suggest creating pd.DataFrame and pd.Series of the constant data, with the identifier as the index value. Then you can use .loc() to retrieve the relevant row and use this data. If you returned this into your function until n = 10, you could calculate then return the final value.
A simple example that might help you start:
matrix1 has columns as states ("A" and "B"), rows as years (1, 2, 3, 4, 5).
matrix2 has rows as year (1, 2, 3, 4, 5), and the data is the "limit%".
Code:
import pandas as pd
# matrix1 = pd.DataFrame({'A': {1: 1, 2: 2, 3: 3, 4: 4, 5: 5},
# 'B': {1: 2, 2: 3, 3: 4, 4: 5, 5: 6}})
matrix1 = pd.DataFrame(data=np.random.rand(5, 5, 5).tolist(),
columns=['A', 'B', 'C', 'D', 'E'],
index=[1, 2, 3, 4, 5])
matrix2 = pd.Series({1: 0.32, 2: 0.23, 3: 0.11, 4: 0.21, 5: 0.2})
def calculation(amount, year, starting_state, n, calcs={}):
"""
This is the function that calculates everything.
:input amount: initial amount as int
:input year: starting year as int
:input starting_state: state in ["A", "B"] as str
:input n: (end_year - starting_year) where end_year <= 5 and n > 0 as int
:input calcs: dictionary of calculations, starting as empty as dictionary
:return: final amount
"""
#when there are no more years remaining
if n == 0:
#return amount
return calcs
#multiply by matrix1
amount *= np.asarray(matrix1.loc[year, starting_state])
#multiply by matrix2
amount *= matrix2.loc[year]
# more calculations ...
#add end result to dictionary
calcs[year] = amount
#return the new data to the function
return calculation(amount, year+1, starting_state, n-1, calcs)
calculation(10, 2, "A", 5-2)
#Out: 1.27512
Running through the interations:
#10 * 2 = 20
#20 * 0.23 = 4.6
#return amount=4.6, year=3, "A", 3-1
#4.6 * 3 = 13.8
#13.8 * 0.11 = 1.518
#return amount=1.518, year=4, "A", 2-1
#1.518 * 4 = 6.072
#6.072 * 0.21 = 1.27512
#return amount=1.27512, year=5, "A", 1-1
#as n in now 0
#return 1.27512
If you have an initial dataframe with the input data, you can then append the end result to the data:
people = pd.DataFrame({"amount": [10, 11, 12, 13],
"starting_year": [2, 2, 1, 3],
"end_year": [5, 5, 4, 5],
"state": ["A", "A", "B", "A"]})
people["output"] = people.apply(lambda x: calculation(
x["amount"], x["starting_year"],
x["state"], x["end_year"] - x["starting_year"]), axis=1)
people
#Out:
# amount starting_year end_year state output
#0 10 2 5 A 1.275120
#1 11 2 5 A 1.402632
#2 12 1 4 B 2.331648
#3 13 3 5 A 3.603600
EDIT
The changes made to reflect your additions:
matrix1 is now a 3-dimensional array to dataframe. This has changed the creation of the dataframe (obviously), and the multiplication of matrix1 needs to convert the returned list to an np.array so that it can be multiplied.
There is now an additional input to the function (which has a default of {} is no argument is passed, which is what you would want at the start. Then in the last line of the function before the return I have added calcs[year] = amount which appends the year and the array for that year to the dictionary. This means that the output of running for people is now a column of dictionaries. If you want to expand this to columns for each year, you can add a line afterwards: people = pd.concat([people, people["output"].apply(pd.Series)], axis=1).

How to rows from an numpy array using single indices and a range of indices

Imagine, I have a 2D NumPy array X with 100 columns and 100 rows. Then, how can I extract the following rows and column using indexing?
rows= 1, 5, 15, 16 to 35, 45, 55 to 75
columns = 1, 2, 10 to 30, 42, 50
I know in MATLAB, we can do this using
X([1,5,15,16:35,45,55:75],[1,2,10:30,42,50]).
How can we do this in Python?
You can use np.r_:
rows = np.r_[1, 5, 15, 16:35, 45, 55:75]
cols = np.r_[1, 2, 10:30, 42, 50]
X[rows,cols]
Note that, in Python, 16:35 usually does not include 35. You may want to do 16:36 if you want row 35 as well.

Skipping colors in colormap when value is NaN in Matplotlib

I'd like a plot a scatter graph of values (y-axis) against dates (x-axis). I'd like to change the color of the point according to month of the year. i.e. each month has a unique color. I'd ideally like to take the colors from a colormap e.g. Viridis.
I have some example data as follows:
# example data
mydata = np.array([0.85405058, 0.78228784, np.nan, 0.72828138, 0.73757833, 0.69303712, 0.5730553, 0.57644895])
dates_list = [datetime.date(2017, 1, 5), datetime.date(2017, 5, 22),
datetime.date(2017, 6, 14), datetime.date(2017, 8, 17), datetime.date(2017, 9, 27), datetime.date(2017, 10, 6), datetime.date(2017, 11, 23), datetime.date(2017, 12, 28)]
months_list= [1, 5, 6, 8, 9, 10, 11, 12]
# plot data
fig, ax = plt.subplots(1,1)
ax.scatter(dates_list,mydata,s=30,c=months_list,cmap='viridis')
If I do this, then it works if there are no missing values, but if there are NaNs (which my data has), then the colors get shifted out of alignment and are no longer unique to a particular month. Can anyone point me in the right direction as to how to assign a unique color to each month?

Pyspark Columnsimilarities interpretation

I was learning about how to use columnsimilarities can someone explain to me the matrix that was generated by the algorithm
lets say in this code
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
MatrixEntry(1, 2, 0.998441152599),
MatrixEntry(0, 1, 0.997463284056)]
how can I know which row is most similar given in the maxtrix? like (0,2,0.991935352214) mean that row 0 and row 2 have a result of 0.991935352214? I know that 0 and 2 are i and j the row and columns respectively of the matrix.
thank you
how can I know which row is most similar given in the maxtrix?
It is columnSimilarities not rowSimilarities so it is just not the thing you're looking for.
You could apply it on transposed matrix, but you really don't want. Algorithms used here are designed for thin and optimally sparse data. It just won't scale for wide one.

Is there a way to color values in a dataframe if the belong to a certain range (Python-Pandas)

I have a data frame with values from 0 to 10. I would like to color the value 1 and 5 with red rather than black. Is that possible to do it in python DataFrame? I am using Jupyter notebook.
You can change the style of cells -
df = pd.DataFrame({'v1': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
dft = df.style.applymap(lambda x: 'color: red' if x >= 1 and x <=5 else 'color: black')
dft
You can find more information about applying styles here - http://pandas.pydata.org/pandas-docs/stable/style.html

Resources