tensorflow timeseries different lengths - python-3.x

I try to get a timeseries into tenserflow to work for an LSTM. I have 4 Files but I'm not sure how to get them together running together. The biggest problem I have is that my first dataset has 1 Data-point per year but 2 others monthly data which should be used for correlation to predict the first set. The 4th Dataset just has some Metadata like Species and Coordinates. Should I put them together somehow, if so how? Any advice in right direction would be nice.
I already looked to the timeseries documentation of tenserflow and also was trying to follow this guide: https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
but I struggle with getting the year and month data good together. I manage the data in R but run Tensorflow in Python. I'm more familiar with R in general.
Thank you all for being here!
Header samples of the Data structure:
File1.csv:
years | noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662
1901 | 1.676948 | 1.305594 | 0.6756204 | 0.7149572
1902 | 1.562344 | 0.899884 | 0.5102933 | 0.6351094
1903 | 1.687270 | 1.354678 | 0.9899198 | 0.6158589
File2.csv:
noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662 |noaa-tree-2664
1 6.41 | 1.85 | 0.33 | 8.61 | 6.07
2 10.45 | 3.20 | 0.38 | 8.58 | 5.30
3 10.81 | 4.30 | 1.50 | 9.34 | 8.50
File3.csv:
noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662 |noaa-tree-2664
1 -0.3 | 11.0 | 10.1 | -22.4 | -15.1
2 -2.9 | 10.2 | 8.8 | -14.5 | -13.3
3 1.0 | 14.3 | 14.7 | -13.8 | -12.7
File4.csv:
noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662 |noaa-tree-2664
1 QUPR | PSME | PSME | PCGL | THOC
2 280.28 | 249.65 | 250.08 | 298 | 280.72
3 39.1 | 31.45 | 32.72 | 56.55 | 48.47

Related

Calculating the difference in value between columns

I have a dataframe with YYYYMM columns that contain monthly totals on the row level:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | Nan | 9.0 | 32.0 |
| 1 | feature2 | 1.0 | 1.0 | 1.0 | 4.0 |
| 2 | feature3 | Nan | 1.0 | 4.0 | 8.0 |
| 3 | feature4 | 9.0 | 15.0 | 19.0 | 24.0 |
| 4 | feature5 | 33.0 | 67.0 | 99.0 | 121.0 |
| 5 | feature6 | 12.0 | 15.0 | 17.0 | 19.0 |
| 6 | feature7 | 1.0 | 8.0 | 15.0 | 20.0 |
| 7 | feature8 | Nan | Nan | 1.0 | 9.0 |
I would like to convert the totals to the monthly change. The feature column should be excluded as I need to keep the feature names. The yearM in the index is a result of pivoting a dataframe to get the YYYYMM on the column level.
This is how the output would look like:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | 0.0 | 9.0 | 23.0 |
| 1 | feature2 | 1.0 | 0.0 | 0.0 | 3.0 |
| 2 | feature3 | Nan | 1.0 | 3.0 | 5.0 |
| 3 | feature4 | 9.0 | 6.0 | 4.0 | 5 |
| 4 | feature5 | 33.0 | 34.0 | 32.0 | 22.0 |
| 5 | feature6 | 12.0 | 3.0 | 2.0 | 2.0 |
| 6 | feature7 | 1.0 | 7.0 | 7.0 | 5.0 |
| 7 | feature8 | Nan | 0.0 | 1.0 | 8.0 |
The row level values now represent the change compared to the previous month instead of having the total for the month.
I know that I should start by filling the NaN rows in the starting column 201902 with 0:
df['201902'] = df['201902'].fillna(0)
I could also calculate them one by one with something similar to this:
df['201902'] = df['201902'].fillna(0) - df['201901'].fillna(0)
df['201903'] = df['201903'].fillna(0) - df['201902'].fillna(0)
df['201904'] = df['201904'].fillna(0) - df['201903'].fillna(0)
...
...
Hopefully there's a smarter solution though
use iloc or drop to access the other columns, then diff with axis=1 for row-wise differences.
monthly_change = df.iloc[:, 1:].fillna(0).diff(axis=1)
# or
# monthly_change = df.drop(['feature'], axis=1).fillna(0).diff(axis=1)

Pandas - Copying to Index and Then Sorting

I am working on a large dataset of stock data. I've been able to create a multi-indexed dataframe, but now I can't configure it the way I want it.
Basically, I am trying to make an index called 'DATE' and then sort each smaller set against the index.
Right now it looks like this:
+------------+----------+-------+-------------+-------+
| DATE | AAPL | | GE | |
+------------+----------+-------+-------------+-------+
| DATE | date | close | date | close |
| 05-31-2019 | 05/31/19 | 203 | 04-31-2019 | 9.3 |
| 05-30-2019 | 05/30/19 | 202 | 04-30-2019 | 9.3 |
| 05-29-2019 | 05/29/19 | 4 | 04-29-2019 | 9.6 |
| | | | | |
| ... | | | | |
| | | | | |
| NaN | NaN | NaN | 01/30/1970 | 0.77 |
| NaN | NaN | NaN | 01/29/1970 | 0.78 |
| NaN | NaN | NaN | 01/28/1970 | 0.76 |
+------------+----------+-------+-------------+-------+
Where DATE is the index.
And I want it to look like this:
+------------+----------+-------+----------+-------+
| DATE | AAPL | | GE | |
+------------+----------+-------+----------+-------+
| DATE | date | close | date | close |
| 05-31-2019 | 05/31/19 | 203 | NaN | NaN |
| 05-30-2019 | 05/30/19 | 202 | NaN | NaN |
| 05-29-2019 | 05/29/19 | 4 | NaN | NaN |
| | | | | |
| ... | | | | |
| | | | | |
| 01/30/1970 | NaN | NaN |01/30/1970| 0.77 |
| 01/29/1970 | NaN | NaN |01/29/1970| 0.78 |
| 01/28/1970 | NaN | NaN |01/28/1970| 0.76 |
+------------+----------+-------+----------+-------+
Where the index (DATE) has taken all of the unique values, and then all of the rows within stock symbols have moved to match the index where 'date' = 'DATE'.
I've tried so many attempts at this, but no matter what I can't figure out either one. I can't figure out how to make the Index a list of all of the unique 'date' values. And I can't figure out how to reformat the symbol data to match the new index.
A lot of my troubles (I suspect) have to do with the fact that I am using a multi-index for this, which makes everything more difficult as Pandas needs to know what level to be using.
I made the initial Index using this code:
df['DATE','DATE'] = df.xs(('AAPL', 'date'), level=('symbol', 'numbers'), axis=1)
df.set_index('DATE', inplace=True)
I tried to make one that kept adding unique values to the column, like this:
for f in filename_wo_ext:
data = df.xs([f, 'date'], level=['symbol', 'numbers'], axis=1)
df.append(data, ignore_index=True)
df['DATE','DATE'] = data
pd.concat([pd.DataFrame([df], columns=['DATE']) for f in filename_wo_ext], ignore_index=True)
But that didn't cycle and append in the for loop that I wanted it to, it just made a column based on the last symbol.
Then in terms of sorting the symbol frame to match the index, I still haven't been able to figure that out.
Thank you so much!

Looking to create weighted average of partitioned columns in Excel

Horrible title, but I couldn't find a way to describe what I'm trying to do concisely. This question was posed to me by a friend, and I'm usually competent in Excel, but in this case I am totally stumped.
Suppose I have the following data:
| A | B | C | D | E | F | G | H |
---------------------------------------------------------------------
1 | 0.50 | 0.50 | 1 | | | 0.30 | 0.30 | |
2 | 0.25 | 0.75 | 2 | | | 0.40 | 0.70 | |
3 | 1.00 | 1.75 | 8 | | | 0.30 | 1.00 | |
4 | 0.75 | 2.50 | 2 | | | 0.50 | 1.50 | |
5 | 1.25 | 3.75 | 3 | | | 1.75 | 3.25 | |
6 | 0.50 | 4.25 | 1 | | | 0.25 | 3.50 | |
7 | 1.00 | 5.25 | 0 | | | 0.50 | 4.00 | |
8 | 0.25 | 5.50 | 2 | | | 0.30 | 4.30 | |
9 | 0.25 | 5.75 | 9 | | | 0.25 | 4.55 | |
10 | 0.75 | 6.50 | 4 | | | 0.70 | 5.25 | |
11 | | | | | | 1.00 | 6.25 | |
12 | | | | | | 0.25 | 0.25 | |
Column A represents the distance traveled while the measurement in column C was collected. Column B represents the total distance traveled so far. So C1 represents some value produced during the process from distance 0 to 0.5. B2 represents the value from distance 0.5 to 0.75, and B3 represents the value from 0.75 to 1.75, etc...
Column F represents a PLANNED second iteration of the same process, but with different measurement intervals. What I need is a way to PREDICT column H, based on a WEIGHTED AVERAGE of values from column C, based on where the intervals in column F intersect with the intervals in column A. For example, since F2 represents the measurement taken from distance 0.30 to 0.70 (an interval of 0.4, split 50/50 across the measurements in C1 and C2), H2 would be equal to: C1*0.5 + C2*0.5: 1.5.
Another example: H3 represents the expected measurement from an interval between 0.7 and 1.0, which is split between C2 (from 0.7 to 0.75 = 0.05) and C3 (from 0.75 to 1.0 = 0.25). So H3 = 16.6%*C2 + 83.3%*C3 = 0.332+6.664 = 6.996.
I'm looking for a way to do this in an Excel spreadsheet without using VBA or breaking it down into something like a Python script to process externally, but so far I'm not finding any way to do it.
Any ideas for accomplishing this entirely within Excel without any special add-ins/scripts installed ?
It's not pretty, but I think the following should work for all except H1 (which would need an added zero row):
=(MAX(0,INDEX(B:B,MATCH(G2,B:B,1))-G1)*INDEX(C:C,MATCH(G2,B:B,1)) +
(G2-INDEX(B:B,MATCH(G2,B:B,1)))*INDEX(C:C,MATCH(G2,B:B,1)+1)) /
MAX(G2-G1,G2-INDEX(B:B,MATCH(G2,B:B,1)))
It matches the values in B and C and weights them accordingly.

Calculating median with three conditions to aggregate a large amount of data

Looking for some help here at aggregating more than 60,000 data points (a fish telemetry study). I need to calculate the median of acceleration values by individual fish, date, and hour. For example, I want to calculate the median for a fish moving from 2:00-2:59PM on June 1.
+--------+----------+-------+-------+------+-------+------+-------+-----------+-------------+
| Date | Time | Month | Diel | ID | Accel | TL | Temp | TempGroup | Behav_group |
+--------+----------+-------+-------+------+-------+------+-------+-----------+-------------+
| 6/1/10 | 01:25:00 | 6 | night | 2084 | 0.94 | 67.5 | 22.81 | High | Non-angled |
| 6/1/10 | 01:36:00 | 6 | night | 2084 | 0.75 | 67.5 | 22.81 | High | Non-angled |
| 6/1/10 | 02:06:00 | 6 | night | 2084 | 0.75 | 67.5 | 22.65 | High | Non-angled |
| 6/1/10 | 02:09:00 | 6 | night | 2084 | 0.57 | 67.5 | 22.65 | High | Non-angled |
| 6/1/10 | 03:36:00 | 6 | night | 2084 | 0.75 | 67.5 | 22.59 | High | Non-angled |
| 6/1/10 | 03:43:00 | 6 | night | 2084 | 0.57 | 67.5 | 22.59 | High | Non-angled |
| 6/1/10 | 03:49:00 | 6 | night | 2084 | 0.57 | 67.5 | 22.59 | High | Non-angled |
| 6/1/10 | 03:51:00 | 6 | night | 2084 | 0.57 | 67.5 | 22.59 | High | Non-angled |
+--------+----------+-------+-------+------+-------+------+-------+-----------+-------------+
I suggest adding a column (say hr) to your data (containing something like =HOUR(B2) copied down to suit) and pivoting your data with ID, Date, hr and Time for ROWS and Sum of Accel for VALUES. Then copy the pivot table (in Tabular format, without Grand Totals) and Paste Special, Values. On the copy, apply Subtotal At each change in: hr, Use function: Average, Add subtotal to: Sum of Accel then select the Sum of Accel column and replace SUBTOTAL(1, with MEDIAN(. Change Average to Median if required.

Create Line-Chart with different X-Values

I have a certain number of measurements. Each in the following form:
Table A:
| Time [s] | Value |
| 0.5 | 2.0 |
| 50.3 | 33.7 |
| 100.0 | 25.5 |
Table B:
| Time [s] | Value |
| 1.3 | 12.7 |
| 27.8 | 25.0 |
| 97.5 | 20.0 |
| 100.0 | 7.1 |
Table C:
...
The time is always the same, from 0.0 seconds to 100.0 seconds.
The measurement-points as to be seen in the example differ.
I now want to display the different measurements in one chart. Each table has its own line-graph. The X-Axis would display the Time.
Is something like this possible in Excel?
Solved my problem by using a Scatter graph instead of a Line graph...

Resources