pandas summation per year - python-3.x

python3, pandas version 0.23.4
Let's say we have a pandas DataFrame as follows
np.random.seed(45)
df = pd.DataFrame({'A': np.random.randint(0, 10, 20)}, index = pd.to_datetime(dd).sort_values(ascending=False))
Now, I would like to total the data in column A with respect to each year. I could do:
gf_perYear = gf.groupby(by= gf.index.year)
gf_perYear.sum()
A
2012 11
2013 8
2014 15
2015 44
2016 13
2017 11
However, I am wondering if there would be a way that would allow me to get the result posted in a new column right by the last day if each year, as shown below:
A sum_per_year
2017-12-15 3 11
2017-11-27 0
2017-07-24 5
2017-06-28 3
2016-11-07 4 13
2016-06-03 9
2015-12-18 8 44
2015-10-16 1
2015-09-18 5
2015-07-15 9
2015-04-09 6
2015-03-18 8
2015-02-18 7
2014-10-21 8 15
2014-09-16 5
2014-01-29 2
2013-01-04 8 8
2012-12-28 1 11
2012-08-21 6
2012-03-02 4

You can using transform
gf_perYear = gf.groupby(by= gf.index.year)
gf['new'] = gf_perYear.transform('sum')

Related

How to find again the index after pivoting dataframe?

I created a dataframe form a csv file containing data on number of deaths by year (running from 1946 to 2021) and month (within year):
dataD = pd.read_csv('MY_FILE.csv', sep=',')
First rows (out of 902...) of output are :
dataD
Year Month Deaths
0 2021 2 55500
1 2021 1 65400
2 2020 12 62800
3 2020 11 64700
4 2020 10 56900
As expected, the dataframe contains an index numbered 0,1,2, ... and so on.
Now, I pivot this dataframe in order to have only 1 row by year and months in column, using the following code:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths')
The first rows of the result are now:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
My question is:
What do I have to change in the previous pivoting code in order to find again the index 0,1,2,..etc. when I output the pivoted file? I think I need to specify index=*** in order to make the pivot instruction run. But afterwards, I would like to recover an index "as usual" (if I can say), exactly like in my first file dataD.
Any possibility?
You can reset_index() after pivoting:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths').reset_index()
This would give you the following:
Month Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
Note that the "Month" here might look like the index name but is actually df.columns.name. You can unset it if preferred:
df.columns.name = None
Which then gives you:
Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0

Excel: HLOOKUP() where blank cells are skipped

I am trying to create an HLOOKUP() style formula that, if it finds a matching heading where the reported value of the row it's on except if it is blank it skips it and looks for the next column with the same heading in the same row.
An example of the data table is as follows:
Heading 1 Heading 2 Heading 1 Heading 4 Heading 5 Heading 1
Sample 1 1 7 13 19
Sample 2 8 14 20 2
Sample 3 9 15 21 3
Sample 4 4 10 16 22
Sample 5 5 11 17 23
Sample 6 12 6 18 24
As you can see, the data under headings 2, 4 and 5 are all in single columns, but the heading 1 values are split between three columns.
I need the final data set to look like this:
Heading 1 Heading 2 Heading 4 Heading 5
Sample 1 1 7 13 19
Sample 2 2 8 14 20
Sample 3 3 9 15 21
Sample 4 4 10 16 22
Sample 5 5 11 17 23
Sample 6 6 12 18 24
I have done some research online and have found a formula that I thought was meant to work as a VLOOKUP(), I can't quite work out what it's doing and when I try it on a transposed version of my data set it doesn't quite do what I expect. I Have been trying to get it work in and also convert it to work in the opposite orientation. The formula is as follows:
{=INDEX($B$3:$G$8,SMALL(IF(INDEX($A$3:$G$8,,MATCH(B$11,$B$2:$G$2,0))<>"",IF($A$3:$A$8=$A12,ROW($A$3:$G$8)-ROW($A3)+$I12)),1),MATCH(B$11,$B$2:$G$2,0))}
This formula is from https://www.mrexcel.com/forum/excel-questions/689238-vlookup-match-but-ignore-blank-cells.html
Running the formula on a transposed version of my data set results in the following:
**Transposed data set**
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6
Heading 1 1 4 5
Heading 2 7 8 9 10 11 12
Heading 1 6
Heading 4 13 14 15 16 17 18
Heading 5 19 20 21 22 23 24
Heading 1 2 3
**Result**
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6
Heading 1 1 0 3 0 5 0 1
Heading 2 7 8 9 10 11 12 2
Heading 4 13 14 15 16 17 18 3
Heading 5 19 20 21 22 23 24 4
**Expected result**
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6
Heading 1 1 2 3 4 5 6
Heading 2 7 8 9 10 11 12
Heading 4 13 14 15 16 17 18
Heading 5 19 20 21 22 23 24
I think that I am probably over complicating this and that there must be a simpler solution to the problem. Any help that anyone can give me would be great. Let me
Thanks!
This is maybe faaar to simple, but why don't you simply add the values of the ´Heading 1´ columns? The empty values are treated as value 0, and by the end you'll have the values you are looking for :-)

How to filter with a lead lag window on a spark dataframe?

The filter function select all the rows in a Spark dataframe which satisfy certain conditions. How would I do window-filter where a set of rows above and below the row, which satisfies the filter condition, are selected? For example, I have the following dataframe myDF:
A B
1 1
2 12
3 13
4 14
5 10
6 17
7 34
8 12
9 1
10 7
11 1
Now I want to write something like myDF.orderBy($"A").myWindowFilter("B" === 12, 2) which will give me the following dataframe (2 is the lag/lead width):
A B
1 1
2 12
3 13
4 14
6 17
7 34
8 12
9 1
10 7
How do I implement such a function myWindowFilter in Scala/Spark?

Combining and reading data from Excel (.xlsx) into Matlab

There are two parts of my query:
1) I have multiple .xlsx files stored in a folder, a total of 1 year's worth (~ 365 .xlsx files). They are named according to date: ' A_ddmmmyyyy.xlsx' (e.g. A_01Jan2016.xlsx). Each .xlsx has 5 columns of data: Date, Quantity, Latitude, Longitude, Measurement. The problem is, each .xlsx file consists about 400,000 rows of data and although I have scripts in Excel to merge them, the inherent row restriction in Excel prevents me from merging all the data together.
(i) Is there a way to read recursively the data from each .xlsx sheet into MATLAB, and specifying the variable name (i.e. Date, Quantity etc) for each column(variable) within MATLAB (there are no column headings in the .xlsx files)?
(ii) How can I merge the data for each column from each .xlsx together?
Thank you
Jefferson
Let's go by parts
First I do not recommend to join all your files data in one column, there is no need to have this information all together you can work separately with this, using for example datastore
working in matlab in mya directory:
>> pwd
ans =
/home/anquegi/learn/matlab/stackoverflow
I have a folder with a folder that have two sample excel files:
>> ls
20_hz.jpg big_data_store_analysis.m excel_files octave-workspace sample-file.log
40_hz.jpg chirp_signals.m NewCode.m sample.csv
>> ls excel_files/
A_01Jan2016.xlsx A_02Jan2016.xlsx
the content of each file is :
Date Quantity Latitude Longitude Measurement
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
15 15 15 15 15
16 16 16 16 16
17 17 17 17 17
18 18 18 18 18
19 19 19 19 19
20 20 20 20 20
21 21 21 21 21
22 22 22 22 22
Only to who how it will work.
Reading the data:
>> ssds = spreadsheetDatastore('./excel_files')
ssds =
SpreadsheetDatastore with properties:
Files: {
'/home/anquegi/learn/matlab/stackoverflow/excel_files/A_01Jan2016.xlsx';
'/home/anquegi/learn/matlab/stackoverflow/excel_files/A_02Jan2016.xlsx'
}
Sheets: ''
Range: ''
Sheet Format Properties:
NumHeaderLines: 0
ReadVariableNames: true
VariableNames: {'Date', 'Quantity', 'Latitude' ... and 2 more}
VariableTypes: {'double', 'double', 'double' ... and 2 more}
Properties that control the table returned by preview, read, readall:
SelectedVariableNames: {'Date', 'Quantity', 'Latitude' ... and 2 more}
SelectedVariableTypes: {'double', 'double', 'double' ... and 2 more}
ReadSize: 'file'
Now you have all your data in tables let's see a preview
>> data = preview(ssds)
data =
Date Quantity Latitude Longitude Measurement
____ ________ ________ _________ ___________
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
The preview is a good point to get sample data to work.
You do not need to merge you can work throught all the elements:
>> ssds.VariableNames
ans =
'Date' 'Quantity' 'Latitude' 'Longitude' 'Measurement'
>> ssds.VariableTypes
ans =
'double' 'double' 'double' 'double' 'double'
% let's get all the Latitude elements that have Date equal 1, in this case the tow files are the same, so we wil get two elements with value 1
>> reset(ssds)
accum = [];
while hasdata(ssds)
T = read(ssds);
accum(end +1) = T(T.Date == 1,:).Latitude;
end
>> accum
accum =
1 1
So you need to work with datastore and tables, is a bit tricky but very useful, you also would like to control the readsize and other variables in datastore objects. but this is a good way working with large data files in matlab
For older versions of matlab you can use a more traditional approximation:
folder='./excel_files';
filetype='*.xlsx';
f=fullfile(folder,filetype);
d=dir(f);
for k=1:numel(d);
data{k}=xlsread(fullfile(folder,d(k).name));
end
Now you have the data stored in data
folder='./excel_files';
filetype='*.xlsx';
f=fullfile(folder,filetype);
d=dir(f);
for k=1:numel(d);
data{k}=xlsread(fullfile(folder,d(k).name));
end
data
data =
[22x5 double] [22x5 double]
data{1}
ans =
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
15 15 15 15 15
16 16 16 16 16
17 17 17 17 17
18 18 18 18 18
19 19 19 19 19
20 20 20 20 20
21 21 21 21 21
22 22 22 22 22
But be carefull with a lot of large file

Excel - calculating the median without removing duplicates

I have a table that looks like this:
ID Total
3 3
3 3
3 3
4 11
4 11
4 11
4 11
4 11
4 11
6 9
6 9
7 13
7 13
7 13
7 13
7 13
7 13
7 13
7 13
7 13
7 13
7 13
7 13
7 13
I would like to calculate the median of column B (Total), excluding duplicate combinations of columns A and B. This could be achieved by constructing a table as below, and calculating the median from that table.
ID Total
3 3
4 11
6 9
7 13
Is there any way of obtaining the median without having to go through this process of manually deleting duplicates?
=MEDIAN(IF(FREQUENCY(MATCH(A2:A25&"|"&B2:B25,A2:A25&"|"&B2:B25,0),ROW(A2:A25)-MIN(ROW(A2:A25))+1),B2:B25))
There is a way with two additional columns. The first column is concatenation of ID and Total, the second counts occurences of each individual combination. Then you just do the median on those rows where the combination occurs for the first time.

Resources