I currently have a spreadsheet of data that I am cleaning to import into a database for further analysis. Currently the format is this:
Country | Year | GDP
--------------------
USA | 1950 | 5
USA | 1951 | 6
...
GBR | 1950 | 4
GBR | 1951 | 5
And so on for many countries. What I want to do is transpose this data so that it is a table of Country by Year and each cell is a coordinate GDP(Country, Year). i.e.:
Country | 1950 | 1951 | ...
-------------------------
USA | 5 | 6 ...
GBR | 4 | 5
Is there an easy way to do such a transposition? I realize it does not work because each country is iterated over 'n' years so a classic transposition is unavailable, but the nice thing is the table is uniform in that each country has rows from 1950-2011. My workflow includes Excel, R and SQLite. Is there a way to structure an sql script to import the rows in this manner? I usually use a csv-to-sql converter tool but I want the db table structured in the manner of the second table.
The underlying reason for this task is that I am gathering health data from WHO (formatted like the second table) and economic indicators from Penn (formatted as in the first table), and going to look at correlations between the two, thus I want all tables to have the same schema, as to ensure the relational aspect of the database is intuitive. I tell you this because I figure someone might have an idea/workaround that might make my original request unnecessary/extraneous.
Related
I do wonder how it is possible to make sliding windows in Pandas.
I have a dataframe with three columns.
Country | Number | DayOfTheYear
===================================
No | 50 | 0
No | 20 | 1
No | 37 | 2
I would love to see 14 day chunks for every country and day combination.
The country think can be ignored for the moment, since I can filter those manually in some way. But imagine there is only one country, is there a smart way to get some sort of summed up sliding window, resulting in something like the following?
Country | Sum | DatesOftheYear
===================================
No | 504 | 0-13
No | 207 | 1-14
No | 337 | 2-15
I would also accept if if they where disjunct, being only 0-13, 14-27, etc.
But I just cannot come along with Pandas. I know an old SQL solution, but is there anybody having a nice idea for Pandas?
If you want a rolling windows of your dataframe, you can simply use the .rolling function of pandas : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
In your case : df["Number"].rolling(14).sum()
I'm trying to create a forecasting process using hierarchical time series. My problem is that I can't find a way to create a for loop that hierarchically extracts daily time series from a pandas dataframe grouping the sum of quantities by date. The resulting daily time series should be passed to a function inside the loop, and the results stored in some other object.
Dataset
The initial dataset is a table that represents the daily sales data of 3 hierarchical levels: city, shop, product. The initial table has this structure:
+============+============+============+============+==========+
| Id_Level_1 | Id_Level_2 | Id_Level_3 | Date | Quantity |
+============+============+============+============+==========+
| Rome | Shop1 | Prod1 | 01/01/2015 | 50 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 02/01/2015 | 25 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 03/01/2015 | 73 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 04/01/2015 | 62 |
+------------+------------+------------+------------+----------+
| ... | ... | ... | ... | ... |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 185 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 147 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 206 |
+------------+------------+------------+------------+----------+
Each City (Id_Level_1) has many Shops (Id_Level_2), and each one has some Products (Id_Level_3). Each shop has a different mix of products (maybe shop1 and shop3 have product7, which is not available in other shops). All data are daily and the measure of interest is the quantity.
Hierarchical Index (MultiIndex)
I need to create a tree structure (hierarchical structure) to extract a time series for each "node" of the structure. I call a "node" a cobination of the hierarchical keys, i.e. "Rome" and "Milan" are nodes of Level 1, while "Rome|Shop1" and "Milan|Shop9" are nodes of level 2. In particulare, I need this on level 3, because each product (Id_Level_3) has different sales in each shop of each city. Here is the strict hierarchy.
Nodes of level 3 are "Rome, Shop1, Prod1", "Rome, Shop1, Prod2", "Rome, Shop2, Prod1", and so on. The key of the nodes is logically the concatenation of the ids.
For each node, the time series is composed by two columns: Date and Quantity.
# MultiIndex dataframe
Liv_Labels = ['Id_Level_1', 'Id_Level_2', 'Id_Level_3', 'Date']
df.set_index(Liv_Labels, drop=False, inplace=True)
The I need to extract the aggregated time series in order but keeping the hierarchical nodes.
Level 0:
Level_0 = df.groupby(level=['Data'])['Qta'].sum()
Level 1:
# Node Level 1 "Rome"
Level_1['Rome'] = df.loc[idx[['Rome'],:,:]].groupby(level=['Data']).sum()
# Node Level 1 "Milan"
Level_1['Milan'] = df.loc[idx[['Milan'],:,:]].groupby(level=['Data']).sum()
Level 2:
# Node Level 2 "Rome, Shop1"
Level_2['Rome',] = df.loc[idx[['Rome'],['Shop1'],:]].groupby(level=['Data']).sum()
... repeat for each level 2 node ...
# Node Level 2 "Milan, Shop9"
Level_2['Milan'] = df.loc[idx[['Milan'],['Shop9'],:]].groupby(level=['Data']).sum()
Attempts
I already tried creating dictionaries and multiindex, but my problem is that I can't get a proper "node" use inside the loop. I can't even extract the unique level nodes keys, so I can't collect a specific node time series.
# Get level labels
Level_Labels = ['Id_Liv'+str(n) for n in range(1, Liv_Num+1)]+['Data']
# Initialize dictionary
TimeSeries = {}
# Get Level 0 time series
TimeSeries["Level_0"] = df.groupby(level=['Data'])['Qta'].sum()
# Get othe levels time series from 1 to Level_Num
for i in range(1, Liv_Num+1):
TimeSeries["Level_"+str(i)] = df.groupby(level=Level_Labels[0:i]+['Data'])['Qta'].sum()
Desired result
I would like a loop the cycles my dataset with these actions:
Creates a structure of all the unique node keys
Extracts the node time series grouped by Date and Quantity
Store the time series in a structure for later use
Thanks in advance for any suggestion! Best regards.
FR
I'm currently working on a switch dataset that I polled from an sql database where each port on the respective switch has a data frame which has a time series. So to access this time series information for each specific port I represented the switches by their IP addresses and the various number of ports on the switch, and to make sure I don't re-query what I already queried before I used the .unique() method to get unique queries of each.
I set my index to be the IP and Port indices and accessed the port information like so:
def yield_df(df):
for ip in df.index.get_level_values('ip').unique():
for port in df.loc[ip].index.get_level_values('port').unique():
yield df.loc[ip].loc[port]
Then I cycled the port data frames with a for loop like so:
for port_df in yield_df(adb_df):
I'm sure there are faster ways to carry out these procedures in pandas but I hope this helps you start solving your problem
In my theoretical data set, I have a list which shows the date-time of a sale, and the employee who completed the transaction.
I know how to do grouping in order to show how many sales each employee has per day, but I'm wondering if there's a way to count how many grouped days have more than 0 sales.
For example, here's the original data set:
Employee | Order Time
A | 8/12 8:00
B | 8/12 9:00
A | 8/12 10:00
A | 8/12 14:00
B | 8/13 10:00
B | 8/13 11:00
A | 8/13 15:00
A | 8/14 12:00
Here's the pivot table that I have created:
Employee | 8/12 | 8/13 | 8/14
A | 3 | 1 | 1
B | 1 | 2 | 0
And here's what I want to know:
Employee | Working Days
A | 3
B | 2
Split your Order Time column (assumed to be B) into two, say with Text to Columns and Space as the delimiter (might need a little adjustment). Then pivot (using the Data Model) as shown:
and sum the results (outside the PT) such as with:
=SUM(F3:H3)
copied down to suit.
Columns F:G may then be hidden.
I fully support #Andrea's Comment (a correction) on the above:
I think this could have been made simpler. If you remove the "Time" in values of the pivot table and then move "Order" from columns to values and use distinct count as in the example. It should count Employee per date making the sum not needed. If you scale this to make it larger. Say 50 dates then the =Sum() needs to be moved each time.
I have a table that more or less looks like this:
Team_Name | Total_Errors | Total_Volume
_______________________________________
Sam | 3 | 1350
Sam | 5 | 1100
Jamie | 7 | 1600
Mark | 3 | 1220
Jamie | 10 | 2100
Mark | 5 | 1300
Sam | 5 | 1100
Jamie | 3 | 1900
Just with a lot more rows. I want to create a formula that calculates the average total_errors for just the numbers corresponsding to Team_names "Jamie" and "Sam".
How do I do this?
Something like Average(If(June(Team_Name)="Jamie","Sam"......?
(the table name is June)
thanks in advance
You can use Sum/Count:
=(SUMIF(A1:A8,"Jamie",B1:B8)+SUMIF(A1:A8,"Sam",B1:B8))/(COUNTIF(A1:A8,"Jamie")+COUNTIF(A1:A8,"Sam"))
I would go with a simple pivot table that uses June as a data source.
Put your Team_Name filed as a rows, and Total_Errors as Values. Change the Field settings of your Total_Errors to be an average, and change how many decimal points you want to see.
You can then apply whatever filters /Slicers you want and get your desired result.
Here's a screenshot (its on MAC, but you'll get the idea)
Assuming DATA in located at A1:C9 enter this formula at F5, note tat the Criteria Range used by the formula is locaed at E2:E4 (see picture below):
=DAVERAGE($A$1:$C$9,$B$1,$E$2:$E$4)
Slightly awkward requirements, so I apologise if the explanation isn't overly clear.
I have two tables, with very similar data (though not identical), which I'd like to merge together and total up as follows.
Both Tables Contain the following headings
Invoice, Date, Account, No., Description, Blank, Credit, Debit, Total
However, they are for slightly different things (support and commission to be exact). Both tables contain multiple rows of data for various customers, but some customers may only be in one table or the other.
I've used pivot tables for each table individually to show the sum totals for each customer (so I have a table of every customers total support value, and a separate table for every customers total commission). Similarly to above though, customers may be in one pivot table but not the other.
What I would like is a single table to show every customer from both tables (if they are in both tables, I only want one record), with the total support (showing 0 if the customer isn't in the table), the total commission (again, 0 if the customer isn't is that table), and ideally the total overall (although this is a simple sum of the other two, so can be added in after if required...
As an example, if the relevant columns in two tables were;
Support Commission
Account | Total Account | Total
----------------- -----------------
A | 25.00 A | 5.00
A | 25.00 C | -10.00
A | 45.00 C | 10.00
B | 10.00 C | 30.00
B | -5.00 C | 25.00
C | 5.00 D | 25.00
C | 10.00 D | -5.00
C | 10.00 E | 15.00
E | 25.00
I'm trying to end up with a table that looks like;
Account | Support Total | Commission Total | Overall Total
----------------------------------------------------------------
A | 95.00 | 5.00 | 100.00
B | 5.00 | 0.00 | 5.00
C | 25.00 | 55.00 | 80.00
D | 0.00 | 20.00 | 20.00
E | 25.00 | 15.00 | 50.00
This isn't something I'd want to do manually, as my actual tables have 2000+ rows in them.
Any help would be greatly appreciated. (I've been messing around with various Excel features for a long time now and I've run out of ideas)
Use multiple consolidation ranges (e.g. further details here - but you can stop short of creating the Table).
Ensure your separate sources have the same column labels:
N.B. 25+15 = 40 :)