So I have a very large set of data (4 million rows+) with journey times between two location nodes for two separate years (2015 and 2024). These are stored in dat files in a format of:
Node A
Node B
Journey Time (s)
123
124
51.4
So I have one long file of over 4 million rows for each year. I need to interpolate journey times for a year between the two for which I have data. I've tried Power Query in Excel as well as Power BI Desktop but have had no reasonable solution beyond cutting the files into smaller < 1 million row pieces so that Excel can manage.
Any ideas?
What type of output are you looking for? PowerBI can easily handle this amount of data, but it depends what you expect your result to be. If you're looking for the average % change in node to node travel time between the two years, then PowerBI could be utilised as it is great at aggregating and comparing large datasets.
However, if you are wanting an output of every single node to node delta between those two years i.e. 4M row output, then PowerBI will calculate this, but then what do you do with it.... a 4M long table?
If you're looking to have an exported result >150K rows (PowerBI limit) or >1M rows (Excel limit), then I would use Python for that (as mentioned above)
Related
I'm new to programming and data analysis in general, and need some help with a large dataset file (43 GB). It is a list of High Frequency trades fro a stock containing two columns I'm intrested in: Time (in UTC format including milliseconds, e.g. 2019-01-01T00:06:41.033529796Z) and price. I have managed to open the file using delimiter software and split it into 509 files which would fit in an excel sheet.
I now need to compare the price change during 5 minute intervals based on the prices in this file.
My first problem is that Excel doesnt the approriate time format for interpretation.
Secondly, I need to understand perhaps using the =FLOOR formula, to split the lsit of trade times into 5 mins intervals and find the difference in corespongin prices.
I have tried making excel recognise the UTC format with no success. Any help would be appreciated!
I was trying to plot some reports for Covid-19 cases around the Globe, using Excel and Power BI. With Power BI is easier and fancier to do definitely, but I need an Excel file or calculation that makes sense - similar to the PBI. What I actually wanted is to calculate the daily increase in new cases (with %) and also death rate but per day, or total death by day and so on..
I did some calculations (% of column total and I calculated one field to get death rate%) here using Pivot tables but not sure how to do daily increase/decrease? Did anyone get an idea for additional calculations?
This is copied from PBI (calculations) which I wanna have similar in Excel - but I am not sure If I can calculate it properly (last 2 pictures).
The data source from the input data is here:
https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.xlsx
You need an extra column for the result you want (e.g. daily increase/decrease), then you can plot either the waterfall chart, or using techniques similar to
https://www.extendoffice.com/documents/excel/5945-excel-chart-display-percentage-change.html
I wanted to see if anyone can assist. I have been using power pivot a fair amount over the last 2 years. I have created a report within it which uses 9 columns measuring 3 metrics. (Essentially sales over the last 2 years with a variance).
The Pivot table pulling through uses no sub-totals and is in tabular form pulling from only the one source file which is an excel workbook.
The issue I am receiving when I pulled in the 6th to 9th column the query got notably slower. Now taking up to 15 minutes to simply add a filter.
Can anyone assist in making this data refresh at a normal rate? I have tried making the source file smaller, making it now so that there is only one source file and not 3 however this seems to have a minimal effect.
I have other reports with much larger amounts of data and more connections where I am not encountering this problem at all so this is a little beyond my comprehension.
I have been downloading and organizing hourly water quality data into Excel for many different states, and have organized them by year. I have done data prep for them to make sure there are no zeros/every day of the year (DOY) has 24 values, but the time series plots were too noisy so want me to get the daily average values instead.
All of the sites annual data is different in terms of how many days are available, and sometimes they are missing whole months due to no recordings.
So my question is, how can I develop a code to give me the average daily value linked to a specific DOY that I can apply to many different Excel sheets. The data appears like this:
And the files are saved like like CA1_2012 (California Site 1 hourly data from 2012)
I know there is a lot on this topic but I have been trying everything and I can't get a code that works!
You can get the summation of the second column based on values in the first column in matlab using accumarray;
[m,~,n] = unique(data(:,1));
sumdata = [m, accumarray(n,data(:,2))];
for mean I would suggest grpstats:
avgdata = grpstats(data, DOY, {'mean'});
or as #gnovice suggested:
avgdata = accumarray(DOY, data, [], #mean);
You can also get what you want by using Pivot Table in excel and group data by DOY and get the mean value for them in the table. (No coding required).
I made a little test machine that accidentally created a 'big' data set:
6 columns with +/- 550.000 rows.
The end result I am looking for is a graph with 6 lines, horizontal axis 1 - 550.000 measurements and vertically the values in the rows. (capped at 200 or so). Data is a resistance measurement that should be between 0 - 30 or very big (borken), the software writes 'inf' in these cases.
My skill is limited to excel, so what have I done until now:
Imported in Excel. The measurements are valuable between 0 - 30 and inf is not good for a graph, so I did: if(cell>200){200}else{keep cell value}.
Now making a graph is a timely exercise and excel does not like this, result is not good.
So I would like to take the average value of 60 measurements to reduce the rows to below 10.000. So =AVERAGE(H1:H60)
But I cannot get this to work.
Questions:
How do I reduce this data set and get a good graph.
Should I switch
to other software that is more applicable?
FYI: I already changed the software of the testing device to take the average value of a bunch of measurements the next time... But I cannot repeat this test.
Download link of data set comma separated file 17MB
I think you are on the right track, however my guess is that you only want to get an average every 60 rows and are unsure how to do this.
Using MOD(Number, Divisor) inside an if statement will let you specify that the average should be calculated only once in every x number of cells.
Assuming you'll have one row above your data table for headers, you are looking for something along the lines of:
=IF(MOD(ROW(A61),60) = 1,AVERAGE(H2:H61),"")
Once you have this you can filter your average column to non-blank values and use this to create your graph.