I have be wondering if is it possible to create an aggregation and summary of values on OpenRefine on the same way as it is done on python and R? Example:
Table of medical appoints with 300k records
Id-patient | Age | Id-appointment | value
The result of aggregating and summarizing by patient would be:
Id-patient | last-age | mean-value
I hope to be clear enough, if that function works on Openrefine it would of a great help.
The answer is "yes but"... It's possible, but a bit complicated. Let's take an example.
Id-patient,Age,Id-appointment,score
1,25,1-1,456
1,26,2-1,895
1,27,3-1,872
1,28,4-1,12
1,29,5-1,87
2,45,1-2,542
2,46,2-2,524
2,52,3-2,78
2,89,4-2,45
2,90,5-2,371
In order to do aggregate calculations per patient, you must first transform each patient into a record. To do this, move the "Id_patient" column to the beginning and use "blank down" (The id must be sorted beforehand with "Sort..." and "Reoder rows permanently").
After that, you can perform calculations on all the values of each record, considered as an array.
All this will be clearer with a screencast :
The formulas used in the demo are:
GREL:
sort(row.record.cells.Age.value)[-1]
GREL:
sum(row.record.cells.score.value) / length(row.record.cells.score.value)
Python/Jython:
def avg(l):
return sum(l, 0.0) / len(l)
return avg([x for x in row['record']['cells']['score']['value']])
As you can see, you can do a lot of things with Open Refine, especially using Pyhon/Jython. BUT calculations is not its main purpose. Open Refine is designed primarily to explore, clean and enrich data. It's not a spreadsheet software. You could do the same much more easily with Pivot Tables in Excel. Just as you can clean up some messy data with Excel, even if it's not the best tool for that.
Related
I've been recently hired as an intern, and a challenge my area has come accross is how to highlight the closest available medical appointment.
Of course, I know that in excel such a thing would be pretty simple, with a matrixial formula like {=MIN(IF( range > current_date_and_time ; range ))} and as such, in PowerBi it should be just as simple.
However, the PowerBi table the area is dealing with shows the entire agenda and another column indicates if it has been reserved or not (with a 1 and 0). So, I'm wondering how to incorporate this condition toe excel formula, and in the end, how to get the closest AVAILABLE appointment in the agenda.
Eventually, the idea is to apply this "filter" for each doctor, and then aggregate by area.
I also know that good manners dictate I show you the data, but this is work related, so I can't do that.
Thanks beforehand, and sorry for the trouble
the problem has been solved, though by means of a different answer. THANKS
I'm not able to take the means for a large dataset given that the amount of attributes is irregular.
I have posted a simplified case for the problem. It explains the problem very well.
An idea that I came up with: Make a filter to condition on a single attribute. However, still, I don't see a way to do this in an efficient way (other then doing it all by hand).
see excel file:
All help is much appreciated.
I'm basically looking for a function/method to achieve taking means of all different attributes conditioned on each person for a large dataset without doing it by hand.
You can use AVERAGEIFS() inside an IF:
=IF(OR(A2<>A1,B2<>B1),AVERAGEIFS(C:C,A:A,A2,B:B,B2),"")
the ifrst part of the if tests whether the row starts a new group either by the person or the attribute changing. Then it uses AVERAGEIFS() to return the correct average of that group. otherwise it returns a blank
What you want to do can be accomplished very simply with a pivot table.
Simply select one of the cells inside the range of data you want to process(See the video for general use of a pivot table https://www.youtube.com/watch?v=iCiayB6GrpQ )
go the insert tab and insert pivot table.
Once you have it, simply check people, attribute, and values. Then drag people and attribute into rows, drag valut into the values window, select the drop down list and change it from sum of value to average and you should be done. https://i.stack.imgur.com/nYEzw.png
I'm trying to rank some data in spotfire, and I'm having a bit of trouble writing a formula to calculate it. Here's a breakdown of what I am working with.
Group: the test group
SNP: what SNP I am looking at
Count: how many counts I get for the specific SNP
What I'd like to do is rank the average # of counts that are present for each SNP, within the group. Thus, I could then see, within a group, which SNP ranks #1, #2, etc.
Thanks!
TL;DR Disclaimer: You can do this, though if you are changing your cross table frequently, it may become a giant hassle. Make sure to double-check that logic is what you'd expect after any modification. Proceed with caution.
The basis of the Custom Expression you seem to be looking for is as follows:
Max(DenseRank(Count() OVER (Intersect([Group],[SNP])),"desc",[Group]))
This gives the total count of rows instead of the average; I was uncertain if "Count" was supposed to be a column or not. If you really do want to turn it into an average, make sure to adjust accordingly.
If all you have is the Group and the SNP nested on the left, you're done and good to go.
First issue, when you want to filter it down, it gives you the dense rank of only those in the filtered set. In some cases this is good, and what you're looking for; in others, it isn't. If you want it to hold fast to its value, regardless of filtering, you can use the same logic, but throw it in a Calculated column, instead of in the custom expression. Then, in your CrossTable Aggregation, get the max of the Calculated Column value.
Calculated Column:
DenseRank(Count() OVER (Intersect([Group],[SNP])),"desc",[Group])
Second Issue: You want to pivot by something other than Group and SNP. Perhaps, for example, by date? If you throw the Date across the top, it's going to show the same numbers for every month -- the overall numbers. This is not particularly helpful.
To a certain extent, Spotfire's Custom Expressions can handle this modification. If you switch between using a single column, you could use the following:
Max(DenseRank(Count() OVER (Intersect([${Axis.Columns.ShortDisplayName}],[Group],[SNP])),"desc",[Group],[${Axis.Columns.ShortDisplayName}]))
That would automatically pull in the column from the top, and show you the ranking for each individual process date.
However, if you start nesting, using hierarchies, renaming your columns, or having multiple aggregations and throwing (Column Names) across the top, you're going to start having to pay a great deal to your custom expression. You'll need to do some form of string replacement around the Axis.Column, or use expression instead of Short Names, and get rid of Nests, etc.
Any layer of complexity will require this sort of analysis, so if your end-users have access to modify the pivot table... honestly, I probably wouldn't give them this column.
Third Issue: I don't know if this is an issue, exactly, but you said "Average Counts" -- Average per day? Per Month? When averaging, you will need to decide if, for example, a month is the total number of days in month or the number of days that particular payor had data. However you decide to aggregate it, make sure you're doing it on the right level.
For the record, I liked the premise of this question; it's something I'd thought would be useful before, but never took the time to try to implement, since sorting a column or limiting a table to only show the top 10 values is much simpler
I am working with two datasets in csv form (movielens latest-small dataset). Given below are the fields of both.
rating.csv
user_id movie_id rating
movie.csv
movie_id movie_name
what I want is to combine them into a single .csv with following fields
user_id movie_id movie_name rating
So that the common column movie_id maps with corresponding movie_name.
Could that be done using Excel? If not, how can I do it?
I just need it as a dataset for my recommender engine, so any simple solution is welcome as end result is all that matters. But since I've some experience in java so that would be easy for my easy understand and implement.
If there is some way using Excel then that would be the best. I have tried searching online and found some VLOOKUP method but couldn't clearly get it.
Also I tried some online merging tools but they just attached the sheets one after the another not mapping the column. So I have no problem using online tools too.
This is the method with a VLOOKUP formula within Excel:
The formula takes 4 arguments:
The value you are wanting to look up
The range of data you are looking into
The column within (2) that contains the answer you want
Whether to match on (1) approximately i.e. FALSE = exact match
See here for documentation on the function.
Check out this tool - https://github.com/DataFoxCo/gocsv - it's based off of csvkit but has a ton of additional features. One of our engineers custom built it - and open sourced it to help solve some of these data issues we deal with every day :)
It will do a vlookup essentially of any sized csv in merely seconds using the join command:
gocsv join --columns 'movie_id','movie_id' --left rating.csv movie.csv > combineddata.csv
then if you still want to reorder the columns, you can do that too:
gocsv select --columns 'user_id','movie_id','movie_name','rating' combineddata.csv > combineddata-final.csv
I split the commands out up top to help explain its use - the documentation has all the examples on it also but ultimately I would really recommend pipelining it and doing it in one command like this:
cat rating.csv \
| gocsv join --left --columns 'movie_id','movie_id' movie.csv \
| gocsv select --columns 'user_id','movie_id','movie_name','rating' > combineddata.csv
I'd like to sort a time series of exam performance by one of three categories:
Ideally, a function would sort the scores by "difficulty" while still preserving chronological order. I'd like to do this without filters etc. Something like this is very close, but not quite there. Do I need to use dynamic ranges? Or can I just define data ranges in the table dialog with VLOOKUP or INDEX/MATCH?
I'm thinking a bar graph would be the easiest way to illustrate the data, but I'm open to suggestions. New scores are added every day, with varying difficulties.
Here is the spreadsheet if anyone would like to look it over.
EDIT:
The output visualization could be, for example, a clustered bar graph, but with only one label per category. The idea is that I'd like to preserve chronological order without necessarily having to mark it on the graph.
Would there, for instance, be a quick-and easy and formula-driven way to put these 14 and 17 values for "score" all together under one label? I feel like 17 bar graphs clustered too closely would be hard to read.
I realize this is more of a formatting than a formula issue, but I appreciate input with regards to both.
I would recommend you add a Table over the data in the workbook. One for verbal and one for math. The upside is that it will automatically grow with your data as you add new rows. This is very helpful because charts and other things will automatically refer to the new data. Add one with CTRL+T or Insert->Table on the Ribbon.
Once you have the Table, you can easily do the sorting bit by adding a two column sort onto the Table. This menu is accessible by right clicking in the Table and doing Sort->Custom Sort. Again, the Table is nice here because it will only sort the data within it (not the whole sheet) and will remember your settings. This lets you add new data and simply do Data->Reapply to get it to sort again. Your sort on Difficulty is going to be alphabetic unless you add a number at the front. Here is the sorting step:
With this done, you can create a quick chart based on that data. For the "implicit chronology" you can simply plot score vs. difficulty for all of them since they are sorted.
To get closer to that matrix style display, you can easily create a PivotTable based on this Table and let it do the organizing by date/difficulty. Here is the result of that. I am using Average as the aggregation function since it appears that no dates have more than 1 score. If they did, it would be a better choice than Sum.