Excel - combining two csv files into one with a common column - excel

I am working with two datasets in csv form (movielens latest-small dataset). Given below are the fields of both.
rating.csv
user_id movie_id rating
movie.csv
movie_id movie_name
what I want is to combine them into a single .csv with following fields
user_id movie_id movie_name rating
So that the common column movie_id maps with corresponding movie_name.
Could that be done using Excel? If not, how can I do it?
I just need it as a dataset for my recommender engine, so any simple solution is welcome as end result is all that matters. But since I've some experience in java so that would be easy for my easy understand and implement.
If there is some way using Excel then that would be the best. I have tried searching online and found some VLOOKUP method but couldn't clearly get it.
Also I tried some online merging tools but they just attached the sheets one after the another not mapping the column. So I have no problem using online tools too.

This is the method with a VLOOKUP formula within Excel:
The formula takes 4 arguments:
The value you are wanting to look up
The range of data you are looking into
The column within (2) that contains the answer you want
Whether to match on (1) approximately i.e. FALSE = exact match
See here for documentation on the function.

Check out this tool - https://github.com/DataFoxCo/gocsv - it's based off of csvkit but has a ton of additional features. One of our engineers custom built it - and open sourced it to help solve some of these data issues we deal with every day :)
It will do a vlookup essentially of any sized csv in merely seconds using the join command:
gocsv join --columns 'movie_id','movie_id' --left rating.csv movie.csv > combineddata.csv
then if you still want to reorder the columns, you can do that too:
gocsv select --columns 'user_id','movie_id','movie_name','rating' combineddata.csv > combineddata-final.csv
I split the commands out up top to help explain its use - the documentation has all the examples on it also but ultimately I would really recommend pipelining it and doing it in one command like this:
cat rating.csv \
| gocsv join --left --columns 'movie_id','movie_id' movie.csv \
| gocsv select --columns 'user_id','movie_id','movie_name','rating' > combineddata.csv

Related

I need to sort an array in excel following the order of another one

Hello everyone and thanks a lot in advance for any help.
I'm not very good using Excel. I want to know if there is a simple way in which I can sort a small two-column data matrix following the order of a column that contains all the bird species in Colombia. I study birds and I usually do avifauna characterization studies. I've always had this problem of not being able to order efficiently using the taxonomic order of species. I have always done it by hand and it takes me a really long time.
This is the file with the example that ilustrates my problem: https://docs.google.com/spreadsheets/d/1089VD4ylJiW9Xw9xRFI0ehraAc_t-qSa/edit?usp=sharing&ouid=112790797352647984659&rtpof=true&sd=true
There are two worksheets in this file. One called "Species" and the other "Data". I need to know if it is possible to make Data array can be sorted following the Species column.
I have tried creating custom lists and the number of entries to create a certain order does not allow me to put more than 100. I have also tried using commands Sort and Sortby without any success.
Again, thanks a lot for any help.
Regrets.
On the Data worksheet, add a helper column:
C2: =MATCH(A2,Species!$A$1:$A$2000,0)
and fill down.
Then Sort by the helper column

Grouping Rows by variables in KableExtra

this is my first question here so I am uncertain on how to word things. However, I am looking into the kableExtra package for creating different tables than the ones I currently know in gt. This is my table output from gt.
enter image description here
Now what I am trying to do is group my kable table in a similar way to this. This data set is exactly what you see, with the addition of a column I have named "x" that includes Prevalence, Abundance, and Intensity for each different row. Is there a way to have a similar output for a kable table? The difficulty that I am having is because this table in gt is so lengthy it doesn't fit very well in a document. Thank you for any help.

Create a search option in Power BI dashboard based on keywords table

I have two tables
With complete data, including a keywords columns. where keywords are comma separated (around 25 keywords)
Unique keywords extracted from the keywords column. (single column with each keyword in each observation)
Task is, based on the keyword in the second table, search the observations that have similar keywords and display on the report.
Looks something like this:
This is a filter, which is not fulfilling my task.
(or)
I am back of https://ideas.powerbi.com/ideas/idea/?ideaid=a586deac-c465-48da-978b-30ac2a4a3245 this activity. if someone can provide any solution related to this, will be helpful :).
I'm not sure what do you try to achieve. If you want just filter some visualization by selecting one of the keywords then create a measure (returning 0 /1, and this we can use for the filter in visualization) using SELECTEDVALUE -> for grabbing selected slicer and pathcontains (you need to replace comas ", " to pipe "|"
https://dax.guide/pathcontains/

Separating values that are combined in one string

I would like to solve this either in Excel or in SPSS:
I have categorical data (each number representing a medical diagnosis) that are combined into single cells. In other words, a row (patient) has multiple diagnoses. However, I would like to know the frequencies of each diagnosis. What is the best way to go about this? (See picture for reference)
For SPSS:
First just creating some sample data to demonstrate on:
data list free/e_cerv_dis_state (a20).
begin data
"{1/2/3/6}" "{1/2/4}" "{2/4/5}" "{1/5/6}" "{4}" "{4/5/6}" "{1/2/3/4/5/6}"
end data.
Now the following code will create a separate variable for each possible diagnosis, and will put a 1 in it if the diagnosis exists in the original variable.
do repeat vr=diag1 to diag9/vl=1 to 9.
compute vr=char.index(e_cerv_dis_state, string(vl, f1) ) > 0.
end repeat.
freq diag1 to diag6.
Note this will only work for up to 9 diagnoses. If you have more than that the solution will have to be adapted to multiple digits.
Assuming that the number of columns is fairly regular, I would suggest using text to columns, and then using COUNTIF on the cells if they are the value wanted. However there is a more robust and reproducible solution that would involve using SQL. If you download the free version of SQL Express here: https://www.microsoft.com/en-gb/sql-server/sql-server-downloads
Then you can import your table of data, here's how to do that: How to import an Excel file into SQL Server?
Then you could use the more friendly SQL database to get the answers you want. For example you can use a select statement that would say:
SELECT count(e_cerv_dis_state)
WHERE e_cerv_dis_state = '6'
It would also be possible to use a CASE WHEN statement to add-in the names of the diagnoses.

Is it possible to make summarized table on openrefine?

I have be wondering if is it possible to create an aggregation and summary of values on OpenRefine on the same way as it is done on python and R? Example:
Table of medical appoints with 300k records
Id-patient | Age | Id-appointment | value
The result of aggregating and summarizing by patient would be:
Id-patient | last-age | mean-value
I hope to be clear enough, if that function works on Openrefine it would of a great help.
The answer is "yes but"... It's possible, but a bit complicated. Let's take an example.
Id-patient,Age,Id-appointment,score
1,25,1-1,456
1,26,2-1,895
1,27,3-1,872
1,28,4-1,12
1,29,5-1,87
2,45,1-2,542
2,46,2-2,524
2,52,3-2,78
2,89,4-2,45
2,90,5-2,371
In order to do aggregate calculations per patient, you must first transform each patient into a record. To do this, move the "Id_patient" column to the beginning and use "blank down" (The id must be sorted beforehand with "Sort..." and "Reoder rows permanently").
After that, you can perform calculations on all the values of each record, considered as an array.
All this will be clearer with a screencast :
The formulas used in the demo are:
GREL:
sort(row.record.cells.Age.value)[-1]
GREL:
sum(row.record.cells.score.value) / length(row.record.cells.score.value)
Python/Jython:
def avg(l):
return sum(l, 0.0) / len(l)
return avg([x for x in row['record']['cells']['score']['value']])
As you can see, you can do a lot of things with Open Refine, especially using Pyhon/Jython. BUT calculations is not its main purpose. Open Refine is designed primarily to explore, clean and enrich data. It's not a spreadsheet software. You could do the same much more easily with Pivot Tables in Excel. Just as you can clean up some messy data with Excel, even if it's not the best tool for that.

Resources