I want to compare two datasets to determine which variables they share - data-processing

I have two RNAseq read-outs for two groups and would like to compare them. These data appear as a gene and a value. I would like to determine which genes are shared between the two datasets but they are very large and doing this manually will take a long time.
Thanks!

Related

Find matching entries and display attached row

I have used QGIS to create statistics by categories-excel-documents. Those documents contain tables (within the same sheet) with IDs, representing buffers of specific places, function-labels, representing the function of the buildings within the buffer of those specific places (ID), and a count for each function-label for each buffer of a specific place(ID).
I did this for two counties which happen to have overlapping buffers, which means I had to combine those tables to get a complete picture of how many buildings of certain functions are within each buffer.
For this, I used a Pivot-Table in Excel. So far so good.
excerpt from excel, showing the original tables used for the pivot-table, the pivot-table and a column containing the relevant IDs
Now, my original tables as well as my Pivot-Table contain IDs (buffer of specific places) that are irrelevant and make the list really long.
Additionally, I have a list / table containing the IDs of buffers that are relevant and will be used for further processing.
How can I either filter my Pivot-Table (or original sources) or rather match entries with the list of IDs that are of relevance, so that I get a list of the needed IDs AND the attached information belonging to the ID (function-label and count per function-label)?
I came across a VLOOKUP-post VLOOKUP-Post on Stack overflow but it stated that this would ignore repeated values. As my IDs are repeated for each function-label, this would be a problem.
I tried to find a filter function in the Pivot-Table but was only able to hand select the IDs that I want displayed in the Pivot-Table. As that's roughly 180 IDs, I don't want to do it manually. I've also seen videos that used the filter or match-function, but this was only used to show matching numbers between two columns of numbers, not the corresponding data that's attached. A similar question on StackOverflow was answered and suggested VLOOKUP, but it also stated that this would be problematic, if the column contained repeated values. As the ID is repeated multiple times for the different function-labels, this seems unfit for my problem.

combining two dataframes sorted by common columns in pandas

I've been stuck on this for a while - I have tried merging, concatenating, joins, but can't get what I want.
given the two dataframes :
I want to combine them to get :
First I want them aligned by Start Location. If the names can be aligned, that is a bonus, but not critical. Although this example looks like the 2nd table is aligning to the 1st, it could be either way, depending on the Start Location.The two dataframes are appropriately sorted beforehand. I can't even figure this out with two dataframes, but ultimately I want to be able to combine several together.

Pandas dataframe to array for further use

I've got a dataframe which contains a csv of selling KPIs (quantity, article number and the corresponding date)
I need to split the dataframe into multiple with each containing the data to one article number (e.g. frame1= 123, frame2=345 and so on.
)
How can I dynamically split like this for a further use in sklearns kmean? (match different article numbers and their selling KPI)
thanks a lot
You can group by the article number using groupBy.
grouped = df.groupby(['article_number'])
You can then access the individual groups using
grouped.groups
or directly apply aggregation functions like grouped.sum(['quantity']) to get a new frame with the respective values for each group.
Also refer to the User Guide.

randomly split DataFrame by group?

I have a DataFrame where multiple rows share group_id values (very large number of groups).
Is there an elegant way to randomly split this data into training and test data in a way that the training and test sets do not share group_id?
The best process I can come up with right now is
- create mask from msk = np.random.rand()
- apply it to the DataFrame
- check test file for rows that share group_id with training set and move these rows to training set.
This is clearly non-elegant and has multiple issues (including the possibility that the test data ends up empty). I feel like there must be a better way, is there?
Thanks
Oh, there is an easy way !
create a list / array of unique group_ids
create a random mask for this list
and use the mask to split the file

How do I filter an Excel pivot table based on the combination of two OLAP dimensions?

I've been tasked with building some ad-hoc reports in Excel that are sourced from an SSAS OLAP cube. I don't have the ability to alter the design of the cube's dimensions currently. I've been receiving repeated requests to filter results based upon the combination of two different dimensions and their attributes.
For example:
One dimension lists locations with their hierarchies. Another dimension contains codes for the various insurance companies we work with. I'm given a list of combinations of these, concatenated with a hyphen separating them, and they are supposed to be the only combinations within the report. For example, I get things like "001-AB5". Unfortunately, there are duplicates of the codes, so I can't just pull the code, seeing that AB5 means different things for different locations, which I can't do anything about at this time either.
For some of the smaller data sets, I've used PowerPivot and just created a calculated column, and added a relationship to the list in another sheet. The issue is that now they want the drill-through actions that have been setup for the cube. Is it possible to create something like a calculated dimension in Excel (or some other means) that would be the concatenation of these without using PowerPivot?

Resources