groupby select value only if match - python-3.x

I got my data sorted correctly, but now Im trying to find a way to group by "first not empty string value". Is there a way to do this without changing the rest of the data? First was close, but not quite what I needed
grouped = sortedvals.groupby(['name']).first().reset_index()
doesnt work if the first value is empty ie: '',2 (my goal is to return 2) but does work for everything else.

Use replace function to replace blank values with np.nan
import numpy as np
grouped = sortedvals.replace('',np.nan).groupby(['name']).first().reset_index()

Related

Python: (partial) matching elements of a list to DataFrame columns, returning entry of a different column

I am a beginner in python and have encountered the following problem: I have a long list of strings (I took 3 now for the example):
ENSEMBL_IDs = ['ENSG00000040608',
'ENSG00000070371',
'ENSG00000070413']
which are partial matches of the data in column 0 of my DataFrame genes_df (first 3 entries shown):
genes_list = (['ENSG00000040608.28', 'RTN4R'],
['ENSG00000070371.91', 'CLTCL1'],
['ENSG00000070413.17', 'DGCR2'])
genes_df = pd.DataFrame(genes_list)
The task I want to perform is conceptually not that difficult: I want to compare each element of ENSEMBL_IDs to genes_df.iloc[:,0] (which are partial matches: each element of ENSEMBL_IDs is contained within column 0 of genes_df, as outlined above). If the element of EMSEMBL_IDs matches the element in genes_df.iloc[:,0] (which it does, apart from the extra numbers after the period ".XX" ), I want to return the "corresponding" value that is stored in the first column of the genes_df Dataframe: the actual gene name, 'RTN4R' as an example.
I want to store these in a list. So, in the end, I would be left with a list like follows:
`genenames = ['RTN4R', 'CLTCL1', 'DGCR2']`
Some info that might be helpful: all of the entries in ENSEMBL_IDs are unique, and all of them are for sure contained in column 0 of genes_df.
I think I am looking for something along the lines of:
`genenames = []
for i in ENSEMBL_IDs:
if i in genes_df.iloc[:,0]:
genenames.append(# corresponding value in genes_df.iloc[:,1])`
I am sorry if the question has been asked before; I kept looking and was not able to find a solution that was applicable to my problem.
Thank you for your help!
Thanks also for the edit, English is not my first language, so the improvements were insightful.
You can get rid of the part after the dot (with str.extract or str.replace) before matching the values with isin:
m = genes_df[0].str.extract('([^.]+)', expand=False).isin(ENSEMBL_IDs)
# or
m = genes_df[0].str.replace('\..*$', '', regex=True).isin(ENSEMBL_IDs)
out = genes_df.loc[m, 1].tolist()
Or use a regex with str.match:
pattern = '|'.join(ENSEMBL_IDs)
m = genes_df[0].str.match(pattern)
out = genes_df.loc[m, 1].tolist()
Output: ['RTN4R', 'CLTCL1', 'DGCR2']

Is there a pandas function that can create a dataframe of the mean, median, and mode of selected columns?

My attempt:
# Compute the mean, median and variance for the variables sph, acous and dur. Compare their level of variability.
sad_mean = dat_songs[['spch', 'acous', 'dur']].mean()
sad_mode = dat_songs[['spch', 'acous', 'dur']].mode()
sad_median = dat_songs[['spch', 'acous', 'dur']].median()
sad_mmm = pd.DataFrame({'mean':sad_mean, 'median':sad_median, 'mode':sad_mode})
sad_mmm
Which outputs this
First of all, the median column is not right at all and want to know how to fix that too.
Secondly, I feel like I have seen some quicker or shorter way to do this with a simple function with pandas.
My data head for reference
Simply try, dat_songs.describe(). Descriptive Statistics will be present for all the numerical columns.
For selected columns.
dat_songs[['spch', 'acous', 'dur']].describe()

How do I select random rows without using df.sample()?

If I wanted to select rows randomly from a data frame without using df.sample(), would something like
import random
peopleCount = people.iloc[[random.randint(1, 101)]], :]
work? Or am I approaching this the wrong way?
You could select N rows by using take and permutaions:
peopleCount = people.take(np.random.permutation(len(people))[:N])
by df.sample() would be my primary choice.

I'm trying to clean my data but it returns the wrong column

I am trying to take one of my imported data sets df19 and clean information out of it to create a second variable noneu19 where, you guessed it, EU countries are removed from the column Destination
Here is what I ran
noneu19=df19
noneu19["Destination"] = noneu19[~noneu19["Destination"].apply(str).str.contains('UK')]
noneu19["Destination"] = noneu19[~noneu19["Destination"].apply(str).str.contains('SWEDEN')]
noneu19["Destination"] = noneu19[~noneu19["Destination"].apply(str).str.contains('SPAIN')]
...
set(noneu19["Destination"])
(The ... replaces the 25 other lines)
what it returns is the list of data indexed in a completely separate column 'Location' for some reason.
If I do set(df19['Destination']) it returns the list that I am trying to clean, so it is not a problem in the original data set. Is there a way that I can do it easier/cleaner/better or a way to troubleshoot why it is returning the wrong column?
Thanks
You can create a list with all the countries in Eu such as
EU = ['SPAIN', 'ITALY'..., 'EU_COUNTRY']
then use isin function like this:
noneu19 = df19.loc[~df19["Destination"].isin(EU)].copy()
The function isin will check if an element of that very column is contained in the list you pass as the argument.
Approaching the problem this way, you will have a more readible and easy to mantain code.

Can a comparator function be made from two conditions connected by an 'and' in python (For sorting)?

I have a list of type:
ans=[(a,[b,c]),(x,[y,z]),(p,[q,r])]
I need to sort the list by using the following condition :
if (ans[j][1][1]>ans[j+1][1][1]) or (ans[j][1][1]==ans[j+1][1][1] and ans[j][1][0]<ans[j+1][1][0]):
# do something (like swap(ans[j],ans[j+1]))
I was able to implement using bubble sort, but I want a faster sorting method.
Is there a way to sort my list using the sort() or sorted() (Using comparator or something similar) functions while pertaining to my condition ?
You can create a comparator function that retuns a tuple; tuples are compared from left to right until one of the elements is "larger" than the other. Your input/output example is quite lacking, but I believe this will result into what you want:
def my_compare(x):
return x[1][1], x[1][0]
ans.sort(key=my_compare)
# ans = sorted(ans, key=my_compare)
Essentially this will first compare the x[1][1] value of both ans[j] and ans[j+1], and if it's the same then it will compare the x[1][0] value. You can rearrange and add more comparators as you wish if this didn't match your ues case perfectly.

Resources