How to create a list and filter out row from another dataframe? - python-3.x

I know this question has been asked before, but every solution doesn't appear to work and gives me the same result. I am looking for insight into what I am doing wrong.
T_18_x2 and Tryp18_50 are large dataframes with different data (except for 2 columns). Specifically, each dataframe contains a column named 'Gene' that posses the same style sting (i.e. HSP90A_HUMAN). I would like to make a list from the Gene column in T_18_x2 to filter rows in Tryp18_50 with the same string in the "Gene" column.
My issue is that the output is simply an empty dataframe. I think it is the string (y2) because the output of this list is duplicates of the strings in the column. I am not sure why this is happening either.
List
Any help would be greatly appreciated.
input:
y2 =T18_x2['Gene'].astype(str).values.tolist()
T18 = Tryp18_50[Tryp18_50['Gene'].isin(y2)]
T18
output:
Output
** I have also tried:
T18=Tryp18_50[pd.notna(Tryp18_50['Gene']) & Tryp18_50['Gene'].astype(str).str.contains('|'.join(y2))]
with the output:
2nd Output

My mistake, I had two "Gene" columns in the first dataframe.

Related

How do I drop complete rows (including all values in it) that contain a certain value in my Pandas dataframe?

I'm trying to write a python script that finds unique values (names) and reports the frequency of their occurrence, making use of Pandas library. There's a total of around 90 unique names, which I've anonymised in the head of the dataframe pasted below.
,1,2,3,4,5
0,monday09-01-2022,tuesday10-01-2022,wednesday11-01-2022,thursday12-01-2022,friday13-01-2022
1,Anonymous 1,Anonymous 1,Anonymous 1,Anonymous 1,
2,Anonymous 2,Anonymous 4,Anonymous 5,Anonymous 5,Anonymous 5
3,Anonymous 3,Anonymous 3,,Anonymous 6,Anonymous 3
4,,,,,
I'm trying to drop any row (the full row) that contains the regex expression "^monday.*", intending to indicate the word "monday" followed by any other number of random characters. I want to drop/deselect any cell/value within that row.
To achieve this goal, I've tried using the line of code below (and many other approaches I found on SO).
df = df[df[1].str.contains("^monday.*", case = True, regex=True) == False]
To clarify, I'm trying to search values of column "1" for the value "^.monday.*" and then deselecting the rows and all values in that row that match the regex expression. I've succesfully removed "monday09-01-2022" and "tuesday10-01-2022" etc.. but I'm also losing random names that are not in the matching rows.
Any help would be very much appreciated! Thank you!

split a column based on a delimiter and then unpivot the result with preserving other columns

I need to split a column to multiple rows and then unpivot them by preseving a/multiple columns, how can I achive this in Python3
See below example
import numpy as np
data=np.array(['a0','a1,a2','a2,a3'])
pk=np.array([1,2,3])
df=pd.DataFrame({'data':data,'PK':pk})
df
df['data'].apply(lambda x : pd.Series(str(x).split(","))).stack()
What I need is:
data pk
a0 1
a1 2
a2 2
a2 3
a3 3
Is there any way to achieve this without merge and resetting indexes as mentioned here?
Convert column data into list and explode the data frame
Data
data=np.array(['a0','a1,a2','a2,a3'])
pk=np.array([1,2,3])
df=pd.DataFrame({'data':data,'PK':pk})
df=spark.createDataFrame(df)
Solution
df.withColumn('data', F.explode(F.split(col('data'),','))).show()
Using the Explode is the keyword (thx to wwnde for pointing it out) for searching this and can be done easily in Python with using existing libraries
First step is converting the column with a delimiter to a list
df=df.assign(Data=df.data.str.split(","))
and then explode
df.explode('Data')
if you are reading from Excel and Pandas detect a list of number as int and if you need to do the explode multiple times then this is the code and results

Pandas: get first datetime-in and last datetime-out in one row

First of all thanks in advance, there are always answers here so we learn a lot from the experts. I'm a noob using "pandas" (it's super handie for what i tried and achieved so far).
I have these data, handed to me like this (don't have access to the origin), 20k rows or more sometimes. The 'in' and 'out' columns may have one or more data per date, so when i get a 'in' the next data could be a 'out' or a 'in', depending, leaving me a blank cell, that's the problem (see first image).
I want to filter the first datetime-in, to left it in one column and the last datetime-out in another but the two in one row (see second image); the data comes in a csv file. I am doing this particular work manually with LibreOffice Calc (yeap).
So far, I have tried locating and relocating, tried merging, grouping... nothing works for me so i feel frustrated, ¿would you please lend me a hand? here is a minimal sample of the file
By the way english is not my language. ¡Thanks so much!
First:
out_column = df["out"].tolist()
This gives you all the out dates as a list, we will need that later.
in_column = df["in"].tolist() # in is used by python so I suggest renaming that row
I treat NaT as NaN (Null) in this Case.
Now we have to find what rows to keep, which we do by going through the in column and only keeping the rows after a NaN (and the first one):
filtered_df = []
tracker = False
for index, element in enumerate(in):
if index == 0 or tracker is True:
filtered_df.append(True)
tracker = False
continue
if element is None:
tracker = True
filtered_df.append(False)
Then you filter your df by this Boolean List:
df = df[filtered_df]
Now you fix up your out column by removing the null values:
while null in out_column:
out_column.remove(null)
Last but not least you overwrite your old out column with the new one:
df["out"] = out_column

Remove rows from data frame for which column equals one of following vectors

I have a data frame with 2 columns x&y.
Now I want to remove all rows where column x is either equal 1 or 3.
How can I do that?
setting rm<-c(1,3)
and then df<-df[!df$x==rm,] does not work
df<-data.frame(c(1,2,3,4,4,4,4,2,2,3,3),c(1:11))
rm<-c(1,3)
df<-df[!df$x==rm,]
Found an answer. So just in case anybody checks this question later on:
df<-df[ ! df$x %in% rm, ]

Adding a zero columns to a dataframe

I have a strange problem which i am not able to figure out. I have a dataframe subset that looks like this
in the dataframe, I add "zero" columns using the following code:
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
and i get a result similar to this
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
I dont understand why sometimes i get zeros the other i get either NaNs or a mix of NaNs and zeros. Please help if you can
Thanks
I believe you need assign with dictionary for set new columns names:
subset = subset.assign(**dict.fromkeys(['IRNotional','IPNotional'], 0))
#you can define each column separately
#subset = subset.assign(**{'IRNotional': 0, 'IPNotional': 1})
Or simplier:
subset['IRNotional'] = 0
subset['IPNotional'] = 0
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
I think problem is different index values, so is necessary create same indices, else for not matched indices get NaNs:
subset['IPNotional']=pd.DataFrame(numpy.zeros(shape=(len(subset),1)), index=subset.index)

Resources