How to use extractall in Pandas and get a new column with the extracted strings? - python-3.x

I have a data frame of 15 columns from a csv file. I am trying to remove one part of the text of a column and create a new column containing that information on each row. Each row of 'phospho' should have only one match to my demands on extractall. Now, I am trying to add the result to my data frame but I get the error:
TypeError: incompatible index of inserted column with frame index
The dataset has two column with names, and 6 columns with values (like 65.98, for ex).
Ex:
accession sequence modification phospho CON_1 CON_2 CON_3 LIF1
LIF2 LIF3 P18767 [R].GAAQNIIPASTGAAK.[A]
1xTMT6plex[K15];1xTMT6plex[N-Term] 1xPhospho [S3(98.3)]
Here is the freaking code:
a = pmap1['phospho'].str.extractall(r'([STEHRYD]\d*)')
pmap1['phosphosites'] = a
Thanks!

I created pmap1 using the following sample data:
pmap1 = pd.DataFrame(data=[[ 'S34T44X', 1 ], [ 'E23H78Y', 2 ],
[ 'R49Y81Z', 3 ], [ 'D20U23X', 4 ]], columns=['phospho', 'nn'])
When you extract all matches:
a = pmap1['phospho'].str.extractall(r'([STEHRYD]\d*)')
the result is:
0
match
0 0 S34
1 T44
1 0 E23
1 H78
2 Y
2 0 R49
1 Y81
3 0 D20
Note that:
The result is of DataFrame type (with a single column named 0).
It contains eight rows. So it is not clear to which row insert
particular matches.
The index is actually a MultiIndex with 2 levels:
The first (unnamed) level is the index of the source row,
The second level (named match) contains the number of
match within the current row.
E.g. in row with index 0 there were founde 2 matches:
S34 - No 0,
T44 - No 1.
So you can not directly save a as a new column of pmap1,
e.g. because pmap1 contains "ordinary" index and
a is a MultiIndex, incompatible with the index of pmap1.
And just this is written in the error message.
If you want somehow "add" a to pmap1, you can e.g. "break" each match
as a separate column the following way:
a2 = a.unstack()
Gives the result:
0
match 0 1 2
0 S34 T44 NaN
1 E23 H78 Y
2 R49 Y81 NaN
3 D20 NaN NaN
where columns are MultiIndex, so to drop the first
level if it, run:
a2.columns = a2.columns.droplevel()
The result is:
match 0 1 2
0 S34 T44 NaN
1 E23 H78 Y
2 R49 Y81 NaN
3 D20 NaN NaN
Then you can perform the actual join, executing:
pmap1.join(a2)
The result is:
phospho nn 0 1 2
0 S34T44X 1 S34 T44 NaN
1 E23H78Y 2 E23 H78 Y
2 R49Y81Z 3 R49 Y81 NaN
3 D20U23X 4 D20 NaN NaN
If you are unhappy about numbers as column names, you can change them as
you wish.
If you are unhappy about NaN values for "missing" matches
(for rows where less matches have been found compared to other rows),
add .fillna('') to the last instruction.
Edit
There is a shorter solution:
After you created a, you can do the whole rest of processing
with a single instruction:
pmap1.join(a[0].unstack()).fillna('')

Related

Is there a way to convert my column of incrementing integers separated by zero to the number of intervals encountered so far in a pandas datafram?

I'm working in pandas and I have a column in my dataframe filled by 0s and incrementing integers starting at one. I would like to add another column of integers but that column would be a counter of how many intervals separated by zero we have encountered to this point. For example my data would like like
Index
1
2
3
0
1
2
0
1
and I would like it to look like
Index IntervalCount
1 1
2 1
3 1
0 1
1 2
2 2
0 2
1 2
Is it possible to do this with vectorized operation or do I have to do this iteratively? Note, it's not important that it be a new column could also overwrite the old one.
You can use cumsum function.
df["IntervalCount"] = (df["Index"] == 1).cumsum()

Fixing Pandas NaN when making a new column?

I have two Panda's Dataframes
id volume
1 100
2 200
3 300
and
id 2020-07-01 2020-07-02 ...
1 12 14
2 5 1
3 7 8
I am trying to make a new column in the first table based on the values in the second table.
df['Total_Change'] = df2.iloc[:, 0] - df2.iloc[:, -1]
df['Change_MoM'] = df2.iloc[:, -2] - df2.iloc[:, -1]
This works, but the values are all shifted down in the table by one so that the first value is NaN and the last value is lost, so that my result is
id volume Total_Change Change_MoM
1 100 NaN NaN
2 200 -2 -2
3 300 4 4
Why is this happening? I've already double checked that the df2.iloc statements are grabbing the correct values, but I don't understand why my first table is shifting the values down a row. I've also tried shifting the table up one, but that left an NaN at the bottom.
The two tables are the same size. To be clear, I want to know how to prevent the NaN from occurring in the first place, not to replace it with some other value.
Both dfs have different index a quick fix is add reset_index()
df=df.reset_index(drop=True)
df2=df2.reset_index(drop=True)

Renaming columns in dataframe w.r.t another specific column

BACKGROUND: Large excel mapping file with about 100 columns and 200 rows converted to .csv. Then stored as dataframe. General format of df as below.
Starts with a named column (e.g. Sales) and following two columns need to be renamed. This pattern needs to be repeated for all columns in excel file.
Essentially: Link the subsequent 2 columns to the "parent" one preceding them.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
APPROACH FOR SOLUTION: I assume it would be possible to begin with an index (e.g. index of Sales column 1 = x) and then rename the following two columns as (x+1) and (x+2).
Then take in the text for the next named column (e.g. Validation) and so on.
I know the rename() function for dataframes.
BUT, not sure how to apply the iteratively for changing column titles.
EXPECTED OUTPUT: Unnamed 2 & 3 changed to Sales_Commented and Sales_No_Comment, respectively.
Similarly Unnamed 5 & 6 change to Validation_Commented and Validation_No_Comment.
Again, repeated for all 100 columns of file.
EDIT: Due to the large number of cols in the file, creating a manual list to store column names is not a viable solution. I have already seen this elsewhere on SO. Also, the amount of columns and departments (Sales, Validation) changes in different excel files with the mapping. So a dynamic solution is required.
Sales Sales_Commented Sales_No_Comment Validation Validation_Commented Validation_No_Comment
0 Commented No comment Commented No comment
1 x x
2 x
3 x x x
As a python novice, I considered a possible approach for the solution using the limited knowledge I have, but not sure what this would look like as a workable code.
I would appreciate all help and guidance.
1.You need is to make a list with the column names that you would want.
2.Make it a dict with the old column names as the keys and new column name as the values.
3. Use df.rename(columns = your_dictionary).
import numpy as np
import pandas as pd
df = pd.read_excel("name of the excel file",sheet_name = "name of sheet")
print(df.head())
Output>>>
Sales Unnamed : 2 Unnamed : 3 Validation Unnamed : 5 Unnamed : 6 Unnamed :7
0 NaN Commented No comment NaN Comment No comment Extra
1 1.0 2 1 1.0 1 1 1
2 3.0 1 1 1.0 1 1 1
3 4.0 3 4 5.0 5 6 6
4 5.0 1 1 1.0 21 3 6
# get new names based on the values of a previous named column
new_column_names = []
counter = 0
for col_name in df.columns:
if (col_name[:7].strip()=="Unnamed"):
new_column_names.append(base_name+"_"+df.iloc[0,counter].replace(" ", "_"))
else:
base_name = col_name
new_column_names.append(base_name)
counter +=1
# convert to dict key pair
dictionary = dict(zip(df.columns.tolist(),new_column_names))
# rename columns
df = df.rename(columns=dictionary)
# drop first column
df = df.iloc[1:].reset_index(drop=True)
print(df.head())
Output>>
Sales Sales_Commented Sales_No_comment Validation Validation_Comment Validation_No_comment Validation_Extra
0 1.0 2 1 1.0 1 1 1
1 3.0 1 1 1.0 1 1 1
2 4.0 3 4 5.0 5 6 6
3 5.0 1 1 1.0 21 3 6

Pandas groupby value and return observation count to dataset

I have a dataset like the following:
id value
a 0
a 0
a 0
a 0
a 1
a 2
a 2
a 2
b 0
b 0
b 1
b 2
b 2
I want to groupby the "id" column and grab the number of observations in the "value" column, and return a new column in the original dataset that counts the number of times the "value" observation occurs within each id.
An example of the output I'm looking for is represented in column "output":
id value output
a 0 4
a 0 4
a 0 4
a 0 4
a 1 1
a 2 3
a 2 3
a 2 3
b 0 2
b 0 2
b 1 1
b 2 2
b 2 2
When grouping on id "a", there are 4 observations of 0, which is provided in the column "output" for each row that contains id of "a" and value of 0.
I have tried applications of groupby and apply, to no avail. Any suggestions would be very helpful. Thank you.
Update: I figured out a solution for anyone who also faces this problem, and it works well.
grouped = df.groupby(['id','value'])
df['output'] = grouped['value'].transform('count')
This will return the count of observations under each bucket and return that count to each observation that meets that criteria, as shown in the "output" column above.
group by id and and value then count value.
data.groupby(['id' , 'value'])['id'].transform('count')

Trying to ignore zero values in Excel column

Expected OutputIn an Excel spreadsheet I am working on there are two columns of interest, column B and column E. In column B there are some 0 values and these are getting carried over to the column E based on the loop that I am running with respect to column D. I want to write a Python script to ignore these 0's and pick the next highest value based on their frequencies into column E.
12NC ModifiedSOCwrt12NC SOC
0 232270463903 0 0
1 232270463903 0 0
2 232270463903 0 0
3 232270463903 0 0
4 232270463903 0 RC0603FR-0738KL
5 232270463903 0 RC0603FR-0738KL
6 232270463903 0 RC0603FR-0738KL
I want to run a loop which picks non-zero values from SOC (column B) and carries it over to ModifiedSOCwrt12NC (column E) based on unique values in Column D.
For example, Column B has values = [0, RCK2] in multiple rows which are based on unique values in column D. So the current loop picks the maximum occurrences of values in column B and fills it into column E. If there is a tie between occurrences of 0 and RCK2, it picks 0 as per the ASCII standard (which I don't want to happen). I want the code to pick RCK2 and fill those in column E.
Since your Data is not accessible, I have created a test data similar to one below -
We can read data in pandas -
import pandas as pd
df = pd.read_excel("ExcelTemplate.xlsx")
df
Index SOC Index2 12NC
0 YXGMY 0 ZJIZX 23445
1 NQHQC 0 JKJKT 23445
2 MWTLY 0 EFCYD 23445
3 RPQFE AC VLOJZ 23445
4 GPLUQ AC AKKKG 23445
5 WGYYM AC DSMLO 23445
6 XGTAQ 0 ZHGWS 45667
7 AMWDT 0 YROLO 45667
following code will do the summarization -
First summarize data on 12NC and SOC and take count
Sort by 12NC, count and SOC, with highest count first
Take the first value of SOC for each 12NC
Merge with original Data to create column E
Export back to Excel
df1 = df.groupby(['12NC', 'SOC'])['Index'].count().reset_index()
df = df.merge(df1[df1['SOC']!=0].sort_values(by=['12NC', 'Index', 'SOC'], ascending=[True, False, True])\
.drop_duplicates(subset=['12NC'], keep='first')[['12NC', 'SOC']].\
rename(index=str, columns={'SOC': 'ModifiedSOCwrt12NC'}),\
on = ['12NC'], how='left')
df.to_excel("ExcelTemplate_modifies.xlsx", index=False)

Resources