Move a Range of cells based on Cell Content - excel

I am trying to use openpyxl to clean up some data ripped from a series of PDFs. The data is coming from a pandas dataframe, so the first cell on each row is the index, and based on that I want to move the range of data in a specific manner. The basic idea is to take data looking like this:
1 abc def ghi
2 jkl mno pqr
1 stu vwx yza
2 bcd efg hij
...
to look like this:
1 abc def ghi jkl mno pqr
2 null null null null
1 stu vwx yza bcd efg hij
2 null null null null
...
The code I'm currently running isn't throwing any errors, but it's also just not doing anything to the sheet. I'm not sure what I'm missing here.
excel_file = load_workbook("C:\\Users\\Jake\\Documents\\Shared-VB\\upload 2.5.23\\Excel Result.xlsx")
sh = excel_file.active
for i in range(1, sh.max_row + 1):
if sh.cell(row=i, column=1).value == "2":
sh.move_range("B{i}:D{i}", rows=-1, cols=3)
excel_file.save("C:\\Users\\Jake\\Documents\\Shared-VB\\upload 2.5.23\\Final Result.xlsx")
UPDATE:
It looks like comparing the value as a string was "the" issue.
I changed it to:
for i in range(1, sh.max_row + 1):
if sh.cell(row=i, column=1).value == 4:
sh.move_range("B{i}:D{i}", rows=-1, cols=3)
which gave me the error :
Traceback (most recent call last):
File "C:\Users\Jake\Documents\Work Projects\Python\Contract Extraction\contract_extraction_xlm_1.py", line 23, in <module>
sh.move_range("B{i}:D{i}", rows=-1, cols=3)
File "C:\Users\Jake\anaconda3\lib\site-packages\openpyxl\worksheet\worksheet.py", line 772, in move_range
cell_range = CellRange(cell_range)
File "C:\Users\Jake\anaconda3\lib\site-packages\openpyxl\worksheet\cell_range.py", line 53, in __init__
min_col, min_row, max_col, max_row = range_boundaries(range_string)
File "openpyxl\utils\cell.py", line 135, in openpyxl.utils.cell.range_boundaries
ValueError: B{i}:D{i} is not a valid coordinate or range
[Finished in 923ms]
So now the issue is that I am not sure how to make the coordinates context-sensitive based on the row that was checked in the for loop.

Per Charlie Clark: ws.move_range() takes either a range string or a CellRange() object, which is easier to parametrise. Your issue is the range string. This should either be an f-string use the .format() method. All this is easier if you stick with the API.

Related

Split a dataframe column into different columns

I am using below code to split 1 dataframe column which has cell as below. Now in this case sometimes one of the below description, consequences, probable causes, recommended actions can be missing.
Column cell 1
Description
abc.
Consequences
def.
Probable causes
ghi.
Recommended actions
jkl
Column cell 2
Description
mno.
Probable causes
pqr
Recommended actions
stu
and so on.
How to split it to get.
Description Consequences Probable causes recommended actions
1st row abc def ghi jkl
2nd row mno - pqr stu
3rd row - vwx yza bcd
4th row efg hij - klm
5th row nop qrs tuv -
and so on.
I am using:
pattern = r"^Description\n*(?P<Description>.*?)\n+Consequences\n*(?P<Consequences>.*?)\n+Probable causes\n*(?P<Cause>.*?)\n+Recommended actions\n*(?P<Actions>.*)"
df = result[0].str.extract(pattern).fillna("")
But with this I am only getting values which have all 4 fields available. What modification I will have to do to get rows even with missing values.

How to filter out a first value of a column and then second value and then third and so on till nth using python

I have a Excel file as below which has multiple information in it :
Name list company
x [xyz,mno,pqr] xyz
y [abc,rst,hij] abc
x [xyz,mno,pqr] uvw
y [abc,rst,hij] def
x [xyz,mno,pqr] mno
y [abc,rst,hij] rst
and from this excel i want to apply filter in Name column and take the first value and then want to check few things and then filter second value and so on which i am explaining with an example below for first value:
Suppose I have filtered "x" from Name column then i have 3 rows , so from list column (horizontal) i need to check whether all three "xyz", "mno" and "pqr" are present in company column (vertical) or not. So, here "xyz" and "mno" are present in first and third row of company column but "pqr" is not present in any of the row. So in output i want "pqr" as shown below:
Name list company Output
x [xyz,mno,pqr] xyz pqr
y [abc,rst,hij] abc hij
x [xyz,mno,pqr] uvw pqr
y [abc,rst,hij] def hij
x [xyz,mno,pqr] mno pqr
y [abc,rst,hij] rst hij
It looks very complex to me and I am unable to get to any code or solution. Your help will really be appreciated.
As per the suggestion I have used below code:
import pandas as pd
import numpy as np
frame=pd.read_excel("Book2.xlsx")
frame_Liste=frame.Liste.values.tolist()
frame_company=frame.company.values.tolist()
frame_col3=[]
for items in frame_Liste:
frame_col3.append(list(set(items)-set(frame_company)))
frame["output"]=frame_col3
frame.to_excel("df.xlsx", index = False)
However I am getting output but the output is wrong and weird.I am showing you the output below:
Link for Output I got from the above code
Next time please be more specific about your problem, i.e. that you have the data in a sheet. Also please take care on your notation, since "[]" means list in python. Thus it was confusing. Now after you wrote that your data is in an excel sheet the problem is clear
import pandas as pd
data=pd.read_excel("path")
frame_Liste_as_String=frame.Liste.tolist()
frame_Liste=[x.split(',') for x in frame_Liste_as_String]
frame_Company=frame.Company.tolist()
frame_col3=[]
for items in frame_Liste:
frame_col3.append(list(set(items)-set(frame_Company)))
frame["col3"]=frame_col3
Name Liste Company col3
0 x xyz,mno,pqr xyz [pqr]
1 y abc,rst,hij abc [hij]
2 x xyz,mno,pqr uvw [pqr]
3 y abc,rst,hij def [hij]
4 x xyz,mno,pqr mno [pqr]
5 y abc,rst,hij rst [hij]
That should solve your problem
In case your data in the Liste Column is really in brackets change the line to
frame_Liste=[x.strip('][').split(',') for x in frame_Liste_as_String]

How to separate specific strings from a text and add them as column names?

This is a looklike example of I data I have, but with much less lines.
So imagine I have a txt file like this:
'''
Useless information 1
Useless information 2
Useless information 3
Measurement:
Len. (cm) :length of the object
Hei. (cm) :height of the object
Tp. :type of the object
~A DATA
10 5 2
8 7 2
5 6 1
9 9 1
'''
and I would like to put the values below '~A DATA' as a DataFrame. I already managed to get the DataFrame without column names (although it got a little messy as there are lines nonsense in my code) as you can see:
with open(r'C:\Users\Lucas\Desktop\...\text.txt') as file:
for line in file:
if line.startswith('~A'):
measures = line.split()[len(line):]
break
df = pd.read_csv(file, names=measures, sep='~A', engine='python')
newdf = df[0].str.split(expand = True)
newdf()
0 1 2
0 10 5 2
1 8 7 2
2 5 6 1
3 9 9 1
Now, I would like to put 'Len', 'Hei' and 'Tp' from the text as column names on the DataFrame. Just these measurement codes (without the consequent strings). How can I do that to have a df like this?
Len Hei Tp
0 10 5 2
1 8 7 2
2 5 6 1
3 9 9 1
One of the solutions would be to split every line below the string 'Measurement' (or beginning with the line 'Len...') till every line above the string '~A' (or ending with line 'Tp'). And then split every line we got. But I don't know how to do that.
Solution 1: If you want to scrape the the column names from the text file itself, than for that, you need to know, from which line the column name information is starting, and then read the file line-by-line and do the processing for the particular lines which you know have column names as text.
To answer you the specific question that you asked, let's assume variable line contains one of the strings, say line = Len. (cm) :length of the object, you could do regex based splitting, wherein, you split across any special symbol except digits and alphabets.
import re
splited_line = re.split(r"[^a-zA-Z0-9]", line) #add other characters which you don't want
print(splited_line)
This results in
['Len', ' ', 'cm', ' ', 'length of the object']
Further, to get the column name, you pick the first element from the list as splited_line[0]
Solution 2: If you already know the column names, you could just do
df.columns = ['Len','Hei','Tp']
Here is the complete solution to what you are looking for:
In [34]: f = open('text.txt', "rb")
...: flag = False
...: column_names = []
...: for line in f:
...: splited_line = re.split(r"[^a-zA-Z0-9~]", line.decode('utf-8'))
...: if splited_line[0] == "Measurement":
...: flag = True
...: continue
...: elif splited_line[0] == "~A":
...: flag = False
...: if flag == True:
...: column_names.append(splited_line[0])

How to select and display multiple columns using loc in python

Assume I have a data frame (ds) like this:
ID Name Age
1 xxc 34
2 sfg 23
3 hdg 18
I want to display the columns Name and Age as well.
Currently through this line of code
def item(id):
return ds.loc[ds['ID'] == id]['Name'].tolist()[0]
I am able to get only Name column value. How do I get Age column value too?
Please note I want to retain the same code, i.e the return statement.
Any solutions please?
list(ds.loc[ds['ID'] == 1,['Name','Age']].iloc[0])
Out[365]: ['xxc', 34]

Pandas: rare data form disrupts normal statistical analysis

I'm having an issue with analysing some bioinformatics data (in Pandas), where a rare but valid form of data messes up the statistical analysis of said data. This is what the data (in the dataframe group_PrEST) normally looks like:
PrEST ID Gene pepCN1 pepCN2 pepCN3
HPRR1 CD38 5298 10158 NaN
HPRR2 EGFR 79749 85793 117274
HPRR6 EPS8 68076 62305 66599
HPRR6 EPS8 NaN NaN 141828
Here is some of the code that works on this data (PrEST_stats is another dataframe that collects the statistics):
PrEST_stats['R1'] = group_PrESTs['pepCN1'].median()
PrEST_stats['R2'] = group_PrESTs['pepCN2'].median()
PrEST_stats['R3'] = group_PrESTs['pepCN3'].median()
PrEST_stats['CN'] = PrEST_stats[['R1', 'R2', 'R3']].median(axis=1)
PrEST_stats['CN'] = PrEST_stats['CN'].round()
PrEST_stats['STD'] = PrEST_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
PrEST_stats['CV'] = PrEST_stats['STD'] / \
PrEST_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100
What it does is essentially this:
Calculate median of the columns pepCN1, pepCN2 and pepCN3, respectively, for each gene
Calculate the median of the results of (1)
Calculate standard deviation and coefficient of variation of results from (1)
And this works just fine, in most cases:
The above data for gene CD38 would give two median values (R1 and
R2) identical to their origins in pepCN1 & pepCN2 (since
there's only one row of gene CD38).
Gene EPS8 would give R1 and R2 in a similar manner, but assign
another value for R3 based on the median of the two values in
the two rows for column pepCN3.
In both of these cases, the statistics would then be calculated in the correct way. The fact that the statistics are calculated after the first round of "reducing" the data (i.e. calculating the first median(s)) is intended: the three data columns represent technical replicates, and should be handled individually before "merging" them together in a final statistical value.
The problem arises in the rare cases where the data looks like this:
PrEST ID Gene pepCN1 pepCN2 pepCN3
HPRR9 PTK2B 4972 NaN NaN
HPRR9 PTK2B 17095 NaN NaN
The script will here reduce the two pepCN1 values to a single median, disregarding the fact that there are no values (i.e. no data from replicates 2 and 3) with which to calculate statistics with from the other data columns. The script will function and give the correct CN value (the median), but the statistics of standard deviation and coefficient of variation will be left out (i.e. show as NaN).
In cases like this, I want the script to somehow see that reducing the data column to one value (the first median) is not the way to go. Essentially, I want it to skip calculating the first median (here: R1) and just calculate statistics on the two pepCN1 rows. Is there a way to do this? Thanks in advance!
[EDIT: new problems]
Ok, so now the code looks like this:
indexer = PrESTs.groupby('PrEST ID').median().count(1) == 1
one_replicate = PrESTs.loc[PrESTs['PrEST ID'].isin(indexer[indexer].index)]
multiple_replicates = PrESTs.loc[~PrESTs['PrEST ID'].isin(indexer[indexer]
.index)]
all_replicates = {0: one_replicate, 1: multiple_replicates}
# Calculations (PrESTs)
PrEST_stats_1 = pd.DataFrame()
PrEST_stats_2 = pd.DataFrame()
all_stats = {0: PrEST_stats_1, 1: PrEST_stats_2}
for n in range(2):
current_replicate = all_replicates[n].groupby(['PrEST ID', 'Gene names'])
current_stats = all_stats[n]
if n == 1:
current_stats['R1'] = current_replicate['pepCN1'].median()
current_stats['R2'] = current_replicate['pepCN2'].median()
current_stats['R3'] = current_replicate['pepCN3'].median()
else:
current_stats['R1'] = current_replicate['pepCN1'] # PROBLEM (not with .median())
current_stats['R2'] = current_replicate['pepCN2'] # PROBLEM (not with .median())
current_stats['R3'] = current_replicate['pepCN3'] # PROBLEM (not with .median())
current_stats['CN'] = current_stats[['R1', 'R2', 'R3']].median(axis=1)
current_stats['CN'] = current_stats['CN'].round()
current_stats['STD'] = current_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
current_stats['CV'] = current_stats['STD'] / \
current_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100
current_stats['STD'] = current_stats['STD'].round()
current_stats['CV'] = current_stats['CV'].round(1)
PrEST_stats = PrEST_stats_1.append(PrEST_stats_2)
... and I have a new problem. Dividing up the two cases into two new DataFrames works just fine, and what I want to do now is to just handle them slightly differently in the for-loop above. I have checked the lines where I commented # PROBLEM in that I added .median() there as well, giving me the same results that I previously got - i.e. the rest of the code works, just NOT when I try to leave the data as it is! This is the error I get:
Traceback (most recent call last):
File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 110, in <module>
current_stats['R1'] = current_replicate['pepCN1']
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1863, in __setitem__
self._set_item(key, value)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1938, in _set_item
self._ensure_valid_index(value)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1915, in _ensure_valid_index
raise ValueError('Cannot set a frame with no defined index '
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
I've tried to find out what's wrong here, but I'm drawing blanks. Is it that I have to do SOMETHING to the data instead of .median(), rather than just nothing? Or something else?
[EDIT]: Changed some lines in the above code (in the else statement):
temp.append(current_replicate['pepCN1'])
temp.append(current_replicate['pepCN2'])
temp.append(current_replicate['pepCN3'])
current_stats = pd.concat(temp)
... where temp is an empty list, but then I get the following error:
File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 119, in <module>
temp2 = pd.concat(temp)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 926, in concat
verify_integrity=verify_integrity)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 986, in __init__
if not 0 <= axis <= sample.ndim:
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 295, in __getattr__
return self._make_wrapper(attr)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 310, in _make_wrapper
raise AttributeError(msg)
AttributeError: Cannot access attribute 'ndim' of 'SeriesGroupBy' objects, try using the 'apply' method
Is this not something I can do with group by objects?
Try this
In [28]: df
Out[28]:
id gene p1 p2 p3
0 HPRR1 CD38 5298 10158 NaN
1 HPRR2 EGFR 79749 85793 117274
2 HPRR6 EPS8 68076 62305 66599
3 HPRR6 EPS8 NaN NaN 141828
4 HPRR9 PTK2B 4972 NaN NaN
5 HPRR9 PTK2B 17095 NaN NaN
[6 rows x 5 columns]
Groupby the id field (I gather that's where you want valid medians). Figure out
if their are any invalid medians in that group (e.g. they come up nan after combinging the group).
In [53]: df.groupby('id').median().count(1)
Out[53]:
id
HPRR1 2
HPRR2 3
HPRR6 3
HPRR9 1
dtype: int64
You want to remove groups that only have 1 valid value, ok!
In [54]: df.groupby('id').median().count(1) == 1
Out[54]:
id
HPRR1 False
HPRR2 False
HPRR6 False
HPRR9 True
dtype: bool
In [30]: indexers = df.groupby('id').median().count(1) == 1
Take them out from the original data (then rerun) or fill or whatever.
In [67]: df.loc[~df.id.isin(indexers[indexers].index)]
Out[67]:
id gene p1 p2 p3
0 HPRR1 CD38 5298 10158 NaN
1 HPRR2 EGFR 79749 85793 117274
2 HPRR6 EPS8 68076 62305 66599
3 HPRR6 EPS8 NaN NaN 141828
[4 rows x 5 columns]
For your overall calculations you can do something like this. This is much preferred to appending to an initially empty DataFrame.
results = []
for r in range(2):
# do the calcs from above to generate say df1 and df2
results.append(df1)
results.append(df2)
# concatenate the rows!
final_result = pd.concat(results)

Resources