selecting all cells between two string in a column - python-3.x

I posted question previously as "using “.between” for string values not working in python" and I was not clear enough, but I could not edit, so I am reposting with clarity here.
I have a Data Frame. In [0,61] I have string. In [0,69] I have a string. I want to slice all the data in cells [0,62:68] between these two and merge them, and paste the result into [1,61]. Subsequently, [0,62:68] will be blank, but that is not important.
However, I have several hundred documents, and I want to write a script that executes on all of them. The strings in [0,61] and [0,69] are always present in all the documents, but along different locations in that column. So I tried using:
For_Paste = df[0][df[0].between('DESCRIPTION OF WORK / STATEMENT OF WORK', 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION', inclusive = False)]
But the output I get is: Series([], Name: 0, dtype: object)
I was expecting a list or array with the desired data that I could merge and paste. Thanks.
enter image description here

If you want to select the rows between two indices (say idx_start and idx_end), excluding these two rows) on column col of the dataframe df, you will want to use
df.loc[idx_start + 1 : idx_end, col]
To find the first index matching a string s, use
idx = df.index[df[col] == s][0]
So for your case, to return a Series of the rows between these two indices, try the following:
start_string = 'DESCRIPTION OF WORK / STATEMENT OF WORK'
end_string = 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION'
idx_start = df.index[df[0] == start_string][0]
idx_end = df.index[df[0] == end_string][0]
For_Paste = df.loc[idx_start + 1 : idx_end, 0]

Related

Extract multiple text string to form new row

I am trying to convert arrays within a row of csv into multiple rows. Currently the data is like this
test = result['properties.techniques'].dropna()
print(test)
['T1078','T1036']
['T1036']
I can add the following line to extract the individual items -
test = result['properties.techniques'].dropna()
techniques = result['properties.techniques'].str.extract(r"(T[0-9]{4})").dropna()[0]
print(techniques )
T1078
T1036
This however will only extract one string per row.
How do I ensure that all data is converted into a new row ?
Using .explode():
techniques = result.explode("properties.techniques").reset_index(drop=True)
print(techniques)
Output:
properties.techniques
0 T1078
1 T1036
2 T1036

Python: (partial) matching elements of a list to DataFrame columns, returning entry of a different column

I am a beginner in python and have encountered the following problem: I have a long list of strings (I took 3 now for the example):
ENSEMBL_IDs = ['ENSG00000040608',
'ENSG00000070371',
'ENSG00000070413']
which are partial matches of the data in column 0 of my DataFrame genes_df (first 3 entries shown):
genes_list = (['ENSG00000040608.28', 'RTN4R'],
['ENSG00000070371.91', 'CLTCL1'],
['ENSG00000070413.17', 'DGCR2'])
genes_df = pd.DataFrame(genes_list)
The task I want to perform is conceptually not that difficult: I want to compare each element of ENSEMBL_IDs to genes_df.iloc[:,0] (which are partial matches: each element of ENSEMBL_IDs is contained within column 0 of genes_df, as outlined above). If the element of EMSEMBL_IDs matches the element in genes_df.iloc[:,0] (which it does, apart from the extra numbers after the period ".XX" ), I want to return the "corresponding" value that is stored in the first column of the genes_df Dataframe: the actual gene name, 'RTN4R' as an example.
I want to store these in a list. So, in the end, I would be left with a list like follows:
`genenames = ['RTN4R', 'CLTCL1', 'DGCR2']`
Some info that might be helpful: all of the entries in ENSEMBL_IDs are unique, and all of them are for sure contained in column 0 of genes_df.
I think I am looking for something along the lines of:
`genenames = []
for i in ENSEMBL_IDs:
if i in genes_df.iloc[:,0]:
genenames.append(# corresponding value in genes_df.iloc[:,1])`
I am sorry if the question has been asked before; I kept looking and was not able to find a solution that was applicable to my problem.
Thank you for your help!
Thanks also for the edit, English is not my first language, so the improvements were insightful.
You can get rid of the part after the dot (with str.extract or str.replace) before matching the values with isin:
m = genes_df[0].str.extract('([^.]+)', expand=False).isin(ENSEMBL_IDs)
# or
m = genes_df[0].str.replace('\..*$', '', regex=True).isin(ENSEMBL_IDs)
out = genes_df.loc[m, 1].tolist()
Or use a regex with str.match:
pattern = '|'.join(ENSEMBL_IDs)
m = genes_df[0].str.match(pattern)
out = genes_df.loc[m, 1].tolist()
Output: ['RTN4R', 'CLTCL1', 'DGCR2']

How to delete a certain row if it contains a particular string in it using python?

I have to create a clean list wherein names with 'Trust' or 'Trustee' in rows get deleted.
I'm using the following code but i'm not getting the desired result ?
df_clean = df[~df['Row Labels'].str.contains('trusteeship')]
eg : if the 'Row Labels' contains a row with ABC Trust or XYTrusteeshipZ, then the whole row should get deleted.
df_clean = df[~df['Row Labels'].str.contains('Trust')]
df_clean = df[~df['Row Labels'].str.lower().str.contains('trust')]
You can match with case=False parameter for ignore lower/uppercase characters:
df_clean = df[~df['Row Labels'].str.contains('trust', case=False)]
Or first convert values to lowercase like mentioned #anon01 in comments:
df_clean = df[~df['Row Labels'].str.lower().str.contains('trust')]

MATLAB: Count string occurrences in table columns

I'm trying to find the amount of words in this table:
Download Table here: http://www.mediafire.com/file/m81vtdo6bdd7bw8/Table_RandomInfoMiddle.mat/file
Words are indicated by the "Type" criteria, being "letters". The key thing to notice is that not everything in the table is a word, and that the entry "" registers as a word. In other words I need to determine the amount of words, by only counting "letters", except if it is a "missing".
Here is my attempt (Yet unsuccessful - Notice the two mentions of "Problem area"):
for col=1:size(Table_RandomInfoMiddle,2)
column_name = sprintf('Words count for column %d',col);
MiddleWordsType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'letters'}));
MiddleWordsExclusionType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'<missing>'})); %Problem area
end
%Call data from table
MiddleWordsType = table2array(MiddleWordsType_table);
MiddleWordsExclusionType = table2array(MiddleWordsExclusionType_table); %Problem area
%Take out zeros where "Type" was
MiddleWordsTotal_Nr = MiddleWordsType(MiddleWordsType~=0);
MiddleWordsExclusionTotal_Nr = MiddleWordsExclusionType(MiddleWordsExclusionType~=0);
%Final answer
FinalMiddleWordsTotal_Nr = MiddleWordsTotal_Nr-MiddleWordsExclusionTotal_Nr;
Any help will be appreciated. Thank you!
You can get the unique values from column 1 when column 2 satisfies some condition using
MiddleWordsType = numel( unique( ...
Table_RandomInfoMiddle{ismember(Table_RandomInfoMiddle{:,2}, 'letters'), 1} ) );
<missing> is a keyword in a categorical array, not literally the string "<missing>". That's why it appears blue and italicised in the workspace. If you want to check specifically for missing values, you can use this instead of ismember:
ismissing( Table_RandomInfoMiddle{:,1} )

Converting Excel functions into R

I have two excel functions that I am trying to convert into R:
numberShares
=IF(AND(N213="BOH",N212="BOH")=TRUE,P212,IF(AND(N213="BOH",N212="Sell")=TRUE,ROUNDDOWN(Q212/C213,0),0))
marketValue
=IF(AND(N212="BOH",N213="BOH")=TRUE,C213*P212,IF(AND(N212="Sell",N213="Sell")=TRUE,Q212,IF(AND(N212="BOH",N213="Sell")=TRUE,P212*C213,IF(AND(N212="Sell",N213="BOH")=TRUE,Q212))))
The cells that are referenced include:
c = closing price of a stock
n = position values of either "buy or hold" or "sell"
p = number of Shares
q = market value, assuming $10,000 initial equity (number of shares*closing price)
and the tops of the two output columns that i am trying to recreate look like this:
output
So far, in R I have constructed a dataframe with the necessary four columns:
data.frame
I just don't know how to write the functions that will populate the number of shares and market value columns. For loops? ifelse?
Again, thank you!!
Covert the AND()'s to infix "&"; the "=" to "=="; and the IF's to ifelse() and you are halfway there. The problem will be in converting your cell references to array or matrix references, and for that task we would have needed a better description of the data layout:
numberShares <-
ifelse( N213=="BOH" & N212=="BOH",
#Perhaps PosVal[213] == "BOH" & PosVal[212] == "BOH"
# ... and very possibly the 213 should be 213:240 and the 212 should be 212:239
P212,
ifelse( N213=="BOH" & N212=="Sell" ,
round(Q212/C213, digits=0),
0))
(You seem to be returning incommensurate values which seems preeety questionable.) Assuming this is correct code despite my misgivings the next translation involves apply the same substitutions in this structure (although you seem to be missing an else-consequent in the last IF function:
marketValue <-
IF( AND(N212="BOH", N213="BOH")=TRUE,
C213*P212,
IF(AND(N212="Sell",N213="Sell")=TRUE,
Q212,
IF( AND(N212="BOH",N213="Sell")=TRUE,
P212*C213,
IF(AND(N212="Sell",N213="BOH")=TRUE,
Q212))))
(Your testing for AND( .,.)=TRUE is I believe unnecessary in Excel and certainly unnecessary in R.)

Resources