MATLAB: Count string occurrences in table columns - string

I'm trying to find the amount of words in this table:
Download Table here: http://www.mediafire.com/file/m81vtdo6bdd7bw8/Table_RandomInfoMiddle.mat/file
Words are indicated by the "Type" criteria, being "letters". The key thing to notice is that not everything in the table is a word, and that the entry "" registers as a word. In other words I need to determine the amount of words, by only counting "letters", except if it is a "missing".
Here is my attempt (Yet unsuccessful - Notice the two mentions of "Problem area"):
for col=1:size(Table_RandomInfoMiddle,2)
column_name = sprintf('Words count for column %d',col);
MiddleWordsType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'letters'}));
MiddleWordsExclusionType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'<missing>'})); %Problem area
end
%Call data from table
MiddleWordsType = table2array(MiddleWordsType_table);
MiddleWordsExclusionType = table2array(MiddleWordsExclusionType_table); %Problem area
%Take out zeros where "Type" was
MiddleWordsTotal_Nr = MiddleWordsType(MiddleWordsType~=0);
MiddleWordsExclusionTotal_Nr = MiddleWordsExclusionType(MiddleWordsExclusionType~=0);
%Final answer
FinalMiddleWordsTotal_Nr = MiddleWordsTotal_Nr-MiddleWordsExclusionTotal_Nr;
Any help will be appreciated. Thank you!

You can get the unique values from column 1 when column 2 satisfies some condition using
MiddleWordsType = numel( unique( ...
Table_RandomInfoMiddle{ismember(Table_RandomInfoMiddle{:,2}, 'letters'), 1} ) );
<missing> is a keyword in a categorical array, not literally the string "<missing>". That's why it appears blue and italicised in the workspace. If you want to check specifically for missing values, you can use this instead of ismember:
ismissing( Table_RandomInfoMiddle{:,1} )

Related

Python: (partial) matching elements of a list to DataFrame columns, returning entry of a different column

I am a beginner in python and have encountered the following problem: I have a long list of strings (I took 3 now for the example):
ENSEMBL_IDs = ['ENSG00000040608',
'ENSG00000070371',
'ENSG00000070413']
which are partial matches of the data in column 0 of my DataFrame genes_df (first 3 entries shown):
genes_list = (['ENSG00000040608.28', 'RTN4R'],
['ENSG00000070371.91', 'CLTCL1'],
['ENSG00000070413.17', 'DGCR2'])
genes_df = pd.DataFrame(genes_list)
The task I want to perform is conceptually not that difficult: I want to compare each element of ENSEMBL_IDs to genes_df.iloc[:,0] (which are partial matches: each element of ENSEMBL_IDs is contained within column 0 of genes_df, as outlined above). If the element of EMSEMBL_IDs matches the element in genes_df.iloc[:,0] (which it does, apart from the extra numbers after the period ".XX" ), I want to return the "corresponding" value that is stored in the first column of the genes_df Dataframe: the actual gene name, 'RTN4R' as an example.
I want to store these in a list. So, in the end, I would be left with a list like follows:
`genenames = ['RTN4R', 'CLTCL1', 'DGCR2']`
Some info that might be helpful: all of the entries in ENSEMBL_IDs are unique, and all of them are for sure contained in column 0 of genes_df.
I think I am looking for something along the lines of:
`genenames = []
for i in ENSEMBL_IDs:
if i in genes_df.iloc[:,0]:
genenames.append(# corresponding value in genes_df.iloc[:,1])`
I am sorry if the question has been asked before; I kept looking and was not able to find a solution that was applicable to my problem.
Thank you for your help!
Thanks also for the edit, English is not my first language, so the improvements were insightful.
You can get rid of the part after the dot (with str.extract or str.replace) before matching the values with isin:
m = genes_df[0].str.extract('([^.]+)', expand=False).isin(ENSEMBL_IDs)
# or
m = genes_df[0].str.replace('\..*$', '', regex=True).isin(ENSEMBL_IDs)
out = genes_df.loc[m, 1].tolist()
Or use a regex with str.match:
pattern = '|'.join(ENSEMBL_IDs)
m = genes_df[0].str.match(pattern)
out = genes_df.loc[m, 1].tolist()
Output: ['RTN4R', 'CLTCL1', 'DGCR2']

separating strings and creating multiple rows per ID, but matching with another column

Here is my data
mydata = data.frame (patient =c("1"),
health_pred = c("diabetes,bp"),
health_label = c("diabetes,bp,hypertension"))
I would like to split up the strings in both health_pred and health_label so that each new row will only have one health issue. But if the health_pred and health_label have an issue that matches I would want it in the same row and if there is no match (i.hypertension) I would like the column with the missing just to have "[]". This is the output I am hoping for
mydata1 = data.frame (patient =c("1","1","1"),
health_pred = c("diabetes","bp","[]"),
health_label = c("diabetes","bp","hypertension"))
Im not even really sure where to get started to do this. Any help would be really appreciated!

How to delete a certain row if it contains a particular string in it using python?

I have to create a clean list wherein names with 'Trust' or 'Trustee' in rows get deleted.
I'm using the following code but i'm not getting the desired result ?
df_clean = df[~df['Row Labels'].str.contains('trusteeship')]
eg : if the 'Row Labels' contains a row with ABC Trust or XYTrusteeshipZ, then the whole row should get deleted.
df_clean = df[~df['Row Labels'].str.contains('Trust')]
df_clean = df[~df['Row Labels'].str.lower().str.contains('trust')]
You can match with case=False parameter for ignore lower/uppercase characters:
df_clean = df[~df['Row Labels'].str.contains('trust', case=False)]
Or first convert values to lowercase like mentioned #anon01 in comments:
df_clean = df[~df['Row Labels'].str.lower().str.contains('trust')]

Split string mixed input into range of single values

I have an input which allows multiple IDs.
They can be entered like this:
[ 1000, 1001, 1050-1060, 1100 ]
Out of this input string I want to get all the single IDs.
I already found this to split after each ,, so the part with 1000, 1001 already works.
data : itab TYPE TABLE OF string,
SPLIT l_bukrs_string AT ';' INTO TABLE itab.
My problem is the self-built range. Any idea how I could combine this with the case above to split 1050-1060 into single values?
I want to get 1050 | 1051 | 1052 | ... | 1060 out of it.
Appreciate every hint :) Thank you so much!
The easiest solution would be to use a real range/select-option for user (?) input instead. Then you would use that range to select every value from the database table.
If you cannot use a real range/select-option, then you could convert the string to one as shown below.
DATA: bukrs_string TYPE string,
split_bukrs TYPE TABLE OF string,
bukrs TYPE bukrs,
bukrs_between TYPE TABLE OF bukrs,
bukrs_range TYPE RANGE OF bukrs,
bukrs_rline LIKE LINE OF bukrs_range,
bukrs_table TYPE TABLE OF bukrs.
FIELD-SYMBOLS: <string> TYPE string,
<bukrs> TYPE bukrs,
<bukrs_from> TYPE bukrs,
<bukrs_to> TYPE bukrs.
bukrs_string = '1000, 1001, 1050-1060, 1100'.
CONDENSE bukrs_string NO-GAPS.
SPLIT bukrs_string AT ',' INTO TABLE split_bukrs.
LOOP AT split_bukrs ASSIGNING <string>.
bukrs_rline-sign = 'I'.
IF <string> CA '-'.
SPLIT <string> AT '-' INTO TABLE bukrs_between.
bukrs_rline-option = 'BT'.
READ TABLE bukrs_between INDEX 1 ASSIGNING <bukrs_from>.
bukrs_rline-low = <bukrs_from>.
READ TABLE bukrs_between INDEX 2 ASSIGNING <bukrs_to>.
bukrs_rline-high = <bukrs_to>.
ELSE.
bukrs_rline-option = 'EQ'.
bukrs = <string>.
bukrs_rline-low = bukrs.
ENDIF.
APPEND bukrs_rline TO bukrs_range.
CLEAR bukrs_rline.
ENDLOOP.
SELECT bukrs
FROM t001
INTO TABLE bukrs_table
WHERE bukrs IN bukrs_range.
Before you split the string, you would condense it, to remove all spaces. Then you would loop over the resulting parts and check if it contains any '-'. If that is the case, you split it again and create a BETWEEN entry in your range (consider if you may want an additional check to see if the latter number is actually higher). If there is no '-', you just create an EQUAL entry.
After you have your real range, you use it to select from the database. This is because not every bukrs in that range has to exist. You may only have 1000, 1050, 1055 and 1060, for example.
Edit: The reason there is no command, function module or class to convert a range to individual values is because what needs to be done changes heavily depending on WHAT data the range is for and if/how much values need to be verified.
If you have an integer range, then all you need to do is take the from-value and add 1 to it until you reach the to-value. What about a range of binary floating point numbers? What about a range of colours? What about your range of company codes, where not all of them necessarily exist? That's why the conversion has to be done manually.
Provided you were given a string with a list of mixed values, both single and interval BUKRS values divided by dash, and this list is separated by comma+space, then
DATA: input TYPE string VALUE '1000, 1001, 1050-1060, 1100, 1300-1340',
itab TYPE TABLE OF char10,
r_bukrs TYPE RANGE OF bukrs.
SPLIT input AT `, ` INTO TABLE itab.
r_bukrs = VALUE #( FOR GROUPS bukrs OF <bukrs> IN itab WHERE ( table_line+4(1) NE '-' ) GROUP BY <bukrs> WITHOUT MEMBERS ( sign = 'I' option = 'EQ' low = bukrs ) ).
DATA(ranges) = VALUE ddtest_ttyp_char( FOR GROUPS bukrs OF <bukrs> IN itab WHERE ( table_line+4(1) EQ '-' ) GROUP BY <bukrs> WITHOUT MEMBERS ( bukrs ) ).
LOOP AT ranges ASSIGNING FIELD-SYMBOL(<range>).
r_bukrs = VALUE #( BASE r_bukrs FOR j = CONV i( <range>(4) ) UNTIL j = CONV i( <range>+5(4) ) + 1 ( sign = 'I' option = 'EQ' low = j ) ).
ENDLOOP.
The first table expression (7th line) fills r_bukrs with unique values from initial table string.
The second table expression (8th line) fills ranges table with dash ranges found in initial table string, 1050-1060 and 1300-1340 in our case.
In the loop through ranges table the <range>(4) is the left extrema of interval, and <range>+5(4) is the right extrema, e.g. 1300 and 1340 correspondingly for last value interval.

selecting all cells between two string in a column

I posted question previously as "using “.between” for string values not working in python" and I was not clear enough, but I could not edit, so I am reposting with clarity here.
I have a Data Frame. In [0,61] I have string. In [0,69] I have a string. I want to slice all the data in cells [0,62:68] between these two and merge them, and paste the result into [1,61]. Subsequently, [0,62:68] will be blank, but that is not important.
However, I have several hundred documents, and I want to write a script that executes on all of them. The strings in [0,61] and [0,69] are always present in all the documents, but along different locations in that column. So I tried using:
For_Paste = df[0][df[0].between('DESCRIPTION OF WORK / STATEMENT OF WORK', 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION', inclusive = False)]
But the output I get is: Series([], Name: 0, dtype: object)
I was expecting a list or array with the desired data that I could merge and paste. Thanks.
enter image description here
If you want to select the rows between two indices (say idx_start and idx_end), excluding these two rows) on column col of the dataframe df, you will want to use
df.loc[idx_start + 1 : idx_end, col]
To find the first index matching a string s, use
idx = df.index[df[col] == s][0]
So for your case, to return a Series of the rows between these two indices, try the following:
start_string = 'DESCRIPTION OF WORK / STATEMENT OF WORK'
end_string = 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION'
idx_start = df.index[df[0] == start_string][0]
idx_end = df.index[df[0] == end_string][0]
For_Paste = df.loc[idx_start + 1 : idx_end, 0]

Resources