Extract multiple text string to form new row - python-3.x

I am trying to convert arrays within a row of csv into multiple rows. Currently the data is like this
test = result['properties.techniques'].dropna()
print(test)
['T1078','T1036']
['T1036']
I can add the following line to extract the individual items -
test = result['properties.techniques'].dropna()
techniques = result['properties.techniques'].str.extract(r"(T[0-9]{4})").dropna()[0]
print(techniques )
T1078
T1036
This however will only extract one string per row.
How do I ensure that all data is converted into a new row ?

Using .explode():
techniques = result.explode("properties.techniques").reset_index(drop=True)
print(techniques)
Output:
properties.techniques
0 T1078
1 T1036
2 T1036

Related

separating strings and creating multiple rows per ID, but matching with another column

Here is my data
mydata = data.frame (patient =c("1"),
health_pred = c("diabetes,bp"),
health_label = c("diabetes,bp,hypertension"))
I would like to split up the strings in both health_pred and health_label so that each new row will only have one health issue. But if the health_pred and health_label have an issue that matches I would want it in the same row and if there is no match (i.hypertension) I would like the column with the missing just to have "[]". This is the output I am hoping for
mydata1 = data.frame (patient =c("1","1","1"),
health_pred = c("diabetes","bp","[]"),
health_label = c("diabetes","bp","hypertension"))
Im not even really sure where to get started to do this. Any help would be really appreciated!

How to delete a certain row if it contains a particular string in it using python?

I have to create a clean list wherein names with 'Trust' or 'Trustee' in rows get deleted.
I'm using the following code but i'm not getting the desired result ?
df_clean = df[~df['Row Labels'].str.contains('trusteeship')]
eg : if the 'Row Labels' contains a row with ABC Trust or XYTrusteeshipZ, then the whole row should get deleted.
df_clean = df[~df['Row Labels'].str.contains('Trust')]
df_clean = df[~df['Row Labels'].str.lower().str.contains('trust')]
You can match with case=False parameter for ignore lower/uppercase characters:
df_clean = df[~df['Row Labels'].str.contains('trust', case=False)]
Or first convert values to lowercase like mentioned #anon01 in comments:
df_clean = df[~df['Row Labels'].str.lower().str.contains('trust')]

MATLAB: Count string occurrences in table columns

I'm trying to find the amount of words in this table:
Download Table here: http://www.mediafire.com/file/m81vtdo6bdd7bw8/Table_RandomInfoMiddle.mat/file
Words are indicated by the "Type" criteria, being "letters". The key thing to notice is that not everything in the table is a word, and that the entry "" registers as a word. In other words I need to determine the amount of words, by only counting "letters", except if it is a "missing".
Here is my attempt (Yet unsuccessful - Notice the two mentions of "Problem area"):
for col=1:size(Table_RandomInfoMiddle,2)
column_name = sprintf('Words count for column %d',col);
MiddleWordsType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'letters'}));
MiddleWordsExclusionType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'<missing>'})); %Problem area
end
%Call data from table
MiddleWordsType = table2array(MiddleWordsType_table);
MiddleWordsExclusionType = table2array(MiddleWordsExclusionType_table); %Problem area
%Take out zeros where "Type" was
MiddleWordsTotal_Nr = MiddleWordsType(MiddleWordsType~=0);
MiddleWordsExclusionTotal_Nr = MiddleWordsExclusionType(MiddleWordsExclusionType~=0);
%Final answer
FinalMiddleWordsTotal_Nr = MiddleWordsTotal_Nr-MiddleWordsExclusionTotal_Nr;
Any help will be appreciated. Thank you!
You can get the unique values from column 1 when column 2 satisfies some condition using
MiddleWordsType = numel( unique( ...
Table_RandomInfoMiddle{ismember(Table_RandomInfoMiddle{:,2}, 'letters'), 1} ) );
<missing> is a keyword in a categorical array, not literally the string "<missing>". That's why it appears blue and italicised in the workspace. If you want to check specifically for missing values, you can use this instead of ismember:
ismissing( Table_RandomInfoMiddle{:,1} )

How to replace the values on top of existing values using python

I have two dataset (sample_set and third_set), using sample_set I want to replace the values in third_set for only those values which equals 'Assessment' = 'Invalid Data'
sample_data['Volume_new'] = third_set['Assessment'].replace(['Invalid Data'], mean_value)
above code has replaced the values in sample_set but remaining values are showing null.
Data of Sample_set

selecting all cells between two string in a column

I posted question previously as "using “.between” for string values not working in python" and I was not clear enough, but I could not edit, so I am reposting with clarity here.
I have a Data Frame. In [0,61] I have string. In [0,69] I have a string. I want to slice all the data in cells [0,62:68] between these two and merge them, and paste the result into [1,61]. Subsequently, [0,62:68] will be blank, but that is not important.
However, I have several hundred documents, and I want to write a script that executes on all of them. The strings in [0,61] and [0,69] are always present in all the documents, but along different locations in that column. So I tried using:
For_Paste = df[0][df[0].between('DESCRIPTION OF WORK / STATEMENT OF WORK', 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION', inclusive = False)]
But the output I get is: Series([], Name: 0, dtype: object)
I was expecting a list or array with the desired data that I could merge and paste. Thanks.
enter image description here
If you want to select the rows between two indices (say idx_start and idx_end), excluding these two rows) on column col of the dataframe df, you will want to use
df.loc[idx_start + 1 : idx_end, col]
To find the first index matching a string s, use
idx = df.index[df[col] == s][0]
So for your case, to return a Series of the rows between these two indices, try the following:
start_string = 'DESCRIPTION OF WORK / STATEMENT OF WORK'
end_string = 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION'
idx_start = df.index[df[0] == start_string][0]
idx_end = df.index[df[0] == end_string][0]
For_Paste = df.loc[idx_start + 1 : idx_end, 0]

Resources