How to replace the values on top of existing values using python - python-3.x

I have two dataset (sample_set and third_set), using sample_set I want to replace the values in third_set for only those values which equals 'Assessment' = 'Invalid Data'
sample_data['Volume_new'] = third_set['Assessment'].replace(['Invalid Data'], mean_value)
above code has replaced the values in sample_set but remaining values are showing null.
Data of Sample_set

Related

Extract multiple text string to form new row

I am trying to convert arrays within a row of csv into multiple rows. Currently the data is like this
test = result['properties.techniques'].dropna()
print(test)
['T1078','T1036']
['T1036']
I can add the following line to extract the individual items -
test = result['properties.techniques'].dropna()
techniques = result['properties.techniques'].str.extract(r"(T[0-9]{4})").dropna()[0]
print(techniques )
T1078
T1036
This however will only extract one string per row.
How do I ensure that all data is converted into a new row ?
Using .explode():
techniques = result.explode("properties.techniques").reset_index(drop=True)
print(techniques)
Output:
properties.techniques
0 T1078
1 T1036
2 T1036

How to delete a certain row if it contains a particular string in it using python?

I have to create a clean list wherein names with 'Trust' or 'Trustee' in rows get deleted.
I'm using the following code but i'm not getting the desired result ?
df_clean = df[~df['Row Labels'].str.contains('trusteeship')]
eg : if the 'Row Labels' contains a row with ABC Trust or XYTrusteeshipZ, then the whole row should get deleted.
df_clean = df[~df['Row Labels'].str.contains('Trust')]
df_clean = df[~df['Row Labels'].str.lower().str.contains('trust')]
You can match with case=False parameter for ignore lower/uppercase characters:
df_clean = df[~df['Row Labels'].str.contains('trust', case=False)]
Or first convert values to lowercase like mentioned #anon01 in comments:
df_clean = df[~df['Row Labels'].str.lower().str.contains('trust')]

MATLAB: Count string occurrences in table columns

I'm trying to find the amount of words in this table:
Download Table here: http://www.mediafire.com/file/m81vtdo6bdd7bw8/Table_RandomInfoMiddle.mat/file
Words are indicated by the "Type" criteria, being "letters". The key thing to notice is that not everything in the table is a word, and that the entry "" registers as a word. In other words I need to determine the amount of words, by only counting "letters", except if it is a "missing".
Here is my attempt (Yet unsuccessful - Notice the two mentions of "Problem area"):
for col=1:size(Table_RandomInfoMiddle,2)
column_name = sprintf('Words count for column %d',col);
MiddleWordsType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'letters'}));
MiddleWordsExclusionType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'<missing>'})); %Problem area
end
%Call data from table
MiddleWordsType = table2array(MiddleWordsType_table);
MiddleWordsExclusionType = table2array(MiddleWordsExclusionType_table); %Problem area
%Take out zeros where "Type" was
MiddleWordsTotal_Nr = MiddleWordsType(MiddleWordsType~=0);
MiddleWordsExclusionTotal_Nr = MiddleWordsExclusionType(MiddleWordsExclusionType~=0);
%Final answer
FinalMiddleWordsTotal_Nr = MiddleWordsTotal_Nr-MiddleWordsExclusionTotal_Nr;
Any help will be appreciated. Thank you!
You can get the unique values from column 1 when column 2 satisfies some condition using
MiddleWordsType = numel( unique( ...
Table_RandomInfoMiddle{ismember(Table_RandomInfoMiddle{:,2}, 'letters'), 1} ) );
<missing> is a keyword in a categorical array, not literally the string "<missing>". That's why it appears blue and italicised in the workspace. If you want to check specifically for missing values, you can use this instead of ismember:
ismissing( Table_RandomInfoMiddle{:,1} )

transform columns from column list within select method that's attached to join method

I have two data frames with the same schema. I'm using the outer join method on both data frames and I'm using the select and coalesce methods to select and transform all columns. I want to iterate over the column list within the select method without explicitly defining each column within the coalesce method. It would be great to know if there's a solution without using a UDF. The two tables that are being joined are songs and staging_songs within the code snippets below.
Instead of explicitly defining each column like so:
updated_songs = songs.join(staging_songs, songs.song_id == staging_songs.song_id, how='full').select(
f.coalesce(staging_songs.song_id, songs.song_id),
f.coalesce(staging_songs.artist_name, songs.artist_name),
f.coalesce(staging_songs.song_name, songs.song_name)
)
Doing something along the lines of:
# column names to iterate over in select method
songs_columns = songs.columns
updated_songs = songs.join(staging_songs, songs.song_id == staging_songs.song_id, how='full').select(
#using for loop like this raises a syntax error
for col in songs_columns:
f.coalesce(staging_songs.col, songs.col))
Try this:
updated_songs = songs.join(staging_songs, songs["song_id"] == staging_songs["song_id"], how='full').select(*[f.coalesce(staging_songs[col], songs[col]).alias(col) for col in songs_columns])

selecting all cells between two string in a column

I posted question previously as "using “.between” for string values not working in python" and I was not clear enough, but I could not edit, so I am reposting with clarity here.
I have a Data Frame. In [0,61] I have string. In [0,69] I have a string. I want to slice all the data in cells [0,62:68] between these two and merge them, and paste the result into [1,61]. Subsequently, [0,62:68] will be blank, but that is not important.
However, I have several hundred documents, and I want to write a script that executes on all of them. The strings in [0,61] and [0,69] are always present in all the documents, but along different locations in that column. So I tried using:
For_Paste = df[0][df[0].between('DESCRIPTION OF WORK / STATEMENT OF WORK', 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION', inclusive = False)]
But the output I get is: Series([], Name: 0, dtype: object)
I was expecting a list or array with the desired data that I could merge and paste. Thanks.
enter image description here
If you want to select the rows between two indices (say idx_start and idx_end), excluding these two rows) on column col of the dataframe df, you will want to use
df.loc[idx_start + 1 : idx_end, col]
To find the first index matching a string s, use
idx = df.index[df[col] == s][0]
So for your case, to return a Series of the rows between these two indices, try the following:
start_string = 'DESCRIPTION OF WORK / STATEMENT OF WORK'
end_string = 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION'
idx_start = df.index[df[0] == start_string][0]
idx_end = df.index[df[0] == end_string][0]
For_Paste = df.loc[idx_start + 1 : idx_end, 0]

Resources