Count the number of labels on IOB corpus with Pandas - python-3.x

From my IOB corpus such as:
mention Tag
170
171 467 O
172
173 Vincennes B-LOCATION
174 . O
175
176 Confirmation O
177 des O
178 privilèges O
179 de O
180 la O
181 ville B-ORGANISATION
182 de I-ORGANISATION
183 Tournai I-ORGANISATION
184 1 O
185 ( O
186 cf O
187 . O
188 infra O
189 , O
I try to make simple statistics like total number of annotated mentions, total by labels etc.
After loading my dataset with pandas I got this:
df = pd.Series(data['Tag'].value_counts(), name="Total").to_frame().reset_index()
df.columns = ['Label', 'Total']
df
Output :
Label Total
0 O 438528
1 36235
2 B-LOCATION 378
3 I-LOCATION 259
4 I-PERSON 234
5 I-INSTALLATION 156
6 I-ORGANISATION 150
7 B-PERSON 144
8 B-TITLE 94
9 I-TITLE 89
10 B-ORGANISATION 68
11 B-INSTALLATION 62
12 I-EVENT 8
13 B-EVENT 2
First of all, How I could get a similar representation above but by regrouping the IOB prefixes such as (example):
Label, Total
PERSON, 300
LOCATION, 154
ORGANISATION, 67
etc.
and secondly how to exclude the "O" and empty strings labels from my output, I tested with .mask() and .where() on my Series but it fails.
Thank you for your leads.

remove B-, I- parts, groupby, sum
df['label'] = df.label.str[2:]
df.groupby(['label']).sum()
For the second part, just return data in which the length of the label column string is greater than 2
df.loc[df.label.str.len()>2]

Related

Find duplicated Rows based on selected columns condition in a pandas Dataframe

I have an extensive base converted into a dataframe where it is difficult to manually identify the following
The dataframe has columns with the names from_bus and to_bus, which are unique identifiers regardless of the order, for example for element 0:
L_ABAN_MACA_0_1 the associated ordered pair (109,140) is the same as (140,109).
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.476683
2
L_AGOY_BAÑO_1_2
69
66
0.476683
3
L_ALAN_INGA_1_1
189
188
0.452790
4
L_ALAN_INGA_1_2
188
189
0.500450
So I want to identify the duplicate ordered pairs and replace them with a single one, whose column value x_ohn_per_km is defined as the sum of the duplicated values, as follows:
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.953366
3
L_ALAN_INGA_1_1
189
188
0.953240
Let us try groupby on from_bus and to_bus after sorting the values in these columns along axis=1 then agg to aggregate the result, optionally reindex to conform the order of columns:
c = ['from_bus', 'to_bus']
df[c] = np.sort(df[c], axis=1)
df.groupby(c, sort=False, as_index=False)\
.agg({'name': 'first', 'x_ohm_per_km': 'sum'})\
.reindex(df.columns, axis=1)
Alternative approach:
d = {**dict.fromkeys(df, 'first'), 'x_ohm_per_km': 'sum'}
df.groupby([*np.sort(df[c], axis=1).T], sort=False, as_index=False).agg(d)
name from_bus to_bus x_ohm_per_km
0 L_ABAN_MACA_0_1 109 140 0.444450
1 L_AGOY_BAÑO_1_1 66 69 0.953366
2 L_ALAN_INGA_1_1 188 189 0.953240

filtering and transposing the dataframe in python3

I made a csv file using pandas and trying to use it as input for the next step. when I open the file using pandas it will look like this example:
example:
Unnamed: 0 Class_Name Probe_Name small_example1.csv small_example2.csv small_example3.csv
0 0 Endogenous CCNO 196 32 18
1 1 Endogenous MYC 962 974 1114
2 2 Endogenous CD79A 390 115 178
3 3 Endogenous FSTL3 67 101 529
4 4 Endogenous VCAN 943 735 9226
I want to make a plot, to do so, I have to change the data structure.
1- I want to remove Unnamed column
2- then I want to make a data frame for a heatmap. to do so I want to use these columns "probe_name", "small_example1.csv", "small_example2.csv" and "small_example3.csv"
3- I also want to transpose the data frame.
here is the expected output:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example1.csv 32 974 115 101 735
small_example1.csv 18 1114 178 529 9226
I tied to do that using the following code:
df = pd.read_csv('myfile.csv')
result = df.transpose()
but it does not return what I want to get. do you know how to fix it?
df.drop(['Unnamed: 0','Class_Name'],axis=1).set_index('Probe_Name').T
Result:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example2.csv 32 974 115 101 735
small_example3.csv 18 1114 178 529 9226
Here's a suggestion:
Changes 1 & 2 can be tackled in one go:
df = df.loc[:, ["Probe_Name", "small_example1.csv", "small_example2.csv", "small_example3.csv"]] # This only retains the specified columns
In order for change 3 (transposing) to work as desired, the column Probe_Name needs to be set as your index:
df = df.set_index("Probe_Name", drop=True)
df = df.transpose()

Get unique values of a column in between a timeperiod in pandas after groupby

I have a requirement where I need to find all the unique values of a merchant_store_id of the user on the same stampcard in between a specific time period. I had group by stampcard id and userid to get the data frame based on the condition. Now I need to find the unique merchant_store_id of the this dataframe in interval of 10mins from that entry.
My approach is I would loop in that groupby dataframe and then I would find the all indexes in that dataframe of that group and then I would create a new dataframe from time of index to index + 60mins and then find the unique merchant_store_id's in it. If the unique merchant_store_id is >1 , I would append that dataframe from that time to a final dataframe. Problem with the approach is it works fine for small data, but for data of size 20,000 rows it shows memory error on linux and keeps on running on windows. Below is my code
fi_df = pd.DataFrame()
for i in df.groupby(["stamp_card_id", "merchant_id", "user_id"]):
user_df = i[1]
if len(user_df)>1:
# get list of unique indexes in that groupby df
index = user_df.index.values
for ind in index:
fdf = user_df[ind:ind+np.timedelta64(1, 'h')]
if len(fdf.merchant_store_id.unique())>1:
fi_df=fi_df.append(fdf)
fi_df.drop_duplicates(keep="first").to_csv(csv_export_path)
Sample Data after group by is:
((117, 209, 'oZOfOgAgnO'), stamp_card_id stamp_time stamps_record_id user_id \
0 117 2018-10-14 16:48:03 1756 oZOfOgAgnO
1 117 2018-10-14 16:54:03 1759 oZOfOgAgnO
2 117 2018-10-14 16:58:03 1760 oZOfOgAgnO
3 117 2018-10-14 17:48:03 1763 oZOfOgAgnO
4 117 2018-10-14 18:48:03 1765 oZOfOgAgnO
5 117 2018-10-14 19:48:03 1767 oZOfOgAgnO
6 117 2018-10-14 20:48:03 1769 oZOfOgAgnO
7 117 2018-10-14 21:48:03 1771 oZOfOgAgnO
8 117 2018-10-15 22:48:03 1773 oZOfOgAgnO
9 117 2018-10-15 23:08:03 1774 oZOfOgAgnO
10 117 2018-10-15 23:34:03 1777 oZOfOgAgnO
merchant_id merchant_store_id
0 209 662
1 209 662
2 209 662
3 209 662
4 209 662
5 209 662
6 209 663
7 209 664
8 209 662
9 209 664
10 209 663 )
I have tried the resampling method also, but then i get the data in respective of the time, where the condition of user hitting multiple merchant_store_id is neglected at end time of the hours.
Any help would be appreciated. Thanks
if those are datetimes you can filter with the following:
filtered_set = set(df[df["stamp_time"]>=x][df["stamp_time"]<=y]["col of interest"])
df[df["stamp_time"]>=x] filters the df
adding [df["stamp_time"]<=y] filters the filtered df
["merchant_store_id"] captures just the specified column (series)
and finally set() returns the unique list (set)
Specific to your code:
x = datetime(lowerbound) #pseudo-code
y = datetime(upperbound) #pseudo-code
filtered_set = set(fi_df[fi_df["stamp_time"]>=x][fi_df["stamp_time"]<=y]["col of interest"])

Using Strings in dot notation Matlab

I wish to access data within a table with dot notation that includes strings. I have a list of strings representing columns of interest within the table. How can I access data using those strings? I wish to create a loop going through the list of strings.
For example for table T I have columns {a b c d e}. I have a 1x3 cell cols={b d e}.
Can I retrieve data using cols in the format (or equivalent) T.cols(1) to to give me the same result as T.b?
You can fetch directly the data using {curly braces} and strings as column indices as you would for a cell array.
For example, lets's create a dummy table (modified from the docs):
clc;
close all;
LastName = {'Smith';'Johnson';'Williams';'Jones';'Brown'};
a = [38;43;38;40;49];
b = [71;69;64;67;64];
c = [176;163;131;133;119];
d = [124 93; 109 77; 125 83; 117 75; 122 80];
T = table(a,b,c,d,...
'RowNames',LastName)
The table looks like this:
T =
a b c d
__ __ ___ __________
Smith 38 71 176 124 93
Johnson 43 69 163 109 77
Williams 38 64 131 125 83
Jones 40 67 133 117 75
Brown 49 64 119 122 80
Now select columns of interest and get the data:
%// Select columns of interest
cols = {'a' 'c'};
%// Fetch data
T{:,cols}
ans =
38 176
43 163
38 131
40 133
49 119
Yay!

How can I plot an iteration coordinate(x,y) data with only label in GNUplot?

I have the set of data like this
0 268 195
1 353 199
2 318 209
3 268 232
4 370 238
5 326 253
6 246 265
7 372 284
8 313 290
9 258 297
0 268 196
1 353 199
2 318 209
3 268 233
4 370 238
5 325 253
6 246 265
7 372 284
8 313 290
9 258 297
I would like to use first column for label and second and third for (x,y) plot, however, I would like to plot only one time label without iteration. How should I do?
Thank you for help.
Do you want something like:
plot 'datafile' u 2:3:1 with labels
... I'm not really sure what you mean by "I would like to plot only one time label without iteration" ...
It looks to me like you want to only take 1 unique label. E.g. only one label that is 0, and only 1 label that is 1 etc. For simplicity, I'll take the first with a small python script:
#test.py
import sys
seen = set()
with open(sys.argv[1]) as f:
for line in f:
num,rest = line.split(None,1)
if num not in seen:
seen.add(num)
sys.stdout.write(line)
Now we can plot our file in gnuplot:
plot '< python test.py yourdatafile' u 2:3:1 w labels
Here's a version of test.py which will average the positions of all the labels with the same "value".
import sys
from collections import defaultdict
d = defaultdict(list)
with open(sys.argv[1]) as f:
for line in f:
num,x,y = map(int,line.split())
d[num].append((x,y))
#now average
for k,v in d.items():
x,y = zip(*v)
avg_x = float(sum(x))/len(x)
avg_y = float(sum(y))/len(y)
print k,avg_x,avg_y

Resources