Using Strings in dot notation Matlab - string

I wish to access data within a table with dot notation that includes strings. I have a list of strings representing columns of interest within the table. How can I access data using those strings? I wish to create a loop going through the list of strings.
For example for table T I have columns {a b c d e}. I have a 1x3 cell cols={b d e}.
Can I retrieve data using cols in the format (or equivalent) T.cols(1) to to give me the same result as T.b?

You can fetch directly the data using {curly braces} and strings as column indices as you would for a cell array.
For example, lets's create a dummy table (modified from the docs):
clc;
close all;
LastName = {'Smith';'Johnson';'Williams';'Jones';'Brown'};
a = [38;43;38;40;49];
b = [71;69;64;67;64];
c = [176;163;131;133;119];
d = [124 93; 109 77; 125 83; 117 75; 122 80];
T = table(a,b,c,d,...
'RowNames',LastName)
The table looks like this:
T =
a b c d
__ __ ___ __________
Smith 38 71 176 124 93
Johnson 43 69 163 109 77
Williams 38 64 131 125 83
Jones 40 67 133 117 75
Brown 49 64 119 122 80
Now select columns of interest and get the data:
%// Select columns of interest
cols = {'a' 'c'};
%// Fetch data
T{:,cols}
ans =
38 176
43 163
38 131
40 133
49 119
Yay!

Related

How to check if a value in a column is found in a list in a column, with Spark SQL?

I have a delta table A as shown below.
point
cluster
points_in_cluster
37
1
[37,32]
45
2
[45,67,84]
67
2
[45,67,84]
84
2
[45,67,84]
32
1
[37,32]
Also I have a table B as shown below.
id
point
101
37
102
67
103
84
I want a query like the following. Here in obviously doesn't work for a list. So, what would be the right syntax?
select b.id, a.point
from A a, B b
where b.point in a.points_in_cluster
As a result I should have a table like the following
id
point
101
37
101
32
102
45
102
67
102
84
103
45
103
67
103
84
Based on your data sample, I'd do an equi-join on point column and then an explode on points_in_cluster :
from pyspark.sql import functions as F
# assuming A is df_A and B is df_B
df_A.join(
df_B,
on="point"
).select(
"id",
F.explode("points_in_cluster").alias("point")
)
Otherwise, you use array_contains:
select b.id, a.point
from A a, B b
where array_contains(a.points_in_cluster, b.point)

Count the number of labels on IOB corpus with Pandas

From my IOB corpus such as:
mention Tag
170
171 467 O
172
173 Vincennes B-LOCATION
174 . O
175
176 Confirmation O
177 des O
178 privilèges O
179 de O
180 la O
181 ville B-ORGANISATION
182 de I-ORGANISATION
183 Tournai I-ORGANISATION
184 1 O
185 ( O
186 cf O
187 . O
188 infra O
189 , O
I try to make simple statistics like total number of annotated mentions, total by labels etc.
After loading my dataset with pandas I got this:
df = pd.Series(data['Tag'].value_counts(), name="Total").to_frame().reset_index()
df.columns = ['Label', 'Total']
df
Output :
Label Total
0 O 438528
1 36235
2 B-LOCATION 378
3 I-LOCATION 259
4 I-PERSON 234
5 I-INSTALLATION 156
6 I-ORGANISATION 150
7 B-PERSON 144
8 B-TITLE 94
9 I-TITLE 89
10 B-ORGANISATION 68
11 B-INSTALLATION 62
12 I-EVENT 8
13 B-EVENT 2
First of all, How I could get a similar representation above but by regrouping the IOB prefixes such as (example):
Label, Total
PERSON, 300
LOCATION, 154
ORGANISATION, 67
etc.
and secondly how to exclude the "O" and empty strings labels from my output, I tested with .mask() and .where() on my Series but it fails.
Thank you for your leads.
remove B-, I- parts, groupby, sum
df['label'] = df.label.str[2:]
df.groupby(['label']).sum()
For the second part, just return data in which the length of the label column string is greater than 2
df.loc[df.label.str.len()>2]

Find duplicated Rows based on selected columns condition in a pandas Dataframe

I have an extensive base converted into a dataframe where it is difficult to manually identify the following
The dataframe has columns with the names from_bus and to_bus, which are unique identifiers regardless of the order, for example for element 0:
L_ABAN_MACA_0_1 the associated ordered pair (109,140) is the same as (140,109).
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.476683
2
L_AGOY_BAÑO_1_2
69
66
0.476683
3
L_ALAN_INGA_1_1
189
188
0.452790
4
L_ALAN_INGA_1_2
188
189
0.500450
So I want to identify the duplicate ordered pairs and replace them with a single one, whose column value x_ohn_per_km is defined as the sum of the duplicated values, as follows:
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.953366
3
L_ALAN_INGA_1_1
189
188
0.953240
Let us try groupby on from_bus and to_bus after sorting the values in these columns along axis=1 then agg to aggregate the result, optionally reindex to conform the order of columns:
c = ['from_bus', 'to_bus']
df[c] = np.sort(df[c], axis=1)
df.groupby(c, sort=False, as_index=False)\
.agg({'name': 'first', 'x_ohm_per_km': 'sum'})\
.reindex(df.columns, axis=1)
Alternative approach:
d = {**dict.fromkeys(df, 'first'), 'x_ohm_per_km': 'sum'}
df.groupby([*np.sort(df[c], axis=1).T], sort=False, as_index=False).agg(d)
name from_bus to_bus x_ohm_per_km
0 L_ABAN_MACA_0_1 109 140 0.444450
1 L_AGOY_BAÑO_1_1 66 69 0.953366
2 L_ALAN_INGA_1_1 188 189 0.953240

Get the names of Top 'n' columns based on a threshold for values across a row

Let's say, that I have the following data:
In [1]: df
Out[1]:
Student_Name Maths Physics Chemistry Biology English
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
I want to add a column to this dataframe which tells me the students' top 'n' subjects that are above a threshold, where the subject names are available in the column names. Let's assume n=3 and threshold=80.
The output would look like the following:
In [3]: df
Out[3]:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 John Doe 90 87 81 65 70 Maths, Physics, Chemistry
1 Jane Doe 82 84 75 73 77 Physics, Maths
2 Mary Lim 40 65 55 60 70 nan
3 Lisa Ray 55 52 77 62 90 English
I tried to use the solution written by #jezrael for this question where they use numpy.argsort to get the positions of sorted values for the top 'n' columns, but I am unable to set a threshold value below which nothing should be considered.
Idea is first replace not matched values by missing values in DataFrame.where, then applied solution with numpy.argsort. Filter by number of Trues of for count non missing values in numpy.where for replace not matched values to empty strings.
Last are values joined in list comprehension and filtered out non matched rows for missing value(s):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
If performance is not important use Series.nlargest per rows, but it is really slow if large DataFrame:
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m)
.apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
Performance:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
#print (df)
def f1(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
return df
def f2(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m).apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
return df
In [210]: %timeit (f1(df.copy()))
19.3 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [211]: %timeit (f2(df.copy()))
2.43 s ± 61.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
An alternative:
res = []
tmp = df.set_index('Student_Name').T
for col in list(tmp):
res.append(tmp[col].nlargest(3)[tmp[col].nlargest(3) > 80].index.tolist())
res = [x if len(x) > 0 else np.NaN for x in res]
df['Top_3_above_80'] = res
Output:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 JohnDoe 90 87 81 65 70 [Maths, Physics, Chemistry]
1 JaneDoe 82 84 75 73 77 [Physics, Maths]
2 MaryLim 40 65 55 60 70 NaN
3 LisaRay 55 52 77 62 90 [English]

Combining data tables

I have two data tables, similar to the ones below:
table1
index value
a 6352
a 67
a 43
b 7765
b 53
c 243
c 7
c 543
table 2
index value
a 425
a 6
b 532
b 125
b 89
b 664
c 314
I would like to combine the data in one table as in the table bellow using the index values. The order is important, so the first batch of values under one index in the common table must be from the table 1
index value
a 6352
a 67
a 43
a 425
a 6
b 7765
b 53
b 532
b 125
b 89
b 664
c 243
c 7
c 543
c 314
I tried to do it using VBA but I'm sadly a complete novice and I was wondering if someone has any pointers how to approach to write the code?
Copy the values of the second table (without the headers) under the values of the first table, select the two resultant columns and sort them by index.
Hope it works!

Resources