How to filter a 2 level list based on values within sublists? - python-3.x

Think that the following list is a table, where sublist[0] contains the column headers.
data = [
['S1', 'S2 ', 'ELEMENT', 'C1', 'C2'],
['X' , 'X' , 'GRT' , 1, 4 ],
['' , 'X' , 'OIP' , 3, 2 ],
['' , 'X' , 'LKJ' , 2, 7 ],
['X' , '' , 'UBC' , 1, 0 ]
]
I'm trying to filter the list based on the values in "column S1" and "column S2".
I want to get:
a new list "S1" containing the sublists that has an "X" in "column S1"
a new list "S2" containing the sublists that has an "X" in "column S2"
Like this:
S1 = [
['ELEMENT', 'C1', 'C2'],
['GRT', 1, 4 ],
['UBC', 1, 0 ]
]
S2 = [
['ELEMENT', 'C1', 'C2'],
['GRT', 1, 4 ],
['OIP', 3, 2 ],
['LKJ', 2, 7 ]
]
Below I show the code I have so far, where I make a copy of source list data an then check which sublist doesn't have "X" in "column S1". I get correct content in new list S1,
but I don't know why the source list data is being modified and I cannot use it to get new list S2.
S1 = data
for sublist in S1[1:]:
if sublist[0] != "X":
s1.remove(sublist)
s2 = data
for sublist in S2[1:]:
if sublist[1] != "X":
s2.remove(sublist)
>>> data
[['S1', 'S2 ', 'ELEMENT', 'C1', 'C2'], ['X', 'X', 'GRT', 1, 4], ['X', '', 'UBC', 1, 0]]
>>> S1
[['S1', 'S2 ', 'ELEMENT', 'C1', 'C2'], ['X', 'X', 'GRT', 1, 4], ['X', '', 'UBC', 1, 0]]
>>>
How would be a better way to get lists S1 and S2? Thanks.

Your problem is because simply assigning the list to a new name does not make a copy.
You might be able to make your solution work by doing
S1 = data[:] # slicing makes a copy
S2 = data[:]
instead.
Here's a generic solution:
def split_from_columns(ls, i_columns=(), indicator='X'):
for i in i_columns:
yield [
[v for k, v in enumerate(sl) if k not in i_columns]
for j, sl in enumerate(ls)
if j == 0 or sl[i] == indicator
]
Usage:
>>> S1, S2 = split_from_columns(data, i_columns=(0, 1))
>>> S1
[['ELEMENT', 'C1', 'C2'], ['GRT', 1, 4], ['UBC', 1, 0]]
>>> S2
[['ELEMENT', 'C1', 'C2'], ['GRT', 1, 4], ['OIP', 3, 2], ['LKJ', 2, 7]]
The if j == 0 part makes sure we always copy the header. You can change i_columns to adjust where the indicator columns are.

Related

How to aggregate string length sequence base on an indicator sequence

I have a dictionary with two keys and their values are lists of strings.
I want to calculate string length of one list base on an indicator in another list.
It's difficult to frame the question is words, so let's look at an example.
Here is an example dictionary:
thisdict ={
'brand': ['Ford','bmw','toyota','benz','audi','subaru','ferrari','volvo','saab'],
'type': ['O','B','O','B','I','I','O','B','B']
}
Now, I want to add an item to the dictionary that corresponds to string cumulative-length of "brand-string-sequence" base on condition of "type-sequence".
Here is the criteria:
If type = 'O', set string length = 0 for that index.
If type = 'B', set string length to the corresponding string length.
If type = 'I', it's when things get complicated. You would want to look back the sequence and sum up string length until you reach to the first 'B'.
Here is an example output:
thisdict ={
"brand": ['Ford','bmw','toyota','benz','audi','subaru','ferrari','volvo','saab'],
'type': ['O','B','O','B','I','I','O','B','B'],
'cumulative-length':[0,3,0,4,8,14,0,5,4]
}
where 8=len(benz)+len(audi) and 14=len(benz)+len(audi)+len(subaru)
Note that in the real data I'm working on, the sequence can be one "B" and followed by an arbitrary number of "I"s. ie. ['B','I','I','I','I','I','I',...,'O'] so I'm looking for a solution that is robust in such situation.
Thanks
You can use the zip fucntion to tie the brand and type together. Then just keep a running total as you loop through the dictionary values. This solution will support any length series and any length string in the brand list. I am assuming that len(thisdict['brand']) == len(thisdict['type']).
thisdict = {
'brand': ['Ford','bmw','toyota','benz','audi','subaru','ferrari','volvo','saab'],
'type': ['O','B','O','B','I','I','O','B','B']
}
lengths = []
running_total = 0
for b, t in zip(thisdict['brand'], thisdict['type']):
if t == 'O':
lengths.append(0)
elif t == 'B':
running_total = len(b)
lengths.append(running_total)
elif t == 'I':
running_total += len(b)
lengths.append(running_total)
print(lengths)
# [0, 3, 0, 4, 8, 14, 0, 5, 4]
Generating random data
import random
import string
def get_random_brand_and_type():
n = random.randint(1,8)
b = ''.join(random.choice(string.ascii_uppercase) for _ in range(n))
t = random.choice(['B', 'I', 'O'])
return b, t
thisdict = {
'brand': [],
'type': []
}
for i in range(random.randint(1,20)):
b, t = get_random_brand_and_type()
thisdict['brand'].append(b)
thisdict['type'].append(t)
yields the following result:
{'type': ['B', 'B', 'O', 'I', 'B', 'O', 'O', 'I', 'O'],
'brand': ['O', 'BSYMLFN', 'OF', 'SO', 'KPQGRW', 'DLCWW', 'VLU', 'ZQE', 'GEUHERHE']}
[1, 7, 0, 9, 6, 0, 0, 9, 0]

Sort 2d dict by the values

I have a 2d dict like this:
{'John' : {'a' : 9, 'b' : 2, 'c': 5}, 'Smith' : {'d' : 1, 'r' : 3, 'f': 4}}
And I want to print/save them, sorted, like this:
John a 9
John c 5
Smith f 4
Smith r 3
John b 2
Smith d 1
Such that they are sorted by their inner value. both keys are not known beforehand.
Is there a way to do this?
Thanks!
One possibility is to expand the dictionary and afterwards perform sorting:
two_dimensional_dictionary = {'John' : {'a' : 9, 'b' : 2, 'c': 5}, 'Smith' : {'d' : 1, 'r' : 3, 'f': 4}}
values = [(first_key, second_key, value)
for first_key, values in two_dimensional_dictionary.items()
for second_key, value in values.items()]
print(list(sorted(values, key=lambda x:x[-1], reverse=True)))
Output:
[('John', 'a', 9), ('John', 'c', 5), ('Smith', 'f', 4), ('Smith', 'r', 3), ('John', 'b', 2), ('Smith', 'd', 1)]
``
We start by making a list of three dimension tuple as follows, where the first element is the key of the outer dictionary, and the second and third element is the key and value of inner dictionary
dict = {'John' : {'a' : 9, 'b' : 2, 'c': 5}, 'Smith' : {'d' : 1, 'r' : 3, 'f': 4}}
tuples = []
for key, value in dict.items():
for k, v in value.items():
tuples.append((key, k, v))
#[('John', 'a', 9), ('John', 'b', 2), ('John', 'c', 5),
#('Smith', 'd', 1), ('Smith', 'r', 3), ('Smith', 'f', 4)]
We then sort this list on the last element in reverse, and print out the result as follows.
sorted_tuples = sorted(tuples, key=lambda x:x[-1], reverse=True)
for a,b,c in sorted_tuples:
print(a, b, c)
#John a 9
#John c 5
#Smith f 4
#Smith r 3
#John b 2
#Smith d 1

Python loop to create a two column layout

I'm trying to make a two column loop in Python and Dash-Bootstrap-Component. The issue is Python related, I don't quite understand how to achieve this.
I'm looping through a list of values. The layout should be multiple rows each with two columns.
(Brevity code)
figs=[]
figs.append(dict(data=data, layout=layout)) # dash
body = dbc.Container(
[
dbc.Row(
[
dbc.Col(
[
html.H4('ES'+str(i)),
dcc.Graph(figure=figs[i])
],
md=6
),
dbc.Col(
[
html.H4('ES'+str(i+1)),
dcc.Graph(figure=figs[i+1]) # <- how to increment i here? This syntax 'figs[i+1]' throws an error.
]
)
]
)
for i, value in enumerate(figs)
]
)
I need to display a graph figs[i] in column one, then increment the index to display the next graph in column two. figs[i+1] is not working and I'm not sure how to nest a for loop or do a while loop in this code. I've attached an image showing that the code works when using the same figs[i] for the two columns.
UPDATE: Thanks to erkandem's answer below, I was able to arrive at a conclusion which is posted here:
figs.append(dict(data=data, layout=layout))
body_py = [0] * len(figs)
for i, value in enumerate(figs):
left = i
right = i + 1 if i+1 < len(figs) else 0
body_py[left] = figs[left]
body_py[right] = figs[right]
body = dbc.Container(
[
dbc.Row(
[
dbc.Col(
[
html.H4('ES '+str(i)),
dcc.Graph(figure=body_py[i])
],
md=6,
)
for i, value in enumerate(body_py)
]
)
]
)
app.layout = html.Div([body])
core issue boils down to (i + i + 1) instead of i + 1 (ò.Ò really?)
figs = [0, 1, 2, 3, 4, 5, 6, 7]
body = [0] * int((len(figs) / 2))
if len(figs) % 2 is not 0:
print('appebnd a dummy figure')
for i in range(len(body)):
# foreplay
row = str(i)
left = i + i
right = (i + i + 1)
print([row, left, right])
# action
body[i] = {row: [dict(h4=str(left), graph=figs[left]), dict(h4=str(right), graph=figs[right])]}
for row in body:
print(row)
which prints:
{'0': [{'h4': '0', 'graph': 0}, {'h4': '1', 'graph': 1}]}
{'1': [{'h4': '2', 'graph': 2}, {'h4': '3', 'graph': 3}]}
{'2': [{'h4': '4', 'graph': 4}, {'h4': '5', 'graph': 5}]}
{'3': [{'h4': '6', 'graph': 6}, {'h4': '7', 'graph': 7}]}

Dataframe to Dictionary [duplicate]

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}
The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}
Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)
Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]
Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}
For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}
Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}
df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()
DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details

Vector combination and array sorting

I have two column vector
A = [8, 2, 2, 1]
B = ['John', 'Peter', 'Paul', 'Evans']
How do I combine them to have an array of
C =
8 'John'
2 'Peter'
2 'Paul'
1 'Evans'
And how do I sort C in ascending order such that I have
C =
1 'Evans'
2 'Paul'
2 'Peter'
8 'John'
I just migrated to python from Matlab and I am having difficulty in this.
Here's an approach using np.argsort with kind = mergesort that also maintains the order to get the sort ordered indices. So, we can stack the input arrays in columns and index with those indices for desired output, like so -
In [172]: A
Out[172]: array([8, 2, 2, 1])
In [173]: B
Out[173]:
array(['John', 'Peter', 'Paul', 'Evans'],
dtype='|S5')
In [174]: sidx = A.argsort(kind='mergesort')
In [175]: np.column_stack((A,B))[sidx]
Out[175]:
array([['1', 'Evans'],
['2', 'Peter'],
['2', 'Paul'],
['8', 'John']],
dtype='|S21')
If you would like to keep the int type for the first column in the output array, you could create an object dtype array, like so -
arr = np.empty((len(A),2),dtype=object)
arr[:,0] = A
arr[:,1] = B
out = arr[sidx] # sidx is same as from previous approach
Result -
In [189]: out
Out[189]:
array([[1, 'Evans'],
[2, 'Peter'],
[2, 'Paul'],
[8, 'John']], dtype=object)
Use np.column_stack() and np.sort():
In [9]: np.column_stack((A, B))
Out[9]:
array([['8', 'John'],
['2', 'Peter'],
['2', 'Paul'],
['1', 'Evans']],
dtype='<U5')
In [10]:
In [10]: np.sort(np.column_stack((A, B)))[::-1]
Out[10]:
array([['1', 'Evans'],
['2', 'Paul'],
['2', 'Peter'],
['8', 'John']],
dtype='<U5')

Resources