Vector combination and array sorting - python-3.x

I have two column vector
A = [8, 2, 2, 1]
B = ['John', 'Peter', 'Paul', 'Evans']
How do I combine them to have an array of
C =
8 'John'
2 'Peter'
2 'Paul'
1 'Evans'
And how do I sort C in ascending order such that I have
C =
1 'Evans'
2 'Paul'
2 'Peter'
8 'John'
I just migrated to python from Matlab and I am having difficulty in this.

Here's an approach using np.argsort with kind = mergesort that also maintains the order to get the sort ordered indices. So, we can stack the input arrays in columns and index with those indices for desired output, like so -
In [172]: A
Out[172]: array([8, 2, 2, 1])
In [173]: B
Out[173]:
array(['John', 'Peter', 'Paul', 'Evans'],
dtype='|S5')
In [174]: sidx = A.argsort(kind='mergesort')
In [175]: np.column_stack((A,B))[sidx]
Out[175]:
array([['1', 'Evans'],
['2', 'Peter'],
['2', 'Paul'],
['8', 'John']],
dtype='|S21')
If you would like to keep the int type for the first column in the output array, you could create an object dtype array, like so -
arr = np.empty((len(A),2),dtype=object)
arr[:,0] = A
arr[:,1] = B
out = arr[sidx] # sidx is same as from previous approach
Result -
In [189]: out
Out[189]:
array([[1, 'Evans'],
[2, 'Peter'],
[2, 'Paul'],
[8, 'John']], dtype=object)

Use np.column_stack() and np.sort():
In [9]: np.column_stack((A, B))
Out[9]:
array([['8', 'John'],
['2', 'Peter'],
['2', 'Paul'],
['1', 'Evans']],
dtype='<U5')
In [10]:
In [10]: np.sort(np.column_stack((A, B)))[::-1]
Out[10]:
array([['1', 'Evans'],
['2', 'Paul'],
['2', 'Peter'],
['8', 'John']],
dtype='<U5')

Related

how to combine two arrays of different types and put them as a list

I have two numpy arrays as following
A = [1,2,3,1,2,3,1,2,3] #integers
B = ['xx','xx','xx','yy','yy','yy','zz','zz''zz'] #strings
that I want to combine and store as a list such as:
AB_list = [[1,'xx'],[2,'xx'],[3,'xx'],[1,'yy'],[2,'yy'],[3,'yy'],[1,'zz'],[2,'zz'],[3,'zz'],]
Anyone could help ?
Something like this using list comprehension and the zip iterator should work:
A = np.array([1,2,3,1,2,3,1,2,3]) #integers
B = np.array(['xx','xx','xx','yy','yy','yy','zz','zz','zz'])
[ [a,b] for a,b in zip(A,B) ]
Out[29]:
[[1, 'xx'],
[2, 'xx'],
[3, 'xx'],
[1, 'yy'],
[2, 'yy'],
[3, 'yy'],
[1, 'zz'],
[2, 'zz'],
[3, 'zz']]
Firstly , you are missing a comma in your List 'B'
B = ['xx','xx','xx','yy','yy','yy','zz','zz','zz']
After fixing it ,You can use column_stack to get the desired result
import numpy as np
A = [1,2,3,1,2,3,1,2,3]
B = ['xx','xx','xx','yy','yy','yy','zz','zz','zz']
np.column_stack((A, B))
Output:
array([['1', 'xx'],
['2', 'xx'],
['3', 'xx'],
['1', 'yy'],
['2', 'yy'],
['3', 'yy'],
['1', 'zz'],
['2', 'zz'],
['3', 'zz']], dtype='<U21')

Covert a dataframe into a matrix form

I have a dataset looks like this:
state VDM MDM OM
AP 1 2 5
GOA 1 2 1
GU 1 2 4
KA 1 5 1
MA 1 4 4
I have tried this code:
aMat=df1000.as_matrix()
print(aMat)
here df1000 is the dataset.
But the above code gives this output:
[['AP' 1 2 5]
['GOA' 1 2 1]
['GU' 1 2 4]
['KA' 1 5 1]
['MA' 1 4 4]]
I want to create a 2d list or matrix which looks like this:
[[1, 2, 5], [1, 2, 1], [1, 2, 4], [1, 5, 1], [1, 4, 4]]
You can use df.iloc[]:
df.iloc[:,1:].to_numpy()
array([[1, 2, 5],
[1, 2, 1],
[1, 2, 4],
[1, 5, 1],
[1, 4, 4]], dtype=int64)
Or for string matrix:
df.astype(str).iloc[:,1:].to_numpy()
array([['1', '2', '5'],
['1', '2', '1'],
['1', '2', '4'],
['1', '5', '1'],
['1', '4', '4']], dtype=object)
Note why we are not using as_matrix()
".as_matrix will be removed in a future version. Use .values instead."
Select all columns without first by DataFrame.iloc and convert integer values to strings by DataFrame.astype, last convert to numpy array by to_numpy or DataFrame.values:
#pandas 0.24+
aMat=df1000.iloc[:, 1:].astype(str).to_numpy()
#pandas below
aMat=df1000.iloc[:, 1:].astype(str).values
Or remove first column by DataFrame.drop:
#pandas 0.24+
aMat=df1000.drop('state', axis=1).astype(str).to_numpy()
#pandas below
aMat=df1000.drop('state', axis=1).astype(str).values

(Python) Finding all possible partitions of a list of lists subject to a size limit for a partition

Suppose that I have a list k = [[1,1,1],[2,2],[3],[4]], with size limit c = 4.
Then I will like to find all possible partitions of k subject ot c. Ideally, the result should be:
[ {[[1,1,1],[3]], [[2,2], [4]]}, {[[1,1,1],[4]], [[2,2], [3]]}, {[[1,1,1]], [[2,2], [3], [4]]}, ..., {[[1,1,1]], [[2,2]], [[3]], [[4]]} ]
where I used set notation { } in the above example (actual case its [ ]) to make it clearer as to what a partition is, where each partition contains groups of lists grouped together.
I implemented the following algorithm but my results do not tally:
def num_item(l):
flat_l = [item for sublist in l for item in sublist]
return len(flat_l)
def get_all_possible_partitions(lst, c):
p_opt = []
for l in lst:
p_temp = [l]
lst_copy = lst.copy()
lst_copy.remove(l)
iterations = 0
while num_item(p_temp) <= c and iterations <= len(lst_copy):
for l_ in lst_copy:
iterations += 1
if num_item(p_temp + [l_]) <= c:
p_temp += [l_]
p_opt += [p_temp]
return p_opt
Running get_all_possible_partitions(k, 4), I obtain:
[[[1, 1, 1], [3]], [[2, 2], [3], [4]], [[3], [1, 1, 1]], [[4], [1, 1, 1]]]
I understand that it does not remove duplicates and exhaust the possible combinations, which I am stuck on.
Some insight will be great! P.S. I did not manage to find similar questions :/
I think this does what you want (explanations in comments):
# Main function
def get_all_possible_partitions(lst, c):
yield from _get_all_possible_partitions_rec(lst, c, [False] * len(lst), [])
# Produces partitions recursively
def _get_all_possible_partitions_rec(lst, c, picked, partition):
# If all elements have been picked it is a complete partition
if all(picked):
yield tuple(partition)
else:
# Get all possible subsets of unpicked elements
for subset in _get_all_possible_subsets_rec(lst, c, picked, [], 0):
# Add the subset to the partition
partition.append(subset)
# Generate all partitions that complete the current one
yield from _get_all_possible_partitions_rec(lst, c, picked, partition)
# Remove the subset from the partition
partition.pop()
# Produces all possible subsets of unpicked elements
def _get_all_possible_subsets_rec(lst, c, picked, current, idx):
# If we have gone over all elements finish
if idx >= len(lst): return
# If the current element is available and fits in the subset
if not picked[idx] and len(lst[idx]) <= c:
# Mark it as picked
picked[idx] = True
# Add it to the subset
current.append(lst[idx])
# Generate the subset
yield tuple(current)
# Generate all possible subsets extending this one
yield from _get_all_possible_subsets_rec(lst, c - len(lst[idx]), picked, current, idx + 1)
# Remove current element
current.pop()
# Unmark as picked
picked[idx] = False
# Only allow skip if it is not the first available element
if len(current) > 0 or picked[idx]:
# Get all subsets resulting from skipping current element
yield from _get_all_possible_subsets_rec(lst, c, picked, current, idx + 1)
# Test
k = [[1, 1, 1], [2, 2], [3], [4]]
c = 4
partitions = list(get_all_possible_partitions(k, c))
print(*partitions, sep='\n')
Output:
(([1, 1, 1],), ([2, 2],), ([3],), ([4],))
(([1, 1, 1],), ([2, 2],), ([3], [4]))
(([1, 1, 1],), ([2, 2], [3]), ([4],))
(([1, 1, 1],), ([2, 2], [3], [4]))
(([1, 1, 1],), ([2, 2], [4]), ([3],))
(([1, 1, 1], [3]), ([2, 2],), ([4],))
(([1, 1, 1], [3]), ([2, 2], [4]))
(([1, 1, 1], [4]), ([2, 2],), ([3],))
(([1, 1, 1], [4]), ([2, 2], [3]))
If all elements in the list are unique, then you can use bit.
Assume k = [a,b,c], which length is 3, then there are 2^3 - 1 = 7 partions:
if you use bit to compresent a, b, c, there will be
001 -> [c]
010 -> [b]
011 -> [b, c]
100 -> [a]
101 -> [a,c]
110 -> [a,b]
111 -> [a,b,c]
so, the key to solving this question is obvious now.
Note: This answer is actually for a closed linked question.
If you only want to return the bipartitions of the list you can utilize more_iterools.set_partions:
>>> from more_itertools import set_partitions
>>>
>>> def get_bipartions(lst):
... half_list_len = len(lst) // 2
... if len(lst) % 2 == 0:
... return list(
... map(tuple, [
... p
... for p in set_partitions(lst, k=2)
... if half_list_len == len(p[0])
... ]))
... else:
... return list(
... map(tuple, [
... p
... for p in set_partitions(lst, k=2)
... if abs(half_list_len - len(p[0])) < 1
... ]))
...
>>> get_bipartions(['A', 'B', 'C'])
[(['A'], ['B', 'C']), (['B'], ['A', 'C'])]
>>> get_bipartions(['A', 'B', 'C', 'D'])
[(['A', 'B'], ['C', 'D']), (['B', 'C'], ['A', 'D']), (['A', 'C'], ['B', 'D'])]
>>> get_bipartions(['A', 'B', 'C', 'D', 'E'])
[(['A', 'B'], ['C', 'D', 'E']), (['B', 'C'], ['A', 'D', 'E']), (['A', 'C'], ['B', 'D', 'E']), (['C', 'D'], ['A', 'B', 'E']), (['B', 'D'], ['A', 'C', 'E']), (['A', 'D'], ['B', 'C', 'E'])]

Dataframe to Dictionary [duplicate]

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}
The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}
Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)
Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]
Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}
For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}
Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}
df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()
DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details

Pandas: Reading an CSV file with the intention of creating a 3D array

First time posting here.
So my question is regarding how to read an CSV file in Pandas with the intention of creating a 2d array with a matrix within each element.
So for instance take this example CSV file
1,1,1;2,2,2;3,3,3
1,1,1;2,2,2;3,3,3
1,1,1;2,2,2;3,3,3
Where each new line represents a separate matrix
and each semicolon represents a separate row within each matrix
and each comma represents a separate element within each row
So from this I would like to get to this type of array:
[
[[1,1,1],[2,2,2],[3,3,3]],
[[1,1,1],[2,2,2],[3,3,3]],
[[1,1,1],[2,2,2],[3,3,3]]
]
Currently, when I use pandas.read_csv() on something like this it'll not read the semicolon as a separator and so something like 1;2 would be read as a string.
Thanks!
You can use read_csv with parameter sep=';' and header=None (if no header in csv). Then you need apply function str.split, because string functions work only with Series (columns of df):
import pandas as pd
import io
temp=u"""1,1,1;2,2,2;3,3,3
1,1,1;2,2,2;3,3,3
1,1,1;2,2,2;3,3,3"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print (df)
0 1 2
0 1,1,1 2,2,2 3,3,3
1 1,1,1 2,2,2 3,3,3
2 1,1,1 2,2,2 3,3,3
print (df.apply(lambda x: x.str.split(',')))
0 1 2
0 [1, 1, 1] [2, 2, 2] [3, 3, 3]
1 [1, 1, 1] [2, 2, 2] [3, 3, 3]
2 [1, 1, 1] [2, 2, 2] [3, 3, 3]
print (df.apply(lambda x: x.str.split(',')).values.tolist())
[[['1', '1', '1'], ['2', '2', '2'], ['3', '3', '3']],
[['1', '1', '1'], ['2', '2', '2'], ['3', '3', '3']],
[['1', '1', '1'], ['2', '2', '2'], ['3', '3', '3']]]
But if need list of int:
import pandas as pd
import io
temp=u"""1,1,1;2,2,2;3,3,3
1,1,1;2,2,2;3,3,3
1,1,1;2,2,2;3,3,3"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print (df)
0 1 2
0 1,1,1 2,2,2 3,3,3
1 1,1,1 2,2,2 3,3,3
2 1,1,1 2,2,2 3,3,3
for col in df.columns:
df[col] = df[col].str.split(',')
#if need convert string numbers to int
df[col] = [[int(y) for y in x] for x in df[col]]
print (df.values.tolist())
[[[1, 1, 1], [2, 2, 2], [3, 3, 3]],
[[1, 1, 1], [2, 2, 2], [3, 3, 3]],
[[1, 1, 1], [2, 2, 2], [3, 3, 3]]]

Resources