Related
I need to concatenate a uid from uids column to each of the uids in the list of the friends column, as shown in the following example:
Given a pandas.DataFrame object A:
uid friends
0 1 [10, 2, 1, 5]
1 2 [1, 2]
2 3 [5, 4]
3 4 [10, 5]
4 5 [1, 2, 5]
the desired output is:
uid friends in_edges
0 1 [10, 2, 1, 5] [(1, 10), (1, 2), (1, 1), (1, 5)]
1 2 [1, 2] [(2, 1), (2, 2)]
2 3 [5, 4] [(3, 5), (3, 4)]
3 4 [10, 5] [(4, 10), (4, 5)]
4 5 [1, 2, 5] [(5, 1), (5, 2), (5, 5)]
I use the following code to achieve this outcome:
import numpy as np
import pandas as pd
A = pd.DataFrame(dict(uid=[1, 2, 3, 4, 5], friends=[[10, 2, 1, 5], [1, 2], [5, 4], [10, 5], [1, 2, 5]]))
A.loc[:, 'in_edges'] = A.loc[:, 'uid'].apply(lambda uid: [(uid, f) for f in A.loc[A.loc[:, 'uid']==uid, 'friends'].values[0]])
but it the A.loc[A.loc[:, 'uid']==uid, 'friends'] part looks kind of cumbersome to me, so I wondered if there is an easier way to accomplish this task?
Thanks in advance.
You can use .apply() with axis=1 parameter:
df["in_edges"] = df[["uid", "friends"]].apply(
lambda x: [(x["uid"], f) for f in x["friends"]], axis=1
)
print(df)
Prints:
uid friends in_edges
0 1 [10, 2, 1, 5] [(1, 10), (1, 2), (1, 1), (1, 5)]
1 2 [1, 2] [(2, 1), (2, 2)]
2 3 [5, 4] [(3, 5), (3, 4)]
3 4 [10, 5] [(4, 10), (4, 5)]
4 5 [1, 2, 5] [(5, 1), (5, 2), (5, 5)]
Why not try product
import itertools
A['in_edges'] = A.apply(lambda x : [*itertools.product([x['uid']], x['friends'])],axis=1)
A
Out[50]:
uid friends in_edges
0 1 [10, 2, 1, 5] [(1, 10), (1, 2), (1, 1), (1, 5)]
1 2 [1, 2] [(2, 1), (2, 2)]
2 3 [5, 4] [(3, 5), (3, 4)]
3 4 [10, 5] [(4, 10), (4, 5)]
4 5 [1, 2, 5] [(5, 1), (5, 2), (5, 5)]
I'm working on a ML project for which I'm using numpy arrays instead of pandas for faster computation.
When I intend to bootstrap, I wish to subset the columns from a numpy ndarray.
My numpy array looks like this:
np_arr =
[(187., 14.45 , 20.22, 94.49)
(284., 10.44 , 15.46, 66.62)
(415., 11.13 , 22.44, 71.49)]
And I want to index columns 1,3.
I have my columns stored in a list as ix = [1,3]
However, when I try to do np_arr[:,ix] I get an error saying too many indices for array .
I also realised that when I print np_arr.shape I only get (3,), whereas I probably want (3,4).
Could you please tell me how to fix my issue.
Thanks!
Edit:
I'm creating my numpy object from my pandas dataframe like this:
def _to_numpy(self, data):
v = data.reset_index()
np_res = np.rec.fromrecords(v, names=v.columns.tolist())
return(np_res)
The reason here for your issue is that the np_arr which you have is a 1-D array. Share your code snippet as well so that it can be looked into as in what is the exact issue. But in general, while dealing with 2-D numpy arrays, we generally do this.
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
You have created a record array (also called a structured array). The result is a 1d array with named columns (fields).
To illustrate:
In [426]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])
In [427]: df
Out[427]:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [428]: arr = df.to_records()
In [429]: arr
Out[429]:
rec.array([(0, 0, 1, 2), (1, 3, 4, 5), (2, 6, 7, 8), (3, 9, 10, 11)],
dtype=[('index', '<i8'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [430]: arr['A']
Out[430]: array([0, 3, 6, 9])
In [431]: arr.shape
Out[431]: (4,)
I believe to_records has a parameter to eliminate the index field.
Or with your method:
In [432]:
In [432]: arr = np.rec.fromrecords(df, names=df.columns.tolist())
In [433]: arr
Out[433]:
rec.array([(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)],
dtype=[('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [434]: arr['A'] # arr.A also works
Out[434]: array([0, 3, 6, 9])
In [435]: arr.shape
Out[435]: (4,)
And multifield access:
In [436]: arr[['A','C']]
Out[436]:
rec.array([(0, 2), (3, 5), (6, 8), (9, 11)],
dtype={'names':['A','C'], 'formats':['<i8','<i8'], 'offsets':[0,16], 'itemsize':24})
Note that the str display of this array
In [437]: print(arr)
[(0, 1, 2) (3, 4, 5) (6, 7, 8) (9, 10, 11)]
shows a list of tuples, just as your np_arr. Each tuple is a 'record'. The repr display shows the dtype as well.
You can't have it both ways, either access columns by name, or make a regular numpy array and access columns by number. The named/record access makes most sense when columns are a mix of dtypes - string, int, float. If they are all float, and you want to do calculations across columns, its better to use the numeric dtype.
In [438]: arr = df.to_numpy()
In [439]: arr
Out[439]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
Let's say I have the following unweighted (all edges weight = 1), undirected, unlabeled, connected graph and I want to find all unique paths of maximum given length. Also, nodes cannot appear twice in a path. I cannot find a routine that does this in networkx atm.
Does anyone knows if any such thing exists ?
Or what could be a good solution for this problem ?
import networkx as nx
G = nx.Graph()
G.add_nodes_from([1, 2, 3, 4, 5, 6, 7, 8, 9])
G.add_edges_from([(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (2, 4), (6, 9), (8, 9), (9, 6)])
The exemple graph looks like this
Let's say I require max length = 2, I would like this output
[1 2]
[2 3]
[2 4]
[3 4]
[4 5]
[5 6]
[6 7]
[7 8]
[8 9]
[6 9]
[1 2 3]
[1 2 4]
[2 3 4]
[2 4 5]
[3 4 5]
[4 5 6]
[5 6 7]
[5 6 9]
[6 7 9]
[6 7 8]
[7 8 9]
[6 9 8]
EDIT: I'm looking for a better solution than using itertools to generate all nodes combinations of required_max_path_length-1 number of nodes + checking for connectivity using G.has_edge(node_1, node_2) within the combinations groups or something similar, which seems like a super bad solution.
So now I'm doing this thx to #user3483203 and it yields the expected output. Itertools usage can be avoided but I don't mind in my specific case.
I still feel like it would scale a bit worst than something else for larger graphs though, I will change the accepted answer if someone finds a better solution.
import networkx as nx
import itertools
required_max_path_length = 2 # (inferior or equal to)
G = nx.Graph()
G.add_nodes_from([1, 2, 3, 4, 5, 6, 7, 8, 9])
G.add_edges_from([(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (2, 4), (6, 9), (8, 9), (9, 6)])
all_paths = []
nodes_combs = itertools.combinations(G.nodes, 2)
for source, target in nodes_combs:
paths = nx.all_simple_paths(G, source=source, target=target, cutoff=required_max_path_length)
for path in paths:
if path not in all_paths and path[::-1] not in all_paths:
all_paths.append(path)
for path in all_paths:
print(path)
In case you want the paths as lists of edges you can do:
for path in map(nx.utils.pairwise, all_paths):
print(list(path))
And you will get:
[(1, 2)]
[(1, 2), (2, 3)]
[(1, 2), (2, 4)]
[(2, 3)]
[(2, 3), (3, 4)]
[(2, 4)]
[(2, 4), (4, 5)]
[(3, 4)]
[(3, 4), (4, 5)]
[(4, 5)]
[(4, 5), (5, 6)]
[(5, 6)]
[(5, 6), (6, 7)]
[(5, 6), (6, 9)]
[(6, 7)]
[(6, 7), (7, 8)]
[(6, 8), (8, 9)]
[(6, 9)]
[(7, 8)]
[(6, 7), (7, 9)]
[(7, 8), (8, 9)]
[(8, 9)]
The following code should solve your task, but it outputs more paths than you have given (e.g. [1,2] and [2,1]):
def find_all_simple_paths(graph, cutoff):
if cutoff == 0:
return [[node] for node in graph]
else:
all_paths = []
current_paths = [[node] for node in graph]
# If you want to include paths of length 0
# all_paths.extend(current_paths)
for _ in range(min(cutoff, len(graph))):
next_paths = []
for path in current_paths:
#print(path)
for neighbor in graph.neighbors(path[-1]):
if neighbor not in path:
new_path = path[:] + [neighbor]
next_paths.append(new_path)
all_paths.append(new_path)
current_paths = next_paths
return all_paths
find_all_simple_paths(G,2)
Output
[[1, 2],
[2, 1],
[2, 3],
[2, 4],
[3, 2],
[3, 4],
[4, 3],
[4, 5],
[4, 2],
[5, 4],
[5, 6],
[6, 5],
[6, 7],
[6, 9],
[7, 6],
[7, 8],
[8, 7],
[8, 9],
[9, 6],
[9, 8],
[1, 2, 3],
[1, 2, 4],
[2, 3, 4],
[2, 4, 3],
[2, 4, 5],
[3, 2, 1],
[3, 2, 4],
[3, 4, 5],
[3, 4, 2],
[4, 3, 2],
[4, 5, 6],
[4, 2, 1],
[4, 2, 3],
[5, 4, 3],
[5, 4, 2],
[5, 6, 7],
[5, 6, 9],
[6, 5, 4],
[6, 7, 8],
[6, 9, 8],
[7, 6, 5],
[7, 6, 9],
[7, 8, 9],
[8, 7, 6],
[8, 9, 6],
[9, 6, 5],
[9, 6, 7],
[9, 8, 7]]
I am working on weighted graphs and I would like to assign a random weight for the edges of the graph, such that,
weight of edge(a, a) = 0
weight of (a, b) = weight of edge(b, a) = K
where K is some random number. This goes on for all the edges of the graphs.
For that, I am using random.randint() method. I am actually using the logic of sum. If sum of both the edges is same, then assign some random integer.
Here is my code,
nodelist = list(range(1, num_nodes + 1))
edgelist = []
for i in nodelist:
for j in nodelist:
if i == j:
edgelist.append((i, j, 0))
if (i != j and sum((i, j)) == sum((j, i))):
rand = random.randint(5, 25)
edgelist.append((i, j, rand))
print(edgelist)
Actual result,
[(1, 1, 0), (1, 2, 18), (1, 3, 6), (2, 1, 13), (2, 2, 0), (2, 3, 21), (3, 1, 20), (3, 2, 17), (3, 3, 0)]
Expected result,
[(1, 1, 0), (1, 2, K), (1, 3, H), (2, 1, K), (2, 2, 0), (2, 3, P), (3, 1, H), (3, 2, P), (3, 3, 0)]
where, K, H, P are some random integers.
If the ordering of the result is not important following code gives the desired output:
import random
num_nodes = 3
nodelist = list(range(1, num_nodes + 1))
edgelist = []
for i in nodelist:
for j in nodelist:
if j > i:
break
if i == j:
edgelist.append((i, j, 0))
else:
rand = random.randint(5, 25)
edgelist.append((i, j, rand))
edgelist.append((j, i, rand))
print(edgelist)
# [(1, 1, 0), (2, 1, 7), (1, 2, 7), (2, 2, 0), (3, 1, 18), (1, 3, 18), (3, 2, 13), (2, 3, 13), (3, 3, 0)]
In case you need the edges sorted, simply use:
print(sorted(edgelist))
# [(1, 1, 0), (1, 2, 20), (1, 3, 16), (2, 1, 20), (2, 2, 0), (2, 3, 23), (3, 1, 16), (3, 2, 23), (3, 3, 0)]
Just a little change in your code will do the trick.
Here is the solution I found to obtain your expected output
num_nodes = 3
nodelist = list(range(1, num_nodes + 1))
edgelist = []
for i in nodelist:
for j in nodelist:
if i == j:
edgelist.append((i, j, 0))
elif i < j:
rand = random.randint(5, 25)
edgelist.append((i, j, rand))
edgelist.append((j, i, rand))
print(sorted(edgelist))
This code outputs :
[(1, 1, 0), (1, 2, 15), (1, 3, 15), (2, 1, 15), (2, 2, 0), (2, 3, 21), (3, 1, 15), (3, 2, 21), (3, 3, 0)]
So I figured out something interesting. Say below matrix shows edges in a complete graph of 5 nodes,
[1, 1] [1, 2] [1, 3] [1, 4] [1, 5]
[2, 1] [2, 2] [2, 3] [2, 4] [2, 5]
[3, 1] [3, 2] [3, 3] [3, 4] [3, 5]
[4, 1] [4, 2] [4, 3] [4, 4] [4, 5]
[5, 1] [5, 2] [5, 3] [5, 4] [5, 5]
now, moving right side from principal diagonal, we have lists whose first element is less than second element. We just got to target them and append new random weight to it.
Here is my code,
nodelist = list(range(1, num_nodes + 1))
edgelist = []
for i in nodelist:
for j in nodelist:
edgelist.append([i, j])
p = 0
eff_edgelist = []
while p < len(edgelist):
if edgelist[p][0] <= edgelist[p][1]:
eff_edgelist.append(edgelist[p])
p += 1
for i in eff_edgelist:
if i[0] == i[1]:
i.append(0)
else:
i.append(random.randint(5, 50))
eff_edgelist = [tuple(i) for i in eff_edgelist]
for i in list(G.edges(data=True)):
print([i])
and the result,
[(1, 1, {'weight': 0})]
[(1, 2, {'weight': 12})]
[(1, 3, {'weight': 37})]
[(1, 4, {'weight': 38})]
[(1, 5, {'weight': 6})]
[(2, 2, {'weight': 0})]
[(2, 3, {'weight': 12})]
[(2, 4, {'weight': 40})]
[(2, 5, {'weight': 8})]
[(3, 3, {'weight': 0})]
[(3, 4, {'weight': 15})]
[(3, 5, {'weight': 38})]
[(4, 4, {'weight': 0})]
[(4, 5, {'weight': 41})]
[(5, 5, {'weight': 0})]
and if you check, print(G[2][1]), the output will be {'weight': 12},
which means weight of edge(a, b) = weight of edge(b, a).
I am having small problem dealing with python spark rdd. My rdd looks like
old_rdd = [( A1, Vector(V1)), (A2, Vector(V2)), (A3, Vector(V3)), ....].
I want to use flatMap, so as to get new rdd like:
new_rdd = [((A1, A2), (V1, V2)), ((A1, A3), (V1, V3))] and so on.
The problem is flatMap removed tuple like [(A1, V1, A2, V2)...]. Do you have any alternative suggestions with or without flatMap(). Thank you in advance.
It is related to Explicit sort in Cartesian transformation in Scala Spark. However, I will suppose that you already cleaned up the RDD for duplicates, and I will assume that the ids have some simple pattern to parse and then identify, and for simplicity I will think on Lists instead of Vectors
old_rdd = sc.parallelize([(1, [1, -2]), (2, [5, 7]), (3, [8, 23]), (4, [-1, 90])])
# It will provide all the permutations, but combinations are a subset of the permutations, so we need to filter.
combined_rdd = old_rdd.cartesian(old_
combinations = combined_rdd.filter(lambda (s1, s2): s1[0] < s2[0])
combinations.collect()
# The output will be...
# -----------------------------
# [((1, [1, -2]), (2, [5, 7])),
# ((1, [1, -2]), (3, [8, 23])),
# ((1, [1, -2]), (4, [-1, 90])),
# ((2, [5, 7]), (3, [8, 23])),
# ((2, [5, 7]), (4, [-1, 90])),
# ((3, [8, 23]), (4, [-1, 90]))]
# Now we need to set the tuple as you want
combinations = combinations.map(lambda (s1, s1): ((s1[0], s2[0]), (s1[1], s2[1]))).collect()
# The output will be...
# ----------------------
# [((1, 2), ([1, -2], [5, 7])),
# ((1, 3), ([1, -2], [8, 23])),
# ((1, 4), ([1, -2], [-1, 90])),
# ((2, 3), ([5, 7], [8, 23])),
# ((2, 4), ([5, 7], [-1, 90])),
# ((3, 4), ([8, 23], [-1, 90]))]