How to flatten dependency graph? - apache-spark

I am new with Apache Spark,
can i get a snippet of how to implement 'flattening' for dependency graph?
i.e lets say I have:
nodes :A,B,C
edges : (A,B),(B,C)
it would result with a new Graph:
nodes:A,B,C
edges:(A,B)(A,C)(B,C)

1) Presuming each node is in its own row
A
B
C
2) Do a CROSS JOIN with self as first step.
A A
A B
A C
B A
B B
B C
C A
C B
C C
2) In second step filter out all the rows where Node name is repeated.
A B
A C
B A
B C
C A
C B
3) Post that derive another field from two fields that would tell you the edge.
A B AB
A C AC
B A BA
B C BC
C A CA
C B CB
You would need to convert this into the (Scala/Python) syntax though. Hope this helps.

Related

Need help inputting data in excel columns from other columns

If I had data in rows A to E as seen below in the table. Some of the values can be NA. IN column F if i wanted to input data from columns A to E in a way that if data in A exists use that otherwise if data in B exists use that otherwise until column E. If none of them have any values return NA. I would like to automate this where somewhere I just specify the order for example A, B, C, D and E OR A, C, E, D, B and the values in F update according to the reference table
Reference : C - B - A - E - D
a
b
c
d
e
f
3
4
3
2
2
7
1
7
NA
1
4
2
4
2
2
4
2
2
Use FILTER() with # operator.
=#FILTER(A2:E2,A2:E2<>"","NA")
For dynamic array approach (spill results automatically), try-
=BYROW(A2:E7,LAMBDA(x,INDEX(FILTER(x,x<>"","NA"),1,1)))

I have a list like this x= [['a','b','c','d']] and I want to print all the elements in it like a b c d How do I do it?

I have a list like this
x = [['a','b','c','d']]
and I want to print all the elements in it like
a
b
c
d
x = [['a','b','c','d']]
print('\n'.join(map(str, x[0])))
result will be
a
b
c
d

Need a function for pandas dataframe that identifies equal strings and then assigns to new columns

I made a pandas df from parts of 2 others:
Here is the pseudocode for what I want to do.
4-column pandas dataframe, values in all columns are single words.
cols A B C D and I want this: cols A B C D E F
in pseudcode:
(for every s in A;
if s equals any string (not substring) in D;
write Yes to E (new column) else write No to E;
if str in B (same row as s) equals str in C (same row as string found in D) write yes to F (new column)
else write No to F)
The following code works but now I need a function to do what is described above:
cols = [1,2,3,5]
df3.drop(df3.columns[cols],axis=1, inplace=True)
df4.drop(df4.columns[1],axis=1, inplace=True)
listi = [df4]
listi.append(df3)
df5 = pd.concat(listi, axis = 1)
It should be if i)if x['A'] == x['D'] and ii) if x['B'] == x['C'] and also I need to add column G which is the string found in C or if string not found.
Here is a small sample data set and expected outcome:
A B C D
cats cat cat cats
went be have had
tried try enter entering
entering enter try tried
Expected outcome
A B C D E F G
cats cat cat cats yes yes cat
went be have had no no tried
try entering entering yes no try
entering enter try tried yes no entering
Column G is the word found in C if the word is found else
For what I understood you can apply a lamda to your DataFrame
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Anyway I built a small example
import pandas as pd
df = pd.DataFrame([['a','a','a','b'],['b','c','b','b'],['d','d','d','g'],['j','j','c','d']],[1,2,3,4], columns=['A','B','C','D'])
df
# A B C D
#1 a a a b
#2 b c b b
#3 d d d g
#4 j j c d
df['E']=df.apply(lambda x: 'yes' if x['A'] == x['B'] else 'no', axis=1)
df['F']=df.apply(lambda x: 'yes' if x['C'] == x['D'] else 'no', axis=1)
df
# A B C D E F
#1 a a a b yes no
#2 b c b b no yes
#3 d d d g yes no
#4 j j c d yes no

pandas advanced splitting by comma

There have been a lot of posts concerning splitting a single column into multiples, but I couldn't find an answer to a slight modification to the idea of splitting.
When you use str.split, it splits the string independent of order. You can modify it to be slightly more complex, such as ordering it by sorting alphabetically
e.x. dataframe (df)
row
0 a, e, c, b
1 b, d, a
2 a, b, c, d, e
3 d, f
foo = df['row'].str.split(',')
will split based on the comma and return:
0 1 2 3
0 a e c b
....
However that doesn't align the results by their unique value. Even if you use a sort on the split string, it will still only result in this:
0 1 2 3 4 5
0 a b c e
1 a b d
...
whereas I want it to look like this:
0 1 2 3 4 5
0 a b c e
1 a b d
2 a b c d e
...
I know I'm missing something. Do I need to add the columns first and then map the split values to the correct column? What if you don't know all of the unique values? Still learning pandas syntax so any pointers in the right direction would be appreciated.
Using get_dummies
s=df.row.str.get_dummies(sep=' ,')
s.mul(s.columns)
Out[239]:
a b c d e f
0 a b c e
1 a b d
2 a b c d e
3 d f

Excel Formula comparing two columns

Below is a sample of the data I have. I want to match the data in Column A and B. If column B is not matching column A, I want to add a row and copy the data from Column A to B. For example, "4" is missing in column B, so I want to add a space and add "4" to column B so it will match column A. I have a large set of data, so I am trying to find a different way instead of checking for duplicate values in the two columns and manually adding one row at a time. Thanks!
A B C D
3 3 Y B
4 5 G B
5 6 B G
6 8 P G
7 9 Y P
8 11 G Y
9 12 B Y
10
11
12
11
12
I would move col B,C,D to a separate columns, say E,F,G, then using index matches against col A and col B identify which records are missing.
For col C: =IFERROR(INDEX(F:F,Match(A1,E:E,0)),"N/A")
For col D: =IFERROR(INDEX(G:G,Match(A1,E:E,0)),"N/A")
Following this you can filter for C="N/A" to identify cases where a B value is missing for an A value, and manually edit. Since you want A & B to be matching here col B is unnecessary, final result w/ removing col B and C->B, D->C:
A B C
3 Y B
4 N/A N/A
5 G B
6 B G
7 N/A N/A
Hope this helps!

Resources