How to create a Chord diagram out of this dataset format?

How to create a Chord diagram out of this dataset format? - python-3.x

I have a dataset which consists of passes made and received by a player with every teammate. A sample dataset looks like this:
ter Stegen Pique Rakitic Busquets Coutinho Suarez Messi \
ter Stegen 0 8 0 2 0 1 1
Pique 12 0 2 20 0 0 1
Rakitic 3 3 0 13 5 2 6
Busquets 1 1 9 0 0 0 8
Coutinho 0 0 2 1 0 4 6
Suarez 0 0 2 1 2 0 1
Messi 0 2 5 1 3 4 0
Lenglet 4 6 8 8 1 0 0
Alba 1 1 8 4 5 8 5
Roberto 4 11 5 4 0 4 6
Vidal 1 10 5 8 3 2 7
Lenglet Alba Roberto Vidal
ter Stegen 4 3 5 5
Pique 9 2 10 5
Rakitic 4 8 2 5
Busquets 4 8 7 12
Coutinho 0 3 0 1
Suarez 0 5 3 3
Messi 0 4 3 4
Lenglet 0 4 0 4
Alba 6 0 1 4
Roberto 1 0 0 8
Vidal 5 7 6 0
How do I visualize this in the form of a chord diagram which shows the flow of passes from every player to every other? I've tried using Holoviews and Plotly but I can't crack how to work with data in this format. Any help would be appreciated.
Here's the entire code:
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.plotting import show, output_file
import numpy as np
pd.set_option("display.max_columns",11)
hv.extension('bokeh')
hv.output(size = 200)
df = pd.read_csv(r"C:\Users\ADMIN\Desktop\Abhishek\BarLiv.csv")
df = df.set_index("0")
df.index.name = None
#print(df)
# Declare a gridded HoloViews dataset and call dframe to flatten it
players = list(df.columns)
data = hv.Dataset((players, players, df), ['source', 'target']).dframe()
#print(players)
# Now create your Chord diagram from the flattened data
chord = hv.Chord(data)
chord.opts(
node_color='index', edge_color='source', label_index='index',
cmap='Category10', edge_cmap='Category10', width=500, height=500)
output_file('chordtest.html')
show(hv.render(chord))
Edit 1: Here's what I'm getting after implementing #philippjfr's solution

HoloViews has provides a neat little trick that makes this pretty easy, you can declare a gridded Dataset from your dataframe and then flatten it:
df = pd.read_csv('/Users/philippjfr/Downloads/BarLiv.csv', index_col=0)
# Declare a gridded HoloViews dataset and call dframe to flatten it
data = hv.Dataset((list(df.columns), list(df.index), df),
['source', 'target'], 'value').dframe()
# Now create your Chord diagram from the flattened data
chord = hv.Chord(data)
chord.opts(
node_color='index', edge_color='source', label_index='index',
cmap='Category10', edge_cmap='Category10', width=500, height=500)

Related

Create an aggregate column based on other columns in pandas dataframe

I have a dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'group' :["A","A","B","B","B"],
'A1_val' :[4,5,7,6,5],
'A1M_val' :[10,100,100,10,1],
'AB_val' :[4,5,7,6,5],
'ABM_val' :[10,100,100,10,1],
'AM_VAL' : [4,5,7,6,5]
}
# Create DataFrame
df1 = pd.DataFrame(data)
df1
group A1_val A1M_val AB_val ABM_val AM_VAL
0 A 4 10 4 10 4
1 A 5 100 5 100 5
2 B 7 100 7 100 7
3 B 6 10 6 10 6
4 B 5 1 5 1 5
Step 1: I want to create columns as below:
A1_agg_val = sum of A1_val + A1M_val (stripping M out of the column and if the name matches then sum it)
Similarly, AB_agg_val = AB_val + ABM_val
Since there is no matching columns for 'AM_VAL', AM_agg_val = AM_val
My expected output:
group A1_val A1M_val AB_val ABM_val AM_VAL A1_AGG_val AB_AGG_val A_AGG_val
0 A 4 10 4 10 4 14 14 4
1 A 5 100 5 100 5 105 105 5
2 B 7 100 7 100 7 107 107 7
3 B 6 10 6 10 6 16 16 6
4 B 5 1 5 1 5 6 6 5

you can use groupby on axis=1
out = (df1.assign(**df1.loc[:,df1.columns.str.lower().str.endswith('_val')]
.groupby(lambda x: x[:2],axis=1).sum().add_suffix('_agg_value')))
print(out)
group A1_val A1M_val AB_val ABM_val AM_VAL A1_agg_value AB_agg_value \
0 A 4 10 4 10 4 14 14
1 A 5 100 5 100 5 105 105
2 B 7 100 7 100 7 107 107
3 B 6 10 6 10 6 16 16
4 B 5 1 5 1 5 6 6
AM_agg_value
0 4
1 5
2 7
3 6
4 5

How to count number of records in each group and add them to main dataset?

Given that i have a dataset as below:
import pandas as pd
import numpy as np
dt = {
"facility":["Ann Arbor","Ann Arbor","Detriot","Detriot","Detriot"],
"patient_ID":[4388,4388,9086,9086,9086],
"year":[2004,2007,2007,2008,2011],
"month":[8,9,9,6,2],
"Nr_Small":[0,0,5,12,10],
"Nr_Medium":[3,1,1,4,3],
"Nr_Large":[2,0,0,0,0]
}
dt = pd.DataFrame(dt)
dt.head()
i need to add a column which shows the number of records in each groups of paitents. Here is what i am doing:
dt["NumberOfVisits"] = dt.groupby(['patient_ID']).size()
or i tried this one:
but it adds a column of Nas into my dataset.However, my favorit output is as below

Use transform here:
df["NumberOfVisits"]=df.groupby(['patient_ID'])['patient_ID'].transform('size')
print(df)
facility patient_ID year month Nr_Small Nr_Medium Nr_Large \
0 Ann Arbor 4388 2004 8 0 3 2
1 Ann Arbor 4388 2007 9 0 1 0
2 Detriot 9086 2007 9 5 1 0
3 Detriot 9086 2008 6 12 4 0
4 Detriot 9086 2011 2 10 3 0
NumberOfVisits
0 2
1 2
2 3
3 3
4 3

Removing Suffix From Dataframe Column Names - Python

I am trying to remove a suffix from all columns in a dataframe, however I am getting error messages. Any suggestions would be appreciated.
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df.add_suffix('_x')
def strip_right(df.columns, _x):
if not text.endswith("_x"):
return text
# else
return text[:len(df.columns)-len("_x")]
Error:
def strip_right(tmp, "_x"):
^
SyntaxError: invalid syntax
I've also tried removing the quotations.
def strip_right(df.columns, _x):
if not text.endswith(_x):
return text
# else
return text[:len(df.columns)-len(_x)]
Error:
def strip_right(df.columns, _x):
^
SyntaxError: invalid syntax

Here is a more concrete example:.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df = df.add_suffix('_x')
print ("With Suffix")
print(df.head())
def strip_right(df, suffix='_x'):
df.columns = df.columns.str.rstrip(suffix)
strip_right(df)
print ("\n\nWithout Suffix")
print(df.head())
Output:
With Suffix
A_x B_x C_x D_x
0 0 7 0 2
1 5 1 8 5
2 6 2 0 1
3 6 6 5 6
4 8 6 5 8
Without Suffix
A B C D
0 0 7 0 2
1 5 1 8 5
2 6 2 0 1
3 6 6 5 6
4 8 6 5 8

I found a bug in the implementation of the accepted answer. The docs for pandas.Series.str.rstrip() reference str.rstrip(), which states:
"The chars argument is not a suffix; rather, all combinations of its values are stripped."
Instead I had to use pandas.Series.str.replace to remove the actual suffix from my column names. See the modified example below.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df = df.add_suffix('_x')
df['Ex_'] = np.random.randint(0,10,size=(10, 1))
df1 = pd.DataFrame(df, copy=True)
print ("With Suffix")
print(df1.head())
def strip_right(df, suffix='_x'):
df.columns = df.columns.str.rstrip(suffix)
strip_right(df1)
print ("\n\nAfter .rstrip()")
print(df1.head())
def replace_right(df, suffix='_x'):
df.columns = df.columns.str.replace(suffix+'$', '', regex=True)
print ("\n\nWith Suffix")
print(df.head())
replace_right(df)
print ("\n\nAfter .replace()")
print(df.head())
Output:
With Suffix
A_x B_x C_x D_x Ex_
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8
After .rstrip()
A B C D E
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8
After .replace()
A B C D Ex_
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8

Slicing a pandas dataframe

import pandas as pd
x = pd.DataFrame([[1,2,3],[4,5,6]])
x[::2]
what does the above command mean and how does it function?

Better is more data, it return even rows only by slicing:
x = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[0,1,2]])
print (x)
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 0 1 2
print (x[::2])
0 1 2
0 1 2 3
2 7 8 9

How do I calculate the probability of every value in a dataframe column quickly in Python？

I want to calculate the probability of all the data in a column dataframe according to its own distribution.For example,my data like this:
data
0 1
1 1
2 2
3 3
4 2
5 2
6 7
7 8
8 3
9 4
10 1
And the output I expect like this：
data pro
0 1 0.155015
1 1 0.155015
2 2 0.181213
3 3 0.157379
4 2 0.181213
5 2 0.181213
6 7 0.048717
7 8 0.044892
8 3 0.157379
9 4 0.106164
10 1 0.155015
I also refer to another question(How to compute the probability ...) and get an example of the above.My code is as follows：
import scipy.stats
samples = [1,1,2,3,2,2,7,8,3,4,1]
samples = pd.DataFrame(samples,columns=['data'])
print(samples)
kde = scipy.stats.gaussian_kde(samples['data'].tolist())
samples['pro'] = kde.pdf(samples['data'].tolist())
print(samples)
But what I can't stand is that if my column is too long, it makes the operation slow.Is there a better way to do it in pandas?Thanks in advance.

Its own distribution does not mean kde. You can use value_counts with normalize=True
df.assign(pro=df.data.map(df.data.value_counts(normalize=True)))
data pro
0 1 0.272727
1 1 0.272727
2 2 0.272727
3 3 0.181818
4 2 0.272727
5 2 0.272727
6 7 0.090909
7 8 0.090909
8 3 0.181818
9 4 0.090909
10 1 0.272727

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to create a Chord diagram out of this dataset format? - python-3.x

Related

Create an aggregate column based on other columns in pandas dataframe

How to count number of records in each group and add them to main dataset?

Removing Suffix From Dataframe Column Names - Python

Slicing a pandas dataframe

How do I calculate the probability of every value in a dataframe column quickly in Python？

Categories

Resources