Indexing Pandas Dataframe [duplicate] - python-3.x

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have 2 pandas dataframes with names and scores.
The first dataframe is is in the form:
df_score_1
A B C D
A 0 1 2 0
B 1 0 0 2
C 2 0 0 3
D 0 2 3 0
where
df_score_1.index
Index(['A', 'B', 'C', 'D'],dtype='object')
The second dataframe is from a text file with three columns which does not display zeros but only positive scores (or non-zero values)
df_score_2
A B 1
A C 1
A D 2
B C 5
B D 1
The goal is to transform df_score_2 into the form df_score_1 using pandas commands. The original form is from a networkx output nx.to_pandas_dataframe(G) line.
I've tried multi-indexing and the index doesn't display the form I would like. Is there an option when reading in a text file or a function to transform the dataframe after?

are you trying to merge the dataframes? or you just want them to have the same index? if you need the same index then use this:
l=df1.index.tolist()
df2.set_index(l, inplace=True)

crosstab and reindex are the best solutions I've found so far:
df = pd.crosstab(df[0], df[1], df[2], aggfunc=sum)
idx = df.columns.union(df.index)
df = df.reindex(index=idx, columns = idx)
The output is an adjacency matrix with NaN values instead of mirrored.
Here's a link to a similar question

I think you need,
df_score_2.set_index(df_score_1.index,inplace=True)

Related

How to turn a column of a data frame into suffixes for other column names? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
Suppose I have a data frame like this:
A B C D
0 1 10 x 5
1 1 20 y 5
2 1 30 z 5
3 2 40 x 6
4 2 50 y 6
5 2 60 z 6
This, can be viewed, as a table that stores the value of B as a function of A, C, and D. Now, I would look like to transform the B column into three columns B_x, B_y, B_z, like this:
A B_x B_y B_z D
0 1 10 20 30 5
1 2 40 50 60 6
I.e., B_x stores B(A, D) when C = 'x', B_y stores B(A, D) when C = 'y', etc.
What is the most efficient way to do this?
I have a found a solution like this:
frames = []
for c, subframe in df.groupby('C'):
subframe = subframe.rename(columns={'B': f'B_{c}'})
subframe = subframe.set_index(['A', 'D'])
del subframe['C']
frames.append(subframe)
out = frames[0]
for frame in frames[1:]:
out = out.join(frame)
out = out.reset_index()
This gives the correct response, but I feel that it is highly inefficient. I am also not too happy with the fact that to implement this solution one would need to know which columns should not get the prefix in column C explicitly. (In this MWE there were only two of them, but there could be tens in real life.)
Is there a better solution? I.e., a method that says, take a column as a suffix column (in this case C) and a set of 'value' columns (in this case only B); turn the value column names into name_prefix and fill them appropriately?
Here's one way to do it:
import pandas as pd
df = pd.DataFrame( data = {'A':[1,1,1,2,2,2],
'B':[10,20,30,40,50,60],
'C':['x','y','z','x','y','z'],
'D':[5,5,5,6,6,6]})
df2 = df.pivot_table( index=['A','D'],
columns=['C'],
values=['B']
)
df2.columns = ['_'.join(col) for col in df2.columns.values]
df2 = df2.reset_index()

Rename all the columns in python dataframe using replace [duplicate]

This question already has answers here:
How can I strip the whitespace from Pandas DataFrame headers?
(5 answers)
Closed 1 year ago.
I want to use the replace function to remove spaces. Is there a way to do this without hardcoding the actual name or using df.columns[0], df.columns[1] and so on... ?
If you wanted to replace all the spaces in the column names you could use this.
Code
import pandas as pd
df = pd.DataFrame({'Column A ':[1,2,2,3], 'Column B ':[4,8,9,12]})
print(df)
print(df.columns)
df.columns = [colname.replace(' ', '') for colname in df.columns]
print(df)
print(df.columns)
Code output
Column A Column B
0 1 4
1 2 8
2 2 9
3 3 12
Index(['Column A ', 'Column B '], dtype='object')
ColumnA ColumnB
0 1 4
1 2 8
2 2 9
3 3 12
Index(['ColumnA', 'ColumnB'], dtype='object')
If that's not you want change replace in the list comprehension to whatever method will return the names as you want them, e.g. rstrip to remove all whitespace characters at the end.

Pandas, DataFrame unique values from few columns [duplicate]

This question already has an answer here:
Get total values_count from a dataframe with Python Pandas
(1 answer)
Closed 4 years ago.
I am trying to count uniqiue values that are in few columns. My data frame looks like that:
Name Name.1 Name.2 Name.3
x z c y
y p q x
q p a y
Output should looks like below:
x 2
z 1
c 1
y 3
q 2
p 2
a 1
I used a groupby or count_values but couldn't get a correct output. Any ideas ? Thanks All !
Seems you want to consider values regardless of their row or column location. In that case you should collapse the dataframe and just use Counter.
from collections import Counter
arr = np.array(df)
count = Counter(arr.reshape(arr.size))
Another (Pandas-based) approach is to (Series) apply value_counts to multiple columns and then take the sum (column-wise)
df2 = df.apply(pd.Series.value_counts)
print(df2.sum(axis=1).astype(int)
a 1
c 1
p 2
q 2
x 2
y 3
z 1
dtype: int32

How to remove duplicates rows by same values in different order in dataframe by pandas

How to remove the duplicates in the df? df only has 1 column. In this case "60,25" and "25,60" is a pair of duplicated rows. The output should be the new df. For each pair of duplicated row, the kept row in format "A,B" where A < B, the removed row should be the one A>B. In this case, "25,60" and "80,123" should be kept. For unique row, it should stay whatever it is.
IIUC, using get_dummies with duplicated
df[~df.A.str.get_dummies(sep=',').duplicated()]
Out[956]:
A
0 A,C
1 A,B
4 X,Y,Z
Data input
df
Out[957]:
A
0 A,C
1 A,B
2 C,A
3 B,A
4 X,Y,Z
5 Z,Y,X
Update op change the question totally to different question
newdf=df.A.str.get_dummies(sep=',')
newdf[~newdf.duplicated()].dot(newdf.columns+',').str[:-1]
Out[976]:
0 25,60
1 123,37
dtype: object
I'd do a combination of things.
Use pandas.Series.str.split to split by commas
Use apply(frozenset) to get a hashable set such that I can use duplicated
Use pandas.Series.duplicated with keep='last'
df[~df.A.str.split(',').apply(frozenset).duplicated(keep='last')]
A
1 123,17
3 80,123
4 25,60
5 25,42
Addressing comments
df.A.apply(
lambda x: tuple(sorted(map(int, x.split(','))))
).drop_duplicates().apply(
lambda x: ','.join(map(str, x))
)
0 25,60
1 17,123
2 80,123
5 25,42
Name: A, dtype: object
Setup
df = pd.DataFrame(dict(
A='60,25 123,17 123,80 80,123 25,60 25,42'.split()
))

Python pandas: Weird index value

I have posted a similar thread but have now another angle to explore: After doing a covariance analysis between X and Z groupby 2 different levels, I get a DF like
index X Z
(1,1,'X') 2.3 0
...
'1' and '1' are the 2 different levels (I could have chosen '1' and '2'; there are 5 and 10 different levels)
Now I would like to extract each 'element' of the index and have something
index X Z H1 H2 H3
(1,1,'X') 2.3 0 1 1 X
...
I read few posts on slice and dice things - but this is not a normal string is it?
Cheers
(1,1,'X') isn't a string here, It's a tuple.
So you need to split the tuple into multiple columns. You can achieve this
by using apply(pandas.Series)
say your dataframe was df in this case.
df.apply(pandas.series)
In [10]: df['index'].apply(pd.Series)
Out[10]:
0 1 2 3
0 1 1 'X'
You need to add the columns back to original data frame so
df[['H1', 'H2','H3']] = df.apply(pandas.Series)

Resources