removing rows that have same information regardless the order in python Data Frame [duplicate] - python-3.x

This question already has answers here:
Remove reverse duplicates from dataframe
(6 answers)
Closed 1 year ago.
I have a data frame that contains two columns I want to remove all duplicated regardless of the order
col1 col2
A B
B A
C D
E F
F E
The output should be
col1 col2
A B
C D
E F
I have tried using the duplicate function but it did not remove anything because they are not in the same order

One way:
Take the inner numpy array and sort it.
Use the dataframe constructor to recreate the dataframe(sorted by row).
Drop the duplicates.
df = pd.DataFrame(np.sort(df.values), columns = df.columns).drop_duplicates()
OUTPUT:
col1 col2
0 A B
2 C D
3 E F

Related

How to turn a column of a data frame into suffixes for other column names? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
Suppose I have a data frame like this:
A B C D
0 1 10 x 5
1 1 20 y 5
2 1 30 z 5
3 2 40 x 6
4 2 50 y 6
5 2 60 z 6
This, can be viewed, as a table that stores the value of B as a function of A, C, and D. Now, I would look like to transform the B column into three columns B_x, B_y, B_z, like this:
A B_x B_y B_z D
0 1 10 20 30 5
1 2 40 50 60 6
I.e., B_x stores B(A, D) when C = 'x', B_y stores B(A, D) when C = 'y', etc.
What is the most efficient way to do this?
I have a found a solution like this:
frames = []
for c, subframe in df.groupby('C'):
subframe = subframe.rename(columns={'B': f'B_{c}'})
subframe = subframe.set_index(['A', 'D'])
del subframe['C']
frames.append(subframe)
out = frames[0]
for frame in frames[1:]:
out = out.join(frame)
out = out.reset_index()
This gives the correct response, but I feel that it is highly inefficient. I am also not too happy with the fact that to implement this solution one would need to know which columns should not get the prefix in column C explicitly. (In this MWE there were only two of them, but there could be tens in real life.)
Is there a better solution? I.e., a method that says, take a column as a suffix column (in this case C) and a set of 'value' columns (in this case only B); turn the value column names into name_prefix and fill them appropriately?
Here's one way to do it:
import pandas as pd
df = pd.DataFrame( data = {'A':[1,1,1,2,2,2],
'B':[10,20,30,40,50,60],
'C':['x','y','z','x','y','z'],
'D':[5,5,5,6,6,6]})
df2 = df.pivot_table( index=['A','D'],
columns=['C'],
values=['B']
)
df2.columns = ['_'.join(col) for col in df2.columns.values]
df2 = df2.reset_index()

turn rows into columns and coping and keeping the same index infrond of all previous together columns

the inverse of
Transpose columns to rows keeping first 3 columns the same
turn:
id col1 col2 col3
1 A B
2 X Y Z
into:
id
1 A
1 B
2 X
2 Y
2 Z
I'm trying unpivot() but from the solution, I cited I need to use .unstack() ?

Pandas, DataFrame unique values from few columns [duplicate]

This question already has an answer here:
Get total values_count from a dataframe with Python Pandas
(1 answer)
Closed 4 years ago.
I am trying to count uniqiue values that are in few columns. My data frame looks like that:
Name Name.1 Name.2 Name.3
x z c y
y p q x
q p a y
Output should looks like below:
x 2
z 1
c 1
y 3
q 2
p 2
a 1
I used a groupby or count_values but couldn't get a correct output. Any ideas ? Thanks All !
Seems you want to consider values regardless of their row or column location. In that case you should collapse the dataframe and just use Counter.
from collections import Counter
arr = np.array(df)
count = Counter(arr.reshape(arr.size))
Another (Pandas-based) approach is to (Series) apply value_counts to multiple columns and then take the sum (column-wise)
df2 = df.apply(pd.Series.value_counts)
print(df2.sum(axis=1).astype(int)
a 1
c 1
p 2
q 2
x 2
y 3
z 1
dtype: int32

Indexing Pandas Dataframe [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have 2 pandas dataframes with names and scores.
The first dataframe is is in the form:
df_score_1
A B C D
A 0 1 2 0
B 1 0 0 2
C 2 0 0 3
D 0 2 3 0
where
df_score_1.index
Index(['A', 'B', 'C', 'D'],dtype='object')
The second dataframe is from a text file with three columns which does not display zeros but only positive scores (or non-zero values)
df_score_2
A B 1
A C 1
A D 2
B C 5
B D 1
The goal is to transform df_score_2 into the form df_score_1 using pandas commands. The original form is from a networkx output nx.to_pandas_dataframe(G) line.
I've tried multi-indexing and the index doesn't display the form I would like. Is there an option when reading in a text file or a function to transform the dataframe after?
are you trying to merge the dataframes? or you just want them to have the same index? if you need the same index then use this:
l=df1.index.tolist()
df2.set_index(l, inplace=True)
crosstab and reindex are the best solutions I've found so far:
df = pd.crosstab(df[0], df[1], df[2], aggfunc=sum)
idx = df.columns.union(df.index)
df = df.reindex(index=idx, columns = idx)
The output is an adjacency matrix with NaN values instead of mirrored.
Here's a link to a similar question
I think you need,
df_score_2.set_index(df_score_1.index,inplace=True)

Transpose Excel Row Data into columns based on Unique Identifier

I have excel table in below format.
Sr. No. Column 1 (X) Column 2(Y) Column 3(Z)
1 X Y Z
2 Y Z
3 Y
4 X Y
5 X
I want to tranpose it in following format in MS Excel.
Sr. No. Value
1 X
1 Y
1 Z
2 Y
2 Z
3 Y
4 X
4 Y
5 X
Actual data contains more than 30 columns which needs to be transposed into 2 columns.
Please guide me.
Select complete table data and then name it as SourceData using
Formula>Name Manager
Now implement following formula for getting first column:
=INDEX(SourceData,CEILING(ROWS($A$1:A1)/(COLUMNS(SourceData)-1),1),1)
And for second column:
=INDEX(SourceData,CEILING(ROWS($A$1:A1)/(COLUMNS(SourceData)-1),1),MOD(ROWS($A$1:A1)-1,COLUMNS(SourceData)-1)+2)
Copy and paste special values and then delete blanks / zeroes.
You will get result as required.
If you were using other databases, there might be a formal unpivot operator/function available. But in MySQL, this is not a possibility. However, one approach which should work here would be to just take a union of the three columns:
SELECT 1 AS sr_no, col1 AS value WHERE col1 IS NOT NULL
UNION ALL
SELECT 2, col2 WHERE col2 IS NOT NULL
UNION ALL
SELECT 3, col3 WHERE col3 IS NOT NULL
ORDER BY sr_no;

Resources