Display barplot in column order using pandas

Display barplot in column order using pandas - python-3.x

I have a data frame with two columns - col1 and col2, but when I use df.plot.barh, the plot returns results in col2 and col1 order. Is there a way to get the plot to display results in col1 and col2 order?

df = pd.DataFrame(np.random.randint(0,10,(5,2)), columns=['col1','col2'])
df.plot.barh()
will yield this:
Instead using bar():
df = pd.DataFrame(np.random.randint(0,10,(5,2)), columns=['col1','col2'])
df.plot.bar()
In both instances, col1 is first in that it is closest to the x axis. To reverse the order of the columns, you would need to reverse the order in which they appear in your dataframe. For just two columns you can use:
df = df[df.columns[::-1]]

Related

Sort values in a dataframe by a column and take second one only if equal

I've created a dataframe using random values using the following code:
values = random(5)
values_1= random(5)
col1= list(values/ values .sum())
col2= list(values_1)
df = pd.DataFrame({'col1':col1, 'col2':col2})
df.sort_values(by=['col2','col1'],ascending=[False,False]).reset_index(inplace=True)
The dataframe created in my case looks like this:
As you can see, the dataframe is not sorted in descending order by 'col2'. What I want to achieve is that it first sorts by 'col2' and if any 2 rows have same values for 'col2', then it should sort by 'col1' as well. Any suggestions? Any help would be appreciated.

Your solution almost working well, but if use inplace in reset_index it is not reused in sort_values.
Possible solution is add ignore_index=True, so reset_index is not necessary.
np.random.seed(2022)
df = pd.DataFrame({'col1':np.random.random(5), 'col2':np.random.random(5)})
df = df.sort_values(by=['col2','col1'],ascending=False, ignore_index=True)
print (df)
col1 col2
0 0.499058 0.897657
1 0.049974 0.896963
2 0.685408 0.721135
3 0.113384 0.647452
4 0.009359 0.486988
Or if want use inplace add it only to sort_values and add also ignore_index=True:
df.sort_values(by=['col2','col1'],ascending=False, ignore_index=True,inplace=True)
print (df)
col1 col2
0 0.499058 0.897657
1 0.049974 0.896963
2 0.685408 0.721135
3 0.113384 0.647452
4 0.009359 0.486988

Your logic is correct but you've missed an inplace=True inside sort_values. Due to this, the sorting does not actually take place in your dataframe. Replace it with this:
df.sort_values(by=['col2','col1'],ascending=[False,False],inplace=True)
df.reset_index(inplace=True,drop=True)

You want to also do the sort inplace=True, not only the reset_index()

Is there any alternative to merge multiple rows into a single row without using groupBy() & collect_list() in spark?

I am trying to merge multiple rows into a single row after grouping data on a different column.
col1 col2
A 1
A 2
B 1
B 3
to
col1 col2
A 1,2
B 1,3
By using the below code:
df = spark.sql("select col1, col2, col3,...., colN from tablename where col3 = 'ABCD' limit 1000")
df.select('col1','col2').groupby('col1').agg(psf.concat_ws(', ', psf.collect_list(df.col2))).display()
This is working fine when there is less data.
But if I try to increase the number of rows to 1million, the code fails with the exception:
java.lang.Exception: Results too large
Is there any alternative to merge multiple rows into a single row in spark without using the combination of groupby() & collect_list()

Converting column values to super subscript and combined with another column to create a new column

I have a data frame as follows:
df = pd.DataFrame()
df['col1'] = [2,2,3,4,5]
df['col2'] = [2,2,2,2,2]
I want to create new column, which is a combination of col1 and col2. But col2 is super subscript of col1. Below is the expected output for col3.
How can I achieve this?

Printing Columns with a correlation greater than 80%

I have a pandas dataframe with a size of 235607 records, and 94 attributes. I am very new python I was able to create a correlation matrix between all of the attributes but it is a lot to look through individually. I tried writing a for loop to print a list of the columns with a correlation greater than 80% but I keep getting the error "'DataFrame' object has no attribute 'c1'"
This is the code I used to create the correlation between the attributes as well as the sample for loop. Thank you in advance for your help :-
corr = data.corr() # data is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)
drop = [cols for cols in upper.c1 if any (upper[c1] > 0.80)]
drop

Sort in place, if you need to use the same variable c1 and then just grab the variables-names pair, using a comprehensive list using the indexes
c1.sort_values(ascending=True, inplace=True)
columns_above_80 = [(col1, col2) for col1, col2 in c1.index if c1[col1,col2] > 0.8 and col1 != col2]
Edit: Added col1 != col2 in the comprehensive list so you don't grab the auto-correlation

you can simply use the numpy.where like this:
corr.loc[np.where(corr>0.8, 1, 0)==1].columns
the output would be array with the names of the columns, which are having values greater then 0.8.
EDIT: I hope this will work. I edited the code above little.

Adding n new columns to pandas Data Frame

Given a Data Frame like the following:
df = pd.DataFrame({'term' : ['analys','applic','architectur','assess','item','methodolog','research','rs','studi','suggest','test','tool','viewer','work'],
'newValue' : [0.810419, 0.631963 ,0.687348, 0.810554, 0.725366, 0.742715, 0.799152, 0.599030, 0.652112, 0.683228, 0.711307, 0.625563, 0.604190, 0.724763]})
df = df.set_index('term')
print(df)
newValue
term
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
I want to add n new empty columns "".
Therefore, I have a value stored in variable n which indicates the number of required new columns.
n = 5
Thanks for your help in advance!

According to this answer,
Each not empty DataFrame has columns, index and some values.
So your dataframe must not have a column without name anyway.
This is the shortest way that I know of to achieve your goal:
n = 5
for i in range(n):
df[len(df.columns)] = ""
newValue 1 2 3 4 5
term
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763

IIUC, you can use:
n= 5
df=(pd.concat([df,pd.DataFrame(columns=['col'+str(i)
for i in range(n)])],axis=1,sort=False).fillna(''))
print(df)
newValue col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
Note: You can remove the fillna() if you want NaN.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Display barplot in column order using pandas - python-3.x

I have a data frame with two columns - col1 and col2, but when I use df.plot.barh, the plot returns results in col2 and col1 order. Is there a way to get the plot to display results in col1 and col2 order?

Related

Sort values in a dataframe by a column and take second one only if equal

Is there any alternative to merge multiple rows into a single row without using groupBy() & collect_list() in spark?

Converting column values to super subscript and combined with another column to create a new column

Printing Columns with a correlation greater than 80%

Adding n new columns to pandas Data Frame

Categories

Resources