Printing Columns with a correlation greater than 80% - python-3.x

I have a pandas dataframe with a size of 235607 records, and 94 attributes. I am very new python I was able to create a correlation matrix between all of the attributes but it is a lot to look through individually. I tried writing a for loop to print a list of the columns with a correlation greater than 80% but I keep getting the error "'DataFrame' object has no attribute 'c1'"
This is the code I used to create the correlation between the attributes as well as the sample for loop. Thank you in advance for your help :-
corr = data.corr() # data is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)
drop = [cols for cols in upper.c1 if any (upper[c1] > 0.80)]
drop

Sort in place, if you need to use the same variable c1 and then just grab the variables-names pair, using a comprehensive list using the indexes
c1.sort_values(ascending=True, inplace=True)
columns_above_80 = [(col1, col2) for col1, col2 in c1.index if c1[col1,col2] > 0.8 and col1 != col2]
Edit: Added col1 != col2 in the comprehensive list so you don't grab the auto-correlation

you can simply use the numpy.where like this:
corr.loc[np.where(corr>0.8, 1, 0)==1].columns
the output would be array with the names of the columns, which are having values greater then 0.8.
EDIT: I hope this will work. I edited the code above little.

Related

Sort values in a dataframe by a column and take second one only if equal

I've created a dataframe using random values using the following code:
values = random(5)
values_1= random(5)
col1= list(values/ values .sum())
col2= list(values_1)
df = pd.DataFrame({'col1':col1, 'col2':col2})
df.sort_values(by=['col2','col1'],ascending=[False,False]).reset_index(inplace=True)
The dataframe created in my case looks like this:
As you can see, the dataframe is not sorted in descending order by 'col2'. What I want to achieve is that it first sorts by 'col2' and if any 2 rows have same values for 'col2', then it should sort by 'col1' as well. Any suggestions? Any help would be appreciated.
Your solution almost working well, but if use inplace in reset_index it is not reused in sort_values.
Possible solution is add ignore_index=True, so reset_index is not necessary.
np.random.seed(2022)
df = pd.DataFrame({'col1':np.random.random(5), 'col2':np.random.random(5)})
df = df.sort_values(by=['col2','col1'],ascending=False, ignore_index=True)
print (df)
col1 col2
0 0.499058 0.897657
1 0.049974 0.896963
2 0.685408 0.721135
3 0.113384 0.647452
4 0.009359 0.486988
Or if want use inplace add it only to sort_values and add also ignore_index=True:
df.sort_values(by=['col2','col1'],ascending=False, ignore_index=True,inplace=True)
print (df)
col1 col2
0 0.499058 0.897657
1 0.049974 0.896963
2 0.685408 0.721135
3 0.113384 0.647452
4 0.009359 0.486988
Your logic is correct but you've missed an inplace=True inside sort_values. Due to this, the sorting does not actually take place in your dataframe. Replace it with this:
df.sort_values(by=['col2','col1'],ascending=[False,False],inplace=True)
df.reset_index(inplace=True,drop=True)
You want to also do the sort inplace=True, not only the reset_index()

Grouping by columns returns all groups regardless prior filtering of values in pandas dataframe?

Concept question:
I have a large dataframe (df) with 4M rows and 15 columns (col1 - col14 and Flag).
Please advise why the following is happening:
I am selecting the subset of interest:
temp = df.loc[(df['col1'].eq('a') & df['col13'].eq(113)), ['col1', 'col13', 'Flag']]
Trying to groupby and mean on Flag column and get all the groups col1-col2 with NaN on the Flag "mean" (Flag here is 0 or 1 integer):
temp.groupby(['col1', 'col13'])['Flag'].mean()
It seems that Python groupby ignores my filtering and returns all possible groups. Why is that?
Note, when I do the same for col1 and Flag, everything works perfectly.

Get only rows of dataframe where a subset of columns exist in another dataframe

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!
You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

Pandas dataframe manipulation with conditions

How can I iterate through Pandas DataFrame field and fill null values with input from another field within the same data frame
My objective is to fill the na values in column y with corresponding values in column z
Its best to avoid iterating through dataframes when it can be accomplished using vector expressions. Something like this should work, although it may need to be massaged a little for your specific case though.
# Set your dataframe
df = ...
# Gets a boolean vector for positions where you have na in column b
nulls_in_b = df["b"].isna()
# Set the places where its null to values from column c
df["b"].loc[nulls_in_b] = df["c"].loc[nulls_in_b]

Selecting a column not in cube in Spark

I have a dataframe which has say 3 columns x,y and z.
I want to get all the three columns in result but I do not want to cube on column z.
Is there a way I can do it?
P.S. - (I have just given example with 3 columns but I have quite a long list of columns so GROUP SET is not an option).
Example -
val df = Seq(("1","x","a"),("1","v","b"),("3","x","c")).toDF("col1","col2","col3")
val list = Seq("col1","col2").map(e=>col(e))
// now I want to select col3 non cubed (basically I do not want get the combinations for it)
// This guy will not select col3 at all since col3 is not part of cube which is I want to achieve
display(df.select($"col1",$"col2",$"col3").cube(list:_*).agg(sum("col1")))
Cube is an extension of GroupBY in which you will get the aggregated result for the various combinations of columns used to group by.
Here is an example of what you can achieve using groupBy,
df.cube($"col1",$"col2").agg(first($"col3").as("col3")).show
Please share your expected result as suggested by Shaido.

Resources