create new dataframe based upon max value in one column and corresponding value in a second column - python-3.x

I have a dataframe created by extracting data from a source (network wireless controller).
Dataframe is created off of a dictionary I build. This is basically what I am doing (a sample to show structure - not the actual dataframe):
df = pd.DataFrame({'AP-1': [30, 32, 34, 31, 33, 35, 36, 38, 37],
'AP-2': [30, 32, 34, 80, 33, 35, 36, 38, 37],
'AP-3': [30, 32, 81, 31, 33, 101, 36, 38, 37],
'AP-4': [30, 32, 34, 95, 33, 35, 103, 38, 121],
'AP-5': [30, 32, 34, 31, 33, 144, 36, 38, 37],
'AP-6': [30, 32, 34, 31, 33, 35, 36, 110, 37],
'AP-7': [30, 87, 34, 31, 111, 35, 36, 38, 122],
'AP-8': [30, 32, 99, 31, 33, 35, 36, 38, 37],
'AP-9': [30, 32, 34, 31, 33, 99, 88, 38, 37]}, index=['1', '2', '3', '4', '5', '6', '7', '8', '9'])
df1 = df.transpose()
This works fine.
Note about the data. Columns 1,2,3 are 'related'. They go together. Same for columns 4,5,6 and 7,8,9. I will explain more shortly.
Columns 1, 4, 7 are client count. Columns 2, 5, 8 are channel util on the 5 Ghz spectrum. Columns 3, 6, 9 are channel util on the 2.4 Ghz spectrum.
Basically I take a reading at 5 minute intervals. The above would represent three readings at 5 minute intervals.
What I want is two new dataframes, two columns each, constructed as follows:
Examine the 5 Ghz columns (here it is 2, 5, 8). Which ever has the highest value becomes column 1 in the new dataframe. Column 2 would be the value of the client count column related to the 5 Ghz column with the highest value. In other words, if column 2 were the highest out of columns 2, 5, 8, then I want the value in column 1 to be the value in the new dataframe for the second column. If the value in column 8 were highest, then I want to also pull the value in column 7. I want the index to be same in the new dataframes as the original -- AP name.
I want to do this for all rows in the 'main' dataframe. I want two new dataframes -- so I will repeat this exact procedure for the 5 Ghz columns and the 2.4 (columns 3, 6, 9 -- also grabbing the corresponding highest client count value for the second column in the new dataframe.
What I have tried:
First I broke the main dataframe into three: df1 has all the client count columns, df2 has the 5 Ghz, and df3 has the 2.4 info, using this:
# create client count only dataframe
df_cc = df[df.columns[::3]]
print(df_cc)
print()
# create 5Ghz channel utilization only dataframe
df_5Ghz = df[df.columns[1::3]]
print(df_5Ghz)
print()
# create 2.4Ghz channel utilization only dataframe
df_24Ghz = df[df.columns[2::3]]
print(df_24Ghz)
print()
This works.
I thought I could then reference the main dataframe, but I don't know how.
Then I found this:
extract column value based on another column pandas dataframe
The query option looked great, but I don't know the value. I need to first discover the max value of the 2.4 and 5 Ghz columns respectively, then grab the corresponding client count value. That is why I first created dataframes containing the 2.4 and 5 Ghz values only, thinking I could first get the max value of each row, then do a lookup on the main dataframe (or use the client count onlydataframe I created), but I just do not know how to realize this idea.
Any assistance would be greatly appreciated.

You can get what you want in 3 steps:
# connection between columns
mapping = {'2': '1', '5': '4', '8': '7'}
# 1. column with highest value among 5GHz values (pandas series)
df2 = df1.loc[:, ['2', '5', '8']].idxmax(axis=1)
df2.name = 'highest value'
# 2. column with client count corresponding to the highest value (pandas series)
df3 = df2.apply(lambda x: mapping[x])
df3.name = 'client count'
# 3. build result using 2 lists of columns (pandas dataframe)
df4 = pd.DataFrame(
{df.name: [
df1.loc[idx, col]
for idx, col in zip(df.index, df.values)]
for df in [df2, df3]},
index=df1.index)
print(df4)
Output:
highest value client count
AP-1 38 36
AP-2 38 36
AP-3 38 36
AP-4 38 103
AP-5 38 36
AP-6 110 36
AP-7 111 31
AP-8 38 36
AP-9 38 88
I guess while not sure it would be easier to solve the issue (and faster to compute) without pandas using just built-in python data types - dictionaries and lists.

Related

In Excel, how can I sort by the first value when there are multiple values in the cell?

I have an automatically generated spreadsheet in Excel. The values of one column are:
1, 184
10, 18, 90
102, 207
11, 13
2
20, 50
204
3, 120
(all comma separated values in a single column, as below)
What I need to do is sort by the first number, so that the above would be:
1, 184
2
3, 120
10, 18, 90
11, 13
20, 50
102, 207
204
How can i do this in Excel?
Excel 365 current channel:
=SORTBY(A1:A8, NUMBERVALUE(TEXTBEFORE(A1:A8,",",,,,A1:A8)))

How to format the csv file with df.to_csv for a multiindex dataframe, python3

I have a multi-indexed dataframe,
>>> df
a1 a2
b1 b2 b1 b2
c1 d1 11 21 31 41
d2 12 22 32 42
c2 d1 13 23 33 43
d2 14 24 34 44
It has 2 levels of header and 2 levels of index. If I directly use the code df.to_csv('test_file.csv'), then the format of the file test_file.csv is
,,a1,a1,a2,a2
,,b1,b2,b1,b2
c1,d1,11,21,31,41
c1,d2,12,22,32,42
c2,d1,13,23,33,43
c2,d2,14,24,34,44
However, I would like to change it to
remove the duplicates in the 1st level of header
remove entire 1st level of index, and make an empty row for each one in the 1st level of index.
The wanted format is:
,a1,,a2,
,b1,b2,b1,b2
c1,,,,,
d1,11,21,31,41
d2,12,22,32,42
c2,,,,,
d1,13,23,33,43
d2,14,24,34,44
Could you please show me how to do it? Thanks!
Please use the code below.
import pandas as pd
df = pd.DataFrame(
{
('a1', 'b1'): [11, 12, 13, 14],
('a1', 'b2'): [21, 22, 23, 24],
('a2', 'b1'): [31, 32, 33, 34],
('a2', 'b2'): [41, 42, 43, 44],
},
index=pd.MultiIndex.from_tuples([
('c1', 'd1'),
('c1', 'd2'),
('c2', 'd1'),
('c2', 'd2'),
]),
)
print(df)
df.to_csv('my_test_file.csv')
Here is a working solution. It uses a helper function to remove the duplicated consecutive labels and groupy+apply+pandas.concat to shift the multiindex level as extra empty row:
def remove_consecutive(l):
'''replaces consecutive items in "l" with empty string'''
from itertools import groupby, chain
return tuple(chain(*([k]+['']*(len(list(g))-1) for k,g in groupby(l))))
(df.groupby(level=0)
# below to shift level=0 as new row
.apply(lambda g: pd.concat([pd.DataFrame([],
index=[g.name],
columns=g.columns),
g.droplevel(0)]))
.droplevel(0)
# below to remove the duplicate column names
.T # unfortunately there is no set_index equivalent for columns, so transpose before/after
.set_index(pd.MultiIndex.from_arrays(list(map(remove_consecutive, zip(*df.columns)))))
.T
# export
.to_csv('filename.csv')
)
output:
,a1,,a2,
,b1,b2,b1,b2
c1,,,,
d1,11,21,31,41
d2,12,22,32,42
c1,,,,
d1,13,23,33,43
d2,14,24,34,44

Converting a List of Pandas Series to a single Pandas DataFrame

I am using statsmodels.api on my data set. I have a list of panda series. The panda series has key value pairs. The keys are the names of the columns and the values contain the data. But, I have a list of series where the keys (column names) are repeated. I want to save all of the values from the list of pandas series to a single dataframe where the column names are the keys of the panda series. All of the series in the list have the same keys. I want to save them as a single data frame so that I can export the dataframe as a CSV. Any idea how I can save the keys as my column names of the df and then have the values fill the rest of the information.
Each series in the list returns something like this:
index 0 of the list: <class 'pandas.core.series.Series'>
height 23
weight 10
size 45
amount 9
index 1 of the list: <class 'pandas.core.series.Series'>
height 11
weight 99
size 25
amount 410
index 2 of the list: <class 'pandas.core.series.Series'>
height 3
weight 0
size 115
amount 92
I would like to be able to read a dataframe such that these values are saved as the following:
DataFrame:
height weight size amount
23 10 45 9
11 11 25 410
3 3 115 92
pd.DataFrame(data=your_list_of_series)
When creating a new DataFrame, pandas will accept a list of series for the data argument. The indices of your series will become the column names of the DataFrame.
Not the most efficient way, but this does the trick:
import pandas as pd
series_list =[ pd.Series({ 'height': 23,
'weight': 10,
'size': 45,
'amount': 9
}),
pd.Series({ 'height': 11,
'weight': 99,
'size': 25,
'amount': 410
}),
pd.Series({ 'height': 3,
'weight': 0,
'size': 115,
'amount': 92
})
]
pd.DataFrame( [series.to_dict() for series in series_list] )
Did you try just calling pd.DataFrame() on the list of series? That should just work.
import pandas as pd
series_list = [
pd.Series({
'height': 23,
'weight': 10,
'size': 45,
'amount': 9
}),
pd.Series({
'height': 11,
'weight': 99,
'size': 25,
'amount': 410
}),
pd.Series({
'height': 3,
'weight': 0,
'size': 115,
'amount': 92
})
]
df = pd.DataFrame(series_list)
print(df)
df.to_csv('path/to/save/foo.csv')
Output:
height weight size amount
0 23 10 45 9
1 11 99 25 410
2 3 0 115 92

How to return the number of values greater than X with multiple criteria

I am seeking for a formula that returns the counts the values greater than 20 after applying two criterias.
I have a table with 3 fields:
Field A: 18, 18, 19, 19, 21, 21, 44, 55, 55, 56, 61, 61, 75, 76, 86
Field B: 1, 4, 1, 5, 1, 6, 3, 1, 2, 1, 1, 3, 1, 1, 1
Field C: 5, 2, 14, 7, 38, 1, 100, 76, 32, 65, 83, 20, 17, 41, 88
I have two criterias:
Criteria1: 18, 55, 61, 75, 86 (this is an array)
Criteria2: 1
Steps:
Step 1 - Apply Criteria_1 to Field_A
Step 2 - Apply Criteria_2 to Field_B
Step 3 - Return number of values greater than 20
Regards,
Elio Fernandes
=SUM(ISNUMBER(MATCH(A1:A15, {18,55,61,75,86}, 0)) * (B1:B15 = 1) * (C1:C15 > 20))
Ctrl+Shift+Enter
This uses the property that TRUE counts as 1 and FALSE counts as 0.

combine several lines in a CSV file into a single line based on a certain condition

I am trying to read a CSV file in this format
COL1, COL2
5, 25
5, 67
5, 89
3, 55
3, 8
3, 109
3, 12
3, 45
3, 663
80, 34
80, 5
and combine COL2 for all entries having the same COL1 in a single line such that the first column indicates the number of columns that follow. So for the sample given above, the output file should look like this:
3, 25, 67, 89
6, 55, 8, 109, 12, 45, 663
2, 34, 5
A solution using awk:
$ awk 'NR>1{a[$1]=a[$1]", "$2;c[$1]++}END{for (k in a) print c[k] a[k]}' file
3, 25, 67, 89
6, 55, 8, 109, 12, 45, 663
2, 34, 5

Resources