I am trying to read a CSV file in this format
COL1, COL2
5, 25
5, 67
5, 89
3, 55
3, 8
3, 109
3, 12
3, 45
3, 663
80, 34
80, 5
and combine COL2 for all entries having the same COL1 in a single line such that the first column indicates the number of columns that follow. So for the sample given above, the output file should look like this:
3, 25, 67, 89
6, 55, 8, 109, 12, 45, 663
2, 34, 5
A solution using awk:
$ awk 'NR>1{a[$1]=a[$1]", "$2;c[$1]++}END{for (k in a) print c[k] a[k]}' file
3, 25, 67, 89
6, 55, 8, 109, 12, 45, 663
2, 34, 5
Related
I have an automatically generated spreadsheet in Excel. The values of one column are:
1, 184
10, 18, 90
102, 207
11, 13
2
20, 50
204
3, 120
(all comma separated values in a single column, as below)
What I need to do is sort by the first number, so that the above would be:
1, 184
2
3, 120
10, 18, 90
11, 13
20, 50
102, 207
204
How can i do this in Excel?
Excel 365 current channel:
=SORTBY(A1:A8, NUMBERVALUE(TEXTBEFORE(A1:A8,",",,,,A1:A8)))
I have this data frame and given the data for each columns:
index = [1, 2, 3, 4, 5, 6, 7]
a = [1247, 1247, 1247, 1247, 1539, 1539, 1539]
b = ['Group_A', 'Group_A', 'Group_B', 'Group_B', 'Group_B', 'Group_B', 'Group_A']
c = [np.nan, 23, 30, 27, 18, 42, 40]
d = [50, 51, 67, np.nan, 44, 37, 49]
df = pd.DataFrame({'ID': a, 'Group': b, 'Unit_sold_1': c, 'Unit_sold_2':d})
If I want to sum the Unit_sold for each ID, I could use these code:
df.groupby(df['ID']).agg({'Unit_sold_1':'sum', 'Unit_sold_2':'sum'})
But what should I code if I want to group them by ID and then by Group. The result looks like this:
ID Group_A_sold_1 Group_B_sold_1 Group_A_sold_2 Group_B_sold_2
0 1247 23 57 101 67
1 1539 40 60 49 81
Do it with pivot_table then columns merge
s=df.pivot_table(index='ID',columns='Group',values=['Unit_sold_1','Unit_sold_2'],aggfunc='sum')
s.columns=s.columns.map('_'.join)
s.reset_index(inplace=True)
Unit_sold_1_Group_A ... Unit_sold_2_Group_B
ID ...
1247 23.0 ... 67.0
1539 40.0 ... 81.0
[2 rows x 4 columns]
#Creating DataFrame
df=pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
output:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
aValue = 43.0
df.loc[(df.CCC-aValue).abs().argsort()]
output:
AAA BBB CCC
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
The output confusing, can you please explain in detail how the below line
works
df.loc[(df.CCC-aValue).abs().argsort()]
With abs flipping negative values, and the subtraction shift values around, it's hard to visualize what's going on. Instead I need to calculate it step by step:
In [97]: x = np.array([100,50,-30,-50])
In [98]: x-43
Out[98]: array([ 57, 7, -73, -93])
In [99]: abs(x-43)
Out[99]: array([57, 7, 73, 93])
In [100]: np.argsort(abs(x-43))
Out[100]: array([1, 0, 2, 3])
In [101]: x[np.argsort(abs(x-43))]
Out[101]: array([ 50, 100, -30, -50])
argsort is the indexing that puts the elements in sorted order. We can see that with:
In [104]: Out[99][Out[100]]
Out[104]: array([ 7, 57, 73, 93])
or
In [105]: np.array([57, 7, 73, 93])[[1, 0, 2, 3]]
Out[105]: array([ 7, 57, 73, 93])
How they work together is determined by the Python syntax; that's straight forward.
(df.CCC-aValue).abs() will take the absolute value of the df.CCC-aValue, and argsort will sort the values and takes the sorted indexes, and df.loc will show the rows with sorted indexes
I have a dataframe created by extracting data from a source (network wireless controller).
Dataframe is created off of a dictionary I build. This is basically what I am doing (a sample to show structure - not the actual dataframe):
df = pd.DataFrame({'AP-1': [30, 32, 34, 31, 33, 35, 36, 38, 37],
'AP-2': [30, 32, 34, 80, 33, 35, 36, 38, 37],
'AP-3': [30, 32, 81, 31, 33, 101, 36, 38, 37],
'AP-4': [30, 32, 34, 95, 33, 35, 103, 38, 121],
'AP-5': [30, 32, 34, 31, 33, 144, 36, 38, 37],
'AP-6': [30, 32, 34, 31, 33, 35, 36, 110, 37],
'AP-7': [30, 87, 34, 31, 111, 35, 36, 38, 122],
'AP-8': [30, 32, 99, 31, 33, 35, 36, 38, 37],
'AP-9': [30, 32, 34, 31, 33, 99, 88, 38, 37]}, index=['1', '2', '3', '4', '5', '6', '7', '8', '9'])
df1 = df.transpose()
This works fine.
Note about the data. Columns 1,2,3 are 'related'. They go together. Same for columns 4,5,6 and 7,8,9. I will explain more shortly.
Columns 1, 4, 7 are client count. Columns 2, 5, 8 are channel util on the 5 Ghz spectrum. Columns 3, 6, 9 are channel util on the 2.4 Ghz spectrum.
Basically I take a reading at 5 minute intervals. The above would represent three readings at 5 minute intervals.
What I want is two new dataframes, two columns each, constructed as follows:
Examine the 5 Ghz columns (here it is 2, 5, 8). Which ever has the highest value becomes column 1 in the new dataframe. Column 2 would be the value of the client count column related to the 5 Ghz column with the highest value. In other words, if column 2 were the highest out of columns 2, 5, 8, then I want the value in column 1 to be the value in the new dataframe for the second column. If the value in column 8 were highest, then I want to also pull the value in column 7. I want the index to be same in the new dataframes as the original -- AP name.
I want to do this for all rows in the 'main' dataframe. I want two new dataframes -- so I will repeat this exact procedure for the 5 Ghz columns and the 2.4 (columns 3, 6, 9 -- also grabbing the corresponding highest client count value for the second column in the new dataframe.
What I have tried:
First I broke the main dataframe into three: df1 has all the client count columns, df2 has the 5 Ghz, and df3 has the 2.4 info, using this:
# create client count only dataframe
df_cc = df[df.columns[::3]]
print(df_cc)
print()
# create 5Ghz channel utilization only dataframe
df_5Ghz = df[df.columns[1::3]]
print(df_5Ghz)
print()
# create 2.4Ghz channel utilization only dataframe
df_24Ghz = df[df.columns[2::3]]
print(df_24Ghz)
print()
This works.
I thought I could then reference the main dataframe, but I don't know how.
Then I found this:
extract column value based on another column pandas dataframe
The query option looked great, but I don't know the value. I need to first discover the max value of the 2.4 and 5 Ghz columns respectively, then grab the corresponding client count value. That is why I first created dataframes containing the 2.4 and 5 Ghz values only, thinking I could first get the max value of each row, then do a lookup on the main dataframe (or use the client count onlydataframe I created), but I just do not know how to realize this idea.
Any assistance would be greatly appreciated.
You can get what you want in 3 steps:
# connection between columns
mapping = {'2': '1', '5': '4', '8': '7'}
# 1. column with highest value among 5GHz values (pandas series)
df2 = df1.loc[:, ['2', '5', '8']].idxmax(axis=1)
df2.name = 'highest value'
# 2. column with client count corresponding to the highest value (pandas series)
df3 = df2.apply(lambda x: mapping[x])
df3.name = 'client count'
# 3. build result using 2 lists of columns (pandas dataframe)
df4 = pd.DataFrame(
{df.name: [
df1.loc[idx, col]
for idx, col in zip(df.index, df.values)]
for df in [df2, df3]},
index=df1.index)
print(df4)
Output:
highest value client count
AP-1 38 36
AP-2 38 36
AP-3 38 36
AP-4 38 103
AP-5 38 36
AP-6 110 36
AP-7 111 31
AP-8 38 36
AP-9 38 88
I guess while not sure it would be easier to solve the issue (and faster to compute) without pandas using just built-in python data types - dictionaries and lists.
I am seeking for a formula that returns the counts the values greater than 20 after applying two criterias.
I have a table with 3 fields:
Field A: 18, 18, 19, 19, 21, 21, 44, 55, 55, 56, 61, 61, 75, 76, 86
Field B: 1, 4, 1, 5, 1, 6, 3, 1, 2, 1, 1, 3, 1, 1, 1
Field C: 5, 2, 14, 7, 38, 1, 100, 76, 32, 65, 83, 20, 17, 41, 88
I have two criterias:
Criteria1: 18, 55, 61, 75, 86 (this is an array)
Criteria2: 1
Steps:
Step 1 - Apply Criteria_1 to Field_A
Step 2 - Apply Criteria_2 to Field_B
Step 3 - Return number of values greater than 20
Regards,
Elio Fernandes
=SUM(ISNUMBER(MATCH(A1:A15, {18,55,61,75,86}, 0)) * (B1:B15 = 1) * (C1:C15 > 20))
Ctrl+Shift+Enter
This uses the property that TRUE counts as 1 and FALSE counts as 0.