combine several lines in a CSV file into a single line based on a certain condition

combine several lines in a CSV file into a single line based on a certain condition - linux

I am trying to read a CSV file in this format
COL1, COL2
5, 25
5, 67
5, 89
3, 55
3, 8
3, 109
3, 12
3, 45
3, 663
80, 34
80, 5
and combine COL2 for all entries having the same COL1 in a single line such that the first column indicates the number of columns that follow. So for the sample given above, the output file should look like this:
3, 25, 67, 89
6, 55, 8, 109, 12, 45, 663
2, 34, 5

A solution using awk:
$ awk 'NR>1{a[$1]=a[$1]", "$2;c[$1]++}END{for (k in a) print c[k] a[k]}' file
3, 25, 67, 89
6, 55, 8, 109, 12, 45, 663
2, 34, 5

Related

In Excel, how can I sort by the first value when there are multiple values in the cell?

I have an automatically generated spreadsheet in Excel. The values of one column are:
1, 184
10, 18, 90
102, 207
11, 13
2
20, 50
204
3, 120
(all comma separated values in a single column, as below)
What I need to do is sort by the first number, so that the above would be:
1, 184
2
3, 120
10, 18, 90
11, 13
20, 50
102, 207
204
How can i do this in Excel?

Excel 365 current channel:
=SORTBY(A1:A8, NUMBERVALUE(TEXTBEFORE(A1:A8,",",,,,A1:A8)))

Group by and aggregate pandas

I have this data frame and given the data for each columns:
index = [1, 2, 3, 4, 5, 6, 7]
a = [1247, 1247, 1247, 1247, 1539, 1539, 1539]
b = ['Group_A', 'Group_A', 'Group_B', 'Group_B', 'Group_B', 'Group_B', 'Group_A']
c = [np.nan, 23, 30, 27, 18, 42, 40]
d = [50, 51, 67, np.nan, 44, 37, 49]
df = pd.DataFrame({'ID': a, 'Group': b, 'Unit_sold_1': c, 'Unit_sold_2':d})
If I want to sum the Unit_sold for each ID, I could use these code:
df.groupby(df['ID']).agg({'Unit_sold_1':'sum', 'Unit_sold_2':'sum'})
But what should I code if I want to group them by ID and then by Group. The result looks like this:
ID Group_A_sold_1 Group_B_sold_1 Group_A_sold_2 Group_B_sold_2
0 1247 23 57 101 67
1 1539 40 60 49 81

Do it with pivot_table then columns merge
s=df.pivot_table(index='ID',columns='Group',values=['Unit_sold_1','Unit_sold_2'],aggfunc='sum')
s.columns=s.columns.map('_'.join)
s.reset_index(inplace=True)
Unit_sold_1_Group_A ... Unit_sold_2_Group_B
ID ...
1247 23.0 ... 67.0
1539 40.0 ... 81.0
[2 rows x 4 columns]

How exactly 'abs()' and 'argsort()' works together

#Creating DataFrame
df=pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
output:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
aValue = 43.0
df.loc[(df.CCC-aValue).abs().argsort()]
output:
AAA BBB CCC
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
The output confusing, can you please explain in detail how the below line
works
df.loc[(df.CCC-aValue).abs().argsort()]

With abs flipping negative values, and the subtraction shift values around, it's hard to visualize what's going on. Instead I need to calculate it step by step:
In [97]: x = np.array([100,50,-30,-50])
In [98]: x-43
Out[98]: array([ 57, 7, -73, -93])
In [99]: abs(x-43)
Out[99]: array([57, 7, 73, 93])
In [100]: np.argsort(abs(x-43))
Out[100]: array([1, 0, 2, 3])
In [101]: x[np.argsort(abs(x-43))]
Out[101]: array([ 50, 100, -30, -50])
argsort is the indexing that puts the elements in sorted order. We can see that with:
In [104]: Out[99][Out[100]]
Out[104]: array([ 7, 57, 73, 93])
or
In [105]: np.array([57, 7, 73, 93])[[1, 0, 2, 3]]
Out[105]: array([ 7, 57, 73, 93])
How they work together is determined by the Python syntax; that's straight forward.

(df.CCC-aValue).abs() will take the absolute value of the df.CCC-aValue, and argsort will sort the values and takes the sorted indexes, and df.loc will show the rows with sorted indexes

create new dataframe based upon max value in one column and corresponding value in a second column

I have a dataframe created by extracting data from a source (network wireless controller).
Dataframe is created off of a dictionary I build. This is basically what I am doing (a sample to show structure - not the actual dataframe):
df = pd.DataFrame({'AP-1': [30, 32, 34, 31, 33, 35, 36, 38, 37],
'AP-2': [30, 32, 34, 80, 33, 35, 36, 38, 37],
'AP-3': [30, 32, 81, 31, 33, 101, 36, 38, 37],
'AP-4': [30, 32, 34, 95, 33, 35, 103, 38, 121],
'AP-5': [30, 32, 34, 31, 33, 144, 36, 38, 37],
'AP-6': [30, 32, 34, 31, 33, 35, 36, 110, 37],
'AP-7': [30, 87, 34, 31, 111, 35, 36, 38, 122],
'AP-8': [30, 32, 99, 31, 33, 35, 36, 38, 37],
'AP-9': [30, 32, 34, 31, 33, 99, 88, 38, 37]}, index=['1', '2', '3', '4', '5', '6', '7', '8', '9'])
df1 = df.transpose()
This works fine.
Note about the data. Columns 1,2,3 are 'related'. They go together. Same for columns 4,5,6 and 7,8,9. I will explain more shortly.
Columns 1, 4, 7 are client count. Columns 2, 5, 8 are channel util on the 5 Ghz spectrum. Columns 3, 6, 9 are channel util on the 2.4 Ghz spectrum.
Basically I take a reading at 5 minute intervals. The above would represent three readings at 5 minute intervals.
What I want is two new dataframes, two columns each, constructed as follows:
Examine the 5 Ghz columns (here it is 2, 5, 8). Which ever has the highest value becomes column 1 in the new dataframe. Column 2 would be the value of the client count column related to the 5 Ghz column with the highest value. In other words, if column 2 were the highest out of columns 2, 5, 8, then I want the value in column 1 to be the value in the new dataframe for the second column. If the value in column 8 were highest, then I want to also pull the value in column 7. I want the index to be same in the new dataframes as the original -- AP name.
I want to do this for all rows in the 'main' dataframe. I want two new dataframes -- so I will repeat this exact procedure for the 5 Ghz columns and the 2.4 (columns 3, 6, 9 -- also grabbing the corresponding highest client count value for the second column in the new dataframe.
What I have tried:
First I broke the main dataframe into three: df1 has all the client count columns, df2 has the 5 Ghz, and df3 has the 2.4 info, using this:
# create client count only dataframe
df_cc = df[df.columns[::3]]
print(df_cc)
print()
# create 5Ghz channel utilization only dataframe
df_5Ghz = df[df.columns[1::3]]
print(df_5Ghz)
print()
# create 2.4Ghz channel utilization only dataframe
df_24Ghz = df[df.columns[2::3]]
print(df_24Ghz)
print()
This works.
I thought I could then reference the main dataframe, but I don't know how.
Then I found this:
extract column value based on another column pandas dataframe
The query option looked great, but I don't know the value. I need to first discover the max value of the 2.4 and 5 Ghz columns respectively, then grab the corresponding client count value. That is why I first created dataframes containing the 2.4 and 5 Ghz values only, thinking I could first get the max value of each row, then do a lookup on the main dataframe (or use the client count onlydataframe I created), but I just do not know how to realize this idea.
Any assistance would be greatly appreciated.

You can get what you want in 3 steps:
# connection between columns
mapping = {'2': '1', '5': '4', '8': '7'}
# 1. column with highest value among 5GHz values (pandas series)
df2 = df1.loc[:, ['2', '5', '8']].idxmax(axis=1)
df2.name = 'highest value'
# 2. column with client count corresponding to the highest value (pandas series)
df3 = df2.apply(lambda x: mapping[x])
df3.name = 'client count'
# 3. build result using 2 lists of columns (pandas dataframe)
df4 = pd.DataFrame(
{df.name: [
df1.loc[idx, col]
for idx, col in zip(df.index, df.values)]
for df in [df2, df3]},
index=df1.index)
print(df4)
Output:
highest value client count
AP-1 38 36
AP-2 38 36
AP-3 38 36
AP-4 38 103
AP-5 38 36
AP-6 110 36
AP-7 111 31
AP-8 38 36
AP-9 38 88
I guess while not sure it would be easier to solve the issue (and faster to compute) without pandas using just built-in python data types - dictionaries and lists.

How to return the number of values greater than X with multiple criteria

I am seeking for a formula that returns the counts the values greater than 20 after applying two criterias.
I have a table with 3 fields:
Field A: 18, 18, 19, 19, 21, 21, 44, 55, 55, 56, 61, 61, 75, 76, 86
Field B: 1, 4, 1, 5, 1, 6, 3, 1, 2, 1, 1, 3, 1, 1, 1
Field C: 5, 2, 14, 7, 38, 1, 100, 76, 32, 65, 83, 20, 17, 41, 88
I have two criterias:
Criteria1: 18, 55, 61, 75, 86 (this is an array)
Criteria2: 1
Steps:
Step 1 - Apply Criteria_1 to Field_A
Step 2 - Apply Criteria_2 to Field_B
Step 3 - Return number of values greater than 20
Regards,
Elio Fernandes

=SUM(ISNUMBER(MATCH(A1:A15, {18,55,61,75,86}, 0)) * (B1:B15 = 1) * (C1:C15 > 20))
Ctrl+Shift+Enter
This uses the property that TRUE counts as 1 and FALSE counts as 0.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

combine several lines in a CSV file into a single line based on a certain condition - linux

A solution using awk: $ awk 'NR>1{a[$1]=a[$1]", "$2;c[$1]++}END{for (k in a) print c[k] a[k]}' file 3, 25, 67, 89 6, 55, 8, 109, 12, 45, 663 2, 34, 5

Related

In Excel, how can I sort by the first value when there are multiple values in the cell?

Group by and aggregate pandas

How exactly 'abs()' and 'argsort()' works together

create new dataframe based upon max value in one column and corresponding value in a second column

How to return the number of values greater than X with multiple criteria

Categories

Resources