Sorting numerically but not alphabetically when numbers are equal - linux

I have a file like this:
A 0.77
C 0.98
B 0.77
Z 0.77
G 0.65
I want to sort the file numerically in descending order. I used this code:
sort -gr -k2,2 file.txt
I obtain this:
C 0.98
Z 0.77
B 0.77
A 0.77
G 0.65
In my real file I have several columns with the same number and they are ordered alphabetically. What I want is to sort numerically but not alphabetically when the numbers are equal, I want to obtain those columns unsorted alphabetically:
C 0.98
B 0.77
Z 0.77
A 0.77
G 0.65
But any random order is fine.

You can use this sort:
sort -k2rn -k1R file
C 0.98
B 0.77
Z 0.77
A 0.77
G 0.65
There are 2 sort options used:
-k2rn: First sort key is column 2; numerical, reverse
-k1R: Second sort key is column 1; random

One in GNU awk that preserves the order of the first field (random in, equally random out):
$ awk ' {
a[$2]=a[$2] (a[$2]==""?"":FS) $1 # append $1 values to hash, indexed on $1
}
END {
PROCINFO["sorted_in"]="#ind_num_desc" # set for traverse order for index order...
for(i in a) { # ... and use it here
n=split(a[i],b)
for(j=1;j<=n;j++) # preserve the input order
print b[j],i # output
}
}' file
C 0.98
A 0.77
B 0.77
Z 0.77
G 0.65
Testing reverse order:
$ tac file | awk '# above awk script'
C 0.98
Z 0.77
B 0.77
A 0.77
G 0.65

Related

Python/how to insert data from one table into another according to the data condition

I have a the first table
user_cnt cost acquisition_cost
channel
facebook_ads 2726 2140.904643 0.79
instagram_new_adverts 3347 2161.441691 0.65
yandex_direct 4817 2233.111449 0.46
youtube_channel_reklama 2686 1068.119204 0.40
and the second
user_id profit source cost_per_user income
8a28b283ac30 0.91 facebook_ads ? 0.12
d7cf130a0105 0.63 youtube_channel ? 0.17
The second table has more 200k rows, but i showed only two. So, i need to put "acquisition_cost" value from the first table to column "cost_per_user" in the second table according to the name of channel/source. For instance, on the first row in the second table cost_per_user should has value - 0.79 due to it's facebook_ads.
I will be grateful if someone can help me to solve this task.
First of all i tried to use the function:
I tried the function:
def cost (source_type):
if source_type == 'instagram_new_adverts':
return 0.65
elif source_type == 'facebook_ads':
return 0.79
elif source_type == 'youtube_channel_reklama':
return 0.40
else: return 0.46
target_build['cost_per_user'] = target_build['source'].apply(cost)`
but i have to find another desicion without using of constants(return 0.65).
Another attemption was like this
for row in first_table['channel'].unique():
second_table.loc[second_table['source'] == row, 'cost_per_user'] = first_table['acquisition_cost']
this code works only for the first four lines and for another it put zero value.
and the last idea was
second_table['cost_per_user'] = second_table['cost_per_user'].where(
second_table['source'].isin(b.index), b['acquisition_cost'])
and again it didn't work.
It looks like you are using pandas. One possible solution is to use pandas version of inner join: merge.
Example
Suppose you don't want to modify either your first or second table, you can create a temporary table including just channel and acquisition_cost from the first table, but also changing the column names to source and cost_per_user. This can be implemented in various ways. One possible way, presuming your channel is an index, is shown below.
temp_df = first_df.acquisition_cost.reset_index().rename(
columns={'channel': 'source', 'acquisition_cost': 'cost_per_user'},
)
temp_df looks like this
source cost_per_user
0 facebook_ads 0.79
1 instaram_new_adverts 0.65
2 yandex_direct 0.46
3 youtube_channel_reklama 0.40
Say your second table looks like this:
user_id profit source income
0 c3519e80c071 0.773956 yandex_direct 0.227239
1 cc39ba469a08 0.438878 instaram_new_adverts 0.554585
2 a44a621e0222 0.858598 facebook_ads 0.063817
3 9dbf921b0959 0.697368 youtube_channel_reklama 0.827631
4 d45bf8fcab75 0.094177 youtube_channel_reklama 0.631664
5 57dbe1efd8b1 0.975622 yandex_direct 0.758088
6 1e0e3f1e13f7 0.761140 instaram_new_adverts 0.354526
7 27a7a7470ef4 0.786064 youtube_channel_reklama 0.970698
8 360dfd543fb5 0.128114 yandex_direct 0.893121
9 a31f46c26abb 0.450386 instaram_new_adverts 0.778383
You can run the merge call to attach cost_per_user to each source.
new_df = pd.merge(left=second_df, right=temp_df, on='source', how='inner')
new_df would look like this
user_id profit source income cost_per_user
0 c3519e80c071 0.773956 yandex_direct 0.227239 0.46
1 57dbe1efd8b1 0.975622 yandex_direct 0.758088 0.46
2 360dfd543fb5 0.128114 yandex_direct 0.893121 0.46
3 cc39ba469a08 0.438878 instaram_new_adverts 0.554585 0.65
4 1e0e3f1e13f7 0.761140 instaram_new_adverts 0.354526 0.65
5 a31f46c26abb 0.450386 instaram_new_adverts 0.778383 0.65
6 a44a621e0222 0.858598 facebook_ads 0.063817 0.79
7 9dbf921b0959 0.697368 youtube_channel_reklama 0.827631 0.40
8 d45bf8fcab75 0.094177 youtube_channel_reklama 0.631664 0.40
9 27a7a7470ef4 0.786064 youtube_channel_reklama 0.970698 0.40
Notes
Complication would arise if the source column of the second table does not have a one-to-one match to the channel in the first table. You will need to read the doc on merge to decide how you would want to handle that situation (e.g. use inner join to discard any mismatch, or left join to keep the unmatched source but receiving no cost_per_user).

Make a new file from the output obtained from for loop

How do we do to make the data structure of multiple (122) rows and (1) columns data into (122) rows and H_S1 H 0.16 2593.354 i.e (4) columns as a new file having extention (.txt or .csv or .dat)? Thanks in advance.
This is the structure of the "print(y,y1,y2,y3)" output obtained using for loop statement in python3.
[122 rows x 1 columns]
H_S1 H 0.16 2593.354
H_S2 H 0.32 2676.584
H_S3 H 0.64 5125.25
H_S4 H 1.28 12029.221
H_S5 H 2.56 6764.678
unc_H_S1 H 0.16 8.16627
unc_H_S2 H 0.32 10.754601
unc_H_S3 H 0.64 5.16457
unc_H_S4 H 1.28 10.93159
.
.
.
Desired output:
[122 rows x 4 columns]
H_S1 H 0.16 2593.354
H_S2 H 0.32 2676.584
H_S3 H 0.64 5125.25
H_S4 H 1.28 12029.221
H_S5 H 2.56 6764.678
unc_H_S1 H 0.16 8.16627
unc_H_S2 H 0.32 10.754601
unc_H_S3 H 0.64 5.16457
unc_H_S4 H 1.28 10.93159
.
.
.

Creating a new column into a dataframe based on conditions

For the dataframe df :
dummy_data1 = {'category': ['White', 'Black', 'Hispanic','White'],
'Pop':['75','85','90','100'],'White_ratio':[0.6,0.4,0.7,0.35],'Black_ratio':[0.3,0.2,0.1,0.45], 'Hispanic_ratio':[0.1,0.4,0.2,0.20] }
df = pd.DataFrame(dummy_data1, columns = ['category', 'Pop','White_ratio', 'Black_ratio', 'Hispanic_ratio'])
I want to add a new column to this data frame,'pop_n', by first checking the category, and then multiplying the value in 'Pop' by the corresponding ratio value in the columns. For the first row,
the category is 'White' so it should multiply 75 with 0.60 and put 45 in pop_n column.
I thought about writing something like :
df['pop_n']= (df['Pop']*df['White_ratio']).where(df['category']=='W')
this works but just for one category.
I will appreciate any helps with this.
Thanks.
Using DataFrame.filter and DataFrame.lookup:
First we use filter to get the columns with ratio in the name. Then split and keep the first word before the underscore only.
Finally we use lookup to match the category values to these columns.
# df['Pop'] = df['Pop'].astype(int)
df2 = df.filter(like='ratio').rename(columns=lambda x: x.split('_')[0])
df['pop_n'] = df2.lookup(df.index, df['category']) * df['Pop']
category Pop White_ratio Black_ratio Hispanic_ratio pop_n
0 White 75 0.60 0.30 0.1 45.0
1 Black 85 0.40 0.20 0.4 17.0
2 Hispanic 90 0.70 0.10 0.2 18.0
3 White 100 0.35 0.45 0.2 35.0
Locate the columns that have underscores in their names:
to_rename = {x: x.split("_")[0] for x in df if "_" in x}
Find the matching factors:
stack = df.rename(columns=to_rename)\
.set_index('category').stack()
factors = stack[map(lambda x: x[0]==x[1], stack.index)]\
.reset_index(drop=True)
Multiply the original data by the factors:
df['pop_n'] = df['Pop'].astype(int) * factors
# category Pop White_ratio Black_ratio Hispanic_ratio pop_n
#0 White 75 0.60 0.30 0.1 45
#1 Black 85 0.40 0.20 0.4 17
#2 Hispanic 90 0.70 0.10 0.2 18
#3 White 100 0.35 0.45 0.2 35

pandas custom sorting multilevel index

I have the following example dataset, and I'd like to sort the index columns by a custom order that is not contained within the dataframe. So far looking on SO I haven't been able to solve this. Example:
import pandas as pd
data = {'s':[1,1,1,1],
'am':['cap', 'cap', 'sea', 'sea'],
'cat':['i', 'o', 'i', 'o'],
'col1':[.55, .44, .33, .22],
'col2':[.77, .66, .55, .44]}
df = pd.DataFrame(data=data)
df.set_index(['s', 'am', 'cat'], inplace=True)
Out[1]:
col1 col2
s am cat
1 cap i 0.55 0.77
o 0.44 0.66
sea i 0.33 0.55
o 0.22 0.44
What I would like is the following:
Out[2]:
col1 col2
s am cat
1 sea i 0.33 0.55
o 0.22 0.44
cap i 0.55 0.77
o 0.44 0.66
and I might also want to sort by 'cat' with the order ['o', 'i'], as well.
Use sort_values and sort_index
df.sort_values(df.columns.tolist()).sort_index(level=1, ascending=False,
sort_remaining=False)
col1 col2
s am cat
1 sea i 0.33 0.55
o 0.22 0.44
cap i 0.55 0.77
o 0.44 0.66
Convert the index to categorical to get the custom order.
data = {'s':[1,1,1,1],
'am':['cap', 'cap', 'sea', 'sea'],
'cat':['i', 'j', 'k', 'l'],
'col1':[.55, .44, .33, .22],
'col2':[.77, .66, .55, .44]}
df = pd.DataFrame(data=data)
df.set_index(['s', 'am', 'cat'], inplace=True)
idx = pd.Categorical(df.index.get_level_values(2).values,
categories=['j','i','k','l'],
ordered=True)
df.index.set_levels(idx, level='cat', inplace=True)
df.reset_index().sort_values('cat').set_index(['s','am','cat'])
col1 col2
s am cat
1 cap j 0.44 0.66
i 0.55 0.77
sea k 0.33 0.55
l 0.22 0.44
As of Pandas 1.1 there is another option with the key param of sort_values.
SORT_VALS = {"am": ["sea", "cap"]}
def sorter(column):
if column.name not in SORT_VALS:
return column
mapper = {val: order for order, val in enumerate(SORT_VALS[column.name])}
return column.map(mapper)
new_df = df.sort_values(by=["s", "am", "cat"], key=sorter)
# col1 col2
# s am cat
# 1 sea i 0.33 0.55
# o 0.22 0.44
# cap i 0.55 0.77
# o 0.44 0.66
You can also use pd.Categorical in the sorter and return a categorical Series for custom sort columns which may have different performance implications depending on your scenario, but note that there is a soon-to-be-fixed bug in pandas that can prevent multi-column sorts with Categorical sorting.

Sorting pivot table (multi index)

I'm trying to sort a pivot table's values in descending order after putting two "row labels" (Excel term) on the pivot.
sample data:
x = pd.DataFrame({'col1':['a','a','b','c','c', 'a','b','c', 'a','b','c'],
'col2':[ 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'col3':[ 1,.67,0.5, 2,.65, .75,2.25,2.5, .5, 2,2.75]})
print(x)
col1 col2 col3
0 a 1 1.00
1 a 1 0.67
2 b 1 0.50
3 c 1 2.00
4 c 1 0.65
5 a 2 0.75
6 b 2 2.25
7 c 2 2.50
8 a 3 0.50
9 b 3 2.00
10 c 3 2.75
To create the pivot, I'm using the following function:
pt = pd.pivot_table(x, index = ['col1', 'col2'], values = 'col3', aggfunc = np.sum)
print(pt)
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 1 0.50
2 2.25
3 2.00
c 1 2.65
2 2.50
3 2.75
In words, this variable pt is first sorted by col1, then by values of col2 within col1 then by col3 within all of those. This is great, but I would like to sort by col3 (the values) while keeping the groups that were broken out in col2 (this column can be any order and shuffled around).
The target output would look something like this (col3 in descending order with any order in col2 with that group of col1):
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 2 2.25
3 2.00
1 0.50
c 3 2.75
1 2.65
2 2.50
I have tried the code below, but this just sorts the entire pivot table values and loses the grouping (I'm looking for sorting within the group).
pt.sort_values(by = 'col3', ascending = False)
For guidance, a similar question was asked (and answered) here, but I was unable to get a successful output with the provided output:
Pandas: Sort pivot table
The error I get from that answer is ValueError: all keys need to be the same shape
You need reset_index for DataFrame, then sort_values by col1 and col3 and last set_index for MultiIndex:
df = df.reset_index()
.sort_values(['col1','col3'], ascending=[True, False])
.set_index(['col1','col2'])
print (df)
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 2 2.25
3 2.00
1 0.50
c 3 2.75
1 2.65
2 2.50

Resources