why am I getting a too many indexers error? - python-3.x

cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3

I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')

Related

column multiplication based on a mapping

I have the following two dataframes. The first one, maps some nodes to area number and the maximum electric load of that node.
bus = pd.DataFrame(data={'Node':[101, 102, 103, 104, 105], 'Area':[1, 1, 2, 2, 3], 'Load':[10, 15, 12, 20, 25]})
which gives us:
Bus Area Load
0 101 1 10
1 102 1 15
2 103 2 12
3 104 2 20
4 105 3 25
The second dataframe, shows the total electric load of each area over a time period (from hour 0 to 5). The column names are the areas (matching the column Area in dataframe bus.
load = pd.DataFrame(data={1:[20, 18, 17, 19, 22, 25], 2:[23, 25,24, 27, 30, 32], 3:[10, 14, 19, 25, 22, 20]})
which gives us:
1 2 3
0 20 23 10
1 18 25 14
2 17 24 19
3 19 27 25
4 22 30 22
5 25 32 20
I would like to have a dataframe that shows the electric load of each bus over the 6 hours.
Assumption: The percentage of the load over time is the same as the percentage of the maximum load shown in bus; e.g., bus 101 has 10/(10+15)=0.4 percent of the electric load of area 1, therefore, to calculate its hourly load, 10/(10+15) should be multiplied by the column corresponding to area 1 in load.
The desired output should be of the following format:
101 102 103 104 105
0 8 12 8.625 14.375 10
1 7.2 10.8 9.375 15.625 14
2 6.8 10.2 9 15 19
3 7.6 11.4 10.125 16.875 25
4 8.8 13.2 11.25 18.75 22
5 10 15 12 20 20
For column 101, we have 0.4 multiplied by column 1 of load.
Any help is greatly appreaciated.
One option is to get the Load divided by the sum, then pivot, get the index matching for both load and bus, before multiplying on the matching levels:
(bus.assign(Load = bus.Load.div(bus.groupby('Area').Load.transform('sum')))
.pivot(None, ['Area', 'Node'], 'Load')
.reindex(load.index)
.ffill() # get the data spread into all rows
.bfill()
.mul(load, level=0)
.droplevel(0,1)
.rename_axis(columns=None)
)
101 102 103 104 105
0 8.0 12.0 8.625 14.375 10.0
1 7.2 10.8 9.375 15.625 14.0
2 6.8 10.2 9.000 15.000 19.0
3 7.6 11.4 10.125 16.875 25.0
4 8.8 13.2 11.250 18.750 22.0
5 10.0 15.0 12.000 20.000 20.0
You can calculate the ratio in bus, transpose load, merge the two and multiply the ratio by the load, here goes:
bus['area_sum'] = bus.groupby('Area')['Load'].transform('sum')
bus['node_ratio'] = bus['Load'] / bus['area_sum']
full_data = bus.merge(load.T.reset_index(), left_on='Area', right_on='index')
result = pd.DataFrame([full_data['node_ratio'] * full_data[x] for x in range(6)])
result.columns = full_data['Node'].values
result:
101
102
103
104
105
0
8
12
8.625
14.375
10
1
7.2
10.8
9.375
15.625
14
2
6.8
10.2
9
15
19
3
7.6
11.4
10.125
16.875
25
4
8.8
13.2
11.25
18.75
22
5
10
15
12
20
20

Pandas Computing On Multidimensional Data

I have two data frames storing tracking data of offensive and defensive players during an nfl game. My goal is to calculate the maximum distance between an offensive player and the nearest defender during the course of the play.
As a simple example, I've made up some data where there are only three offensive players and two defensive players. Here is the data:
Defense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 117 20.2 20.0
1 2 1 117 21.0 19.1
2 3 1 117 21.3 18.3
3 4 1 117 22.0 17.5
4 5 1 117 22.5 17.2
5 6 1 117 23.0 16.9
6 7 1 117 23.6 16.7
7 8 2 117 25.1 34.1
8 9 2 117 25.9 34.2
9 10 2 117 24.1 34.5
10 11 2 117 22.7 34.2
11 12 2 117 21.5 34.5
12 13 2 117 21.1 37.3
13 14 3 117 21.2 44.3
14 15 3 117 20.4 44.6
15 16 3 117 21.9 42.7
16 17 3 117 21.1 41.9
17 18 3 117 20.1 41.7
18 19 3 117 20.1 41.3
19 1 1 555 40.1 17.0
20 2 1 555 40.7 18.3
21 3 1 555 41.0 19.6
22 4 1 555 41.5 18.4
23 5 1 555 42.6 18.4
24 6 1 555 43.8 18.0
25 7 1 555 44.2 15.8
26 8 2 555 41.2 37.1
27 9 2 555 42.3 36.5
28 10 2 555 45.6 36.3
29 11 2 555 47.9 35.6
30 12 2 555 47.4 31.3
31 13 2 555 46.8 31.5
32 14 3 555 47.3 40.3
33 15 3 555 47.2 40.6
34 16 3 555 44.5 40.8
35 17 3 555 46.5 41.0
36 18 3 555 47.6 41.4
37 19 3 555 47.6 41.5
Offense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 751 30.2 15.0
1 2 1 751 31.0 15.1
2 3 1 751 31.3 15.3
3 4 1 751 32.0 15.5
4 5 1 751 31.5 15.7
5 6 1 751 33.0 15.9
6 7 1 751 32.6 15.7
7 8 2 751 51.1 30.1
8 9 2 751 51.9 30.2
9 10 2 751 51.1 30.5
10 11 2 751 49.7 30.6
11 12 2 751 49.5 30.9
12 13 2 751 49.1 31.3
13 14 3 751 12.2 40.3
14 15 3 751 12.4 40.5
15 16 3 751 12.9 40.7
16 17 3 751 13.1 40.9
17 18 3 751 13.1 41.1
18 19 3 751 13.1 41.3
19 1 1 419 41.3 15.0
20 2 1 419 41.7 15.3
21 3 1 419 41.8 15.4
22 4 1 419 42.9 15.6
23 5 1 419 42.6 15.6
24 6 1 419 44.8 16.0
25 7 1 419 45.2 15.8
26 8 2 419 62.2 30.1
27 9 2 419 63.3 30.5
28 10 2 419 62.6 31.0
29 11 2 419 63.9 30.6
30 12 2 419 67.4 31.3
31 13 2 419 66.8 31.5
32 14 3 419 30.3 40.3
33 15 3 419 30.2 40.6
34 16 3 419 30.5 40.8
35 17 3 419 30.5 41.0
36 18 3 419 31.6 41.4
37 19 3 419 31.6 41.5
38 1 1 989 10.1 15.0
39 2 1 989 10.2 15.5
40 3 1 989 10.4 15.4
41 4 1 989 10.5 15.8
42 5 1 989 10.6 15.9
43 6 1 989 10.1 15.5
44 7 1 989 10.9 15.3
45 8 2 989 25.8 30.1
46 9 2 989 25.2 30.1
47 10 2 989 21.8 30.2
48 11 2 989 25.8 30.2
49 12 2 989 25.6 30.5
50 13 2 989 25.5 31.0
51 14 3 989 50.3 40.3
52 15 3 989 50.3 40.2
53 16 3 989 50.2 40.4
54 17 3 989 50.1 40.8
55 18 3 989 50.6 41.2
56 19 3 989 51.4 41.6
The data is essentially multidimensional with GameTime, PlayId, and PlayerId as independent variables and x-coord and y-coord as dependent variables. How can I go about calculating the maximum distance from the nearest defender during the course of a play?
My guess is I would have to create columns containing the distance from each defender for each offensive player, but I don't know how to name those and be able to account for an unknown amount of defensive/offensive players (the full data set contains thousands of players).
Here is a possible solution , I think there is a way to making it more efficient :
Assuming you have a dataframe called offense_df and a dataframe called defense_df:
In the merged dataframe you'll get the answer to your question, basically it will create the following dataframe:
from scipy.spatial import distance
merged_dataframe = pd.merge(offense_df,defense_df,on=['GameTime','PlayId'],suffixes=('_off','_def'))
GameTime PlayId PlayerId_off x-coord_off y-coord_off PlayerId_def x-coord_def y-coord_def
0 1 1 751 30.2 15.0 117 20.2 20.0
1 1 1 751 30.2 15.0 555 40.1 17.0
2 1 1 419 41.3 15.0 117 20.2 20.0
3 1 1 419 41.3 15.0 555 40.1 17.0
4 1 1 989 10.1 15.0 117 20.2 20.0
The next two lines are here to create a unique column for the coordinates , basically it will create for the offender (coord_off) and the defender a column (coord_def) that contains a tuple (x,y) this will simplify the computation of the distance.
merged_dataframe['coord_off'] = merged_dataframe.apply(lambda x: (x['x-coord_off'], x['y-coord_off']),axis=1)
merged_dataframe['coord_def'] = merged_dataframe.apply(lambda x: (x['x-coord_def'], x['y-coord_def']),axis=1)
We compute the distance to all the defender at a given GameTime,PlayId.
merged_dataframe['distance_to_def'] = merged_dataframe.apply(lambda x: distance.euclidean(x['coord_off'],x['coord_def']),axis=1)
For each PlayerId,GameTime,PlayId we take the distance to the nearest defender.
smallest_dist = merged_dataframe.groupby(['GameTime','PlayId','PlayerId_off'])['distance_to_def'].min()
Finally we take the maximum distance (of these minimum distances) for each PlayerId.
smallest_dist.groupby('PlayerId_off').max()

Concatenate two dataframes by column

I have 2 dataframes. First dataframe contain number of year and count with 0:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 0
4 1894 0
5 1895 0
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 0
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 0
18 1908 0
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 0
27 1917 0
28 1918 0
29 1919 0
.. ... ...
90 1980 0
91 1981 0
92 1982 0
93 1983 0
94 1984 0
95 1985 0
96 1986 0
97 1987 0
98 1988 0
99 1989 0
100 1990 0
101 1991 0
102 1992 0
103 1993 0
104 1994 0
105 1995 0
106 1996 0
107 1997 0
108 1998 0
109 1999 0
110 2000 0
111 2001 0
112 2002 0
113 2003 0
114 2004 0
115 2005 0
116 2006 0
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
Second dataframe have similar columns but filled with smaller number of years and filled count:
year count
0 1970 1
1 1957 7
2 1947 19
3 1987 12
4 1979 7
5 1940 1
6 1950 19
7 1972 4
8 1954 15
9 1976 15
10 2006 3
11 1963 16
12 1980 6
13 1956 13
14 1967 5
15 1893 1
16 1985 5
17 1964 6
18 1949 11
19 1945 15
20 1948 16
21 1959 16
22 1958 12
23 1929 1
24 1965 12
25 1969 15
26 1946 12
27 1961 1
28 1988 1
29 1918 1
30 1999 3
31 1986 3
32 1981 2
33 1960 2
34 1974 4
35 1953 9
36 1968 11
37 1916 2
38 1955 5
39 1978 1
40 2003 1
41 1982 4
42 1984 3
43 1966 4
44 1983 3
45 1962 3
46 1952 4
47 1992 2
48 1973 4
49 1993 10
50 1975 2
51 1900 1
52 1991 1
53 1907 1
54 1977 4
55 1908 1
56 1998 2
57 1997 3
58 1895 1
I want to create third dataframe df3. For each row, if year in df1 and df2 are equal, then df3["count"] = df2["count"] else df3["count"] = df1["count"].
I tried to use join to do this:
df_new = df2.join(df1, on='year', how='left')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But got an error:
ValueError: columns overlap but no suffix specified: Index(['year'], dtype='object')
I found the solution to this error(Pandas join issue: columns overlap but no suffix specified) But after I run code with those changes:
df_new = df2.join(df1, on='year', how='left', lsuffix='_left', rsuffix='_right')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But output is not what I want:
count year
0 NaN 1890
1 NaN 1891
2 NaN 1892
3 NaN 1893
4 NaN 1894
5 NaN 1895
6 NaN 1896
7 NaN 1897
8 NaN 1898
9 NaN 1899
10 NaN 1900
11 NaN 1901
12 NaN 1902
13 NaN 1903
14 NaN 1904
15 NaN 1905
16 NaN 1906
17 NaN 1907
18 NaN 1908
19 NaN 1909
20 NaN 1910
21 NaN 1911
22 NaN 1912
23 NaN 1913
24 NaN 1914
25 NaN 1915
26 NaN 1916
27 NaN 1917
28 NaN 1918
29 NaN 1919
.. ... ...
29 1.0 1918
30 3.0 1999
31 3.0 1986
32 2.0 1981
33 2.0 1960
34 4.0 1974
35 9.0 1953
36 11.0 1968
37 2.0 1916
38 5.0 1955
39 1.0 1978
40 1.0 2003
41 4.0 1982
42 3.0 1984
43 4.0 1966
44 3.0 1983
45 3.0 1962
46 4.0 1952
47 2.0 1992
48 4.0 1973
49 10.0 1993
50 2.0 1975
51 1.0 1900
52 1.0 1991
53 1.0 1907
54 4.0 1977
55 1.0 1908
56 2.0 1998
57 3.0 1997
58 1.0 1895
[179 rows x 2 columns]
Desired output is:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 1
4 1894 0
5 1895 1
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 1
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 1
18 1908 1
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 2
27 1917 0
28 1918 1
29 1919 0
.. ... ...
90 1980 6
91 1981 2
92 1982 4
93 1983 3
94 1984 3
95 1985 5
96 1986 3
97 1987 12
98 1988 1
99 1989 0
100 1990 0
101 1991 1
102 1992 2
103 1993 10
104 1994 0
105 1995 0
106 1996 0
107 1997 3
108 1998 2
109 1999 3
110 2000 0
111 2001 0
112 2002 0
113 2003 1
114 2004 0
115 2005 0
116 2006 3
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
The issue if because you should place year as index. In addition, if you don't want to lose data, you should join on outer instead of left.
This is my code:
df = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df2 = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df = df.set_index("year")
df2 = df2.set_index("year")
df3 = df.join(df2["qty"], how = "outer", lsuffix='_left', rsuffix='_right')
df3 = df3.fillna(0)
At this step you have 2 columns with values from df1 or df2. In you merge rule, I don't get what you want. You said :
if df1["qty"] == df2["qty"] => df3["qty"] = df2["qty"]
if df1["qty"] != df2["qty"] => df3["qty"] = df1["qty"]
That means you want everytime df1["qty"] because of df1["qty"] == df2["qty"]. Am I right ?
Just in case. If you want a code to adjust you can use apply as follow :
def foo(x1, x2):
if x1 == x2:
return x2
else:
return x1
df3["count"] = df3.apply(lambda row: foo(row["qty_left"], row["qty_left"]), axis=1)
df3.drop(["qty_left","qty_right"], axis = 1, inplace = True)
I hope it helps,
Nicolas

Lookup Pandas Dataframe comparing different size data frames

I have two pandas df that look like this
df1
Amount Price
0 5 50
1 10 53
2 15 55
3 30 50
4 45 61
df2
Used amount
0 4.5
1 1.2
2 6.2
3 4.1
4 25.6
5 31
6 19
7 15
I am trying to insert a new column on df2 that will give provide the price from the df1, df1 and df2 have different size, df1 is smaller
I am expecting something like this
df3
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31 61
6 19 50
7 15 55
I am thinking to solve this, with something like this function
def price_function(key, table):
used_amount_df2 = (row[0] for row in df1)
price = filter(lambda x: x < key, used_amount_df1)
Here is my own solution
1st approach:
from itertools import product
import pandas as pd
df2=df2.reset_index()
DF=pd.DataFrame(list(product(df2.Usedamount, df1.Amount)), columns=['l1', 'l2'])
DF['DIFF']=(DF.l1-DF.l2)
DF=DF.loc[DF.DIFF<=0,]
DF=DF.sort_values(['l1','DIFF'],ascending=[True,False]).drop_duplicates(['l1'],keep='first')
df1.merge(DF,left_on='Amount',right_on='l2',how='left').merge(df2,left_on='l1',right_on='Usedamount',how='right').loc[:,['index','Usedamount','Price']].set_index('index').sort_index()
Out[185]:
Usedamount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
2nd using pd.merge_asof I recommend this
df2=df2.rename({'Used amount':Amount}).sort_values('Amount')
df2=df2.reset_index()
pd.merge_asof(df2,df1,on='Amount',allow_exact_matches=True,direction='forward')\
.set_index('index').sort_index()
Out[206]:
Amount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
Using pd.IntervalIndex you can
In [468]: df1.index = pd.IntervalIndex.from_arrays(df1.Amount.shift().fillna(0),df1.Amount)
In [469]: df1
Out[469]:
Amount Price
(0.0, 5.0] 5 50
(5.0, 10.0] 10 53
(10.0, 15.0] 15 55
(15.0, 30.0] 30 50
(30.0, 45.0] 45 61
In [470]: df2['price'] = df2['Used amount'].map(df1.Price)
In [471]: df2
Out[471]:
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
You can use cut or searchsorted for create bins.
Notice: Index in df1 has to be default - 0,1,2....
#create default index if necessary
df1 = df1.reset_index(drop=True)
#create bins
bins = [0] + df1['Amount'].tolist()
#get index values of df1 by values of Used amount
a = pd.cut(df2['Used amount'], bins=bins, labels=df1.index)
#assign output
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
a = df1['Amount'].searchsorted(df2['Used amount'])
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
You can use pd.DataFrame.reindex with method=bfill
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill')
Price
Used amount
4.5 50
1.2 50
6.2 53
4.1 50
25.6 50
31.0 61
19.0 50
15.0 55
To add that to a new column we can use
join
df2.join(
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill'),
on='Used amount'
)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
Or assign
df2.assign(
Price=df1.set_index('Amount').reindex(df2['Used amount'], method='bfill').values)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55

Grep rows with a length of 3

Hi i have a table which looks like this:
chr10 84890986 84891021 2 17.5 2 93 0 61 48 2 48 0 1.16 GA
chr10 84897562 84897613 2 25.5 2 100 0 102 50 49 0 0 1 AC
chr10 84899819 84899844 2 12.5 2 100 0 50 0 0 52 48 1 GT
chr10 84905282 84905318 6 5.8 6 87 6 54 80 19 0 0 0.71 AAAAAC
chr10 84955235 84955267 2 16 2 100 0 64 50 0 0 50 1 AT
chr10 84972254 84972288 2 17 2 93 0 59 2 0 47 50 1.16 GT
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85014721 85014841 5 23.8 5 78 8 66 0 69 0 29 1 TTCCC
chr10 85021530 85021701 5 38.4 5 84 13 53 74 0 24 0 0.85 AAGAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85059334 85059364 5 6 5 92 0 51 20 3 0 76 0.92 ATTTT
chr10 85072010 85072038 2 14 2 100 0 56 50 50 0 0 1 CA
chr10 85072037 85072077 4 10 4 84 10 55 25 22 0 52 1.47 ATCT
chr10 85084308 85084338 6 5 6 91 0 51 83 13 3 0 0.77 CAAAAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
chr10 85151154 85151190 6 6.5 6 87 12 51 0 11 0 88 0.5 TTTCTT
chr10 85168255 85168320 4 16.2 4 100 0 130 50 0 49 0 1 AGGA
chr10 85173155 85173184 2 14.5 2 100 0 58 48 0 0 51 1 TA
chr10 85196836 85196861 2 12.5 2 100 0 50 52 48 0 0 1 AC
chr10 85215511 85215546 2 17.5 2 100 0 70 51 48 0 0 1 AC
chr10 85225048 85225075 2 13.5 2 100 0 54 51 48 0 0 1 AC
chr10 85242322 85242357 2 17.5 2 93 0 61 0 2 48 48 1.16 TG
chr10 85245934 85245981 4 11 4 79 20 51 27 2 0 70 0.99 ATTT
chr10 85249139 85249230 5 18.8 5 88 6 116 0 60 0 39 0.97 TTCCC
chr10 85251100 85251153 5 11 5 97 2 92 0 0 37 62 0.96 GTTTG
chr10 85268725 85268752 4 6.8 4 100 0 54 0 25 0 74 0.83 CTTT
chr10 85268767 85268798 4 7.8 4 100 0 62 0 0 22 77 0.77 TTTG
chr10 85269189 85269239 6 8.8 6 79 16 54 84 2 12 2 0.8 AAAAGA
chr10 85330217 85330253 2 18 2 100 0 72 0 0 50 50 1 TG
chr10 85332256 85332314 4 15 4 82 7 75 70 1 27 0 0.97 AAGA
chr10 85337969 85337996 2 13.5 2 100 0 54 0 0 48 51 1 TG
chr10 85344795 85344957 2 75.5 2 83 12 198 45 4 3 45 1.42 TA
chr10 85349732 85349765 5 6.8 5 93 6 59 84 15 0 0 0.61 AAAAC
chr10 85353082 85353109 5 5.4 5 100 0 54 0 22 18 59 1.38 CTGTT
I want to extract all rows with have 3 and only 3 characters in the last column. My try till now is this:
grep -E "['ACTG']['ACTG']['ACTG']{1,3}$"
But this gives me everything from 3 and longer than 3. I tried many different combinations but nothing seems to give me what i want. Any ideas?
If you like to try awk, you can do:
awk '$NF~/\<...\>/' file
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
It will test if last field $NF has 3 character ...
This regex would also do: awk '$NF~/^...$/'
Or if you need exact characters. (PS this needs awk 4.x, or use of switch --re-interval)
awk '$NF~/^[ACTG]{3}$/' file
Using grep
grep -E " [ACTG]{3}$" file
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
You need the space, to separate last column, and {3} to get 3 and only 3 characters.
If you want to print the rows which has exactly three chars in the last column then you could use the below grep command.
grep -E " [ACTG]{3}$"
[ACTG]{3} Matches exactly three characters from the given list.
You have to grep either " ['ACTG']['ACTG']['ACTG']$" or " ['ACTG']{1,3}$".
Currently, you are grepping 3 to 5 'ACTG'.
Also, the quotes are unnecessary ['ACTG'] means "match anything between []" so any of the 5 characters 'ACTG, just grep " [ACTG]{1,3}$".
Be sure to use a delimiter for the left part (space ' ', tab\t if it is tab delimited, word boundary \b or \W).
If your lines are all ending with [ACTG]+, you can even only grep -E "\W.{,3}$"
Another way that you could do this would be using awk:
$ awk '$NF ~ /^[ACTG][ACTG][ACTG]$/' file
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
This prints all lines whose last field exactly matches 3 of the characters "A", "C", "T" or "G".
2 hours late but this is one way in awk
This can be easily edited for different lengths and fields.
awk 'length($NF)==3' file
As i was looking for answers myself i found out that Perl regex work more efficiently:
this does the deal : grep -P '\t...$' Way more compact code.
$ cat roi_new.bed | grep -P "\t...$"
chr10 81038152 81038182 3 9.7 3 92 7 51 30 0 0 70 0.88 TTA
chr10 81272294 81272320 3 8.7 3 100 0 52 0 30 69 0 0.89 GGC
chr10 81287690 81287720 3 10 3 100 0 60 66 33 0 0 0.92 CAA

Resources