Pandas Computing On Multidimensional Data - python-3.x

I have two data frames storing tracking data of offensive and defensive players during an nfl game. My goal is to calculate the maximum distance between an offensive player and the nearest defender during the course of the play.
As a simple example, I've made up some data where there are only three offensive players and two defensive players. Here is the data:
Defense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 117 20.2 20.0
1 2 1 117 21.0 19.1
2 3 1 117 21.3 18.3
3 4 1 117 22.0 17.5
4 5 1 117 22.5 17.2
5 6 1 117 23.0 16.9
6 7 1 117 23.6 16.7
7 8 2 117 25.1 34.1
8 9 2 117 25.9 34.2
9 10 2 117 24.1 34.5
10 11 2 117 22.7 34.2
11 12 2 117 21.5 34.5
12 13 2 117 21.1 37.3
13 14 3 117 21.2 44.3
14 15 3 117 20.4 44.6
15 16 3 117 21.9 42.7
16 17 3 117 21.1 41.9
17 18 3 117 20.1 41.7
18 19 3 117 20.1 41.3
19 1 1 555 40.1 17.0
20 2 1 555 40.7 18.3
21 3 1 555 41.0 19.6
22 4 1 555 41.5 18.4
23 5 1 555 42.6 18.4
24 6 1 555 43.8 18.0
25 7 1 555 44.2 15.8
26 8 2 555 41.2 37.1
27 9 2 555 42.3 36.5
28 10 2 555 45.6 36.3
29 11 2 555 47.9 35.6
30 12 2 555 47.4 31.3
31 13 2 555 46.8 31.5
32 14 3 555 47.3 40.3
33 15 3 555 47.2 40.6
34 16 3 555 44.5 40.8
35 17 3 555 46.5 41.0
36 18 3 555 47.6 41.4
37 19 3 555 47.6 41.5
Offense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 751 30.2 15.0
1 2 1 751 31.0 15.1
2 3 1 751 31.3 15.3
3 4 1 751 32.0 15.5
4 5 1 751 31.5 15.7
5 6 1 751 33.0 15.9
6 7 1 751 32.6 15.7
7 8 2 751 51.1 30.1
8 9 2 751 51.9 30.2
9 10 2 751 51.1 30.5
10 11 2 751 49.7 30.6
11 12 2 751 49.5 30.9
12 13 2 751 49.1 31.3
13 14 3 751 12.2 40.3
14 15 3 751 12.4 40.5
15 16 3 751 12.9 40.7
16 17 3 751 13.1 40.9
17 18 3 751 13.1 41.1
18 19 3 751 13.1 41.3
19 1 1 419 41.3 15.0
20 2 1 419 41.7 15.3
21 3 1 419 41.8 15.4
22 4 1 419 42.9 15.6
23 5 1 419 42.6 15.6
24 6 1 419 44.8 16.0
25 7 1 419 45.2 15.8
26 8 2 419 62.2 30.1
27 9 2 419 63.3 30.5
28 10 2 419 62.6 31.0
29 11 2 419 63.9 30.6
30 12 2 419 67.4 31.3
31 13 2 419 66.8 31.5
32 14 3 419 30.3 40.3
33 15 3 419 30.2 40.6
34 16 3 419 30.5 40.8
35 17 3 419 30.5 41.0
36 18 3 419 31.6 41.4
37 19 3 419 31.6 41.5
38 1 1 989 10.1 15.0
39 2 1 989 10.2 15.5
40 3 1 989 10.4 15.4
41 4 1 989 10.5 15.8
42 5 1 989 10.6 15.9
43 6 1 989 10.1 15.5
44 7 1 989 10.9 15.3
45 8 2 989 25.8 30.1
46 9 2 989 25.2 30.1
47 10 2 989 21.8 30.2
48 11 2 989 25.8 30.2
49 12 2 989 25.6 30.5
50 13 2 989 25.5 31.0
51 14 3 989 50.3 40.3
52 15 3 989 50.3 40.2
53 16 3 989 50.2 40.4
54 17 3 989 50.1 40.8
55 18 3 989 50.6 41.2
56 19 3 989 51.4 41.6
The data is essentially multidimensional with GameTime, PlayId, and PlayerId as independent variables and x-coord and y-coord as dependent variables. How can I go about calculating the maximum distance from the nearest defender during the course of a play?
My guess is I would have to create columns containing the distance from each defender for each offensive player, but I don't know how to name those and be able to account for an unknown amount of defensive/offensive players (the full data set contains thousands of players).

Here is a possible solution , I think there is a way to making it more efficient :
Assuming you have a dataframe called offense_df and a dataframe called defense_df:
In the merged dataframe you'll get the answer to your question, basically it will create the following dataframe:
from scipy.spatial import distance
merged_dataframe = pd.merge(offense_df,defense_df,on=['GameTime','PlayId'],suffixes=('_off','_def'))
GameTime PlayId PlayerId_off x-coord_off y-coord_off PlayerId_def x-coord_def y-coord_def
0 1 1 751 30.2 15.0 117 20.2 20.0
1 1 1 751 30.2 15.0 555 40.1 17.0
2 1 1 419 41.3 15.0 117 20.2 20.0
3 1 1 419 41.3 15.0 555 40.1 17.0
4 1 1 989 10.1 15.0 117 20.2 20.0
The next two lines are here to create a unique column for the coordinates , basically it will create for the offender (coord_off) and the defender a column (coord_def) that contains a tuple (x,y) this will simplify the computation of the distance.
merged_dataframe['coord_off'] = merged_dataframe.apply(lambda x: (x['x-coord_off'], x['y-coord_off']),axis=1)
merged_dataframe['coord_def'] = merged_dataframe.apply(lambda x: (x['x-coord_def'], x['y-coord_def']),axis=1)
We compute the distance to all the defender at a given GameTime,PlayId.
merged_dataframe['distance_to_def'] = merged_dataframe.apply(lambda x: distance.euclidean(x['coord_off'],x['coord_def']),axis=1)
For each PlayerId,GameTime,PlayId we take the distance to the nearest defender.
smallest_dist = merged_dataframe.groupby(['GameTime','PlayId','PlayerId_off'])['distance_to_def'].min()
Finally we take the maximum distance (of these minimum distances) for each PlayerId.
smallest_dist.groupby('PlayerId_off').max()

Related

Concatenate two dataframes by column

I have 2 dataframes. First dataframe contain number of year and count with 0:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 0
4 1894 0
5 1895 0
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 0
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 0
18 1908 0
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 0
27 1917 0
28 1918 0
29 1919 0
.. ... ...
90 1980 0
91 1981 0
92 1982 0
93 1983 0
94 1984 0
95 1985 0
96 1986 0
97 1987 0
98 1988 0
99 1989 0
100 1990 0
101 1991 0
102 1992 0
103 1993 0
104 1994 0
105 1995 0
106 1996 0
107 1997 0
108 1998 0
109 1999 0
110 2000 0
111 2001 0
112 2002 0
113 2003 0
114 2004 0
115 2005 0
116 2006 0
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
Second dataframe have similar columns but filled with smaller number of years and filled count:
year count
0 1970 1
1 1957 7
2 1947 19
3 1987 12
4 1979 7
5 1940 1
6 1950 19
7 1972 4
8 1954 15
9 1976 15
10 2006 3
11 1963 16
12 1980 6
13 1956 13
14 1967 5
15 1893 1
16 1985 5
17 1964 6
18 1949 11
19 1945 15
20 1948 16
21 1959 16
22 1958 12
23 1929 1
24 1965 12
25 1969 15
26 1946 12
27 1961 1
28 1988 1
29 1918 1
30 1999 3
31 1986 3
32 1981 2
33 1960 2
34 1974 4
35 1953 9
36 1968 11
37 1916 2
38 1955 5
39 1978 1
40 2003 1
41 1982 4
42 1984 3
43 1966 4
44 1983 3
45 1962 3
46 1952 4
47 1992 2
48 1973 4
49 1993 10
50 1975 2
51 1900 1
52 1991 1
53 1907 1
54 1977 4
55 1908 1
56 1998 2
57 1997 3
58 1895 1
I want to create third dataframe df3. For each row, if year in df1 and df2 are equal, then df3["count"] = df2["count"] else df3["count"] = df1["count"].
I tried to use join to do this:
df_new = df2.join(df1, on='year', how='left')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But got an error:
ValueError: columns overlap but no suffix specified: Index(['year'], dtype='object')
I found the solution to this error(Pandas join issue: columns overlap but no suffix specified) But after I run code with those changes:
df_new = df2.join(df1, on='year', how='left', lsuffix='_left', rsuffix='_right')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But output is not what I want:
count year
0 NaN 1890
1 NaN 1891
2 NaN 1892
3 NaN 1893
4 NaN 1894
5 NaN 1895
6 NaN 1896
7 NaN 1897
8 NaN 1898
9 NaN 1899
10 NaN 1900
11 NaN 1901
12 NaN 1902
13 NaN 1903
14 NaN 1904
15 NaN 1905
16 NaN 1906
17 NaN 1907
18 NaN 1908
19 NaN 1909
20 NaN 1910
21 NaN 1911
22 NaN 1912
23 NaN 1913
24 NaN 1914
25 NaN 1915
26 NaN 1916
27 NaN 1917
28 NaN 1918
29 NaN 1919
.. ... ...
29 1.0 1918
30 3.0 1999
31 3.0 1986
32 2.0 1981
33 2.0 1960
34 4.0 1974
35 9.0 1953
36 11.0 1968
37 2.0 1916
38 5.0 1955
39 1.0 1978
40 1.0 2003
41 4.0 1982
42 3.0 1984
43 4.0 1966
44 3.0 1983
45 3.0 1962
46 4.0 1952
47 2.0 1992
48 4.0 1973
49 10.0 1993
50 2.0 1975
51 1.0 1900
52 1.0 1991
53 1.0 1907
54 4.0 1977
55 1.0 1908
56 2.0 1998
57 3.0 1997
58 1.0 1895
[179 rows x 2 columns]
Desired output is:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 1
4 1894 0
5 1895 1
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 1
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 1
18 1908 1
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 2
27 1917 0
28 1918 1
29 1919 0
.. ... ...
90 1980 6
91 1981 2
92 1982 4
93 1983 3
94 1984 3
95 1985 5
96 1986 3
97 1987 12
98 1988 1
99 1989 0
100 1990 0
101 1991 1
102 1992 2
103 1993 10
104 1994 0
105 1995 0
106 1996 0
107 1997 3
108 1998 2
109 1999 3
110 2000 0
111 2001 0
112 2002 0
113 2003 1
114 2004 0
115 2005 0
116 2006 3
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
The issue if because you should place year as index. In addition, if you don't want to lose data, you should join on outer instead of left.
This is my code:
df = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df2 = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df = df.set_index("year")
df2 = df2.set_index("year")
df3 = df.join(df2["qty"], how = "outer", lsuffix='_left', rsuffix='_right')
df3 = df3.fillna(0)
At this step you have 2 columns with values from df1 or df2. In you merge rule, I don't get what you want. You said :
if df1["qty"] == df2["qty"] => df3["qty"] = df2["qty"]
if df1["qty"] != df2["qty"] => df3["qty"] = df1["qty"]
That means you want everytime df1["qty"] because of df1["qty"] == df2["qty"]. Am I right ?
Just in case. If you want a code to adjust you can use apply as follow :
def foo(x1, x2):
if x1 == x2:
return x2
else:
return x1
df3["count"] = df3.apply(lambda row: foo(row["qty_left"], row["qty_left"]), axis=1)
df3.drop(["qty_left","qty_right"], axis = 1, inplace = True)
I hope it helps,
Nicolas

why am I getting a too many indexers error?

cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')

Add new colum to data based on first column record

I have the following data:
Cow_ID Age DIM MY MCF MCP MCL Cow_Order
26 1424 0 NA NA 0.0336 0.0505
26 1425 1 NA 0.0404 0.0338 0.0505
26 1426 2 NA 0.0388 0.0337 0.0505
26 1427 3 NA 0.0391 0.0337 0.0505
26 1428 4 35.2 0.0393 0.0337 0.0505
35 1432 8 34.7 0.0396 0.0337 0.0505
35 1433 9 33.6 0.0397 0.0337 0.0505
35 1434 10 32.8 0.0397 0.0337 0.0505
35 1435 11 33.7 0.0388 0.0337 0.0505
47 1439 15 30.8 0.0391 0.0337 0.0505
47 1440 16 31.3 0.0387 0.0337 0.0505
47 1441 17 33.7 0.0392 0.0337 0.0505
47 1442 18 30.2 0.0392 0.0337 0.0505
47 1443 19 34.1 0.0393 0.0337 0.0505
47 1444 20 33.3 0.0339 0.0286 0.0495
What I would like to do is to add an order from 1...1000 (in my complete data)into the column named Cow_Order based on first column cow_id:
Final data should look like:
Cow_ID Age DIM MY MCF MCP MCL Cow_Order
26 1424 0 NA NA 0.0336 0.0505 1
26 1425 1 NA 0.0404 0.0338 0.0505 1
26 1426 2 NA 0.0388 0.0337 0.0505 1
26 1427 3 NA 0.0391 0.0337 0.0505 1
26 1428 4 35.2 0.0393 0.0337 0.0505 1
35 1432 8 34.7 0.0396 0.0337 0.0505 2
35 1433 9 33.6 0.0397 0.0337 0.0505 2
35 1434 10 32.8 0.0397 0.0337 0.0505 2
35 1435 11 33.7 0.0388 0.0337 0.0505 2
47 1439 15 30.8 0.0391 0.0337 0.0505 2
47 1440 16 31.3 0.0387 0.0337 0.0505 3
47 1441 17 33.7 0.0392 0.0337 0.0505 3
47 1442 18 30.2 0.0392 0.0337 0.0505 3
47 1443 19 34.1 0.0393 0.0337 0.0505 3
47 1444 20 33.3 0.0339 0.0286 0.0495 3
Thanks
Do 'man awk' on any Linux system and you should get a new man page. While GNU's improved version is called gawk beginners may not see much difference between the two, though advanced people sure do.
generate | awk '
/Cow_ID/ {print "\t" $0, "Cow_Order"; next;}
{ if ( $1 != Cow_last ) {
Cow_Order++;
Cow_last = $1;
}
print $0, Cow_Order
}'
Look at awk's printf() function if you want to format into neat columns, or many other ways.
if the data in a file with name testfile,try this:
count=1;for cow_id in `awk '{if(FNR>1) print $1}' testfile |sort |uniq`; do awk -v cid=$cow_id -v orderid=$count '{if($1 == cid) {print $0"\t"orderid}} ' testfile; ((count++));done
the output:
26 1424 0 NA NA 0.0336 0.0505 1
26 1425 1 NA 0.0404 0.0338 0.0505 1
26 1426 2 NA 0.0388 0.0337 0.0505 1
26 1427 3 NA 0.0391 0.0337 0.0505 1
26 1428 4 35.2 0.0393 0.0337 0.0505 1
35 1432 8 34.7 0.0396 0.0337 0.0505 2
35 1433 9 33.6 0.0397 0.0337 0.0505 2
35 1434 10 32.8 0.0397 0.0337 0.0505 2
35 1435 11 33.7 0.0388 0.0337 0.0505 2
47 1439 15 30.8 0.0391 0.0337 0.0505 3
47 1440 16 31.3 0.0387 0.0337 0.0505 3
47 1441 17 33.7 0.0392 0.0337 0.0505 3
47 1442 18 30.2 0.0392 0.0337 0.0505 3
47 1443 19 34.1 0.0393 0.0337 0.0505 3
47 1444 20 33.3 0.0339 0.0286 0.0495 3

Grep rows with a length of 3

Hi i have a table which looks like this:
chr10 84890986 84891021 2 17.5 2 93 0 61 48 2 48 0 1.16 GA
chr10 84897562 84897613 2 25.5 2 100 0 102 50 49 0 0 1 AC
chr10 84899819 84899844 2 12.5 2 100 0 50 0 0 52 48 1 GT
chr10 84905282 84905318 6 5.8 6 87 6 54 80 19 0 0 0.71 AAAAAC
chr10 84955235 84955267 2 16 2 100 0 64 50 0 0 50 1 AT
chr10 84972254 84972288 2 17 2 93 0 59 2 0 47 50 1.16 GT
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85014721 85014841 5 23.8 5 78 8 66 0 69 0 29 1 TTCCC
chr10 85021530 85021701 5 38.4 5 84 13 53 74 0 24 0 0.85 AAGAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85059334 85059364 5 6 5 92 0 51 20 3 0 76 0.92 ATTTT
chr10 85072010 85072038 2 14 2 100 0 56 50 50 0 0 1 CA
chr10 85072037 85072077 4 10 4 84 10 55 25 22 0 52 1.47 ATCT
chr10 85084308 85084338 6 5 6 91 0 51 83 13 3 0 0.77 CAAAAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
chr10 85151154 85151190 6 6.5 6 87 12 51 0 11 0 88 0.5 TTTCTT
chr10 85168255 85168320 4 16.2 4 100 0 130 50 0 49 0 1 AGGA
chr10 85173155 85173184 2 14.5 2 100 0 58 48 0 0 51 1 TA
chr10 85196836 85196861 2 12.5 2 100 0 50 52 48 0 0 1 AC
chr10 85215511 85215546 2 17.5 2 100 0 70 51 48 0 0 1 AC
chr10 85225048 85225075 2 13.5 2 100 0 54 51 48 0 0 1 AC
chr10 85242322 85242357 2 17.5 2 93 0 61 0 2 48 48 1.16 TG
chr10 85245934 85245981 4 11 4 79 20 51 27 2 0 70 0.99 ATTT
chr10 85249139 85249230 5 18.8 5 88 6 116 0 60 0 39 0.97 TTCCC
chr10 85251100 85251153 5 11 5 97 2 92 0 0 37 62 0.96 GTTTG
chr10 85268725 85268752 4 6.8 4 100 0 54 0 25 0 74 0.83 CTTT
chr10 85268767 85268798 4 7.8 4 100 0 62 0 0 22 77 0.77 TTTG
chr10 85269189 85269239 6 8.8 6 79 16 54 84 2 12 2 0.8 AAAAGA
chr10 85330217 85330253 2 18 2 100 0 72 0 0 50 50 1 TG
chr10 85332256 85332314 4 15 4 82 7 75 70 1 27 0 0.97 AAGA
chr10 85337969 85337996 2 13.5 2 100 0 54 0 0 48 51 1 TG
chr10 85344795 85344957 2 75.5 2 83 12 198 45 4 3 45 1.42 TA
chr10 85349732 85349765 5 6.8 5 93 6 59 84 15 0 0 0.61 AAAAC
chr10 85353082 85353109 5 5.4 5 100 0 54 0 22 18 59 1.38 CTGTT
I want to extract all rows with have 3 and only 3 characters in the last column. My try till now is this:
grep -E "['ACTG']['ACTG']['ACTG']{1,3}$"
But this gives me everything from 3 and longer than 3. I tried many different combinations but nothing seems to give me what i want. Any ideas?
If you like to try awk, you can do:
awk '$NF~/\<...\>/' file
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
It will test if last field $NF has 3 character ...
This regex would also do: awk '$NF~/^...$/'
Or if you need exact characters. (PS this needs awk 4.x, or use of switch --re-interval)
awk '$NF~/^[ACTG]{3}$/' file
Using grep
grep -E " [ACTG]{3}$" file
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
You need the space, to separate last column, and {3} to get 3 and only 3 characters.
If you want to print the rows which has exactly three chars in the last column then you could use the below grep command.
grep -E " [ACTG]{3}$"
[ACTG]{3} Matches exactly three characters from the given list.
You have to grep either " ['ACTG']['ACTG']['ACTG']$" or " ['ACTG']{1,3}$".
Currently, you are grepping 3 to 5 'ACTG'.
Also, the quotes are unnecessary ['ACTG'] means "match anything between []" so any of the 5 characters 'ACTG, just grep " [ACTG]{1,3}$".
Be sure to use a delimiter for the left part (space ' ', tab\t if it is tab delimited, word boundary \b or \W).
If your lines are all ending with [ACTG]+, you can even only grep -E "\W.{,3}$"
Another way that you could do this would be using awk:
$ awk '$NF ~ /^[ACTG][ACTG][ACTG]$/' file
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
This prints all lines whose last field exactly matches 3 of the characters "A", "C", "T" or "G".
2 hours late but this is one way in awk
This can be easily edited for different lengths and fields.
awk 'length($NF)==3' file
As i was looking for answers myself i found out that Perl regex work more efficiently:
this does the deal : grep -P '\t...$' Way more compact code.
$ cat roi_new.bed | grep -P "\t...$"
chr10 81038152 81038182 3 9.7 3 92 7 51 30 0 0 70 0.88 TTA
chr10 81272294 81272320 3 8.7 3 100 0 52 0 30 69 0 0.89 GGC
chr10 81287690 81287720 3 10 3 100 0 60 66 33 0 0 0.92 CAA

how to separate paragraphs in a textfile into multiple textfiles?

I want to separate this textfile to 3 textfile that each paragraph makes a textfile.(my Os is ubuntu12.04)
Input
2008 2 2 1120 31.2 L 34.031 48.515 16.7 INS 5 0.3 4.0LINS 1
GAP=145 0.67 4.1 2.9 6.6 0.2283E-01 -0.1718E+00 0.1289E+02E
ACTION:UPD 08-12-28 13:25 OP:moh STATUS: ID:20080202112031 L I
2008-02-02-1120-39S.IN____006 6
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
SNGE BZ EPg 1120 57.69 91 0.0210 159 318
SNGE BZ AML 1121 24.50 2880.9 0.55 159 318
SHGR BZ EPN5 1121 5.17 52 -0.0510 215 173
GHVR BZ EPn 1121 10.84 52 0.3610 256 78
GHVR BZ ESg 1121 43.50 91 -0.0210 256 78
CHTH BZ EPn 1121 18.26 52 0.1210 317 48
CHTH BZ AML 1122 8.01 494.0 0.68 317 48
DAMV BZ EPn 1121 23.36 52 -0.49 9 362 60
DAMV BZ AML 1122 7.03 382.0 0.48 362 60
2008 211 1403 46.2 L 27.659 55.544 14.1 INS 4 0.1 4.0LINS 1
GAP=171 0.38 1.7 1.2 3.3 -0.8271E-01 -0.3724E-01 0.4284E+00E
2008-02-11-1403-37S.INSN__048 6
ACTION:NEW 08-12-28 13:25 OP:moh STATUS: ID:20080211140346 L I
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
BNDS BZ EPg 14 3 58.14 90 -0.0710 68.3 115
BNDS BN AML 14 4 26.39 8461.0 0.52 68.3 115
GHIR BZ EPn 14 4 26.40 52 0.0310 261 286
GHIR BN ESg 14 4 59.85 90 -0.0110 261 286
GHIR BN AML 14 5 25.22 1122.4 0.56 261 286
GHIR BE AML 14 5 43.83 769.3 0.64 261 286
KRBR BZ EPn 14 4 29.25 52 -0.1110 284 24
KRBR BN ESg 14 5 6.28 90 0.0010 284 24
KRBR BN AML 14 5 18.89 552.4 0.64 284 24
KRBR BE AML 14 5 19.22 574.0 0.60 284 24
ZHSF BZ EPn 14 5 3.24 52 0.25 8 555 66
2008 213 2055 31.5 L 31.713 51.180 14.1 INS 9 0.5 4.2LINS 1
GAP=127 1.21 4.6 6.5 9.6 0.7570E+01 -0.1161E+02 0.9944E+01E
ACTION:UPD 08-12-28 13:25 OP:moh STATUS: ID:20080213205531 L I
2008-02-13-2054-59S.NSN___048 6
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
NASN BZ EPg 2056 3.15 90 -0.6410 195 51
SHGR BZ EPg 2056 8.57 90 -0.3810 229 282
SHGR BN AML 2056 49.27 2371.2 0.77 229 282
SHGR BE AML 2056 51.00 2484.4 0.77 229 282
GHVR BZ EPn 2056 18.39 52 1.0110 307 1
GHVR BE AML 2057 11.42 734.2 0.85 307 1
ASAO BZ EPn 2056 20.35 52 -0.36 9 332 341
ASAO BE ESg 2057 5.23 90 0.27 9 332 341
ASAO BN AML 2057 15.86 723.3 0.64 332 341
GHIR BZ EPn 2056 31.68 52 0.48 9 418 155
GHIR BN AML 2057 51.30 259.1 0.79 418 155
DAMV BZ EPn 2056 33.90 52 -0.27 9 441 9
DAMV BN AML 2057 43.30 237.4 0.65 441 9
THKV BZ EPn 2056 37.71 52 0.33 8 467 357
THKV BE AML 2057 51.62 205.7 0.72 467 357
ZNJK BZ EPn 2056 53.12 52 -0.35 7 596 338
BNDS BZ EPn 2057 3.72 52 -0.06 7 680 133
output1.txt
2008 2 2 1120 31.2 L 34.031 48.515 16.7 INS 5 0.3 4.0LINS 1
GAP=145 0.67 4.1 2.9 6.6 0.2283E-01 -0.1718E+00 0.1289E+02E
ACTION:UPD 08-12-28 13:25 OP:moh STATUS: ID:20080202112031 L I
2008-02-02-1120-39S.IN____006 6
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
SNGE BZ EPg 1120 57.69 91 0.0210 159 318
SNGE BZ AML 1121 24.50 2880.9 0.55 159 318
SHGR BZ EPN5 1121 5.17 52 -0.0510 215 173
GHVR BZ EPn 1121 10.84 52 0.3610 256 78
GHVR BZ ESg 1121 43.50 91 -0.0210 256 78
CHTH BZ EPn 1121 18.26 52 0.1210 317 48
CHTH BZ AML 1122 8.01 494.0 0.68 317 48
DAMV BZ EPn 1121 23.36 52 -0.49 9 362 60
DAMV BZ AML 1122 7.03 382.0 0.48 362 60
output2.txt
2008 211 1403 46.2 L 27.659 55.544 14.1 INS 4 0.1 4.0LINS 1
GAP=171 0.38 1.7 1.2 3.3 -0.8271E-01 -0.3724E-01 0.4284E+00E
2008-02-11-1403-37S.INSN__048 6
ACTION:NEW 08-12-28 13:25 OP:moh STATUS: ID:20080211140346 L I
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
BNDS BZ EPg 14 3 58.14 90 -0.0710 68.3 115
BNDS BN AML 14 4 26.39 8461.0 0.52 68.3 115
GHIR BZ EPn 14 4 26.40 52 0.0310 261 286
GHIR BN ESg 14 4 59.85 90 -0.0110 261 286
GHIR BN AML 14 5 25.22 1122.4 0.56 261 286
GHIR BE AML 14 5 43.83 769.3 0.64 261 286
KRBR BZ EPn 14 4 29.25 52 -0.1110 284 24
KRBR BN ESg 14 5 6.28 90 0.0010 284 24
KRBR BN AML 14 5 18.89 552.4 0.64 284 24
KRBR BE AML 14 5 19.22 574.0 0.60 284 24
ZHSF BZ EPn 14 5 3.24 52 0.25 8 555 66
output3.txt
2008 213 2055 31.5 L 31.713 51.180 14.1 INS 9 0.5 4.2LINS 1
GAP=127 1.21 4.6 6.5 9.6 0.7570E+01 -0.1161E+02 0.9944E+01E
ACTION:UPD 08-12-28 13:25 OP:moh STATUS: ID:20080213205531 L I
2008-02-13-2054-59S.NSN___048 6
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
NASN BZ EPg 2056 3.15 90 -0.6410 195 51
SHGR BZ EPg 2056 8.57 90 -0.3810 229 282
SHGR BN AML 2056 49.27 2371.2 0.77 229 282
SHGR BE AML 2056 51.00 2484.4 0.77 229 282
GHVR BZ EPn 2056 18.39 52 1.0110 307 1
GHVR BE AML 2057 11.42 734.2 0.85 307 1
ASAO BZ EPn 2056 20.35 52 -0.36 9 332 341
ASAO BE ESg 2057 5.23 90 0.27 9 332 341
ASAO BN AML 2057 15.86 723.3 0.64 332 341
GHIR BZ EPn 2056 31.68 52 0.48 9 418 155
GHIR BN AML 2057 51.30 259.1 0.79 418 155
DAMV BZ EPn 2056 33.90 52 -0.27 9 441 9
DAMV BN AML 2057 43.30 237.4 0.65 441 9
THKV BZ EPn 2056 37.71 52 0.33 8 467 357
THKV BE AML 2057 51.62 205.7 0.72 467 357
ZNJK BZ EPn 2056 53.12 52 -0.35 7 596 338
BNDS BZ EPn 2057 3.72 52 -0.06 7 680 133
I give you an idea, just one method: iterate your file, row by row.
Save in a buffer all the row while row!="" or row!='\n': in this case, save buffer in a differente file.
buffer=""
id=0
cat test | \
while read row; do
#check row value, save in buffer
.....
cat buffer > fileName_${id}.txt
id=$((id+1))
done

Resources