Parse .log file to collect data after keyword then create a nested dictionary using predefined column names - python-3.x

I am trying to parse a .log file using python for the Status information about processes of a linux system. The .log file has a lot of different sections of information, the sections of interest start with "##START-ALLPROCESSES-xxxxxxxx" where x is the epoch date and end with '##END-ALLPROCESSES-xxxxxxx". After this line each process is listed with 52 columns each, the number of processes may change depending on the info recorded at the time, and there may be multiple sections with this information at different times.
The idea is to open the .log file, find the sections and then use the XXXXXXX as the key for a nested dictionary where the keys are the predefined column dates filled in with the values from the section, and do this for all different sections that would be found on the .log fie. The nested dictionary would look something like below
[date1-XXXXXX:
[ columnName1: process1,
.
.
.
columnName52: info1
],
.
.
.
[ columnName1: process52,
.
.
.
columName52: info52
]
],
[date2-XXXXXX:
[ columnName1: process1,
.
.
.
columnName52: info1
],
.
.
.
[ columnName1: process52,
.
.
.
columName52: info52
]
]
The data in the .log file looks as follow and would have multiple sections as this but with a different date each line starts with the process id and (process name)
##START-ALLPROCESSES-1676652419
1 (systemd) S 0 1 1 0 -1 4210944 2070278 9743969773 2070 2703 8811 11984 7638026 9190549 20 0 1 0 0 160043008 745 18446744073709551615 187650352414720 187650353516788 281474853505456 0 0 0 671173123 4096 1260 1 0 0 17 0 0 0 2706 0 0 187650353585800 187650353845340 187651263758336 281474853506734 281474853506745 281474853506745 281474853507053 0
10 (rcu_bh) I 2 0 0 0 -1 2129984 0 0 0 0 0 0 0 0 20 0 1 0 2 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10251 (kworker/1:2) I 2 0 0 0 -1 69238880 0 0 0 0 0 914 0 0 20 0 1 0 617684776 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 1 0 0 0 0 0 0 0 0 0 0 0 0 0
10299 (loop2) S 2 0 0 0 -1 3178560 0 0 0 0 0 24 0 0 0 -20 1 0 10871 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 169 0 0 0 0 0 0 0 0 0 0
10648 (kworker/2:0) I 2 0 0 0 -1 69238880 0 0 0 0 0 567 0 0 20 0 1 0 663634994 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 2 0 0 0 0 0 0 0 0 0 0 0 0 0
1082 (nvme-wq) I 2 0 0 0 -1 69238880 0 0 0 0 0 0 0 0 0 -20 1 0 109 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 3 0 0 0 0 0 0 0 0 0 0 0 0 0
1095 (scsi_eh_0) S 2 0 0 0 -1 2129984 0 0 0 0 0 0 0 0 20 0 1 0 110 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 3 0 0 0 0 0 0 0 0 0 0 0 0 0
1096 (scsi_tmf_0) I 2 0 0 0 -1 69238880 0 0 0 0 0 0 0 0 0 -20 1 0 110 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1099 (scsi_eh_1) S 2 0 0 0 -1 2129984 0 0 0 0 0 0 0 0 20 0 1 0 110 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 (migration/0) S 2 0 0 0 -1 69238848 0 0 0 0 0 4961 0 0 -100 0 1 0 2 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 99 1 0 0 0 0 0 0 0 0 0 0 0
1100 (scsi_tmf_1) I 2 0 0 0 -1 69238880 0 0 0 0 0 0 0 0 0 -20 1 0 110 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##END-ALLPROCESSES-1676652419
I have tried it multiple ways but I cannot seem to get it to go correctly, my last attempt
columns = ['pid', 'comm', 'state', 'ppid', 'pgrp', 'session', 'tty_nr', 'tpgid', 'flags', 'minflt', 'cminflt', 'majflt', 'cmajflt', 'utime', 'stime',
'cutime', 'cstime', 'priority', 'nice', 'num_threads', 'itrealvalue', 'starttime', 'vsize', 'rss', 'rsslim', 'startcode', 'endcode', 'startstack', 'kstkesp',
'kstkeip', 'signal', 'blocked', 'sigignore', 'sigcatch', 'wchan', 'nswap', 'cnswap', 'exit_signal', 'processor', 'rt_priority', 'policy', 'delayacct_blkio_ticks',
'guest_time', 'cguest_time', 'start_data', 'end_data', 'start_brk', 'arg_start', 'arg_end', 'env_start', 'env_end', 'exit_code' ]
for file in os.listdir(dir):
if file.endswith('.log'):
with open(file, 'r') as f:
data = f.read()
data = data.split('##START-ALLPROCESSES-')
data = data[1:]
for i in range(len(data)):
data[i] = data[i].split('##END-ALLPROCESSES-')
data[i] = data[i][0]
data[i] = re.split('\r', data[i])
data[i] = data[i][0]
data[i] = re.split('\n', data[i])
for j in range(len(data[i])):
data[i][j] = re.split('\s+', data[i][j])
#print(data[i])
data[i][0] = str(data[i][0])
data_dict = {}
for i in range(len(data)):
data_dict[data[i][0]] = {}
for j in range(len(columns)):
data_dict[data[i][0]][columns[j]] = data[i][j+1]
print(data_dict)
I converted the epoch date into a str as I was getting unhashable list errors, however that made it so the epoch date shows as a key but each column now has the entire list for the 52 columms of information as a single one, so definitely I am missing something

To solve this problem, you could follow the following steps:
Open the .log file and read the contents
Search for all the sections of interest by finding lines that start with "##START-ALLPROCESSES-" and end with "##END-ALLPROCESSES-"
For each section found, extract the epoch date and create a dictionary with an empty list for each of the 52 columns
Iterate over the lines within the section and split the line into the 52 columns using space as a separator. Add the values to the corresponding list in the dictionary created in step 3
Repeat steps 3 and 4 for all the sections found in the .log file
Return the final nested dictionary
Here is some sample code that implements these steps:
import re
def parse_log_file(log_file_path):
with open(log_file_path, 'r') as log_file:
log_contents = log_file.read()
sections = re.findall(r'##START-ALLPROCESSES-(.*?)##END-ALLPROCESSES-', log_contents, re.DOTALL)
nested_dict = {}
for section in sections:
lines = section.strip().split('\n')
epoch_date = lines[0].split('-')[-1]
column_names = ['column{}'.format(i) for i in range(1, 53)]
section_dict = {column_name: [] for column_name in column_names}
for line in lines[1:]:
values = line.strip().split()
for i, value in enumerate(values):
section_dict[column_names[i]].append(value)
nested_dict['date{}-{}'.format(epoch_date, len(section_dict['column1']))] = section_dict
return nested_dict
You can call this function by passing the path to the .log file as an argument. The function returns the nested dictionary described in the problem statement.

Related

edit sample ID in vcf file using sed command

I have a vcf file in which ID is represented like this:
GTEX-1117F_GTEX-1117F
I want to edit the ID to look like:
only
GTEX-1117F
and exclude other part of the ID for all GTEX values.
Id GTEX-1117F_GTEX-1117F GTEX-111CU_GTEX-111CU GTEX-111FC_GTEX-111FC GTEX-111VG_GTEX-111VG GTEX-111YS_GTEX-111YS GTEX-1122O_GTEX-1122O GTEX-1128S_GTEX-1128S GTEX-113IC_GTEX-113IC
I tried to use this command:
sed -e '$s/\[[[:digit:]]\+\]//g; s/_GTEX[[:digit:]]\+//g' genotype_chr22_filtered_dosage2.txt > genotype_chr22_filtered_dosage3.txt
This is how my file looks like:
Id GTEX-1117F_GTEX-1117F GTEX-111CU_GTEX-111CU GTEX-111FC_GTEX-111FC GTEX-111VG_GTEX-111VG GTEX-111YS_GTEX-111YS GTEX-1122O_GTEX-1122O GTEX-
1128S_GTEX-1128S GTEX-113IC_GTEX-113IC GTEX-113JC_GTEX-113JC GTEX-117XS_GTEX-117XS GTEX-117YW_GTEX-117YW GTEX-117YX_GTEX-117YX GTEX-1192W_GTEX-11
92W GTEX-1192X_GTEX-1192X GTEX-11DXW_GTEX-11DXW GTEX-11DXX_GTEX-11DXX GTEX-11DXY_GTEX-11DXY GTEX-11DXZ_GTEX-11DXZ GTEX-11DYG_GTEX-11DYG GTEX-11
DZ1_GTEX-11DZ1 GTEX-11EI6_GTEX-11EI6 GTEX-11EM3_GTEX-11EM3 GTEX-11EMC_GTEX-11EMC GTEX-11EQ8_GTEX-11EQ8 GTEX-11EQ9_GTEX-11EQ9
GTEX-ZVZP_GTEX-ZVZP GTEX-ZVZQ_GTEX-ZVZQ GTEX-ZWKS_GTEX-ZWKS GTEX-ZXES_GTEX-ZXES GTEX-ZXG5_GTEX-ZXG5 GTEX-ZY6K_GTEX-ZY6K GTEX-ZYFC_GTEX-ZYFC GTEX-ZYFD_GTEX-ZYFD GTEX-ZYFG_GTEX-ZYFG GTEX-ZYT6_GTEX-ZYT6 GTEX-ZYVF_GTEX-ZYVF GTEX-ZYW4_GTEX-ZYW4 GTEX-ZYWO_GTEX-ZYWO
GTEX-ZYY3_GTEX-ZYY3 GTEX-ZZ64_GTEX-ZZ64 GTEX-ZZPT_GTEX-ZZPT GTEX-ZZPU_GTEX-ZZPU
22_20000146_G_A_b37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0

How to create a confusion matrix from an incomplete dataframe in python

I have a dataframe which looks like this:
I1 I2 V
0 1 1 300
1 1 5 7
2 1 9 3
3 2 2 280
4 2 3 4
5 5 1 5
6 5 5 400
I1 and I2 represent indexes while V represent values.
The indexes with values equal to 0 have been omitted, but I'd like to get a confusion matrix showing all the values, i.e. something like this:
1 2 3 4 5 6 7 8 9
1 300 0 0 0 7 0 0 0 3
2 0 280 4 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 5 0 0 0 400 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0
How can I do it?
Thanks in advance!
Use set_index with unstack for reshape, for append missing values add reindex and for data cleaning rename_axis :
r = range(1, 10)
df = (df.set_index(['I1','I2'])['V']
.unstack(fill_value=0)
.reindex(index=r, columns=r, fill_value=0)
.rename_axis(None)
.rename_axis(None, axis=1))
print (df)
1 2 3 4 5 6 7 8 9
1 300 0 0 0 7 0 0 0 3
2 0 280 4 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 5 0 0 0 400 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0
Detail:
print (df.set_index(['I1','I2'])['V']
.unstack(fill_value=0))
I2 1 2 3 5 9
I1
1 300 0 0 7 3
2 0 280 4 0 0
5 5 0 0 400 0
Alternative solution with pivot, if all values are integers:
r = range(1, 10)
df = (df.pivot('I1','I2', 'V')
.fillna(0)
.astype(int)
.reindex(index=r, columns=r, fill_value=0)
.rename_axis(None)
.rename_axis(None, axis=1))
print (df)
1 2 3 4 5 6 7 8 9
1 300 0 0 0 7 0 0 0 3
2 0 280 4 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 5 0 0 0 400 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0
Option 1: Using numpy you can
In [150]: size = df[['I1', 'I2']].values.max()
In [151]: arr = np.zeros((size, size))
In [152]: arr[df.I1-1, df.I2-1] = df.V
In [153]: idx = np.arange(1, size+1)
In [154]: pd.DataFrame(arr, index=idx, columns=idx).astype(int)
Out[154]:
1 2 3 4 5 6 7 8 9
1 300 0 0 0 7 0 0 0 3
2 0 280 4 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 5 0 0 0 400 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0
Option 2: Using scipy.sparse.csr_matrix
In [178]: from scipy.sparse import csr_matrix
In [179]: size = df[['I1', 'I2']].values.max()
In [180]: idx = np.arange(1, size+1)
In [181]: pd.DataFrame(csr_matrix((df['V'], (df['I1']-1, df['I2']-1)), shape=(size, si
...: ze)).toarray(), index=idx, columns=idx)
Out[181]:
1 2 3 4 5 6 7 8 9
1 300 0 0 0 7 0 0 0 3
2 0 280 4 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 5 0 0 0 400 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0

Create dummy variables for interdependent categories in pandas

I'm trying to set up a linear regression model in order to predict traffic counts based on the day, and the time of day. Since both are categorical variables, I have to create dummy variables. The get_dummies function makes this very easy, when doing this for both variables individually. However, in the case of predicting traffic volumes, the interdependence between the day and the time of day are important. Therefore, I'll need dummies for all days * all time intervals.
I made a small example, avoiding to trouble you with big datasets:
import pandas as pd
df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
'Time': [11,15,9,15,17,10,20],
'Count': [100,150,150,150,180,60,50]})
df_dummies = pd.get_dummies(df.Day)
print(df_dummies)
Results in a nice dataframe with dummies:
Fri Mon Sat Sun Thu Tue Wed
0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 1 0 0 0
So what I'm after is something like this:
import pandas as pd
df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
'Time': [11,15,9,15,17,10,20],
'Count': [100,150,150,150,180,60,50]})
df_dummies = pd.get_dummies(df.Day * df.Time)
print(df_dummies)
With a result like this:
Fri_9 Fri_15 Mon_9 Mon_15 Sat_9 Sat_15 Sun_9 ...
0 0 1 0 0 0 0 0 ...
1 0 0 0 0 0 1 0 ...
2 0 0 0 0 0 0 1 ...
3 0 0 0 0 1 0 0 ...
4 1 0 0 0 0 0 0 ...
5 0 0 1 0 0 0 0 ...
6 0 0 0 1 0 0 0 ...
7 0 0 0 0 0 0 0 ...
[...]
Is there any way in which this can be done elegantly?
I belive need join columns together with cast to strings:
df_dummies = pd.get_dummies(df.Day + '_' + df.Time.astype(str))
#df_dummies = pd.get_dummies(df.Day.str.cat(df.Time.astype(str), sep='_'))
print(df_dummies)
Fri_17 Mon_11 Sat_10 Sun_20 Thu_15 Tue_15 Wed_9
0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 1 0 0 0
I'm trying to set up a linear regression model in order to predict
Technically, you can make a dummy of the tuples:
>>> pd.get_dummies(df[['Day', 'Time']].apply(tuple, axis=1))
(Fri, 17) (Mon, 11) (Sat, 10) (Sun, 20) (Thu, 15) (Tue, 15) (Wed, 9)
0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 1 0 0 0
...
however, I think this approach is not the best at the ML level. This will probably fragment the data very much, making things hard for your regressor. You might consider using a gradient-boosted decision tree, if you're after the interactions.

intersecting pandas dataframe with list of terms in python 3

I have a pandas dataframe for a term x document matrix. I need to extract only those values that appear in a separate list of terms (extracted from a different source).
The list TestDict contains terms. E.g., ['аборенков','абсолютно',...]
-- to create a term by document matrix (reused)
def fn_tdm_df(docs, xColNames = None, **kwargs):
#initialize the vectorizer
vectorizer = CountVectorizer(**kwargs)
x1 = vectorizer.fit_transform(docs)
#create dataFrame
df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
if xColNames is not None:
df.columns = xColNames
return df
Then later in the code I have:
for filename in os.listdir(DIR):
d1 = fn_tdm_df(docs = MyCorpus(DIR)['docs'])
The output of d1.head(5):
0 1 2 3 4 5 6 7 8 9 ... 27 28 29 30 31 \
аборенков 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0
аборенкова 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0
абсолютно 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0
авантюру 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0
аварии 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0
32 33 34 35 36
аборенков 0 0 0 2 0
аборенкова 0 0 0 1 0
абсолютно 0 0 0 0 0
авантюру 0 0 0 0 0
аварии 0 0 0 0 0
I'm doing something like:
with open("TestTD", "w", encoding = "utf-8") as f:
for w in d1.index: # d1 is my dataframe
if w in TestDict:
np.savetxt(f, d1.loc(w),fmt='%d')
But I don't think the .loc(w) is what I want...

Linux string functions

How is this so complex or am I missing something. I simply want to get the integer value of a substring within an existing string and place it in a variable - same as php strpos.
The closest I have found is:
echo $haystack | awk '{print index($0,"<tagtosearch>")}';
Tried
myvar=$($haystack | awk '{print index($0,"<tagtosearch>")}');
but says command not found
The application is to automate include custom bash scripts on a given linux box, but not overwrite existing. Therefore I decided to insert custom start and end tags to denote the custom section. So I simply expected to get the start and end positions, delete this part of the file and pull in a latest version.
So this is as far as I have that function:
function install-env(){
mkdir -p /etc/datadimension/tmp;
cd /etc/datadimension/tmp;
cp /etc/bash.bashrc tempbash.bashrc;
newbash=$(cat tempbash.bashrc);
echo "$newbash" > newbash.bashrc;
insertstart=$(echo "$newbash" | awk '{print index($0,"<starttag>")}');
echo $insertstart;
}
Output
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Do I need to apt install something that can handle strings - this seems really basic requirement that is lacking.
With $($haystack), you invoke the value assigned to $haystack as a command.
To pipe $haystack value to awk try this:
myvar=$(echo "$haystack" | awk '{print index($0,"<tagtosearch>")}');
Or as #anubhava suggests in comment, using a here string:
myvar=$(awk '{print index($0, "<tagtosearch>")}' <<< "$haystack")
Could not find anything acceptable aside from php handling
If you want to print position of a search word in file then you can use this awk command:
awk -v kw='<starttag>' 'p=index($0, kw){p+=b; exit} {b+=length($0)+1} END{print p}' file
We keep adding each line's length into a running variable b until index retuns greater than 0. At that time we add current line's index into b and exit.
This one seems to work like php strpos() function
strpos() {
local string=$1
local findme=$2
awk -v x=$string -v y=$findme '{print index($0, y)}' <<< $string
}
Some tests
strpos azerty a
1
strpos azerty t
5
strpos azerty er
3
strpos azerty/uio /
7
strpos "azerty uio" " "
7
EDIT:
If you want exactly the php behaviour you can do:
php_strpos() {
local haystack="$1"
local needle="$2"
local offset=${3:-0}
php -r "echo strpos('$haystack', '$needle', $offset);"
}

Resources