I have got a string stored in a dataframe column
import pandas as pd
df = pd.DataFrame({"ID": 1, "content": "froyay-xcd = (E)-cut-2-froyay-xcd"}, index=[0])
print(df)
idx = df[df['content'].str.contains("froyay-xcd = (E)-cut-2-froyay-xcd")]
print(idx)
I'm trying to find the index of the row that contains a search string and the following warning occurs
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
return func(self, *args, **kwargs)
I'm not sure why an empty dataframe is returned when the search string actually is present in the dataframe columns.
Any suggestions will be highly appreciated. I expect the output to return the row stored in the dataframe.
You can add regex=False parameter for avoid convert values to regex, here () are special regex characters:
idx = df[df['content'].str.contains("froyay-xcd = (E)-cut-2-froyay-xcd", regex=False)]
print(idx)
ID content
0 1 froyay-xcd = (E)-cut-2-froyay-xcd
Or you can escape regex by:
import re
idx = df[df['content'].str.contains(re.escape("froyay-xcd = (E)-cut-2-froyay-xcd"))]
print(idx)
ID content
0 1 froyay-xcd = (E)-cut-2-froyay-xcd
You can add \ before ( and ) to avoid it and then get index using .index
df.content.str.contains("froyay-xcd = \(E\)-cut-2-froyay-xcd").index
Int64Index([0], dtype='int64')
If you have more regex special character better to use regex=False as #jezrael said.
Related
Using Pandas, I'm attempting to 'slice' (Sorry if that's not the correct term) segments of a dataframe out of one DF and into a new one, where every segment is stacked one on top of the other.
Code:
import pandas as pd
df = pd.DataFrame(
{
'TYPE': ['System','VERIFY','CMD','SECTION','SECTION','VERIFY','CMD','CMD','VERIFY','CMD','System'],
'DATE': [100,200,300,400,500,600,700,800,900,1000,1100],
'OTHER': [10,20,30,40,50,60,70,80,90,100,110],
'STEP': ['Power On','Start: 2','Start: 1-1','Start: 10-7','End: 10-7','Start: 3-1','Start: 10-8','End: 1-1','End: 3-1','End: 10-8','Power Off']
})
print(df)
column_headers = df.columns.values.tolist()
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_step = 'STEP'
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
type_df = df[df[col_name_type].isin(types_to_check)]
for row in type_df:
if 'CMD' in row:
if 'START:' in row[col_name_step].value:
idx_start = row.iloc[::-1].str.match('VERIFY').first_valid_index() #go backwards and find first VERIFY
step_match = row[col_name_step].value[6:] #get the unique ID after Start:
idx_end = df[df[col_name_step].str.endswith(step_match, na=False)].last_valid_index() #find last instance of matching unique id
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
print(df)
print(df_segments)
Nothing gets populated in my segements array so the concat function fails.
From my research I'm confident that this can be done using either .loc or .iloc, but I can't seem to get a working implementation in.
My DF:
What I am trying to make:
Any help and/or guidance would be welcome.
Edit: To clarify, I'm trying to create a new DF that is comprised of every group of rows, where the start is the "VERIFY" that comes before a "CMD" row that also contains "Start:", and the end is the matching "CMD" row that has end.
EDIT2: I think the following is something close to what I need, but I'm unsure how to get it to reliably work:
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
cmd_check = ['CMD']
verify_check = ['VERIFY']
cmd_df = df[(df[col_name_type].isin(cmd_check))]
cmd_start_df = cmd_df[(cmd_df[col_name_step].str.contains('START:'))]
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,]
idx_start = temp_df[col_name_type].isin(verify_check).last_valid_index()
idx_end = cmd_df[cmd_df[col_name_type].str.endswith(step_name, na=False)].last_valid_index()
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
you can use str.contains
segmented_df = df.loc[df['STEP'].str.contains('Start|End')]
print(segmented_df )
I created some code to accomplish the 'slicing' I wanted:
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,:]
temp_list = temp_df[col_name_type].values.tolist()
if 'VERIFY' in temp_list:
idx_start = temp_df[temp_df[col_name_type].str.match('VERIFY')].last_valid_index()
else:
idx_start = cmd_idx
idx_end = cmd_df[cmd_df[col_name_step].str.endswith(step_name, na=False)].last_valid_index()
slides.append(df.loc[idx_start:idx_end, :])
slides.append(df_blank)
I essentially create a new DF that is a subset of the old DF up to my first START index, then I find the last_valid_index that has VERIFY, then I use that index to create a filtered DF from idx_start to idx_end and then eventually concat all those slices into one DF.
Maybe there's an easier way, but I couldn't find it.
#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.
You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:
I have a dataframe with column which contains two different column values and their name as follows:
How Do I transform it into separate columns?
So far, I tried Following:
use df[col].apply(pd.Series) - It didn't work since data in column is not in dictionary format.
Tried separating columns by a semi-colon (";") sign but It is not a good idea since the given dataframe might have n number of column based on response.
EDIT:
Data in plain text format:
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
How about:
df2 = (df["ClusterName"]
.str.replace("Date:", "")
.str.replace("Bucket:", "")
.str.split(";", expand=True))
df2.columns = ["Date", "Bucket"]
EDIT:
Without hardcoding the variable names, here's a quick hack. You can clean it up (and make less silly variable names):
df_temp = df.ClusterName.str.split(";", expand=True)
cols = []
for col in df_temp:
df_temptemp = df_temp[col].str.split(":", expand=True)
df_temp[col] = df_temptemp[1]
cols.append(df_temptemp.iloc[0, 0])
df_temp.columns = cols
So .. maybe like this ...
Setup the data frame
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
df = pd.DataFrame(data=d)
df
Parse over the dataframe breaking apart by colon and semi-colon
ls = []
for index, row in df.iterrows():
splits = row['ClusterName'].split(';')
print(splits[0].split(':')[1],splits[1].split(':')[1])
ls.append([splits[0].split(':')[1],splits[1].split(':')[1]])
df = pd.DataFrame(ls, columns =['Date', 'Bucket'])
I have a problem where I need to search the contents of an RDD in another RDD.
This question is different from Efficient string matching in Apache Spark, as I am searching for an exact match and I don't need the overhead of using the ML stack.
I am new to spark and I want to know which of these methods is more efficient or if there is another way.
I have a keyword file like the below sample (in production it might reach up to 200 lines)
Sample keywords file
0.47uF 25V X7R 10% -TDK C2012X7R1E474K125AA
20pF-50V NPO/COG - AVX- 08055A200JAT2A
and I have another file(tab separated)from which I need to find matches(in production I have up to 80 Million line)
C2012X7R1E474K125AA Conn M12 Circular PIN 5 POS Screw ST Cable Mount 5 Terminal 1 Port
First method
I defined a UDF and looped through keywords for each line
keywords = sc.textFile("keys")
part_description = sc.textFile("part_description")
def build_regex(keywords):
res = '('
for key in keywords:
res += '(?<!\\\s)%s(?!\\\s)|' % re.escape(key)
res = res[0:len(res) - 1] + ')'
return r'%s' % res
def get_matching_string(line, regex):
matches = re.findall(regex, line, re.IGNORECASE)
matches = list(set(matches))
return list(set(matches)) if matches else None
def find_matching_regex(line):
result = list()
for keyGroup in keys:
matches = get_matching_string(line, keyGroup)
if matches:
result.append(str(keyGroup) + '~~' + str(matches) + '~~' + str(len(matches)))
if len(result) > 0:
return result
def split_row(list):
try:
return Row(list[0], list[1])
except:
return None
keys_rdd = keywords.map(lambda keywords: build_regex(keywords.replace(',', ' ').replace('-', ' ').split(' ')))
keys = keys_rdd.collect()
sc.broadcast(keys)
part_description = part_description.map(lambda item: item.split('\t'))
df = part_description.map(lambda list: split_row(list)).filter(lambda x: x).toDF(
["part_number", "description"])
find_regex = udf(lambda line: find_matching_regex(line), ArrayType(StringType()))
df = df.withColumn('matched', find_regex(df['part_number']))
df = df.filter(df.matched.isNotNull())
df.write.save(path=job_id, format='csv', mode='append', sep='\t')
Second method
I thought I can do more parallel processing (instead of looping through keys like above) I did cartersian product between keys and lines, splitted and exploded the keys then compared each key to the part column
df = part_description.cartesian(keywords)
df = df.map(lambda tuple: (tuple[0].split('\t'), tuple[1])).map(
lambda tuple: (tuple[0][0], tuple[0][1], tuple[1]))
df = df.toDF(['part_number', 'description', 'keywords'])
df = df.withColumn('single_keyword', explode(split(F.col("keywords"), "\s+"))).where('keywords != ""')
df = df.withColumn('matched_part_number', (df['part_number'] == df['single_keyword']))
df = df.filter(df['matched_part_number'] == F.lit(True))
df.write.save(path='part_number_search', format='csv', mode='append', sep='\t')
Are these the correct ways to do this? Is there anything I can do to process these data faster?
These are both valid solutions, and I have used both in different circumstances.
You communicate less data by using your broadcast approach, sending only 200 extra lines to each executor as opposed to replicating each line of your >80m line file 200 times, so it is likely this one will end up being faster for you.
I have used the cartesian approach when the number of records in my lookup is not feasibly broadcast-able (being much, much larger than 200 lines).
In your situation, I would use broadcast.
This question is relatively close to existing answers about extracting strings, but my data has technical twists to it. The df column data looks like this:
Col1:
2909_10_2018
2909_14_2019
32_13_2019
4200_14_2018
4124__2019
Objective is to extract the string between the two "_", except sometimes there is no string.
I tried multiple solutions posted in similar topics to no avail:
try:
df['Col2'] = re.search('.*abc_[^_]*', df.Col1)
except:
TypeError:
df['Col2'] = ''
Produces ""
try:
df['Col2'] = re.search('_(.*)_', df.Col1)
except:
TypeError:
df['Col2'] = ''
Produces ""
df['Col2'] = df.Col1.str.split("_", n = 1, expand = True)
Results in ValueError: Wrong number of items passed 2, placement implies 1.
What is a good pythonic way to extract the text between the "_" while handling the omissions?
Considering the format doesn't change, you can use lambda function as you have to do the same operation for each and every row. The below code will create a new column with empty strings as well.
Code:
df['Col2'] = df.Col1.apply(lambda x: x.split('_')[1])
Output:
Col1 Col2
0 2909_10_2018 10
1 2909_14_2019 14
2 32_13_2019 13
3 4200_14_2018 14
4 4124__2019