How to optimize searching RDD contents in another RDD - apache-spark

I have a problem where I need to search the contents of an RDD in another RDD.
This question is different from Efficient string matching in Apache Spark, as I am searching for an exact match and I don't need the overhead of using the ML stack.
I am new to spark and I want to know which of these methods is more efficient or if there is another way.
I have a keyword file like the below sample (in production it might reach up to 200 lines)
Sample keywords file
0.47uF 25V X7R 10% -TDK C2012X7R1E474K125AA
20pF-50V NPO/COG - AVX- 08055A200JAT2A
and I have another file(tab separated)from which I need to find matches(in production I have up to 80 Million line)
C2012X7R1E474K125AA Conn M12 Circular PIN 5 POS Screw ST Cable Mount 5 Terminal 1 Port
First method
I defined a UDF and looped through keywords for each line
keywords = sc.textFile("keys")
part_description = sc.textFile("part_description")
def build_regex(keywords):
res = '('
for key in keywords:
res += '(?<!\\\s)%s(?!\\\s)|' % re.escape(key)
res = res[0:len(res) - 1] + ')'
return r'%s' % res
def get_matching_string(line, regex):
matches = re.findall(regex, line, re.IGNORECASE)
matches = list(set(matches))
return list(set(matches)) if matches else None
def find_matching_regex(line):
result = list()
for keyGroup in keys:
matches = get_matching_string(line, keyGroup)
if matches:
result.append(str(keyGroup) + '~~' + str(matches) + '~~' + str(len(matches)))
if len(result) > 0:
return result
def split_row(list):
try:
return Row(list[0], list[1])
except:
return None
keys_rdd = keywords.map(lambda keywords: build_regex(keywords.replace(',', ' ').replace('-', ' ').split(' ')))
keys = keys_rdd.collect()
sc.broadcast(keys)
part_description = part_description.map(lambda item: item.split('\t'))
df = part_description.map(lambda list: split_row(list)).filter(lambda x: x).toDF(
["part_number", "description"])
find_regex = udf(lambda line: find_matching_regex(line), ArrayType(StringType()))
df = df.withColumn('matched', find_regex(df['part_number']))
df = df.filter(df.matched.isNotNull())
df.write.save(path=job_id, format='csv', mode='append', sep='\t')
Second method
I thought I can do more parallel processing (instead of looping through keys like above) I did cartersian product between keys and lines, splitted and exploded the keys then compared each key to the part column
df = part_description.cartesian(keywords)
df = df.map(lambda tuple: (tuple[0].split('\t'), tuple[1])).map(
lambda tuple: (tuple[0][0], tuple[0][1], tuple[1]))
df = df.toDF(['part_number', 'description', 'keywords'])
df = df.withColumn('single_keyword', explode(split(F.col("keywords"), "\s+"))).where('keywords != ""')
df = df.withColumn('matched_part_number', (df['part_number'] == df['single_keyword']))
df = df.filter(df['matched_part_number'] == F.lit(True))
df.write.save(path='part_number_search', format='csv', mode='append', sep='\t')
Are these the correct ways to do this? Is there anything I can do to process these data faster?

These are both valid solutions, and I have used both in different circumstances.
You communicate less data by using your broadcast approach, sending only 200 extra lines to each executor as opposed to replicating each line of your >80m line file 200 times, so it is likely this one will end up being faster for you.
I have used the cartesian approach when the number of records in my lookup is not feasibly broadcast-able (being much, much larger than 200 lines).
In your situation, I would use broadcast.

Related

How to split the large text file into records based on the row size in Python or PySpark

I have a large text file with 15GB of size. The data inside the text file is considered as the single string with some 20million records of data. Each record is of length 5000. Each record is having 450+ column
Now I want to split the each record of the text file into new line. And split the each record as per the schema with some delimiter to load it as a Dataframe.
This is the sample approach - sample data:
HiIamRowData1HiIamRowData2HiIamRowData3HiIamRowData4HiIamRowData5HiIamRowData6HiIamRowData7HiIamRowData8
Expected output:
Hi#I#am#Row#Data#1#
Hi#I#am#Row#Data#2#
Hi#I#am#Row#Data#3#
Hi#I#am#Row#Data#4#
Hi#I#am#Row#Data#5#
Hi#I#am#Row#Data#6#
Hi#I#am#Row#Data#7#
Hi#I#am#Row#Data#8#
Code:
### Schema
schemaData = [['col1',0,2],['col2',2,1],['col3',3,2],['col4',5,3],['col5',8,4],['col6',12,1]]
df = pd.DataFrame(data= schemaData, columns=['FeildName','offset','size'])
print(df.head(5))
file = 'sampleText.txt'
inputFile = open(file, 'r').read()
recordLen = 13
totFileLen = len(inputFile)
finalStr = ''
### First for loop to split the each record based on record length
for i in range(0,totFileLen,recordLen):
record = inputFile[i:i+recordLen]
recStr = ''
### Second For loop to apply the Schema on top of each record.
for index, row in df.iterrows():
#print(record[row['offset']:row['offset'] + row['size']])
recStr = recStr + record[row['offset']:row['offset'] + row['size']] + '#'
recStr = recStr + '\n'
finalStr += recStr
print(finalStr)
text_file = open("Output.txt", "w")
text_file.write(finalStr)
For the above 8 rows sample data It is taking 56 (8 rows + 48 row times column) Total Iterations.
In real Data set I am having 25 Million Rows and 500 columns. It will take 25 mil + 25 mil X 500col Iterations
Constraints:
The entire data in the text file is sequence data, all the records are placed next to each other and entire data is in one string. I want to read the text file and write the final Data to new text file.
I don't want to split the File into smaller size chunks while processing. Like 50 MB of data files, by doing this IF the last record got splits between the half first 50MB and second chunk of 50MB, Then from second 50MB chunk onwards the data will be wrong sliced. As we are slicing each record based on the length of record 5000.
If I can split the each chunk based on the File length inside the text file that will be possible approach.
I have tried the below python approach. For smaller files it is working fine. But for the file >500MB onwards its taking hours to split the each record schema wise.
I have tried multithreading and multiprocessing approach too didn't seen much improvement there.
QUESTION: Is there any better approach for this problem either in Python or PySpark? To reduce the time complexity.
You can effectively process your big file iteratively by:
capturing a sequential chunk of the needed size at a time
passing it to pandas.read_fwf with predefined column widths
and immediately export the constructed dataframe to the output csv file (creates it if it doesn't exist) appending the line with specified separator
from io import StringIO
rec_len = 13
widths = [2, 1, 2, 3, 4, 1]
with open('sampleText.txt') as inp, open('output.txt', 'w+') as out:
while (line := inp.read(rec_len).strip()):
pd.read_fwf(StringIO(line), widths=widths, header=None) \
.to_csv(out, sep='#', header=False, index=False, mode='a')
The output.txt contents I get:
Hi#I#am#Row#Data#1
Hi#I#am#Row#Data#2
Hi#I#am#Row#Data#3
Hi#I#am#Row#Data#4
Hi#I#am#Row#Data#5
Hi#I#am#Row#Data#6
Hi#I#am#Row#Data#7
Hi#I#am#Row#Data#8
Yes, we can achieve the same result using PySpark UDF with Spark functions. Let me show you how in 5 steps:
Import necessary
import pandas as pd
from pyspark.sql.functions import udf, split, explode
Reading text file using Spark read method
sample_df = spark.read.text("path/to/file.txt")
Convert your custom function to PySpark UDF (User Defined Function) inorder to use it in Spark
def delimit_records(value):
recordLen = 13
totFileLen = len(value)
finalStr = ''
for i in range(0,totFileLen,recordLen):
record = value[i:i+recordLen]
schemaData = [['col1',0,2],['col2',2,1],['col3',3,2],['col4',5,3],['col5',8,4],['col6',12,1]]
pdf = pd.DataFrame(data= schemaData, columns=['FeildName','offset','size'])
recStr = ''
for index, row in pdf.iterrows():
recStr = recStr + record[row['offset']:row['offset'] + row['size']] + '#'
recStr = recStr + '\n'
finalStr += recStr
return finalStr.rstrip()
Registering your User Defined Function
delimit_records = udf(delimit_records)
Use your custom function against the column, you want to modify
df1 = sample_df.withColumn("value", delimit_records("value"))
Split the record based on delimiter "\n" using PySpark split()
function
df2 = df1.withColumn("value", split("value", "\n"))
Use the explode() function to transform a column of arrays or maps
into multiple rows
df3 = df2.withColumn("value", explode("value"))
Let's print the output
df3.show()
Output:
+-------------------+
| value|
+-------------------+
|Hi#I#am#Row#Data#1#|
|Hi#I#am#Row#Data#2#|
|Hi#I#am#Row#Data#3#|
|Hi#I#am#Row#Data#4#|
|Hi#I#am#Row#Data#5#|
|Hi#I#am#Row#Data#6#|
|Hi#I#am#Row#Data#7#|
|Hi#I#am#Row#Data#8#|
+-------------------+

Using Pandas to get a contiguous segment of one dataframe and copy it into a new one?

Using Pandas, I'm attempting to 'slice' (Sorry if that's not the correct term) segments of a dataframe out of one DF and into a new one, where every segment is stacked one on top of the other.
Code:
import pandas as pd
df = pd.DataFrame(
{
'TYPE': ['System','VERIFY','CMD','SECTION','SECTION','VERIFY','CMD','CMD','VERIFY','CMD','System'],
'DATE': [100,200,300,400,500,600,700,800,900,1000,1100],
'OTHER': [10,20,30,40,50,60,70,80,90,100,110],
'STEP': ['Power On','Start: 2','Start: 1-1','Start: 10-7','End: 10-7','Start: 3-1','Start: 10-8','End: 1-1','End: 3-1','End: 10-8','Power Off']
})
print(df)
column_headers = df.columns.values.tolist()
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_step = 'STEP'
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
type_df = df[df[col_name_type].isin(types_to_check)]
for row in type_df:
if 'CMD' in row:
if 'START:' in row[col_name_step].value:
idx_start = row.iloc[::-1].str.match('VERIFY').first_valid_index() #go backwards and find first VERIFY
step_match = row[col_name_step].value[6:] #get the unique ID after Start:
idx_end = df[df[col_name_step].str.endswith(step_match, na=False)].last_valid_index() #find last instance of matching unique id
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
print(df)
print(df_segments)
Nothing gets populated in my segements array so the concat function fails.
From my research I'm confident that this can be done using either .loc or .iloc, but I can't seem to get a working implementation in.
My DF:
What I am trying to make:
Any help and/or guidance would be welcome.
Edit: To clarify, I'm trying to create a new DF that is comprised of every group of rows, where the start is the "VERIFY" that comes before a "CMD" row that also contains "Start:", and the end is the matching "CMD" row that has end.
EDIT2: I think the following is something close to what I need, but I'm unsure how to get it to reliably work:
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
cmd_check = ['CMD']
verify_check = ['VERIFY']
cmd_df = df[(df[col_name_type].isin(cmd_check))]
cmd_start_df = cmd_df[(cmd_df[col_name_step].str.contains('START:'))]
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,]
idx_start = temp_df[col_name_type].isin(verify_check).last_valid_index()
idx_end = cmd_df[cmd_df[col_name_type].str.endswith(step_name, na=False)].last_valid_index()
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
you can use str.contains
segmented_df = df.loc[df['STEP'].str.contains('Start|End')]
print(segmented_df )
I created some code to accomplish the 'slicing' I wanted:
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,:]
temp_list = temp_df[col_name_type].values.tolist()
if 'VERIFY' in temp_list:
idx_start = temp_df[temp_df[col_name_type].str.match('VERIFY')].last_valid_index()
else:
idx_start = cmd_idx
idx_end = cmd_df[cmd_df[col_name_step].str.endswith(step_name, na=False)].last_valid_index()
slides.append(df.loc[idx_start:idx_end, :])
slides.append(df_blank)
I essentially create a new DF that is a subset of the old DF up to my first START index, then I find the last_valid_index that has VERIFY, then I use that index to create a filtered DF from idx_start to idx_end and then eventually concat all those slices into one DF.
Maybe there's an easier way, but I couldn't find it.

Cacheing and Loops in (Py)Spark

I understand that 'for' and 'while' loops are generally to-be-avoided when using Spark. My question is about optimizing a 'while' loop, though if I'm missing a solution that makes it unnecessary, I am all ears.
I'm not sure I can demonstrate the issue (very long processing times, compounding as the loop goes on) with toy data, but here is some pseudo code:
### I have a function - called 'enumerator' - which involves several joins and window functions.
# I run this function on my base dataset, df0, and return df1
df1 = enumerator(df0, param1 = apple, param2 = banana)
# Check for some condition in df1, then count number of rows in the result
counter = df1 \
.filter(col('X') == some_condition) \
.count()
# If there are rows meeting this condition, start a while loop
while counter > 0:
print('Starting with counter: ', str(counter))
# Run the enumerator function on df1 again
df2 = enumerator(df1, param1= apple, param2 = banana)
# Check for the condition again, then continue the while loop if necessary
counter = df2 \
.filter(col('X') == some_condition) \
.count()
df1 = df2
# After the while loop finishes, I take the last resulting dataframe and I will do several more operations and analyses downstream
final_df = df2
An essential aspect of the enumerator function is to 'look back' on a sequence in a window, and so it may take several runs before all the necessary corrections are made.
In my heart, I know this is ugly but the windowing/ranking/sequential analysis within the function is critical. My understanding is that the underlying Spark query plan gets more and more convoluted as the loop continues. Are there any best practices I should adopt in this situation? Should I be cacheing at any point - either before the while loop starts, or within the loop itself?
You definitely should cache/persist the dataframes, otherwise every iteration in the while loop will start from scratch from df0. Also you may want to unpersist the used dataframes to free up disk/memory space.
Another point to optimize is not to do a count, but use a cheaper operation, such as df.take(1). If that returns nothing then counter == 0.
df1 = enumerator(df0, param1 = apple, param2 = banana)
df1.cache()
# Check for some condition in df1, then count number of rows in the result
counter = len(df1.filter(col('X') == some_condition).take(1))
while counter > 0:
print('Starting with counter: ', str(counter))
df2 = enumerator(df1, param1 = apple, param2 = banana)
df2.cache()
counter = len(df2.filter(col('X') == some_condition).take(1))
df1.unpersist() # unpersist df1 as it will be overwritten
df1 = df2
final_df = df2

Filter rows from a dataframe

I have got a string stored in a dataframe column
import pandas as pd
df = pd.DataFrame({"ID": 1, "content": "froyay-xcd = (E)-cut-2-froyay-xcd"}, index=[0])
print(df)
idx = df[df['content'].str.contains("froyay-xcd = (E)-cut-2-froyay-xcd")]
print(idx)
I'm trying to find the index of the row that contains a search string and the following warning occurs
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
return func(self, *args, **kwargs)
I'm not sure why an empty dataframe is returned when the search string actually is present in the dataframe columns.
Any suggestions will be highly appreciated. I expect the output to return the row stored in the dataframe.
You can add regex=False parameter for avoid convert values to regex, here () are special regex characters:
idx = df[df['content'].str.contains("froyay-xcd = (E)-cut-2-froyay-xcd", regex=False)]
print(idx)
ID content
0 1 froyay-xcd = (E)-cut-2-froyay-xcd
Or you can escape regex by:
import re
idx = df[df['content'].str.contains(re.escape("froyay-xcd = (E)-cut-2-froyay-xcd"))]
print(idx)
ID content
0 1 froyay-xcd = (E)-cut-2-froyay-xcd
You can add \ before ( and ) to avoid it and then get index using .index
df.content.str.contains("froyay-xcd = \(E\)-cut-2-froyay-xcd").index
Int64Index([0], dtype='int64')
If you have more regex special character better to use regex=False as #jezrael said.

Most efficient way to compare two panda data frame and update one dataframe based on condition

I have two dataframe df1 and df2. df2 consist of "tagname" and "value" column. Dictionary "bucket_dict" holds the data from df2.
bucket_dict = dict(zip(df2.tagname,df2.value))
In a df1 there are millions of row.3 columns are there "apptag","comments" and "Type" in df1. I want to match between this two dataframes like, if
"dictionary key" from bucket_dict contains in df1["apptag"] then update the value of df1["comments"] = corresponding dictionary key
and df1["Type"] = corresponding bucket_dict["key name"]
. I used below code:
for each_tag in bucket_dict:
df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "comments"] = each_tag
df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "Type"] = bucket_dict[each_tag]
Is there any efficient way to do this since it's taking longer time.
Bucketing df from which dictionary has been created:
bucketing_df = pd.DataFrame([["pen", "study"], ["pencil", "study"], ["ersr","study"],["rice","grocery"],["wht","grocery"]], columns=['tagname', 'value'])
other dataframe:
output_df = pd.DataFrame([["test123-pen", "pen"," "], ["test234-pencil", "pencil"," "], ["test234-rice","rice", " "], columns=['apptag', 'comments','type'])
Required output:
You can do this by calling an apply on your comments column along with a loc on your bucketing_df in this manner -
def find_type(a):
try:
return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['value'].values[0]
except:
return ""
def find_comments(a):
try:
return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['tagname'].values[0]
except:
return ""
output_df['type'] = output_df['apptag'].apply(lambda a: find_type(a))
output_df['comments'] = output_df['apptag'].apply(lambda a:find_comments(a))
Here I had to make them separate functions so it could handle cases where no tagname existed in apptag
It gives you this as the output_df -
apptag comments type
0 test123-pen pen study
1 test234-pencil pencil study
2 test234-rice rice grocery
All this code uses is the existing bucketing_df and output_df you provided at the end of your question.

Resources