Fastest way to replace substrings with dictionary (On large dataset) - string

I have 10M texts (fits in RAM) and a python dictionary of a kind:
"old substring":"new substring"
The size of a dictionary is ~15k substrings.
I am looking for the FASTEST way to replace each text with the dict (to find every "old substring" in every text and to replace it with "new substring").
The source texts are in pandas dataframe.
For now i have tried these approaches:
1) Replace in a loop with reduce and str replace (~120 rows/sec)
replaced = []
for row in df.itertuples():
replaced.append(reduce(lambda x, y: x.replace(y, mapping[y]), mapping, row[1]))
2) In loop with simple replace function ("mapping" is the 15k dict) (~160 rows/sec):
def string_replace(text):
for key in mapping:
text = text.replace(key, mapping[key])
return text
replaced = []
for row in tqdm(df.itertuples()):
replaced.append(string_replace(row[1]))
Also .iterrows() works 20% slower than .itertuples()
3) Using apply on Series (also ~160 rows/sec):
replaced = df['text'].apply(string_replace)
With these speed it take hours to process the whole dataset.
Anyone has experience with this kind of mass substring replacements? Is it possible to speed it up? It can be tricky or ugly but have to be as fast as possible, not necessary using pandas.
Thanks.
UPDATED:
Toy data to check the idea:
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
result expected:
old replaced
0 first text to replace FT to rep
1 second text to replace 2nd text to rep

Ive overcome this again and found a fantastic library called flashtext.
Speedup on 10M records with 15k vocabulary is about x100 (really one hundred times faster than regexp or other approaches from my first post)!
Very easy to use:
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
import flashtext
processor = flashtext.KeywordProcessor()
for k, v in mapping.items():
processor.add_keyword(k, v)
print(list(map(processor.replace_keywords, df["old"])))
Result:
['FT to rep', '2nd text to rep']
Also flexible adaptation to different languages if needed, using processor.non_word_boundaries attribute.
Trie-based search used in here gives amazing speedup.

One solution would have been to convert the dictionary to a trie and write the code so that you only pass once through the modified text.
Basically, you advance through the text and the trie one character at a time, and as soon a match is found, you replace it.
Of course, if you need to apply the replacements also to already replaced text, this is harder.

I think you are looking for replace with regex on df i.e
If you hava dictionary then pass it as a parameter.
d = {'old substring':'new substring','anohter':'another'}
For entire dataframe
df.replace(d,regex=True)
For series
df[columns].replace(d,regex=True)
Example
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
df['replaced'] = df['old'].replace(mapping,regex=True)

Related

Text analysis: Match long list of keywords with character strings in a text variable in dataframe

I am trying to match a long list of keywords (all German municipalities) saved as a dataframe (df_2) with another variable from a second dataframe (df_1) that contains character strings (long descriptive text).
As output, I am trying to extract the names of municipalities mentioned in df_1 that are given in df_2, if possible, including a count how many times the name was mentioned in df_1.
So far, I have tried using grep(), but the pattern from df_2 exceeds my memory (since the pattern contains close to 13000 values), sapply(), but here, I have the same problems with the memory (only using a part of df_2 takes up more than 6GB) str_extract() and using the quanteda package, using df_2 as a "dictionary". But I am getting nowhere.
This is a replicated sample with the part of my code that is working (although I only get the whole string as output, not just the name of the municipality, so it isn't very useful yet).
Is there a way to write a function that tells df_1 to add a new variable and copy the value of df_2[1] whenever there is a match in the character string for df_2[i] in df_1$description[i]?
I think this may be the only option that doesn't overload the memory.
`###sample for testing
name <- c( "A", "B", "C", "D")
date <- c("1999-03-02","1999-04-02","1999-05-02","1999-06-02" )
event <- c("occurrence1","occurrence2","occurrence3","occurrence4" )
description <- c("this is a sample text and München that is also a sample text Berlin",
"this is a sample text and Detmold that is also a sample text Berlin and Berlin",
"this is a sample text and Darmstadt that is also a sample text Magdeburg and Halle",
"this is a sample text and München that is also a sample text Berlin" )
df_1 <- cbind(name, date, event, description)
df_1 <- as.data.frame(df_1)
locations <- c("München", "Berlin", "Darmstadt", "Magdeburg", "Detmold", "Halle")
df_2 <- as.data.frame(locations)`
##sample code for generating output
pattern_sample <- paste(df_2$locations, collapse="|")
result_sample <- grep(pattern_sample, df_1$description, value=TRUE)
result_sample

Is there a way to only list a certain format of text from a list?

I am quite new to python.
And i want to only get a certain format from a bigger list, example:
Whats in the list:
/ABC/EF213
/ABC/EF
/ABC/12AC4
/ABC/212
However the only on i want listed are the ones with this format /###/##### while the rest gets discarded
You could use a generator expression or a for loop to check each element of the list to see if it matches a pattern. One way of doing this would be to check if the item matches a regex pattern.
As an example:
import re
original_list = ["Item I don't want", "/ABC/EF213", "/ABC/EF", "/ABC/12AC4", "/ABC/212", "123/456", "another useless item", "/ABC/EF"]
filtered_list = [item for item in original_list if re.fullmatch("\/\w+\/\w+", item) is not None]
print(filtered_list)
outputs
['/ABC/EF213', '/ABC/EF', '/ABC/12AC4', '/ABC/212', '/ABC/EF']
If you need help making regex patterns, there are many great websites such as regexr which can help you
Every String can be used as a list without any conversion. If the only format you want to check is /###/##### then you can simply make if commands like these:
for text in your_list:
if len(text) == 10 and text[0] == "/" and text[4] == "/" (and so on):
print(text)
Of course this would require a lot of if statements and would take a pretty long time. So I would recomend doing a faster and simpler scan. We could perform this one by, for example, splitting the texts, which would look something like this:
for text in your_list:
checkstring = text.split("/")
Now you have your text Split in parts, and you can simply check what lengths these new parts have with the len() command.

Convert everything in a dictionary to lower case, then filter on it?

import pandas as pd
import nltk
import os
directory = os.listdir(r"C:\...")
x = []
num = 0
for i in directory:
x.append(pd.read_fwf("C:\\..." + i))
x[num] = x[num].to_string()
So, once I have a dictionary x = [ ] populated by the read_fwf for each file in my directory:
I want to know how to make it so every single character is lowercase. I am having trouble understanding the syntax and how it is applied to a dictionary.
I want to define a filter that I can use to count for a list of words in this newly defined dictionary, e.g.,
list = [bus, car, train, aeroplane, tram, ...]
Edit: Quick unrelated question:
Is pd_read_fwf the best way to read .txt files? If not, what else could I use?
Any help is very much appreciated. Thanks
Edit 2: Sample data and output that I want:
Sample:
The Horncastle boar's head is an early seventh-century Anglo-Saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. It was discovered in 2002 by a metal detectorist searching
in the town of Horncastle, Lincolnshire. It was reported as found
treasure and acquired for £15,000 by the City and County Museum, where
it is on permanent display.
Required output - changes everything in uppercase to lowercase:
the horncastle boar's head is an early seventh-century anglo-saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. it was discovered in 2002 by a metal detectorist searching
in the town of horncastle, lincolnshire. it was reported as found
treasure and acquired for £15,000 by the city and county museum, where
it is on permanent display.
You shouldn't need to use pandas or dictionaries at all. Just use Python's built-in open() function:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Use the string's lower() method to make everything lowercase
text = text.lower()
print(text)
# Split text by whitespace into list of words
word_list = text.split()
# Get the number of elements in the list (the word count)
word_count = len(word_list)
print(word_count)
If you want, you can do it in the reverse order:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Split text by whitespace into list of words
word_list = text.split()
# Use list comprehension to create a new list with the lower() method applied to each word.
lowercase_word_list = [word.lower() for word in word_list]
print(word_list)
Using a context manager for this is good since it automatically closes the file for you as soon as it goes out of scope (de-tabbed from with statement block). Otherwise you would have to use file.open() and file.read().
I think there are some other benefits to using context managers, but someone please correct me if I'm wrong.
I think what you are looking for is dictionary comprehension:
# Python 3
new_dict = {key: val.lower() for key, val in old_dict.items()}
# Python 2
new_dict = {key: val.lower() for key, val in old_dict.iteritems()}
items()/iteritems() gives you a list of tuples of the (keys, values) represented in the dictionary (e.g. [('somekey', 'SomeValue'), ('somekey2', 'SomeValue2')])
The comprehension iterates over each of these pairs, creating a new dictionary in the process. In the key: val.lower() section, you can do whatever manipulation you want to create the new dictionary.

How can I create a dictionary for a large amount to text and list the most frequent word?

I am new to coding and I am trying to create a dictionary from a large body of text and would also like the most frequent word to be shown?
For example, if I had a block of text such as:
text = '''George Gordon Noel Byron was born, with a clubbed right foot, in London on January 22, 1788. He was the son of Catherine Gordon of Gight, an impoverished Scots heiress, and Captain John (“Mad Jack”) Byron, a fortune-hunting widower with a daughter, Augusta. The profligate captain squandered his wife’s inheritance, was absent for the birth of his only son, and eventually decamped for France as an exile from English creditors, where he died in 1791 at 36.'''
I know the steps I would like the code to take. I want words that are the same but capitalised to be counted together so Hi and hi would count as Hi = 2.
I am trying to get the code to loop through the text and create a dictionary showing how many times each word appears. My final goal is to them have the code state which word appears most frequently.
I don't know how to approach such a large amount of text, the examples I have seen are for a much smaller amount of words.
I have tried to remove white space and also create a loop but I am stuck and unsure if I am going the right way about coding this problem.
a.replace(" ", "")
#this gave built-in method replace of str object at 0x000001A49AD8DAE0>, I have now idea what this means!
print(a.replace) # this is what I tried to write to remove white spaces
I am unsure of how to create the dictionary.
To count the word frequency would I do something like:
frequency = {}
for value in my_dict.values() :
if value in frequency :
frequency[value] = frequency[value] + 1
else :
frequency[value] = 1
What I was expecting to get was a dictionary that lists each word shown with a numerical value showing how often it appears in the text.
Then I wanted to have the code show the word that occurs the most.
This may be too simple for your requirements, but you could do this to create a dictionary of each word and its number of repetitions in the text.
text = "..." # text here.
frequency = {}
for word in text.split(" "):
if word not in frequency.keys():
frequency[word] = 1
else:
frequency[word] += 1
print(frequency)
This only splits the text up at each ' ' and counts the number of each occurrence.
If you want to get only the words, you may have to remove the ',' and other characters which you do not wish to have in your dictionary.
To remove characters such as ',' do.
text = text.replace(",", "")
Hope this helps and happy coding.
First, to remove all non-alphabet characters, aside from ', we can use regex
After that, we go through a list of the words and use a dictionary
import re
d = {}
text = text.split(" ")#turns it into a list
text = [re.findall("[a-zA-Z']", text[i]) for i in range(len(text))]
#each word is split, but non-alphabet/apostrophe are removed
text = ["".join(text[i]) for i in range(len(text))]
#puts each word back together
#there may be a better way for the short-above. If so, please tell.
for word in text:
if word in d.keys():
d[word] += 1
else:
d[word] = 1
d.pop("")
#not sure why, but when testing I got one key ""
You can use regex and Counter from collections :
import re
from collections import Counter
text = "This cat is not a cat, even if it looks like a cat"
# Extract words with regex, ignoring symbols and space
words = re.compile(r"\b\w+\b").findall(text.lower())
count = Counter(words)
# {'cat': 3, 'a': 2, 'this': 1, 'is': 1, 'not': 1, 'even': 1, 'if': 1, 'it': 1, 'looks': 1, 'like': 1}
# To get the most frequent
most_frequent = max(count, key=lambda k: count[k])
# 'cat'

Generate sensible strings using a pattern

I have a table of strings (about 100,000) in following format:
pattern , string
e.g. -
*l*ph*nt , elephant
c*mp*t*r , computer
s*v* , save
s*nn] , sunny
]*rr] , worry
To simplify, assume a * denotes a vowel, a consonant stands unchanged and ] denotes either a 'y' or a 'w' (say, for instance, semi-vowels/round-vowels in phonology).
Given a pattern, what is the best way to generate the possible sensible strings? A sensible string is defined as a string having each of its consecutive two-letter substrings, that were not specified in the pattern, inside the data-set.
e.g. -
h*ll* --> hallo, hello, holla ...
'hallo' is sensible because 'ha', 'al', 'lo' can be seen in the data-set as with the words 'have', 'also', 'low'. The two letters 'll' is not considered because it was specified in the pattern.
What are the simple and efficient ways to do this?
Are there any libraries/frameworks for achieving this?
I've no specific language in mind but prefer to use java for this program.
This is particularly well suited to Python itertools, set and re operations:
import re
import itertools
VOWELS = 'aeiou'
SEMI_VOWELS = 'wy'
DATASET = '/usr/share/dict/words'
SENSIBLES = set()
def digraphs(word, digraph=r'..'):
'''
>>> digraphs('bar')
set(['ar', 'ba'])
'''
base = re.findall(digraph, word)
base.extend(re.findall(digraph, word[1:]))
return set(base)
def expand(pattern, wildcard, elements):
'''
>>> expand('h?', '?', 'aeiou')
['ha', 'he', 'hi', 'ho', 'hu']
'''
tokens = re.split(re.escape(wildcard), pattern)
results = set()
for perm in itertools.permutations(elements, len(tokens)):
results.add(''.join([l for p in zip(tokens, perm) for l in p][:-1]))
return sorted(results)
def enum(pattern):
not_sensible = digraphs(pattern, r'[^*\]]{2}')
for p in expand(pattern, '*', VOWELS):
for q in expand(p, ']', SEMI_VOWELS):
if (digraphs(q) - not_sensible).issubset(SENSIBLES):
print q
## Init the data-set (may be long...)
## you may want to pre-compute this
## and adapt it to your data-set.
for word in open(DATASET, 'r').readlines():
for digraph in digraphs(word.rstrip()):
SENSIBLES.add(digraph)
enum('*l*ph*nt')
enum('s*nn]')
enum('h*ll*')
As there aren't many possibilites for two-letter substrings, you can go through your dataset and generate a table that contains the count for every two-letter substring, so the table will look something like this:
ee 1024 times
su 567 times
...
xy 45 times
xz 0 times
The table will be small as you'll only have about 26*26 = 676 values to store.
You have to do this only once for your dataset (or update the table every time it changes if the dataset is dynamic) and can use the table for evaluating possible strings. F.e., for your example, add the values for 'ha', 'al' and 'lo' to get a "score" for the string 'hallo'. After that, choose the string(s) with the highest score(s).
Note that the scoring can be improved by checking longer substrings, f.e. three letters, but this will also result in larger tables.

Resources