Text analysis: Match long list of keywords with character strings in a text variable in dataframe - text

I am trying to match a long list of keywords (all German municipalities) saved as a dataframe (df_2) with another variable from a second dataframe (df_1) that contains character strings (long descriptive text).
As output, I am trying to extract the names of municipalities mentioned in df_1 that are given in df_2, if possible, including a count how many times the name was mentioned in df_1.
So far, I have tried using grep(), but the pattern from df_2 exceeds my memory (since the pattern contains close to 13000 values), sapply(), but here, I have the same problems with the memory (only using a part of df_2 takes up more than 6GB) str_extract() and using the quanteda package, using df_2 as a "dictionary". But I am getting nowhere.
This is a replicated sample with the part of my code that is working (although I only get the whole string as output, not just the name of the municipality, so it isn't very useful yet).
Is there a way to write a function that tells df_1 to add a new variable and copy the value of df_2[1] whenever there is a match in the character string for df_2[i] in df_1$description[i]?
I think this may be the only option that doesn't overload the memory.
`###sample for testing
name <- c( "A", "B", "C", "D")
date <- c("1999-03-02","1999-04-02","1999-05-02","1999-06-02" )
event <- c("occurrence1","occurrence2","occurrence3","occurrence4" )
description <- c("this is a sample text and München that is also a sample text Berlin",
"this is a sample text and Detmold that is also a sample text Berlin and Berlin",
"this is a sample text and Darmstadt that is also a sample text Magdeburg and Halle",
"this is a sample text and München that is also a sample text Berlin" )
df_1 <- cbind(name, date, event, description)
df_1 <- as.data.frame(df_1)
locations <- c("München", "Berlin", "Darmstadt", "Magdeburg", "Detmold", "Halle")
df_2 <- as.data.frame(locations)`
##sample code for generating output
pattern_sample <- paste(df_2$locations, collapse="|")
result_sample <- grep(pattern_sample, df_1$description, value=TRUE)
result_sample

Related

Is there an python solution for mapping a (pandas data frame) with (unique values of Split a string column)

I have a data frame (df).
The Data frame contains a string column called: supported_cpu.
The (supported_cpu) data is a string type separated by a comma.
I want to use this data for the ML model.
enter image description here
I had to get unique values for the column (supported_cpu). The output is a (list) of unique values.
def pars_string(df,col):
#Separate the column from the string using split
data=df[col].value_counts().reset_index()
data['index']=data['index'].str.split(",")
# Create a list including all of the items, which is separated by column
df_01=[]
for i in range(data.shape[0]):
for j in data['index'][i]:
df_01.append(j)
# get unique value from sub_df
list_01=list(set(df_01))
# there are some leading or trailing spaces in the list_01 which need to be deleted to get unique value
list_02=[x.strip(' ') for x in list_01]
# get unique value from list_02
list_03=list(set(list_02))
return(list_03)
supported_cpu_list = pars_string(df=df,col='supported_cpu')
The output:
enter image description here
I want to map this output to the data frame to encode it for the ML model.
How could I store the output in the data frame? Note : Some row have a multi-value(more than one CPU)
Input: string type separated by a column
output: I did not know what it should be.
Input: string type separated by a column
output: I did not know what it should be.
I really recommend to anyone who's starting using pandas to read about vectorization and thinking in terms of columns (aka Series). This is the way it was build and it is the way in which its supposed to be used.
And from what I understand (I may be wrong) is that you want to get unique values from supported_cpu column. So you could use the Series methods on string to split that particular column, then flatten the resulting array using internal `chain
from itertools import chain
df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')
unique_vals = set(chain(*df['supported_cpus'].tolist()))
unique_vals = (item for item in unique_vals if item)
Multi-values in some rows should be parsed to single values for later ML model training. The list can be converted to dataframe simply by pd.DataFrame(supported_cpu_list).

Identify numbers, in a large data string, that are prefixed to an alphabet upto 2 positions in between other characters

I have a string containing thousands of lines of this data without line break (only a few lines shown for readability with line break)
5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital
7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital
Format is
(entry number)(district)(patient number)(age)(gender)(case of)(symptoms)(comorbidity)(date of death)(place of death)
without spaces, or brackets.
Problem : The data i want to collect is age.
However i cant seem to find a way to single out the age since its clouded by a lot of other numbers in the data. I have tried various iterations of count, limiting it to 1 to 99, separating the data etc, and failed.
My Idea : Since the gender is always either 'M'/'F', and the two numbers before the gender is the age. Isolating the two numbers before the gender seems like an ideal solution.
xxM
xxF
My Goal : I would like to collect all the xx numbers irrespective of gender and store them in a list. How do i go about this?
import re
input_str = '5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'
ages = [found[-3:-1] for found in re.findall('[0-9]+[M,F]', input_str, re.I)]
print(ages)
# ['62', '65']
This works fine with the sample but if there are districts starting with 'M/F' then entry number will be collected as well.
A workaround is to match exactly seven digits (if the patient number is always 5 digits and and the age is generally 2 digits).
ages = [found[-3:-1] for found in re.findall(r'\d{7}[M,F]', input_str, re.I)]
With the structure you gave I've built a dict of reg expressions to match components. Then put this back into a dict
There are ways I can imagine this will not work
if age < 10, only 1 digit so you will pick up a digit of patient number
there maybe strings that don't match the re expressions which will mean odd results
It's the most structured way I can think to go....
import re
data = "5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital"
md = {
"entrynum": "([0-9]+)",
"district": "([A-Z,a-z]+)",
"patnum_age": "([0-9]+)",
"sex": "([M,F])",
"remainder": "(.*)$"
}
data_dict = {list(md.keys())[i]:tk
for i, tk in
enumerate([tk for tk in re.split("".join(md.values()), data) if tk!=""])
}
print(f"Assumed age:{data_dict['patnum_age'][-2:]}\nparsed:{data_dict}\n")
output
Assumed age:62
parsed:{'entrynum': '5', 'district': 'BengaluruUrban', 'patnum_age': '4598962', 'sex': 'M', 'remainder': 'SARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'}

How to read integer values from a textfile and count the of occurrence of each value in pyspark

i would like to read the from a text file shown below, loop through each individual digit and determine which digit occurs the maximum number of times. How can I go about it in pyspark.
Here is the txt file
1.4142135623 7309504880 1688724209 6980785696 7187537694 8073176679 7379907324 7846210703 8850387534 3276415727 3501384623 0912297024 9248360558 5073721264 4121497099 9358314132 2266592750 5592755799 9505011527 8206057147 0109559971 6059702745 3459686201 4728517418 6408891986 0955232923 0484308714 3214508397 6260362799 5251407989 6872533965 4633180882 9640620615 2583523950 5474575028 7759961729 8355752203 3753185701 1354374603 4084988471 6038689997 0699004815 0305440277 9031645424 7823068492 9369186215 8057846311 1596668713 0130156185 6898723723 5288509264 8612494977 1542183342 0428568606 0146824720 7714358548 7415565706 9677653720 2264854470 1585880162 0758474922 6572260020 8558446652 1458398893 9443709265 9180031138 8246468157 0826301005 9485870400 3186480342 1948972782 9064104507 2636881313 7398552561 1732204024 5091227700 2269411275 7362728049 5738108967 5040183698 6836845072 5799364729 0607629969 4138047565 4823728997 1803268024 7442062926 9124859052 1810044598 4215059112 0249441341 7285314781 0580360337 1077309182 8693147101 7111168391 6581726889 4197587165 8215212822 951848847
If you want to achieve for a single line or multiple lines in a text file, first eliminate all the special characters from the file, and then you need to get pair RDD (digits in text file as key) .
Bellow is the pseudo-code for it,
strRDD = df.flatMap(lambda line : list(str(line.value)))\
.filter(lambda digit : digit.isalnum())\
.map(lambda digit : (int(digit), 1))\
.reduceByKey(lambda a, b : a + b).max(lambda x : x[1])
print(strRDD)
Output for your text file is :(8, 113)
//Using RDD's - This will be better for your scenario.
load data using text file then use flatMap (since you said all numbers in single line) to split using space as delimiter. Then convert each number from string to float as some of your numbers are float. Now make RDD pair( number as key and value as one for each number). Using reduceBykey aggregate total count of each number and then sort by value using sortBy in descending order.
data = sc.textFile("your file path")
numcount = data.flatMap(lambda x: x.split( )).map(lambda x: (float(x),1)).reduceByKey(lambda a,b:a+b).sortBy(lambda x: x[1],ascending = False)
for i in numcount.collect(): print (i)
// Using DataFrames:
since you said all numbers in single line I am reading as text to get
all values into single column. Then creating new column and splitting the column to list with space as delimiter using split function. Then exploding the list to get each number into a row.
Now drop old column using drop function
Now using groupBy, count functions you can count each number how many times it show's up. Also using sort in descending order you can find order wise.
from pyspark.sql.functions import *
numcount = spark.read.text("your file path").withColumn('new', explode(split('value',' '))).drop('value')
numcount.groupBy('new').agg(count('new').alias('counts')).sort(desc('counts')).show()
Starting from a dataframe with the data you can do:
df = df.groupby("digit").count().sort("count")
This code returns the most repeated values. After that you can show this values and get the most repèated value with:
df.show()
df.limit(1).show()

How to get confidence of each line using pytesseract

I have successfully setup Tesseract and can translate the images to text...
text = pytesseract.image_to_string(Image.open(image))
However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?
I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.
After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...
text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')
So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...
text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)
Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...
conf = text.groupby(['block_num'])['conf'].mean()
#Srikar Appalaraju is right. Take the following example image:
Now use the following code:
text = pytesseract.image_to_data(gray, output_type='data.frame')
text = text[text.conf != -1]
text.head()
Notice that all five rows have the same block_num, so that if we group by using that column, all the 5 words (texts) will be grouped together. But that's not what we want, we want to group only the first 3 words that belong to the first line and in order to do that properly (in a generic manner) for a large enough image we need to group by all the 4 columns page_num, block_num, par_num and line_num simulataneuosly, in order to compute the confidence for the first line, as shown in the following code snippet:
lines = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['text'] \
.apply(lambda x: ' '.join(list(x))).tolist()
confs = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['conf'].mean().tolist()
line_conf = []
for i in range(len(lines)):
if lines[i].strip():
line_conf.append((lines[i], round(confs[i],3)))
with the following desired output:
[('Ying Thai Kitchen', 91.667),
('2220 Queen Anne AVE N', 88.2),
('Seattle WA 98109', 90.333),
('« (206) 285-8424 Fax. (206) 285-8427', 83.167),
('‘uw .yingthaikitchen.com', 40.0),
('Welcome to Ying Thai Kitchen Restaurant,', 85.333),
('Order#:17 Table 2', 94.0),
('Date: 7/4/2013 7:28 PM', 86.25),
('Server: Jack (1.4)', 83.0),
('44 Ginger Lover $9.50', 89.0),
('[Pork] [24#]', 43.0),
('Brown Rice $2.00', 95.333),
('Total 2 iten(s) $11.50', 89.5),
('Sales Tax $1.09', 95.667),
('Grand Total $12.59', 95.0),
('Tip Guide', 95.0),
('TEK=$1.89, 18%=62.27, 20%=82.52', 6.667),
('Thank you very much,', 90.75),
('Cone back again', 92.667)]
The current accepted answer is not entirely correct. The correct way to get each line using pytesseract is
text.groupby(['block_num','par_num','line_num'])['text'].apply(list)
We need to do this based on this answer: Does anyone knows the meaning of output of image_to_data, image_to_osd methods of pytesseract?
Column block_num: Block number of the detected text or item
Column par_num: Paragraph number of the detected text or item
Column line_num: Line number of the detected text or item
Column word_num: word number of the detected text or item
But above all 4 columns are interconnected.If the item comes from new line then word number will start counting again from 0, it doesn't continue from previous line last word number. Same goes with line_num, par_num, block_num.

Fastest way to replace substrings with dictionary (On large dataset)

I have 10M texts (fits in RAM) and a python dictionary of a kind:
"old substring":"new substring"
The size of a dictionary is ~15k substrings.
I am looking for the FASTEST way to replace each text with the dict (to find every "old substring" in every text and to replace it with "new substring").
The source texts are in pandas dataframe.
For now i have tried these approaches:
1) Replace in a loop with reduce and str replace (~120 rows/sec)
replaced = []
for row in df.itertuples():
replaced.append(reduce(lambda x, y: x.replace(y, mapping[y]), mapping, row[1]))
2) In loop with simple replace function ("mapping" is the 15k dict) (~160 rows/sec):
def string_replace(text):
for key in mapping:
text = text.replace(key, mapping[key])
return text
replaced = []
for row in tqdm(df.itertuples()):
replaced.append(string_replace(row[1]))
Also .iterrows() works 20% slower than .itertuples()
3) Using apply on Series (also ~160 rows/sec):
replaced = df['text'].apply(string_replace)
With these speed it take hours to process the whole dataset.
Anyone has experience with this kind of mass substring replacements? Is it possible to speed it up? It can be tricky or ugly but have to be as fast as possible, not necessary using pandas.
Thanks.
UPDATED:
Toy data to check the idea:
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
result expected:
old replaced
0 first text to replace FT to rep
1 second text to replace 2nd text to rep
Ive overcome this again and found a fantastic library called flashtext.
Speedup on 10M records with 15k vocabulary is about x100 (really one hundred times faster than regexp or other approaches from my first post)!
Very easy to use:
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
import flashtext
processor = flashtext.KeywordProcessor()
for k, v in mapping.items():
processor.add_keyword(k, v)
print(list(map(processor.replace_keywords, df["old"])))
Result:
['FT to rep', '2nd text to rep']
Also flexible adaptation to different languages if needed, using processor.non_word_boundaries attribute.
Trie-based search used in here gives amazing speedup.
One solution would have been to convert the dictionary to a trie and write the code so that you only pass once through the modified text.
Basically, you advance through the text and the trie one character at a time, and as soon a match is found, you replace it.
Of course, if you need to apply the replacements also to already replaced text, this is harder.
I think you are looking for replace with regex on df i.e
If you hava dictionary then pass it as a parameter.
d = {'old substring':'new substring','anohter':'another'}
For entire dataframe
df.replace(d,regex=True)
For series
df[columns].replace(d,regex=True)
Example
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
df['replaced'] = df['old'].replace(mapping,regex=True)

Resources