Tensorflow Dataset: Accessing row values to preprocess text data - python-3.x

I have used the tf.data.experimental.CsvDataset to read CSV data. the CSV has 2 different lang for the transformer model.
train_examples = tf.data.experimental.CsvDataset("./Data/training.csv", [tf.string, tf.string], header=True)
#printing 'train_examples'
<CsvDatasetV2 shapes: ((), ()), types: (tf.string, tf.string)>
I am trying to preprocess data for each column of text data before training the transformer model. How would I pass a function like the below on the 2 columns of the data? What structure is the output from tf.data.experimental.CsvDataset?
def preprocess_sentence(sentence):
sentence = sentence.lower().strip()
# creating a space between a word and the punctuation following it
# eg: "he is a boy." => "he is a boy ."
sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
sentence = re.sub(r'[" "]+', " ", sentence)
# replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
sentence = sentence.strip()
# adding a start and an end token to the sentence
return sentence
If I apply the above function, the CsvDataset object cannot handle any operations.
AttributeError: 'CsvDatasetV2' object has no attribute 'lower'

What structure is the output from tf.data.experimental.CsvDataset?
CsvDataset returns a tensorflow dataset which is a custom object representing an arbitrarily large dataset.
If I apply the above function, the CsvDataset object cannot handle any operations
That's because datasets are evaluated lazily by default (with good reason, as I mentioned above they can represent huge, even infinite, datasets) so, by default, mapping operations need to be done using tensor operations.
Usefully, however, there is a tensorflow operation that allows you to call python code from tf so you could do something like this:
pre_processed_dataset = my_dataset.map(lambda x: tf.py_function(preprocess_sentence, x, tf.string))
(though you should make sure preprecess_sentence actually takes 2 sentences as an argument in common with your dataset which is a dataset of string pairs).
Having said that, it would be much more optimal is if you could just translate your preprocessing function into tensor operations. Maybe something like this:
def preprocess(sentence1, sentence2):
def preprocess_sentence(sentence):
ret = tf.strings.lower(sentence)
ret = tf.strings.strip(ret)
ret = tf.strings.regex_replace(ret, "([?.!,])", " \1 ")
ret = tf.strings.regex_replace(ret, '[" "]+', " ")
ret = tf.strings.regex_replace(ret, "[^a-zA-Z?.!,]+", " ")
ret = tf.strings.strip(ret)
return ret
return preprocess_sentence(sentence1), preprocess_sentence(sentence2)
then you can map your dataset like this:
my_preprocessed_dataset = my_dataset.map(preprocess)

Related

Create a List with spaces without and not using .split()

I have to write an application that asks the user to enter a list of numbers separated by a space and then prints the sum of the numbers. The user can enter any number of numbers. I am not allowed to use the split function in python. I was wondering how I can do it that. Any help would be appreciated it as I'm kind of stuck on where to start.
Possible solution is to use regular expressions:
# import regular expression library
import re
# let user enter numbers and store user data into 'data' variable
data = input("Enter numbers separated by space: ")
"""
regular expression pattern '\d+' means the following:
'\d' - any number character,
'+' - one or more occurence of the character
're.findall' will find all occurrences of regular expression pattern
and store to list like '['1', '258', '475', '2', '6']'
please note that list items stored as str type
"""
numbers = re.findall(r'\d+', data)
"""
list comprehension '[int(_) for _ in numbers]' converts
list items to int type
'sum()' - summarizes list items
"""
summary = sum([int(_) for _ in numbers])
print(f'Sum: {summary}')
Another solution is following:
string = input("Enter numbers separated by space: ")
splits = []
pos = -1
last_pos = -1
while ' ' in string[pos + 1:]:
pos = string.index(' ', pos + 1)
splits.append(string[last_pos + 1:pos])
last_pos = pos
splits.append(string[last_pos + 1:])
summary = sum([int(_) for _ in filter(None, splits)])
print(f'Sum: {summary}')
From my point of view, the first option is more concise and better protected from user errors.

In python 3 how can I return a variable to a function properly?

I am currently in school studying python and have a question. I am working on a midterm project that has to take an input, assign it to a list, if the first letter isnt capital - capitalize it..and count the number of words in the sentence.
While my code works.. I can't help but think I handled the arguments into the functions completely wrong. If you could take a look at it and help me out on how I could clean it up that would be excellent.
Please remember - I am new..so explain it like I am 5!
sentence_list = sentList()
sentence = listToString(sentence_list)
sentence = is_cap(sentence)
sentence = fix(sentence)
sentence = count_words(sentence)
def sentList():
sentence_list = []
sentence_list.append(input('Please enter a sentence: '))
return sentence_list
def listToString(sentence_list):
sentence = ""
sentence = ''.join(sentence_list)
return sentence
def is_cap(sentence):
sentence = sentence.capitalize()
return sentence
def fix(sentence):
sentence = sentence + "." if (not sentence.endswith('.')) and (not sentence.endswith('!')) and \
(not sentence.endswith('?')) else sentence
return sentence
def count_words(sentence):
count = len(sentence.split())
print('The number of words in the string are: '+ str(count))
print(sentence)
main()```
first of all, your code is very good as a beginner, good job dude.
so
to make your function run, you need call it after you defined them. but here you put the call at the top of the page.
the reason of that is python read the codes from top to bottom, so when he read the first that call a function that he didn't read 'til this line
the code should be like this:
def sentList():
sentence_list = []
sentence_list.append(input('Please enter a sentence: '))
return sentence_list
def listToString(sentence_list):
sentence = ""
sentence = ''.join(sentence_list)
return sentence
def is_cap(sentence):
sentence = sentence.capitalize()
return sentence
def fix(sentence):
sentence = sentence + "." if (not sentence.endswith('.')) and (not sentence.endswith('!')) and \ (not sentence.endswith('?')) else sentence
return sentence
def count_words(sentence):
count = len(sentence.split())
print('The number of words in the string are: '+ str(count))
print(sentence)
sentence_list = sentList()
sentence = listToString(sentence_list)
sentence = is_cap(sentence)
sentence = fix(sentence)
sentence = count_words(sentence)
I guess that it. if you have any another question. this community will always be here

Iterating through Huggingface tokenizer with remainder

Transformer models have maximum token limits. If I want to substring my text to fit within that limit, what is the generally accepted way?
Due to the treatment of special characters, it isn't the case that the tokenizer maps its tokens to something amenable to looping. Naively:
subst = " ".join(mytext.split(" ")[0:MAX_LEN])
would let me loop through chunks with something like:
START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
START = START + MAX_LEN
i = i + 1
tokens = tokenizer(text)
However, " ".join(mytext.split(" ")[0:MAX_LEN]) is not equal to the length given by tokenizer(text).
You can see the difference below:
>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens
>>> len(mytext.split(" "))
10001
>>> encoded_input = tokenizer(mytext)
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors
What is the function argument to tokenizer or if none available, the generally accepted iteration procedure for longer documents?

How can I make my function work for any number?

I am having some issues with some code I wrote for this problem:
“Write a function namedd calc that will evaluate a simple arithmetic expression. The input to your program will be a string of the form:
operand1 operator operand2
where operand1 and operand2 are non-negative integers and operator is a single-character operator, which is either +, -, or *. You may assume that there is a space between each operand and the operator. You may further assume that the input is a valid mathemat- ical expression, i.e. your program is not responsible for the case where the user enters gibberish.
Your function will return an integer, such that the returned value is equal to the value produced by applying the given operation to the given operands.
Sample execution:
calc("5 + 10") # 15
“You may not use the split or eval functions in your solution.
Hint: the hard part here is breaking the input string into its three component. You may use the find and rfind functions to find the position of the first and last space, and then use the slice operator (that is, s[startindex:endindex]) to extract the relevant range of characters. Be careful of off-by-one errors in using the slice operator.
Hint: it’s best to test your code as you work. The first step should be to break the input string into its three components. Write a program that does that, have it print out the operator and the two operands on separate lines, and test it until you are convinced that it works. Then, modifying it to perform the desired mathematical operation should be straightforward. Test your program with several different inputs to make sure it works as you expect.”
Here is my code:
def calc(exp):
operand1 = int(exp[:1])
operand2 = int(exp[4:6])
operator = exp[2:3]
if(operator == "+"):
addition = operand1+operand2
return addition
if(operator == "-"):
subtraction = operand1-operand2
return subtraction
if(operator == "*"):
multiplication = operand1*operand2
return multiplication
print(calc("5 + 10"))
print(calc("4 - 8"))
print(calc("4 * 3"))
My code does not fully meet the criteria of this question. It only works for single digit numbers. How can I make my code work for any number?
Like:
“504 + 507”
”5678 + 76890”
and so on?
Thank you. Any help is appreciated.
As the hint says, get the position of the first and last space of the expression, use it to extract the operand and the operators, and then evaluate accordingly.
def calc(exp):
#Get the position for first space with find
low_idx = exp.find(' ')
#Get the position for last space with rfind
high_idx = exp.rfind(' ')
#Extract operators and operand with slice, converting operands to int
operand1 = int(exp[0:low_idx])
operator = exp[low_idx+1:high_idx]
operand2 = int(exp[high_idx:])
result = 0
#Evaluate based on operator
if operator == '+':
result = operand1 + operand2
elif operator == '-':
result = operand1 - operand2
elif operator == '*':
result = operand1 * operand2
return result
print(calc("5 + 10"))
print(calc("4 - 8"))
print(calc("4 * 3"))
print(calc("504 + 507"))
print(calc("5678 + 76890"))
#15
#-4
#12
#1011
#82568
The answer is in the specification:
You may use the find and rfind functions to find the position of the first and last space, and then use the slice operator (that is, s[startindex:endindex]) to extract the relevant range of characters.
find and rfind are methods of string objects.
You could split it into three components using this code: (note: this doesn't use split or eval)
def splitExpression(e):
numbers = ["1","2","3","4","5","6","7","8","9","0"] # list of all numbers
operations = ["+","-","*","/"] # list of all operations
output = [] # output components
currentlyParsing = "number" # the component we're currently parsing
buildstring = "" # temporary variable
for c in e:
if c == " ":
continue # ignore whitespace
if currentlyParsing == "number": # we are currently parsing a number
if c in numbers:
buildstring += c # this is a number, continue
elif c in operations:
output.append(buildstring) # this component has reached it's end
buildstring = c
currentlyParsing = "operation" # we are expecting an operation now
else:
pass # unknown symbol!
elif currentlyParsing == "operation": # we are currently parsing an operation
if c in operations:
buildstring += c # this is an operation, continue
elif c in numbers:
output.append(buildstring) # this component has reached it's end
buildstring = c
currentlyParsing = "number" # we are expecting a number now
else:
pass # unknown symbol!
if buildstring: # anything left in the buffer?
output.append(buildstring)
buildstring = ""
return output
Usage: splitExpression("281*14") returns ["281","*","14"]
This function also accepts spaces between numbers and operations
You can simply take the string and use the split method for the string object, which will return a list of strings based on some separator.
For example:
stringList = "504 + 507".split(" ")
stringList will now be a list such as ["504", "+", "507"] due to the separator " " which is a whitespace. Then just use stringList[1] with your conditionals to solve the problem. Additionally, you can use int(stringList[0]) and int(stringList[2]) to convert the strings to int objects.
EDIT:
Now I realized that your problem said to use find() instead of split(). Simply use the logic above but instead find(" ") the first whitespace. You will then need to find the second whitespace by slicing past the first whitespace using the two additional arguments available for find().
You need to split the string out instead of hard coding the positions of the indexes.
When coding you want to try to make your code as dynamic as possible, that generally means not hard coding stuff that could be a variable or in this case could be grabbed from the spaces.
Also in the if statements I modified them to elif as it is all one contained statement and thus should be grouped.
def calc(exp):
vals = exp.split(' ')
operand1 = int(vals[0])
operand2 = int(vals[2])
operator = vals[1]
if operator == '+':
return operand1+operand2
elif operator == '-':
return operand1-operand2
else:
return operand1*operand2

Can I connect the variables In a table?

For an example, lets say I have this table:
tbl = {"hi ", "my ", "name ", "is ", "King"}
Can I get this to return:
"hi my name is King"
Without
for k, v in ipairs( tbl )
print(v)
end
Because I am trying to process an unknown quantity of inputs, and compare the result to another string.
You can use table.concat() to get the result string:
local str = table.concat(tbl)
print(str)
It can do more, in particular, table.concat() takes a second optional parameter, which can be used as the separator, for instance, to use commas to separate each elements:
local str = table.concat(tbl, ',')
The biggest advantage of table.concat() versus direct string concatenation is performance, see PiL §11.6 for detail.

Resources