Text classification using NER, SRL, etc - nlp

I'm working on a task of classifying text complaints and I extracted some features like Named Entities, Events, Time Expressions, Semantic Role Labels, etc. I want to classify the text according to these features. My question is how do I encode this data in order to feed it to a classifier?
Here is examples of data extracted:
named_entities: (FedEx, Israel, Paris) , (Zara, London, Chris), ...
time_expressions: ('2021-08-31', '31/08/2021') , ('30 August', '2019'), ...
srl: {"verbs": [{"verb": "write", "description": "I [V: write] [A1: a complaint] [A2: to FedEx] .", "tags": ["O", "O", "O", "O", "O", "O", "B-V", "B-A1", "I-A1", "B-A2", "I-A2", "O"]}], "words": ["I", "to", "FedEx", "."]}, {}, ...
events: {'T1': 'come', 'T2': 'present', 'T3': 'send','T4': 'destination','T5': 'instrument'},
{'T1': 'loader', 'T2': 'bearer', 'T3': 'cargo'} , ...
Previously I used word embeddings to encode the full text but now that the information is in vectors I don't know how to proceed.

Related

What is the right way to use Nested states with pytransitions?

So i've been looking around on the pytransitions github and SO and it seems after 0.8 the way you could use macro-states (or super state with substates in it) has change. I would like to know if it's still possible to create such a machine with pytransition (the blue square is suppose to be a macro-state that has 2 states in it, one of them, the green one, being another macro) :
Or do I have to follow the workflow suggested here : https://github.com/pytransitions/transitions/issues/332 ?
Thx a lot for any info !
I would like to know if it's still possible to create such a machine with pytransition.
The way HSMs are created and managed has changed in 0.8 but you can of course use (deeply) nested states. For a state to have substates, you need to pass the states (or children) parameter with the state definitions/objects you'd like to nest. Furthermore, you can pass transitions for that particular scope. I am using HierarchicalGraphMachine since this allows me to create a graph right away.
from transitions.extensions.factory import HierarchicalGraphMachine
states = [
# create a state named A
{"name": "A",
# with the following children
"states":
# a state named '1' which will be accessible as 'A_1'
["1", {
# and a state '2' with its own children ...
"name": "2",
# ... 'a' and 'b'
"states": ["a", "b"],
"transitions": [["go", "a", "b"],["go", "b", "a"]],
# when '2' is entered, 'a' should be entered automatically.
"initial": "a"
}],
# we could also pass [["go", "A_1", "A_2"]] to the machine constructor
"transitions": [["go", "1", "2"]],
"initial": "1"
}]
m = HierarchicalGraphMachine(states=states, initial="A")
m.go()
m.get_graph().draw("foo.png", prog="dot") # [1]
Output of 1:

Grabbing first names and storing in a list within a dictionary

I need to extract the first names of bob and alice from the dictionary and store it in list7. I have tried slicing and get error that my value exceeds range and have currently tried this code and receive an error as well.
directory = [{'firstName':"bob",'department':"Accounting",'salary':50000{'firstName':"alice",'department':"Marketing",'salary':100000}]
list7[]
#My Code
list7 = [ sub['firstName'] for sub in directory ]
Your code actually works:
directory = [
{"firstName": "bob", "department": "Accounting", "salary": 50000},
{"firstName": "alice", "department": "Marketing", "salary": 100000},
]
list7 = [sub["firstName"] for sub in directory]
print(list7)
# ['bob', 'alice']

How can I combine the elements of a list using python 3

I have
l = [["a", "b", "c"], ["1", "2", "3"], ["x", "y", "z"]]
how can I get a list of the combination of all the elements?
combine(l)
should return
["a1x", "a2x", "a3x", "b1x", "b2x", "b3x", "c1x", "c2x", "c3x",
"a1y", "a2y", "a3y", "b1y", "b2y", "b3y", "c1y", "c2y", "c3y",
"a1z", "a2z", "a3z", "b1z", "b2z", "b3z", "c1z", "c2z", "c3z"]
This kind of combination you want is available in Python as the itertools.product function. You just have to post-process its output
to join each tuple back as a string:
import itertools
l = [["a", "b", "c"], ["1", "2", "3"], ["x", "y", "z"]]
combined = ["".join(combination) for combination in itertools.product(*l)]
print(combined)
Results in:
['a1x', 'a1y', 'a1z', 'a2x', 'a2y', 'a2z', 'a3x', 'a3y', 'a3z', 'b1x', 'b1y', 'b1z', 'b2x', 'b2y', 'b2z', 'b3x', 'b3y', 'b3z', 'c1x', 'c1y', 'c1z', 'c2x', 'c2y', 'c2z', 'c3x', 'c3y', 'c3z']

OpenNLP yielding undesired result

I am using OpenNLP to process queries like "doctor working in Los Angeles" and "female living in Hollywood and working in Santa Monica". For English understanding human these sentences are very obvious that the subjects are "doctor" and "female". However when I use opennlp it tagged the sentence as
female_JJ living_NN in_IN hollywood_NN
[ female living ] [ in ] [ hollywood ]
Here's another sentence "person living in santa monica and working in malibu and playing football" was processed to be
person_NN living_VBG in_IN santa_NN monica_NN and_CC working_VBG in_IN malibu_NN and_CC playing_NN football_NN
[ person ] [ living ] [ in ] [ santa monica ] and [ working ] [ in ] [ malibu and playing football ]
Why does OpenNLP's POS tagger tagged them wrongly? These sentences have simplest grammatical structures. If the most advanced NLP technologies still fails to parse these sentences does it mean that NLP is far from being practical currently?
the accuracy of all these NLP projects can not be 100%. because these projects are working on cases of probablity. these errors can be there. still then these are most accurate working results

CouchDB historical view snapshots

I have a database with documents that are roughly of the form:
{"created_at": some_datetime, "deleted_at": another_datetime, "foo": "bar"}
It is trivial to get a count of non-deleted documents in the DB, assuming that we don't need to handle "deleted_at" in the future. It's also trivial to create a view that reduces to something like the following (using UTC):
[
{"key": ["created", 2012, 7, 30], "value": 39},
{"key": ["deleted", 2012, 7, 31], "value": 12}
{"key": ["created", 2012, 8, 2], "value": 6}
]
...which means that 39 documents were marked as created on 2012-07-30, 12 were marked as deleted on 2012-07-31, and so on. What I want is an efficient mechanism for getting the snapshot of how many documents "existed" on 2012-08-01 (0+39-12 == 27). Ideally, I'd like to be able to query a view or a DB (e.g. something that's been precomputed and saved to disk) with the date as the key or index, and get the count as the value or document. e.g.:
[
{"key": [2012, 7, 30], "value": 39},
{"key": [2012, 7, 31], "value": 27},
{"key": [2012, 8, 1], "value": 27},
{"key": [2012, 8, 2], "value": 33}
]
This can be computed easily enough by iterating through all of the rows in the view, keeping a running counter and summing up each day as I go, but that approach slows down as the data set grows larger, unless I'm smart about caching or storing the results. Is there a smarter way to tackle this?
Just for the sake of comparison (I'm hoping someone has a better solution), here's (more or less) how I'm currently solving it (in untested ruby pseudocode):
require 'date'
def date_snapshots(rows)
current_date = nil
current_count = 0
rows.inject({}) {|hash, reduced_row|
type, *ymd = reduced_row["key"]
this_date = Date.new(*ymd)
if current_date
# deal with the days where nothing changed
(current_date.succ ... this_date).each do |date|
key = date.strftime("%Y-%m-%d")
hash[key] = current_count
end
end
# update the counter and deal with the current day
current_date = this_date
current_count += reduced_row["value"] if type == "created_at"
current_count -= reduced_row["value"] if type == "deleted_at"
key = current_date.strftime("%Y-%m-%d")
hash[key] = current_count
hash
}
end
Which can then be used like so:
rows = couch_server.db(foo).design(bar).view(baz).reduce.group_level(3).rows
date_snapshots(rows)["2012-08-01"]
Obvious small improvement would be to add a caching layer, although it isn't quite as trivial to make that caching layer play nicely incremental updates (e.g. the changes feed).
I found an approach that seems much better than my original one, assuming that you only care about a single date:
def size_at(date=Time.now.to_date)
ymd = [date.year, date.month, date.day]
added = view.reduce.
startkey(["created_at"]).
endkey( ["created_at", *ymd, {}]).rows.first || {}
deleted = view.reduce.
startkey(["deleted_at"]).
endkey( ["deleted_at", *ymd, {}]).rows.first || {}
added.fetch("value", 0) - deleted.fetch("value", 0)
end
Basically, let CouchDB do the reduction for you. I didn't originally realize that you could mix and match reduce with startkey/endkey.
Unfortunately, this approach requires two hits to the DB (although those could be parallelized or pipelined). And it doesn't work as well when you want to get a lot of these sizes at once (e.g. view the whole history, rather than just look at one date).

Resources