numerical entity extraction from unstructured texts using python - python-3.x

I want to extract numerical entities like temperature and duration mentioned in unstructured formats of texts using neural models like CRF using python. I would like to know how to proceed for numerical extraction as most of the examples available on the internet are for specific words or strings extraction.
Input: 'For 5 minutes there, I felt like baking in an oven at 350 degrees F'
Output: temperature: 350
duration: 5 minutes

So far my research shows that you can treat numbers as words.
This raises an issue : learning 5 will be ok, but 19684 will be to rare to be learned.
One proposal is to convert into words. "nineteen thousands six hundred eighty four" and embedding each word. The inconvenient is that you are now learning a (minimum) 6 dimensional vector (one dimension per word)
Based on your usage, you can also embed 0 to 3000 with distinct ids, and say 3001 to 10000 will map id 3001 in your dictionary, and then add one id in your dictionary for each 10x.

Related

I am wondering if the statistical analysis I did makes any sense

I am helping with a retrospective study and the data isn't very well organized. Also, I am new to statistics, so I took a stab at analyzing the data myself. We will be getting the help of a statistician later on, but not sure when yet.
We are looking at about 100 patients and each patient was followed up with for a variable amount of time. Throughout each patient's follow-up, there were a variable amount of observations made at various timepoints. The observations included a set of lab values, anthropometric data, and demographic data. So to conduct the analysis, we split up the observations into time bins (eg. 6 months follow-up, 1 year follow up, etc). Then for each time point, we categorized each patient in one of 3 groups based on the outcome of interest. Also, for each time point, we selected one observation to represent one patient during that timepoint (since there could be many within the same time bin). For the analysis, we did the following:
1 . ANOVA within each timepoint to compare the 3 groups of outcomes . Looking at select independent variables of interest.
2 . For the same variables of interest above, do a repeated measures ANOVA to see if it's changing over time.
3 . Test for correlations between the variables of interest mentioned above and other independent variables.
4 . Test each independent variable in a univariate binomial logistic regression to see if it predicts outcome. There were 3 groups, so we did pairwise regressions (eg. (outcome 1 + 2) vs (outcome 3), and (outcome 1) vs (outcome 2 + 3)).
5 . Do a multivariate binomial logistic regression with forward elimination using only the significant independent variables retained from step 4.
6 . If any independent variables of interest are retained in the MV regression, run it again testing for potential interactions with any variables it was correlated with from step 3. We tried to do this by making a new variable that is the product of the two variables and putting it into the regression.
What I'm trying to show with this analysis is that one key independent variable explains the difference in outcomes among the patients. So far my analysis seems to be doing this, as it seems to be one of the few variables retained at step 6 and with a good significance value. So sorry if this is very confusing to read.

How to best store time series data in Elasticsearch?

I regularly have to conduct chemical experiments which result in a huge set of time serie data. For example, 100 lists with measured concentration of fluids and for each measurement an assigned timestamp in microseconds.
I would like to track and model each experiment and assign to it multiple lists with (measurement, timestamp) pairs. The measurement lists do not have to be of equal length and can greatly vary. For example, one measurement list could be of length 100, the next one 4000, always depending on the conducted experiment. When being at the university lab, I also take notes for different timestamps, which I would also like to track for the timestamps in the DB (tagged timestamps).
Later on, the full analysis text of the experiment should also be stored.
Is Elasticsearch capable of storing such time series data or measurement lists? Because this is not mostly text but rather numbers I'm a bit restraint.
Even though I have searched through the net for a while, I could not find a proper way to set up measurement lists as explained yet.
Any help, ideas and maybe helpful links are highly appreciated!

Small data anomaly detection algo

I have the following 3 cases of a numeric metric on a time series(t,t1,t2 etc denotes different hourly comparisons across periods)
If you notice the 3 graphs t(period of interest) clearly has a drop off for image 1 but not so much for image 2 and image 3. Assume this is some sort of numeric metric(raw metric or derived) and I want to create a system/algo which specifically catches case 1 but not case 2 or 3 with t being the point of interest. While visually this makes sense and is very intuitive I am trying to design a way to this in python using the dataframes shown in the picture.
Generally the problem is how do I detect when the time series is behaving very differently from any of the prior weeks.
Edit: When I say different what I really mean is, my metric trends together across periods in t1 to t4 but if they dont and try to separate out of the envelope, that to me is an anomaly. If you notice chart 1 you can see t tries to split out from rest of the tn this is an anomaly for me. in other cases t is within the bounds of other time periods. Hope this helps.
With small data the best is if you can come up with a good transformation into a simpler representation.
In this case I would try the following:
Distance to the median along the time-axis. Then a summary of that, could be median, Mean-Squared-Error etc
Median of the cross-correlation of the signals

How to measure how distinct a document is based on predefined linguistic categories?

I have 3 categories of words that correspond to different types of psychological drives (need-for-power, need-for-achievement, and need-for-affiliation). Currently, for every document in my sample (n=100,000), I am using a tool to count the number of words in each category, and calculating a proportion score for each category by converting the raw word counts into a percentage based on total words used in the text.
n-power n-achieve n-affiliation
Document1 0.010 0.025 0.100
Document2 0.045 0.010 0.050
: : : :
: : : :
Document100000 0.100 0.020 0.010
For each document, I would like to get a measure of distinctiveness that indicates the degree to which the content of a document on the three psychological categories differs from the average content of all documents (i.e., the prototypical document in my sample). Is there a way to do this?
Essentially what you have is a clustering problem. Currently you made a representation of each of your documents with 3 numbers, lets call them a vector (essentially you cooked up some embeddings). To do what you want you can
1) Calculate an average vector for the whole set. Basically add up all numbers in each column and divide by the number of documents.
2) Pick a metric you like which will reflect an alignment of your document vectors with an average. You can just use (Euclidian)
sklearn.metrics.pairwise.euclidean_distances
or cosine
sklearn.metrics.pairwise.cosine_distances
X will be you list of document vectors and Y will be a single average vector in the list. This is a good place to start.
If I would do it I would ignore average vector approach as you are in fact dealing with clustering problem. So I would use KMeans
see more here guide
Hope this helps!

Simulation in Excel using probability

I am trying to create a spreadsheet that can find the most likely probability that a student scored a specific grade on a test.
Only one student can score a grade and only one grade can have a student.
I have limited information about each student.
There are 5 students (1,2,3,4,5)
and the grades possible are only (100,90,80,70,60)
In the spreadsheet a 1 denotes that the student DIDN'T score that grade.
Does anyone know how to make a simulation that I can find the most likely probability of what student scored what grade?
Link:
https://docs.google.com/spreadsheets/d/1a8uUIRzUKsY3DolTM1A0ISqMd-42WCUCiDsxmUT5TKI/edit?usp=sharing
Based on your response in comments, each student has an equal likelihood of getting each grade. No simulation is necessary.
If you want to simulate it anyway, don't use Excel*. Create a vector of students, and pair it with a shuffled vector of the grades. Lather, rinse, repeat as many times as you want to see that the student-to-grade matching is uniformly distributed.
* - To get an idea of how bad Excel can be for random variate generation, enable the Analysis Toolpak, go to "Data -> Data Analysis" on the ribbon, and select "Random Number Generation". Fill in the tabs that you want 10 variables, number of random numbers 2000, a "Normal" distribution, leave the mean and std dev at 0 and 1, and enter a "Random Seed" value of 123. You will find that the resulting table contains 3 instances of the value "-9.35764". Values that extreme should occur about once per twenty thousand years if you generate a billion a second. Getting three of them is so extreme that it should happen once per 1030 times the current estimated age of the universe. Conclude that a) it's your lucky day, or b) Excel sucks at random numbers, and despite being informed about this as far back as 1998 Microsoft hasn't bothered to fix it.

Resources