to get in touch with Turi I'm trying to create a model that is able to distinguish between strings consisting of chars and strings consisting of numbers.
I have CSV-file with training data. Each line consists of two entries, a string and an indicator whether this string is a number or a plane string
String, isNumber
bvmuuflo , 0
71047015 , 1
My Python-Script to generate the model looks like this:
import graphlab as gl
data = gl.SFrame('data.csv')
model = gl.classifier.create(data, target="isNumber", features=["String"])
This works fine. But I have no idea how to use the model to check for example if "qwerty" is a String or a Number.
I'm trying to use the model.classify(...) API-call. But the two calls
model.classify(gl.SFrame(["qwertzui"])
and
model.classify(gl.SFrame(["98765432"])
return the same result
Columns:
class int
probability float
Rows: 1
Data:
+-------+----------------+
| class | probability |
+-------+----------------+
| 1 | 0.509227594584 |
+-------+----------------+
[1 rows x 2 columns]
Obviously there is a mistake in my program, but I'm not able to find it.
Any help is welcome!
Since the model only has one column for training it will be able to identify strings it has already seen but unable to identify ones it has not. My guess is the .509 is the percentage of your input that is a string, so it just responds with that for anything it has not seen before.
This is obviously a toy example but if you want to get it to work I would use something like a bag of words, but for letters. Make 36 columns with the titles a,b,c...z,0,1...9 and put the count of each character per string for each row. This way the model will look at individual letters as giving a probability to the class instead of the string as a whole.
Related
I have a NLP project where I would like to remove the words that appear only once in the keywords. That is to say, for each row I have a list of keywords and their frequencies.
I would like something like
if the frequency for the word in the whole column ['keywords'] ==1 then replace by "".
I cannot test word by word. So my idea was creating a list with all the words and remove the duplicates, then for each word in this list count.sum and then delete. But I have no idea how to do that.
Any ideas? Thanks!
Here's how my data looks like:
sample.head(4)
ID keywords age sex
0 1 fibre:16;quoi:1;dangers:1;combien:1;hightech:1... 62 F
1 2 restaurant:1;marrakech.shtml:1 35 M
2 3 payer:1;faq:1;taxe:1;habitation:1;macron:1;qui... 45 F
3 4 rigaud:3;laurent:3;photo:11;profile:8;photopro... 46 F
To add on to what #jpl mentioned with scikit-learn's CountVectorizer, there exists an option min_df that does exactly what you want, provided you can get your data in the right format. Here's an example:
from sklearn.feature_extraction.text import CountVectorizer
# assuming you want the token to appear in >= 2 documents
vectorizer = CountVectorizer(min_df=2)
documents = ['hello there', 'hello']
X = vectorizer.fit_transform(documents)
This gives you:
# Notice the dimensions of our array – 2 documents by 1 token
>>> X.shape
(2, 1)
# Here is a count of how many times the tokens meeting the inclusion
# criteria are observed in each document (as you see, "hello" is seen once
# in each document
>>> X.toarray()
array([[1],
[1]])
# this is the entire vocabulary our vectorizer knows – see how "there" is excluded?
>>> vectorizer.vocabulary_
{'hello': 0}
Your representation makes that difficult.
You should build a dataframe where each column is a word; then you can use easily pandas operations like the sum to do whatever you want.
However this will lead to a very sparse dataframe, which is never good.
Many libraries, e.g. scikit learn's CountVectorizer allow you to do what you want efficiently.
d1 = dataset['End station'].head(20)
for x in d1:
x = re.compile("[0-9]{5}")
print(d1)
Using dataset['End_Station'] = dataset['End station'].map(lambda x: re.compile("([0-9]{5})").search(x).group())
shows - TypeError: expected string or bytes-like object.
I am new to data analysis, can't think of any other methods
Pandas has its own methods concerning Regex, so the "more pandasonic" way
to write code is just to use them, instead of native re methods.
Consider such example of source data:
End station
0 4055 Johnson Street, Chicago
1 203 Mayflower Avenue, Columbus
To find the street No in the above addresses, run:
df['End station'].str.extract(r'(?P<StreetNo>\d{,5})')
and you will get:
StreetNo
0 4055
1 203
Note also that the street No may be shorter than 5 digits, but you attempt
to match a sequence of just 5 digits.
Another weird point in your code: Why do you compile a regex in a loop
and then make no use of them?
Edit
After a more thorough look at your code I have a couple of additional remarks.
When you write:
for x in df:
...
then the loop iterates actually over column names (not rows).
Another weird point in your code is that x variable, used initially to hold
a column name, you use again to save a compiled regex there.
It is a bad habbit. Variables should be used to hold one clearly
defined object in each of them.
And as far as iteration over rows is concerned, you can use e.g.
for idx, row in df.iterrows():
...
But note that iterrows returns pairs composed of:
index of the current row,
the row itself.
Then (in the loop) you will probably refer to individual columns of this row.
I have this table in access called invoices.
It has a field that performs calculations based on two fields. that is field_price*field_Quantity. Now I added two new columns containing the unit of measure for field_weight and field_quantity named field_priceUnit and field_quantityUnit.
Now instead of just taking the multiplication to perform the calculation, I want it to see if the units of measures match, it doesn't match then it should do a convertion of the field_quantity into the unit of measure of field_priceUnit.
example:
Row1:
ID:1|Field_Quantity:23|field_quantityUnit:LB|Field_weight:256|field_priceunit:KG|field_price:24| Calculated_Column:
the calculated_column should do the calculation this way.
1. if field_quantityunit=LB and field_priceunit=LB then field_quantity*field_price
else
if field_quantityUnit=LB and field_priceUnit=KG
THEN ((field_quantity/0.453592)*field_price) <<
Please help me.
I have to do this for multiple conditions.
Field_priceunit may have values as LB,KG, and MT
Field_quantityUnit may have field as LB,KG, and MT
if both units don't match, then I want to do the conversion and calculate based on the new convetion as seen in the example.
Thank you
The following formula should get you running if your units are only lb and kg and you only have to check one direction:
iif(and(field_quantityunit='LB', field_priceunit='LB'), field_quantity*field_price, (field_quantity/0.453592)*field_price)
This doesn't scale well though as you may have to convert field_price or you may add other units. This iif formula will grow WAY out of hand quickly.
Instead create a new table called unit_conversion or whatever you like:
unit | conversion
lb | .453592
kg | 1
g | 1000
mg | 1000000
Now in your query join:
LEFT OUTER JOIN unit_conversion as qty_conversion
ON field_quantityunit = qty_conversion.unit
LEFT OUTER JOIN unit_conversion as price_conversion
On field_priceUnit = price_conversion.unit
Up in your SELECT portion of the query you can now just do:
(field_quantity * qty_conversion.conversion) * (field_price * price_conversion.conversion)
And you don't have to worry what the units are. They will all convert over to a kilogram and get multiplied.
You could convert everything over to a pound or really any unit of weight here if you want but the nice thing is, you only need to add new units to your conversion table to handle them in any sql that you write this way so it's very scalable.
Say I have two tables. attrsTable:
file | attribute | value
------------------------
A | xdim | 5
A | ydim | 6
B | xdim | 7
B | ydim | 3
B | zdim | 2
C | xdim | 1
C | ydim | 7
sizeTable:
file | size
-----------
A | 17
B | 23
C | 34
I have these tables related via the 'file' field. I want a PowerPivot measure within attrsTable, whose calculation uses size. For example, let's say I want xdim+ydim/size for each of A, B, C. The calculations would be:
A: (5+6)/17
B: (7+3)/23
C: (1+7)/34
I want the measure to be generic enough so I can use slicers later on to slice by file or attribute. How do I accomplish this?
I tried:
dimPerSize := CALCULATE([value]/SUM(sizeTable[size])) # Calculates 0
dimPerSize := CALCULATE([value]/SUM(RELATED(sizeTable[size]))) # Produces an error
Any idea what I'm doing wrong? I'm probably missing some fundamental concepts here of how to use DAX with relationships.
Hi Redstreet,
taking a step back from your solution and the one proposed by Jacob, I think it might be useful to create another table that would aggregate all the calculations (especially given you probably have more than 2 tables with file-specific attributes).
So I have created one more table that contains (only) unique file names, and thus the relationships could be visualized this way:
It's much simpler to add necessary measures (no need for calculated columns). I have actually tested 2 scenarios:
1) create simple SUM measures for both Attribute Value and File Size. Then divide those two measures and job done :-).
2) use SUMX functions to have a bit more universal solution. Then the final formula for DimPerSize calculation could look like this:
=DIVIDE(
SUMX(DISTINCT(fileTable[file]),[Sum of AttrValue]),
SUMX(DISTINCT(fileTable[file]),[Sum of FileSize]),
BLANK()
)
With [Sum of AttrValue] being:
=SUM(attrsTable[value])
And Sum of FileSize being:
=SUM(sizeTable[size])
This worked perfectly fine, even though SUMX in both cases goes over all instances of given file name. So for file B it also calculates with zdim (if there is a need to filter this out, then use simple calculate / filter combination). In case of file size, I am using SUMX as well, even though it's not really needed since the table contains only 1 record for each file name. If there would be 2 instances, then use SUMX or AVERAGEX depending on the desired outcome.
This is the link to my source file in Excel (2010).
Hope this helps.
You look to have the concept of relationships OK but you aren't on the right track in terms of CALCULATE() either in terms of the structure or the fact that you can't simply use 'naked' numerical columns, they need to be packaged in some way.
Your desired approach is correct in that once you get a simple version of the thing running, you will be able to slice and dice it over any of your related dimensions.
Best practice is probably to build this up using several measures:
[xdim] = CALCULATE(SUM('attrstable'[value]), 'attrstable'[attribute] = "xdim")
[ydim] = CALCULATE(SUM('attrstable'[value]), 'attrstable'[attribute] = "ydim")
[dimPerSize] = ([xdim] + [ydim]) / VALUES('sizeTable'[size])
But depending on exactly how your pivot is set up, this is likely to also throw an error because it will try and use the whole 'size' column in your totals. There are two main strategies for dealing with this:
Use an 'iterative' formula such as SUX() or AVERAGEX() to iterate individually over the 'file' field and then adds up or averages for the total e.g.
[ItdimPerSize] = AVERAGEX(VALUES('sizeTable'[file]), [dimPerSize])
Depending on the maths you want to use, you might find that produce a useful average that you need to use SUMX but devide by the number of cases i.e. COUNTROWS('sizeTable'[file]).
You might decide that the totals are irrelevant and simply introduce an error handling element that will make them blank e.g.
[NtdimPerSize] = IF(HASONEVALUE('sizeTable'[file]),[dimPerSize],BLANK())
NB, all of this assumes that when you are creating your pivot that you are 'dragging in' the file field from the 'sizetable'.
I have the following data.frame:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
# Note that the salary below is in stringified format.
# In reality there are more such stringified numerical columns.
salary <- as.character(c(21000, 23400, 26800))
df <- data.frame(employee,salary)
The output is:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ employee: Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
$ salary : Factor w/ 3 levels "21000","23400",..: 1 2 3
What I want to do is to convert the change the value from string into pure number
straight fro the df variable. At the same time preserve the string name for employee.
I tried this but won't work:
as.numeric(df)
At the end of the day I'd like to perform arithmetic on these numeric
values from df. Such as df2 <- log2(df), etc.
Ok, there's a couple of things going on here:
R has two different datatypes that look like strings: factor and character
You can't modify most R objects in place, you have to change them by assignment
The actual fix for your example is:
df$salary = as.numeric(as.character(df$salary))
If you try to call as.numeric on df$salary without converting it to character first, you'd get a somewhat strange result:
> as.numeric(df$salary)
[1] 1 2 3
When R creates a factor, it turns the unique elements of the vector into levels, and then represents those levels using integers, which is what you see when you try to convert to numeric.