There are 40 rows in my dataset and 3 attribute columns. Each row is a separate text document. I converted strings to separate terms using TermdocumentMatrix() function of library(tm). But this functions is treating number of attribute columns as number of documents. Why is it so? Am I making some mistake?
Is there any attribute filter in R which is similar to weka's StringToWordVector filter? I want the result to be same as weka's StringToWordVector filter
Sample is shown below :
Title, Author, BookSummary
The Da Vinci Code, Dan Brown, Louvre curator and Priory of Sion Grand Master Jacques<br>
This sample is showing just 1 row.
I tried this code :-
data<-read.csv("C:/Users/admin/Desktop/RTextMining/dataset.csv")
corpus.tmp<-Corpus(VectorSource(data))
View(corpus.tmp)
corpus.tmp<- tm_map(corpus.tmp,removePunctuation)
corpus.tmp<- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp<- tm_map(corpus.tmp, tolower)
corpus.tmp<- tm_map(corpus.tmp, removeWords, stopwords("english"))
library(SnowballC)
corpus.tmp <- tm_map(corpus.tmp, stemDocument)
TDM <- TermDocumentMatrix(corpus.tmp)
Related
I am trying to achieve this result: assign a category to a document based on its title, or part of its title.
Title
Category
correspondence
Correspondence
Note Transmission Correspondence
Correspondence
Advisors Evaluation Report
Report
Country Notes
Correspondence
Annual Portfolio Report
Report
Appointment Letter
Correspondence
The categories are arranged into a table (docCategories) where each row starts with a unique category name, and is followed by a set of labels that match entirely or partially with the document title.
Category
Label
Label2
Label3
Label4
Correspondence
Letter
Memo
Note
Correspondence
Report
Dashboard
Report
The formula will take the document title and check if it matches any of the labels (with wild cards), so to return the unique category in the first position in the same row of the matched label.
Appointment Letter -> matches label:letter -> cat:Correspondence
I have made it working with this formula to be copied in the Category column:
=INDEX(docCategories;MIN(IF(docCategories=A2;ROW(docCategories)))-1;MIN(IF(docCategories=A2;1)))
And only if the title is exact matching of the entire label (e.g. Correspondence -> matches label:correspondence -> cat:Correspondence).
I am looking to have it working for matching on part of the title (e.g. Appointment Letter -> matches label:letter -> cat:Correspondence).
I have tried and failed to change the docCategories=<title> into something that can match the substring of the title, even applying the SPLITEXT(<title>) it still fails to give me the expected result.
Who can think of a creative solution for this?
The following solution works for any number of categories and for any number of labels on any category. It also identifies if no labels were found and also if more than one label was found from a different category. Since the question doesn't specify any specific excel version tag I assume Microsoft Office 365 function can be used.
On cell I2 put the following formula:
=LET(rng, A2:E3, texts, G2:G9, lkupValues, B2:E3, categories, INDEX(rng,,1),
BYROW(texts, LAMBDA(text,LET(
reduceResult, REDUCE("",categories, LAMBDA(acc,c, LET(
lkup, XLOOKUP(c,categories, lkupValues), searchLabels, FILTER(lkup, lkup<>0),
IF(SUM(N(ISNUMBER(SEARCH(searchLabels,text))))=0, acc,
IF(acc="", c, "MORE THAN ONE CATEGORY FOUND"))
))), IF(reduceResult<>"", reduceResult, "CATEGORY NOT FOUND")
)))
)
and here is the corresponding output:
The last two rows Title column were added to test the Non-Happy paths.
Explanation
We use LET function to define the names to be used and to avoid repeating the same calculation. If in your excel version you have DROP function available, then the name: lkupValues can be defined as follow: DROP(rng,,1).
The main idea is to iterate over texts values via BYROW and for each text we invoke SEARCH function for all categories. When the first input argument of SEARCH is an array, it returns an array of the same shape indicating the start of the index position of the labels found in text or #VALUE! if no labels were found.
Note: SEARCH is not case sensitive, if that is not the case, then replace it with FIND.
We use REDUCE function to iterate over all categories to find a match. For each category (c) we find the corresponding labels via XLOOKUP. Since not all categories have the same number of labels, for example Report has fewer labels than the Correspondence category. We need to adjust it to remove empty labels. The name searchLabels filters the result to only non-empty labels.
For checking if labels were not found we use the following condition:
SUM(N(ISNUMBER(SEARCH(searchLabels,text))))=0
ISNUMBER converts the SEARCH result to TRUE/FALSE values. N function converts the result to equivalent 0,1 values.
If the condition is TRUE, it returns the accumulator (acc initialized to an empty string). If the condition is FALSE, some labels were found, then it returns the category (c) if acc is empty, i.e. no previous categories were found. If acc is not empty any previous category was found, so it returns MORE THAN ONE CATEGORY FOUND.
Finally, if the result of REDUCE (reduceResult) is an empty string, it means the accumulator was not updated after initialization, so no labels were found for any category and it is indicated with the output: CATEGORY NOT FOUND.
I realize this is another thread with a similar question:
Netsuite: Saved Search Function much like "Text To Columns" in Excel
but the answer only pulled the first Class, none of the sub classes.
What if you want all of the Classes split into columns?
"Class" could have many levels out to
Main:Sub1:Sub2:Sub3
Could have any number of sub classes.
I'm assuming I could create 6 (or more) different columns with each one resulting in each respective Main and Sub -
Column 1 Column 2 Column 3 etc.
Main Sub1 Sub2 etc.
This seems like a common desire I can't find an answer to.
Thanks!
You can use REGEXP_SUBSTR to extract the part of the string for the corresponding column.
Method 1
Main Class (First Column)
TRIM(REGEXP_SUBSTR({class}, '^[^:]+'))
Subclass 1 (Second Column)
TRIM(REGEXP_SUBSTR({class}, '^[^:]+:([^:]+)',1,1,'i',1))
Subclass 2 (Third Column)
TRIM(REGEXP_SUBSTR({class}, '^[^:]+:[^:]+:([^:]+)',1,1,'i',1))
Extend this pattern using additional groups of [^:]+: immediately after the first '^' of the regex string to add extra columns.
Method 2
Main Class (First Column)
TRIM(REGEXP_SUBSTR({class}, '^([^:]*):*([^:]*):*([^:]*):*([^:]*):*([^:]*):*([^:]*)',1,1,'i',1))
Subclass 1 (Second Column)
TRIM(REGEXP_SUBSTR({class}, '^([^:]*):*([^:]*):*([^:]*):*([^:]*):*([^:]*):*([^:]*)',1,1,'i',2))
Subclass 2 (Third Column)
TRIM(REGEXP_SUBSTR({class}, '^([^:]*):*([^:]*):*([^:]*):*([^:]*):*([^:]*):*([^:]*)',1,1,'i',3))
Extend this to additional columns by simply changing the value of the last parameter, up to the number of capture groups in the regex string - in the examples shown this will work up to 6 columns. For more than 6 columns you would also have to add extra ([^:]*):* groups to the regex string.
d1 = dataset['End station'].head(20)
for x in d1:
x = re.compile("[0-9]{5}")
print(d1)
Using dataset['End_Station'] = dataset['End station'].map(lambda x: re.compile("([0-9]{5})").search(x).group())
shows - TypeError: expected string or bytes-like object.
I am new to data analysis, can't think of any other methods
Pandas has its own methods concerning Regex, so the "more pandasonic" way
to write code is just to use them, instead of native re methods.
Consider such example of source data:
End station
0 4055 Johnson Street, Chicago
1 203 Mayflower Avenue, Columbus
To find the street No in the above addresses, run:
df['End station'].str.extract(r'(?P<StreetNo>\d{,5})')
and you will get:
StreetNo
0 4055
1 203
Note also that the street No may be shorter than 5 digits, but you attempt
to match a sequence of just 5 digits.
Another weird point in your code: Why do you compile a regex in a loop
and then make no use of them?
Edit
After a more thorough look at your code I have a couple of additional remarks.
When you write:
for x in df:
...
then the loop iterates actually over column names (not rows).
Another weird point in your code is that x variable, used initially to hold
a column name, you use again to save a compiled regex there.
It is a bad habbit. Variables should be used to hold one clearly
defined object in each of them.
And as far as iteration over rows is concerned, you can use e.g.
for idx, row in df.iterrows():
...
But note that iterrows returns pairs composed of:
index of the current row,
the row itself.
Then (in the loop) you will probably refer to individual columns of this row.
to get in touch with Turi I'm trying to create a model that is able to distinguish between strings consisting of chars and strings consisting of numbers.
I have CSV-file with training data. Each line consists of two entries, a string and an indicator whether this string is a number or a plane string
String, isNumber
bvmuuflo , 0
71047015 , 1
My Python-Script to generate the model looks like this:
import graphlab as gl
data = gl.SFrame('data.csv')
model = gl.classifier.create(data, target="isNumber", features=["String"])
This works fine. But I have no idea how to use the model to check for example if "qwerty" is a String or a Number.
I'm trying to use the model.classify(...) API-call. But the two calls
model.classify(gl.SFrame(["qwertzui"])
and
model.classify(gl.SFrame(["98765432"])
return the same result
Columns:
class int
probability float
Rows: 1
Data:
+-------+----------------+
| class | probability |
+-------+----------------+
| 1 | 0.509227594584 |
+-------+----------------+
[1 rows x 2 columns]
Obviously there is a mistake in my program, but I'm not able to find it.
Any help is welcome!
Since the model only has one column for training it will be able to identify strings it has already seen but unable to identify ones it has not. My guess is the .509 is the percentage of your input that is a string, so it just responds with that for anything it has not seen before.
This is obviously a toy example but if you want to get it to work I would use something like a bag of words, but for letters. Make 36 columns with the titles a,b,c...z,0,1...9 and put the count of each character per string for each row. This way the model will look at individual letters as giving a probability to the class instead of the string as a whole.
I have the following data.frame:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
# Note that the salary below is in stringified format.
# In reality there are more such stringified numerical columns.
salary <- as.character(c(21000, 23400, 26800))
df <- data.frame(employee,salary)
The output is:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ employee: Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
$ salary : Factor w/ 3 levels "21000","23400",..: 1 2 3
What I want to do is to convert the change the value from string into pure number
straight fro the df variable. At the same time preserve the string name for employee.
I tried this but won't work:
as.numeric(df)
At the end of the day I'd like to perform arithmetic on these numeric
values from df. Such as df2 <- log2(df), etc.
Ok, there's a couple of things going on here:
R has two different datatypes that look like strings: factor and character
You can't modify most R objects in place, you have to change them by assignment
The actual fix for your example is:
df$salary = as.numeric(as.character(df$salary))
If you try to call as.numeric on df$salary without converting it to character first, you'd get a somewhat strange result:
> as.numeric(df$salary)
[1] 1 2 3
When R creates a factor, it turns the unique elements of the vector into levels, and then represents those levels using integers, which is what you see when you try to convert to numeric.