I am converting PDF files to text files using pdfminer and then trying to extract contents. But when I do so, there will be some empty lines in text file. I need to remove those empty lines. Can someone tell how can I achieve this.
Text file after conversion:
XYZ
8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota
+91 1234567890 | ABCD#gmail.com | 7/108, Malviya Nagar Jaipur (302017)
SUMMARY
To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an
innovative software company.
EXPERIENCE
Machine Learning Engineering Intern , Forsk Technologies , Jaipur (May,2017 – July,2017)
Learned the foundational concepts of data science and machine learning including python and statistics,
enough time was spent on understanding the concept behind each algorithm and examples and case
studies were done. Built some mid-scaled machine learning models using supervised and unsupervised
learning.
Software Engineering Intern , Proxbotics Creations Technologies , Jaipur (May,2016 – July,2016)
Developed and optimized various projects including ecommerce, booking & reservation, non-profit
organization Websites, using technologies: HTML, CSS, PHP, JavaScript, MySQL etc.
Trainee at TecheduSoft , Kota (May,2015)
The course contains 15+ modules including Android Basics, fragments, screen designing, intents, various
views, signing app, web servers, web services, notifications, etc.
PROJECTS
All projects are available on git: https://github.com/JAIJANYANI
Video Analysis for surveillance
-A command line app which takes all your CCTV feeds as input and filters feeds with abnormal events
which results in 90% less videos to watch, Used image processing and deep learning algorithms,
outputs all time-stamps of interesting events for all feeds.
Food Calorie Estimator
-An android app to estimate calories present in food with still image. Trained own Data-set (Meal-net)
using Transfer learning Built upon Inception V3, Proposed a Deep Convolutional Neural Network (CNN)
with 48 Layers, Developed a REST API to integrate it in Mobile apps, Optimized total computation time
~ 2 Seconds.
CryptoCurrency Market Predictor
- A Flask app to predict the future prices of various Crypto Currencies, implemented various supervised
and deep learning algorithms such as LSTM (RNN), polynomial regression, using scikit-learn, tensorflow,
keras etc.
Spam Filter
-A REST API to Detect Incoming SMS or Email as Spam or Ham which can be trained on your own data
set. Used NLP with Naive Bayes for Sentiment Analysis.
Image Classifier using CNN
-An application which detects objects present in a still image, implemented convolutional neural
network using open source machine learning library which can be run on multiple machines to reduce
training workloads, classifies objects using pre-trained image-net model.
Online Student and Faculty Portal
-A Web Portal to manage attendance of students and faculties, can be integrated to mobile apps. Uses
Php, MySQL, HTML, CSS, JavaScript, etc.
Tax Accounting
-A Decentralized web app built on Ethereum Block-Chain using Truffle and Embark framework, which
can be used to transfer funds between accounts which automatically deducts tax from the account.
TECHNICAL SKILLS
Programming Languages
Web Technologies
Scripting Languages
Database Management System
Operating Systems
Strongest Areas
COURSES
:
:
:
:
:
C, C++
HTML,CSS
Python, PHP, BASH
MySQL, SQLite
Microsoft Windows, Linux, UNIX
:
Machine Learning, Data Science
Applied Machine Learning , Applied Data Science , Exploratory Data Analysis & Data Visualization , Neural
Networks & Deep Learning , Computer networks , Data Structures & Algorithms , Operating Systems , Cloud
Computing , Data Mining , Block chain Essentials , Database Management Systems.
EDUCATION
University College of Engineering , Kota : Btech (Pursuing) in Computer Science Engineering (2018)
St. Edmunds School , Jaipur : Senior Secondary (XII) Education Rajasthan (2012)
St. Edmunds School , Jaipur : Secondary (X) Education Rajasthan (2010)
Expected:
XYZ
8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota
+91 1234567890 | ABCD#gmail.com | 7/108, Malviya Nagar Jaipur (302017)
SUMMARY
To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an
innovative software company.
EXPERIENCE
-Machine Learning Engineering Intern , Forsk Technologies , Jaipur (May,2017 – July,2017)
Learned the foundational concepts of data science and machine learning including python and statistics,
enough time was spent on understanding the concept behind each algorithm and examples and case
studies were done. Built some mid-scaled machine learning models using supervised and unsupervised
learning.
-Software Engineering Intern , Proxbotics Creations Technologies , Jaipur (May,2016 – July,2016)
Developed and optimized various projects including ecommerce, booking & reservation, non-profit
organization Websites, using technologies: HTML, CSS, PHP, JavaScript, MySQL etc.
-Trainee at TecheduSoft , Kota (May,2015)
The course contains 15+ modules including Android Basics, fragments, screen designing, intents, various
views, signing app, web servers, web services, notifications, etc.
PROJECTS
All projects are available on git: https://github.com/JAIJANYANI
Video Analysis for surveillance
-A command line app which takes all your CCTV feeds as input and filters feeds with abnormal events
which results in 90% less videos to watch, Used image processing and deep learning algorithms,
outputs all time-stamps of interesting events for all feeds.
- Food Calorie Estimator
-An android app to estimate calories present in food with still image. Trained own Data-set (Meal-net)
using Transfer learning Built upon Inception V3, Proposed a Deep Convolutional Neural Network (CNN)
with 48 Layers, Developed a REST API to integrate it in Mobile apps, Optimized total computation time
~ 2 Seconds.
- CryptoCurrency Market Predictor
- A Flask app to predict the future prices of various Crypto Currencies, implemented various supervised
and deep learning algorithms such as LSTM (RNN), polynomial regression, using scikit-learn, tensorflow,
keras etc.
- Spam Filter
-A REST API to Detect Incoming SMS or Email as Spam or Ham which can be trained on your own data
set. Used NLP with Naive Bayes for Sentiment Analysis.
Image Classifier using CNN
-An application which detects objects present in a still image, implemented convolutional neural
network using open source machine learning library which can be run on multiple machines to reduce
training workloads, classifies objects using pre-trained image-net model.
Online Student and Faculty Portal
A Web Portal to manage attendance of students and faculties, can be integrated to mobile apps. Uses
Php, MySQL, HTML, CSS, JavaScript, etc.
Tax Accounting
-A Decentralized web app built on Ethereum Block-Chain using Truffle and Embark framework, which
can be used to transfer funds between accounts which automatically deducts tax from the account.
TECHNICAL SKILLS
Programming Languages
Web Technologies
Scripting Languages
Database Management System
Operating Systems
Strongest Areas
COURSES
:
:
:
:
:
C, C++
HTML, CSS
Python, PHP, BASH
MySQL, SQLite
Microsoft Windows, Linux, UNIX
:
Machine Learning, Data Science
Applied Machine Learning , Applied Data Science , Exploratory Data Analysis & Data Visualization , Neural
Networks & Deep Learning , Computer networks , Data Structures & Algorithms , Operating Systems , Cloud
Computing , Data Mining , Block chain Essentials , Database Management Systems.
EDUCATION
University College of Engineering , Kota : Btech (Pursuing) in Computer Science Engineering (2018)
St. Edmunds School , Jaipur : Senior Secondary (XII) Education Rajasthan (2012)
St. Edmunds School , Jaipur : Secondary (X) Education Rajasthan (2010)
I'm using the below code for the conversion:
def extract_text_from_pdf(pdf_path):
'''
Helper function to extract the plain text from .pdf files
'''
with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh, caching=True,check_extractable=True):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, codec='utf-8', laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
output_txt = text
yield text
#print(output_txt)
converter.close()
fake_file_handle.close()
def extract_text_from_doc(doc_path):
'''
Helper function to extract plain text from .doc or .docx files
'''
temp = docx2txt.process(doc_path)
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
return ' '.join(text)
def extract_text(file_path, extension):
'''
Wrapper function to detect the file extension and call text extraction function accordingly
'''
text = ''
if extension == '.pdf':
for page in extract_text_from_pdf(file_path):
text += ' ' + page
elif extension == '.docx' or extension == '.doc':
text = extract_text_from_doc(file_path)
print(text)
return text
I suggest to use regex.
import re
rc = re.compile(r"\n{2,}") # two and more new lines
string_without_extra_new_lines = re.sub(rc,"\n", string_with_extra_blank_lines)
Having this as an example:
XYZ
8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota
+91 1234567890 | ABCD#gmail.com | 7/108, Malviya Nagar Jaipur (302017)
SUMMARY
To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an
innovative software company.
EXPERIENCE
This is the output:
XYZ
8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota
+91 1234567890 | ABCD#gmail.com | 7/108, Malviya Nagar Jaipur (302017)
SUMMARY
To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an
innovative software company.
EXPERIENCE
I am a beginner in Machine Learning . I want to build a model for predicting
trending product. Can you please tell me in which layout and what parameters
do I need in my dataset. Let's say I want to predict a certain product from certain category .So I will be collecting dataset from various e-commerce sites e.g ebay, amazon etc. of that category .
Please tell me in detail.
You will need an dataset with features like
Number of sales
Ratings
Recommendations
And many more.
This is be a classification problem. You need to classify the products as trendy or not trendy. Also you will need labels which describe the data as trendy or not trendy.
I am working on a document classification problem for financial reports/documents. Is there a ready made corpus for this ? I found a couple of use cases, but they all made their own corpus.
You will more than likely have to create your own corpus. I had a similar task and manually creating such a corpus would be too tedious. As a result I created News Corpus Builder a python module that would allow you to quickly develop a corpus based on your particular interest of topics.
The module allows you to generate your own corpus and store the text and associated label in sqlite or as flat files.
from news_corpus_builder import NewsCorpusGenerator
# Location to save generated corpus
corpus_dir = '/Users/skillachie/finance_corpus'
# Save results to sqlite or files per article
ex = NewsCorpusGenerator(corpus_dir,'sqlite')
# Retrieve 50 links related to the search term dogs and assign a category of Pet to the retrieved links
links = ex.google_news_search('dogs','Pet',50)
# Generate and save corpus
ex.generate_corpus(links)
More details on my blog
The finance corpus is available for download here . The corpus has the following categories:
Policy (licenses , regulation, SEC, monetary, fed, monetary,fiscal,imf)
International Finance( global finance, IMF, ECB, trouble in Greece, RMB devaluation)
Economy (GDP, Jobs, unemployment, housing, economy) Raising Capital(ipo, equity)
Real Estate
Mergers & Acquisitions (merger,acquisitions)
Oil(oil,oil prices,natural gas price)
Commodities (commodities,gold ,silver)
Fraud(insider trading, ponzi scheme, finance fraud)
Litigation (company litigation, company settlement,)
Earning Reports
You can use the Reuters-21578 corpus. http://www.daviddlewis.com/resources/testcollections/reuters21578/
It is a basic corpus for test classification.
I'm looking to train a naive Bayes with some new data sources that haven't been used before. I've already looked at the Lee & Pang corpus of IMDB reviews and the MPQA opinion corpus. I'm looking for new web services that fit the following criteria.
Easily Classified - must have a like/dislike or 5 star rating
Readily available
Pertain to new material (less important than the first two)
Here are some samples I have come up with on my own.
Etsy API
Rotten Tomatoes API
Yelp API
Any other suggestions would be much appreciated =)
In Pang&Lee's later work (2008) "Opinion Mining and Sentiment Analysis" here they have a section for publicly available resources. It has links to those corpora.
Take a look at sentiment140. It has a corpus that you can download and train with. You can easily extend to new tweets.
Is anyone out there used Rapidminer for sentiment analysis... Is this a right combination???
If not how do I get started with a simple sentiment analysis application??
RapidMiner is a very powerful text mining and sentiment analysis tools. I can recommend the RapidMiner training courses offered by Rapid-I. They gave me a really quick start. They also offer a dedicated course on text mining and sentiment analysis:
Sentiment Analysis, Opinion Mining, and Automated Market Research .
Starting in September or October 2009, they will also offer webinars. You should contact them directly, if you would like to learn more about their webinars. Several major online market research companies in Europe and the US are using RapidMiner for opinion mining and sentiment analysis from internet discussions groups and web blogs. For more details and references I would again suggest to simply ask their team at contact(at)rapid-i.com or check their RapidMiner forum at forum.rapid-i.com .
Best regards,
Frank
This series of videos should help:
http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html
When I go to rapid miner site it is confusing me.
http://rapidminer.com/solutions/sentiment-analysis/
"It looks like a crowd sourcing to identify the polarity of product reviews and discussions around the web." If you are looking to automate in real time this might not be a good solution.
spotdy.com offers free NLP for developers. It works pretty cool.
Most of the Sentiment Analysis software tokenize words and giving a positive and negative factor and sum those up. Since language is contextual, this leads to ignoring the context which is not a right way to do.
Instead deep learning models, HMM based on sentence structure. It computes the sentiment based on how words are composed in a sentence. Check out spotdy.com. It is free.