Remove blank lines from text which is extracted from pdf file - python-3.x
I am converting PDF files to text files using pdfminer and then trying to extract contents. But when I do so, there will be some empty lines in text file. I need to remove those empty lines. Can someone tell how can I achieve this.
Text file after conversion:
XYZ
8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota
+91 1234567890 | ABCD#gmail.com | 7/108, Malviya Nagar Jaipur (302017)
SUMMARY
To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an
innovative software company.
EXPERIENCE
Machine Learning Engineering Intern , Forsk Technologies , Jaipur (May,2017 – July,2017)
Learned the foundational concepts of data science and machine learning including python and statistics,
enough time was spent on understanding the concept behind each algorithm and examples and case
studies were done. Built some mid-scaled machine learning models using supervised and unsupervised
learning.
Software Engineering Intern , Proxbotics Creations Technologies , Jaipur (May,2016 – July,2016)
Developed and optimized various projects including ecommerce, booking & reservation, non-profit
organization Websites, using technologies: HTML, CSS, PHP, JavaScript, MySQL etc.
Trainee at TecheduSoft , Kota (May,2015)
The course contains 15+ modules including Android Basics, fragments, screen designing, intents, various
views, signing app, web servers, web services, notifications, etc.
PROJECTS
All projects are available on git: https://github.com/JAIJANYANI
Video Analysis for surveillance
-A command line app which takes all your CCTV feeds as input and filters feeds with abnormal events
which results in 90% less videos to watch, Used image processing and deep learning algorithms,
outputs all time-stamps of interesting events for all feeds.
Food Calorie Estimator
-An android app to estimate calories present in food with still image. Trained own Data-set (Meal-net)
using Transfer learning Built upon Inception V3, Proposed a Deep Convolutional Neural Network (CNN)
with 48 Layers, Developed a REST API to integrate it in Mobile apps, Optimized total computation time
~ 2 Seconds.
CryptoCurrency Market Predictor
- A Flask app to predict the future prices of various Crypto Currencies, implemented various supervised
and deep learning algorithms such as LSTM (RNN), polynomial regression, using scikit-learn, tensorflow,
keras etc.
Spam Filter
-A REST API to Detect Incoming SMS or Email as Spam or Ham which can be trained on your own data
set. Used NLP with Naive Bayes for Sentiment Analysis.
Image Classifier using CNN
-An application which detects objects present in a still image, implemented convolutional neural
network using open source machine learning library which can be run on multiple machines to reduce
training workloads, classifies objects using pre-trained image-net model.
Online Student and Faculty Portal
-A Web Portal to manage attendance of students and faculties, can be integrated to mobile apps. Uses
Php, MySQL, HTML, CSS, JavaScript, etc.
Tax Accounting
-A Decentralized web app built on Ethereum Block-Chain using Truffle and Embark framework, which
can be used to transfer funds between accounts which automatically deducts tax from the account.
TECHNICAL SKILLS
Programming Languages
Web Technologies
Scripting Languages
Database Management System
Operating Systems
Strongest Areas
COURSES
:
:
:
:
:
C, C++
HTML,CSS
Python, PHP, BASH
MySQL, SQLite
Microsoft Windows, Linux, UNIX
:
Machine Learning, Data Science
Applied Machine Learning , Applied Data Science , Exploratory Data Analysis & Data Visualization , Neural
Networks & Deep Learning , Computer networks , Data Structures & Algorithms , Operating Systems , Cloud
Computing , Data Mining , Block chain Essentials , Database Management Systems.
EDUCATION
University College of Engineering , Kota : Btech (Pursuing) in Computer Science Engineering (2018)
St. Edmunds School , Jaipur : Senior Secondary (XII) Education Rajasthan (2012)
St. Edmunds School , Jaipur : Secondary (X) Education Rajasthan (2010)
Expected:
XYZ
8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota
+91 1234567890 | ABCD#gmail.com | 7/108, Malviya Nagar Jaipur (302017)
SUMMARY
To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an
innovative software company.
EXPERIENCE
-Machine Learning Engineering Intern , Forsk Technologies , Jaipur (May,2017 – July,2017)
Learned the foundational concepts of data science and machine learning including python and statistics,
enough time was spent on understanding the concept behind each algorithm and examples and case
studies were done. Built some mid-scaled machine learning models using supervised and unsupervised
learning.
-Software Engineering Intern , Proxbotics Creations Technologies , Jaipur (May,2016 – July,2016)
Developed and optimized various projects including ecommerce, booking & reservation, non-profit
organization Websites, using technologies: HTML, CSS, PHP, JavaScript, MySQL etc.
-Trainee at TecheduSoft , Kota (May,2015)
The course contains 15+ modules including Android Basics, fragments, screen designing, intents, various
views, signing app, web servers, web services, notifications, etc.
PROJECTS
All projects are available on git: https://github.com/JAIJANYANI
Video Analysis for surveillance
-A command line app which takes all your CCTV feeds as input and filters feeds with abnormal events
which results in 90% less videos to watch, Used image processing and deep learning algorithms,
outputs all time-stamps of interesting events for all feeds.
- Food Calorie Estimator
-An android app to estimate calories present in food with still image. Trained own Data-set (Meal-net)
using Transfer learning Built upon Inception V3, Proposed a Deep Convolutional Neural Network (CNN)
with 48 Layers, Developed a REST API to integrate it in Mobile apps, Optimized total computation time
~ 2 Seconds.
- CryptoCurrency Market Predictor
- A Flask app to predict the future prices of various Crypto Currencies, implemented various supervised
and deep learning algorithms such as LSTM (RNN), polynomial regression, using scikit-learn, tensorflow,
keras etc.
- Spam Filter
-A REST API to Detect Incoming SMS or Email as Spam or Ham which can be trained on your own data
set. Used NLP with Naive Bayes for Sentiment Analysis.
Image Classifier using CNN
-An application which detects objects present in a still image, implemented convolutional neural
network using open source machine learning library which can be run on multiple machines to reduce
training workloads, classifies objects using pre-trained image-net model.
Online Student and Faculty Portal
A Web Portal to manage attendance of students and faculties, can be integrated to mobile apps. Uses
Php, MySQL, HTML, CSS, JavaScript, etc.
Tax Accounting
-A Decentralized web app built on Ethereum Block-Chain using Truffle and Embark framework, which
can be used to transfer funds between accounts which automatically deducts tax from the account.
TECHNICAL SKILLS
Programming Languages
Web Technologies
Scripting Languages
Database Management System
Operating Systems
Strongest Areas
COURSES
:
:
:
:
:
C, C++
HTML, CSS
Python, PHP, BASH
MySQL, SQLite
Microsoft Windows, Linux, UNIX
:
Machine Learning, Data Science
Applied Machine Learning , Applied Data Science , Exploratory Data Analysis & Data Visualization , Neural
Networks & Deep Learning , Computer networks , Data Structures & Algorithms , Operating Systems , Cloud
Computing , Data Mining , Block chain Essentials , Database Management Systems.
EDUCATION
University College of Engineering , Kota : Btech (Pursuing) in Computer Science Engineering (2018)
St. Edmunds School , Jaipur : Senior Secondary (XII) Education Rajasthan (2012)
St. Edmunds School , Jaipur : Secondary (X) Education Rajasthan (2010)
I'm using the below code for the conversion:
def extract_text_from_pdf(pdf_path):
'''
Helper function to extract the plain text from .pdf files
'''
with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh, caching=True,check_extractable=True):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, codec='utf-8', laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
output_txt = text
yield text
#print(output_txt)
converter.close()
fake_file_handle.close()
def extract_text_from_doc(doc_path):
'''
Helper function to extract plain text from .doc or .docx files
'''
temp = docx2txt.process(doc_path)
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
return ' '.join(text)
def extract_text(file_path, extension):
'''
Wrapper function to detect the file extension and call text extraction function accordingly
'''
text = ''
if extension == '.pdf':
for page in extract_text_from_pdf(file_path):
text += ' ' + page
elif extension == '.docx' or extension == '.doc':
text = extract_text_from_doc(file_path)
print(text)
return text
I suggest to use regex.
import re
rc = re.compile(r"\n{2,}") # two and more new lines
string_without_extra_new_lines = re.sub(rc,"\n", string_with_extra_blank_lines)
Having this as an example:
XYZ
8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota
+91 1234567890 | ABCD#gmail.com | 7/108, Malviya Nagar Jaipur (302017)
SUMMARY
To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an
innovative software company.
EXPERIENCE
This is the output:
XYZ
8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota
+91 1234567890 | ABCD#gmail.com | 7/108, Malviya Nagar Jaipur (302017)
SUMMARY
To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an
innovative software company.
EXPERIENCE
Related
dataset to use for question formation from any text
I am trying to create an improved quiz generator that accepts a certain text as an input and forms questions from the sentences. I want to create a machine learning model that splits the sentence into different parts so it is capable of forming different questions from the same sentence. For example: from the sentence "Amazon river is the longest river in South America." should form questions: What is the longest river in South America? Is Amazon river the longest river in South America? Where is Amazon river located? etc. If possible, I would also like it to get the context from multiple sentences and then form one question from multiple sentence information. I want it to be able to perform well on any text, not just specific topic. How should I make my dataset or which dataset should I use? I don't have a lot of previous knowledge on the topic, so I was thinking of somehow using nltk.pos_tag() which specifies everyword in a sentence. I am just not sure how to use it in my model and dataset.
What you're attempting to do is non-trivial and is related to the task of Automatic Question Generation (AQG) which looks at converting structured or unstructured declarative natural language sentences into valid interrogative forms. Various automated linguistic (rules-based) and statistical methods have been employed. I'd recommend reading [1] by Blšták & Rozinajová, particularly Section 2 which summarises some of the datasets and methods available. The survey by Lu & Lu [2] provides a recent overview of the field. It seems like the most common approach is to leverage existing QA datasets (e.g. SQuAD, HotpotQA et cetera, see Table 5 of [2]). In terms of more practical, quick ways to get started without having to train your own ML/DL model, you could use existing Transformer-based models from HuggingFace such as iarfmoose/t5-base-question-generator available here which takes concatenated answers and context as an input sequence, e.g.: <answer> answer text here <context> context text here and will generate a full question (interrogative) sentence as an output sequence. According to the author, it is recommended that a large number of sequences be generated and then filtered with iarfmoose/bert-base-cased-qa-evaluator. References [1] Blšták, M. and Rozinajová, V., 2022. Automatic question generation based on sentence structure analysis using machine learning approach. Natural Language Engineering, 28(4), pp.487-517. [2] Lu, C.Y. and Lu, S.E., 2021, October. A Survey of Approaches to Automatic Question Generation: from 2019 to Early 2021. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021) (pp. 151-162).
How to make or get corpus of financial documents
I am working on a document classification problem for financial reports/documents. Is there a ready made corpus for this ? I found a couple of use cases, but they all made their own corpus.
You will more than likely have to create your own corpus. I had a similar task and manually creating such a corpus would be too tedious. As a result I created News Corpus Builder a python module that would allow you to quickly develop a corpus based on your particular interest of topics. The module allows you to generate your own corpus and store the text and associated label in sqlite or as flat files. from news_corpus_builder import NewsCorpusGenerator # Location to save generated corpus corpus_dir = '/Users/skillachie/finance_corpus' # Save results to sqlite or files per article ex = NewsCorpusGenerator(corpus_dir,'sqlite') # Retrieve 50 links related to the search term dogs and assign a category of Pet to the retrieved links links = ex.google_news_search('dogs','Pet',50) # Generate and save corpus ex.generate_corpus(links) More details on my blog The finance corpus is available for download here . The corpus has the following categories: Policy (licenses , regulation, SEC, monetary, fed, monetary,fiscal,imf) International Finance( global finance, IMF, ECB, trouble in Greece, RMB devaluation) Economy (GDP, Jobs, unemployment, housing, economy) Raising Capital(ipo, equity) Real Estate Mergers & Acquisitions (merger,acquisitions) Oil(oil,oil prices,natural gas price) Commodities (commodities,gold ,silver) Fraud(insider trading, ponzi scheme, finance fraud) Litigation (company litigation, company settlement,) Earning Reports
You can use the Reuters-21578 corpus. http://www.daviddlewis.com/resources/testcollections/reuters21578/ It is a basic corpus for test classification.
Gender Detection by audio
I've been searching everywhere for some form of gender detection by reading frequency data of a audio file. I've had no luck with finding a program that could do that or even anything that can output audio data so I can write a basic program to read it and manipulate it to determine gender of the speaker. Do any of you know where I can find something to help me with this? To reiterate, I basically want to have a program that when a person talks into a microphone it will say the gender of the speaker with a fair amount of precision. My full plan is to also have speech to text feature on it, so the program will write out what the speaker said and give some extremely basic demographics on the speaker. *Preferably with a common scripting language thats cross platform or linux supported.
Though an old question but still if someone is interested in doing gender detection from audio, You can easily do this by extracting MFCC (Mel-frequency Cepstral coefficient) features and model it with machine learning model GMM (Gausssian Mixture model) One can follow this tutorial which implements the same and has evaluated it on subset extracted from Google's AudioSet gender wise data. https://appliedmachinelearning.wordpress.com/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/
You're going to want to look into formant detection and linear predictive coding. Heres a paper that has some signal flow diagrams that could be ported over to scipy/numpy.
Sources of classified sentiment data?
I'm looking to train a naive Bayes with some new data sources that haven't been used before. I've already looked at the Lee & Pang corpus of IMDB reviews and the MPQA opinion corpus. I'm looking for new web services that fit the following criteria. Easily Classified - must have a like/dislike or 5 star rating Readily available Pertain to new material (less important than the first two) Here are some samples I have come up with on my own. Etsy API Rotten Tomatoes API Yelp API Any other suggestions would be much appreciated =)
In Pang&Lee's later work (2008) "Opinion Mining and Sentiment Analysis" here they have a section for publicly available resources. It has links to those corpora.
Take a look at sentiment140. It has a corpus that you can download and train with. You can easily extend to new tweets.
rapidminer and sentiment analysis
Is anyone out there used Rapidminer for sentiment analysis... Is this a right combination??? If not how do I get started with a simple sentiment analysis application??
RapidMiner is a very powerful text mining and sentiment analysis tools. I can recommend the RapidMiner training courses offered by Rapid-I. They gave me a really quick start. They also offer a dedicated course on text mining and sentiment analysis: Sentiment Analysis, Opinion Mining, and Automated Market Research . Starting in September or October 2009, they will also offer webinars. You should contact them directly, if you would like to learn more about their webinars. Several major online market research companies in Europe and the US are using RapidMiner for opinion mining and sentiment analysis from internet discussions groups and web blogs. For more details and references I would again suggest to simply ask their team at contact(at)rapid-i.com or check their RapidMiner forum at forum.rapid-i.com . Best regards, Frank
This series of videos should help: http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html
When I go to rapid miner site it is confusing me. http://rapidminer.com/solutions/sentiment-analysis/ "It looks like a crowd sourcing to identify the polarity of product reviews and discussions around the web." If you are looking to automate in real time this might not be a good solution. spotdy.com offers free NLP for developers. It works pretty cool. Most of the Sentiment Analysis software tokenize words and giving a positive and negative factor and sum those up. Since language is contextual, this leads to ignoring the context which is not a right way to do. Instead deep learning models, HMM based on sentence structure. It computes the sentiment based on how words are composed in a sentence. Check out spotdy.com. It is free.