Add a space in lambda function - pandas-groupby

I have 2 columns in my data frame - ASIN and keywords . I am trying to groupby ASINs , the groupby is working fine
ASIN keywords
0 B07GFGXMZZ mangalagiri dress materials
1 B07GFGXMZZ pure cotton dress materials for women
2 B07GFGXMZZ suit material party wear for women
3 B076BL4CWB dhakai jamdani
4 B076BL4CWB jamdani
Groupby Code
df.groupby('ASIN').apply(lambda x: x.sum())
Output
but how to add a space for each lambda iteration as of now it is not doing , you can observe the same in output image i have linked
i Tried
df.groupby('ASIN').apply(lambda x:" ".join(x.sum()))
But it didn't work
ASIN
9801321261 98013212619801321261 cane mat with runnercane ...
B008YLNICE B008YLNICEB008YLNICEB008YLNICEB008YLNICEB008YL...
B00P81OJ26 B00P81OJ26B00P81OJ26B00P81OJ26B00P81OJ26B00P81...
B010SZBHEE B010SZBHEEB010SZBHEEB010SZBHEEB010SZBHEEB010SZ...
B01143KAY2 B01143KAY2B01143KAY2B01143KAY2B01143KAY2B01143...
B0157XMD4A B0157XMD4A elephant painted box
B0157XMRJ6 B0157XMRJ6B0157XMRJ6B0157XMRJ6B0157XMRJ6B0157X...

Related

In python, headings not in the same row

I extracted three columns from a larger data frame (recent_grads) as follows...
df = recent_grads.groupby('Major_category')['Men', 'Women'].sum()
However, when I print df, it comes up as follows...
Men Women
Major_category
Agriculture & Natural Resources 40357.0 35263.0
Arts 134390.0 222740.0
Biology & Life Science 184919.0 268943.0
Business 667852.0 634524.0
Communications & Journalism 131921.0 260680.0
Computers & Mathematics 208725.0 90283.0
Education 103526.0 455603.0
Engineering 408307.0 129276.0
Health 75517.0 387713.0
Humanities & Liberal Arts 272846.0 440622.0
Industrial Arts & Consumer Services 103781.0 126011.0
Interdisciplinary 2817.0 9479.0
Law & Public Policy 91129.0 87978.0
Physical Sciences 95390.0 90089.0
Psychology & Social Work 98115.0 382892.0
Social Science 256834.0 273132.0
How do I get Major_category heading in the same row as Men and Women headings? I tried to put the three columns in a new data frame as follows...
df1 = df[['Major_category', 'Men', 'Women']].copy()
This gives me an error (Major_category not in index)
Hi man you should try reset_index https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html:
df = df.groupby('Major_category')['Men', 'Women'].sum()
# Print the output.
md = df.reset_index()
print(md)
Seems like you want to convert the groupby object back to a dataframe try:
df['Major_category'].apply(pd.DataFrame)

Cosine Similarity of rows based on selected pandas columns

I am trying to make an item-item based movie recommender. In the movies dataset I have meta data about movies such as title, genres, directors, actors, producers, writers, year_of_release etc. Currently I am calculating the similarity based on the genre column only using tf-idf vectorizer by splitting the genre column into a list and it is working completely fine. Here is the code I am using:
def vector_cosine(df, index):
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
return cosine_sim[index]
# Function that get movie recommendations based on the cosine similarity score of movie genres
def genre_recommendations(title, idx):
sim_scores = list(enumerate(vector_cosine(movies,idx)))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:6]
movie_indices = [i[0] for i in sim_scores]
recommendations = list(titles.iloc[movie_indices])
if title in recommendations: recommendations.remove(title)
return recommendations
and this is the sample data:
{"movie_id":"217d1207-effc-4605-bdef-899339615fe6","title":"Mamma Mia! Here We Go Again","release_date":"2018","actor_names":["Amanda Seyfried","Andy Garc\u00eda","Cher","Christine Baranski","Colin Firth","Dominic Cooper","Jessica Keenan Wynn","Julie Walters","Lily James","Meryl Streep","Pierce Brosnan","Stellan Skarsg\u00e5rd"],"director_names":["Ol Parker"],"producer_names":["Gary Goetzman","Judy Craymer"],"genres":"['Comedy', 'Music', 'Romance']","rating_aus":"PG","rating_nzl":"PG","rating_usa":null} {"movie_id":"59a0a1cd-5b83-401d-9b30-caa2c7e4f46c","title":"The Spy Who Dumped Me","release_date":"2018","actor_names":["Carolyn Pickles","Fred Melamed","Gillian Anderson","Hasan Minhaj","Ivanna Sakhno","James Fleet","Jane Curtin","Justin Theroux","Kate McKinnon","Mila Kunis","Paul Reiser","Sam Heughan"],"director_names":["Susanna Fogel"],"producer_names":null,"genres":"['Action', 'Comedy']","rating_aus":null,"rating_nzl":"R16","rating_usa":null} {"movie_id":"35e19192-6fb0-4c2b-9591-0d70deb30db6","title":"Beirut","release_date":"2018","actor_names":["Alon Aboutboul","Dean Norris","Douglas Hodge","Jon Hamm","Jonny Coyne","Kate Fleetwood","Larry Pine","Le\u00efla Bekhti","Mark Pellegrino","Rosamund Pike","Shea Whigham","Sonia Okacha"],"director_names":["Brad Anderson"],"producer_names":["Mike Weber","Monica Levinson","Shivani Rawat","Ted Field","Tony Gilroy"],"genres":"['Action', 'Thriller']","rating_aus":"MA15+","rating_nzl":"M","rating_usa":null} {"movie_id":"6e3c57d8-51c9-4b20-afe7-e09d949b2d7f","title":"Mile 22","release_date":"2018","actor_names":["Alexandra Vino","Iko Uwais","John Malkovich","Lauren Cohan","Lauren Mary Kim","Mark Wahlberg","Nikolai Nikolaeff","Poorna Jagannathan","Ronda Jean Rousey","Sala Baker","Sam Medina","Terry Kinney"],"director_names":["Peter Berg"],"producer_names":["Mark Wahlberg","Peter Berg","Stephen Levinson"],"genres":"['Action']","rating_aus":"MA15+","rating_nzl":"R16","rating_usa":null} {"movie_id":"cde85555-c944-44ef-a3b1-a5bf842646ad","title":"The Meg","release_date":"2018","actor_names":["Cliff Curtis","James Gaylyn","Jason Statham","Jessica McNamee","Li Bingbing","Masi Oka","Page Kennedy","Rainn Wilson","Robert Taylor","Ruby Rose","Tawanda Manyimo","Winston Chao"],"director_names":["Jon Turteltaub"],"producer_names":["Belle Avery","Colin Wilson","Lorenzo di Bonaventura"],"genres":"['Action', 'Science Fiction', 'Thriller']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"d1496722-2713-4b29-a323-18a0e2a0d6f3","title":"Monster's Ball","release_date":"2001","actor_names":["Amber Rules","Billy Bob Thornton","Charles Cowan Jr.","Coronji Calhoun","Gabrielle Witcher","Halle Berry","Heath Ledger","Peter Boyle","Sean Combs","Taylor LaGrange","Taylor Simpson","Yasiin Bey"],"director_names":["Marc Forster"],"producer_names":["Lee Daniels"],"genres":"['Drama', 'Romance']","rating_aus":null,"rating_nzl":"R16","rating_usa":null} {"movie_id":"844b309e-34ef-47bf-b981-eadc1d915886","title":"How to Be a Latin Lover","release_date":"2017","actor_names":["Anne McDaniels","Eugenio Derbez","Kristen Bell","Mckenna Grace","Michael Cera","Michaela Watkins","Raquel Welch","Rob Corddry","Rob Huebel","Rob Lowe","Rob Riggle","Salma Hayek"],"director_names":["Ken Marino"],"producer_names":null,"genres":"['Comedy']","rating_aus":"M","rating_nzl":"R13","rating_usa":null} {"movie_id":"991e3711-9918-41a0-b660-3c53ffa4901c","title":"Good Fortune","release_date":"2016","actor_names":null,"director_names":["Joshua Tickell","Rebecca Harrell Tickell"],"producer_names":null,"genres":"['Documentary']","rating_aus":"PG","rating_nzl":null,"rating_usa":null} {"movie_id":"7bf1935e-7d1a-437c-a71a-b4c70eb4f853","title":"Paper Heart","release_date":"2009","actor_names":["Charlyne Yi","Gill Summers","Jake Johnson","Martin Starr","Michael Cera","Seth Rogen"],"director_names":["Nicholas Jasenovec"],"producer_names":null,"genres":"['Comedy', 'Drama', 'Romance']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"4cd0e423-acde-4bf7-bf93-f77777a4de6f","title":"Daybreakers","release_date":"2009","actor_names":["Claudia Karvan","Emma Randall","Ethan Hawke","Harriet Minto","Day","Isabel Lucas","Jay Laga'aia","Michael Dorman","Mungo McKay","Sam Neill","Tiffany Lamb","Vince Colosimo","Willem Dafoe"],"director_names":["Michael Spierig","Peter Spierig"],"producer_names":["Bryan Furst","Chris Brown","Sean Furst","Todd Fellman"],"genres":"['Action', 'Fantasy', 'Horror', 'Science Fiction']","rating_aus":"MA15+","rating_nzl":"R16","rating_usa":null} {"movie_id":"c0a84525-46a4-4977-86af-7fc8cb014683","title":"Requiem for a Dream","release_date":"2000","actor_names":["Charlotte Aronofsky","Christopher McDonald","Ellen Burstyn","Janet Sarno","Jared Leto","Jennifer Connelly","Joanne Gordon","Louise Lasser","Marcia Jean Kurtz","Mark Margolis","Marlon Wayans","Suzanne Shepherd"],"director_names":["Darren Aronofsky"],"producer_names":["Eric Watson","Palmer West"],"genres":"['Crime', 'Drama']","rating_aus":null,"rating_nzl":"R18","rating_usa":null} {"movie_id":"fe5367fe-b558-4fbe-872f-ab041ef58213","title":"Grizzly Man","release_date":"2005","actor_names":["David Letterman","Jewel Palovak","Kathleen Parker","Sam Egli","Timothy Treadwell","Warren Queeney","Werner Herzog","Willy Fulton"],"director_names":["Werner Herzog"],"producer_names":["Erik Nelson"],"genres":"['Documentary']","rating_aus":"M","rating_nzl":null,"rating_usa":null} {"movie_id":"6bd66f6b-2834-4013-884f-7eaf257e09fb","title":"The Great Buck Howard","release_date":"2008","actor_names":["Adam Scott","Colin Hanks","Debra Monk","Emily Blunt","Griffin Dunne","John Malkovich","Jonathan Ames","Patrick Fischler","Ricky Jay","Steve Zahn","Tom Hanks","Wallace Langham"],"director_names":["Sean McGinly"],"producer_names":["Gary Goetzman","Tom Hanks"],"genres":"['Comedy', 'Drama']","rating_aus":"G","rating_nzl":"G","rating_usa":null} {"movie_id":"78b22fc2-5069-40da-b4ac-790ec3902a32","title":"Boo! A Madea Halloween","release_date":"2016","actor_names":["Andre Hall","Bella Thorne","Brock O'Hurn","Cassi Davis","Diamond White","Jimmy Tatro","Kian Lawley","Lexy Panterra","Liza Koshy","Patrice Lovely","Tyler Perry","Yousef Erakat"],"director_names":["Tyler Perry"],"producer_names":null,"genres":"['Comedy', 'Drama', 'Horror']","rating_aus":"M","rating_nzl":null,"rating_usa":null} {"movie_id":"bc5c3635-bbfb-4dd7-b0ed-408787fd5f43","title":"Fantastic Beasts and Where to Find Them","release_date":"2016","actor_names":["Alison Sudol","Carmen Ejogo","Colin Farrell","Dan Fogler","Eddie Redmayne","Ezra Miller","Johnny Depp","Jon Voight","Katherine Waterston","Ron Perlman","Samantha Morton","Zo\u00eb Kravitz"],"director_names":["David Yates"],"producer_names":["David Heyman","J.K. Rowling","Lionel Wigram","Steve Kloves"],"genres":"['Adventure', 'Family', 'Fantasy']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"1ec3a043-a6c4-44a0-9fb1-c948eb07cf85","title":"Silver Linings Playbook","release_date":"2012","actor_names":["Anupam Kher","Bonnie Aarons","Bradley Cooper","Brea Bee","Chris Tucker","Dash Mihok","Jacki Weaver","Jennifer Lawrence","John Ortiz","Julia Stiles","Robert De Niro","Shea Whigham"],"director_names":["David O. Russell"],"producer_names":["Bruce Cohen","Donna Gigliotti","Jonathan Gordon","Mark Kamine"],"genres":"['Comedy', 'Drama', 'Romance']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"4c0bbde5-7a34-4556-a481-3357ef69b651","title":"The Equalizer","release_date":"2014","actor_names":["Alex Veadov","Bill Pullman","Chlo\u00eb Grace Moretz","David Harbour","David Meunier","Denzel Washington","E. Roger Mitchell","Haley Bennett","Johnny Skourtis","Marton Csokas","Melissa Leo","Vladimir Kulich"],"director_names":["Antoine Fuqua"],"producer_names":["Alex Siskin","Denzel Washington","Jason Blumenthal","Mace Neufeld","Michael Sloan","Richard Wenk","Steve Tisch","Todd Black","Tony Eldridge"],"genres":"['Action', 'Crime', 'Thriller']","rating_aus":"MA15+","rating_nzl":"R18","rating_usa":null} {"movie_id":"809f3131-7445-4ffb-8b77-6de6699c85c4","title":"The Notebook","release_date":"2004","actor_names":["David Thornton","Ed Grady","Gena Rowlands","James Garner","James Marsden","Jennifer Echols","Joan Allen","Kevin Connolly","Rachel McAdams","Ryan Gosling","Sam Shepard","Starletta DuPois"],"director_names":["Nick Cassavetes"],"producer_names":["Lynn Harris","Mark Johnson"],"genres":"['Drama', 'Romance']","rating_aus":"PG","rating_nzl":"PG","rating_usa":null} {"movie_id":"49652b6d-b818-4fc4-9c57-9ec7b5c346cc","title":"The Matrix","release_date":"1999","actor_names":["Anthony Ray Parker","Belinda McClory","Carrie","Anne Moss","Gloria Foster","Hugo Weaving","Joe Pantoliano","Julian Arahanga","Keanu Reeves","Laurence Fishburne","Marcus Chong","Paul Goddard","Robert Taylor"],"director_names":["Lana Wachowski","Lilly Wachowski"],"producer_names":["Joel Silver"],"genres":"['Action', 'Science Fiction']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"d356a087-4e89-420b-867c-618544969302","title":"The Hunger Games","release_date":"2012","actor_names":["Alexander Ludwig","Donald Sutherland","Elizabeth Banks","Isabelle Fuhrman","Jennifer Lawrence","Josh Hutcherson","Lenny Kravitz","Liam Hemsworth","Stanley Tucci","Toby Jones","Wes Bentley","Woody Harrelson"],"director_names":["Gary Ross"],"producer_names":["Jon Kilik","Nina Jacobson"],"genres":"['Adventure', 'Fantasy', 'Science Fiction']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"0ef84c8c-ffc5-4896-9e3b-5303acba0ff3","title":"The Wolf of Wall Street","release_date":"2013","actor_names":["Brian Sacca","Henry Zebrowski","Jon Bernthal","Jon Favreau","Jonah Hill","Kenneth Choi","Kyle Chandler","Leonardo DiCaprio","Margot Robbie","Matthew McConaughey","P. J. Byrne","Rob Reiner"],"director_names":["Martin Scorsese"],"producer_names":["Emma Tillinger Koskoff","Joey McFarland","Leonardo DiCaprio","Martin Scorsese","Riza Aziz"],"genres":"['Comedy', 'Crime', 'Drama']","rating_aus":"R18+","rating_nzl":"R18","rating_usa":null}
What I want next is to calculate similarity based on multiple columns. Can you please guide me how can I achieve that.
You can use cosine_similarity from sklearn.metrics.pairwise like:
sim_df = cosine_similarity(df)
Then you can get the top 10 similar movies for a particular movie like:
sim_df[movie_index].argsort()[-10:][::-1]

Python get first and last value from string using dictionary key values

I have gotten a very strange data. I have dictionary with keys and values where I want to use this dictionary to search if these keywords are ONLY starting and/or end of the text not middle of the sentence. I tried to create simple data frame below to show the problem case and python codes that I have tried so far. How do I get it go search for only starting or ending of the sentence? This one searches whole text sub-strings.
Code:
d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
"The word Apple is commonly confused with Apple Corp which is a business",
"Apple Corp is a business they make computers",
"Apple Corp also writes App",
"The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df
Original Dataframe:
id text
1 The word Apple is commonly confused with Apple Corp which is a business
2 Apple Corp is a business they make computers
3 Apple Corp also writes App
4 The Apple Corp also writes App
Code Tried out:
def matcher(k):
x = (i for i in d if i in k)
# i.startswith(k) getting error
return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df
Error:
TypeError: 'in <string>' requires string as left operand, not bool
when I use this x = (i for i in d if i.startswith(k) in k)
Empty values if i tried this x = (i for i in d if i.startswith(k) == True in k)
TypeError: sequence item 0: expected str instance, NoneType found
when i use this x = (i.startswith(k) for i in d if i in k)
Results from Code above ... Create new field 'text_value':
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business Company;Application
2 Apple Corp is a business they make computers Company;Application
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Company;Application
Trying to get an FINAL output like this:
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business NaN
2 Apple Corp is a business they make computers Company
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Application
You need a matcher function which can accept flag and then call that twice to get the results for startswith and endswith.
def matcher(s, flag="start"):
if flag=="start":
for i in d:
if s.startswith(i):
return d[i]
else:
for i in d:
if s.endswith(i):
return d[i]
return None
df['st'] = df['text'].apply(matcher)
df['ed'] = df['text'].apply(matcher, flag="end")
df['text_value'] = df[['st', 'ed']].apply(lambda x: ';'.join(x.dropna()),1)
df = df[['id','text', 'text_value']]
The text_value column looks like:
0
1 Company
2 Company;Application
3 Application
Name: text_value, dtype: object
joined = "|".join(d.keys())
pat = '(?i)^(?:the\\s*)?(' + joined + ')\\b.*?|.*\\b(' + joined + ')$'+'|.*'
get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '')
df.text.str.replace(pat,get)
0
1 Company
2 Company;Application
3 Company;Application
Name: text, dtype: object

unable to get rid of all emojis

I need help removing emojis. I looked at some other stackoverflow questions and this is what I am de but for some reason my code doesn't get rid of all the emojis
d= {'alexveachfashion': 'Fashion Style * Haute Couture * Wearable Tech * VR\n👓👜⌚👠\nSoundCloud is Live #alexveach\n👇New YouTube Episodes ▶️👇', 'andrewvng': 'Family | Fitness | Friends | Gym | Food', 'runvi.official': 'Accurate measurement via SMART insoles & real-time AI coaching. Improve your technique & BOOST your performance with every run.\nSoon on Kickstarter!', 'triing': 'Augmented Jewellery™️ • Montreal. Canada.', 'gedeanekenshima': 'Prof na Etec Albert Einstein, Mestranda em Automação e Controle de Processos, Engenheira de Controle e Automação, Técnica em Automação Industrial.', 'jetyourdaddy': '', 'lavonne_sun': '☄️🔭 ✨\n°●°。Visual Narrative\nA creative heart with a poetic soul.\n————————————\nPARSONS —— Design & Technology', 'taysearch': 'All the World’s Information At Your Fingertips. (Literally) Est. 1991🇺🇸 🎀#PrincessofSearch 🔎Sample 👇🏽 the Search Engine Here 🗽', 'hijewellery': 'Fine 3D printed jewellery for tech lovers #3dprintedjewelry #wearabletech #jewellery', 'yhanchristian': 'Estudante de Engenharia, Maker e viciado em café.', 'femka': 'Fashion Futurist + Fashion Tech Lab Founder #technoirlab + Fashion Designer / Parsons & CSM Grad / Obsessed with #fashiontech #future #cryptocurrency', 'sinhbisen': 'Creator, TRiiNG, augmented jewellery label ⭕️ Transhumanist ⭕️ Corporeal cartographer ⭕️', 'stellawearables': '#StellaWearables ✉️Info#StellaWearables.com Premium Wearable Technology That Monitors Personal Health & Environments ☀️🏝🏜🏔', 'ivoomi_india': 'We are the manufacturers of the most innovative technologies and user-friendly gadgets with a global presence.', 'bgutenschwager': "When it comes to life, it's all about the experience.\nGoogle Mapper 🗺\n360 Photographer 📷\nBrand Rep #QuickTutor", 'storiesofdesign': 'Putting stories at the heart of brands and businesses | Cornwall and London | #storiesofdesign', 'trume.jp': '草創期から国産ウオッチの製造に取り組み、挑戦を続けてきたエプソンが世界に放つ新ブランド「TRUME」(トゥルーム)。目指すのは、最先端技術でアナログウオッチを極めるブランド。', 'themarinesss': "I didn't choose the blog life, the blog life chose me | Aspiring Children's Book Author | www.slayathomemum.com", 'ayowearable': 'The world’s first light-based wearable that helps you sleep better, beat jet lag and have more energy! #goAYO Get yours at:', 'wearyourowntechs': 'Bringing you the latest trends, Current Products and Reviews of Wearable Technology. Discover how they can enhance your Life and Lifestyle', 'roxfordwatches': 'The Roxford | The most stylish and customizable fitness smartwatch. Tracks your steps/calories/dist/sleep. Comes with FOUR bands, and a travel case!', 'playertek': "Track your entire performance - every training session, every match. \nBecause the best players don't hide.", '_kate_hartman_': '', 'hmsmc10': 'Health & Wellness 🍎\nBoston, MA 🏙\nSuffolk MPA ‘17 🎓 \n.\nJust Strong Ambassador 🏋🏻\u200d♀️', 'gadgetxtreme': 'Dedicated to reviewing gadgets, technologies, internet products and breaking tech news. Follow us to see daily vblogs on all the disruptive tech..', 'freedom.journey.leader': '📍MN\n🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿 \n📧Ashleybp5#gmail.com \n#homeschool #bossmom #builder #momlife', 'arts_food_life': 'Life through my phone.', 'medgizmo': 'Wearable #tech: #health #healthcare #wellness #gadgets #apps. Images/links provided as information resource only; doesn’t mean we endorse referenced', 'sawearables': 'The home of wearable tech in South Africa!\n--> #WearableTech #WearableTechnology #FitnessTech Find your wearable #', 'shop.mercury': 'Changing the way you charge.⚡️\nGet exclusive product discounts, and help us reach our goal below!🔋', 'invisawear': 'PRE-ORDERS NOW AVAILABLE! Get yours 25% OFF here: #girlboss #wearabletech'}
for key in d:
print("---with emojis----")
print(d[key])
print("---emojis removed----")
x=''.join(c for c in d[key] if c <= '\uFFFF')
print(x)
output example
---with emojis----
📍MN
🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿
📧Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---emojis removed----
MN
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---with emojis----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!🔋
---emojis removed----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!
There is no technical definition of what an "emoji" is. Various glyphs may be used to render printable characters, symbols, control characters and the like. What seems like an "emoji" to you may be part of normal script to others.
What you probably want to do is to look at the Unicode category of each character and filter out various categories. While this does not solve the "emoji"-definition-problem per se, you get much better control over what you are actually doing without removing, for example, literally all characters of languages spoken by 2/3 of the planet.
Instead of filtering out certain categories, you may filter everything except the lower- and uppercase letters (and numbers). However, be aware that ꙭ is not "the googly eyes emoji" but the CYRILLIC SMALL LETTER DOUBLE MONOCULAR O, which is a normal lowercase letter to millions of people.
For example:
import unicodedata
s = "🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿"
# Just filter category "symbol"
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', ))
print(t)
...results in
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
This may not be emoji-free enough, yet the • is technically a form of punctuation. So filter this as well
# Filter symbols and punctuations. You may want 'Cc' as well,
# to get rid of control characters. Beware that newlines are a
# form of control-character.
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', 'Po'))
print(t)
And you get
Wife Homeschooling Mom to 5 D Y I lover Small town living in MN

How do read a SEC txt-file into a pandas dataframe?

I am trying to use SEC (U.S. Security and Exchange Commision data). The SEC provides useful data in a txtformat. I am using
Financial Statement Data Sets for the second quarter of 2017. You can find the data I use here.
I try to read the txtfiles into a pandas dataframe. I tried it the following ways:
sub = pd.read_fwf('sub.txt')
sub_1 = pd.read_csv('sub.txt')
I get no error with using Pandas' read_fwf function - but the output is utter rubbish. Here is the head of the dataframe:
adsh cik name sic countryba stprba cityba zipba bas1 bas2 baph countryma stprma cityma zipma mas1 mas2 countryinc stprinc ein former changed afs wksi fye form period fy fp filed accepted prevrpt detail instance nciks aciks Unnamed: 1
0 0000002178-17-000038\t2178\tADAMS RESOURCES & ... NaN
1 0000002488-17-000107\t2488\tADVANCED MICRO DEV... NaN
I do get an error when using read_csv: Error tokenizing data. C error: Expected 2 fields in line 7, saw 3
Any ideas on how tor read the data into a pandas dataframe?
It looks like the files are tab separated - that's why you're seeing \t in the results. pandas read_csv defaults to comma separated values, so you have to change the separator. This is controlled by the sep parameter. In addition, you will need to provide the proper encoding (errors are thrown when trying to read the num, pre, and tag files). Generally ISO-8859-1 is a good choice.
#import pandas
import pandas as pd
#read in the .txt file and choose a separator and encoding standard
df = pd.read_csv('sub.txt', sep='\t', encoding='ISO-8859-1')
#output the results
print(df)
adsh cik name \
0 0000002178-17-000038 2178 ADAMS RESOURCES & ENERGY, INC.
1 0000002488-17-000107 2488 ADVANCED MICRO DEVICES INC
2 0000002969-17-000019 2969 AIR PRODUCTS & CHEMICALS INC /DE/
3 0000002969-17-000024 2969 AIR PRODUCTS & CHEMICALS INC /DE/
4 0000003499-17-000010 3499 ALEXANDERS INC
5 0000003545-17-000043 3545 ALICO INC
6 0000003570-17-000073 3570 CHENIERE ENERGY INC

Resources