why isn't my parsing from JSON working in python - python-3.x
with open('tweets.json') as json_data:
data = json.load(json_data)
print(data['text'])
I want to extract specific data/values but I keep getting this error:
print(data['text'])
TypeError: string indices must be integers
I am a beginner with python and i am trying to learn by using the twitter api
this is my json:
"{\"created_at\":\"Wed Feb 03 03:02:04 +0000 2016\",\"id\":694717462621884416,\"id_str\":\"694717462621884416\",\"text\":\"Finallyy #taylorcaniff Happy bday bae, I love you soooo much, keep smiling, I'm so proud of everything you've done\\u2661 https:\\/\\/t.co\\/uwjeASxsA3\",\"source\":\"\\u003ca href=\\\"http:\\/\\/twitter.com\\/download\\/android\\\" rel=\\\"nofollow\\\"\\u003eTwitter for Android\\u003c\\/a\\u003e\",\"truncated\":false,\"in_reply_to_status_id\":null,\"in_reply_to_status_id_str\":null,\"in_reply_to_user_id\":null,\"in_reply_to_user_id_str\":null,\"in_reply_to_screen_name\":null,\"user\":{\"id\":1364125758,\"id_str\":\"1364125758\",\"name\":\"C o l l i n e r\",\"screen_name\":\"HoodsPizzaxJCat\",\"location\":\"2\\/5 UJ | The Vamps DM\",\"url\":null,\"description\":\"\\u25a8Issa liked x2 & follow\\u25a8Brent Follow, liked x4 &DM\\u25a8Chris liked x3 and follows\\u25a8Taylor, Kizzy, Jacob, Caspar, King B. & Momma Collins follow\\u25a8Trevor Liked \\u25a8\",\"protected\":false,\"verified\":false,\"followers_count\":12136,\"friends_count\":13282,\"listed_count\":20,\"favourites_count\":29245,\"statuses_count\":46864,\"created_at\":\"Fri Apr 19 10:59:10 +0000 2013\",\"utc_offset\":-10800,\"time_zone\":\"Buenos Aires\",\"geo_enabled\":true,\"lang\":\"es\",\"contributors_enabled\":false,\"is_translator\":false,\"profile_background_color\":\"09ED92\",\"profile_background_image_url\":\"http:\\/\\/pbs.twimg.com\\/profile_background_images\\/506799872326893569\\/vdaHWDTj.jpeg\",\"profile_background_image_url_https\":\"https:\\/\\/pbs.twimg.com\\/profile_background_images\\/506799872326893569\\/vdaHWDTj.jpeg\",\"profile_background_tile\":true,\"profile_link_color\":\"4CC74C\",\"profile_sidebar_border_color\":\"FFFFFF\",\"profile_sidebar_fill_color\":\"DDEEF6\",\"profile_text_color\":\"333333\",\"profile_use_background_image\":true,\"profile_image_url\":\"http:\\/\\/pbs.twimg.com\\/profile_images\\/688994368057921536\\/IKy-2UYn_normal.jpg\",\"profile_image_url_https\":\"https:\\/\\/pbs.twimg.com\\/profile_images\\/688994368057921536\\/IKy-2UYn_normal.jpg\",\"profile_banner_url\":\"https:\\/\\/pbs.twimg.com\\/profile_banners\\/1364125758\\/1450566712\",\"default_profile\":false,\"default_profile_image\":false,\"following\":null,\"follow_request_sent\":null,\"notifications\":null},\"geo\":null,\"coordinates\":null,\"place\":null,\"contributors\":null,\"is_quote_status\":false,\"retweet_count\":0,\"favorite_count\":0,\"entities\":{\"hashtags\":[],\"urls\":[],\"user_mentions\":[{\"screen_name\":\"taylorcaniff\",\"name\":\"Taylor Caniff\",\"id\":1396698397,\"id_str\":\"1396698397\",\"indices\":[9,22]}],\"symbols\":[],\"media\":[{\"id\":694717457911693312,\"id_str\":\"694717457911693312\",\"indices\":[116,139],\"media_url\":\"http:\\/\\/pbs.twimg.com\\/media\\/CaQh2OIWwAA6G_C.jpg\",\"media_url_https\":\"https:\\/\\/pbs.twimg.com\\/media\\/CaQh2OIWwAA6G_C.jpg\",\"url\":\"https:\\/\\/t.co\\/uwjeASxsA3\",\"display_url\":\"pic.twitter.com\\/uwjeASxsA3\",\"expanded_url\":\"http:\\/\\/twitter.com\\/HoodsPizzaxJCat\\/status\\/694717462621884416\\/photo\\/1\",\"type\":\"photo\",\"sizes\":{\"large\":{\"w\":480,\"h\":800,\"resize\":\"fit\"},\"thumb\":{\"w\":150,\"h\":150,\"resize\":\"crop\"},\"small\":{\"w\":340,\"h\":566,\"resize\":\"fit\"},\"medium\":{\"w\":480,\"h\":800,\"resize\":\"fit\"}}}]},\"extended_entities\":{\"media\":[{\"id\":694717457911693312,\"id_str\":\"694717457911693312\",\"indices\":[116,139],\"media_url\":\"http:\\/\\/pbs.twimg.com\\/media\\/CaQh2OIWwAA6G_C.jpg\",\"media_url_https\":\"https:\\/\\/pbs.twimg.com\\/media\\/CaQh2OIWwAA6G_C.jpg\",\"url\":\"https:\\/\\/t.co\\/uwjeASxsA3\",\"display_url\":\"pic.twitter.com\\/uwjeASxsA3\",\"expanded_url\":\"http:\\/\\/twitter.com\\/HoodsPizzaxJCat\\/status\\/694717462621884416\\/photo\\/1\",\"type\":\"photo\",\"sizes\":{\"large\":{\"w\":480,\"h\":800,\"resize\":\"fit\"},\"thumb\":{\"w\":150,\"h\":150,\"resize\":\"crop\"},\"small\":{\"w\":340,\"h\":566,\"resize\":\"fit\"},\"medium\":{\"w\":480,\"h\":800,\"resize\":\"fit\"}}}]},\"favorited\":false,\"retweeted\":false,\"possibly_sensitive\":false,\"filter_level\":\"low\",\"lang\":\"en\",\"timestamp_ms\":\"1454468524972\"}\r\n"
Data is in text format. (not in JSON). So first you have to convert it in JSON.
>>> json.loads(data)["text"]
u"Finallyy #taylorcaniff Happy bday bae, I love you soooo much, keep smiling, I'm so proud of everything you've done\u2661 https://t.co/uwjeASxsA3"
Related
Avoiding cartesian when adding unique classifier to a list in python 3
I have 5 .csv files I am importing and all contain emails: Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv", usecols=["Email Address"]) Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv", usecols=["Email"]) Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv", usecols=["Email"]) Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Ambassadors Q1 2021.csv", usecols=["Email Address"]) Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv", usecols=["Email Address"]) Followers= pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Followers Q1 2021 R12.csv", usecols=["Email"]) While I am only importing emails (annoyingly with two different naming conventions because of the systems they originate from), I am adding the import name as a classifer - i.e. Donors, Volunteers, etc. Donors['Value'] = "Donors" Activists['Value'] = "Activists" Low_Level_Activists['Value'] = "Low_Level_Activists" Ambassadors['Value'] = "Ambassadors" Volunteers['Value'] = "Volunteers" Advocates['Value'] = 'Followers' I then concatenate all the files and handle the naming issue. I am sure there is a more elegant way to do this but here's what I have: S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True) S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email']) S1= S1.drop(['Email','Email Address'], axis = 1) print(S1['Handle'].count()) #checks full count The total on that last line is 166,749 Here is my problem. I need to filter the emails for uniques - easy enough using .nuniques() and the but the problem I am having is I also need to carry the classifier. So if a singular email is a Donor but also an Activist, I pull both when I try to merge the unique values with the classifier. I have been at this for many hours (and to the end of the Internet!) and can't seem to find a workable solution. I've tried dictionary for loops, merges, etc. ad infinitum. The unique email count is 165,923 (figured out via Python &/or excel:( ). Essentially I would want to pull the earliest classifier in my list on a match. So if an email is a Donor and an Activist-> call them a Donor. Or if a email is a Volunteer and a Follower -> call them a Volunteer on one email record. Any help would be greatly appreciated.
I'll give it a try with some made-up data: import pandas as pd fa = pd.DataFrame([['paul#mail.com', 'Donors'], ['max#mail.com', 'Donors']], columns=['Handle', 'Value']) fb = pd.DataFrame([['paul#mail.com', 'Activists'], ['annie#mail.com', 'Activists']], columns=['Handle', 'Value']) S1 = pd.concat([fa, fb]) print(S1) gives Handle Value 0 paul#mail.com Donors 1 max#mail.com Donors 0 paul#mail.com Activists 1 annie#mail.com Activists You can group by Handle and then pick any Value you like, e.g. the first: for handle, group in S1.groupby('Handle'): print(handle, group.reset_index().loc[0, 'Value']) gives annie#mail.com Activists max#mail.com Donors paul#mail.com Donors or collect all roles of a person: for handle, group in S1.groupby('Handle'): print(handle, group.Value.unique()) gives annie#mail.com ['Activists'] max#mail.com ['Donors'] paul#mail.com ['Donors' 'Activists']
How can I take the outer product of string vectors in J?
I'm trying to replicate the outer product notation in APL: ∘.,⍨ 'x1' 'y1' 'z1' 'x2' 'y2' 'z2' 'x3' 'y3' 'z3' which yields x1x1 x1y1 x1z1 x1x2 x1y2 x1z2 x1x3 x1y3 x1z3 y1x1 y1y1 y1z1 y1x2 y1y2 y1z2 y1x3 y1y3 y1z3 z1x1 z1y1 z1z1 z1x2 z1y2 z1z2 z1x3 z1y3 z1z3 x2x1 x2y1 x2z1 x2x2 x2y2 x2z2 x2x3 x2y3 x2z3 y2x1 y2y1 y2z1 y2x2 y2y2 y2z2 y2x3 y2y3 y2z3 z2x1 z2y1 z2z1 z2x2 z2y2 z2z2 z2x3 z2y3 z2z3 x3x1 x3y1 x3z1 x3x2 x3y2 x3z2 x3x3 x3y3 x3z3 y3x1 y3y1 y3z1 y3x2 y3y2 y3z2 y3x3 y3y3 y3z3 z3x1 z3y1 z3z1 z3x2 z3y2 z3z2 z3x3 z3y3 z3z3 But I can't figure out how to do something similar in J. I found this Cartesian product in J post that I thought would be similar enough, but I just can't seem to translate it to an array of strings from an array of numbers. Adapting Dan Bron's answer therein and applying it to a simpler example 6 6 $ , > { 2 # < 'abc' gives aaabac babbbc cacbcc aaabac babbbc cacbcc which is almost what I want, but I don't know how to generalize it to use 2-letter (or more) strings instead of single ones in a similar fashion. I also don't know how to format those results with spaces between the pairs like the APL output, so it may not be the right path either. Similarly, I tried adapting Michael Berry's answer from that thread to get 9 36 $ ,,"1/ ~ 9 2 $ 'x1y1z1x2y2z2x3y3z3' which gives x1x1x1y1x1z1x1x2x1y2x1z2x1x3x1y3x1z3 y1x1y1y1y1z1y1x2y1y2y1z2y1x3y1y3y1z3 z1x1z1y1z1z1z1x2z1y2z1z2z1x3z1y3z1z3 x2x1x2y1x2z1x2x2x2y2x2z2x2x3x2y3x2z3 y2x1y2y1y2z1y2x2y2y2y2z2y2x3y2y3y2z3 z2x1z2y1z2z1z2x2z2y2z2z2z2x3z2y3z2z3 x3x1x3y1x3z1x3x2x3y2x3z2x3x3x3y3x3z3 y3x1y3y1y3z1y3x2y3y2y3z2y3x3y3y3y3z3 z3x1z3y1z3z1z3x2z3y2z3z2z3x3z3y3z3z3 Again, this is almost what I want, and this one handled the multiple characters, but there are still no spaces between them and the command is getting farther from the simplicity of the APL version. I can get the same results a bit more cleanly with ravel items ,. ,"1/ ~ 9 2 $ 'x1y1z1x2y2z2x3y3z3' I've been going through the J primer and exploring parts that look relevant in the dictionary, but I'm still very new, so I apologize if this is a dumb question. I feel like the rank conjunction operator should be able to help me here, but I had a hard time following its explanation in the primer. I played with ": to try to format the strings to have trailing spaces, but I also couldn't figure that out. The fact that this was so easy in APL also makes me think I'm doing something very wrong in J to be having this much trouble. After reading more of the primer I got something that looks like what I want with ,. 9 1 $ ' ' ,."2 ,"1/~ [ ;._2 'x1 y1 z1 x2 y2 z2 x3 y3 z3 ' but this is still way more complicated than the APL version, so I'm still hoping there is an actually elegant and concise way to do this.
I think that the only thing that I can add to the things that you have already pointed out is that to keep a string separate into components you would need to box. <#,"1/~ 9 2 $ 'x1y1z1x2y2z2x3y3z3' +----+----+----+----+----+----+----+----+----+ |x1x1|x1y1|x1z1|x1x2|x1y2|x1z2|x1x3|x1y3|x1z3| +----+----+----+----+----+----+----+----+----+ |y1x1|y1y1|y1z1|y1x2|y1y2|y1z2|y1x3|y1y3|y1z3| +----+----+----+----+----+----+----+----+----+ |z1x1|z1y1|z1z1|z1x2|z1y2|z1z2|z1x3|z1y3|z1z3| +----+----+----+----+----+----+----+----+----+ |x2x1|x2y1|x2z1|x2x2|x2y2|x2z2|x2x3|x2y3|x2z3| +----+----+----+----+----+----+----+----+----+ |y2x1|y2y1|y2z1|y2x2|y2y2|y2z2|y2x3|y2y3|y2z3| +----+----+----+----+----+----+----+----+----+ |z2x1|z2y1|z2z1|z2x2|z2y2|z2z2|z2x3|z2y3|z2z3| +----+----+----+----+----+----+----+----+----+ |x3x1|x3y1|x3z1|x3x2|x3y2|x3z2|x3x3|x3y3|x3z3| +----+----+----+----+----+----+----+----+----+ |y3x1|y3y1|y3z1|y3x2|y3y2|y3z2|y3x3|y3y3|y3z3| +----+----+----+----+----+----+----+----+----+ |z3x1|z3y1|z3z1|z3x2|z3y2|z3z2|z3x3|z3y3|z3z3| +----+----+----+----+----+----+----+----+----+ If you want to get rid of the boxes and instead insert spaces then you are not really going to have the character items separately, you will have long strings with the spaces as part of the result. And it is a very good question because it requires you to understand the fact that character strings in J are vectors. I suppose that technically what you are looking for is this which results in a 9 9 4 shape, but it won't look the way that you expect. ,"1/~ 9 2 $ 'x1y1z1x2y2z2x3y3z3' x1x1 x1y1 x1z1 x1x2 x1y2 x1z2 x1x3 x1y3 x1z3 y1x1 y1y1 y1z1 y1x2 y1y2 y1z2 y1x3 y1y3 y1z3 z1x1 z1y1 z1z1 z1x2 z1y2 z1z2 z1x3 z1y3 z1z3 x2x1 x2y1 x2z1 x2x2 x2y2 x2z2 x2x3 x2y3 x2z3 y2x1 y2y1 y2z1 y2x2 y2y2 y2z2 y2x3 y2y3 y2z3 z2x1 z2y1 z2z1 z2x2 z2y2 z2z2 z2x3 z2y3 z2z3 x3x1 x3y1 x3z1 x3x2 x3y2 x3z2 x3x3 x3y3 x3z3 y3x1 y3y1 y3z1 y3x2 y3y2 y3z2 y3x3 y3y3 y3z3 z3x1 z3y1 z3z1 z3x2 z3y2 z3z2 z3x3 z3y3 z3z3 $ ,"1/~ 9 2 $ 'x1y1z1x2y2z2x3y3z3' 9 9 4 You could also take the boxes and convert them to symbols, which might be closer to what you want, although they do have the backtick indicator as part of their representation. s:#<#,"1/~ 9 2 $ 'x1y1z1x2y2z2x3y3z3' `x1x1 `x1y1 `x1z1 `x1x2 `x1y2 `x1z2 `x1x3 `x1y3 `x1z3 `y1x1 `y1y1 `y1z1 `y1x2 `y1y2 `y1z2 `y1x3 `y1y3 `y1z3 `z1x1 `z1y1 `z1z1 `z1x2 `z1y2 `z1z2 `z1x3 `z1y3 `z1z3 `x2x1 `x2y1 `x2z1 `x2x2 `x2y2 `x2z2 `x2x3 `x2y3 `x2z3 `y2x1 `y2y1 `y2z1 `y2x2 `y2y2 `y2z2 `y2x3 `y2y3 `y2z3 `z2x1 `z2y1 `z2z1 `z2x2 `z2y2 `z2z2 `z2x3 `z2y3 `z2z3 `x3x1 `x3y1 `x3z1 `x3x2 `x3y2 `x3z2 `x3x3 `x3y3 `x3z3 `y3x1 `y3y1 `y3z1 `y3x2 `y3y2 `y3z2 `y3x3 `y3y3 `y3z3 `z3x1 `z3y1 `z3z1 `z3x2 `z3y2 `z3z2 `z3x3 `z3y3 `z3z3
I'd say the closest direct analogue of the APL expresion is to keep each string boxed: ,&.>/~ 'x1';'y1';'z1';'x2';'y2';'z2';'x3';'y3';'z3' ┌────┬────┬────┬────┬────┬────┬────┬────┬────┐ │x1x1│x1y1│x1z1│x1x2│x1y2│x1z2│x1x3│x1y3│x1z3│ ├────┼────┼────┼────┼────┼────┼────┼────┼────┤ │y1x1│y1y1│y1z1│y1x2│y1y2│y1z2│y1x3│y1y3│y1z3│ ├────┼────┼────┼────┼────┼────┼────┼────┼────┤ │z1x1│z1y1│z1z1│z1x2│z1y2│z1z2│z1x3│z1y3│z1z3│ ├────┼────┼────┼────┼────┼────┼────┼────┼────┤ │x2x1│x2y1│x2z1│x2x2│x2y2│x2z2│x2x3│x2y3│x2z3│ ├────┼────┼────┼────┼────┼────┼────┼────┼────┤ │y2x1│y2y1│y2z1│y2x2│y2y2│y2z2│y2x3│y2y3│y2z3│ ├────┼────┼────┼────┼────┼────┼────┼────┼────┤ │z2x1│z2y1│z2z1│z2x2│z2y2│z2z2│z2x3│z2y3│z2z3│ ├────┼────┼────┼────┼────┼────┼────┼────┼────┤ │x3x1│x3y1│x3z1│x3x2│x3y2│x3z2│x3x3│x3y3│x3z3│ ├────┼────┼────┼────┼────┼────┼────┼────┼────┤ │y3x1│y3y1│y3z1│y3x2│y3y2│y3z2│y3x3│y3y3│y3z3│ ├────┼────┼────┼────┼────┼────┼────┼────┼────┤ │z3x1│z3y1│z3z1│z3x2│z3y2│z3z2│z3x3│z3y3│z3z3│ └────┴────┴────┴────┴────┴────┴────┴────┴────┘
How to search for specific text in csv within a Pandas, python
Hello I want to find the account text # in the title column, and save it in the new csv. Pandas can do it, I tried to make it but it didn't work. This is my csv http://www.sharecsv.com/s/c1ed9790f481a8d452049be439f4e3d8/Newnormal.csv this is my code: import pandas as pd data = pd.read_csv("Newnormal.csv") data.dropna(inplace = True) sub ='#' data["Indexes"]= data["title"].str.find(sub) print(data) I want results like this From, to, title Xavier5501,KudiiThaufeeq,RT #KudiiThaufeeq: Royal Rape, Royal Harassment, Royal Cocktail Party, Royal Pedo, Royal Bidding, Royal Maalee Bayaan, Royal Slavery..et Thank you.
reduce records to only those that have an "#" in title define new column which is text between "#" and ":" you are left with some records where this leave NaN in to column. I've just filtered these out df = pd.read_csv("Newnormal.csv") df = df[df["title"].str.contains("#")==True] df["to"] = df["title"].str.extract(r".*([#][A-Z,a-z,0-9,_]+[:])") df = df[["from","to","title"]] df[~df["to"].isna()].to_csv("ToNewNormal.csv", index=False) df[~df["to"].isna()] output from to title 1 Xavier5501 #KudiiThaufeeq: RT #KudiiThaufeeq: Royal Rape, Royal Harassmen... 2 Suzane24979006 #USAID_NISHTHA: RT #USAID_NISHTHA: Don't step outside your hou... 3 sandeep_sprabhu #USAID_NISHTHA: RT #USAID_NISHTHA: Don't step outside your hou... 4 oliLince #Timothy_Hughes: RT #Timothy_Hughes: How to Get a Salesforce Th... 7 rismadwip #danielepermana: RT #danielepermana: Pak kasus covid per hari s... ... ... ... ... 992 Reptoid_Hunter #sapiofoxy: RT #sapiofoxy: I literally can't believe we ha... 994 KPCResearch #sapiofoxy: RT #sapiofoxy: I literally can't believe we ha... 995 GreySparkUK #VoxSmartGlobal: RT #VoxSmartGlobal: The #newnormal will see mo... 997 Gabboa10 #HuShameem: RT #HuShameem: One of #PGO_MV admin staff test... 999 wanjirunjendu #ntvkenya: RT #ntvkenya: AAK's Mugure Njendu shares insig...
How to reconstruct original text from spaCy tokens, even in cases with complicated whitespacing and punctuation
' '.join(token_list) does not reconstruct the original text in cases with multiple whitespaces and punctuation in a row. For example: from spacy.tokenizer import Tokenizer from spacy.lang.en import English nlp = English() # Create a blank Tokenizer with just the English vocab tokenizerSpaCy = Tokenizer(nlp.vocab) context_text = 'this is a test \n \n \t\t test for \n testing - ./l \t' contextSpaCyToksSpaCyObj = tokenizerSpaCy(context_text) spaCy_toks = [i.text for i in contextSpaCyToksSpaCyObj] reconstruct = ' '.join(spaCy_toks) reconstruct == context_text >False Is there an established way of reconstructing original text from spaCy tokens? Established answer should work with this edge case text (you can directly get the source from clicking the 'improve this question' button) " UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016\n\n\n RELEASE IN PART\n B5, B6\n\n\n\n\nFrom: H <hrod17#clintonemail.com>\nSent: Monday, July 23, 2012 7:26 AM\nTo: 'millscd #state.gov'\nCc: 'DanielJJ#state.gov.; 'hanleymr#state.gov'\nSubject Re: S speech this morning\n\n\n\n Waiting to hear if Monica can come by and pick up at 8 to take to Josh. If I don't hear from her, can you send B5\nsomeone else?\n\n Original Message ----\nFrom: Mills, Cheryl D [MillsCD#state.gov]\nSent: Monday, July 23, 2012 07:23 AM\nTo: H\nCc: Daniel, Joshua J <Daniel1.1#state.gov>\nSubject: FW: S speech this morning\n\nSee below\n\n B5\n\ncdm\n\n Original Message\nFrom: Shah, Rajiv (AID/A) B6\nSent: Monday, July 23, 2012 7:19 AM\nTo: Mills, Cheryl D\nCc: Daniel, Joshua.'\nSubject: S speech this morning\n\nHi cheryl,\n\nI look fwd to attending the speech this morning.\n\nI had one last minute request - I understand that in the final version there is no reference to the child survival call to\naction, but their is a reference to family planning efforts. Could you and josh try to make sure there is some specific\nreference to the call to action?\n\nAlso, in terms of acknowledgements it would be good to note torn friedan's leadership as everyone is sensitive to our ghi\ntransition and we want to continue to send the usaid-pepfar-cdc working together public message. I don't know if he is\nthere, but wanted to flag.\n\nLook forward to it.\n\nRaj\n\n\n\n\n UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016\n\x0c"
You can very easily accomplish this by changing two lines in your code: spaCy_toks = [i.text + i.whitespace_ for i in contextSpaCyToksSpaCyObj] reconstruct = ''.join(spaCy_toks) Basically, each token in spaCy knows whether it is followed by whitespace or not. So you call token.whitespace_ instead of joining them on space by default.
How to search tweets from an id to another id
I'm trying to get tweets using TwitterSearch in Python3. So basically I want to get all tweets between these 2 IDs. 748843914254249984 ->760065085616250880 These 2 IDs are from the Fri Jul 01 11:41:16 +0000 2016 to Mon Aug 01 10:50:12 +0000 2016 So here's the code I made. crawl.py #!/usr/bin/python3 # coding: utf-8 from TwitterSearch import * import datetime def crawl(): try: tso = TwitterSearchOrder() tso.set_keywords(["keyword"]) tso.set_since_id(748843914254249984) tso.set_max_id(760065085616250880) ACCESS_TOKEN = xxx ACCESS_SECRET = xxx CONSUMER_KEY = xxx CONSUMER_SECRET = xxx ts = TwitterSearch( consumer_key = CONSUMER_KEY, consumer_secret = CONSUMER_SECRET, access_token = ACCESS_TOKEN, access_token_secret = ACCESS_SECRET ) for tweet in ts.search_tweets_iterable(tso): print(tweet['id_str'], '-', tweet['created_at']) except TwitterSearchException as e: print( e ) if __name__ == '__main__': crawl() I'm not very familiar with Twitter API and searching with it. But this code should do the job. But it's giving : 760058064816988160 - Mon Aug 01 10:22:18 +0000 2016 [...] 760065085616250880 - Mon Aug 01 10:50:12 +0000 2016 Many, many times... Like I got the same lines over and over again instead of getting everything between my two IDs. So I'm not getting any of the July tweets, any idea why ?
TL;DR Remove the tso.set_max_id(760065085616250880) line. Explanation (as far as I understand it) I have found your problem in the TwitterSearch Docs: "The only parameter with a default value is count with 100. This is because it is the maximum of tweets returned by this very Twitter API endpoint." If I check this in your code by creating a search URL, I get: tso.create_search_url() #?q=Vuitton&since_id=748843914254249984&count=100&max_id=760065085616250880 which contains count=100 (meaning it will get the first page of 100 tweets). And, in contrast with removing the set_since_id and set_max_id which also has count=100 and retrieves many more tweets, it stops at 100 tweets. set_since_id without set_max_id works, the other way around doesn't. So removing the max_id=760065085616250880 from the search URL resulted in the results you want. If anyone can explain why set_max_id is not working along, please edit my answer.