I'm having a lot of difficulty with the datefinder module in Python.
I'm using gspread to fetch data from Google Sheets.
I've got a method with 2 parameters: the email and entry.
From entry (string of text) I'm extracting dates, which can be more than one.
But when I try to add the results into a txt or csv file, it just won't work.
import datefinder
def check(email, entry):
matches = datefinder.find_dates(entry)
for match in matches:
with open("myFile.txt", "a") as file:
file.write(email)
file.write("\n")
file.write(match)
I've tried many combinations, and it seems to be something wrong with match. I tried parsing to string or even just grabbing match.day, match.month separately since the type is integer, but still got the same problem.
I used print to debug, it stays stuck repeating the first email until it throws gspread.exception.APIError: {'code': 429, 'message': "Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute per user' of service 'sheets.googleapis.com' for consumer ...
BUT, it works fine if I'm just printing:
import datefinder
def check(email, entry):
matches = datefinder.find_dates(entry)
for match in matches:
print(email, match)
If it was a problem with the API, I wouldn't be able to print, so I don't know what else to try.
And if I'm only inserting the email into txt/csv, it works. It exports fine.
Related
I am fetching tweets from the twitter API and then forwarding them through a tcp connection into a socket where spark is reading data from. This is my code
For reference line will look something like this
{
data : {
text: "some tweet",
id: some number
}
matching_rules: [{tag: "some string", id: same number}, {tag:...}]
}
def ingest_into_spark(tcp_conn, stream):
for line in stream.iter_lines():
if not (line is None):
try :
# print(line)
tweet = json.loads(line)['matching_rules'][0]['tag']
# tweet = json.loads(line)['data']['text']
print(tweet, type(tweet), len(tweet))
tcp_conn.sendall(tweet.encode('utf-8'))
except Exception as e:
print("Exception in ingesting data: ", e)
the spark side code:
print(f"Connecting to {SPARK_IP}:{SPARK_PORT}...")
input_stream = streaming_context.socketTextStream(SPARK_IP, int(SPARK_PORT))
print(f"Connected to {SPARK_IP}:{SPARK_PORT}")
tags = input_stream.flatMap(lambda tags: tags.strip().split())
mapped_hashtags = tags.map(lambda hashtag: (hashtag, 1))
counts=mapped_hashtags.reduceByKey(lambda a, b: a+b)
counts.pprint()
spark is not reading the data sent over the stream no matter what I do. But when I replace the line tweet = json.loads(line)['matching_rules'][0]['tag'] with the line tweet = json.loads(line)['data']['text'] it suddenly works as expected. I have tried printing the content of tweet and its type in both lines and its string in both. Only difference is the first one has the actual tweets while second only has 1 word.
I have tried with many different types of inputs and hard-coding the input as well. But I cannot imagine why reading a different field of a json make my code to stop working.
Replacing either the client or the server with netcat shows that the data is being sent over the socket as expected in both cases
If there are no solutions to this I would be open to knowing about alternate ways of ingesting data into spark as well which could be used in this scenario
As described in the documentation, records (lines) in text streams are delimited by new lines (\n). Unlike print(), sendall() is a byte-oriented function and it does not automatically add a new line. No matter how many tags you send with it, Spark will just keep on reading everything as a single record, waiting for the delimiter to appear. When you send the tweet text instead, it works because some tweets do contain line breaks.
Try the following and see if it makes it work:
tcp_conn.sendall((tweet + '\n').encode('utf-8'))
I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.
I am trying to get some data from a website (using the modules named requests & BeautifulSoup) and print it in a text file but every time I try to do so, it says the following:
TypeError: descriptor 'write' requires a 'file' object but received a 'NavigableString'
I have tried using the csv library to import the data but since I couldn't add the line by line data to the csv, I decided to add all the output to a text file and then take out the data I require.
file_object = open("name-list.txt", "w") #Opening the file
name = soup.find(class_='table-responsive') #Extracting the data
name_list = name.find_all('td') #Refining the data
for final in name_list:
all = final.contents[0] #Final result
file.write(all) #This is where the Error Comes
file.close()
When I use print(all) in the for loop, I get the output that I need which consists of multi-line text including the names, age, gender, etc. of the people from the table on the website but when I try to print that output into the text file, the error pops up.
I want to know if there is a better way that iterating through a csv when performing a check. Virtually I am using SOAP UI (free version) to test a web service based on a search.
What I want to do is look at a response from a particular search request (the step name of the SOAP Request is 'Search Request') and look for all instances of test found in between xml tags <TestID> for both within <IFInformation> and <OFInformation> (this will be in a groovy script step).
def groovyUtils = new com.eviware.soapui.support.GroovyUtils(context)
import groovy.xml.XmlUtil
def response = messageExchange.response.responseContent
def xml = new XmlParser().parseText( response )
def IF = xml.'soap:Body'
.IF*
.TestId.text()
def OF = xml.'soap:Body'
.OF*
.TestId.text()
Now what I want to do is for each instance of the 'DepartureAirportId', I want to check that the ID is within a CSV file. There are two columns within the csv file (let's call it Search.csv) and both columns contain many rows. If the flight is found within any row within the first column, add a count +1 for the variable 'Test1', else if found in second column in csv, add count +1 for variable 'Test2'. If not found within any, add count +1 for variable 'NotFound'
I don't know if iterating through a csv is the best outcome or output all the data from the csv into an array list and iterate it through there but I want to know how this can be done and the best way for my own learning experience?
don't know about your algorithm, but the easiest way to iterate through simple csv file in groovy by line and splitting each line with separator:
new File("/1.csv").splitEachLine(","){line->
println " ${ line[0] } ${ line[1] } "
}
http://docs.groovy-lang.org/latest/html/groovy-jdk/java/io/File.html#splitEachLine(java.lang.String,%20groovy.lang.Closure)
You might want to use CSV Validator.
Format.of(String regex)
It should do the trick - just provide the literal you're looking for as a rule for first column and check if it throws an exception or not.
I am new to python and I am trying to read data from URL. Basically I am reading the historical stock data, get the closing price and save the closing price in to a list. The closing price is available at the 4th index (5th column) of each line. And I want to do all of these within a list comprehension.
Code snippet:
from urllib.request import urlopen
URL = "http://ichart.yahoo.com/table.csv?s=AAPL&a=3&b=1&c=2016&d=9&e=30&f=2016"
def downloadClosingPrice():
urlHandler = urlopen(URL)
next(urlHandler)
return [float(line.split(",")[4]) for line in urlHandler.read().decode("utf8").splitlines() if line]
closingPriceList = downloadClosingPrice()
The above code just works fine. I am able to read and fetch the required data. However just out of curiosity, can the code for list comprehension be written in a more simpler or easier way ?
Thanks...
I did try out various ways and this is how I could do the same using different forms of list comprehension:
return [float(line.decode("utf8").split(",")[4]) for line in urlHandler if line]
# return [float(line.decode("utf8").split(",")[4]) for line in urlHandler.readlines() if line]
# return [float(line.split(",")[4]) for line in urlHandler.read().decode("utf8").splitlines() if line]
The first one is better because it reads the file line by line which saves memory. And of course it's simpler and easier to understand.