Ingesting data into spark using socketTextStream - apache-spark

I am fetching tweets from the twitter API and then forwarding them through a tcp connection into a socket where spark is reading data from. This is my code
For reference line will look something like this
{
data : {
text: "some tweet",
id: some number
}
matching_rules: [{tag: "some string", id: same number}, {tag:...}]
}
def ingest_into_spark(tcp_conn, stream):
for line in stream.iter_lines():
if not (line is None):
try :
# print(line)
tweet = json.loads(line)['matching_rules'][0]['tag']
# tweet = json.loads(line)['data']['text']
print(tweet, type(tweet), len(tweet))
tcp_conn.sendall(tweet.encode('utf-8'))
except Exception as e:
print("Exception in ingesting data: ", e)
the spark side code:
print(f"Connecting to {SPARK_IP}:{SPARK_PORT}...")
input_stream = streaming_context.socketTextStream(SPARK_IP, int(SPARK_PORT))
print(f"Connected to {SPARK_IP}:{SPARK_PORT}")
tags = input_stream.flatMap(lambda tags: tags.strip().split())
mapped_hashtags = tags.map(lambda hashtag: (hashtag, 1))
counts=mapped_hashtags.reduceByKey(lambda a, b: a+b)
counts.pprint()
spark is not reading the data sent over the stream no matter what I do. But when I replace the line tweet = json.loads(line)['matching_rules'][0]['tag'] with the line tweet = json.loads(line)['data']['text'] it suddenly works as expected. I have tried printing the content of tweet and its type in both lines and its string in both. Only difference is the first one has the actual tweets while second only has 1 word.
I have tried with many different types of inputs and hard-coding the input as well. But I cannot imagine why reading a different field of a json make my code to stop working.
Replacing either the client or the server with netcat shows that the data is being sent over the socket as expected in both cases
If there are no solutions to this I would be open to knowing about alternate ways of ingesting data into spark as well which could be used in this scenario

As described in the documentation, records (lines) in text streams are delimited by new lines (\n). Unlike print(), sendall() is a byte-oriented function and it does not automatically add a new line. No matter how many tags you send with it, Spark will just keep on reading everything as a single record, waiting for the delimiter to appear. When you send the tweet text instead, it works because some tweets do contain line breaks.
Try the following and see if it makes it work:
tcp_conn.sendall((tweet + '\n').encode('utf-8'))

Related

Python datefinder module export to txt or csv

I'm having a lot of difficulty with the datefinder module in Python.
I'm using gspread to fetch data from Google Sheets.
I've got a method with 2 parameters: the email and entry.
From entry (string of text) I'm extracting dates, which can be more than one.
But when I try to add the results into a txt or csv file, it just won't work.
import datefinder
def check(email, entry):
matches = datefinder.find_dates(entry)
for match in matches:
with open("myFile.txt", "a") as file:
file.write(email)
file.write("\n")
file.write(match)
I've tried many combinations, and it seems to be something wrong with match. I tried parsing to string or even just grabbing match.day, match.month separately since the type is integer, but still got the same problem.
I used print to debug, it stays stuck repeating the first email until it throws gspread.exception.APIError: {'code': 429, 'message': "Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute per user' of service 'sheets.googleapis.com' for consumer ...
BUT, it works fine if I'm just printing:
import datefinder
def check(email, entry):
matches = datefinder.find_dates(entry)
for match in matches:
print(email, match)
If it was a problem with the API, I wouldn't be able to print, so I don't know what else to try.
And if I'm only inserting the email into txt/csv, it works. It exports fine.

Printing sql query results

Two separate but related questions to querying a database and sending the results to discord:
Example code:
async def query(ctx):
try:
cursor.execute("SELECT record.name, COUNT(place.name) .....")
for row in cursor:
await ctx.send(row)
except Error as e:
print(e)
Each row prints as it's own message and displays in the following format:
('potato', 1)
('orange', 1)
('steak', 1)
I would like to append all the rows into a single message as the results are numerous enough that this blows the channel up with spam messages. Each row on it's own line and part 2, remove the query result formatting.
It should look like this(parentheses indicating one message not visible formatting):
"
potato 1
orange 2
steak 1
"
I have tried to change "await ctx.send(row)" to "await ctx.send(row[0],row[1]) to remove the formatting but I know send doesn't operate in the same manner as print() and it's instead looking for a string.
"await ctx.send(row[0].join(row[1]))" fails because: "TypeError: can only join an iterable"
Further I have no clue how to get the result from cursor append it into a single string with line breaks.
Thanks in advance for any advice.
This is just basic python
message = ""
for veg, i in cursor.fetchall(): # Or however you named it
message += f"{veg} {i}\n"
The message var now looks like this:
"potato 1
orange 2
steak 1"

How do I iterate through two lists in API call Python 3

I have two files containing information that I need to input in the same script. One containing ID's, one on each line, and the other list containing parameters on their own individual lines as well. It should be made known that this list contains over 4000 lines each. Other API calls have been successful but this one is a bit harder to figure out.
The way this is intended to work is that the script reads line 1 from the ID file, insert that ID where %s is in the url. This will complete the url necessary for the API call. Then I need the parameters which are on the same lines matching their respective network ID's in the ID file, placed in %s in the payload section.
I got it to this point and what is happening now is when an ID is picked in the ID list, the URL becomes complete and does what it is supposed to. However when the script starts reading the contents file, it is iterating over and over until all parameters for ALL networks are complete and this is applied for just that one network, which is not supposed to happen, then it moves onto the next network ID and does the same thing.
I posted a sample visual to give you an idea of what the output is. I know there must be a way to have them read one line at a time, run the script, and iterate over to the next line in sequence, and do this until both of the entire lists are complete.
Python is not my strongest area so any assistance is greatly appreciated.
The files are .txt files and properly formatted. These data have been tested using postman and have been successful in our other API calls as well, so we can eliminate a couple of factors.
with open('TZ.txt') as file1, open ('TZContents.txt') as file2:
array1 = file1.readlines()
file = file2.readlines()
for line in array1:
url = 'https://dashboard.meraki.com/api/v0/networks/%s' %line.rstrip("\n")
for line2 in file:
payload = '%s' % line2.rstrip("\n")
headers = {'X-Cisco-Meraki-API-Key': 'API Key','Content-Type': "application/json"}
response = requests.request('PUT', url, headers = headers, data = payload, allow_redirects=True, timeout = 10)
print(response.text)
Output Example Below:
{"id":"1111", "type":"wireless","name":"Network A}
{"id":"1111", "type":"wireless","name":"Network B}
{"id":"1111", "type":"wireless","name":"Network C}
{"errors":["Name has already been taken"]}
{"errors":["Name has already been taken"]}
{"errors":["Name has already been taken"]}
{"errors":["Name has already been taken"]}
{"errors":["Name has already been taken"]}
{"id":"2222", "type":"appliance","name":"Network A}
{"id":"2222", "type":"appliance","name":"Network B}
{"id":"2222", "type":"appliance","name":"Network C}
Should be this:
{"id":"1111", "type":"wireless","name":"Network A}
{"id":"2222", "type":"appliance","name":"Network B}
{"id":"3333", "type":"combined","name":"Network C}
I read your description and I guess that the two files contain exactly the same number of lines. Is that correct?
In the present code a nested for iterations is used, resulting in redudant output.
You might use a same index to locate the same line in either file.
A modified code might be
with open('TZ.txt') as file1, open ('TZContents.txt') as file2:
ids = file1.readlines()
params = file2.readlines()
n_lines = len(ids)
for line_num in list(range(n_lines)):
url = 'https://dashboard.meraki.com/api/v0/networks/%s' %ids[line_num].rstrip("\n")
payload = '%s' % params[line_num].rstrip("\n")
headers = {'X-Cisco-Meraki-API-Key': 'API Key','Content-Type': "application/json"}
response = requests.request('PUT', url, headers = headers, data = payload, allow_redirects=True, timeout = 10)
print(response.text)

python-snappy streaming data in a loop to a client

I would like to send multiple compressed arrays from a server to a client using python snappy, but I cannot get it to work after the first array. Here is a snippet for what is happening:
(sock is just the network socket that these are communicating through)
Server:
for i in range(n): #number of arrays to send
val = items[i][1] #this is the array
y = (json.dumps(val)).encode('utf-8')
b = io.BytesIO(y)
#snappy.stream_compress requires a file-like object as input, as far as I know.
with b as in_file:
with sock as out_file:
snappy.stream_compress(in_file, out_file)
Client:
for i in range(n): #same n as before
data = ''
b = io.BytesIO()
#snappy.stream_decompress requires a file-like object to write o, as far as I know
snappy.stream_decompress(sock, b)
data = b.getvalue().decode('utf-8')
val = json.loads(data)
val = json.loads(data) works only on the first iteration, but afterwards it stop working. When I do a print(data), only the first iteration will print anything. I've verified that the server does flush and send all the data, so I believe it is a problem with how I decide to receive the data.
I could not find a different way to do this. I searched and the only thing I could find is this post which has led me to what I currently have.
Any suggestions or comments?
with doesn't do what you think, refer to it's documentation. It calls sock.__exit__() after the block executed, that's not what you intended.
# what you wrote
with b as in_file:
with sock as out_file:
snappy.stream_compress(in_file, out_file)
# what you meant
snappy.stream_compress(b, sock)
By the way:
The line data = '' is obsolete because it's reassigned anyways.
Adding to #paul-scharnofske's answer:
Likewise, on the receiving side: stream_decompress doesn't quit until end-of-file, which means it will read until the socket is closed. So if you send separate multiple compressed chunks, it will read all of them before finishing, which seems not what you intend. Bottom line, you need to add "framing" around each chunk so that you know on the receiving end when one ends and the next one starts. One way to do that... For each array to be sent:
Create a io.BytesIO object with the json-encoded input as you're doing now
Create a second io.BytesIO object for the compressed output
Call stream_compress with the two BytesIO objects (you can write into a BytesIO in addition to reading from it)
Obtain the len of the output object
Send the length encoded as a 32-bit integer, say, with struct.pack("!I", length)
Send the output object
On the receiving side, reverse the process. For each array:
Read 4 bytes (the length)
Create a BytesIO object. Receive exactly length bytes, writing those bytes to the object
Create a second BytesIO object
Pass the received object as input and the second object as output to stream_decompress
json-decode the resulting output object

Vertex failure triggered quick job abort - Exception thrown during data extraction

I am running a data lake analytics job, and during extraction I am getting an error.
I use in my scripts TEXT extractor and also my own extractor. I try to get data from a file containing two columns separated by a space character. When I run my scripts locally everything works fine, but not when I try to run scripts using my DLA account. I have the problem only when I try to get data from files with many thousands of rows (but only 36 MB of data), for smaller files everything also works correctly. I noticed that the exception is throwing when total number of vertices is larger than the one for the extraction node. I met this problem erlier, working with other "big" files (.csv, .tsv) and extractors. Could someone tell me what happens?
Error message:
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract[0][0] with error: Vertex user code error.
Vertex failed with a fail-fast error
Script code:
#result =
EXTRACT s_date string,
s_time string
FROM #"/Samples/napis.txt"
//USING USQLApplicationTest.ExtractorsFactory.getExtractor();
USING Extractors.Text(delimiter:' ');
OUTPUT #result
TO #"/Out/Napis.log"
USING Outputters.Csv();
Code behind:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (StreamReader sr = new StreamReader(input.BaseStream))
{
string line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
string[] words = line.Split(' ');
int i = 0;
foreach (var c in output.Schema)
{
output.Set<object>(c.Name, words[i]);
i++;
}
yield return output.AsReadOnly();
}
}
}
}
public static class ExtractorsFactory
{
public static IExtractor getExtractor()
{
return new MyExtractor();
}
}
Part of sample file:
...
str1 str2
str1 str2
str1 str2
str1 str2
str1 str2
...
In job resources i found jobError message:
"Unexpected number of columns in input stream."-"description":"Unexpected number of columns in input record at line 1.\nExpected 2 columns- processed 1 columns out of 1."-"resolution":"Check the input for errors or use \"silent\" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring \"invalid\" rows may influence job results.
But I checked the file again and I don't see an incorrect number of columns. Is it possible that the error is caused by an incorrect file split and distribution? I read that the big files can be extracted in parallel.
Sorry for my poor English.
The same question was answered here: https://social.msdn.microsoft.com/Forums/en-US/822af591-f098-4592-b903-d0dbf7aafb2d/vertex-failure-triggered-quick-job-abort-exception-thrown-during-data-extraction?forum=AzureDataLake.
Summary:
We currently have an issue with large files where the row is not aligned with the file extent boundary if you upload the file with the "wrong" tool. If you upload it as row-oriented file through Visual Studio or via the Powershell command, you should get it aligned (if the row delimiter is CR or LF). If you did not use the "right" upload tool, the built-in extractor will show the behavior that you report because it currently assumes that record boundaries are aligned to the extents that we split the file into for parallel processing. We are working on a general fix.
If you see similar error messages with your custom extractor that uses AtomicFileProcessing=true and should be immune to the split, please send me your job link so I can file an incident and have the engineering team review your case.

Resources