I am running a data lake analytics job, and during extraction I am getting an error.
I use in my scripts TEXT extractor and also my own extractor. I try to get data from a file containing two columns separated by a space character. When I run my scripts locally everything works fine, but not when I try to run scripts using my DLA account. I have the problem only when I try to get data from files with many thousands of rows (but only 36 MB of data), for smaller files everything also works correctly. I noticed that the exception is throwing when total number of vertices is larger than the one for the extraction node. I met this problem erlier, working with other "big" files (.csv, .tsv) and extractors. Could someone tell me what happens?
Error message:
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract[0][0] with error: Vertex user code error.
Vertex failed with a fail-fast error
Script code:
#result =
EXTRACT s_date string,
s_time string
FROM #"/Samples/napis.txt"
//USING USQLApplicationTest.ExtractorsFactory.getExtractor();
USING Extractors.Text(delimiter:' ');
OUTPUT #result
TO #"/Out/Napis.log"
USING Outputters.Csv();
Code behind:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (StreamReader sr = new StreamReader(input.BaseStream))
{
string line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
string[] words = line.Split(' ');
int i = 0;
foreach (var c in output.Schema)
{
output.Set<object>(c.Name, words[i]);
i++;
}
yield return output.AsReadOnly();
}
}
}
}
public static class ExtractorsFactory
{
public static IExtractor getExtractor()
{
return new MyExtractor();
}
}
Part of sample file:
...
str1 str2
str1 str2
str1 str2
str1 str2
str1 str2
...
In job resources i found jobError message:
"Unexpected number of columns in input stream."-"description":"Unexpected number of columns in input record at line 1.\nExpected 2 columns- processed 1 columns out of 1."-"resolution":"Check the input for errors or use \"silent\" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring \"invalid\" rows may influence job results.
But I checked the file again and I don't see an incorrect number of columns. Is it possible that the error is caused by an incorrect file split and distribution? I read that the big files can be extracted in parallel.
Sorry for my poor English.
The same question was answered here: https://social.msdn.microsoft.com/Forums/en-US/822af591-f098-4592-b903-d0dbf7aafb2d/vertex-failure-triggered-quick-job-abort-exception-thrown-during-data-extraction?forum=AzureDataLake.
Summary:
We currently have an issue with large files where the row is not aligned with the file extent boundary if you upload the file with the "wrong" tool. If you upload it as row-oriented file through Visual Studio or via the Powershell command, you should get it aligned (if the row delimiter is CR or LF). If you did not use the "right" upload tool, the built-in extractor will show the behavior that you report because it currently assumes that record boundaries are aligned to the extents that we split the file into for parallel processing. We are working on a general fix.
If you see similar error messages with your custom extractor that uses AtomicFileProcessing=true and should be immune to the split, please send me your job link so I can file an incident and have the engineering team review your case.
Related
I am fetching tweets from the twitter API and then forwarding them through a tcp connection into a socket where spark is reading data from. This is my code
For reference line will look something like this
{
data : {
text: "some tweet",
id: some number
}
matching_rules: [{tag: "some string", id: same number}, {tag:...}]
}
def ingest_into_spark(tcp_conn, stream):
for line in stream.iter_lines():
if not (line is None):
try :
# print(line)
tweet = json.loads(line)['matching_rules'][0]['tag']
# tweet = json.loads(line)['data']['text']
print(tweet, type(tweet), len(tweet))
tcp_conn.sendall(tweet.encode('utf-8'))
except Exception as e:
print("Exception in ingesting data: ", e)
the spark side code:
print(f"Connecting to {SPARK_IP}:{SPARK_PORT}...")
input_stream = streaming_context.socketTextStream(SPARK_IP, int(SPARK_PORT))
print(f"Connected to {SPARK_IP}:{SPARK_PORT}")
tags = input_stream.flatMap(lambda tags: tags.strip().split())
mapped_hashtags = tags.map(lambda hashtag: (hashtag, 1))
counts=mapped_hashtags.reduceByKey(lambda a, b: a+b)
counts.pprint()
spark is not reading the data sent over the stream no matter what I do. But when I replace the line tweet = json.loads(line)['matching_rules'][0]['tag'] with the line tweet = json.loads(line)['data']['text'] it suddenly works as expected. I have tried printing the content of tweet and its type in both lines and its string in both. Only difference is the first one has the actual tweets while second only has 1 word.
I have tried with many different types of inputs and hard-coding the input as well. But I cannot imagine why reading a different field of a json make my code to stop working.
Replacing either the client or the server with netcat shows that the data is being sent over the socket as expected in both cases
If there are no solutions to this I would be open to knowing about alternate ways of ingesting data into spark as well which could be used in this scenario
As described in the documentation, records (lines) in text streams are delimited by new lines (\n). Unlike print(), sendall() is a byte-oriented function and it does not automatically add a new line. No matter how many tags you send with it, Spark will just keep on reading everything as a single record, waiting for the delimiter to appear. When you send the tweet text instead, it works because some tweets do contain line breaks.
Try the following and see if it makes it work:
tcp_conn.sendall((tweet + '\n').encode('utf-8'))
I want to know if there is a better way that iterating through a csv when performing a check. Virtually I am using SOAP UI (free version) to test a web service based on a search.
What I want to do is look at a response from a particular search request (the step name of the SOAP Request is 'Search Request') and look for all instances of test found in between xml tags <TestID> for both within <IFInformation> and <OFInformation> (this will be in a groovy script step).
def groovyUtils = new com.eviware.soapui.support.GroovyUtils(context)
import groovy.xml.XmlUtil
def response = messageExchange.response.responseContent
def xml = new XmlParser().parseText( response )
def IF = xml.'soap:Body'
.IF*
.TestId.text()
def OF = xml.'soap:Body'
.OF*
.TestId.text()
Now what I want to do is for each instance of the 'DepartureAirportId', I want to check that the ID is within a CSV file. There are two columns within the csv file (let's call it Search.csv) and both columns contain many rows. If the flight is found within any row within the first column, add a count +1 for the variable 'Test1', else if found in second column in csv, add count +1 for variable 'Test2'. If not found within any, add count +1 for variable 'NotFound'
I don't know if iterating through a csv is the best outcome or output all the data from the csv into an array list and iterate it through there but I want to know how this can be done and the best way for my own learning experience?
don't know about your algorithm, but the easiest way to iterate through simple csv file in groovy by line and splitting each line with separator:
new File("/1.csv").splitEachLine(","){line->
println " ${ line[0] } ${ line[1] } "
}
http://docs.groovy-lang.org/latest/html/groovy-jdk/java/io/File.html#splitEachLine(java.lang.String,%20groovy.lang.Closure)
You might want to use CSV Validator.
Format.of(String regex)
It should do the trick - just provide the literal you're looking for as a rule for first column and check if it throws an exception or not.
I have an assignment in my computation for geophysicist course; the task is basically finding the largest value in a column of a txt file (the value is the magnitude of the earthquake, and the file contains all earthquakes from 1990-2000). Then take the latitude and longitude of such earthquake(s) and plot it into a map.
The task is quite simple to do in python, but since I am devoting some free time to study webdev I am thinking of making a simple webapp that would do the complete task.
In other words, I would upload a file into it, and it would automatically appoint the biggest earthquakes into a google map.
But since I am kind of a noob in node.js I am having a hard time starting the project, so I am breaking it into parts, and I need help with the first part of it.
I am thinking of converting the txt.file with data into a .csv file and subsequently converting it into a .json file. I have absolutely no idea what algorithm I should use to scan the .json file and find the largest value of a given column.
Here is the first row of the original .txt file:
1 5 0 7.0 274 102.000 -3.000
here it is on a .csv file, using an online converter:
1 5 0 7.0 274 102.000 -3.000
And here it is on the .json file, again, using on a online converter:
1\t 5\t 0\t7.0\t274\t102.000\t -3.000
Basically I need to scan all the rows and find the largest value of the 5th column.
Any help on how I would start writing this code?
Thanks very much.
TLDR version:
Need to find the largest value of the 5th column in a JSON file with multiple rows.
I had a go at this as a one-liner, code golf style. I'll leave out the usual "don't get Stack Overflow to do your homework for you" shtick. You're only cheating yourself, kids these days, yada yada.
Split, map, reduce.
let data = require('fs').readFileSync('renan/homework/geophysicist_data.txt')
let biggest = data.split('\n')
.map(line => line.split(/[ ]+/)[4])
.reduce((a, b) => Math.max(a, b))
Having loaded up the data we process it in 3 steps.
.split('\n') By splitting on the newline character we break the text file down into an array, so that each line in the text file is converted into an item in the array.
.map(line => line.split(/[ ]+/)[4]) 'map' takes this array of lines and runs a command on every single line individually. For each line we tell it that one-or-more spaces (split(/[ ]+/)) is the column separator, and once it's been broken into columns to take the fifth column in that line (We use [4] instead of [5] because javascript starts counting from 0).
.reduce((a, b) => Math.max(a, b)) Now we have an array containing only the fifth column numbers, we can send the array directly to Math.max and let it do the hard work calculating our answer for us. Hooray!
If this data is even a little bit un-uniform it would be very easy to break this, but i'm assuming because it's a homework assignment that is not the case.
Good luck!
If your file contains just a lines with numbers with the same structure I'd not convert it to csv or json.
I'd simply go for parsing the .txt manually. Here is the code snippet how you could do this. I used 2 external modules: lodash (ultra-popular unility library for data manipulations) and validator (helps to validate strings):
'use strict';
const fs = require('fs');
const _ = require('lodash');
const os = require('os');
const validator = require('validator');
const parseLine = function (line) {
if (!line) {
throw new Error('Line not passed');
}
//Splitting a line into tokens
//Some of tokens are separated with double spaces
//So using a regex here
let tokens = line.split(/\s+/);
//Data validation
if (!(
//I allowed more tokens per line that 7, but not less
tokens && tokens.length >= 7
//Also checking that strings contain numbers
//So they will be parsed properly
&& validator.isDecimal(tokens[4])
&& validator.isDecimal(tokens[5])
&& validator.isDecimal(tokens[6])
)) {
throw new Error('Cannot parse a line');
}
//Parsing the values as they come as string
return {
magnitude: parseFloat(tokens[4]),
latitude: parseFloat(tokens[5]),
longitude: parseFloat(tokens[6])
}
};
//I passed the encoding to readFile because if I would not
//data would be a buffer, so we'd have to call .toString('utf8') on it.
fs.readFile('./data.txt', 'utf8', (err, data) => {
if (err) {
console.error(err.stack);
throw err;
}
//Splitting into lines with os.EOL
//so our code work on Linux/Windows/Mac
let lines = data.split(os.EOL);
if (!(lines && lines.length)) {
console.log('No lines found.');
return;
}
//Simple lodash chain to parse all lines
//and then find the one with max magnitude
let earthquake = _(lines)
.map(parseLine)
.maxBy('magnitude');
//Simply logging it here
//You'll probably put it into a callback/promise
//Or send as a response from here
console.log(earthquake);
});
Text corpora are often distributed as large files containing specific documents on each new line. For instance, I have a file with 10 million product reviews, one per line, and each review contains multiple sentences.
When processing such files with Stanford CoreNLP, using the command line, for instance
java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma -file test.txt
the output, whether in text or xml format, will number all sentences from 1 to n, ignoring the original line numbering that separates the documents.
I would like to keep track of the original file's line numbering (e.g. in xml format, to have an output tree like <original_line id=1>, then <sentence id=1>, then <token id=1>). Or else, to be able to reset the numbering of sentences at the start of each new line in the original file.
I have tried the answer to a similar question about Stanford's POS tagger, without success. Those options do not keep track of the original line numbers.
A quick solution could be to split the original file in multiple files, then processing each of them with CoreNLP and the -filelist input option. However, for large files with millions of documents, creating millions of individual files just to preserve the original line/document numbering seems inefficient.
I suppose it would be possible to modify the source code of Stanford CoreNLP, but I am unfamiliar with Java.
Any solution to preserve the original line numbering in the output would be very helpful, whether through the command line or by showing an example Java code that would achieve that.
I've dug through the code base, and I can't find a command line flag that will help you.
I wrote some sample Java code that should do the trick.
I put this in DocPerLineProcessor.java, which I put into stanford-corenlp-full-2015-04-20. I also put a file called sample-doc-per-line.txt which had 4 sentences per line.
First make sure to compile:
cd stanford-corenlp-full-2015-04-20
javac -cp "*:." DocPerLineProcessor.java
Here is the command to run:
java -cp "*:." DocPerLineProcessor sample-doc-per-line.txt
The output sample-doc-per-line.txt.xml should be the desired xml format, but sentences now have which line number they're on.
Here is the code:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class DocPerLineProcessor {
public static void main (String[] args) throws IOException {
// set up properties
Properties props = new Properties();
props.setProperty("annotators",
"tokenize, ssplit, pos, lemma, ner, parse");
// set up pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read in a product review per line
Iterable<String> lines = IOUtils.readLines(args[0]);
Annotation mainAnnotation = new Annotation("");
// add a blank list to put sentences into
List<CoreMap> blankSentencesList = new ArrayList<CoreMap>();
mainAnnotation.set(CoreAnnotations.SentencesAnnotation.class,blankSentencesList);
// process each product review
int lineNumber = 1;
for (String line : lines) {
Annotation annotation = new Annotation(line);
pipeline.annotate(annotation);
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
sentence.set(CoreAnnotations.LineNumberAnnotation.class,lineNumber);
mainAnnotation.get(CoreAnnotations.SentencesAnnotation.class).add(sentence);
}
lineNumber += 1;
}
PrintWriter xmlOut = new PrintWriter(args[0]+".xml");
pipeline.xmlPrint(mainAnnotation, xmlOut);
}
}
Now when I run this, the sentence tags also have the appropriate line number. So the sentences still have a global id, but you can mark which line they came from. This will also set it up so newline always ends a sentence.
Please let me know if you need any clarification or if I made any errors transcribing my code.
The Question is already answered but i had the same problem and came up with a command line solution that worked for me. The trick was to specify the tokenizerFactory and give it the option tokenizeNLs=true
It looks like this:
java -mx1g -cp stanford-corenlp-3.6.0.jar:slf4j-api.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier english.conll.4class.distsim.normal.tagger -outputFormat slashTags -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions "tokenizeNLs=true" -textFile untagged_lines.txt > tagged_lines.txt
I have a simple script to import some spectroscopy data from files with some base filename (YYYYMMDD) and a header. My current method pushes the actual spectral intensities to some vector 'rawspectra' and I can call the data by `rawspectra{m,n}.data(q,r)
In the script, I specify by hand the base filename and save it as a string 'filebase'.
I would like to append the name of the rawspectra vector with the filebase so I might be able to use the script to import files acquired on different dates into the same workspace without overwriting the rawspectra vector (and also allowing for easy understanding of which vectors are attached to which experimental conditions. I can easily do this by manually renaming a vector, but I'd rather make this automatic.
My importation script follows:
%for the importation of multiple sequential files, starting at startfile
%and ending at numfiles. All raw scans are subsequently plotted.
numfiles = input('How many spectra?');
startfile = input('What is the starting file number?');
numberspectra = numfiles - (startfile - 1);
filebase = strcat(num2str(input('what is the base file number?')),'_');
rawspectra = cell(startfile, numberspectra);
for k= startfile:numberspectra
filename = strcat(filebase,sprintf('%.3d.txt', k));
%eval(strcat(filebase,'rawspectra')){k} = importdata(filename); - This does not work.
rawspectra{k} = importdata(filename);
figure;
plot(rawspectra{1,k}.data(:,1),rawspectra{1,k}.data(:,2))
end
If any of you can help me out with what should be a seemingly simple task, I would be very appreciative. Basically, I want 'filebase' to go in front of 'rawspectra' and then increment that by k++ within the loop.
Thanks!
Why not just
rawspectra(k) = importdata(filename);
rawspectra(k).filebase = filebase;