Spark, grep / search in very long strings / datablock

Spark, grep / search in very long strings / datablock - apache-spark

I have a very long string / datablock where I want to search / grep within.
Example: ...AAABBAAAAVAACCDE...
In this example, I want to search for AVA.
The length of the string is hundredth of GBs
My problem is, when I split the string in block of xxMB (to allow parallel execution) the search will fail on the boundaries.
Example
[Block 1] ...AAABBAAAA
[Block 2] VAACCDE...
In the example above, I will never find the string AVA.
Are the methods or helper function to address this boundary problem?
Thanks for your help

In Spark it's not each to read these custom formats, especially files that are not delimted by newlines, very efficiently out-of-the box.
In essence you need a FileInputStream from your original file (the one with the huge string) and for each chunk you want each record to be read this as a stream
You can, for example, retain a cache of the last n characters from each chuck/record and concat that to the next record, effectively creating an overlap.
eg:
val fileIn = "hugeString.txt"
val fileOut = "sparkFriendlyOutput.txt"
val reader = new FileInputStream(fileIn)
val writer = new BufferedOutputStream(new FileOutputStream(fileOut))
val recordSize = 9
val maxSearchLength = 3
val bytes = Array.fill[Byte](recordSize)(0)
val prefix = Array.fill[Byte](maxSearchLength)(' ')
Stream
.continually((reader.read(bytes),bytes))
.takeWhile(_._1 != -1)
.foreach{
case (_, buffer) => {
writer.write(prefix ++ buffer :+ '\n'.toByte)
Array.copy(buffer.toList.takeRight(maxSearchLength).toArray,0,prefix,0,maxSearchLength)
}}
writer.close()
reader.close()
This turns this string
1234567890123456789012345678901234567890123456789012345...
Into this File:
123456789
789012345678
678901234567
567890123456
...
This does require you to pick a maximum length that you ever want to search for, because that's what the overlap is for.
This file could be read in Spark very easily
On the other hand if you don't have the luxury to be able to store this on disk (or in memory) perhaps you could look into creating a custom spark streaming solution where you either implement a custom streaming source (structured streaming) or custom receiver (Dstream) that reads the file via a similar FileInputStream + buffered prefix solution.
PS. You could do smarter things with the overlap (at least divide by two, so noth the entire possible length is duplicated,
PS I assumed that you don't care about the absolute position. If you do, then I would store the original offset as Long next to each line

Related

How to split a String by bodySize in Groovy Script

Before anything else, I hope that this world situation is not affecting you too much and that you can be as long as possible at home and in good health.
You see, I'm very, very new to Groovy Script and I have a question: How can I separate a String based on its body size?
Assuming that the String has a size of 3,000 characters getting the body like
def body = message.getBody (java.lang.String) as String
and its size like
def bodySize = body.getBytes (). Length
I should be able to separate it into 500-character segments and save each segment in a different variable (which I will later set in a property).
I read some examples but I can't adjust them to what I need.
Thank you very much in advance.

Assuming it's ok to have a List of segment strings, you can simply do:
def segments = body.toList().collate(500)*.join()
This splits the body into a list of characters, collates these into 500 length groups, and then joins each group back to a String.
As a small example
def body = 'abcdefghijklmnopqrstuvwxyz'
def segments = body.toList().collate(5)*.join()
Then segments equals
['abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy', 'z']

Running same function for different arguments in a loop in python

I have large 3D same-sized array of data like density, temperature, pressure, entropy, … . I want to run a same function (like divergence()) for each of these arrays. The easy way is as follows:
div_density = divergence(density)
div_temperature = divergence(temperature)
div_pressure = divergence(pressure)
div_entropy = divergence(entropy)
Considering the fact that I have several arrays (about 100), I'd like to use a loop as follows:
var_list = ['density', 'temperature', 'pressure', 'entropy']
div = np.zeros((len(var_list)))
for counter, variable in enumerate(var_list):
div[Counter] = divergence(STV(variable))
I'm looking for a function like STV() which simply changes "string" to the "variable name". Is there a function like that in python? If yes, what is that function (by using such function, data should not be removed from the variable)?
These 3D arrays are large and because of the RAM limitation cannot be saved in another list like:
main_data=[density, temperature, pressure, entropy]
So I cannot have a loop on main_data.

One workaround is to use exec as follows
var_list = ['density', 'temperature', 'pressure', 'entropy']
div = np.zeros((len(var_list)))
for counter, variable in enumerate(var_list):
s = "div[counter] = divergence("+variable+")"
exec(s)
exec basically executes the string given as the argument in the python interpreter.

how about using a dictionary? that links the variable content to names.
Instead of using variable names density = ... use dict entries data['density'] for the data:
data = {}
# load ur variables like:
data['density'] = ...
divs = {}
for key, val in data.items():
divs[key] = divergence(val)
Since the data you use is large and the operations you try to do are computational expensive I would have a look at some of the libraries that provide methods to handle such data structures. Some of them also use c/c++ bindings for the expensive calculations (such as numpy does). Just to name some: numpy, pandas, xarray, iris (especially for earth data)

Why do multiple print() methods in Spark Streaming affect the values in my list?

I'm trying to receive one JSON line per two seconds, store them in a List which has elements from a costum Class, created by me, and print the resulting List after each execution of the context. So I'm doing something like this:
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaReceiverInputDStream<String> streamData = ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<LinkedList<StreamValue>> getdatatoplace = streamData.map(new Function<String, LinkedList<StreamValue>>() {
#Override
public LinkedList<StreamValue> call(String s) throws Exception {
//Access specific attributes in the JSON
Gson gson = new Gson();
Type type = new TypeToken<Map<String, String>>() {
}.getType();
Map<String, String> retMap = gson.fromJson(s, type);
String a = retMap.get("exp");
String idValue = retMap.get("id");
//insert values into the stream_Value LinkedList
stream_Value.push(new StreamValue(idValue, a, UUID.randomUUID()));
return stream_Value;
}
});
getdatatoplace.print();
This works very well, and I get the following result:
//at the end of the first batch duration/cycle
getdatatoplace[]={json1}
//at the end of the second batch duration/cycle
getdatatoplace[]={json1,json2}
...
However, if I do multiple prints of getdatatoplace, let's say 3:
getdatatoplace.print();
getdatatoplace.print();
getdatatoplace.print();
then I get this result:
//at the end of the first print
getdatatoplace[]={json1}
//at the end of the second print
getdatatoplace[]={json1,json1}
//at the end of the third print
getdatatoplace[]={json1,json1,json1}
//Context ends with getdatatoplace.size()=3
//New cycle begins, and I get a new value json2
//at the end of the first print
getdatatoplace[]={json1,json1,json1,json2}
...
So what happens is that, for each print that I do, even though I do stream_Value.push before, and the commands I gave in my batch duration haven't ended yet, stream_Value pushes values to my List for every print that I do.
My question is, why does this happen, and how do I make it so that, independentely of the number of print() methods I use, I get just one JSON line stored in my list per Batch Duration/per execution.
I hope I was not confusing, as I am new to Spark and may have confused some of the vocabulary. Thank you so much.
PS: Even if I print another DStream, the same thing happens. Say I do this, each with same 'architecture' of the Stream above:
JavaDStream1.print();
JavaDStream2.print();
At the end of JavaDStream2.print(), the list within JavaDstream1 has one extra value.

Spark Streaming uses the same computation model as Spark. The operations we declare on the data form a Direct Acyclic Graph (DAG) that's evaluated when actions are used to materialize such computations on the data.
In Spark Streaming, output operations, such as print() will schedule the materialization of these operations at every batch interval.
The DAG for this Stream would look something like this:
[TextStream]->[map]->[print]
print will schedule the map operation on the data received by the socketTextStream. When we add more print actions, our DAG looks like:
/->[map]->[print]
[TextStream] ->[map]->[print]
\->[map]->[print]
And here the issue should become visible. The map operation is executed three times. That's expected behavior and normally not an issue, because map is supposed to be a stateless transformation.
The root cause of the problem here is that map contains a mutation operation, as it adds elements to a global collection stream_Value defined outside the scope of the function passed to map.
This not only causes the duplication issues, but will not work in general, when Spark Streaming runs in its usual cluster mode.

How to write Fortran Output as CSV file?

Can any one tell me, how can I write my output of Fortran program in CSV format? So I can open the CSV file in Excel for plotting data.

A slightly simpler version of the write statement could be:
write (1, '(1x, F, 3(",", F))') a(1), a(2), a(3), a(4)
Of course, this only works if your data is numeric or easily repeatable. You can leave the formatting to your spreadsheet program or be more explicit here.

I'd also recommend the csv_file module from FLIBS. Fortran is well equipped to read csv files, but not so much to write them. With the csv_file module, you put
use csv_file
at the beginning of your function/subroutine and then call it with:
call csv_write(unit, value, advance)
where unit = the file unit number, value = the array or scalar value you want to write, and advance = .true. or .false. depending on whether you want to advance to the next line or not.
Sample program:
program write_csv
use csv_file
implicit none
integer :: a(3), b(2)
open(unit=1,file='test.txt',status='unknown')
a = (/1,2,3/)
b = (/4,5/)
call csv_write(1,a,.true.)
call csv_write(1,b,.true.)
end program
output:
1,2,3
4,5
if you instead just want to use the write command, I think you have to do it like this:
write(1,'(I1,A,I1,A,I1)') a(1),',',a(2),',',a(3)
write(1,'(I1,A,I1)') b(1),',',b(2)
which is very convoluted and requires you to know the maximum number of digits your values will have.
I'd strongly suggest using the csv_file module. It's certainly saved me many hours of frustration.

The Intel and gfortran (5.5) compilers recognize:
write(unit,'(*(G0.6,:,","))')array or data structure
which doesn't have excess blanks, and the line can have more than 999 columns.
To remove excess blanks with F95, first write into a character buffer and then use your own CSV_write program to take out the excess blanks, like this:
write(Buf,'(999(G21.6,:,","))')array or data structure
call CSV_write(unit,Buf)
You can also use
write(Buf,*)array or data structure
call CSV_write(unit,Buf)
where your CSV_write program replaces whitespace with "," in Buf. This is problematic in that it doesn't separate character variables unless there are extra blanks (i.e. 'a ','abc ' is OK).

I thought a full simple example without any other library might help. I assume you are working with matrices, since you want to plot from Excel (in any case it should be easy to extend the example).
tl;dr
Print one row at a time in a loop using the format format(1x, *(g0, ", "))
Full story
The purpose of the code below is to write in CSV format (that you can easily import in Excel) a (3x4) matrix.
The important line is the one labeled 101. It sets the format.
program testcsv
IMPLICIT NONE
INTEGER :: i, nrow
REAL, DIMENSION(3,4) :: matrix
! Create a sample matrix
matrix = RESHAPE(source = (/1,2,3,4,5,6,7,8,9,10,11,12/), &
shape = (/ 3, 4 /))
! Store the number of rows
nrow = SIZE(matrix, 1)
! Formatting for CSV
101 format(1x, *(g0, ", "))
! Open connection (i.e. create file where to write)
OPEN(unit = 10, access = "sequential", action = "write", &
status = "replace", file = "data.csv", form = "formatted")
! Loop across rows
do i=1,3
WRITE(10, 101) matrix(i,:)
end do
! Close connection
CLOSE(10)
end program testcsv
We first create the sample matrix. Then store the number of rows in the variable nrow (this is useful when you are not sure of the matrix's dimension beforehand). Skip a second the format statement. What we do next is to open (create or replace) the CSV file, names data.csv. Then we loop over the rows (do statement) of the matrix to write a row at a time (write statement) in the CSV file; rows will be appended one after another.
In more details how the write statement works is: WRITE(U,FMT) WHAT. We write "what" (the i-th row of the matrix: matrix(i,:)), to connection U (the one we created with the open statement), formatting the WHAT according to FMT.
Note that in the example FMT=101, and 101 is the label of our format statement:
format(1x, *(g0, ", "))
what this does is: "1x" insert a white space at the beginning of the row; the "*" is used for unlimited format repetition, which means that the format in the following parentheses is repeated for all the data left in the object we are printing (i.e. all elements in the matrix's row). Thus, each row number is formatted as: 'g0, ", "'.
g is a general format descriptor that handles floats as well as characters, logicals and integers; the trailing 0 basically means: "use the least amount of space needed to contain the object to be formatted" (avoids unnecessary spaces). Then, after the formatted number, we require the comma plus a space: **", ". This produces our comma-separated values for a row of the matrix (you can use other separators instead of "," if you need). We repeat for every row and that's it.
(The spaces in the format are not really needed, thus one could use format(*(g0,","))
Reference: Metcalf, M., Reid, J., & Cohen, M. (2018). Modern Fortran Explained: Incorporating Fortran 2018. Oxford University Press.

Tens seconds work with a search engine finds me the FLIBS library, which includes a module called csv_file which will write strings, scalars and arrays out is CSV format.

How does one retrieve the most frequently occurring keyterms from a Sphinx index?

I have a Sphinx index of text files and I'd like to retrieve a list of the keyterms Sphinx found when indexing the text files, ordered, highest to lowest, by how frequently they occurred in the dataset. How do I do this?
I'd like to retrieve both the real term and the stem if possible.
I'm using the PHP api to make calls to the index.
Below are my Sphinx.conf settings for this index:
source srcDatasheets
{
type = mysql
sql_host = localhost
sql_user = user
sql_pass = pass
sql_db = db
sql_port = 3306
sql_query = \
SELECT id, company_id, title, brief, content_file_path \
FROM datasheets
sql_attr_uint = company_id
sql_file_field = content_file_path
sql_query_info = SELECT * FROM datasheets WHERE id=$id
}
index datasheets
{
source = srcDatasheets
path = /usr/local/sphinx/var/data/datasheetsStemmed
docinfo = extern
charset_type = sbcs
morphology = stem_en
min_stemming_len = 1
}

One cannot retrieve keyword density directly from a live index with Sphinx. The data is not stored in a way that allows this. Here is a response from the Sphinx forums.
What you can do, however, is run the indexer with --buildstops, and --buildfreqs (see the docs). The indexer will output a txt file of the most frequently occurring terms and frequencies based on the settings you have in the .conf file for that index.
This processes the data set to create the list and the text file, and doesn't actually create a new searchable index.
I ran a test on an index of text files (converted pdfs) with min word length and min stemming lengths of 5 characters. 70,000 files processed in about 20 seconds (5 minutes with min character limits set to 1).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark, grep / search in very long strings / datablock - apache-spark

Related

How to split a String by bodySize in Groovy Script

Running same function for different arguments in a loop in python

Why do multiple print() methods in Spark Streaming affect the values in my list?

How to write Fortran Output as CSV file?

How does one retrieve the most frequently occurring keyterms from a Sphinx index?

Categories

Resources