I am new to Spark and reasonably new to Clojure (although I really like what Clojure can do so far). I am currently trying to parse JSON in Clojure using sparkling, and I am having trouble with the basics of transforming data and getting it back in a form I can understand and debug. I use dummy data in the example below, but my actual data is over 400GB.
As an example, I first tried splitting each line of my JSON input (each line is a full record) by commas so that I would have a list of keys and values (for eventual conversion to keyword and value maps). In Scala (for which it is easier to find Spark examples) with dummy data, this works fine:
val data = sc.parallelize(Array ("a:1,b:2","a:3,b:4"))
val keyVals = data.map(line => line.split(","))
keyVals.collect()
This returns Array[Array[String]] = Array(Array(a:1, b:2), Array(a:3, b:4)), which is at least a reasonable starting point for key-value mapping.
However, when I run the following in clojure with sparkling:
(def jsony-strings (spark/parallelize sc ["a:1,b:2","a:3,b:4"]))
(def jsony-map (->> jsony-strings
(spark/map (fn [l] (string/split l #",")))
))
(spark/collect jsony-map)
I get the usual concurrency spaghetti from the JVM, the crux of which seems to be:
2018-08-24 18:49:55,796 WARN serialization.Utils:55 - Error deserializing object (clazz: gdelt.core$fn__7437, namespace: gdelt.core)
java.lang.ClassNotFoundException: gdelt.core$fn__7437
Which is an error I seem to get pretty much anything I try to do something more complex than counts.
Can someone please point me in the right direction?
I guess I should note that my Big Problem is processing lots and lots of lines of JSON in a bigger-than-memory (400G) dataset. I will be using the JSON keys to filter, sort, calculate, etc., and the Spark pipelines looked good for both rapid parallel processing and convenient for these functions. But I am certainly open to considering other alternatives for processing this dataset.
You should use Cheshire for this:
;; parse some json
(parse-string "{\"foo\":\"bar\"}")
;; => {"foo" "bar"}
;; parse some json and get keywords back
(parse-string "{\"foo\":\"bar\"}" true)
;; => {:foo "bar"}
I like to use a shortcut for the 2nd case, since I always want to convert string keys into clojure keywords:
(is= {:a 1 :b 2} (json->edn "{\"a\":1, \"b\":2}"))
It is just a simple wrapper with (I think) an easier-to-remember name:
(defn json->edn [arg]
"Shortcut to cheshire.core/parse-string"
(cc/parse-string arg true)) ; true => keywordize-keys
(defn edn->json [arg]
"Shortcut to cheshire.core/generate-string"
(cc/generate-string arg))
Update: Note that Cheshire can work with lazy streams:
;; parse a stream lazily (keywords option also supported)
(parsed-seq (clojure.java.io/reader "/tmp/foo"))
Related
This is a follow-up question to this post. What I want to achieve is to avoid counting words in headers and inside code blocks having this pattern:
```{r label-name}
all code words not to be counted.
```
Rather than this pattern:
```
{r label-name}
all code words not to be counted.
```
Because when I use the latter pattern I lose my fontification lock in the Rmarkdown buffer in emacs, so I always use the first one.
Consider this MWE:
MWE (MWE-wordcount.Rmd)
# Results {-}
## Topic 1 {-}
This is just a random text with a citation in markdown \#ref(fig:pca-scree)).
Below is a code block.
```{r pca-scree, echo = FALSE, fig.align = "left", out.width = "80%", fig.cap = "Scree plot with parallel analysis using simulated data of 100 iterations (red line) suggests retaining only the first 2 components. Observed dimensions with their eigenvalues are shown in green."}
knitr::include_graphics("./plots/PCA_scree_parallel_analysis.png")
```
## Topic 2 {-}
<!-- todo: a comment that needs to be avoided by word count hopefully-->
The result should be 17 words only. Not counting words in code blocks, comments, or Markdown markups (like the headers).
I followed the method explained here to get pandoc count the words using a lua filter. In short I did these steps:
from command line:
mkdir -p ~/.local/share/pandoc/filters
Then created a file there named wordcount.lua with this content:
-- counts words in a document
words = 0
wordcount = {
Str = function(el)
-- we don't count a word if it's entirely punctuation:
if el.text:match("%P") then
words = words + 1
end
end,
Code = function(el)
_,n = el.text:gsub("%S+","")
words = words + n
end,
}
function Pandoc(el)
-- skip metadata, just count body:
pandoc.walk_block(pandoc.Div(el.blocks), wordcount)
print(words .. " words in body")
os.exit(0)
end
I put the following elisp code in scratch buffer and evaluated it:
(defun pandoc-count-words ()
(interactive)
(shell-command-on-region (point-min) (point-max)
"pandoc --lua-filter wordcount.lua"))
From inside the MWE Markdown file (MWE-wordcount.Rmd) I issued M-x pandoc-count-wordsand I get the count in the minibuffer.
Using the first pattern I get 62 words.
Using the second pattern I get 22 words, more reasonable.
This method successfully avoids counting words inside a comment.
Questions
How to get the lua filter code avoid counting words using the first pattern rather than the second?
How to get the lua filter avoid counting words in the headers ##?
I would also appreciate if the answer explains how lua code works.
This is a fun question, it combines quite a few technologies. The most important here is R Markdown, and we need to look under the hood to understand what's going on.
One of the first step in R Markdown processing is to parse the document, find all R code blocks (marked by the {r ...} pattern, execute those blocks, and replaces the blocks with the evaluation results. The modified input text is then passed to pandoc, which parses it into an abstract document tree (AST). That AST can be examined or modified with a filter before pandoc writes the document in the target format.
This is relevant because it is R Markdown, not pandoc, that recognizes input of the form
``` {r ...}
# code
```
as code blocks, while pandoc parses them as inline code that is identical to ` {r ...} # code `, i.e., all newlines in the code are ignored. The reason for this lies in pandoc's attribute parsing and the overloading of ` chars in Markdown syntax.¹
This gives us the answer to your first question: we can't! The two code snippets look exactly the same by the time they reach the filter in pandoc's AST; they cannot be distinguished. However, we get proper code blocks with newlines if we run R Markdown's knitr step to execute the code.
So one solution could be to make the wordcount.lua filter part of the R Markdown processing step, but to run the filter only when the COUNT_WORDS environment variable is set. We can do that by adding this snippet to the top of the filter file:
if not os.getenv 'COUNT_WORDS` then
return {}
end
See the R Markdown cookbook on how to integrate the filter.
I'm leaving out the second question, because this answer is already quite long and that subquestion is worth a separate post.
¹: pandoc would recognize this as a code block if the r was preceded by a dot, as in
``` {.r}
# code
```
I have a very long string / datablock where I want to search / grep within.
Example: ...AAABBAAAAVAACCDE...
In this example, I want to search for AVA.
The length of the string is hundredth of GBs
My problem is, when I split the string in block of xxMB (to allow parallel execution) the search will fail on the boundaries.
Example
[Block 1] ...AAABBAAAA
[Block 2] VAACCDE...
In the example above, I will never find the string AVA.
Are the methods or helper function to address this boundary problem?
Thanks for your help
In Spark it's not each to read these custom formats, especially files that are not delimted by newlines, very efficiently out-of-the box.
In essence you need a FileInputStream from your original file (the one with the huge string) and for each chunk you want each record to be read this as a stream
You can, for example, retain a cache of the last n characters from each chuck/record and concat that to the next record, effectively creating an overlap.
eg:
val fileIn = "hugeString.txt"
val fileOut = "sparkFriendlyOutput.txt"
val reader = new FileInputStream(fileIn)
val writer = new BufferedOutputStream(new FileOutputStream(fileOut))
val recordSize = 9
val maxSearchLength = 3
val bytes = Array.fill[Byte](recordSize)(0)
val prefix = Array.fill[Byte](maxSearchLength)(' ')
Stream
.continually((reader.read(bytes),bytes))
.takeWhile(_._1 != -1)
.foreach{
case (_, buffer) => {
writer.write(prefix ++ buffer :+ '\n'.toByte)
Array.copy(buffer.toList.takeRight(maxSearchLength).toArray,0,prefix,0,maxSearchLength)
}}
writer.close()
reader.close()
This turns this string
1234567890123456789012345678901234567890123456789012345...
Into this File:
123456789
789012345678
678901234567
567890123456
...
This does require you to pick a maximum length that you ever want to search for, because that's what the overlap is for.
This file could be read in Spark very easily
On the other hand if you don't have the luxury to be able to store this on disk (or in memory) perhaps you could look into creating a custom spark streaming solution where you either implement a custom streaming source (structured streaming) or custom receiver (Dstream) that reads the file via a similar FileInputStream + buffered prefix solution.
PS. You could do smarter things with the overlap (at least divide by two, so noth the entire possible length is duplicated,
PS I assumed that you don't care about the absolute position. If you do, then I would store the original offset as Long next to each line
I am new to spark programming and i got stuck while using map.
My data Rdd contains.
Array[(String, Int)] = Array((steve,5), (bill,4), (" amzon",6), (flikapr,7))
and while using again map i am getting below mentioned error .
data.map((k,v) => (k,v+1))
<console>:32: error: wrong number of parameters; expected = 1
data.map((k,v) => (k,v+1))
I am trying to pass a tuple with key value and wants to get back a tuple with 1 + to value.
Please help , why i am getting error.
Thanks
You almost got it. rdd.map() operates on each record of the RDD and in your case, that record is a tuple. You can simply access the tuple members using Scala's underscore accessors like this:
val data = sc.parallelize(Array(("steve",5), ("bill",4), ("amzon",6), ("flikapr",7)))
data.map(t => (t._1, t._2 + 1))
(steve,6)
(bill,5)
(amzon,7)
(flikapr,8)
Or better yet, use Scala's powerful pattern matching like this:
data.map({ case (k, v) => (k, v+1) }).foreach(println)
(steve,6)
(bill,5)
(amzon,7)
(flikapr,8)
Here's the best so far -- key-value tuples are so common in Spark that we usually refer to them as PairRDDs and they come with plenty of convenience functions. For your use case, you only need to operate on the value without changing the key. You can simply use mapValues():
data.mapValues(_ + 1).foreach(println)
(steve,6)
(bill,5)
(amzon,7)
(flikapr,8)
I have a List of Lists that looks like this (Python3):
myLOL = ["['1466279297', '703.0']", "['1466279287', '702.0']", "['1466279278', '702.0']", "['1466279268', '706.0']", "['1466279258', '713.0']"]
I'm trying to use a list comprehension to convert the first item of each inner list to an int and the second item to a float so that I end up with this:
newLOL = [[1466279297, 703.0], [1466279287, 702.0], [1466279278, 702.0], [1466279268, 706.0], [1466279258, 713.0]]
I'm learning list comprehensions, can somebody please help me with this syntax?
Thank you!
[edit - to explain why I asked this question]
This question is a means to an end - the syntax requested is needed for testing. I'm collecting sensor data on a ZigBee network, and I'm using an Arduino to format the sensor messages in JSON. These messages are published to an MQTT broker (Mosquitto) running on a Raspberry Pi. A Redis server (also running on the Pi) serves as an in-memory message store. I'm writing a service (python-MQTT client) to parse the JSON and send a LoL (a sample of the data you see in my question) to Redis. Finally, I have a dashboard running on Apache on the Pi. The dashboard utilizes Highcharts to plot the sensor data dynamically (via a web socket connection between the MQTT broker and the browser). Upon loading the page, I pull historical chart data from my Redis LoL to "very quickly" populate the charts on my dashboard (before any realtime data is added dynamically). I realize I can probably format the sensor data the way I want in the Redis store, but that is a problem I haven't worked out yet. Right now, I'm trying to get my historical data to plot correctly in Highcharts. With the data properly formatted, I can get this piece working.
Well, you could use ast.literal_eval:
from ast import literal_eval
myLOL = ["['1466279297', '703.0']", "['1466279287', '702.0']", "['1466279278', '702.0']", "['1466279268', '706.0']", "['1466279258', '713.0']"]
items = [[int(literal_eval(i)[0]), float(literal_eval(i)[1])] for i in myLOL]
Try:
import json
newLOL = [[int(a[0]), float(a[1])] for a in (json.loads(s.replace("'", '"')) for s in myLOL)]
Here I'm considering each element of the list as a JSON, but since it's using ' instead of " for the strings, I have to replace it first (it only works because you said there will be only numbers).
This may work? I wish I was more clever.
newLOL = []
for listObj in myLOL:
listObj = listObj.replace('[', '').replace(']', '').replace("'", '').split(',')
newListObj = [int(listObj[0]), float(listObj[1])]
newLOL.append(newListObj)
Iterates through your current list, peels the string apart into a list by replace un-wanted string chracters and utilizing a split on the comma. Then we take the modified list object and create another new list object with the values being the respective ints and floats. We then append the prepared newListObj to the newLOL list. Considering you want an actual set of lists within your list. Your previously documented input list actually contains strings, which look like lists.
This is a very strange format and the best solution is likely to change the code which generates that.
That being said, you can use ast.literal_eval to safely evaluate the elements of the list as Python tokens:
>>> lit = ast.literal_eval
>>> [[lit(str_val) for str_val in lit(str_list)] for str_list in myLOL]
[[1466279297, 703.0], [1466279287, 702.0], [1466279278, 702.0], [1466279268, 706.0], [1466279258, 713.0]]
We need to do it twice - once to turn the string into a list containing two strings, and then once per resulting string to convert it into a number.
Note that this will succeed even if the strings contain other valid tokens. If you want to validate the format too, you'd want to do something like:
>>> def process_str_list(str_list):
... l = ast.literal_eval(str_list)
... if not isinstance(l, list):
... raise TypeError("Expected list")
... str_int, str_float = l
... return [int(str_int), float(str_float)]
...
>>> [process_str_list(str_list) for str_list in myLOL]
[[1466279297, 703.0], [1466279287, 702.0], [1466279278, 702.0], [1466279268, 706.0], [1466279258, 713.0]]
Your input consists of a list of strings, where each string is the string representation of a list. The first task is to convert the strings back into lists:
import ast
lol2 = map(ast.literal_eval, mylol) # [['1466279297', '703.0'], ...]
Now, you can simply get int and float values from lol2:
newlol = [[int(a[0]), float(a[1])] for a in lol2]
I am new to natural language processing and I want to use it to write a news aggregator(in Node.js in my case). Rather than just use a prepackage framework, I want to learn the nuts and bolts and I am starting with the NLP portion. I found this one tutorial that has been the most helpful so far:
http://www.p-value.info/2012/12/howto-build-news-aggregator-in-100-loc.html
In it, the author gets the RSS feeds and loops through them looking for the elements(or fields) title and description. I know Python and understand the code. But what I don't understand is what NLP is doing here with title and description under the hood(besides scraping and tokenizing, which is apparent...and those tasks don't need a NLP).
import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
d = feedparser.parse(feed)
for e in d['entries']:
words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
words.extend(nltk.wordpunct_tokenize(e['title']))
lowerwords=[x.lower() for x in words if len(x) > 1]
ct += 1
print ct, "TITLE",e['title']
corpus.append(lowerwords)
titles.append(e['title'])
(reading your question more carefully maybe this was all already obvious to you, but it doesn't look like anything more deep or interesting is going on)
wordpunct_tokenize is set up here here (last line) as
wordpunct_tokenize = WordPunctTokenizer().tokenize
WordPunctTokenizer is implemented by this code:
class WordPunctTokenizer(RegexpTokenizer):
def __init__(self):
RegexpTokenizer.__init__(self, r'\w+|[^\w\s]+')
The heart of this is just the regular expression r'\w+|[^\w\s]+', which defines what strings are considered to be tokens by this tokenizer. There are two options, separated by the |:
\w+, that is, more than one "word" character (alphabetical or numeric)
[^\w\s]+, more than one character that is not either a "word" character or whitespace, thus this matches any string of punctuation
Here is a reference for Python regular expressions.
I have not dug into the RegexpTokenizer, but I assume is set up such that the tokenize function returns an iterator that searches a string for the first match of the regular expression, then the next, etc.