Best way to pre-process text messages using Hadoop - search

I am using Hadoop to process text messages(SMS). but I am not sure of the best way to pre-process these data so that I can do an efficient search. for example, after preprocessing the data if someone searches for 'NY' I will be able to display the messages containing the word 'NY'.
Is it advisable to write the pre-processed data to an xml file and not to a database.
NOTE: I have around 200K text messages in an .csv file.

The way I import preprocessed data to hdfs is to first import the data (csv file in your case) into a database and then create a table view that fine-tunes it to your needs. Then I import the data into hdfs using Sqoop. More Information on sqoop can be found here
http://www.cloudera.com/blog/2009/06/introducing-sqoop/
for doing a sqoop import from a database take a look at
http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_connecting_to_a_database_server

You probably want to index the text messages, maybe using something like Lucene.

Go for
Solr (Especially used for text mining)
Powerful full-text search
Provides dynamic clustering
Provides database integration as well
Supports .csv,.xml,word,pdf..
Highly scalable

Related

How to parse big XML in google cloud function efficiently?

I have to extract data from XML files with the size of several hundreds of MB in a Google Cloud Function and I was wondering if there are any best practices?
Since I am used to nodejs I was looking at some popular libraries like fast-xml-parser but it seems cumbersome if you only want specific data from a huge xml. I am also not sure if there are any performance issues when the XML is too big. Overall this does not feel like the best solution to parse and extract data from huge XMLs.
Then I was wondering if I could use BigQuery for this task where I simple convert the xml to json and throw it into a Dataset where I then can use a query to retrieve the data I want.
Another solution could be to use python for the job since it is good in parsing and extracting data from a XML so even though I have no experience in python I was wondering if this path could still be
the best solution?
If anything above does not make sense or if one solution is preferable to the other or if anyone can share any insights I would highly appreciate it!
I suggest you to check this article in which they discuss how to load XML data into BigQuery using Python Dataflow. I think that this approach may work in your situation.
Basically what they suggest is:
To parse the xml into a Python dictionary using the package xmltodict.
Specify a schema of the output table in BigQuery.
Use a Beam pipeline to take an XML file and use it to populate a BigQuery table.

Workflow for interpreting linked data in .ttl files with Python RDFLib

I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.

How can I write to HDFS from Spark to make access to that data faster?

Assume that I am not tools like Hive or HBase (Spark is unable to use Hive indexes anyway for optimization), what is the best way to write data to the HDFS to make access to that data faster.
What I was thinking is to save many different files, the name of whose is identified by the keys. Let us say we have a database of people who are identified by their firstname and surname. Maybe I could save files with the first letters of firstname and surname. In this way, we would have 26x26=676 files. So, for example, if we want to see record of Alan Walker, we need to just load the file AW. Would this be a good way or are there much better ways to do this kind of thing?
I believe that an index is what you need. In HDFS as in databases indexing has some overhead on insertion but makes queries much faster.
HDFS does not have any sort of index as it is supposedly a DFS rather than a Database, yet the requirement that your mentions has been implemented through third programs
There are many indexing tools that works with HDFS, you can have a look to APACHE SOLR for instance
Here is a tutorial to keep you going: https://lucene.apache.org/solr/guide/6_6/running-solr-on-hdfs.html

Efficient way to store a JSON string in a Cassandra column?

Cassandra newbie question. I'm collecting some data from a social networking site using REST calls. So I end up with the data coming back in JSON format.
The JSON is only one of the columns in my table. I'm trying to figure out what the "best practice" is for storing the JSON string.
First I thought of using the map type, but the JSON contains a mix of strings, numerical types, etc. It doesn't seem like I can declare wildcard types for the map key/value. The JSON string can be quite large, probably over 10KB in size. I could potentially store it as a string, but it seems like that would be inefficient. I would assume this is a common task, so I'm sure there are some general guidelines for how to do this.
I know Cassandra has native support for JSON, but from what I understand, that's mostly used when the entire JSON map matches 1-1 with the database schema. That's not the case for me. The schema has a bunch of columns and the JSON string is just a sort of "payload". Is it better to store the JSON string as a blob or as text? BTW, the Cassandra version is 2.1.5.
Any hints appreciated. Thanks in advance.
In the Cassandra Storage engine there's really not a big difference between a blob and a text, since Cassandra stores text as blobs essentially. And yes the "native" JSON support you speak of is only for when your data model matches your JSON model, and it's only in Cassandra 2.2+.
I would store it as a text type, and you shouldn't have to implement anything to compress your JSON data when sending the data (or handle uncompressing). Since Cassandra's Binary Protocol supports doing transport compression. Also make sure your table is storing the data compressed with the same compression algorithm (I suggest using LZ4 since it's the fastest algo implmeneted) to save on doing compression for each read request. Thus if you configure storing the data compressed and use transport compression, you don't even have to implement either yourself.
You didn't say which Client Driver you're using, but here's the documentation on how to setup Transport Compression for Datastax Java Client Driver.
It depends on how to want to query your JSON. There are 3 possible strategies:
Store as a string
Store as a compressed blob
Store as a blob
Option 1 has the advantage of being human readable when you query your data on command line with cqlsh or if you want to debug data directly live. The drawback is the size of this JSON column (10k)
Option 2 has the advantage to keep the JSON payload small because text elements have a pretty decent compression ration. Drawbacks are: a. you need to take care of compression/decompression client side and b. it's not human readable directly
Option 3 has drawbacks of option 1 (size) and 2 (not human readable)

how to do ad-hoc searches on log data stored in HBase

I have log file data stored in HBase. What will be the fastest way to do quick searches on keywords in a log in HBase.
I read something about creating an inverted index, but I'm not clear how the index would look like or even how to create one?
I also looked at hbasene- https://github.com/akkumar/hbasene
Any pointers on how to go about the searching would be great.
You should look at solutions that integrate lucene (which is an inverted index + more interesting features like stemming etc.) with HBase. hbasene is one option but more complete solutions are SolBase and Lily

Resources