What is the best way to write Google Cloud Dataflow output to Cassandra?
I don't seem to find many people doing it. After searching for a while, the only thing I found was: https://github.com/benjumanji/cassandra-dataflow which has only 3 commits and is 4 months old.
In general, is it a good idea to write Dataflow's output to Cassandra?
One possible approach would be to implement a custom sink (for batch): https://cloud.google.com/dataflow/model/custom-io#creating-sinks.
Related
I have to extract data from XML files with the size of several hundreds of MB in a Google Cloud Function and I was wondering if there are any best practices?
Since I am used to nodejs I was looking at some popular libraries like fast-xml-parser but it seems cumbersome if you only want specific data from a huge xml. I am also not sure if there are any performance issues when the XML is too big. Overall this does not feel like the best solution to parse and extract data from huge XMLs.
Then I was wondering if I could use BigQuery for this task where I simple convert the xml to json and throw it into a Dataset where I then can use a query to retrieve the data I want.
Another solution could be to use python for the job since it is good in parsing and extracting data from a XML so even though I have no experience in python I was wondering if this path could still be
the best solution?
If anything above does not make sense or if one solution is preferable to the other or if anyone can share any insights I would highly appreciate it!
I suggest you to check this article in which they discuss how to load XML data into BigQuery using Python Dataflow. I think that this approach may work in your situation.
Basically what they suggest is:
To parse the xml into a Python dictionary using the package xmltodict.
Specify a schema of the output table in BigQuery.
Use a Beam pipeline to take an XML file and use it to populate a BigQuery table.
I was wondering if the LOESS (locally estimated scatterplot smoothing) regression was a function built-in Spark/PySpark (I'm more interested in the PySpark answer but both would be interesting).
I did some research and couldn't find one so decided to try and code it myself using pandas-udf functions but while doing it, when I displayed the scatter_plot of the manufactured data I created to begin testing my algo, Azure Databricks (on which I'm coding) proposed to me to automatically compute/display the LOESS of my dataset :
So maybe there is indeed a built-in LOESS that I just couldn't find ? If not (and Databricks is the only one responsible for this), is there any way to access the result of databricks's LOESS computation/access the function Databricks is using to do that ?
Thank you in advance :)
I'm searching for tool for storing documentation about tables, datasources, etl processes and etc for my DWH.
I've seen some presentations on youtube, but I've found out, that most of the companies are using custom, own system or something like wiki ith plain text descriptions.
I think, that it is not so useful for Analysts, Mangers and other user to find out , what they need and how to use data to calculate suitable for them statistics.
Can you suggest, please, what may I use for this case? What I must read?
While Airflow was baked with some support for Apache-Atlas, in my opinion
the one of the best data-lake metadata management tools right now is Lyft's Amundsen
and they've also released lyft/amundsendatabuilder, the introduction of which says
Amundsen Databuilder is a data ingestion library, which is inspired by
Apache Gobblin. It could be used in an orchestration
framework(e.g. Apache Airflow) to build data from Amundsen. You could
use the library either with an adhoc python script(example) or
inside an Apache Airflow DAG(example).
In a POC, we are using cassandra for storing (among other things) Apache access logs (parsed) and use together with apache spark + zeppelin. We have managed to get things working BUT we are very uncertain about how to model the data correctly.
Edit: Our queries will span over months and years rather than weeks and days. Against production jobs are likely executed perhaps daily (at least for now) and we will use a smaller dataset during development.
Since this will be used for analytics ONLY, the queries can be pretty much anything but of course we could consider a handful of queries in advance.
I.e
latency percentiles
geo distribution
sum of requests
Popular rest resources
... etc
Partition key + Primary key. This is really difficult... the only thing that I can think of is something like ((userid, [webresource]), timestamp).
At least this would give a fairly even distribution. Otherwise we would have to use a checksum or something which feels wrong.
Or should I have different tables for different types, like latency, geo etc? Or is this a good option for materialized views?
I have googled for something like this without any luck so perhaps cassandra is a poor solution for this BUT still, we would really like to see how far we can get.
Anyway, any input is highly appreciated!
Regards /Johan
I have log file data stored in HBase. What will be the fastest way to do quick searches on keywords in a log in HBase.
I read something about creating an inverted index, but I'm not clear how the index would look like or even how to create one?
I also looked at hbasene- https://github.com/akkumar/hbasene
Any pointers on how to go about the searching would be great.
You should look at solutions that integrate lucene (which is an inverted index + more interesting features like stemming etc.) with HBase. hbasene is one option but more complete solutions are SolBase and Lily