As I have requirement of to store large amount of data with faster processing and higher scalability, So I choosen hadoop for this but I requires data collaboration also, I know the sharepoint is best candidate for it.
Please let me know how to integrate sharepoint with hadoop.
I know the SSIS which is used to SQL server integration with Hadoop but I need realtime examples so I am able to find out the exact solution for it.
Setup HDFS NFS Gateway and copy Sharepoint files. You could also use basic script to PUT the files to HDFS. It would require to use an edge node that has access to SharePoint repository and HDFS client.
HDFS NFS Gateway: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
HDFS PUT: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#put
If you already use HDP and it is installed with Ambari, HDFS NFS Gateway is just another service to add via Ambari.
Related
I have been looking for a way to secure Parquet files, column-wise, for Spark access. Ideally, that would work the same way Apache Ranger works for Hive, i.e., a Sysadmin defines the access policies for different groups and columns.
I have been trying Ranger through Hortoworks HDP, however, it seems that plug-ins for Spark and Parquet are not there yet.
I have also been able to devise a solution using Apache Drill and views, however, it is not acceptable right now mainly because of the still scarce community support for Drill.
Has anyone faced the same requirement and/or have some directions for a solution?
After a great deal of research, I've come to a conclusion that this is not possible.
The way Ranger works with other tools (HDFS, Hive, HBase, etc) is by using plug-ins that implements hooks provided by those tools. For instance, to create a custom plug-in to secure Hive, one needs to create a HiveAuthorizer through the HiveAuthorizerFactory. But there's no such a hook for Parquet as it is nothing more than a file format.
A possible solution that would allow to secure Parquet files at a column-wise level from Ranger is to create an extension for Ranger's HDFS plugin. This extension would implement the access rules for Parquet files defined through Ranger. That way, we could seamlessly secure Parquet files the same way we do for Hive or HBase as long as the files are stored in HDFS.
I am new to big data and Spark(pyspark).
Recently I just setup a spark cluster and wanted to use Cassandra File System (CFS) on my spark cluster to help upload files.
Can any one tell me how to set it up and briefly introduce how to use CFS system? (like how to upload files / from where)
BTW I don't even know how to use HDFS(I downloaded pre-built spark-bin-hadoop but I can't find hadoop in my system tho.)
Thanks in advance!
CFS only exists in DataStax Enterprise and isn't appropriate for most Distributed File applications. It's primary focused as a substitute for HDFS for map/reduce jobs and small temporary but distributed files.
To use it you just use the CFS:// uri and make sure you are using dse spark-submit from your application.
I am new to Apache Spark, I would like to know is it possible to store data using Apache Spark. Or is it only a processing tool?
Thanks for spending your time,
Satya
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
In real life use-case you usually have database, or data repository frome where you access data from spark.
Spark can access data that's in:
SQL Databases (Anything that can be connected using JDBC driver)
Local files
Cloud storage (eg. Amazon S3)
NoSQL databases.
Hadoop File System (HDFS)
and many more...
Detailed description can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Apache Spark is primarily processing engine. It works with underlying file systems such as HDFS, s3 and other supported file systems. It has capabilities to read the data from relational databases as well. But primarily it is in memory distributed processing tool.
As you can read in Wikipedia, Apache Spark is defined as:
is an open source cluster computing framework
When we refer about computing, it's related to a processing tool, in essence it allows to work as a pipeline scheme (or somehow ETL), you read the dataset, you process the data, and then you store the data processed, or models that describe the data.
If your main objective is to distribute your data, there are some good alternatives like HDFS (Hadoop File System), and others.
I'm trying to implement security on my hadoop data.I'm using cloudera hadoop
Below are the two specific things I'm looking for
1. Role based authorization and authentication
2. Encryption on data residing in HDFS
I have looked into Kerboroes but it doesn't provide encryption for data already residing in HDFS.
Are there any other security tools i can go for? has anyone done above two security features in cloudera hadoop.
Please suggest
I think Apache Sentry will be best for you.You can find more information here.
I am working on a Hadoop project and generating lots of data in my local cluster. Sooner later I will be using cloud based Hadoop solution because my Hadoop cluster is very small comparative to real work load, however I dont have a choice as of now which one I will be using i.e. Windows Azure based, EMR or something else. I am generating lots of data locally and want to store this data to some cloud based storage based on the fact that I will use this data with Hadoop later but very soon.
I am looking for suggestion to decided which cloud store to choose based in someone experience. Thanks in advance.
First of all it is a great question. Let's try to understand "How data is processed in Hadoop":
In Hadoop all the data is processed on Hadoop cluster means when you process any data, that data is copied from its sources to HDFS, which is an essential component of Hadoop.
When data is copied to HDFS only after your run Map/Reduce jobs in it to get your results.
That means it does not matter what and where your data sources is(Amazon S3, Azure Blob, SQL Azure, SQL Server, on premise source etc), you will have to move/transfer/copy your data from source to HDFS, within the limits of Hadoop.
Once data is processed in Hadoop cluster, the result will be stored the location you would have configured in your job. The output data source can be HDFS or an outside location accessible from Hadoop Cluster
Once you have data copied to HDFS you can keep it one HDFS as long as you want but you will have to pay the price to use the Hadoop cluster.
In some cases when you are running Hadoop Job between some interval and data move/copy can be done faster, it is good to have a strategy to 1) acquire Hadoop cluster 2) copy data 3) run job 4) release cluster.
So based on above details, when you choose a data source in Cloud for your Hadoop Cluster you would have to consider the following:
If you have large data (which is normal with Hadoop clusters) to process, consider different data sources and the time it will take to copy/move data from those data source to HDFS because this will be your first step.
You would need to choose a data source which must have the lowest network latency so you can get data in and out, as fast as possible.
You also need to consider how you will move large amount of data from your current location to any cloud store. The best option would be to have a storage where you can send your data disk (HDD/Tape etc) because uploading multiple TB data will take great amount of time.
Amazon EMR (already available), Windows Azure (HadoopOnAzure in CTP) and Google (BigQuery in Preview, based on Google Dremel) provides pre-configured Hadoop clusters in cloud so you can choose where you would want to run your Hadoop job then you can consider the cloud storage.
Even if you choose one cloud data storage and decide to move to other because you want to use other Hadoop cluster in cloud, you sure can transfer the data however consider the time and data transfer support available to you.
For example, with HadooponAzure you can connect various data sources i.e. Amazon S3, Azure Blob Storage, SQL Server and SQL Azure etc so a variety of data sources are the best with any cloud Hadoop cluster.