How to specify the path where saveAsTable saves files to? - apache-spark

I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter
df = sqlContext.read.format("json").load("s3a://somefile")
df_writer = pyspark.sql.DataFrameWriter(df)
df_writer.partitionBy('col1')\
.saveAsTable('test_table', format='parquet', mode='overwrite')
The parquet files went to "/tmp/hive/warehouse/...." which is a local tmp directory on my driver.
I did setup hive.metastore.warehouse.dir in hive-site.xml to a "s3a://...." location, but spark doesn't seem to respect to my hive warehouse setting.

Use path.
df_writer.partitionBy('col1')\
.saveAsTable('test_table', format='parquet', mode='overwrite',
path='s3a://bucket/foo')

you can use insertInto(tablename) to overwrite a existing table since 1.4

Related

Get data from subfolders of an unpartitioned hive table into a dataframe in spark

There is an external table in hive pointing to s3 location that is not partitioned. The table points to a folder in s3 but the data is in multiple subfolders inside that folder.
This table can be queried even though the table is not partitioned by setting few properties in hive like below,
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
However, when the same table is used in spark to load the data into a dataframe using a sql statement like df = sqlContext.sql("select * from table_name"), the action fails saying 'The subfolders in the external s3 location is not a file'.
I tried setting above hive properties in spark using sc.hadoopConfiguration.set("mapred.input.dir.recursive","true") method, but it did not help. Looks like this would help only for sc.textFile kind of loading.
This can be achieved by setting the following property in spark,
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
Note here that the property is set usign sqlContext instead of sparkContext.
And I tested this in spark 1.6.2

wrong schema in case of dataframe obtained from ORC hive table [duplicate]

I have a directory containing ORC files. I am creating a DataFrame using the below code
var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");
It returns data frame with this schema
[_col0: int, _col1: bigint]
Where as the expected schema is
[scan_nbr: int, visit_nbr: bigint]
When I query on files in parquet format I get correct schema.
Am I missing any configuration(s)?
Adding more details
This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)
We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.
Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).
I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.
The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243
This was fixed in 2.0.0.
Setting
sqlContext.setConf('spark.sql.hive.convertMetastoreOrc', 'false')
fixes this.
If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):
tx = sqlContext.table("tx_parquet")
df = sqlContext.table("tx_orc")
tx_cols = tx.schema.names
tx_cols.remove('started_at_date')
tx_cols.append('started_at_date') #move it to end
#fix column names for orc
oldColumns = df.schema.names
newColumns = tx_cols
df = functools.reduce(
lambda df, idx: df.withColumnRenamed(
oldColumns[idx], newColumns[idx]), range(
len(oldColumns)), df)
We can use:
val df = hiveContext.read.table("tableName")
Your df.schema or df.columns will give actual column names.
If version upgrade is not an available option, quick fix could be to rewrite ORC file using PIG. That seems to work just fine.

Spark SQL on ORC files doesn't return correct Schema (Column names)

I have a directory containing ORC files. I am creating a DataFrame using the below code
var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");
It returns data frame with this schema
[_col0: int, _col1: bigint]
Where as the expected schema is
[scan_nbr: int, visit_nbr: bigint]
When I query on files in parquet format I get correct schema.
Am I missing any configuration(s)?
Adding more details
This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)
We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.
Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).
I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.
The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243
This was fixed in 2.0.0.
Setting
sqlContext.setConf('spark.sql.hive.convertMetastoreOrc', 'false')
fixes this.
If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):
tx = sqlContext.table("tx_parquet")
df = sqlContext.table("tx_orc")
tx_cols = tx.schema.names
tx_cols.remove('started_at_date')
tx_cols.append('started_at_date') #move it to end
#fix column names for orc
oldColumns = df.schema.names
newColumns = tx_cols
df = functools.reduce(
lambda df, idx: df.withColumnRenamed(
oldColumns[idx], newColumns[idx]), range(
len(oldColumns)), df)
We can use:
val df = hiveContext.read.table("tableName")
Your df.schema or df.columns will give actual column names.
If version upgrade is not an available option, quick fix could be to rewrite ORC file using PIG. That seems to work just fine.

Read parquet files in Spark with pattern matching

I'm running Spark 1.3.0 and want to read a number of parquet files based on pattern matching. the parquet files are basically the underlying files of a Hive DB and I want to read some of the files (across different folders) only. the folder structure is
hdfs://myhost:8020/user/hive/warehouse/db/blogs/some/meta/files/
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160101/01/file1.parq
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160101/02/file2.parq
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160103/01/file3.parq
Something like
val v1 = sqlContext.parquetFile("hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd={[0-9]*}")
I want to ignore the meta files and load only the parquet files inside the date folders. Is this possible?
you can use wildcard in parquet like so (works on 1.5 didn't test on 1.3):
val v1 = sqlContext.parquetFile("hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd*")
another thing you can do in case that doesn't work is to create external table using hive with partition by yymmdd and read parquet from that table using:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SELECT FROM ...")
you can't use regular expression.
also I think you folder structure is problematic. it should be
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=150204/
or
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=150204/part=01
and not:
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=150204/1
because they way you use it I think you will have troubles using the folder names (yymmdd) as partition because the files are not directly under it

Overwriting Parquet File in Bluemix Object Storage with Apache Spark Notebook

I'm running a Spark Notebook to save a DataFrame as a Parquet File in the Bluemix Object Storage.
I want to overwrite the Parquet File, when rerunning the Notebook. But actually it's just appending the data.
Below a sample of the iPython Code:
df = sqlContext.sql("SELECT * FROM table")
df.write.parquet("swift://my-container.spark/simdata.parquet", mode="overwrite")
I'm not the python guy,but SaveMode work for dataframe like this
df.write.mode(SaveMode.Overwrite).parquet("swift://my-container.spark/simdata.parquet")
I think the blockstorage replace only the 'simdata.parquet' the 'PART-0000*' remains cuz was 'simdata.parquet' with the 'UUID' of app-id, when you try to read, the DF read all files with the 'simdata.parquet*'

Resources