Loading PIG output files into Hive table with some blank cells - azure

I have successfully loaded a 250000 record CSV file into HDFS and I have performed some ETL functions on it such as removing any characters in a string other than 0-9, a-z and A-Z so that it's nice and clean.
I've saved the output of this ETL to the HDFS for loading into Hive. While in Hive I created the schema for the table and set the appropriate data types for each column.
create external table pigOutputHive (
id string,
Score int,
ViewCount int,
OwnerUserId string,
Body string,
Rank int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
location '/user/admin/PigOutputETL';
When I run a simple query on the data such as:
SELECT * FROM pigoutputhive LIMIT 100000;
The data looks as it should, and when i download it to my local machine and view it in Excel as a CSV it also looks good.
When I try and run the following query on the same table I get every field being returned as an integer even for the string columns. See the screenshot below.
Can anyone see where I am going wrong? Of the original 250000 rows there are some blanks in a particular fields such as the OwnerUserId, do I need to tell Pig or Hive how to handle these?

Related

How to fix the SQL query in databricks if column name has bracket in it

I have a file which has data like this , I have converted that file into a databricks table.
Select * from myTable
Output:
Product[key] Product[name]
123 Mobile
345 television
456 laptop
I want to query my table for laptop data.
I am using below query
Select * from myTable where Product[name]='laptop'
I am getting below error in databricks:
AnalysisException: cannot resolve 'Product' given input columns:
[spark_catalog.my_db.myTable.Product[key],[spark_catalog.my_db.myTable.Product[name]
When certain characters appear in column names of a table in SQL, you get a parse exception. These characters include brackets, dots (.), hyphens (-), etc. So, when such characters appear in column names, we need an escape character to parse these characters just as a part of column name.
For SQL in Databricks, this character is Backtick (`). Enclosing your column name in backticks ensures that your column name is parsed correctly as it is even when it includes characters like ‘[]’ (In this case).
Since you have converted a file data into Databricks table, you were not able to see the main problem which is parsing the column name. If you manually create a table with specified schema in Databricks, you will get the following result:
Once you use Backtick in the following way, using the column name would not be a problem anymore.
create table mytable(`Product[key]` integer, `Product[name]` varchar(20))
select * from mytable where `Product[name]`='laptop'

Keeping Special Characters in Spark Table Column Name

Is there any way to keep special characters for a column in a spark 3.0 table?
I need to do something like
CREATE TABLE schema.table
AS
SELECT id=abc
FROM tbl1
I was reading in Hadoop that you would put back ticks around the column name but this does not work in spark.
If there is a way to do this in PySpark that would work as well
Turns out you parquet and delta formats do not accept special characters under any circumstance. You must use Row Format Delimited
spark.sql("""CREATE TABLE schema.test
ROW FORMAT DELIMITED
SELECT 1 AS `brand=one` """)

Can I retrieve limited dataset from a CSV file in Azure Data Lake U-SQL?

Can I filter data loaded from a CSV File using `U-SQL's EXTRACT? I know we can limit the data using the where condition in the select after the EXTRACT. But, I want to filter it during the use of the 'EXTRACT'
I have huge CSV file. I don't want to load all of it into the first dataset itself.
e.g.
I have lot of auto claims in the dataset. I want to filter it while I 'EXTRACT' it based on a date in the dataset.
The answer is yes. However only column pruning can be pushed into the extractors. Since there is no semantics assigned to the data before you extract it with the Csv extractor, filters will be applied on the generated rowset. There are many examples out there that show you how to do so. Here is an example from one of the hands-on-labs.
Yes you can filter data loaded from csv file. You can do something like this:
#log =
EXTRACT UserId int,
StartDate DateTime,
Location string,
....
....
Url string
FROM "/Samples/Data/Log.csv"
USING Extractors.csv();
#result =
SELECT Location, Url, StartDate
FROM #log
WHERE StartDate >= DateTime.Parse("2017/01/6") AND StartDate <= DateTime.Parse("2018/06/08");
OUTPUT #result
TO "/output/cleanlog.csv"
USING Outputters.Csv();

Correct syntax for creating a parquet table with CTAS at a specified location

I'm trying to create a table stored as parquet with spark.sql with a pre-specified external location, but I appear to be missing something, or something is omitted from the documentation.
My reading of the documentation suggests the following should work:
create table if not exists schema.test
using PARQUET
partitioned by (year, month)
location 's3a://path/to/location'
as select '1' test1, true test2, '2017' year, '01' month
But this returns the error:
mismatched input 'location' expecting (line 4, pos 0)
The documentation suggests external is automatically implied by setting location, but anyway adding create external table gives the same error.
I was able to successfully create an empty table with similar syntax:
create external table if not exists schema.test
(test1 string, test2 boolean)
partitioned by (year string, month string)
stored as PARQUET
location 's3a://path/to/location'
My alternative is to save the select results to a parquet at /path/to/location directly first, then create a table pointing to this, but this seems roundabout when the original syntax seems valid and designed for this purpose.
What's wrong with my approach?

spark to hive data types

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

Resources