Split a string based on spacein Hive - string

This is the format of my CSV file:
Chevrolet C10,13.0,8,350.0,145.0,4055,12.0,76,US
Ford F108,13.0,8,302.0,130.0,3870,15.0,76,US
Dodge D100,13.0,8,318.0,150.0,3755,14.0,76,US
Honda Accord CVCC,31.5,4,98.00,68.00,2045,18.5,77,Japan
Buick Opel Isuzu Deluxe,30.0,4,111.0,80.00,2155,14.8,77,US
Renault 5 GTL,36.0,4,79.00,58.00,1825,18.6,77,Europe
Plymouth Arrow GS,25.5,4,122.0,96.00,2300,15.5,77,US
I want to split the first field like,
Chevrolet C10 should be Chevrolet
Ford F108 should be Ford
Honda Accord CVCC should be Honda etc and then I will use the car name for further processing.

Solution in Pig
Code :
read = LOAD 'test.data' USING PigStorage(',') AS (name:chararray, val1:long, val2:long, val3:long, val4:long, val5:long, val6:long, country:chararray);
sub_data = FOREACH read GENERATE SUBSTRING(name,0,(INDEXOF(name, ' ',0))) AS (subname:chararray);
DUMP sub_data;
Output :
(Chevrolet)
(Ford)
(Dodge)
(Honda)
(Buick)
(Renault)
(Plymouth)

select
case when MODEL like 'US % %' or MODEL like 'Europe % %'
then regexp_extract(MODEL, '^([^ ]* [^ ]*) ', 1)
when MODEL like '% %'
then regexp_extract(MODEL, '^([^ ]*) ', 1)
else MODEL
end as BRAND
from WHATEVER
Chevrolet C10 => Chevrolet
US Honda Accord => US Honda
Zorglub => Zorglub

Use the below UDF -
substring_index(string A, string delim, int count)
Reference

Create a table with the schema which you want for your table.
CREATE TABLE carinfo (carname STRING, val1 DOUBLE, val2 INT, val3 DOUBLE, val4 DOUBLE, val5 INT, val6 DOUBLE, val7 INT, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
load the data into the above table
LOAD DATA LOCAL INPATH '/hivesamples/splitstr.txt' OVERWRITE INTO TABLE carinfo;
Use CTAS to split the carname and get the brand name. This new table will have the same schema which you defined earlier.
CREATE TABLE modified_carinfo
AS
SELECT split(carname, ' ')[0] as carname, val1, val2, val3, val4, val5 ,val6, val7, country
FROM carinfo;

Related

Split Complex String in PySpark Dataframe Column

I have a PySpark dataframe column comprised of multiple addresses. The format is as below:
id addresses
1 [{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]
I want to transform it as below:
id
city
state
street
postalCode
country
1
null
null
123, ABC St, ABC Square
11111
USA
1
Dallas
TX
456, DEF Plaza, Test St
99999
USA
Any inputs on how to achieve this using PySpark? The dataset is huge (several TBs) so want to do this in an efficient way.
I tried splitting the address string on comma however since there are commas within the addresses as well, the output is not as expected. I guess I need to use a regular expression pattern with the braces but not sure how. Moreover, how do I go about denormalizing the data?
#Data
from pyspark.sql.functions import *
df =spark.createDataFrame([(1,'{"city":"New York","state":"NY","street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}')],
('id','addresses'))
df.show(truncate=False)
#pass the string column to rdd to extracr schema
rdd=df.select(col("addresses").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
newschema =spark.read.json(rdd).schema
##Apply schema to string column reading using from_schema
df3=df.select("*",from_json("addresses", newschema).alias("test_col"))#Assign schema to column using select
df3.select('id','test_col.*').show()
+---+--------+-------+----------+-----+------------------------+
|id |city |country|postalCode|state|street |
+---+--------+-------+----------+-----+------------------------+
|1 |New York|USA |11111 |NY |123, ABC St, ABC Square|
+---+--------+-------+----------+-----+------------------------+

Snowflake unpivoting

I need to transpose a table in which column1 is name of an entity and column2 to column366 are dates in a year that hold a dollar amount. The table, the select statement and the output result are all given
below -
Question - This syntax requires me to create a comma separated list of columns - which are basically 365 dates - and use that list in the IN clause of the select statement.
Like this -
.....unpivot (cash for dates in ("1-1-2020" , "1-2-2020" , "1-3-2020"........."12-31-2020")) order by 2
Is there any better way of doing this ? Like with regular expressions ? I don't want to type 365 dates in mm-dd-yyyy format and get carpel tunnel for my trouble
Here is the table - First line is column header, second line is separator. 3rd, 4th and 5th lines are sample data.
Name 01-01-2020 01-02-2020 01-03-2020 12-31-2020
---------------------------------------------------------------------------------------------------
Entity1 10.00 15.75 20.00 100.00
Entity2 11.00 16.75 20.00 10.00
Entity3 112.00 166.75 29.00 108.00
I can transpose it using the select statement below
select * from Table1
unpivot (cash for dates in ("1-1-2020" , "1-2-2020" , "1-3-2020")) order by 2
to get an output like the one below -
Name-------------------dates-----------------------cash
--------------------------------------------------------------
Entity1 01-01-2020 10.00
Entity2 01-01-2020 11.00
Entity3 01-01-2020 112.00
...............................
.............................
.........
and so on
There is a simpler way to do this without PIVOT. Snowflake gives you a function to represent an entire row as an "OBJECT" -- a collection of key-value pairs. With that representation, you can FLATTEN each element and extract both the column name (key == date) and the value inside (value == cash). Here is a query that will do it:
with obj as (
select OBJECT_CONSTRUCT(*) o from Table1
)
select o:NAME::varchar as name,
f.key::date as date,
f.value::float as cash
from obj,
lateral flatten (input => obj.o, mode => 'OBJECT') f
where f.key != 'NAME'
;

How to reorder values in a row alphabetically using T-SQL?

I need to reorder the values in rows of a table by alphabetical order, for example:
Id Values
--------------------------------
1 Banana, Apple, Oranges
2 Oranges, Melon, Cucumber
3 Cucumber, Banana, Apple
The expected output should be:
Id Values
--------------------------------
1 Apple, Banana, Oranges
2 Cucumber, Melon, Oranges
3 Apple, Banana, Cucumber
You can generate the data above using the following code:
CREATE TABLE [Table] (
[Id] INT NOT NULL,
[Values] VARCHAR(30) NOT NULL,
CONSTRAINT [PK_Table_Id] PRIMARY KEY CLUSTERED ([Id])
);
GO
INSERT INTO [Table] ([Id], [Values]) VALUES (1, 'Banana, Apple, Oranges'),(2, 'Oranges, Melon, Cucumber'),(3, 'Cucumber, Banana, Apple');
If you are using SQL Server 2017 or later, we can use a combination of STRING_SPLIT and STRING_AGG:
WITH cte AS (
SELECT Id, value
FROM [Table]
CROSS APPLY STRING_SPLIT([Values], ', ')
)
SELECT
Id,
STRING_AGG(value, ', ') WITHIN GROUP (ORDER BY value) AS [Values]
FROM cte
GROUP BY Id
ORDER BY Id;
However, I seriously suggest that you stop just with my CTE step above, because storing CSV values in your table is a bad idea from the very beginning. So, once you have each value per Id on a separate row, you should stop, because then your data is already normalized, or at least much closer to it.

PYSPARK : Join a table column with one of the two columns from another table

My problem is as follow:
Table 1
ID1 ID2
1 2
3 4
Table 2
C1 VALUE
1 London
4 Texas
Table3
C3 VALUE
2 Paris
3 Arizona
Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.
i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.
i.e my final output should look like:
ID Aggregated
1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3
Any suggestion how to achieve this in pyspark.
I am not sure if joining the tables is going to help in this problem.
I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.
Thanks
Below is a very straightforward approach:
spark.sql(
"""
select 1 as id1,2 as id2
union
select 3 as id1,4 as id2
""").createOrReplaceTempView("table1")
spark.sql(
"""
select 1 as c1, 'london' as city
union
select 4 as c1, 'texas' as city
""").createOrReplaceTempView("table2")
spark.sql(
"""
select 2 as c1, 'paris' as city
union
select 3 as c1, 'arizona' as city
""").createOrReplaceTempView("table3")
spark.table("table1").show()
spark.table("table2").show()
spark.table("table3").show()
# for simplicity, union table2 and table 3
spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
spark.table("city_mappings").show()
# now join to the ids:
spark.sql("""
select id1, id2, city from table1
join city_mappings on c1 = id1 or c1 = id2
""").createOrReplaceTempView("id_to_city")
# and finally you can aggregate:
spark.sql("""
select id1, id2, collect_list(city)
from id_to_city
group by id1, id2
""").createOrReplaceTempView("result")
table("result").show()
# result looks like this, you can reshape to better suit your needs :
+---+---+------------------+
|id1|id2|collect_list(city)|
+---+---+------------------+
| 1| 2| [london, paris]|
| 3| 4| [texas, arizona]|
+---+---+------------------+

Excel column split by criteria

I have a column with the following text:
PLEASANT AVENUE
PATTERSON DRIVE
I would like to separate the road type ("Avenue", "Drive", etc.) from the address (or road name, like "Pleasant" or "Patterson").
I need to end up with Col1 as the street name and Col2 as the type, as follows:
col1 | col2
|
PLEASANT | AVENUE
|
PATTERSON | DRIVE
How can I do this?
Select the column where you have the text, then go to Data => Text to column => Delimited => Space.
Adding on to #nicolaesse's post, you could first replace text to create a better delimiter.
For example: replace "AVENUE" with "; AVENUE", and then use the delimiter " ; " to split into columns.
P.S. do this for "DRIVE" too

Resources