I am trying to group my data and store in hdfs with a folder for each 'name' and subfolders for each 'YearMonth' under each name folder.
Input:
(Date) (name) (col3) (col4)
2015-02-02 abc y z
2016-01-02 xyz i j
2015-03-02 abc f b
2015-02-06 abc y z
2016-03-02 xyz a q
Expected out in hdfs:
abc folder
->201502 subfolder
2015-02-02 abc y z
2015-02-06 abc y z
->201503 subfolder
2015-03-02 abc f b
xyz folder
->201601 subfolder
2016-01-02 xyz i j
->201603 subfolder
2016-03-02 xyz a q
I am not sure of how to use the Multistorage option on Name column after grouping the tuples by date.
First create two relations by filtering by name
You can use substring to create a new field for DATE subfolder
CONCAT(SUBSTRING($0,0,4),SUBSTRING($0,5,7)
Now use this to field as folder name in MultiStorage
If you have HCatalog you can follow these steps:
Create partitioned table as below using HCatalog
hcat -e "create external table TABLE_NAME (col3 string, col4 string) partitioned by (date string, name string) location '/user/$USER/multi_partition'"
Load Data to that table using following pig statements :
A = LOAD 'Input.dat' using PigStorage() as (date:chararray, name:chararray, col3:chararray, col4:chararray);
STORE A INTO 'TABLE_NAME' using org.apache.hive.hcatalog.pig.HCatStorer();
Related
I have a PySpark dataframe column comprised of multiple addresses. The format is as below:
id addresses
1 [{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]
I want to transform it as below:
id
city
state
street
postalCode
country
1
null
null
123, ABC St, ABC Square
11111
USA
1
Dallas
TX
456, DEF Plaza, Test St
99999
USA
Any inputs on how to achieve this using PySpark? The dataset is huge (several TBs) so want to do this in an efficient way.
I tried splitting the address string on comma however since there are commas within the addresses as well, the output is not as expected. I guess I need to use a regular expression pattern with the braces but not sure how. Moreover, how do I go about denormalizing the data?
#Data
from pyspark.sql.functions import *
df =spark.createDataFrame([(1,'{"city":"New York","state":"NY","street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}')],
('id','addresses'))
df.show(truncate=False)
#pass the string column to rdd to extracr schema
rdd=df.select(col("addresses").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
newschema =spark.read.json(rdd).schema
##Apply schema to string column reading using from_schema
df3=df.select("*",from_json("addresses", newschema).alias("test_col"))#Assign schema to column using select
df3.select('id','test_col.*').show()
+---+--------+-------+----------+-----+------------------------+
|id |city |country|postalCode|state|street |
+---+--------+-------+----------+-----+------------------------+
|1 |New York|USA |11111 |NY |123, ABC St, ABC Square|
+---+--------+-------+----------+-----+------------------------+
I am trying to create a csv file where if few columns are same then i will merge row with similar value into one row .
eg:
Input :
Party_No install_date Start_date End_date Product_family Version City state
111 24-05-2018 25-05-2019 21-03-2020 storage 1 LA USA
111 24-05-2018 25-05-2019 21-03-2020 storage 1 KA USA
111 24-05-2018 25-05-2019 21-03-2020 storage 2 PA UK
Output
Party_No install_date Start_date End_date Product_family Version City state
111 24-05-2018 25-05-2019 21-03-2020 storage 1,2 LA,KA,PA UK,USA
ex : in my case
if value of party_number , item_install_date ,Contract_subline_date , Contract_Subline_end_date , Instance_family
i will merger row with same value into one row . other column apart from above mentioned will have comma separated value
Input CSV file link
Expected output CSV link
Code i tried:
import pandas as pd
import np
df = None
df = pd.read_csv("Export.csv")
df.fillna(0,inplace=True)
pf=df.groupby(['PARTY_NUMBER','ITEM_INSTALL_DATE','CONTRACT_SUBLINE_START_DATE','CONTRACT_SUBLINE_END_DATE','INSTANCE_PRODUCT_FAMILY']).agg([','.join])
pf.to_csv("result1.csv", index=False)
Adding the unqiue (or set when order is not important)
df.groupby(['...']).agg(lambda x : ','.join(x.unique())) # set(x)
Hi my csv file structure is like
File-1
id col1 col2
a_1 sd fd
ds emd
sf jhk
File-2
id col1 col2
b_2 sd fd
ds emd
sf jhk
Now I want that when I load my csv file into the dataframe my id column for the all the rows of this file 1 should have value 1 and of file 2 value 2. So my datframe should look like-
id col1 col2
a_1 sd fd
a_1 ds emd
a_1 sf jhk
b_2 sd fd
b_2 ds emd
b_2 sf jhk
I want to do this so I can identify the rows by file id if I am reading multiple csv file.Please note that I dont want to add filename as id, I want to use the id column in the first row of file to extend to all the corresponding rows of file in the dataframe.
if you are sure if its going to be in the first row.
Below is psudo code.
file1_id = df_file1.filter(id != None).select(col('id')).collect()[0]
and then use the above calculated id for the file as
df_file1.drop('id').withColumn('id',lit(file1_id))
Follow the same for the second dataframe df_file2
then do a union
df_file = df_file1.unionAll(df_file2)
I have two files customer and sales like below
Customer :
cu_id name region city state
1 Rahul ME Vizag AP
2 Raghu SE HYD TS
3 Rohith ME BNLR KA
Sales:
sa_id sales country
2 100000 IND
3 230000 USA
4 240000 UK
Both the files are \t delimited.
I want to join both the files based on the cu_id from customer and sa_id from sales using pyspark with out using sparksql/dataframes.
your help is very much appreciated.
You can definitely use the join methods that Spark has to offer regarding workings with RDD's.
You can do something like:
customerRDD = sc.textFile("customers.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
salesRDD = sc.textFile("sales.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
joinedRDD = customerRDD.join(salesRDD)
And you will get a new RDD that contains the only joined records from both customer and sales files.
I have created 3 temporary tables from loading from case class. And let's call it
ABC
BCD
EFG
Then I proceed to created 3 more tables by performing join.
ABC join with BCD which gives XYZ
XYZ join with EFG which gives LMN
LMN join with ABC which gives PQR
Does spark allow me in some way to see the lineage of dependent registered temporary table?? How can I extract the information where it knows LMN depends on XYZ and PQR depends on LMN, and use it programmatically to build a lineage tree (without all the plan information).
etc:
|-PQR
|-LMN
|-XYZ
|- ABC
|-EFG
|-ABC
Thanks.
There is an open source tool you can use to visualize the lineage: https://github.com/AbsaOSS/spline
It harvests the lineages at the runtime and captures it to display as a graph.
Let's take this SQL for example, which exactly creates the table/view as you mentioned.
CREATE TABLE ABC(abc_column1 int, abc_column2 string);
CREATE TABLE BCD(bcd_column1 int, bcd_column2 string);
CREATE TABLE EFG(efg_column1 int, efg_column2 string);
create view XYZ(xyz_column1,xyz_column2)
as select abc.abc_column1, bcd.bcd_column2
from abc
left join bcd on abc.abc_column1 = bcd.bcd_column1;
create view LMN (lmn_column1,lmn_column2)
as select xyz.xyz_column1, efg.efg_column2
from xyz
left join efg on xyz.xyz_column1 = efg.efg_column1;
create view PQR (pqr_column1, pqr_column2)
as select lmn.lmn_column1, abc.abc_column2
from LMN
left join ABC on abc.abc_column1 = lmn.lmn_column1;
And this is the data lineage that you are asking:
You may try your own SQL to get the data lineage here:
https://sqlflow.gudusoft.com/#/