My data is in a csv file. The file hasn't got any header column
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
If I read it, Spark creates names for the columns automatically.
scala> val data = spark.read.csv("./data/flight-data/csv/2015-summary.csv")
data: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]
Is it possible to provide my own names for the columns when reading the file if I don't want to use _c0, _c1? For eg, I want spark to use DEST, ORIG and count for column names. I don't want to add header row in the csv to do this
Yes you can, There is a way, You can us toDF function of dataframe.
val data = spark.read.csv("./data/flight-data/csv/2015-summary.csv").toDF("DEST", "ORIG", "count")
It's better to define schema (StructType) first, then load the csv data using the schema.
Here is how to define schema:
import org.apache.spark.sql.types._
val schema = StructType(Array(
StructField("DEST",StringType,true),
StructField("ORIG",StringType,true),
StructField("count",IntegerType,true)
))
Load the dataframe:
val df = spark.read.schema(schema).csv("./data/flight-data/csv/2015-summary.csv")
Hopefully it'll help you.
Related
I have a PySpark dataframe column comprised of multiple addresses. The format is as below:
id addresses
1 [{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]
I want to transform it as below:
id
city
state
street
postalCode
country
1
null
null
123, ABC St, ABC Square
11111
USA
1
Dallas
TX
456, DEF Plaza, Test St
99999
USA
Any inputs on how to achieve this using PySpark? The dataset is huge (several TBs) so want to do this in an efficient way.
I tried splitting the address string on comma however since there are commas within the addresses as well, the output is not as expected. I guess I need to use a regular expression pattern with the braces but not sure how. Moreover, how do I go about denormalizing the data?
#Data
from pyspark.sql.functions import *
df =spark.createDataFrame([(1,'{"city":"New York","state":"NY","street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}')],
('id','addresses'))
df.show(truncate=False)
#pass the string column to rdd to extracr schema
rdd=df.select(col("addresses").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
newschema =spark.read.json(rdd).schema
##Apply schema to string column reading using from_schema
df3=df.select("*",from_json("addresses", newschema).alias("test_col"))#Assign schema to column using select
df3.select('id','test_col.*').show()
+---+--------+-------+----------+-----+------------------------+
|id |city |country|postalCode|state|street |
+---+--------+-------+----------+-----+------------------------+
|1 |New York|USA |11111 |NY |123, ABC St, ABC Square|
+---+--------+-------+----------+-----+------------------------+
I am dealing with spark data frame df which has two columns tstamp and c_1. Data type for c_1 is 'string', and I want to add a new column by extracting string between two characters in that field.
For example: original dataframe df
tstamp
c_1
2022-06-15 10:00:00
xxx&cd7=H10S10P10&cd21=GA&cd3=6...
2022-06-15 10:10:01
xz&cd7=H11S11P11&cd21=CA&cd3=5...
We want to add a new column (same or another dataframe) called cd_7 and the value will be the string between 'cd7=' and '&cd21' like below:
tstamp
c_1
cd_7
2022-06-15 10:00:00
xxx&cd7=H10S10P10&cd21=GA&cd3=6...
H10S10P10
2022-06-15 10:10:01
xz&cd7=H11S11P11&cd21=CA&cd3=5...
H11S11P11
How could I write it using Pyspark? Thanks!
Use regex to extract everything between special characters = and &
df.withColumn('x', regexp_extract('c_1', '(?<=[\=]).*(?=[\&])',0)).show()
+-------------------+--------------------+---------+
| tstamp| c_1| x|
+-------------------+--------------------+---------+
|2022-06-15 10:00:00|xxx&cd7=H10S10P10...|H10S10P10|
|2022-06-15 10:10:01|xz&cd7=H11S11P11&...|H11S11P11|
+-------------------+--------------------+---------+
Used an alternative way to get the answer by converting this to a pandas dataframe and do data manipulation, but not ideal if data is large.
df['cd_7'] = df['c_1'].apply(lambda st: st[st.find("cd7=")+4:st.find("&cd21")])
I'm trying to work on this data set drinks by country and find out the mean of beer servings of each country in each continent sorted from highest to lowest.
So my result should look something like below:
South America: Venezuela 333, Brazil 245, paraguay 213
and like that for the other continents (Don't want to mix countries of different continents!)
Creating the grouped data without the sorting is quite easy like below:
ddf = pd.read_csv(drinks.csv)
grouped_continent_and_country = ddf.groupby(['continent', 'country'])
print(grouped_continent_and_country['beer_servings'].mean())
but how to do the sorting??
Thanks a lot.
In this case you can just sort values by 'continent' and 'beer_servings' without applying .mean():
ddf = pd.read_csv('drinks.csv')
#sorting by continent and beer_servings columns
ddf = ddf.sort_values(by=['continent','beer_servings'], ascending=True)
#making the dataframe with only needed columns
ddf = ddf[['continent', 'country', 'beer_servings']].copy()
#exporting to csv
ddf.to_csv("drinks1.csv")
Output fragment:
continent,country,beer_servings
...
Africa,Botswana,173
Africa,Angola,217
Africa,South Africa,225
Africa,Gabon,347
Africa,Namibia,376
Asia,Afghanistan,0
Asia,Bangladesh,0
Asia,North Korea,0
Asia,Iran,0
Asia,Kuwait,0
Asia,Maldives,0
...
I am using spark-sql 2.4.1v with java8.
I have a scenario where I have some meta data in dataset1 i.e. which is loaded from an HDFS Parquet file.
And I have another dataset2 which is read from a Kafka Stream.
For each record from dataset2 for column1 I need to check columnX in dataset2
if its there in dataset1.
If it is there in dataset1,then I need replace the columnX value with column1 value of dataset1
Else
I need to add increment (max(column1 ) from dataset1 ) by one and store it dataset1.
Some sample data you can see here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3447405230020171/7035720262824085/latest.html
How this can be done in sSpark?
Example:
val df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
).toDF("company_id_external","company_id")
val df2 = Seq(
("60886923","Chengdu Fuma Food Co,.Ltd"), //company_id_external match found in df1
("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
("583633","Boso oil and fat Co., Ltd. ") //company_id_external match found in df1
)toDF("company_id_external","companyName")
If match found in df1
Here only two records of df1 "company_id_external" matching to df2 "company_id_external"
i.e. 60886923 & 583633 ( first and last record)
For these records of df2
i.e. ("60886923","Chengdu Fuma Food Co,.Ltd") becomes ==> ("2860","Chengdu Fuma Food Co,.Ltd")
("583633","Boso oil and fat Co., Ltd. ") becomes ==> ("46067","Boso oil and fat Co., Ltd. ")
Else match not found in df1
For other two of df2 there is no "company_id_external" match in df1, need to generate it company_id and add to df1
i.e. ("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
company_id generation logic
new company_id = max(company_id) of df1 + 1
From the above max is 50330 + 1 => 50331 add this record to df1 i.e. ("608815923","50331")
Do the same for other one i.e. add this record to df1 i.e. ("583633","50332")
**So now**
df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
("608815923","50331")
("583633","50332")
).toDF("company_id_external","company_id")
I need to convert the columns into rows .Please help me in the below requirement in spark scala code.input file is | delimiter and one of the column having comma delimiter value.based on the comma delimiter i need to convert them into rows
my input records:
c11|c12|a,b|c14
c21|c22|a,c,d|c24
expected output :
a,c11,c12,c14
b,c11,c12,c14
a,c21,c22,c24
c,c21,c22,c24
d,c21,c22,c24
Thanks,
Siva
First read the dataframe as csv with | as a separator:
This provides a dataframe with the base columns you need except the third which would be a string. Lets say you renamed this column to be called _c2 (the default name for the third column). Now you can split the string to get an array
We also remove the previous column as we don't need it anymore.
Lastly we use explode to turn the array to rows and drop the unused column
from pyspark.sql.functions import split
from pyspark.sql.functions import explode
df1 = spark.read.csv("pathToFile", sep="|")
df2 = df1.withColumn("splitted", split(df1["_c2"],",")).drop("_c2")
df3 = df2.withColumn("exploded", explode(df2["splitted"])).drop("splitted")
or in scala (free form)
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions.explode
val df1 = spark.read.csv("pathToFile", sep="|")
val df2 = df1.withColumn("splitted", split(df1("_c2"),",")).drop("_c2")
val df3 = df2.withColumn("exploded", explode(df2("splitted"))).drop("splitted")