ID Name Gender Country
1 Arun Male India
2 Akshay Male England
3 Chethna Female India
4 Priya Female China
5 Piyush Male India
6 Arun Male England
7 Tisha Female England
8 Chethna Female China
I want to group them into male/female first, then the country associated.
Query1 : select Gender, count(distinct name) from Table group by Gender
Output:
Gender count(distinct name)
Male 3
Female 3
Copying the result in JSON like this,
result : {male : {count : 3}, female : {count : 3} }
Query2 : select Gender, Country, count(distinct name) from Table group by Gender, Country
Output:
Gender Country count(distinct name)
Male India 2
Male England 2
Female India 1
Female China 2
Female England 1
Adding this result in the above Json,
result : {Male:{count:3,India:{count:2},England:{count:2}},Female:{count:3,India:{count:1},China:{count:2},England:{count:1}}}
So can I achieve this in a single query?
You can compute the counts by gender and by gender+country in a single query by using GROUPING SETS:
WITH data(id, name, gender, country) AS (
VALUES
(1, 'Arun', 'Male' , 'India'),
(2, 'Akshay', 'Male' , 'England'),
(3, 'Chethna', 'Female', 'India'),
(4, 'Priya', 'Female', 'China'),
(5, 'Piyush', 'Male' , 'India'),
(6, 'Arun', 'Male' , 'England'),
(7, 'Tisha', 'Female', 'England'),
(8, 'Chethna', 'Female', 'China'))
SELECT gender, country, count(distinct name)
FROM data
GROUP BY GROUPING SETS ((gender), (gender, country))
which produces:
gender | country | _col2
--------+---------+-------
Male | England | 2
Female | China | 2
Male | NULL | 3
Female | NULL | 3
Female | India | 1
Male | India | 2
Female | England | 1
(7 rows)
Related
I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.
I have two dataframes:
Dataframe_A:
Account_Nbr Customer_ID Gender
1234 A1234 male
5678 ? female
Dataframe_B:
Account_Nbr Customer_ID
1234 A1234
5678 B5678
And I want to replace '?' in dataframe A with 'B5678', here is my code:
Dataframe_A = Dataframe_A.assign(
Customer_ID = lambda x:
[cid if (cid != '?' ) else
Datafram_B.loc[Datafram_B['Account_Nbr'] == acct, ['Customer_ID']]
for cid, acct in zip(x.Customer_ID, x.Account_Nbr)]
Dataframe_A
But the output is not what I expect:
Account_Nbr Customer_ID Gender
1234 A1234 male
5678 Customer_ID female
B5678
It looks like it replace the cell with whole series. How can I get the output like this? Thank you.
Account_Nbr Customer_ID Gender
1234 A1234 male
5678 B5678 female
The below code should do the job.
import pandas as pd
df1 = pd.DataFrame([
[1234, 'A1234', 'male'],
[5678, '?', 'female']], columns=['Account_Nbr', 'Customer_ID', 'Gender'])
df2 = pd.DataFrame([
[1234, 'A1234'],
[5678, 'B5678']], columns=['Account_Nbr', 'Customer_ID'])
mask = df1['Account_Nbr'] == df2['Account_Nbr']
df1.loc[mask, 'Customer_ID'] = df2[mask]['Customer_ID']
df1.head()
Output:
Account_Nbr Customer_ID Gender
0 1234 A1234 male
1 5678 B5678 female
I have a requirement of copying data from one Hive source table to other target table. Below is source table structure with sample data:
source_table
Userid Name Phone1 Phone2 Phone3 Address1 Address2 Address3
123 Jitu 123456 987654 111111 DELHI GURGAON NOIDA
234 Mark 123456 987654 111111 UK USA IND
While copying data from source to target, my requirement is to have Phone1, Phone2, Phone3 along with corresponding Address1, Address2 and Address3
columns in a single column in target table. Below is how data should look like in target table:
Target_table
Userid Name Phone_no Address
123 Jitu 123456 DELHI
123 Jitu 987654 GURGAON
123 Jitu 111111 NOIDA
234 Mark 123456 UK
234 Mark 987654 USA
234 Mark 111111 IND
I know simplest way to do this would be doing multiple inserts into target table for each Phone and address column from source table
using either hive query language or spark dataframes.
Is there any other efficient method I can use to achieve this.
Original dataframe can be selected several times, for each column index, and then selected dataframes combined into one by "union":
val df = Seq(
(123, "Jitu", "123456", "987654", "111111", "DELHI", "GURGAON", "NOIDA"),
(234, "Mark", "123456", "987654", "111111", "UK", "USA", "IND")
).toDF(
"Userid", "Name", "Phone1", "Phone2", "Phone3", "Address1", "Address2", "Address3"
)
val columnIndexes = Seq(1, 2, 3)
val onlyOneIndexDfs = columnIndexes.map(idx =>
df.select(
$"Userid",
$"Name",
col(s"Phone$idx").alias("Phone_no"),
col(s"Address$idx").alias("Address")))
val result = onlyOneIndexDfs.reduce(_ union _)
Output:
+------+----+--------+-------+
|Userid|Name|Phone_no|Address|
+------+----+--------+-------+
|123 |Jitu|123456 |DELHI |
|123 |Jitu|111111 |NOIDA |
|123 |Jitu|987654 |GURGAON|
|234 |Mark|123456 |UK |
|234 |Mark|987654 |USA |
|234 |Mark|111111 |IND |
+------+----+--------+-------+
Just in case, if you are intrested in Hive solution as well,Lateral view yield cartesian product when joining multiple arrays result set.you can achieve the same results using posexplode as shown below:
select Userid,Name,phone,address
from source_table
lateral view posexplode(array(Phone1,Phone2,Phone3)) valphone as x,phone
lateral view posexplode(array(Address1,Address2,Address3)) valaddress as t,address
where x=t
;
hive> set hive.cli.print.header=true;
userid name phone address
123 Jitu 123456 DELHI
123 Jitu 987654 GURGAON
123 Jitu 111111 NOIDA
234 Mark 123456 UK
234 Mark 987654 USA
234 Mark 111111 IND
Time taken: 2.759 seconds, Fetched: 6 row(s)
I have a pandas dataframe like this:
df1:
id name gender
1 Alice Male
2 Jenny Female
3 Bob Male
And now I want to add a new column sport which will contain values in the form of list.Let's I want to add Football to the rows where gender is male So df1 will look like:
df1:
id name gender sport
1 Alice Male [Football]
2 Jenny Female NA
3 Bob Male [Football]
Now if I want to add Badminton to rows where gender is female and tennis to rows where gender is male so that final output is:
df1:
id name gender sport
1 Alice Male [Football,Tennis]
2 Jenny Female [Badminton]
3 Bob Male [Football,Tennis]
How to write a general function in python which will accomplish this task of appending new values into the column based upon some other column value?
The below should work for you. Initialize column with an empty list and proceed
df['sport'] = np.empty((len(df), 0)).tolist()
def append_sport(df, filter_df, sport):
df.loc[filter_df, 'sport'] = df.loc[filter_df, 'sport'].apply(lambda x: x.append(sport) or x)
return df
filter_df = (df.gender == 'Male')
df = append_sport(df, filter_df, 'Football')
df = append_sport(df, filter_df, 'Cricket')
Output
id name gender sport
0 1 Alice Male [Football, Cricket]
1 2 Jenny Female []
2 3 Bob Male [Football, Cricket]
I want to add multiple rows by deriving them from a string column in Stata.
I have a dataset like the following one:
year countryname intensitylevel
1990 India, Pakistan 1
1991 India, Pakistan 1
1992 India, Pakistan 1
1996 India, Pakistan 1
To be more precise, I want to split the country name variable for each country separately.
In the end, I want to have a dataset like the one below:
year countryname intensitylevel
1990 India 1
1990 Pakistan 1
1991 India 1
1991 Pakistan 1
This is a simple split and reshape:
clear
input year str15 countryname intensitylevel
1990 "India, Pakistan" 1
1991 "India, Pakistan" 1
1992 "India, Pakistan" 1
1996 "India, Pakistan" 1
end
split countryname, p(,)
drop countryname
reshape long countryname, i(countryname* year)
sort year countryname
list year countryname intensitylevel, abbreviate(15) sepby(year)
+-------------------------------------+
| year countryname intensitylevel |
|-------------------------------------|
1. | 1990 Pakistan 1 |
2. | 1990 India 1 |
|-------------------------------------|
3. | 1991 Pakistan 1 |
4. | 1991 India 1 |
|-------------------------------------|
5. | 1992 Pakistan 1 |
6. | 1992 India 1 |
|-------------------------------------|
7. | 1996 Pakistan 1 |
8. | 1996 India 1 |
+-------------------------------------+