Create new dataframe column where cell value indicates animal type - python-3.x

Suppose that I have the following dataframe called pet_stores that consists of the number of cats and dogs per location of a pet store franchise
Dog Cat City
0 5 11 NYC
1 4 1 San Francisco
How can I transform this dataframe such that instead of having separate Dog and Cat columns I have a single column called animal_type? I want the following result:
animal_type Count City
0 Dog 5 NYC
1 Dog 4 San Francisco
2 Cat 11 NYC
3 Cat 1 San Francisco
Thanks!

Use melt:
>>> df.melt('City', var_name='animal_type', value_name='Count')
City animal_type Count
0 NYC Dog 5
1 San Francisco Dog 4
2 NYC Cat 11
3 San Francisco Cat 1
Without melt and index manipulation:
>>> df.set_index('City').rename_axis(columns='animal_type') \
.stack().rename('Count').reset_index()
City animal_type Count
0 NYC Dog 5
1 NYC Cat 11
2 San Francisco Dog 4
3 San Francisco Cat 1
Use sort_values to change the order of rows.

Related

Want To Collect The Same string of header

I have header of sheet as
'''
+--------------+------------------+----------------+--------------+---------------+
| usa_alaska | usa_california | france_paris | italy_roma | france_lyon |
|--------------+------------------+----------------+--------------+---------------|
+--------------+------------------+----------------+--------------+---------------+
'''
df = pd.DataFrame([], columns = 'usa_alaska usa_california france_paris italy_roma france_lyon'.split())
I want to separate the headers by country and region in a way so that when I call france, I should get paris and lyon as columns.
Create a MultiIndex from your column names:
Suppose this dataframe:
>>> df
usa_alaska usa_california france_paris italy_roma france_lyon
0 1 2 3 4 5
df.columns = df.columns.str.split('_', expand=True)
df = df.sort_index(axis=1)
Output
>>> df
france italy usa
lyon paris roma alaska california
0 5 3 4 1 2
>>> df['france']
paris lyon
0 3 5

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

How to calculate percentage existence

pets = pd.DataFrame.from_dict({'Izzy':['Smith','cat',1],\
'Lynx':['Smith','cat',9],\
'Oreo':['Smith','dog',7],\
'Archie':['Mack','dog',3], \
'Prim':['Mack','cat',1], \
'Fern':['Somers','cat',12]}, orient='index')
pets.columns=['family', 'type','age']
family type age
Izzy Smith cat 1
Lynx Smith cat 9
Oreo Smith dog 7
Archie Mack dog 3
Prim Mack cat 1
Fern Somers cat 12
I want to calculate the average age of each type of pet across all families, and also the percentage of families that own each type of pet. So I'm starting with this, which is easy.
pets.groupby(by='type').mean()
age
type
cat 5.75
dog 5.00
But not sure how to get the second number, in this case 100% for cats, and 67% for dogs. I'm sure I can get there in several steps, but is there an easy way to do this in the same groupby?
Try:
x = pets.groupby("type")["family"].nunique() / pets["family"].nunique() * 100
print(x)
Prints:
type
cat 100.000000
dog 66.666667
Name: family, dtype: float64

Excel merge cells equivalent in SQL

I am trying to generate SQL output similar to how Excel does the merge cells.
Sample output from SQL select statement
Name
Country
State
years
lived
Alice
USA
CA
2
2
Alice
USA
NYC
1
1
Alice
USA
MI
5
5
Bob
USA
CA
1
1
Bob
USA
NYC
8
8
Bob
USA
IL
4
4
I am trying to convert the output in below format using SQL query, so I can directly export to Excel, instead of modifying in excel separately.
Name
Country
State
years
lived
CA
2
NYC
1
Alice
USA
MI
5
8
CA
1
NYC
8
Bob
USA
IL
4
13
Any help is appreciated.

Select two or more consecutive rows based on a criteria using python

I have a data set like this:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
A 2019-01-01 11.18 TX 234567 3
B 2019-01-02 12.19 WA 456789 4
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
B 2019-01-02 12.50 DC 157890 7
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
A 2019-01-04 09:40 CA 234567 11
In this data set I want to compare and select two or more consecutive which fit the following criteria:
User should be same
Time difference should be less than 15 mins
Cookie should be different
So if I apply the filter I should get the following data:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
So, in the above, comparing first two rows(index 1 and 2) satisfy all the conditions above. The next two (index 2 and 3) has same cookie, index 3 and 4 has different user, 5 and 6 is selected and displayed, 6 and 7 has time difference more than 15 mins. 8,9 and 10 fit the criteria but 11 doesnt as the date is 24 hours apart.
How can I solve this using python dataframe? All help is appreciated.
What I have tried:
I tried creating flags using
shift()
cookiediff=pd.DataFrame(df.Cookie==df.Cookie.shift())
cookiediff.columns=['Cookiediffs']
timediff=pd.DataFrame(pd.to_datetime(df.time) - pd.to_datetime(df.time.shift()))
timediff.columns=['timediff']
mask = df.user != df.user.shift(1)
timediff.timediff[mask] = np.nan
cookiediff['Cookiediffs'][mask] = np.nan
This will do the trick:
import numpy as np
#you have inconsistent time delim-just to correct it per your sample data
df["time"]=df["time"].str.replace(":", ".")
df["time"]=pd.to_datetime(df["time"], format="%Y-%m-%d %H.%M")
cond_=np.logical_or(
df["time"].sub(df["time"].shift()).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift()) &\
df["cookie"].ne(df["cookie"].shift()),
df["time"].sub(df["time"].shift(-1)).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift(-1)) &\
df["cookie"].ne(df["cookie"].shift(-1)),
)
res=df.loc[cond_]
Few points- you need to ensure your time column is datetime in order to make the 15 minutes condition verifiable.
Then - the final filter (cond_) you obtain by comparing each row to the previous one, checking all 3 conditions OR by doing the same, but checking against the next one (otherwise you would just get all the consecutive matching rows, except the first one).
Outputs:
user time city cookie index
0 A 2019-01-01 11:00:00 NYC 123456 1
1 A 2019-01-01 11:12:00 CA 234567 2
4 B 2019-01-02 12:21:00 FL 456789 5
5 B 2019-01-02 12:31:00 VT 987654 6
7 A 2019-01-03 09:12:00 CA 123456 8
8 A 2019-01-03 09:27:00 NYC 345678 9
9 A 2019-01-03 09:34:00 TX 123456 10
You could use regular expressions to isolate the fields and use named groups and the groupdict() function to store the value of each field into a dictionary and compare the values from the last dictionary to the current one. So iterate through each line of the dataset with two dictionaries, the current dictionary and the last dictionary, and perform a re.search() on each line with the regex pattern string to separate each line into named fields, then compare the value of the two dictionaries.
So, something like:
import re
c_dict=re.search('(?P<user>\w) +(?P<time>\d{4}-\d{2}-\d{2} \d{2}\.\d{2}) +(?P<city>\w+) +(?P<cookie>\d{6}) +(?P<index>\d+)',s).groupdict()
for each line of your dataset. For the first line of your dataset, this would create the dictionary {'user': 'A', 'time': '2019-01-01 11.00', 'city': 'NYC', 'cookie': '123456', 'index': '1'}. With the fields isolated, you could easily compare the values of the fields to previous lines if you stored those in another dictionary.

Resources