How to get a list of column names in featuretools - featuretools

How can I get a list of column names in featuretools.
in pandas data frames I just type this code
dataframe.columns
that returns a list of columns names
however, I tried to do it in an entity set and failed.
should I convert the entity set to data frame?
I'm now doing it by converting variables to string and do regular expressions to obtain the name of a variable. However, I believe that there is a better way to do it.
thank you,

An entity set contains multiple dataframes, on for each entity.
In [1]: import featuretools as ft
In [2]: es = ft.demo.load_retail()
In [3]: es
Out[3]:
Entityset: demo_retail_data
Entities:
order_products [Rows: 401604, Columns: 7]
customers [Rows: 4372, Columns: 2]
products [Rows: 3684, Columns: 3]
orders [Rows: 22190, Columns: 5]
Relationships:
order_products.product_id -> products.product_id
order_products.order_id -> orders.order_id
orders.customer_name -> customers.customer_name
If you want the variables for the "orders" entity run
In [4]: es["orders"].variables
Out[4]:
[<Variable: order_id (dtype = index)>,
<Variable: country (dtype = categorical)>,
<Variable: cancelled (dtype = boolean)>,
<Variable: first_order_products_time (dtype: datetime_time_index, format: None)>,
<Variable: customer_name (dtype = id)>]
If you want to acess the underlying dataframe itself run
In [5]: es["orders"].df.head(5)
Out[5]:
order_id country cancelled first_order_products_time customer_name
536365 536365 United Kingdom False 2010-12-01 08:26:00 Andrea Brown
536366 536366 United Kingdom False 2010-12-01 08:28:00 Andrea Brown
536367 536367 United Kingdom False 2010-12-01 08:34:00 Krista Maddox
536368 536368 United Kingdom False 2010-12-01 08:34:00 Krista Maddox
536369 536369 United Kingdom False 2010-12-01 08:35:00 Krista Maddox

Related

Create a parent child tree dictionary based on two columns in dataframe

Suggest I've a dataframe containing countries and cities looking like this:
data = {'Parent':['Netherlands','Belgium','Germany','France'],'Child':['Amsterdam','Brussels','Berlin', '']}
I want to create a tree dictionary depicting which city belongs to which country.
In this example I the country France has no cities, I don't want to have the empty values as child node in the dictionary.
Can someone point me into the right direction on what to use to reach this solution?
Kind regards
Using pandas:
df = pd.DataFrame(data)
df.transpose()
0 1 2 3
Parent Netherlands Belgium Germany France
Child Amsterdam Brussels Berlin
Or using zip:
dict(zip(*data.values()))
{'Netherlands': 'Amsterdam', 'Belgium': 'Brussels', 'Germany': 'Berlin', 'France': ''}

sorting multi-column grouped by data frame

I'm trying to work on this data set drinks by country and find out the mean of beer servings of each country in each continent sorted from highest to lowest.
So my result should look something like below:
South America: Venezuela 333, Brazil 245, paraguay 213
and like that for the other continents (Don't want to mix countries of different continents!)
Creating the grouped data without the sorting is quite easy like below:
ddf = pd.read_csv(drinks.csv)
grouped_continent_and_country = ddf.groupby(['continent', 'country'])
print(grouped_continent_and_country['beer_servings'].mean())
but how to do the sorting??
Thanks a lot.
In this case you can just sort values by 'continent' and 'beer_servings' without applying .mean():
ddf = pd.read_csv('drinks.csv')
#sorting by continent and beer_servings columns
ddf = ddf.sort_values(by=['continent','beer_servings'], ascending=True)
#making the dataframe with only needed columns
ddf = ddf[['continent', 'country', 'beer_servings']].copy()
#exporting to csv
ddf.to_csv("drinks1.csv")
Output fragment:
continent,country,beer_servings
...
Africa,Botswana,173
Africa,Angola,217
Africa,South Africa,225
Africa,Gabon,347
Africa,Namibia,376
Asia,Afghanistan,0
Asia,Bangladesh,0
Asia,North Korea,0
Asia,Iran,0
Asia,Kuwait,0
Asia,Maldives,0
...

How to create and populate new column in 2nd Pandas dataframe based on condition in 1st df?

I have a pandas dataframe which looks like the following.
df1 = {'City' :['London', 'Basel', 'Kairo'], 'Country': ['UK','Switzerland','Egypt']}
City Country
0 London UK
1 Basel Switzerland
2 Kairo Egypt
Now I want to assign a new column called 'Continent' that should be filled based on the values.
So df2 should look like:
City Country Continent
0 London UK Europe
1 Basel Switzerland Europe
2 Kairo Egypt Africa
So df1 inidicates the values for the new column in df2. How do I do this in the easiest way?
With a dictionary of country->continent, you can pass it pd.DataFrame.replace() method, like this:
df['Continent'] = df['Country'].replace(country_dict)

Groupby with Cumcount - Not Working As Expected

I am looking to return a dataframe for the last 5 times a Product was sold & I'm running into issues.
Here is my dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='D'), 10000)})
I thought I could sort the Dataframe by date then use .cumcount() to create a helper column to later filter by. Here's what I tried:
df = df.sort_values('Date',ascending=False)
df['count_product'] = df.groupby(['Date','Product']).cumcount() + 1
df2 = df.loc[df.count_product < 5]
This does not work as intended. Based on the data above, I would have expected Product 1 to the following dates included in the new dataframe: 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, & 2018-12-27. Product 3 would have the dates 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, & 2018-12-26.
Any suggestions?
Check with drop_duplicates then groupby with head, after filter we using merge
yourdf=df.drop_duplicates(['Product','Date']).groupby('Product').head(4)[['Product','Date']].merge(df)
You can create a filter from the groupby:
s = df.groupby('Product').apply(lambda x: x.Date.ge(x.Date.drop_duplicates().nlargest(5).iloc[-1])).reset_index(0, True)
df2 = df.loc[s]
Just to check:
df2.groupby('Product').Date.agg(['min', 'max'])
min max
Product
Product 1 2018-12-27 2018-12-31
Product 2 2018-12-27 2018-12-31
Product 3 2018-12-26 2018-12-31

Join two files in Pyspark without using sparksql/dataframes

I have two files customer and sales like below
Customer :
cu_id name region city state
1 Rahul ME Vizag AP
2 Raghu SE HYD TS
3 Rohith ME BNLR KA
Sales:
sa_id sales country
2 100000 IND
3 230000 USA
4 240000 UK
Both the files are \t delimited.
I want to join both the files based on the cu_id from customer and sa_id from sales using pyspark with out using sparksql/dataframes.
your help is very much appreciated.
You can definitely use the join methods that Spark has to offer regarding workings with RDD's.
You can do something like:
customerRDD = sc.textFile("customers.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
salesRDD = sc.textFile("sales.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
joinedRDD = customerRDD.join(salesRDD)
And you will get a new RDD that contains the only joined records from both customer and sales files.

Resources