How to use list comprehension in pandas to create a new series for a plot? - python-3.x

This is by far the most difficult problem I have faced. I am trying to create plots indexed on ratetype. For example, a matrix of unique ratetype x avg customer number for that ratetype is what I want to create efficiently. The lambda expression for getting the rows where the value is equal to each individual ratetype then getting the average customer number for that type then creating a series based on these two lists that are equal in size and length and accurate, is way over my head for pandas.
The number of different ratetypes can be in the hundreds. Reading it into a list via lambda would logically be a better choice than hard coding each possibility, as the list is going to only increase in size and new variability.
""" a section of the data for example use. Working with column "Ratetype"
column "NumberofCustomers" to work towards getting something like
list1 = unique occurs of ratetypes
list2 = avg number of customers for each ratetype
rt =['fixed','variable',..]
avg_cust_numbers = [45.3,23.1,...]
**basically for each ratetype: get mean of all row data for custno column**
ratetype,numberofcustomers
fixed,1232
variable, 1100
vec, 199
ind, 1211
alg, 123
bfd, 788
csv, 129
ggg, 1100
aaa, 566
acc, 439
"""
df['ratetype','number_of_customers']
fixed = df.loc['ratetype']=='fixed']
avg_fixed_custno = fixed.mean()
rt_counts = df.ratetype.value_counts()
rt_uniques = df.ratetype.unique()
# rt_uniques would be same size vector as avg_cust_nos, has to be anyway
avg_cust_nos = [avg_fixed_custno, avg_variable_custno]
My goal is to create and plot these subplots using matplot.pyplot.
data = {'ratetypes': pd.Series(rt_counts, index=rt_uniques),
'Avg_cust_numbers': pd.Series(avg_cust_nos, index=rt_uniques),
}
df = pd.DataFrame(data)
df = df.sort_values(by=['ratetypes'], ascending=False)
fig, axes = plt.subplots(nrows=2, ncols=1)
for i, c in enumerate(df.columns):
df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c)
plt.savefig('custno_byrate.png', bbox_inches='tight')

Related

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

How can I select specific values from list and plot a seaborn boxplot?

I have a list (length 300) of lists (each length 1000). I want to sort the list of 300 by the median of each list of 1000, and then plot a seaborn boxplot of the top 10 (i.e. the 10 lists with the greatest median).
I am able to plot the entire list of 300 but don't know where to go from there.
I can plot a range of the points but how to I plot, for example: data[3],data[45], data[129] all in the same plot?
ax = sns.boxplot(data = data[0:50])
I can also work out which items in the list are in the top 10 by doing this (but I realise this is not the most elegant way!)
array_median = np.median(data, axis=1)
np_sortedarray = np.sort(np.array(array_median))
sort_panda = pd.DataFrame(array_median)
TwoL = sort_panda.reset_index()
TwoL.sort_values(0)
Ultimately I want a boxplot with 10 boxes, showing the list items that have the greatest median values.
Example of data: list of 300 x 1000
[[1.236762285232544,
1.2303414344787598,
1.196462631225586,
...1.1787045001983643,
1.1760116815567017,
1.1614983081817627,
1.1546586751937866],
[1.1349891424179077,
1.1338907480239868,
1.1239897012710571,
1.1173863410949707,
...1.1015456914901733,
1.1005324125289917,
1.1005228757858276],
[1.0945734977722168,
...1.091795563697815]]
I modified your example data a bit just to make it easier.
import seaborn as sns
import pandas as pd
import numpy as np
data = [[1.236762285232544, 1.2303414344787598, 1.196462631225586, 1.1787045001983643, 1.1760116815567017, 1.1614983081817627, 1.1546586751937866],
[1.1349891424179077, 1.1338907480239868, 1.1239897012710571, 1.1173863410949707, 1.1015456914901733, 1.1005324125289917, 1.1005228757858276]]
To sort your data, since it is in list format and not numpy arrays, you can use the sorted function with a key to tell it to perform an operation on each list in your list, which is how the function will sort. Setting reverse = True tells it to sort highest to lowest.
sorted_data = sorted(data, key = lambda x: np.median(x), reverse = True)
To select the top n lists, add [:n] to the end of the previous statement.
To plot in Seaborn, it's easiest to convert your data to a pandas.DataFrame.
df = pd.DataFrame(data).T
That makes a DataFrame with 10 columns (or 2 in this example). We can rename the columns to make each dataset clearer.
df = df.rename(columns={k: f'Data{k+1}' for k in range(len(sorted_data))}).reset_index()
And to plot 2 (or 10) boxplots in one plot, you can reshape the dataframe to have 2 columns, one for the data and one for the dataset number (ID) (credit here).
df = pd.wide_to_long(df, stubnames = ['Data'], i = 'index', j = 'ID').reset_index()[['ID', 'Data']]
And then you can plot it.
sns.boxplot(x='ID', y = 'Data', data = df)
See this answer for fetching top 10 elements
idx = (-median).argsort()[:10]
data[idx]
Also, you can get particular elements of data like this
data[[3, 45, 129]]

How to get specific attributes of a df that has been grouped

I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File
You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)
Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])

One Hot Encoding a composite field

I want to transform multiple columns with same categorical values using a OneHotEncoder. I created a composite field and tried to use OneHotEncoder on it as below: (Items 1-3 are from the same list of items)
import pyspark.sql.functions as F
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = StringIndexer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
encoder = OneHotEncoder(setInputCol="basketIndex", setOutputCol="basketVec")
encoded = encoder.transform(indexed)
def myConcat(*cols):
return F.concat(*[F.coalesce(c, F.lit("*")) for c in cols])
I am getting an out of memory error.
Does this approach work? How do I one hot encode a composite field or multiple columns with categorical values from same list?
If you have categorical values array why you didn't try CountVectorizer:
import pyspark.sql.functions as F
from pyspark.ml.feature import CountVectorizer
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = CountVectorizer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
Note: I can't comment yet (due to the fact that I'm a new user).
What is the cardinality of your "item1", "item2" and "item3"
More specifically, what are the values that the following prints is giving ?
k1 = df.item1.nunique()
k2 = df.item2.nunique()
k3 = df.item3.nunique()
k = k1 * k2 * k3
print (k1, k2, k3)
One hot encoding is basically creating a very sparse matrix of same number of rows as your original dataframe with k number of additional columns, where k = products of the three numbers printed above.
Therefore, if your 3 numbers are large, you get out of memory error.
The only solutions are to:
(1) increase your memory or
(2) introduce a hierarchy among the categories and use the higher level categories to limit k.

Convert huge number of lists to pandas dataframe

User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?
I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)

Resources