How to get specific attributes of a df that has been grouped - python-3.x

I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File

You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)

Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])

Related

Optimize for loop Pandas

There is a data frame with name df which contains repeating rows identified with DICE_SUMMARY_ID.
After I perform some calculations for different columns, I need to write back the results to the original dataframe.
The issue is that df contains over 100k rows and a for loop is very time consuming. Currently, it shows about 3 hours.
How can I reduce the time?
#extract unique ids from dataframe
uniqueIDs = df['DICE_SUMMARY_ID'].unique()
#iterate over the unique ids and calculate
for i in range(len(uniqueIDs)):
#get a slice of the dataframe at i'th unique id
uniqueID_df = df.loc[df['DICE_SUMMARY_ID'] == uniqueIDs[i]]
#calculate sum of all family types
SINGLE_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10001).sum())
EXTRA_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10003).sum())
NO_OF_ADULTS = int(SINGLE_ADULTS + EXTRA_ADULTS)
NO_OF_DEPENDENTS_U_16 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20001).sum())
NO_OF_DEPENDENTS_16_TO_18 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20002).sum())
#get the array of indexes of each unique uid in the tuple
#each unique uid has 10 - 20 rows in the original df,
#given that there are over 100k records, this becoming very time consuming
indices = np.where(df["DICE_SUMMARY_ID"] == uniqueIDs[i])[0]
for j in indices:
# #insert value in column at index for each repeating index
df['NO_OF_ADULTS'].iloc[j] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[j] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[j] = NO_OF_DEPENDENTS_16_TO_18
faster version, but I am still not satisfied
df['NO_OF_ADULTS'].iloc[indices.min():indices.max()] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_16_TO_18

Spark window function and taking first and last values per column per partition (aggregation over window)

Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.
Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition

Convert huge number of lists to pandas dataframe

User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?
I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)

Use data in Spark Dataframe column as condition or input in another column expression

I have an operation that I want to perform within PySpark 2.0 that would be easy to perform as a df.rdd.map, but since I would prefer to stay inside the Dataframe execution engine for performance reasons, I want to find a way to do this using Dataframe operations only.
The operation, in RDD-style, is something like this:
def precision_formatter(row):
formatter = "%.{}f".format(row.precision)
return row + [formatter % row.amount_raw / 10 ** row.precision]
df = df.rdd.map(precision_formatter)
Basically, I have a column that tells me, for each row, what the precision for my string formatting operation should be, and I want to selectively format the 'amount_raw' column as a string depending on that precision.
I don't know of a way to use the contents of one or more columns as input to another Column operation. The closest I can come is suggesting the use of Column.when with an externally-defined set of boolean operations that correspond to the set of possible boolean conditions/cases within the column or columns.
In this specific case, for instance, if you can obtain (or better yet, already have) all possible values of row.precision, then you can iterate over that set and apply a Column.when operation for each value in the set. I believe this set can be obtained with df.select('precision').distinct().collect().
Because the pyspark.sql.functions.when and Column.when operations themselves return a Column object, you can iterate over the items in the set (however it was obtained) and keep 'appending' when operations to each other programmatically until you have exhausted the set:
import pyspark.sql.functions as PSF
def format_amounts_with_precision(df, all_precisions_set):
amt_col = PSF.when(df['precision'] == 0, df['amount_raw'].cast(StringType()))
for precision in all_precisions_set:
if precision != 0: # this is a messy way of having a base case above
fmt_str = '%.{}f'.format(precision)
amt_col = amt_col.when(df['precision'] == precision,
PSF.format_string(fmt_str, df['amount_raw'] / 10 ** precision)
return df.withColumn('amount', amt_col)
You can do it with a python UDF. They can take as many input values (values from columns of a Row) and spit out a single output value. It would look something like this:
from pyspark.sql import types as T, functions as F
from pyspark.sql.function import udf, col
# Create example data frame
schema = T.StructType([
T.StructField('precision', T.IntegerType(), False),
T.StructField('value', T.FloatType(), False)
])
data = [
(1, 0.123456),
(2, 0.123456),
(3, 0.123456)
]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)
# Define UDF and apply it
def format_func(precision, value):
format_str = "{:." + str(precision) + "f}"
return format_str.format(value)
format_udf = F.udf(format_func, T.StringType())
new_df = df.withColumn('formatted', format_udf('precision', 'value'))
new_df.show()
Also, if instead of the column precision value you wanted to use a global one, you could use the lit(..) function when you call it like this:
new_df = df.withColumn('formatted', format_udf(F.lit(2), 'value'))

Improving the speed of cross-referencing rows in the same DataFrame in pandas

I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())

Resources