Sum of different items in a pandas df - python-3.x

I have this data and I need to sum the total ['Tons'] per ['KeyCode´] and how to show all the ['Origin'] that make this total.
I did a groupby by KeyCode, but I can't figure out how to proceed.
df = data.groupby("KeyCode", group_keys=False).apply(lambda x: x)
df.head()

you can use all aggregation functions in a groupby object to get the values you are interested on.
To get the total of something, the function is sum(), whereas to get the unique values we use unique(). To use unique you must first get the series (a column, in Excel language) related to the variable you want to get the unique values for each value of the groupby variable.
See the code and result below:
data = pd.read_excel("total cargo.xlsx")
# get total of 'Tons'
res = data.groupby("KeyCode", group_keys=False).sum()
# get all the 'Origin' values for each keycode
res['Origins'] = data.groupby("KeyCode", group_keys=False)['Origin'].unique()
res.head()

Related

How to turn a column values into a list

I have dataset which i want to extract a column of values into a list, how to achieve that ?
I have tried :
List<Row> rows = dF.select("col1").collectAsList();
But then how to iterate over values of col1 ?
Thanks
You can iterate over a list of rows the same way you can iterate over any list.
For eg: Assuming you want to access the first column (string type) of all the rows, you can use the following snippet
import scala.collection.JavaConversions.asScalaBuffer
val rows = df.select("col1").collectAsList();
rows.map(r => r.getAs[String](0))
This is how you can iterate over the list selected from a column, but it is not recommended if the column is big to fit in the driver node
df.select("col1").collectAsList()
.stream().forEach(row -> System.out.println(row.getString(0)));
There are other functions for other datatype to get the value from Row like getString, getLong, getInt etc

pick from first occurrences till last values in array column in pyspark df

I have problem in which is have to search for first occurrence of "Employee_ID" in "Mapped_Project_ID", Need to pick the values in the array till last value from the first matching occurrences
I have one dataframe like below :
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E101, E102, E103]
Name3|E103|[E101, E102, E103, E104, E105]
I want to have output df like below:
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E102, E103]
Name3|E103|[E103, E104, E105]
Not sure, How to achieve this.
Can someone provide an help on this or logic to handle this in spark without need of any UDFs?
Once you have your dataframe you can use spark 2.4's higher order array function (see https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) to filter out any values within the array that are lower than the value in the Employee_ID column like so:
myDataframe
.selectExpr(
"Employee_Name",
"Employee_ID",
"filter(Mapped_Project_ID, x -> x >= Employee_ID) as Mapped_Project_ID"
);

Error while selecting rows from pandas data frame

I have a pandas dataframe df with the column Name. I did:
for name in df['Name'].unique ():
X = df[df['Name'] == name]
print (X.head())
but then X contains all kinds of different Name, not an unique name I want.
What did I do wrong?
Thanks a lot
You probably don't want to overwrite X with every iteration of your loop and only keep the dataframe containing the last value of df['Name'].unique().
Depending on your data and goal, you might want to use groupby as jezrael suggests, or maybe do something like df[~df['Name'].duplicated()].

How to filter column on values in list in pyspark?

I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code:
df = dfRawData.filter(col("X").between("CB","CI","CR"))
But I am getting the following error:
between() takes exactly 3 arguments (4 given)
Please let me know how I can resolve this issue.
The function between is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To do that, use isin:
import pyspark.sql.functions as f
df = dfRawData.where(f.col("X").isin(["CB", "CI", "CR"]))

Iterating over rows of dataframe but keep each row as a dataframe

I want to iterate over the rows of a dataframe, but keep each row as a dataframe that has the exact same format of the parent dataframe, except with only one row. I know about calling DataFrame() and passing in the index and columns, but for some reason this doesn't always give me the same format of the parent dataframe. Calling to_frame() on the series (i.e. the row) does cast it back to a dataframe, but often transposed or in some way different from the parent dataframe format. Isn't there some easy way to do this and guarantee it will always be the same format for each row?
Here is what I came up with as my best solution so far:
def transact(self, orders):
# Buy or Sell
if len(orders) > 1:
empty_order = orders.iloc[0:0]
for index, order in orders.iterrows():
empty_order.loc[index] = order
#empty_order.append(order)
self.sub_transact(empty_order)
else:
self.sub_transact(orders)
In essence, I empty the dataframe and then insert the series, from the For loop, back into it. This works correctly, but gives the following warning:
C:\Users\BNielson\Google Drive\My Files\machine-learning\Python-Machine-Learning\ML4T_Ex2_1.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
empty_order.loc[index] = order
C:\Users\BNielson\Anaconda3\envs\PythonMachineLearning\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
So it's this line giving the warning:
empty_order.loc[index] = order
This is particularly strange because I am using .loc already, when normally you get this error when you don't use .loc.
There is a much much easier way to do what I want.
order.to_frame().T
So...
if len(orders) > 1:
for index, order in orders.iterrows():
self.sub_transact(order.to_frame().T)
else:
self.sub_transact(orders)
What this actually does is translates the series (which still contains the necessary column and index information) back to a dataframe. But for some Moronic (but I'm sure Pythonic) reason it transposes it so that the previous row is now the column and the previous columns are now multiple rows! So you just transpose it back.
Use groupby with a unique list. groupby does exactly what you are asking for as in, it iterates over each group and each group is a dataframe. So, if you manipulate it such that you groupby a value that is unique for each and every row, you'll get a single row dataframe when you iterate over the group
for n, group in df.groupby(np.arange(len(df))):
pass
# do stuff
If I can suggest an alternative way than it would be like this:
for index, order in orders.iterrows():
orders.loc[index:index]
orders.loc[index:index] is exactly one row dataframe slice with the same structure, including index and column names.

Resources