I'm trying to create a new dataframe from an existing dataframe. I tried groupby but I didn't seem to sum up the strings as a whole number. Instead it returned many strings(colleges in this case).
This is the original dataframe
I tried groupby to get the number(whole number) of the colleges but it returned a many colleges(string) instead
How do I return the number of colleges as an integer in the new column 'totalPlayer'? Please help.
Hoping, I understand your question correctly.
You need to count the distinct values in college column.
Assuming df is the name of your data frame. Below code will help.
df['college'].nunique()
Helping Links -
Counting unique values in a column in pandas dataframe like in Qlik?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html
Related
I am trying to iterate through this very large DataFrame that I have in python. The thing is, I only want to pull out data from one specific column that contains the names of a bunch of counties.
I have tried to use iteritems(), itertupel(), and iterrows() to no avail.
Any suggestions on how to do this?
My end goal is to have a nested dictionary with each internal dictionary's key being a name from the DataFrame column.
Also tried to use this method below to select a single column but that will only print the name of the column, not its contents.
for county in map_datafile[['NAME']]:
print(county)
If you delete one pair of square brackets, you get a Series that is iterable:
for county in map_datafile['NAME']:
print(county)
See the difference:
print(type(map_datafile[['NAME']]))
# pandas.core.frame.DataFrame
print(type(map_datafile['NAME']))
# pandas.core.series.Series
This question already has answers here:
Updating a dataframe column in spark
(5 answers)
Closed 3 years ago.
I am trying to use Pyspark to permute a column in a dataframe, aka shuffle all values for a single column across rows.
I am trying to avoid the solution where the column gets split and assigned an index column before being joined back to the original dataframe which also has an added index column. Primarily because of my understanding (which could be very wrong) that joins are bad in terms of runtime for a large dataset (millions of rows).
# for some dataframe spark_df
new_df = spark_df.select(colname).sort(colname)
new_df.show() # column values sorted nicely
spark_df.withColumn("ha", new_df[colname]).show()
# column "ha" no longer sorted and has same permutation as spark_df.colname
Thanks for any guidance in helping me understand this, I am a complete beginner with this :)
Edit: Sorry if I was being unclear in the question, I just wanted to replace a column with the sorted version of it without doing join. Thank you for pointing out that dfs are not mutable, but even doing spark_df.withColumn("ha", spark_df.select(colname).sort(colname)[colname]).show() shows column 'ha' as having the same permutation as 'colname' when doing sort on the column itself shows a different permutation. The question is mainly about why the permutation stays the same in the new column 'ha', not about how to replace a column. Thanks again! (Also changed the title to better reflect the question)
Spark dataframes and RDDs are immutable. Every time you make a transformation, a new one is created. Therefore, when you do new_df = spark_df.select(colname).sort(colname), spark_df remains unchanged. Only new_df is sorted. This is why spark_df.withColumn("ha", new_df[colname]) returns an unsorted dataframe.
Try new_df.withColumn("ha", new_df[colname]) instead.
From the given df below,
request_accepted_short = pd.DataFrame({'requester_id':[1,1,2,3],
'accepter_id':[2,3,3,4],
'accept_date':['2016_06-03','2016_06-08','2016_06-08','2016_06-09']})
I want to find the person (requester_id and accepter_id both are the id of a person) with the most friends, and I also want to show the number of friends that person has. Based on above df, the person is 'id=3', and number of friends is 3.
This is rated as medium SQL problem in Leetcode, and I want to find the efficient, Pandaic way of solving this problem.
Here's what I tried.
I concatenated the requester_id and accepter_id in one column, in order to see which id is most common in one column.
summary = pd.concat([request_accepted_short['requester_id'],request_accepted_short['accepter_id']])
Then, I used pandas .mode() to detect the most common id.
summary.mode()
With this process, it does get me to the most id with most friends, but this is far from best way to solve this.
My lack of understanding in
1. how .concat() works and how .mode() works, and
2. how pandas series and pandas dataframe work together,
are obvious here.
Any help from pandas expert will be appreciated
You can use value_counts() to find the count of value with most occurence. Since value_counts is a Series method, you need to stack the two columns first
df[['requester_id','accepter_id']].stack().value_counts().reset_index(name = 'count').iloc[0]
index 3
count 3
The same would work using concat as you are trying,
pd.concat([df['requester_id'],df['accepter_id']]).value_counts()
I have a pandas DataFrame filled with strings. I would like to apply a string operation to all entries, for example capitalize(). I know that for a series we can use series.str.capitlize(). I also know that I can loop over the column of the Dataframe and do this for each of the columns. But I want something more efficient and elegant, without looping. Thanks
use stack + unstack
stack makes a dataframe with a single level column index into a series. You can then perform your str.capitalize() and unstack to get back your original form.
df.stack().str.capitalize().unstack()
I have a list of pyspark.sql.Row objects as follows:
[Row(artist=1255340), Row(artist=942), Row(artist=378), Row(artist=1180), Row(artist=813)]
From a DataFrame having schema (id, name) I want to filter out rows where id equals some artist in the given Row of list. What will be the correct way to go about it ?
To clarify further, I want to do something like: select row from dataframe where row.id is in list_of_row_objects
The main question is how big is list_of_row_objects. If it is small then the link provided by #Karthik Ravindra
If it is big, then you can instead use dataframe_of_row_objects. do an inner join between your dataframe and dataframe_of_row_objects with the artist column in dataframe_of_row_objects and the id column in your original dataframe. This would basically remove any id not in dataframe_of_row_objects.
Of course using a join is slower but it is more flexible. For lists which are not small but are still small enough to fit into memory you can use the broadcast hint to still get better performance.