Using apply on pandas dataframe with strings without looping over series - string

I have a pandas DataFrame filled with strings. I would like to apply a string operation to all entries, for example capitalize(). I know that for a series we can use series.str.capitlize(). I also know that I can loop over the column of the Dataframe and do this for each of the columns. But I want something more efficient and elegant, without looping. Thanks

use stack + unstack
stack makes a dataframe with a single level column index into a series. You can then perform your str.capitalize() and unstack to get back your original form.
df.stack().str.capitalize().unstack()

Related

Spark DataFrame how to change permutation of one column without join [duplicate]

This question already has answers here:
Updating a dataframe column in spark
(5 answers)
Closed 3 years ago.
I am trying to use Pyspark to permute a column in a dataframe, aka shuffle all values for a single column across rows.
I am trying to avoid the solution where the column gets split and assigned an index column before being joined back to the original dataframe which also has an added index column. Primarily because of my understanding (which could be very wrong) that joins are bad in terms of runtime for a large dataset (millions of rows).
# for some dataframe spark_df
new_df = spark_df.select(colname).sort(colname)
new_df.show() # column values sorted nicely
spark_df.withColumn("ha", new_df[colname]).show()
# column "ha" no longer sorted and has same permutation as spark_df.colname
Thanks for any guidance in helping me understand this, I am a complete beginner with this :)
Edit: Sorry if I was being unclear in the question, I just wanted to replace a column with the sorted version of it without doing join. Thank you for pointing out that dfs are not mutable, but even doing spark_df.withColumn("ha", spark_df.select(colname).sort(colname)[colname]).show() shows column 'ha' as having the same permutation as 'colname' when doing sort on the column itself shows a different permutation. The question is mainly about why the permutation stays the same in the new column 'ha', not about how to replace a column. Thanks again! (Also changed the title to better reflect the question)
Spark dataframes and RDDs are immutable. Every time you make a transformation, a new one is created. Therefore, when you do new_df = spark_df.select(colname).sort(colname), spark_df remains unchanged. Only new_df is sorted. This is why spark_df.withColumn("ha", new_df[colname]) returns an unsorted dataframe.
Try new_df.withColumn("ha", new_df[colname]) instead.

Groupby Strings

I'm trying to create a new dataframe from an existing dataframe. I tried groupby but I didn't seem to sum up the strings as a whole number. Instead it returned many strings(colleges in this case).
This is the original dataframe
I tried groupby to get the number(whole number) of the colleges but it returned a many colleges(string) instead
How do I return the number of colleges as an integer in the new column 'totalPlayer'? Please help.
Hoping, I understand your question correctly.
You need to count the distinct values in college column.
Assuming df is the name of your data frame. Below code will help.
df['college'].nunique()
Helping Links -
Counting unique values in a column in pandas dataframe like in Qlik?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html

Dask Dataframe View Entire Row

I want to see the entire row for a dask dataframe without the fields being cutoff, in pandas the command is pd.set_option('display.max_colwidth', -1), is there an equivalent for dask? I was not able to find anything.
You can import pandas and use pd.set_option() and Dask will respect pandas' settings.
import pandas as pd
# Don't truncate text fields in the display
pd.set_option("display.max_colwidth", -1)
dd.head()
And you should see the long columns. It 'just works.'
Dask does not normally display the data in a dataframe at all, because it represents lazily-evaluated values. You may want to get a specific row by index, using the .loc accessor (same as in Pandas, but only efficient if the index is known to be sorted).
If you meant to get the whole list of columns only, you can get this by the .columns attribute.

Spark: convert DataFrame column into vector

I have a DataFrame df with a column column and I would like to convert column into a vector (e.g. a DenseVector) so that I can use it in vector and matrix products.
Beware: I don't need a column of vectors; I need a vector object.
How to do this?
I found out the vectorAssembler function (link) but this doesn't help me, as it converts some DataFrame columns into a vector columns, which is still a DataFrame column; my desired output should instead be a vector.
About the goal of this question: why am I trying to convert a DF column into a vector? Assume I have a DF with a numerical column and I need to compute a product between a matrix and this column. How can I achieve this? (The same could hold for a DF numerical row.) Any alternative approach is welcome.
How:
DenseVector(df.select("column_name").rdd.map(lambda x: x[0]).collect())
but it doesn't make sense in any practical scenario.
Spark Vectors are not distributed, therefore are applicable only if data fits in memory of one (driver) node. If this is the case you wouldn't use Spark DataFrame for processing.

A more efficient way of getting the nlargest values of a Pyspark Dataframe

I am trying to get the top 5 values of a column of my dataframe.
A sample of the dataframe is given below. In fact the original dataframe has thousands of rows.
Row(item_id=u'2712821', similarity=5.0)
Row(item_id=u'1728166', similarity=6.0)
Row(item_id=u'1054467', similarity=9.0)
Row(item_id=u'2788825', similarity=5.0)
Row(item_id=u'1128169', similarity=1.0)
Row(item_id=u'1053461', similarity=3.0)
The solution I came up with is to sort all of the dataframe and then to take the first 5 values. (the code below does that)
items_of_common_users.sort(items_of_common_users.similarity.desc()).take(5)
I am wondering if there is a faster way of achieving this.
Thanks
You can use RDD.top method with key:
from operator import attrgetter
df.rdd.top(5, attrgetter("similarity"))
There is a significant overhead of DataFrame to RDD conversion but it should be worth it.

Resources