This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I am trying to join two dataframes, one has Id's and phone numbers, another one has bunch of other columns along with same Id field (however there are some duplicate Id's in this DF). How can I join the phone number column from first dataframe to the second one? I tried doing this but I am getting duplicated key error:
df= df.join(other.set_index('Id'), on='Id', how='outer')
How can I accomplish this? (I want duplicate ID's in second DF to have same phone numbers as non-duplicate ones)
Try the following with pandas.merge method, you can read the documentation here.
df= df.merge(other.set_index('Id'), suffixes = ('_l', '_r'), how='outer')
This will merge two dataframes and add suffixes for exact matching column names.
Related
This question already has answers here:
How to check if a string column in pyspark dataframe is all numeric
(7 answers)
Closed last year.
pyspark 2.3.1
my rows to col1 should only contain integers. I am trying to filter out any row that have even one character. How can I do this in pyspark?
I've tried
df.select('col1').filter(df.col1.rlike(^[a-zA-Z]))
however rows that contain alphabet also contain integers therefore not filtered.
How can I do this?
You can try to select pure digital rows.
df = df.filter('col1 rlike "^[0-9]+$"')
df.show(truncate=False)
This question already has answers here:
Updating a dataframe column in spark
(5 answers)
Closed 3 years ago.
I am trying to use Pyspark to permute a column in a dataframe, aka shuffle all values for a single column across rows.
I am trying to avoid the solution where the column gets split and assigned an index column before being joined back to the original dataframe which also has an added index column. Primarily because of my understanding (which could be very wrong) that joins are bad in terms of runtime for a large dataset (millions of rows).
# for some dataframe spark_df
new_df = spark_df.select(colname).sort(colname)
new_df.show() # column values sorted nicely
spark_df.withColumn("ha", new_df[colname]).show()
# column "ha" no longer sorted and has same permutation as spark_df.colname
Thanks for any guidance in helping me understand this, I am a complete beginner with this :)
Edit: Sorry if I was being unclear in the question, I just wanted to replace a column with the sorted version of it without doing join. Thank you for pointing out that dfs are not mutable, but even doing spark_df.withColumn("ha", spark_df.select(colname).sort(colname)[colname]).show() shows column 'ha' as having the same permutation as 'colname' when doing sort on the column itself shows a different permutation. The question is mainly about why the permutation stays the same in the new column 'ha', not about how to replace a column. Thanks again! (Also changed the title to better reflect the question)
Spark dataframes and RDDs are immutable. Every time you make a transformation, a new one is created. Therefore, when you do new_df = spark_df.select(colname).sort(colname), spark_df remains unchanged. Only new_df is sorted. This is why spark_df.withColumn("ha", new_df[colname]) returns an unsorted dataframe.
Try new_df.withColumn("ha", new_df[colname]) instead.
I'm trying to create a new dataframe from an existing dataframe. I tried groupby but I didn't seem to sum up the strings as a whole number. Instead it returned many strings(colleges in this case).
This is the original dataframe
I tried groupby to get the number(whole number) of the colleges but it returned a many colleges(string) instead
How do I return the number of colleges as an integer in the new column 'totalPlayer'? Please help.
Hoping, I understand your question correctly.
You need to count the distinct values in college column.
Assuming df is the name of your data frame. Below code will help.
df['college'].nunique()
Helping Links -
Counting unique values in a column in pandas dataframe like in Qlik?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html
I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()
This question already has answers here:
How to perform union on two DataFrames with different amounts of columns in Spark?
(22 answers)
Closed 4 years ago.
I have ‘n’ number of delimited data sets, CSVs may be. But one of them might have a few extra columns. I am trying to read all of them as dataframes and put them in one. How can I merge them as an unionAll and make them a single dataframe ?
P.S: I can do this when I know what is ‘n’. And, it’s a simple unionAll when the column counts are equal.
There is another approach other than the solutions mentioned in first two comments.
Read all CSV files to a single RDD producing RDD[String].
Map to create Rdd[Row] with appropriate length while filling missing values with null or any suitable values.
Create dataFrame schema.
Create DataFrame from RDD[Row] using created Schema.
This may not be a good approach if the CSVs has large number of columns.
Hope this helps