Splitting strings into two different columns pandas - python-3.x

I have a below data frame called df. It has location column and it is a list separated by a comma.
Expected output
I need to split the last two strings into multiple columns.
Example Input:
['122 Grenfell Street', 'Adelaide CBD', '5000 Adelaide', 'Australia']
Example Output:
df['Country']: Australia
df['City'] : 5000 Adelaide
I need to do the same for all the rows.
I tried below code
df['Country'] = df['Loction'].str.split(',', expand = True)
The above code is not working. I tried other posts but not successful

Create list by using the tolist(). Create datframe using pd.DataFrame
Say sample data is:
df=pd.DataFrame({'text':[['122 Grenfell Street', 'Adelaide CBD', '5000 Adelaide', 'Australia']]})
Extract list elements into columns:
df[['Street','Area','City','Country']] = pd.DataFrame(df.text.tolist(), index= df.index)
text Street \
0 [122 Grenfell Street, Adelaide CBD, 5000 Adela... 122 Grenfell Street
Area City Country
0 Adelaide CBD 5000 Adelaide Australia

Use, Series.str.extract along with the given regex pattern:
df[['City', 'Country']] = df['Location'].str.extract(r"'([^,']+?)'\s*,\s*'([^'\]]+)'\s*\]")
Result:
# print(df)
Location City Country
0 [122 Grenfell Street, Adelaide CBD, 5000 Adela... 5000 Adelaide Australia
See the regex demo here.

Related

Is there a simple way to remove duplicate values in certain cells of a dataframe column?

I have a dataframe column with city locations and some of the cells have the same value (city) twice within each cell. I was wondering how to get rid of one of the values. eg. Instead of it saying Dublin Dublin below it will only say Dublin once.
I have tried df['city'].apply(set) but it doesn't give me what I am looking for.
Any advice much appreciated. Please see the image below:
You can split each item by (space) and convert each list of split strings to a set (which is deduplicated, but not sorted), and then re-join:
df['city'] = df['city'].str.split().apply(lambda x: pd.Series(x).drop_duplicates().tolist()).str.join(' ')
Output:
>>> df
city
0 Los Angeles CA
1 none
2 London
3 Dublin

How to do fuzzy matching within in the same dataset with multiple columns

I have a student rank dataset in which a few values are missing and I want to do fuzzy logic on
names and rank columns within the same dataset, find the best matching values, update null values for the rest of the columns, and add a matched name column, matched rank column, and score. I'm a beginner that would be great if someone
help me. Thank You.
data:
Name School Marks Location Rank
0 JACK TML 90 AU 3
1 JHON SSP 85 NULL NULL
2 NULL TML NULL AU 3
3 BECK NTC NULL EU 2
4 JHON SSP NULL JP 1
5 SEON NTC 80 RS 5
Expected Data Output:
data:
Name School Marks Location Rank Matched_Name Matched_Rank Score
0 JACK TML 90 AU 3 Jack 3 100
1 JHON SSP 85 JP 1 JHON 1 100
2 BECK NTC NULL EU 2 - - -
3 SEON NTC 80 RS 5 - - -
I how to do it with fuzzy logic ?
here is my code
ds1 = pd.read_csv(dataset.csv)
ds2 = pd.read_csv(dataset.csv)
# Columns to match on from df_left
left_on = ["Name", "Rank"]
# Columns to match on from df_right
right_on = ["Name", "Rank"]
# Now perform the match
#Start the time
a = datetime.datetime.now()
print('started at :',a)
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(ds1,
ds2,
left_on,
right_on)
b = datetime.datetime.now()
print('end at :', b)
print("Time taken: ", b-a)
print(matched_results)
try:
print(matched_results.columns)
cols = matched_results.columns
except:
pass
print(matched_results.to_csv('matched_results.csv',index=False))
# Let's see the best matches
try:
matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)
except:
pass
Using fuzzywuzzy is usually when names are not exact matches. I can't see this in your case. However, if your names aren't exact matches, you may do the following:
Create a list of all school names using
df['school_name'].tolist()
Find null values in your data frame.
Use
process.extractOne(current_name, school_names_list, scorer=fuzz.partial_ratio)
Just remember that you should never use Fuzzy if you have exact names. You'll only need to filter the data frame like this:
filtered = df[df['school_name'] == x]
and use it to replace values in the original data frame.

How join two dataframes with multiple overlap in pyspark

Hi I have a dataset of multiple households where all people within households have been matched between two datasources. The dataframe therefore consists of a 'household' col, and two person cols (one for each datasource). However some people (like Jonathan or Peter below) where not able to be matched and so have a blank second person column.
Household
Person_source_A
Person_source_B
1
Oliver
Oliver
1
Jonathan
1
Amy
Amy
2
David
Dave
2
Mary
Mary
3
Lizzie
Elizabeth
3
Peter
As the dataframe is gigantic, my aim is to take a sample of the unmatched individuals, and then output a df that has all people within households where only sampled unmatched people exist. Ie say my random sample includes Oliver but not Peter, then I would only household 1 in the output.
My issue is I've filtered to take the sample and now am stuck making progress. Some combination of join, agg/groupBy... will work but I'm struggling. I add a flag to the sampled unmatched names to identify them which i think is helpful...
My code:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1)
# add flag of sampled unmatched persons
df_unmatched_sample = df_unmatched.withColumn('sample_flag', lit('1'))
As it pertains to your intent:
I just want to reduce my dataframe to only show the full households of
households where an unmatched person exists that has been selected by
a random sample out of all unmatched people
Using your existing approach you could use a join on the Household of the sample records
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).select("Household").distinct()
desired_df = df.join(df_unmatched_sample,["Household"],"inner")
Edit 1
In response to op's comment:
Is there a slightly different way that keeps a flag to identify the
sampled unmatched person (as there are some households with more than
one unmatched person)?
A left join on your existing dataset after adding the flag column to your sample may help you to achieve this eg:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).withColumn('sample_flag', lit('1'))
desired_df = (
df.alias("dfo").join(
df_unmatched_sample.alias("dfu"),
[
col("dfo.Household")==col("dfu.Household") ,
col("dfo.per_A")==col("dfu.per_A"),
col("dfo.per_B").isNull()
],
"left"
)
)

How to merge two rows having same values into single row in python?

I am having a table called 'data' in that the values will be like following,
ID NAME DOB LOCATION
1 bob 08/10/1985 NEW JERSEY
1 bob 15/09/1987 NEW YORK
2 John 08/10/1985 NORTH CAROLINA
2 John 26/11/1990 OKLAHOMA
For example
I want output like,
ID NAME No.of.Days
1 bob difference of two given dates in days
2 John difference of two given dates in days
Please help me to form a python code to get the expected output.
If there will be only two dates in a for a given ID then below works!
df.groupby(['ID','NAME'])['DOB'].apply(lambda x: abs(pd.to_datetime(list(x)[0]) - pd.to_datetime(list(x)[1]))).reset_index(name='No.Of.Days')
Output
ID NAME No.Of.Days
0 1 bob 766 days
1 2 John 1934 days
you can use np.diff also
df.groupby(['ID','NAME'])['DOB'].apply(lambda x: np.diff(list(x))[0]).reset_index(name='No.Of.Days')
First, You need to convert Date column into date format. Lets suppose you are reading from .csv then read your .csv file as follows
df = pd.read_csv('yourfile.csv', parse_dates = ['DOB'])
otherwise, convert your existing dataframe column into date format as follows.
df['DOB'] = pd.to_datetime(df['DOB'])
now, you can perform the usual numeric operations.
df.groupby(['ID','NAME'])['DOB'].apply(lambda x: abs(pd.to_datetime(list(x)[0]) - pd.to_datetime(list(x)[1]))).reset_index(name='No.Of.Days')

Make a list from one OR many string values in a pandas dataframe

I have one data frame that has resulted from a spatial join between 2 Geopandas.GeoDataFrame objects.
Because there was more than one item overlapping with the target feature, the rows have been duplicated so that each row has the inherited information from each of the overlapping entities. To simulate this situation, we can run the following lines:
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
cities = geopandas.read_file(geopandas.datasets.get_path('naturalearth_cities'))
cities = cities[['geometry', 'name']]
cities = cities.rename(columns={'name':'City'})
countries_with_city = geopandas.sjoin(world, cities, how="inner", op='intersects')
I am trying to generate a new column in the world geodaframe that contains a list of length 0,1 or +1, with the "City" attribute of all the overlapping cities of each country. For this, I wrote this so far:
for country in world.index:
subset_countries = countries_with_city.loc[countries_with_city.index==world.loc[country, "name"]]
a = subset_countries["City"].tolist()
list_of_names = list(subset_countries["City"])
world[list_of_names]=list_of_names
When I run this code, however, I get stuck at the line a = subset_countries["City"].tolist(). The error I get is 'str' object has no attribute 'tolist'.
According to what I have tested and investigated, it seems that I am getting this error because the first country [countries_with_city.loc[countries_with_city.index==world.loc[1, "name"]]] has only one city inside of it. Hence, when I slice the dataframe, there fact that there is only one row with index=1 makes the outcome a string, instead of data frame that can then be listed.
Is there a straightforward way I can use so that the code works in any case? (when there are 0, 1 and many cities). The goal is to generate a list of city names that will then be written in the world dataframe.
I am working on python 3
If I understand you correctly, one approach is to build a mapping from country name to a list of city names:
# Build a Series with index=countries, values=cities
country2city = countries_with_city.groupby('name')['City'].agg(lambda x: list(x))
# Use the mapping on the name column of the world DataFrame
world['city_list'] = world['name'].map(county)
# Peek at a nontrivial part of the result
world.drop('geometry', axis=1).tail()
pop_est continent name iso_a3 gdp_md_est city_list
172 218519.0 Oceania Vanuatu VUT 988.5 NaN
173 23822783.0 Asia Yemen YEM 55280.0 [Sanaa]
174 49052489.0 Africa South Africa ZAF 491000.0 [Cape Town, Bloemfontein, Johannesburg, Pretoria]
175 11862740.0 Africa Zambia ZMB 17500.0 [Lusaka]
176 12619600.0 Africa Zimbabwe ZWE 9323.0 [Harare]
If you intend to print the city lists right away, you can join the strings in each list to remove the square brackets:
world['city_str'] = world['city_list'].apply(lambda x: ', '.join(c for c in x)
if x is not np.nan else None)
# Sanity-check result
world.filter(like='city').tail()
city_list city_str
172 NaN None
173 [Sanaa] Sanaa
174 [Cape Town, Bloemfontein, Johannesburg, Pretoria] Cape Town, Bloemfontein, Johannesburg, Pretoria
175 [Lusaka] Lusaka
176 [Harare] Harare

Resources