How to select and display multiple columns using loc in python

How to select and display multiple columns using loc in python - python-3.x

Assume I have a data frame (ds) like this:
ID Name Age
1 xxc 34
2 sfg 23
3 hdg 18
I want to display the columns Name and Age as well.
Currently through this line of code
def item(id):
return ds.loc[ds['ID'] == id]['Name'].tolist()[0]
I am able to get only Name column value. How do I get Age column value too?
Please note I want to retain the same code, i.e the return statement.
Any solutions please?

list(ds.loc[ds['ID'] == 1,['Name','Age']].iloc[0])
Out[365]: ['xxc', 34]

Related

Set New Dataframe Column If Existing Columns Value Starts With A Number

How can I create a new column (like below) to separate out values if they start with a number? I've attempted utilizing variations of isdigit() and slicing the value to look at the first character [:1], but I haven't been able to get this to work. df.apply(lambda x: x if x['attr'][:1].isdigit()==False)
Dummy Data:
data = {'Name':['Bob','Kyle','Kevin'],
'attr':['abc123','1230','(ab)']}
df = pd.DataFrame(data)
Desired Output:
data = {'Name':['Bob','Kyle','Kevin'],
'attr':['abc123','1230','(ab)'],
'num_start':[None,'1230',None],
'str_start':['abc123',None,'(ab)']}
df = pd.DataFrame(data)

Use a regex: r'^\d'
pandas.Series.str.startswith would work, except that it doesn't accept a regex; instead, use pandas.Series.str.contains and anchor to the start of the string with ^, then search for a numeric digit with \d.
df.attr.str.contains(r'^\d') gives a Series of True/False values; for rows where it is True, the value goes to the num_start column, and where it is False, the value goes to the str_start column.
From the first df,
condition = df.attr.str.contains(r'^\d')
df['num_start'] = df.attr.where(condition, other=None)
df['str_start'] = df.attr.where(~condition, other=None)
gives
Name attr num_start str_start
0 Bob abc123 None abc123
1 Kyle 1230 1230 None
2 Kevin (ab) None (ab)
Postscript
If this is an x-y problem, where you want to handle the rows differently depending on whether the attr starts with a digit or no, consider something like
for starts_with_num, group in df.groupby(condition):
# logic
if starts_with_num:
# do the thing (attr starts with a digit)
else:
# do the other thing (attr doesn't start with a digit)

Another possible solution:
cond = df['attr'].str.replace('[^\w\d]', '').str.contains(r'^\d')
df['num_start'] = np.where(cond, df['attr'], None)
df['str_start'] = np.where(cond, None, df['attr'])
Output:
Name attr num_start str_start
0 Bob abc123 None abc123
1 Kyle 1230 1230 None
2 Kevin (ab) None (ab)

How do I change column value exists based on value in another column?

I have this data
Customer_ID SNN Occupation
0 CUS_0xd40 821-00-0265 Scientist
1 CUS_0xd40 #F%$D#*&8 ?
2 CUS_0xd40 004-07-5839 Scientist
3 CUS_0xd40 133-16-7738 ?
4 CUS_0x2dbc 031-35-0942 Entrepreneur
I want to check if CustomerID matches. If it matches, I want to replicate data from other already-filled
I get array([None], dtype=object) when apply this code how to fix it and how to solve it using group by
def ffff(x):
for i in train_df.index:
if train_df['Customer_ID'].iloc[i]==train_df['Customer_ID'].iloc[i+1]:
if train_df['Occupation'].iloc[i]=='?':
train_df['Occupation'].iloc[i]==train_df['Occupation'].iloc[i+1]
else:
train_df['Occupation'].iloc[i]==train_df['Occupation'].iloc[i]
else:
pass
break
train_df['Occupation']=train_df['Occupation'].apply(ffff)

How to do fuzzy matching within in the same dataset with multiple columns

I have a student rank dataset in which a few values are missing and I want to do fuzzy logic on
names and rank columns within the same dataset, find the best matching values, update null values for the rest of the columns, and add a matched name column, matched rank column, and score. I'm a beginner that would be great if someone
help me. Thank You.
data:
Name School Marks Location Rank
0 JACK TML 90 AU 3
1 JHON SSP 85 NULL NULL
2 NULL TML NULL AU 3
3 BECK NTC NULL EU 2
4 JHON SSP NULL JP 1
5 SEON NTC 80 RS 5
Expected Data Output:
data:
Name School Marks Location Rank Matched_Name Matched_Rank Score
0 JACK TML 90 AU 3 Jack 3 100
1 JHON SSP 85 JP 1 JHON 1 100
2 BECK NTC NULL EU 2 - - -
3 SEON NTC 80 RS 5 - - -
I how to do it with fuzzy logic ?
here is my code
ds1 = pd.read_csv(dataset.csv)
ds2 = pd.read_csv(dataset.csv)
# Columns to match on from df_left
left_on = ["Name", "Rank"]
# Columns to match on from df_right
right_on = ["Name", "Rank"]
# Now perform the match
#Start the time
a = datetime.datetime.now()
print('started at :',a)
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(ds1,
ds2,
left_on,
right_on)
b = datetime.datetime.now()
print('end at :', b)
print("Time taken: ", b-a)
print(matched_results)
try:
print(matched_results.columns)
cols = matched_results.columns
except:
pass
print(matched_results.to_csv('matched_results.csv',index=False))
# Let's see the best matches
try:
matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)
except:
pass

Using fuzzywuzzy is usually when names are not exact matches. I can't see this in your case. However, if your names aren't exact matches, you may do the following:
Create a list of all school names using
df['school_name'].tolist()
Find null values in your data frame.
Use
process.extractOne(current_name, school_names_list, scorer=fuzz.partial_ratio)
Just remember that you should never use Fuzzy if you have exact names. You'll only need to filter the data frame like this:
filtered = df[df['school_name'] == x]
and use it to replace values in the original data frame.

How to split a column values in dataframe

I have a dataframe like this
PK Name Mobile questions
1 Jack 12345 [{'question':"how are you","Response":"Fine"},{"question":"whats your age","Response":"i am 19"}]
2 kim 102345 [{'question':"how are you","Response":"Not Fine"},{"question":"whats your age","Response":"i am 29"}]
3 jame 420
I want the output df to be like
PK Name Mobile Question 1 Response 1 Question 2 Response 2
1 Jack 12345 How are you Fine Whats your age i am 19
2 Kim 102345 How are you Not Fine Whats your age i am 29
3 jame 420

you can use explode to first create a row per element in each list. Then create a dataframe from this exploded series and keep indexes. assign a column to get incremental value per row per index group, then set_index and unstack to create the right shape. finally rename the columns and join to original df
# create a row per element in each list in each row
s = df['questions'].explode()
# create the dataframe and reshape
df_qr = pd.DataFrame(s.tolist(), index=s.index)\
.assign(cc=lambda x: x.groupby(level=0).cumcount()+1)\
.set_index('cc', append=True).unstack()
#flatten columns names
df_qr.columns = [f'{col[0]} {col[1]}' for col in df_qr.columns]
# join back to df
df_f = df.drop('questions', axis=1).join(df_qr, how='left')
print (df_f)
PK Name Mobile question 1 question 2 Response 1 Response 2
0 1 Jack 12345 how are you whats your age Fine i am 19
1 2 kim 102345 how are you whats your age Not Fine i am 29
Edit, if some rows are emppty strings instead of lis, then create s this way:
s = df.loc[df['questions'].apply(lambda x: isinstance(x, list)), 'questions'].explode()

Code optimisation - comparing two datetime columns by month and creating a new column too slow

I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.
my_list_of_ids = df[df.bool_column == 1].id.values
def my_func(date1, date2):
for id_ in df.id:
if id_ in my_list_of_ids:
if date1.month == date2.month:
my_var = 1
else:
my_var = 0
else:
my_var = 0
return my_var
df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)
Been waiting for 30 minutes and still 0%. Any help is appreciated.
UPDATE (adding an example):
id | date1 | date2 | bool_column | new_column |
id1 2019-02-13 2019-04-11 1 0
id1 2019-03-15 2019-04-11 0 0
id1 2019-04-23 2019-04-11 0 1
id2 2019-08-22 2019-08-11 1 1
id2 ....
id3 2019-09-01 2019-09-30 1 1
.
.
.
What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.

The pandas way to do this is
mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})
Since you have a large data-set, this will take time, but should be faster than using apply

The best way to deal with the month match is to use vectorization in pandas and do this:
new_column = (df.date1.dt.month == df.date2.dt.month).astype(int)
That is, avoid using apply() over the DataFrame (which will probably be iterative) and take advantage of the underlying numpy vectorization. The gateway to such functionality is almost always in families of Series functions and properties, like the dt family for dates, str family for strings, and so forth.
Luckily, you have pre-computed the id_list membership in your bool_column, so to add membership as a criterion, just do this:
new_column = ((df.date1.dt.month == df.date2.dt.month) & df.bool_column).astype(int)
Once again, the & of two Series takes advantage of vectorization. You stay inside boolean space till the end, then cast to int with astype(int). Reviewing your code, it occurs to me that the iterative checking of your id_list may be the real performance hit here, even more so than the DataFrame.apply(). Whatever you do, avoid at all costs iterating your id_list at each row, since you already have a vector denoting membership in your bool_column.
By the way I believe there's a tiny error in your example data, the new_column value for your third row should be 0, since your bool_column value there is 0.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to select and display multiple columns using loc in python - python-3.x

list(ds.loc[ds['ID'] == 1,['Name','Age']].iloc[0]) Out[365]: ['xxc', 34]

Related

Set New Dataframe Column If Existing Columns Value Starts With A Number

How do I change column value exists based on value in another column?

How to do fuzzy matching within in the same dataset with multiple columns

How to split a column values in dataframe

Code optimisation - comparing two datetime columns by month and creating a new column too slow

Categories

Resources