Script for converting intermingled cell data to interaction matix - excel

I have bibliographic data from Web of Science that I need to configure into an interaction matrix (basically a tabulation table of authors working together). However, the cells are configured awkwardly.
1: [Hussain, Raja Azadar; Badshah, Amin] Quaid I Azam Univ, Dept Chem, Coordinat Chem Lab, Islamabad, Pakistan; [Tahir, Muhammad Nawaz] Univ Sargodha, Dept Phys, Sargodha, Punjab, Pakistan; [Tamoor-ul- Hassan; Bano, Asghari] Quaid I Azam Univ, Phytoharmone Lab, Dept Plant Sci, Islamabad, Pakistan
2: [Shahida, Shabnam; Khan, Muhammad Haleem] Univ Azad Jammu & Kashmir, Dept Chem, Muzaffarabad, Ajk, Pakistan; [Ali, Akbar] Pakistan Inst Nucl Sci & Technol, Div Chem, Islamabad, Pakistan
And I need it to look like this:
1: Hussain, Raja Azadar, Quaid I Azam Univ, Dept Chem, Coordinat Chem Lab, Islamabad, Pakistan
1: Badshah, Amin, Quaid I Azam Univ, Dept Chem, Coordinat Chem Lab, Islamabad, Pakistan
1: Tamoor-ul- Hassan, Quaid I Azam Univ, Phytoharmone Lab, Dept Plant Sci, Islamabad, Pakistan
1: Bano, Asghari, Quaid I Azam Univ, Phytoharmone Lab, Dept Plant Sci, Islamabad, Pakistan
2: Shahida, Shabnam, Univ Azad Jammu & Kashmir, Dept Chem, Muzaffarabad, Ajk, Pakistan
2: Khan, Muhammad Haleem, Univ Azad Jammu & Kashmir, Dept Chem, Muzaffarabad, Ajk, Pakistan
2: Ali, Akbar, Pakistan Inst Nucl Sci & Technol, Div Chem, Islamabad, Pakistan
Any help would be greatly appreciated!

Split on `; [` --> arr1
For each element in arr1:
Split on `]` --> arr2(0) and arr2(1)
split arr2(0) on `;` -->arr3
For each element in arr3:
Combine arr3(x) with arr2(1) - put in cell
Loop till done

Related

VBA for merging range of cells with reference to a cell

I want to merge the range of cells with reference to a unique cell value and require VBA for the same.
Sample Data
Name
Phone No.
Car Company
Car Model
Car Expiry
DL No.
Charlie Andrew
98765
CA123
Charlie Andrew
98765
Mercedes
D201
Charlie Andrew
D201
Jun-50
Charlie Andrew
98765
Volkswagon
CA123
Charlie Andrew
Volkswagon
POLO
Charlie Andrew
POLO
MAR-25
Charlie Andrew
98765
Jun-50
Charlie Andrew
12345
BMW
520D
Charlie Andrew
520D
MAY-40
CA456
Stephen Logan
556644
GM MOTORS
Stephen Logan
GM MOTORS
2255H
Stephen Logan
2255H
APR-30
Stephen Logan
556644
SL987
Desired Result
Name
Phone No.
Car Company
Car Model
Car Expiry
DL No.
Charlie Andrew
987654
Mercedes; Volkswagon
D201; POLO
Jun-50; mar-25
CA123
Charlie Andrew
12345
BMW
520D
MAY-40
CA456
Stephen Logan
556644
GM MOTORS
2255H
APRIL-30
SL987
Please note that, DL No. should not be merged as it is a unique value
Thanks in Advance
I tried various VBA's but didn't get desired result.

Remove custom stop words from pandas dataframe not working

I am trying to remove a custom list of stop words, but its not working.
desc = pd.DataFrame(description, columns =['description'])
print(desc)
Which gives the following results
description
188693 The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff
11443 According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa...
... ...
2732 The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit...
[9875 rows x 1 columns]
I found the following code here, but it doesn't seem to work
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
desc.assign(new_desc=desc.replace(dict(string={pat: ''}), regex=True))
Which produces the following results
description new_desc
188693 The Kentucky Cannabis Company and Bluegrass He... The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff Ohio County Sheriff
11443 According to new reports from federal authorit... According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou... KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa... The crew of Insight, WCNY's weekly public affa...
... ... ...
2732 The Arkansas Supreme Court on Thursday cleared... The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ... Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres... Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan... Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit... SAN TAN VALLEY — Two men have been charged wit...
9875 rows × 2 columns
As you can see, the stop words weren't removed. Any help you can provide would be greatly appreciated.
Handle the case, simplify pattern,
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join(remove_words)
desc['new_desc'] = desc.description.str.lower().replace(pat,'', regex=True)
description new_desc
0 The Kentucky Cannabis Company and Bluegrass He... the kentucky company and bluegrass he...
1 Ohio County Sheriff ohio county sheriff
2 According to new reports from federal authorit... according to new reports from federal authorit...
3 KANSAS CITY, Mo. (AP)The Chiefs will be mariju... kansas city, mo. (ap)the chiefs will be witho...
4 The crew of Insight, WCNY's weekly public affa... the crew of insight, wcny's weekly public affa...

check amount of time between different rows of data (time) and date and name of employee

I have a df with this info ['Name', 'Department', 'Date', 'Time', 'Activity'],
so for example looks like this:
Acosta, Hirto 225 West 28th Street 9/18/2019 07:25:00 Punch In
Acosta, Hirto 225 West 28th Street 9/18/2019 11:57:00 Punch Out
Acosta, Hirto 225 West 28th Street 9/18/2019 12:28:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 06:57:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 12:00:00 Punch Out
Adams, Juan 225 West 28th Street 9/16/2019 12:28:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 15:30:00 Punch Out
Adams, Juan 225 West 28th Street 9/18/2019 07:04:00 Punch In
Adams, Juan 225 West 28th Street 9/18/2019 11:57:00 Punch Out
I need to calculate the time between the punch in and the punch out in the same day for the same employee.
i manage to just clean the data
like:
self.raw_data['Time'] = pd.to_datetime(self.raw_data['Time'], format='%H:%M').dt.time
sorted_db = self.raw_data.sort_values(['Name', 'Date'])
sorted_db = sorted_db[['Name', 'Department', 'Date', 'Time', 'Activity']]
any suggestions will be appreciated
so i found the answer of my problem and i wanted to share it.
first a separate the "Punch in" and the "Punch Out" if two columns
def process_info(self):
# filter data and organized --------------------------------------------------------------
self.raw_data['in'] = self.raw_data[self.raw_data['Activity'].str.contains('In')]['Time']
self.raw_data['pre_out'] = self.raw_data[self.raw_data['Activity'].str.contains('Out')]['Time']
after i sort the information base in date and time
sorted_data = self.raw_data.sort_values(['Date', 'Name'])
after that i use the shift function to move on level up the 'out' column so in parallel with the in.
sorted_data['out'] = sorted_data.shift(-1)['Time']
and finally i take out the extra out columns that was created in the first step. but checking if it is by itself.
filtered_data = sorted_data[sorted_data['pre_out'].isnull()]

if username in dataframe 1 is equal to username in dateframe 2 then place the nextcolumn in dataframe 1

dataframe 1 is
View Name member user id
Admin_Case_View Catherine Kear ckear
Admin_IT Atul Dhiwar adhiwar-sa
Admin_IT Costin Bulisache cbulisac
Admin_IT Deepa Gopal SA
Admin_IT Geoff Semonian SA
Admin_IT Glenn Castan SA
Admin_IT Nikhil Manekar nmanekar
Admin_Questions Chaitanya Kondury kkondury
Admin_Questions Geetha Maddala gmaddala
Admin_Questions Kelly Kim jungeunk
Admin_Questions Megan Yeh megany
dataframe 2 is
Case Owner Alias Owner Region
cbulisac Other
aandiapp India
gmaddala North America
abarak Europe
abell Europe
nmanekar India
abhghos India
kkondury India
abhishuk India
acai China
megany North America
adasari India
adhiwar-sa North America
here if username in dataframe 1 is equal to username in dataframe 2 then place the region in dataframe 1.
output should be :-
View Name member user id region
Admin_Case_View Catherine Kear ckear
Admin_IT Atul Dhiwar adhiwar-sa North America
Admin_IT Costin Bulisache cbulisac Other
Admin_IT Deepa Gopal SA
Admin_IT Geoff Semonian SA
Admin_IT Glenn Castan SA
Admin_IT Nikhil Manekar nmanekar India
Admin_Questions Chaitanya Kondury kkondury india
Admin_Questions Geetha Maddala gmaddala North America
Admin_Questions Kelly Kim jungeunk Europe
Admin_Questions Megan Yeh adhiwar-sa North America
Try this you just need merge,
df3=pd.merge(df1,df2,left_on=['user id'],right_on=['Case Owner Alias'],how='left').rename(columns={'Owner Region':'region'}).drop('Case Owner Alias',1).fillna('')
O/P:
View Name member user id region
0 Admin_Case_View Catherine Kear ckear
1 Admin_IT Atul Dhiwar adhiwar-sa North America
2 Admin_IT Costin Bulisache cbulisac Other
3 Admin_IT Deepa Gopal SA
4 Admin_IT Geoff Semonian SA
5 Admin_IT Glenn Castan SA
6 Admin_IT Nikhil Manekar nmanekar India
7 Admin_Questions Chaitanya Kondury kkondury India
8 Admin_Questions Geetha Maddala gmaddala North America
9 Admin_Questions Kelly Kim jungeunk
10 Admin_Questions Megan Yeh megany North America
Note: Map is not advisable when you have a large Dataframe.

Selecting the first and the last word from a 3 word long string in PLSQL

For example I have names like these:
John Lucas Smith
Kevin Thomas Bacon
I need to do it with regexp_substr, or replace or something like that.
and what I want to get is:
John Smith
Kevin Bacon
Thank you!
Something like this?
SQL> with test (col) as
2 (select 'John Lucas Smith' from dual union
3 select 'Kevin Thomas Bacon' from dual union
4 select 'Little Foot' from dual
5 )
6 select regexp_substr(col, '^\w+') ||' '||
7 regexp_substr(col, '\w+$') first_and_last
8 from test;
FIRST_AND_LAST
-------------------------------------
John Smith
Kevin Bacon
Little Foot
SQL>

Resources