Is it possible to set a custom delimiter/separator for a specific column when cleaning data with pandas? - python-3.x

My dataset is a .txt file separated by colons (:). One of the columns contains a date AND time, the date is separated by backslash (/) which is fine. However, the time is separated by colons (:) just like the rest of the data which throws off my method for cleaning the data.
Example of a couple of lines of the dataset:
USA:Houston, Texas:05/06/2020 12:00:00 AM:car
Japan:Tokyo:05/06/2020 11:05:10 PM:motorcycle
USA:Houston, Texas:12/15/2020 12:00:10 PM:car
Japan:Kyoto:01/04/1999 05:30:00 PM:bicycle
I'd like to clean the dataset before loading it into a dataframe in python using pandas. How do I separate the columns? I can't use
df = pandas.read_csv('example.txt', sep=':', header=None)
because that will separate the time data into different columns. Any ideas?

You can concatenate the columns back:
df = pandas.read_csv("example.txt", sep=":", header=None)
df[6] = pd.to_datetime(
df[2].astype(str) + ":" + df[3].astype(str) + ":" + df[4].astype(str)
)
df = df[[0, 1, 6, 5]].rename(
columns={0: "State", 1: "City", 6: "Time", 5: "Type"}
)
print(df)
Prints:
State City Time Type
0 USA Houston, Texas 2020-05-06 00:00:00 car
1 Japan Tokyo 2020-05-06 23:05:10 motorcycle
2 USA Houston, Texas 2020-12-15 12:00:10 car
3 Japan Kyoto 1999-01-04 17:30:00 bicycle

pd.read_csv(file, sep=r"(?:(?<!\d):(?!\d)?)", header=None)
See demo for the regex!
Regex is looking for a non-digit behind the : via negative look behind ?<! and possibly a non-digit after : via negative look ahead. The latter is optional to cover the case e.g. Tokyo:05 and split here also. The ?: at the beginning says "don't keep the :'s you find in the result".
I get:
0 1 2 3
0 USA Houston, Texas 05/06/2020 12:00:00 AM car
1 Japan Tokyo 05/06/2020 11:05:10 PM motorcycle
2 USA Houston, Texas 12/15/2020 12:00:10 PM car
3 Japan Kyoto 01/04/1999 05:30:00 PM bicycle

Related

Map Pandas Series Containing key/value pairs to a new columns with data

I have a dataframe containing a pandas series (column 2) as below:
column 1
column 2
column 3
1123
Requested By = John Doe 1\n Requested On = 12 October 2021\n Comments = This is a generic request
INC29192
1251
NaN
INC18217
1918
Requested By = John Doe 2\n Requested On = 2 September 2021\n Comments = This is another generic request
INC19281
I'm struggling to extract, split and map column 2 data to a series of new column names with the appropriate data for that record (where possible, that is where there is data available as I have NaNs).
The Desired output is something like (where Ive dropped the column 3 data for legibility):
column 1
column 3
Requested By
Requested On
Comments
1123
INC29192
John Doe 1
12 October 2021
This is a generic request
1251
INC18217
NaN
NaN
NaN
1918
INC19281
John Doe 2
2 September 2021
This is another generic request
I have spent quite some time, trying various approaches, from lambda functions to comprehensions to explode methods but havent quite found a solution that provides the desired output.
First I would convert column 2 values to dictionaries and then convert them to Dataframes and join them to your df:
df['column 2'] = df['column 2'].apply(lambda x:
{y.split(' = ',1)[0]:y.split(' = ',1)[1]
for y in x.split(r'\n ')}
if not pd.isna(x) else {})
df = df.join(pd.DataFrame(df['column 2'].values.tolist())).drop('column 2', axis=1)
print(df)
Output:
column 1 column 3 Requested By Requested On Comments
0 1123 INC29192 John Doe 1 12 October 2021 This is a generic request
1 1251 INC18217 NaN NaN NaN
2 1918 INC19281 John Doe 2 2 September 2021 This is another generic request

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

How can I create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe?

The following dataframe shows trips made by employees of different companies:
source:
import pandas as pd
emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank', 'Lee', 'Jack'],
'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020', '01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
'Return' : ['01/31/2020', '02/15/2020', '02/20/2020', '03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
}
df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart', 'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])
output:
Name Company Depart Return Charges
0 Bob ABC 2020-01-01 2020-01-31 10.10
1 Joe ABC 2020-01-01 2020-02-15 20.25
2 Sue ABC 2020-01-06 2020-02-20 30.32
3 Jack HIJ 2020-01-01 2020-03-01 40.00
4 Henry HIJ 2020-05-01 2020-05-05 50.01
5 Frank DEF 2020-01-13 2020-01-15 60.32
6 Lee DEF 2020-01-12 2020-01-30 70.99
7 Jack DEF 2020-01-14 2020-02-02 80.87
How can I create another dataframe based on the following aspects:
The original dataframe is based on employee names/trips.
The generated dataframe will be based on companies grouped by overlapping dates.
The 'Name' column will not be included as it is no longer needed.
The 'Company' column will remain.
The 'Depart' date will be of the earliest date of any overlapping trip dates.
The 'Return' date will be of the latest date of any overlapping trip dates.
Any company trips that do not have overlapping dates will be its own entry/row.
The 'Charges' for each trip will be totaled for the new company entry.
Here is the desired output of the new dataframe:
Company Depart Return Charges
0 ABC 01/01/2020 02/20/2020 60.67
1 HIJ 01/01/2020 03/01/2020 40.00
2 HIJ 05/01/2020 05/05/2020 50.01
3 DEF 01/12/2020 02/02/2020 212.18
I've looked into the following as possible solutions:
Create a hierarchical index based on the company and date. As I worked through this, I realized that all this really does is create a hierarchical index but that's based on the specific columns. Also, this method won't aggregate the individual rows into summary rows.
df1 = df.set_index(['Company', not exactly sure how to say overlapping dates])
I also tried using timedelta but it resulted in True/False values in a separate column, and I'm not entirely sure how that would be used to combine into a single row based on overlapping date and company. Also, I don't think groupby('Company') works since there could be different trips and that non-overlapping that would require their own rows.
df['trips_overlap'] = (df.groupby('Company')
.apply(lambda x: (x['Return'].shift() - x['Depart']) > timedelta(0))
.reset_index(level=0, drop=True))

Select two or more consecutive rows based on a criteria using python

I have a data set like this:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
A 2019-01-01 11.18 TX 234567 3
B 2019-01-02 12.19 WA 456789 4
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
B 2019-01-02 12.50 DC 157890 7
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
A 2019-01-04 09:40 CA 234567 11
In this data set I want to compare and select two or more consecutive which fit the following criteria:
User should be same
Time difference should be less than 15 mins
Cookie should be different
So if I apply the filter I should get the following data:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
So, in the above, comparing first two rows(index 1 and 2) satisfy all the conditions above. The next two (index 2 and 3) has same cookie, index 3 and 4 has different user, 5 and 6 is selected and displayed, 6 and 7 has time difference more than 15 mins. 8,9 and 10 fit the criteria but 11 doesnt as the date is 24 hours apart.
How can I solve this using python dataframe? All help is appreciated.
What I have tried:
I tried creating flags using
shift()
cookiediff=pd.DataFrame(df.Cookie==df.Cookie.shift())
cookiediff.columns=['Cookiediffs']
timediff=pd.DataFrame(pd.to_datetime(df.time) - pd.to_datetime(df.time.shift()))
timediff.columns=['timediff']
mask = df.user != df.user.shift(1)
timediff.timediff[mask] = np.nan
cookiediff['Cookiediffs'][mask] = np.nan
This will do the trick:
import numpy as np
#you have inconsistent time delim-just to correct it per your sample data
df["time"]=df["time"].str.replace(":", ".")
df["time"]=pd.to_datetime(df["time"], format="%Y-%m-%d %H.%M")
cond_=np.logical_or(
df["time"].sub(df["time"].shift()).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift()) &\
df["cookie"].ne(df["cookie"].shift()),
df["time"].sub(df["time"].shift(-1)).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift(-1)) &\
df["cookie"].ne(df["cookie"].shift(-1)),
)
res=df.loc[cond_]
Few points- you need to ensure your time column is datetime in order to make the 15 minutes condition verifiable.
Then - the final filter (cond_) you obtain by comparing each row to the previous one, checking all 3 conditions OR by doing the same, but checking against the next one (otherwise you would just get all the consecutive matching rows, except the first one).
Outputs:
user time city cookie index
0 A 2019-01-01 11:00:00 NYC 123456 1
1 A 2019-01-01 11:12:00 CA 234567 2
4 B 2019-01-02 12:21:00 FL 456789 5
5 B 2019-01-02 12:31:00 VT 987654 6
7 A 2019-01-03 09:12:00 CA 123456 8
8 A 2019-01-03 09:27:00 NYC 345678 9
9 A 2019-01-03 09:34:00 TX 123456 10
You could use regular expressions to isolate the fields and use named groups and the groupdict() function to store the value of each field into a dictionary and compare the values from the last dictionary to the current one. So iterate through each line of the dataset with two dictionaries, the current dictionary and the last dictionary, and perform a re.search() on each line with the regex pattern string to separate each line into named fields, then compare the value of the two dictionaries.
So, something like:
import re
c_dict=re.search('(?P<user>\w) +(?P<time>\d{4}-\d{2}-\d{2} \d{2}\.\d{2}) +(?P<city>\w+) +(?P<cookie>\d{6}) +(?P<index>\d+)',s).groupdict()
for each line of your dataset. For the first line of your dataset, this would create the dictionary {'user': 'A', 'time': '2019-01-01 11.00', 'city': 'NYC', 'cookie': '123456', 'index': '1'}. With the fields isolated, you could easily compare the values of the fields to previous lines if you stored those in another dictionary.

Read a NASDAQ HTML table to a Dataframe

I get the most recent list of traded companies from NASDAQ with this code however I'd like to have the results shown in a data-frame instead of just the list with all the other info I might not need.
Any ideas how could that be achieved? Thanks
Parsing latest NASDAQ company
from bs4 import BeautifulSoup
import requests
r=requests.get('https://www.nasdaq.com/screening/companies-by
industry.aspx
exchange=NASDAQ&sortname=marketcap&sorttype=1&pagesize=4000')
data = r.text
soup = BeautifulSoup(data, "html.parser")
table = soup.find( "table", {"id":"CompanylistResults"} )
for row in table.findAll("tr"):
for cell in row("td"):
print (cell.get_text().strip())
Looks like you are looking for the aptly named read_html, though you need to play around until you get what you want. In your case:
>>> import pandas as pd
>>> df=pd.read_html(table.prettify(),flavor='bs4')[0]
>>> df.columns = [c.strip() for c in df.columns]
See output below.
The first line is what gets the job done, and the second one just strips off all those pesky spaces and new lines in your header. Looks like there is a hidden ADR TSO which seems useless, so you could drop it if you do not know what it is. It may also make sense to drop all even rows as they are just a continuation of the odd rows, and useless links as far as I can tell. In a single line:
>>> df = df.drop(['ADR TSO'], axis=1) #Drop useless column
>>> df1= df[::2] #To get rid of even rows
>>> df2= df[~df['Name'].str.contains('Stock Quote')].head() #By string filtration if we are not sure about the odd/even thing
Output of original head just for show:
>>> df.head()
Name Symbol Market Cap \
0 Amazon.com, Inc. AMZN $802.18B
1 AMZN Stock Quote AMZN Ratings AMZN Stock Report NaN NaN
2 Microsoft Corporation MSFT $789.12B
3 MSFT Stock Quote MSFT Ratings MSFT Stock Report NaN NaN
4 Alphabet Inc. GOOGL $740.3B
ADR TSO Country IPO Year \
0 NaN United States 1997
1 NaN NaN NaN
2 NaN United States 1986
3 NaN NaN NaN
4 NaN United States n/a
Subsector
0 Catalog/Specialty Distribution
1 NaN
2 Computer Software: Prepackaged Software
3 NaN
4 Computer Software: Programming, Data Processing
Output of cleaned df.head():
Name Symbol Market Cap Country IPO Year \
0 Amazon.com, Inc. AMZN $802.18B United States 1997
2 Microsoft Corporation MSFT $789.12B United States 1986
4 Alphabet Inc. GOOGL $740.3B United States n/a
6 Alphabet Inc. GOOG $735.24B United States 2004
8 Apple Inc. AAPL $720.3B United States 1980
Subsector
0 Catalog/Specialty Distribution
2 Computer Software: Prepackaged Software
4 Computer Software: Programming, Data Processing
6 Computer Software: Programming, Data Processing
8 Computer Manufacturing

Resources