How to display latitude and longitude as two separate columns from current single with no separator? - python-3.x

I have a data frame in Pandas with a column labeled “Location.”
The column is an object data type and is in the following format within the column:
Location
————————
POINT
(-73.525969
41.081897)
I’d like to remove the formatting and store each of the data points in two columns: Latitude and Longitude, which would have to be created. How would I accomplish this?
Similar post I’ve reviewed always have delimiters between the numbers (such as a comma) but this doesn’t.
Thank you!
Serial Number
List Year
Date Recorded
Town
Address
Assessed Value
Sale Amount
Sales Ratio
Property Type
Residential Type
Non Use Code
Assessor Remarks
OPM remarks
Location
0
141466
2014
2015-08-06
Stamford
83 OVERBROOK DRIVE
503270.0
850000.0
0.592082
Residential
Single Family
NaN
NaN
NaN
POINT (-73.525969 41.081897)
1
140604
2014
2015-06-29
New Haven
56 HIGHVIEW LANE
86030.0
149900.0
0.573916
Residential
Single Family
NaN
NaN
NaN
POINT (-72.878115 41.30285)
2
14340
2014
2015-07-01
Ridgefield
32 OVERLOOK DR
351880.0
570000.0
0.617333
Residential
Single Family
NaN
NaN
NaN
POINT (-73.508273 41.286223)
3
140455
2014
2015-04-30
New Britain
171 BRADFORD WALK
204680.0
261000.0
0.784215
Residential
Condo
NaN
NaN
NaN
POINT (-72.775592 41.713335)
4
141195
2014
2015-06-26
Stamford
555 GLENBROOK ROAD
229330.0
250000.0
0.917320
Residential
Single Family
NaN
NaN
NaN
POINT (-73.519774 41.07203)

You can use pandas.Series.str.extract:
df['Location'].str.extract('POINT \((?P<latitude>[-\d.#+])\s+(?P<longitude>[-\d.#+])\)').astype(float)
You might need to change the \s+ separator depending on your real input

Related

Map Pandas Series Containing key/value pairs to a new columns with data

I have a dataframe containing a pandas series (column 2) as below:
column 1
column 2
column 3
1123
Requested By = John Doe 1\n Requested On = 12 October 2021\n Comments = This is a generic request
INC29192
1251
NaN
INC18217
1918
Requested By = John Doe 2\n Requested On = 2 September 2021\n Comments = This is another generic request
INC19281
I'm struggling to extract, split and map column 2 data to a series of new column names with the appropriate data for that record (where possible, that is where there is data available as I have NaNs).
The Desired output is something like (where Ive dropped the column 3 data for legibility):
column 1
column 3
Requested By
Requested On
Comments
1123
INC29192
John Doe 1
12 October 2021
This is a generic request
1251
INC18217
NaN
NaN
NaN
1918
INC19281
John Doe 2
2 September 2021
This is another generic request
I have spent quite some time, trying various approaches, from lambda functions to comprehensions to explode methods but havent quite found a solution that provides the desired output.
First I would convert column 2 values to dictionaries and then convert them to Dataframes and join them to your df:
df['column 2'] = df['column 2'].apply(lambda x:
{y.split(' = ',1)[0]:y.split(' = ',1)[1]
for y in x.split(r'\n ')}
if not pd.isna(x) else {})
df = df.join(pd.DataFrame(df['column 2'].values.tolist())).drop('column 2', axis=1)
print(df)
Output:
column 1 column 3 Requested By Requested On Comments
0 1123 INC29192 John Doe 1 12 October 2021 This is a generic request
1 1251 INC18217 NaN NaN NaN
2 1918 INC19281 John Doe 2 2 September 2021 This is another generic request

How can I speed up this pandas dataframe for loop computation?

I have the following dataframe of BTC price for each minute from 2018-01-15 17:01:00 to 2020-10-31 09:59:00, as you can see this is 1,468,379 rows of data, so my code needs to be optimized otherwise computations can take a long time.
dfcondensed = df[["Open","Close","Buy", "Sell"]]
dfcondensed
Timestamp Open Close Buy Sell
2018-01-15 17:01:00 14174.00 14185.25 14185.11 NaN
2018-01-15 17:02:00 14185.11 14185.15 NaN NaN
2018-01-15 17:03:00 14185.25 14157.32 NaN NaN
2018-01-15 17:04:00 14177.52 14184.71 NaN NaN
2018-01-15 17:05:00 14185.03 14185.14 NaN NaN
... ... ... ... ...
2020-10-31 09:55:00 13885.00 13908.36 NaN NaN
2020-10-31 09:56:00 13905.38 13915.81 NaN NaN
2020-10-31 09:57:00 13909.02 13936.00 NaN NaN
2020-10-31 09:58:00 13936.00 13920.78 NaN NaN
2020-10-31 09:59:00 13924.56 13907.85 NaN NaN
1468379 rows × 4 columns
The algorithm that I'm trying to run is this:
PnL = []
for i in range(dfcondensed.shape[0]):
if str(dfcondensed['Buy'].isnull().values[i]) == "False":
for j in range(dfcondensed.shape[0]-i):
if str(dfcondensed['Sell'].isnull().values[i+j]) == "False":
PnL.append( ((dfcondensed["Open"].iloc[i+j+1] - dfcondensed["Open"].iloc[i+1]) / dfcondensed["Open"].iloc[i+1]) * 100 )
break
Basically, to make it clear, what I'm trying to do is assess the Profit/Loss of buying/selling at the points in the Buy/Sell column. So in the first row the strategy being tested in the dataframe says buy at 14185.11, which was the open price at 2018-01-15 17:02:00, the algrithm should then look for when the strategy tells it to sell and mark this down, then it should look for the time that it's next told to buy and mark this down, then look for the next sell and mark this down, by the end there was over 7,000 different trades, I want to see the profit per trade so I can do some analysis and improve my strategy.
Using the above code to get a PnL list seems to run for a long time and I gave up waiting for it. How can I speed up the algorithm?
I found a way to speed up my loop using list-comprehensions and unrolled loops:
buylist = df["Buy"]
selllist = df["Sell"]
buylist = [x for x in buylist if str(x) != 'nan']
selllist = [x for x in selllist if str(x) != 'nan']
profit = []
for i in range(len(selllist)):
profit.append( (selllist[i] - buylist[i]) / buylist[i] * 100)

Is it possible to set a custom delimiter/separator for a specific column when cleaning data with pandas?

My dataset is a .txt file separated by colons (:). One of the columns contains a date AND time, the date is separated by backslash (/) which is fine. However, the time is separated by colons (:) just like the rest of the data which throws off my method for cleaning the data.
Example of a couple of lines of the dataset:
USA:Houston, Texas:05/06/2020 12:00:00 AM:car
Japan:Tokyo:05/06/2020 11:05:10 PM:motorcycle
USA:Houston, Texas:12/15/2020 12:00:10 PM:car
Japan:Kyoto:01/04/1999 05:30:00 PM:bicycle
I'd like to clean the dataset before loading it into a dataframe in python using pandas. How do I separate the columns? I can't use
df = pandas.read_csv('example.txt', sep=':', header=None)
because that will separate the time data into different columns. Any ideas?
You can concatenate the columns back:
df = pandas.read_csv("example.txt", sep=":", header=None)
df[6] = pd.to_datetime(
df[2].astype(str) + ":" + df[3].astype(str) + ":" + df[4].astype(str)
)
df = df[[0, 1, 6, 5]].rename(
columns={0: "State", 1: "City", 6: "Time", 5: "Type"}
)
print(df)
Prints:
State City Time Type
0 USA Houston, Texas 2020-05-06 00:00:00 car
1 Japan Tokyo 2020-05-06 23:05:10 motorcycle
2 USA Houston, Texas 2020-12-15 12:00:10 car
3 Japan Kyoto 1999-01-04 17:30:00 bicycle
pd.read_csv(file, sep=r"(?:(?<!\d):(?!\d)?)", header=None)
See demo for the regex!
Regex is looking for a non-digit behind the : via negative look behind ?<! and possibly a non-digit after : via negative look ahead. The latter is optional to cover the case e.g. Tokyo:05 and split here also. The ?: at the beginning says "don't keep the :'s you find in the result".
I get:
0 1 2 3
0 USA Houston, Texas 05/06/2020 12:00:00 AM car
1 Japan Tokyo 05/06/2020 11:05:10 PM motorcycle
2 USA Houston, Texas 12/15/2020 12:00:10 PM car
3 Japan Kyoto 01/04/1999 05:30:00 PM bicycle

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

How do I select all dates that are most recent in comparison with a different date field?

I'm trying to gather all dates between 06-01-2020 and 06-30-2020 based on the forecast date which can be 06-08-2020, 06-20-2020, and 06-24-2020. The problem I am running into is that I'm only grabbing all of the dates associated with the forecast date 06-24-2020. I need all dates that are most recent so if say 06-03-2020 occurs with the forecast date 06-08-2020 and not with 06-20-2020, I still need all of the dates associated with that forecast date. Here's the code I am currently using.
df = df[df['Forecast Date'].isin([max(df['Forecast Date'])])]
It's producing this-
Date \
5668 2020-06-25
5669 2020-06-26
5670 2020-06-27
5671 2020-06-28
5672 2020-06-29
5673 2020-06-30
Media Granularity Forecast Date
5668 NaN 2020-06-24
5669 NaN 2020-06-24
5670 NaN 2020-06-24
5671 NaN 2020-06-24
5672 NaN 2020-06-24
5673 NaN 2020-06-24
With a length of 6 (len(df[df['Forecast Date'].isin([max(df['Forecast Date'])])])). It needs to be a length of 30, one for each unique date. It is only grabbing the columns where the max of Forecast date is 06-24-2020.
I'm thinking it's something along the lines of df.sort_values(df[['Date', 'Forecast Date']]).drop_duplicates(df['Date'], keep='last') but it's giving me a key error.
It was easy but not what I expected.
df = df.sort_values(by=['Date', 'Forecast Date']).drop_duplicates(subset=['Date'], keep='last')

Resources