I am attempting to get substring between start and end of different characters. I tried several different regex notations, I am coming close to the output I need, but it is not fully correct. What can I do to fix this?
Data csv
ID,TEST
abc,1#London4#Harry Potter#5Rowling##
cde,6#Harry Potter1#England#5Rowling
efg,4#Harry Potter#5Rowling##1#USA
ghi,
jkm,4#Harry Potter5#Rowling
xyz,4#Harry Potter1#China#5Rowling
Code:
import pandas as pd
df = pd.read_csv('sample2.csv')
print(df)
Try:
df['TEST'].astype(str).str.extract('(1#.*(?=#))')
Got output from above code: It doesn't pick up end line '1#USA'
1#London4#Harry Potter#5Rowling#
1#England
NaN
NaN
NaN
1#China
Output needed:
1#London
1#England
1#USA
NaN
NaN
1#China
You can do this:
>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
0
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
You can try:
# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)
Output:
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
Name: TEST, dtype: object
Related
How do i remove nan values from dataframe in Python? I already tried with dropna(), but that did not work for me. Also is NaN diffferent from nan. I am using Pandas.
While printing the data frame it does not print as NaN but instead as nan.
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
You can change nan values with NaN using replace() and then use dropna().
import numpy as np
df = df.replace('nan', np.nan)
df = df.dropna()
Update:
Original dataframe:
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
Applied df.replace('nan', np.nan):
1 2.11358 0.649067060588935
2 NaN 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 NaN
Applied df.dropna():
1 2.11358 0.649067060588935
3 2.10066 0.3653980276694516
according to the Pandas documentation, pandas.read_csv (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) should support detection of bad lines though error_bad_lines and warn_bad_lines set to True.
So I have created a csv with bad format called test.csv:
aaa,bbb,ccc
ssdf,sdtf,aesrt,,,,
eart,erate
aert,aert,aert
and run read_csv:
>>> pd.read_csv('test.csv', error_bad_lines = True )
aaa bbb ccc
ssdf sdtf aesrt NaN NaN NaN NaN
eart erate NaN NaN NaN NaN NaN
As I understand the documentation, this should raise an Error, except it doesn't. Instead bad csv is loaded. Pandas seems to ignore all of error_bad_lines / warn_bad_lines.
Is my understanding of the documentation wrong, or is it really a bug in Pandas? Does anyone know of elegant work-around how to load only correct csv?
I'm working with Python 3.6.8, Pandas 0.25.0 and Ubuntu 18.04.
I made some tests and found that the second line will determine the number of columns to be expected in the rest of the file.
for example, the second line (ssdf,sdtf,aesrt,,,,) has 7 columns. So if all the following rows have less than 7 columns, then no errors!
If you modify one row to be 7 or more, then it will crash. The default value for error_bad_lines is true, so you don't need to specify it explicitly.
Example without an error :
data.csv :
0,jjjjjj
1,er,ate,, # 5 columns
2,bb,b
3,sdtf,aesrt,ll,sdfd # 5 columns, so no errors appear.
4,erate,
5,aert,aert
df1 = pd.read_csv('data.csv')
df1
Result : No error
0 jjjjjj
1 er ate NaN NaN
2 bb b NaN NaN
3 sdtf aesrt ll sdfd
4 erate NaN NaN NaN
5 aert aert NaN NaN
Example with an error :
data.csv :
0,jjjjjj
1,er,ate,, # 5 columns
2,bb,b
3,sdtf,aesrt,ll,sdfd,sdf,sdf,sdf,sdf, # more than 5 columns
4,erate,
5,aert,aert
df1 = pd.read_csv('data.csv')
df1
Result : error !!
ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 10
I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
I'm encountering a little problem when trying to download data from an url and putting it into an array using pandas:
When I simply do the following:
EUR_USD_TICK = pd.read_csv('https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz')
I get this error:
CParserError: Error tokenizing data. C error: Expected 2 fields in line 14, saw 3
Which I assume is due to the fact that the data retrieved from the url is compressed.
So I tried to set the compression to gzip:
EUR_USD_TICK = pd.read_csv('https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz',compression='gzip')
When doing that, I get no error message, however, the array is only composed by NaN values:
D Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
The length of the array matches the length of the csv file, so I have no idea why the array is filed with NaN values.
Would anyone have an idea of why I'm getting this issue?
Thanks
Ps: The csv data that I'm trying to download:
DateTime,Bid,Ask
01/04/2015 22:00:00.389,1.19548,1.19557
01/04/2015 22:00:00.406,1.19544,1.19556
01/04/2015 22:00:00.542,1.19539,1.19556
01/04/2015 22:00:00.566,1.19544,1.19556
It seems that there are some characters in the file that Python/pandas doesn't like ('\x00').
So I used the gzip module to manually read the file and then removed those characters. After that, Pandas reads the file without issues.
import pandas as pd
import requests
from io import StringIO,BytesIO
import gzip
r = requests.get('https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz')
gzip_file = gzip.GzipFile(fileobj=BytesIO(r.content))
pd.read_csv(StringIO(gzip_file.read().decode('utf-8').replace('\x00','')))
Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['One','Two',np.nan],
'B':[np.nan,np.nan,'Three'],
})
df
A B
0 One NaN
1 Two NaN
2 NaN Three
I'd like to create one column ('C') that takes the value of either 'A' or 'B' if it is not NaN like this:
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three
Thanks in advance!
You can use combine_first:
df['C'] = df.A.combine_first(df.B)
print df
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three
Or fillna:
df['C']= df.A.fillna(df.B)
print df
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three
Or np.where and add value if both conditions are False e.g. 1:
df['C'] = np.where(df.A.notnull(), df.A,np.where(df.B.notnull(), df.B, 1))
print df
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three