Python - dataframe url parsing issue

Python - dataframe url parsing issue - python-3.x

I am trying to get domain names from the url from a column into another column. Its working on a string like object, when I apply to dataframe it doesn't work. How to do I apply this to a data frame?
Tried:
from urllib.parse import urlparse
import pandas as pd
id1 = [1,2,3]
ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
df = pd.DataFrame({'id':id1,'url':ls})
df
# urlparse(df['url']) # ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# df['url'].map(urlparse) # AttributeError: 'float' object has no attribute 'decode'
working on string:
string = 'https://google.com/tensoflow'
parsed_uri = urlparse(string)
result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
result
looking for a column:
col3
https://google.com/
https://math.com/
nan
Errror

You can try something like this.
Here I have used pandas.Series.apply() to solve.
» Initialization and imports
>>> from urllib.parse import urlparse
>>> import pandas as pd
>>> id1 = [1,2,3]
>>> import numpy as np
>>> ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
>>> ls
['https://google.com/tensoflow', 'https://math.com/some/website', nan]
>>>
» Inspect the newly created DataFrame.
>>> df = pd.DataFrame({'id':id1,'url':ls})
>>> df
id url
0 1 https://google.com/tensoflow
1 2 https://math.com/some/website
2 3 NaN
>>>
>>> df["url"]
0 https://google.com/tensoflow
1 https://math.com/some/website
2 NaN
Name: url, dtype: object
>>>
» Applying a function using pandas.Series.apply(func) on url column..
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else np.nan)
0 https://google.com/
1 https://math.com/
2 NaN
Name: url, dtype: object
>>>
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
>>>
» Store the above result in a variable (not mandatory, just to simply).
>>> s = df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
>>> s
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
» Finally
>>> df2 = pd.DataFrame({"col3": s})
>>> df2
col3
0 https://google.com/
1 https://math.com/
2 nan
>>>
» To make sure, what is s and what is df2, check types (again, not mandatory).
>>> type(s)
<class 'pandas.core.series.Series'>
>>>
>>>
>>> type(df2)
<class 'pandas.core.frame.DataFrame'>
>>>
Reference links:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.isnull.html

Related

How to read excel table with one column?

I have a table in Excel with one column that I want to read into the list:
At first I tried it like this:
>>> df = pandas.read_excel('emails.xlsx', sheet_name=None)
>>> df
OrderedDict([('Sheet1', Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> for k, v in df.items():
... print(type(v), v)
...
<class 'pandas.core.frame.DataFrame'> Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> df = df.items()[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'odict_items' object is not subscriptable
I tried it differently:
>>> df = pandas.read_excel('emails.xlsx', index_col=0)
>>> df
Empty DataFrame
Columns: []
Index: [wonderct#mail.ru, fcl#fcl-bd.com, galina#dorax-investments.com]
[419 rows x 0 columns]
>>> foo = []
>>> for i in df.index:
... foo.append(i)
...
>>> foo
['wonderct#mail.ru', 'fcl#fcl-bd.com', 'galina#dorax-investments.com']
It almost worked, but the first element is missing. What else can I do? Is there really no way to read the Excel file simply line by line?

Try this:
df=pd.read_excel('temp.xlsx', header=None)
target_list=list(df[0].values)

Use:
target_list = pandas.read_excel('emails.xlsx', index_col=None, names=['A'])['A'].tolist()

AttributeError: 'NoneType' object has no attribute 'isnull' [duplicate]

This question already has answers here:
Pandas DataFrame: set_index with inplace=True returns a NoneType, why?
(2 answers)
Understanding inplace=True in pandas
(11 answers)
Closed 4 years ago.
I'm trying to remove the empty rows. But when I try to count empty lines to see if it worked, I have an error:
AttributeError: 'NoneType' object has no attribute 'isnull'
My script:
import pandas
import pandas as pd
data = pd.read_csv('data.csv', sep=';')
#print('Table Data\n', data)
data_sum_empty = data.isnull().sum()
#print(data_sum_empty)
data_not_empty = data_sum_empty.dropna(how = 'all', inplace = True)
print(data_not_empty.isnull().sum())
Output:
Traceback (most recent call last):
File ".\data_vis.py", line 12, in
print(data_not_empty.isnull().sum())
AttributeError: 'NoneType' object has no attribute 'isnull'
Some data
flightID DepTime ArrTime ActualElapsedTime AirTime ArrDelay
BBYYEUVY67527 1416.0 1514.0 58.0 39.0 64.0
MUPXAQFN40227 2137.0 37.0 120.0 47.0 52.0
LQLYUIMN79169 730.0 916.0 166.0 143.0 -25.0
KTAMHIFO10843 NaN NaN NaN NaN NaN
BOOXJTEY23623 NaN NaN NaN NaN NaN
Why duplicate???? I did not know the problem was because of the inplace. If I'd known, I would not have asked!

When you do an operation on a df with inplace=True, the variable or output of that operation is None.
data_sum_empty.dropna(how = 'all', inplace = True)
data_not_empty = data_sum_empty.copy()
print(data_not_empty.isnull().sum())
Or
data_not_empty = data_sum_empty.dropna(how = 'all')
print(data_not_empty.isnull().sum())

Do not reassign if you use inplace = True:
data_not_empty = data_sum_empty.dropna(how = 'all')
print(data_not_empty.isnull().sum())

Why index name always appears in the parquet file created with pandas?

I am trying to create a parquet using pandas dataframe, and even though I delete the index of the file, it is still appearing when I am re-reading the parquet file. Can anyone help me with this? I want index.name to be set as None.
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1

It works as expected using pyarrow:
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1 ---> INDEX NAME APPEARS EVEN AFTER DELETING USING fastparquet
>>> del df.index.name
>>> df.to_parquet('test.parquet', engine='pyarrow')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
0 1 --> INDEX NAME IS NONE WHEN CONVERSION IS DONE WITH pyarrow

Hey this works with pyarrow with the following
df = pd.DataFrame({'key': 1}, index=[0])
df.to_parquet('test.parquet', engine='pyarrow', index=False)
df = pd.read_parquet('test.parquet', engine='pyarrow')
df.head()
As #alexopoulos7 mentioned in the to_parquet documentation it states you can use the "index" argument as a parameter. It seems to work, perhaps because I'm explicitly stating the engine='pyarrow'

I have been playing with both libraries pyarrow and fastparquet, trying to write a parquet file without preserving indexes since I need those data to be read from redshift as an external table.
For me what it worked was for library fastparquet
df.to_parquet(destination_file, engine='fastparquet', compression='gzip', write_index=False)
If you try to follow the to_parquet official documentation you will see that it mentions parameter "index" but this throws an error if this argument does not exist in the used engine. Currently, I have found that only fastparquet has such an option and in named "write_index"

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?

pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object

Reading a pickled file, unequal length error in pandas dataframe

I want to read a pickle file in python 3.5. I am using the following code.
The following is my output, I want to load it as pandas dataframe.
when I try to convert into pd Dataframe, using df = pd.DataFrame(df), I am getting the below error.
ValueError: arrays must all be same length
link to data- https://drive.google.com/file/d/1lSFBPLbUCluWfPjzolUZKmD98yelTSXt/view?usp=sharing

I think you need dict comprehension with concat:
from pandas.io.json import json_normalize,
import pickle
fh = open("imdbnames40.pkl", 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
print (df.head())
ethnicity score best
'Aina Rapoza 0 Asian 0.89 Asian
1 GreaterAfrican 0.05 Asian
2 GreaterEuropean 0.06 Asian
3 IndianSubContinent 0.11 GreaterEastAsian
4 GreaterEastAsian 0.89 GreaterEastAsian
Then if need column from first level of MultiIndex:
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python - dataframe url parsing issue - python-3.x

Related

How to read excel table with one column?

AttributeError: 'NoneType' object has no attribute 'isnull' [duplicate]

Why index name always appears in the parquet file created with pandas?

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

Reading a pickled file, unequal length error in pandas dataframe

Categories

Resources