I have a table in Excel with one column that I want to read into the list:
At first I tried it like this:
>>> df = pandas.read_excel('emails.xlsx', sheet_name=None)
>>> df
OrderedDict([('Sheet1', Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> for k, v in df.items():
... print(type(v), v)
...
<class 'pandas.core.frame.DataFrame'> Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> df = df.items()[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'odict_items' object is not subscriptable
I tried it differently:
>>> df = pandas.read_excel('emails.xlsx', index_col=0)
>>> df
Empty DataFrame
Columns: []
Index: [wonderct#mail.ru, fcl#fcl-bd.com, galina#dorax-investments.com]
[419 rows x 0 columns]
>>> foo = []
>>> for i in df.index:
... foo.append(i)
...
>>> foo
['wonderct#mail.ru', 'fcl#fcl-bd.com', 'galina#dorax-investments.com']
It almost worked, but the first element is missing. What else can I do? Is there really no way to read the Excel file simply line by line?
Try this:
df=pd.read_excel('temp.xlsx', header=None)
target_list=list(df[0].values)
Use:
target_list = pandas.read_excel('emails.xlsx', index_col=None, names=['A'])['A'].tolist()
Related
Main question:
When processing data batch per batch, how to handle changing schema in pyarrow ?
Long story:
As an example, I have the following data
| col_a | col_b |
-----------------
| 10 | 42 |
| 41 | 21 |
| 'foo' | 11 |
| 'bar' | 99 |
I'm working with python 3.7 and using pandas 1.1.0.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df
col_a col_b
0 10 42
1 41 21
2 foo 11
3 bar 99
>>> df.dtypes
col_a object
col_b int64
dtype: object
>>>
I need to start working with Apache Arrow using PyArrow 1.0.1 implementation. In my application, we are working batch per batch. This means that we see part of the data, thus part of data types.
>>> dfi = pd.read_csv('test.csv', iterator=True, chunksize=2)
>>> dfi
<pandas.io.parsers.TextFileReader object at 0x7fabae915c50>
>>> dfg = next(dfi)
>>> dfg
col_a col_b
0 10 42
1 41 21
>>> sub_1 = next(dfi)
>>> sub_2 = next(dfi)
>>> sub_1
col_a col_b
2 foo 11
3 bar 99
>>> dfg2
col_a col_b
2 foo 11
3 bar 99
>>> sub_1.dtypes
col_a int64
col_b int64
dtype: object
>>> sub_2.dtypes
col_a object
col_b int64
dtype: object
>>>
My goal is to persist this whole dataframe using parquet format of Apache Arrow in the constraint of working batch per batch. It requires us to correctly fill the schema. How does one handle dtypes that change over batchs ?
Here's the full code to reproduce the problem using above data.
from pyarrow import RecordBatch, RecordBatchFileWriter, RecordBatchFileReader
import pandas as pd
pd.DataFrame([['10', 42], ['41', 21], ['foo', 11], ['bar', 99]], columns=['col_a', 'col_b']).to_csv('test.csv')
dfi = pd.read_csv('test.csv', iterator=True, chunksize=2)
sub_1 = next(dfi)
sub_2 = next(dfi)
# No schema provided here. Pyarrow should infer the schema from data. The first column is identified as a col of int.
batch_to_write_1 = RecordBatch.from_pandas(sub_1)
schema = batch_to_write_1.schema
writer = RecordBatchFileWriter('test.parquet', schema)
writer.write(batch_to_write_1)
# We expect to keep the same schema but that is not true, the schema does not match sub_2 data. So the
# following line launch an exception.
batch_to_write_2 = RecordBatch.from_pandas(sub_2, schema)
# writer.write(batch_to_write_2) # This will fail bcs batch_to_write_2 is not defined
We get the following exception
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 858, in pyarrow.lib.RecordBatch.from_pandas
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 579, in dataframe_to_arrays
for c, f in zip(columns_to_convert, convert_fields)]
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 579, in <listcomp>
for c, f in zip(columns_to_convert, convert_fields)]
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type str)
This behavior is intended. Some alternatives to try (I believe they should work but I haven't tested all of them):
If you know the final schema up front construct it by hand in pyarrow instead of relying on inferred one from the first record batch.
Go through all the data and compute a final schema. Then reprocess the data with the new schema.
Detect a schema change and recast previous record batches.
Detect the schema change and start a new table (you will would then end up with one parquet file per schema and you would need another process to unify the schemas).
Lastly if it works, and you are trying to tranform CSV data, you might consider using the built in Arrow CSV parser.
I can't seem to find the answer to my question so I'm trying my luck on here. Would very much appreciate your help.
I've got a Pandas dataframe with values in Col1 and Col2. Instead of the np.nan values in Col2, I'd like to calculate the following: today's Col2 value = previous day's Col2 value multiplied by today's Col1 value.
This should be some form of recursive function. I've tried several answers, including a for loop here below, but none seem to work:
df = pd.read_excel('/Users/fhggshgf/Desktop/test.xlsx')
df.index = df.date
df.drop(['date'], axis=1, inplace=True)
for i in range(1, len(df)):
fill_value = df['Col2'].iloc[i - 1]
finaldf['Col2'].fillna(fill_value, inplace=True)
screenshot
You could try something like this.
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': [1,2,3,4,5,6],
'col_1': [951, 909, 867, 844, 824, 826],
'col_2': [179, 170, 164, 159, 153, 149]})
col_2_update_list = []
for i, row in df.iterrows():
if i != 0:
today_col_1 = df.at[i,'col_1']
prev_day_col_2 = df.at[i-1,'col_2']
new_col_2_val = prev_day_col_2 * today_col_1
col_2_update_list.append(new_col_2_val)
else:
col_2_update_list.append(np.nan)
df['updated_col_2'] = col_2_update_list
This avoids the use of loops but you need to create 2 new columns:
import pandas as pd
import numpy as np
import sys
df = pd.DataFrame({'date': [1,2,3,4,5,6],
'col_1': [951, 909, 867, 844, 824, 826],
'col_2': [179, np.nan, 164, 159, np.nan, 149]})
print(df)
# Compare 2 columns
df['col_4'] = df['col_2'].fillna(method='ffill')*df['col_1']
df['col_3'] = df['col_2'].fillna(sys.maxsize)
df['col_2'] = df[['col_4','col_3']].min(axis=1).astype(int)
df = df.drop(['col_4', 'col_3'], axis = 1)
print(df)
I am trying to get domain names from the url from a column into another column. Its working on a string like object, when I apply to dataframe it doesn't work. How to do I apply this to a data frame?
Tried:
from urllib.parse import urlparse
import pandas as pd
id1 = [1,2,3]
ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
df = pd.DataFrame({'id':id1,'url':ls})
df
# urlparse(df['url']) # ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# df['url'].map(urlparse) # AttributeError: 'float' object has no attribute 'decode'
working on string:
string = 'https://google.com/tensoflow'
parsed_uri = urlparse(string)
result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
result
looking for a column:
col3
https://google.com/
https://math.com/
nan
Errror
You can try something like this.
Here I have used pandas.Series.apply() to solve.
» Initialization and imports
>>> from urllib.parse import urlparse
>>> import pandas as pd
>>> id1 = [1,2,3]
>>> import numpy as np
>>> ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
>>> ls
['https://google.com/tensoflow', 'https://math.com/some/website', nan]
>>>
» Inspect the newly created DataFrame.
>>> df = pd.DataFrame({'id':id1,'url':ls})
>>> df
id url
0 1 https://google.com/tensoflow
1 2 https://math.com/some/website
2 3 NaN
>>>
>>> df["url"]
0 https://google.com/tensoflow
1 https://math.com/some/website
2 NaN
Name: url, dtype: object
>>>
» Applying a function using pandas.Series.apply(func) on url column..
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else np.nan)
0 https://google.com/
1 https://math.com/
2 NaN
Name: url, dtype: object
>>>
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
>>>
» Store the above result in a variable (not mandatory, just to simply).
>>> s = df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
>>> s
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
» Finally
>>> df2 = pd.DataFrame({"col3": s})
>>> df2
col3
0 https://google.com/
1 https://math.com/
2 nan
>>>
» To make sure, what is s and what is df2, check types (again, not mandatory).
>>> type(s)
<class 'pandas.core.series.Series'>
>>>
>>>
>>> type(df2)
<class 'pandas.core.frame.DataFrame'>
>>>
Reference links:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.isnull.html
I am trying to create a parquet using pandas dataframe, and even though I delete the index of the file, it is still appearing when I am re-reading the parquet file. Can anyone help me with this? I want index.name to be set as None.
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1
It works as expected using pyarrow:
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1 ---> INDEX NAME APPEARS EVEN AFTER DELETING USING fastparquet
>>> del df.index.name
>>> df.to_parquet('test.parquet', engine='pyarrow')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
0 1 --> INDEX NAME IS NONE WHEN CONVERSION IS DONE WITH pyarrow
Hey this works with pyarrow with the following
df = pd.DataFrame({'key': 1}, index=[0])
df.to_parquet('test.parquet', engine='pyarrow', index=False)
df = pd.read_parquet('test.parquet', engine='pyarrow')
df.head()
As #alexopoulos7 mentioned in the to_parquet documentation it states you can use the "index" argument as a parameter. It seems to work, perhaps because I'm explicitly stating the engine='pyarrow'
I have been playing with both libraries pyarrow and fastparquet, trying to write a parquet file without preserving indexes since I need those data to be read from redshift as an external table.
For me what it worked was for library fastparquet
df.to_parquet(destination_file, engine='fastparquet', compression='gzip', write_index=False)
If you try to follow the to_parquet official documentation you will see that it mentions parameter "index" but this throws an error if this argument does not exist in the used engine. Currently, I have found that only fastparquet has such an option and in named "write_index"
The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?
pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object