Why index name always appears in the parquet file created with pandas?

Why index name always appears in the parquet file created with pandas? - python-3.x

I am trying to create a parquet using pandas dataframe, and even though I delete the index of the file, it is still appearing when I am re-reading the parquet file. Can anyone help me with this? I want index.name to be set as None.
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1

It works as expected using pyarrow:
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1 ---> INDEX NAME APPEARS EVEN AFTER DELETING USING fastparquet
>>> del df.index.name
>>> df.to_parquet('test.parquet', engine='pyarrow')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
0 1 --> INDEX NAME IS NONE WHEN CONVERSION IS DONE WITH pyarrow

Hey this works with pyarrow with the following
df = pd.DataFrame({'key': 1}, index=[0])
df.to_parquet('test.parquet', engine='pyarrow', index=False)
df = pd.read_parquet('test.parquet', engine='pyarrow')
df.head()
As #alexopoulos7 mentioned in the to_parquet documentation it states you can use the "index" argument as a parameter. It seems to work, perhaps because I'm explicitly stating the engine='pyarrow'

I have been playing with both libraries pyarrow and fastparquet, trying to write a parquet file without preserving indexes since I need those data to be read from redshift as an external table.
For me what it worked was for library fastparquet
df.to_parquet(destination_file, engine='fastparquet', compression='gzip', write_index=False)
If you try to follow the to_parquet official documentation you will see that it mentions parameter "index" but this throws an error if this argument does not exist in the used engine. Currently, I have found that only fastparquet has such an option and in named "write_index"

Related

How to iteratively add rows to an inital empty pandas Dataframe?

I have to iteratively add rows to a pandas DataFrame and find this quite hard to achieve. Also performance-wise I'm not sure if this is the best approach.
So from time to time, I get data from a server and this new dataset from the server will be a new row in my pandas DataFrame.
import pandas as pd
import datetime
df = pd.DataFrame([], columns=['Timestamp', 'Value'])
# as this df will grow over time, is this a costly copy (df = df.append) or does pandas does some optimization there, or is there a better way to achieve this?
# ignore_index, as I want the index to automatically increment
df = df.append({'Timestamp': datetime.datetime.now()}, ignore_index=True)
print(df)
After one day the DataFrame will be deleted, but during this time, probably 100k times a new row with data will be added.
The goal is still to achieve this in a very efficient way, runtime-wise (memory doesn't matter too much as enough RAM is present).

I tried this to compare the speed of 'append' compared to 'loc' :
import timeit
code = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df= df.append({'A' : 3, 'B' : 4}, ignore_index = True)
"""
code2 = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df.loc[df.index.max()+1, :] = [3, 4]
"""
elapsed_time1 = timeit.timeit(code, number = 1000)/1000
elapsed_time2 = timeit.timeit(code2, number = 1000)/1000
print('With "append" :',elapsed_time1)
print('With "loc" :' , elapsed_time2)
On my machine, I obtained these results :
With "append" : 0.001502693824000744
With "loc" : 0.0010836279180002747
Using "loc" seems to be faster.

How to read excel table with one column?

I have a table in Excel with one column that I want to read into the list:
At first I tried it like this:
>>> df = pandas.read_excel('emails.xlsx', sheet_name=None)
>>> df
OrderedDict([('Sheet1', Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> for k, v in df.items():
... print(type(v), v)
...
<class 'pandas.core.frame.DataFrame'> Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> df = df.items()[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'odict_items' object is not subscriptable
I tried it differently:
>>> df = pandas.read_excel('emails.xlsx', index_col=0)
>>> df
Empty DataFrame
Columns: []
Index: [wonderct#mail.ru, fcl#fcl-bd.com, galina#dorax-investments.com]
[419 rows x 0 columns]
>>> foo = []
>>> for i in df.index:
... foo.append(i)
...
>>> foo
['wonderct#mail.ru', 'fcl#fcl-bd.com', 'galina#dorax-investments.com']
It almost worked, but the first element is missing. What else can I do? Is there really no way to read the Excel file simply line by line?

Try this:
df=pd.read_excel('temp.xlsx', header=None)
target_list=list(df[0].values)

Use:
target_list = pandas.read_excel('emails.xlsx', index_col=None, names=['A'])['A'].tolist()

Attribute Error split when trying to apply split method in flatmap after converting DF to RDD

I am using below code snippet to read some sample file using spark context
>>> textFile = sc.textFile("hdfs:///user/hive/warehouse/sample.txt")
>>> textFile.flatMap(lambda word:word.split(" ")).collect()
Assume this gives output something like below
[u'hi', u'there,', u'I', u'am', u'working', u'on', u'something', u'random.']
Now I am using below code snippet to read some sample file using data frame and then trying to convert to rdd and apply flatmap as done earlier
>>> df = spark.read.text("hdfs:///user/hive/warehouse/sample.txt")
>>> df.rdd.flatMap(lambda word:word.split(" ")).collect()
This fails with attribute Error split.
I went on to check the datatype of df.rdd and textFile using below code snippet
>>> type(df.rdd)
<class 'pyspark.rdd.RDD'>
>>> type(textFile)
<class 'pyspark.rdd.RDD'>
Both are identical.
Now when I check type of individual element of these Rdd using below code snippet. I am observing difference.
>>> textFile.map(lambda x:type(x)).collect()
[<type 'unicode'>]
>>> df.rdd.map(lambda x:type(x)).collect()
[<class 'pyspark.sql.types.Row'>]
Why is there discrepency

You should convert it to list after you convert df to rdd
>>> textFile = sc.textFile("hdfs://localhost:8020/test/ali/sample.txt")
>>> textFile.flatMap(lambda word:word.split(" ")).collect()
['hi', 'there,', 'I', 'am', 'working', 'on', 'something', 'random.']
>>>
>>> df = spark.read.text("hdfs://localhost:8020/test/ali/sample.txt")
>>> df.rdd.flatMap(lambda x: list(x)).flatMap(lambda word:word.split(" ")).collect()
['hi', 'there,', 'I', 'am', 'working', 'on', 'something', 'random.']

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?

pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object

How to load Only column names from csv file (Pandas)?

I have a large csv file and don't want to load it fully into my memory, I need to get only column names from this csv file. How to load it clearly?

try this:
pd.read_csv(file_name, nrows=1).columns.tolist()

If you pass nrows=0 to read_csv then it will only load the column row:
In[8]:
import pandas as pd
import io
t="""a,b,c,d
0,1,2,3"""
pd.read_csv(io.StringIO(t), nrows=0)
Out[8]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []
After which accessing attribute .columns will give you the columns:
In[10]:
pd.read_csv(io.StringIO(t), nrows=0).columns
Out[10]: Index(['a', 'b', 'c', 'd'], dtype='object')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why index name always appears in the parquet file created with pandas? - python-3.x

Related

How to iteratively add rows to an inital empty pandas Dataframe?

How to read excel table with one column?

Attribute Error split when trying to apply split method in flatmap after converting DF to RDD

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

How to load Only column names from csv file (Pandas)?

Categories

Resources