How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv

How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv - python-3.x

I want to preserve the datatype of a column of a data frame which is a list while writing it to csv file. when I read it, I have to have the values in lis format.
I have tried
pd.read_csv('namesss.csv',dtype = {'letters' = list})
but it says
dtype <class 'list'> not understood
this is an example
df = pd.DataFrame({'name': ['jack','johnny','stokes'],
'letters':[['j','k'],['j','y'],['s','s']]})
print(type(df['letters'][0]))
df
<class 'list'>
name letters
0 jack [j, k]
1 johnny [j, y]
2 stokes [s, s]
df.to_csv('namesss.csv')
print(type(pd.read_csv('namesss.csv')['letters'][0]))
<class 'str'>

You can use the ast module to make strings into lists :
import ast
df2 = pd.read_csv('namesss.csv')
df2['letters'] =[ast.literal_eval(x) for x in df2['letters'] ]
In [1] : print(type(df2['letters'][0]))
Out[1] : <class 'list'>

Related

How to read excel table with one column?

I have a table in Excel with one column that I want to read into the list:
At first I tried it like this:
>>> df = pandas.read_excel('emails.xlsx', sheet_name=None)
>>> df
OrderedDict([('Sheet1', Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> for k, v in df.items():
... print(type(v), v)
...
<class 'pandas.core.frame.DataFrame'> Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> df = df.items()[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'odict_items' object is not subscriptable
I tried it differently:
>>> df = pandas.read_excel('emails.xlsx', index_col=0)
>>> df
Empty DataFrame
Columns: []
Index: [wonderct#mail.ru, fcl#fcl-bd.com, galina#dorax-investments.com]
[419 rows x 0 columns]
>>> foo = []
>>> for i in df.index:
... foo.append(i)
...
>>> foo
['wonderct#mail.ru', 'fcl#fcl-bd.com', 'galina#dorax-investments.com']
It almost worked, but the first element is missing. What else can I do? Is there really no way to read the Excel file simply line by line?

Try this:
df=pd.read_excel('temp.xlsx', header=None)
target_list=list(df[0].values)

Use:
target_list = pandas.read_excel('emails.xlsx', index_col=None, names=['A'])['A'].tolist()

Spacy - Convert Token type into list

I have few elements which I got after performing operation in spacy having type
Input -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output:
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
I want to make all elements in list with str type for iteration.
Expected output -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output
<class 'str'>
<class 'str'>
<class 'str'>
please suggest some optimized way..

Spacy Token has a attribute called text.
Here's a complete example:
import spacy
nlp = spacy.load('en_core_web_sm')
t = (u"India Australia Brazil")
li = nlp(t)
for i in li:
print(i.text)
or if you want the list of tokens as list of strings:
list_of_strings = [i.text for i in li]

Thanks for the solution and for sharing your knowledge. It works very well to convert a spacy doc/span to a string or list of strings to further use them in string operations.
you can also use this:-
for i in li:
print(str(i))

Attribute Error split when trying to apply split method in flatmap after converting DF to RDD

I am using below code snippet to read some sample file using spark context
>>> textFile = sc.textFile("hdfs:///user/hive/warehouse/sample.txt")
>>> textFile.flatMap(lambda word:word.split(" ")).collect()
Assume this gives output something like below
[u'hi', u'there,', u'I', u'am', u'working', u'on', u'something', u'random.']
Now I am using below code snippet to read some sample file using data frame and then trying to convert to rdd and apply flatmap as done earlier
>>> df = spark.read.text("hdfs:///user/hive/warehouse/sample.txt")
>>> df.rdd.flatMap(lambda word:word.split(" ")).collect()
This fails with attribute Error split.
I went on to check the datatype of df.rdd and textFile using below code snippet
>>> type(df.rdd)
<class 'pyspark.rdd.RDD'>
>>> type(textFile)
<class 'pyspark.rdd.RDD'>
Both are identical.
Now when I check type of individual element of these Rdd using below code snippet. I am observing difference.
>>> textFile.map(lambda x:type(x)).collect()
[<type 'unicode'>]
>>> df.rdd.map(lambda x:type(x)).collect()
[<class 'pyspark.sql.types.Row'>]
Why is there discrepency

You should convert it to list after you convert df to rdd
>>> textFile = sc.textFile("hdfs://localhost:8020/test/ali/sample.txt")
>>> textFile.flatMap(lambda word:word.split(" ")).collect()
['hi', 'there,', 'I', 'am', 'working', 'on', 'something', 'random.']
>>>
>>> df = spark.read.text("hdfs://localhost:8020/test/ali/sample.txt")
>>> df.rdd.flatMap(lambda x: list(x)).flatMap(lambda word:word.split(" ")).collect()
['hi', 'there,', 'I', 'am', 'working', 'on', 'something', 'random.']

Why index name always appears in the parquet file created with pandas?

I am trying to create a parquet using pandas dataframe, and even though I delete the index of the file, it is still appearing when I am re-reading the parquet file. Can anyone help me with this? I want index.name to be set as None.
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1

It works as expected using pyarrow:
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1 ---> INDEX NAME APPEARS EVEN AFTER DELETING USING fastparquet
>>> del df.index.name
>>> df.to_parquet('test.parquet', engine='pyarrow')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
0 1 --> INDEX NAME IS NONE WHEN CONVERSION IS DONE WITH pyarrow

Hey this works with pyarrow with the following
df = pd.DataFrame({'key': 1}, index=[0])
df.to_parquet('test.parquet', engine='pyarrow', index=False)
df = pd.read_parquet('test.parquet', engine='pyarrow')
df.head()
As #alexopoulos7 mentioned in the to_parquet documentation it states you can use the "index" argument as a parameter. It seems to work, perhaps because I'm explicitly stating the engine='pyarrow'

I have been playing with both libraries pyarrow and fastparquet, trying to write a parquet file without preserving indexes since I need those data to be read from redshift as an external table.
For me what it worked was for library fastparquet
df.to_parquet(destination_file, engine='fastparquet', compression='gzip', write_index=False)
If you try to follow the to_parquet official documentation you will see that it mentions parameter "index" but this throws an error if this argument does not exist in the used engine. Currently, I have found that only fastparquet has such an option and in named "write_index"

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?

pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv - python-3.x

You can use the ast module to make strings into lists : import ast df2 = pd.read_csv('namesss.csv') df2['letters'] =[ast.literal_eval(x) for x in df2['letters'] ] In [1] : print(type(df2['letters'][0])) Out[1] : <class 'list'>

Related

How to read excel table with one column?

Spacy - Convert Token type into list

Attribute Error split when trying to apply split method in flatmap after converting DF to RDD

Why index name always appears in the parquet file created with pandas?

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

Categories

Resources