how to method chain describe(), value_counts(), and to_csv()

how to method chain describe(), value_counts(), and to_csv() - python-3.x

Here is my code
#import modules
import pandas as pd
#assign pd.read_csv() to infile to read in data from a datafile
infile = pd.read_csv('..\Infiles\\StarWars_Data.txt')
#get user input for selecting a Series name
Series_name = input("please enter the name of one of the series's for closer inspection")
#Select the Series
Series_Data = infile[Series_name]
#Chain the value_counts(), and describe() and to_csv() methods
Series_Data.value_counts()\
.describe()\
.to_csv('..\Outfiles\StarWars_Results.txt')
I expect it to perform value_counts() (returns the numbers of unique values in a series), describe() (gives summary statistics on a series), and to_csv(writes what is stored into a specified csv file).
For some reason to_csv() is returning describe() but it is not returning value_counts(), how do I write the data from both value_counts() and describe() to the same document?

IIUC, do you want?
pd.concat([df['words'].value_counts(),
df['words'].describe()])\
.to_csv('..\Outfiles\StarWars_Results.txt')
MCVE:
s = pd.Series([*'AABBBBCDDEEEEEEE'])
pd.concat([s.value_counts(), s.describe()]).rename_axis('key').to_csv('a.text')
!type a.txt
Output:
key,0
E,7
B,4
A,2
D,2
C,1
count,16
unique,5
top,E
freq,7

Related

How to use HDFStore.select screen data

Question:
1、how to select the rows(Pseudo code) : columns['Name']='Name_A' (Name_A just a example) & columns['time'] isin (2021-11-21 00:00:00,2021-11-22 00:00:00) .
I have store about 4 billion rows data to a hdf5 file.
Now, I want to select some data.
My code like this:
import pandas as pd
ss = pd.HDFStore("xh_data_L9.hdf5") #<class 'pandas.io.pytables.HDFStore'>
print(type(ss))
print(ss.keys())
s_1 = ss.select('alldata',start=0,stop=500) # data example
print(s_1)
ss.close()
I found HDFStore.select usage like this:
HDFStore.select(key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, auto_close=False)
# can not run success.
s_3 = ss.select('alldata',where="Time>2021-11-21 00:00:00 & Time<2021-11-22 00:00:00)")
s_3 = ss.select('alldata',['Name'] == 'Name_A')
I have google some method,but don't how to use "where"
code and result

I found that the reason was whether the data_columns was established when the file was created.
#this method created hdf5 don't have data_columns
ss.append('store',df_temp,index=True)
#this method created hdf5 have data_columns
store.append("store", df_temp, format="table", data_columns=True)
#query whether include data_columns
import pandas as pd
ss = pd.HDFStore("store.hdf5")
print(ss.info())
if the result include " dc->[Time,Name,Value]".
ss.select("store",where="Name='Name_A'")
#Single quotation marks are required before and after the varies.
The following is the official website explanation for data_columns:
data_columns :
list of columns, or True, default None
List of columns to create as indexed data columns for on-disk
queries, or True to use all columns. By default only the axes of the object are indexed.
See here <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#query-via-data-columns>.

Dask: Add list to a column value like pandas does

I am bit new to dask. I have large csv file and large list. Length of row of csv are equal to length of the list. I am trying to create a new column in the Dask dataframe from a list. In pandas, it pretty straight forward, however in Dask I am having hard time creating new column for it. I am avoiding to use pandas because my data is 15GB+.
Please see my tries below.
csv Data
name,text,address
john,some text here,MD
tim,some text here too,WA
Code tried
import dask.dataframe as dd
import numpy as np
ls = ['one','two']
ddf = dd.read_csv('../data/test.csv')
ddf.head()
Try #1:
ddf['new'] = ls # TypeError: Column assignment doesn't support type list
Try #2: What should be passed here for condlist?
ddf['new'] = np.select(choicelist=ls) # TypeError: _select_dispatcher() missing 1 required positional argument: 'condlist'
Looking for this output:
name text address new
0 john some text here MD one
1 tim some text here too WA two

Try creating a dask dataframe and then appending it like this -
#ls = dd.from_array(np.array(['one','two']))
#ddf['new'] = ls
# As tested by OP
import dask.array as da
ls = da.array(['one','two','three'])
ddf['new'] = ls

Create dictionary using cell values from two columns of excel

I have an excel sheet with columns ISIN and URL as header as shown below:
*ISIN URL*
ISIN1 https://mylink3.pdf
ISIN2 https://mylink2.pdf
I need to create a dictionary of the values of this sheet and have used the below code:
import pandas as pd
my_dic = pd.read_excel('PDFDwn.xlsx', index_col=0).to_dict()
print(my_dic)
The output that I receive is as below.
{'URL': {'ISIN1': 'https://mylink3.pdf', 'ISIN2': 'https://mylink2.pdf'}}
whereas expected output should be as below without URL piece.
{'ISIN1': 'https://mylink3.pdf', 'ISIN2': 'https://mylink2.pdf'}

try this,
print df.set_index('ISIN')['URL'].to_dict()
Output:
{'ISIN2': 'https://mylink2.pdf', 'ISIN1': 'https://mylink3.pdf'}
As Per User sample:
my_dic = pd.read_excel('PDFDwn.xlsx', index_col=0).set_index('ISIN')['URL'].to_dict()

Simple solution:
my_dic['URL']
Output:
{'ISIN1': 'https://mylink3.pdf', 'ISIN2': 'https://mylink2.pdf'}

How to create corpus from pandas data frame to operate with NLTK

Here is my problem:
I have a csv file containing articles data set with columns: ID, CATEGORY, TITLE, BODY.
In python, I read the file to a pandas data frame like this:
import pandas as pd
df = pd.read_csv('my_file.csv')
Now I need to transform somehow this df to get a corpus object, let's call it my_corpus. But how exactly I can do it? I assume I need to use:
from nltk.corpus.reader import CategorizedCorpusReader
my_corpus = some_nltk_function(df) # <- what is the function?
At the end I can use NLTK methods to analyze the corpus. For example:
import nltk
my_corpus.fileids() # <- I expect values from column ID
my_corpus.categories() # <- I expect values from column CATEGORY
my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
my_corpus.sents(categories=['cat_A', 'cat_B', 'cat_C']) # <- I expect values from column TITLE and BODY
Please, advise.

I guess you need to do 2 things.
First you need to convert each row of your dataframe df to corpus files. The following function should do it for you
def CreateCorpusFromDataFrame(corpusfolder,df):
for index, r in df.iterrows():
id=r['ID']
title=r['TITLE']
body=r['BODY']
category=r['CATEGORY']
fname=str(category)+'_'+str(id)+'.txt'
corpusfile=open(corpusfolder+'/'+fname,'a')
corpusfile.write(str(body) +" " +str(title))
corpusfile.close()
CreateCorpusFromDataFrame('yourcorpusfolder/',df)
Second, you need to read the files from yourcorpusfolder and then do the NLTK processing required by you
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
my_corpus=CategorizedPlaintextCorpusReader('yourcorpusfolder/',
r'.*', cat_pattern=r'(.*)_.*')
my_corpus.fileids() # <- I expect values from column ID
my_corpus.categories() # <- I expect values from column CATEGORY
my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
my_corpus.sents(categories=['cat_A', 'cat_B']) # <- I expect values from column TITLE and BODY
Some helpful references :
https://groups.google.com/forum/#!topic/nltk-users/YFCKjHbpUkY
Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line

Writing multiple columns into csv files using python

I am a python beginner. I am trying to write multiple lists into separate columns in a csv file.
In my csv file, I would like to have
2.9732676520000001 0.0015852047556142669 1854.1560636319559
4.0732676520000002 0.61902245706737125 2540.1258143280334
4.4032676520000003 1.0 2745.9167395368572
Following is the code that I wrote.
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
with open('output/'+file,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
This isn't running properly. I'm getting non-empty format string passed to object.format this error message. I'm having a hard time to catch what is going wrong. Could anyone spot what's going wrong with my code?

You are better off using pandas
import pandas as pd
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
# pandas can convert a list of lists to a dataframe.
# each list is a row thus after constructing the dataframe
# transpose is applied to get to the user's desired output.
df = pd.DataFrame([df, int_peak, CS])
df = df.transpose()
# write the data to the specified output path: "output"/+file_name
# without adding the index of the dataframe to the output
# and without adding a header to the output.
# => these parameters are added to be fit the desired output.
df.to_csv("output/"+file_name, index=False, header=None)
The output CSV looks like this:
2.973268 0.001585 1854.156064
4.073268 0.619022 2540.125814
4.403268 1.000000 2745.916740
However to fix your code, you need to use another file name variable other than file. I changed that in your code as follows:
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
with open('/tmp/'+file_name,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
and it works. The output is as follows:
2.973268,0.001585,1854.156064
4.073268,0.619022,2540.125814
4.403268,1.000000,2745.916740

If you need to write only a few selected columns to CSV then you should use following option.
csv_data = df.to_csv(columns=['Name', 'ID'])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to method chain describe(), value_counts(), and to_csv() - python-3.x

Related

How to use HDFStore.select screen data

Dask: Add list to a column value like pandas does

Create dictionary using cell values from two columns of excel

How to create corpus from pandas data frame to operate with NLTK

Writing multiple columns into csv files using python

Categories

Resources