Splitting the rows of a Dataframe using pyspark - apache-spark

code:
import os.path
file_name = os.path.join('databricks-datasets', 'cs190', 'data-001', 'millionsong.txt')
raw_data_df = sqlContext.read.load(file_name, 'text')
sample_points = raw_data_df.take(5)
print sample_points
Example output:
[Row(1,2,3),Row(4,5,6)]
From this output,wanted to parse each row in the DataFrame into individual elements, using Spark's select and split methods.
For example, split "1,2,3" into ['1','2','3']
Code:
raw_data_df.select((explode(split(raw_data_df.value,"\s+"))))
But the code doesnt seem to worf as expected any suggestions would be helpful.

Try this..
raw_data_df.map(lambda l: l[0].select(l[0].split(",")))

Related

Using loops to call multiple pandas dataframe columns

I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan
Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)
chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan

How read big csv file and read each column separatly?

I have a big CSV file that I read as dataframe.
But I can not figure out how can I read separately by every column.
I have tried to use sep = '\' but it gives me an error.
I read my file with this code:
filename = "D:\\DLR DATA\\status maritime\\nperf-data_2019-01-01_2019-09-10_bC5HIhAfJWqdQckz.csv"
#Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv(filename)
df1 = df.head()
When I print my dataframe head I have this result:
In variable explorer my dataframe consist only 1 column with all data inside:
I try to set sep = '' with space and coma. But it didn't work.
How can I read all data with appropriate different columns?
I would appreciate any help.
You have a tab separated file use \t
df = dd.read_csv(filename, sep='\t')

Why read_sas is converting strings to float?

I am trying to read a .sas7bdat file using pandas and I am having a hard time because pandas is converting strings values that look like a number into float.
For example, if I have a telephone number like '348386789' and I read it with the following code:
import pandas as pd
df = pd.read_sas('test.sas7bdat', format='sas7bdat', encoding='utf-8')
The output would be 348386789.0!
I can convert every single column with something like df['number'].astype(int).astype(str) but this would be very unefficent.
There is the same problem in the read_csv function but there you can use the argument dtype that sets the type for the required column (es. dtype={'number': str)}).
Is there a better way to read values in the desired format and use it in a dataframe?
UPDATE
I even tried sas7bdat.py and pyreadstat with the same results. You might say that the problem is in the data but using an online tool to read sas7bdat the data seems correct.
Code for the other two libraries:
# pyreadstat module
import pyreadstat
df2, meta = pyreadstat.read_sas7bdat('test.sas7bdat')
# sas7bdat module
from sas7bdat import SAS7BDAT
reader = SAS7BDAT('test.sas7bdat')
df_sas = reader.to_data_frame()
If you want to try, (and you have a SAS license), you can create a .sas7bdat file with the following content:
column_1,column_2,column_3
11,20190129,5434
19,20190228,5236
59,20190328,10448
76,20190129,5434
Use sas7bdat.py instead. That typically preserves the dataset formats better.
IF a particular column is defined as character in the SAS dataset, then sas7bdat will read it as as string regardless of how the contents look like. As a lazy example, I created this dataset in SAS:
data test;
id = '1111111'; val = 1; output;
id = '2222222'; val = 2; output;
run;
And then ran the following Python code on it:
reader = SAS7BDAT('test.sas7bdat')
df = reader.to_data_frame()
print(df)
cols = reader.columns
for col in cols:
print(str(col.name) + " " + str(col.type))
Here is what I see:
id val
0 1111111 1.0
1 2222222 2.0
b'id' string
b'val' number
If you are looking to 'intelligently' convert numbers to strings based on context, then you may need to look elsewhere. Any SAS dataset reader is just going to read based on the format specified within the dataset at best.

Python Pandas dataframe, how to integrate new columns into a new csv

guys, I need a bit help on Pandas and would appreciate greatly your inputs.
My original file looks like this:
I would like to convert it by mergering some pairs of columns (generating their averages) and returns a new file looking like this:
Also, if possible, I would also like to split the column 'RateDateTime' into two columns, one contains the date, the other contains only the time. How should I do it? I tried coding as belows but it doesn't work:
import pandas as pd
dateparse = lambda x: pd.datetime.strptime(x, '%Y/%m/%d %H:%M:%S')
df = pd.read_csv('data.csv', parse_dates=['RateDateTime'], index_col='RateDateTime',date_parser=dateparse)
a=pd.to_numeric(df['RateAsk_open'])
b=pd.to_numeric(df['RateAsk_high'])
c=pd.to_numeric(df['RateAsk_low'])
d=pd.to_numeric(df['RateAsk_close'])
e=pd.to_numeric(df['RateBid_open'])
f=pd.to_numeric(df['RateBid_high'])
g=pd.to_numeric(df['RateBid_low'])
h=pd.to_numeric(df['RateBid_close'])
df['Open'] = (a+e) /2
df['High'] = (b+f) /2
df['Low'] = (c+g) /2
df['Close'] = (d+h) /2
grouped = df.groupby('CurrencyPair')
Open=grouped['Open']
High=grouped['High']
Low=grouped['Low']
Close=grouped['Close']
w=pd.concat([Open, High,Low,Close], axis=1, keys=['Open', 'High','Low','Close'])
w.to_csv('w.csv')
Python returns:
TypeError: cannot concatenate object of type "<class 'pandas.core.groupby.groupby.SeriesGroupBy'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
Can someone help me please? Many thanks!!!
IIUYC, you don't need grouping here. You can simply update existing dataframe with new columns and specify, what columns you need to save to csv file in to_csv method. Here is example:
df['Open'] = df[['RateAsk_open', 'RateBid_open']].mean(axis=1)
df['RateDate'] = df['RateDateTime'].dt.date
df['RateTime'] = df['RateDateTime'].dt.time
df.to_csv('w.csv', columns=['CurrencyPair', 'Open', 'RateDate', 'RateTime'])

Python 3 appending to a empty DataFrame

Fairly new to coding. I have looked at a couple of other similar questions about appending DataFrames in python but could not solve the problem.
I have the data below (CSV) in an excel xls file:
Venue Name,Cost ,Restriction,Capacity
Cinema,5,over 13,50
Bar,10,over 18,50
Restaurant,15,no restriction,25
Hotel,7,no restriction,100
I am using the code below to try to "filter" out rows which have "no restriction" under the restrictions column. The code seems to work right through to the last line i.e. both print statements are giving me what I would expect.
import pandas as pd
import numpy as np
my_file = pd.ExcelFile("venue data.xlsx")
mydata = my_file.parse(0, index_col = None, na_values = ["NA"])
my_new_file = pd.DataFrame()
for index in mydata.index:
if "no restriction" in mydata.Restriction[index]:
print (mydata.Restriction[index])
print (mydata.loc[index:index])
my_new_file.append(mydata.loc[index:index], ignore_index = True)
Don't loop through dataframes. It's almost never necessary.
Use:
df2 = df[df['Restriction'] != 'no restriction']
Or
df2 = df.query("Restriction != 'no restriction'")

Resources