How to separate column values into multiple rows & multiple columns with Python - python-3.x

I have a csv file with columns like this
I need to separate column (B) values into separate columns and multiple rows like this
This is what I tried (the data here in below code is same as csv data above) and did not work
data = [{"latlong":'{lat: 15.85173248 , lng: 78.6216129},{lat: 15.85161765 , lng: 78.61982138},{lat: 15.85246304 , lng: 78.62031075},{lat: 15.85250474 , lng: 78.62034441},{lat: 15.85221891 , lng: 78.62174507},', "Id": 1},
{"latlong": '{lat: 15.8523723 , lng: 78.62177758},{lat: 15.85236637 , lng: 78.62179098},{lat: 15.85231281 , lng: 78.62238316},{lat: 15.8501259 , lng: 78.62201676},', "Id":2}]
df = pd.DataFrame(data)
df
df.latlong.apply(pd.Series)
This works in this case
data1 = [{'latlong':[15.85173248, 78.6216129, 1]},{'latlong': [15.85161765, 78.61982138, 1]},{'latlong': [15.85246304, 78.62031075, 1]},
{'latlong': [15.85250474, 78.62034441, 1]}, {'latlong': [15.85221891, 78.62174507, 1]},{'latlong': [15.8523723, 78.62177758, 2]},
{'latlong': [15.85236637, 78.62179098, 2]}, {'latlong': [15.85231281, 78.62238316, 2]},{'latlong': [15.8501259,78.62201676, 2]}]
df1 = pd.DataFrame(data1)
df1
df1 = df1['latlong'].apply(pd.Series)
df1.columns = ['lat', 'long', 'Id']
df1
How can I achieve this with Python ?
New to python. I tried following links... Could not understand how to apply it to my case.
Splitting dictionary/list inside a Pandas Column into Separate Columns
python split data frame columns into multiple rows

Your data is in a very strange format ... the entries of latlong aren't actually valid JSON (there is a trailing comma at the end, and there are no quotes around the field names), so I would probably actually use a regular expression to split out the columns, and a list comprehension to split out the rows:
In [39]: pd.DataFrame(
[{'Id':r['Id'], 'lat':lat, 'long':long}
for r in data
for lat,long in re.findall("lat: ([\d.]+).*?lng: ([\d.]+)",
r['latlong'])])
Out[39]:
Id lat long
0 1 15.85173248 78.6216129
1 1 15.85161765 78.61982138
2 1 15.85246304 78.62031075
3 1 15.85250474 78.62034441
4 1 15.85221891 78.62174507
5 2 15.8523723 78.62177758
6 2 15.85236637 78.62179098
7 2 15.85231281 78.62238316
8 2 15.8501259 78.62201676

Related

Glue/Spark: Filter a large dynamic frame with thousands of conditions

I am trying to filter a timeseries glue dynamic frame with millions of rows having data:
id val ts
a 1.3 2022-05-03T14:18:00.000Z
a 9.2 2022-05-03T12:18:00.000Z
c 8.2 2022-05-03T13:48:00.000Z
I have another pandas dataframe with thousands of rows:
id start_ts end_ts
a 2022-05-03T14:00:00.000Z 2022-05-03T14:18:00.000Z
a 2022-05-03T11:38:00.000Z 2022-05-03T12:18:00.000Z
c 2022-05-03T13:15:00.000Z 2022-05-03T13:48:00.000Z
I want to filter all the rows in the time series dynamic frame having condition they have the same id and the ts is between start_ts and end_ts.
My current approach is too slow to solve the problem:
I am first iterating over the pandas_df and storing multiple filtered glue dynamic frames into an array
dfs=[]
for index, row in pandas_df.iterrows():
df = Filter.apply(ts_dynamicframe, f=lambda x: ((row['start_ts'] <= x['ts'] <= row['end_ts']) and x['id'] == index))
dfs.append(df)
and then unioning all the dynamicframes together.
df = dfs[0]
dfs.pop(0)
for _df in dfs:
df = df.union(_df)
the materialization takes too long and never finishes..
print("Count: ", df.count())
What could be more efficient approaches to solving this problem with spark/glue?
Use a range join
Data
df=spark.createDataFrame([('a' , 1.3 ,'2022-05-03T14:18:00.000Z'),
('a' , 9.2, '2021-05-03T12:18:00.000Z'),
('c' , 8.2, '2022-05-03T13:48:00.000Z')],
('id' , 'val', 'ts' ))
df1=spark.createDataFrame([('a' , '2022-05-03T14:00:00.000Z' , '2022-05-03T14:18:00.000Z'),
('a' , '2022-05-03T11:38:00.000Z' , '2022-05-03T12:18:00.000Z'),
('c' , '2022-05-03T13:15:00.000Z' , '2022-05-03T13:48:00.000Z')],
('id' , 'start_ts' , 'end_ts' ))
#Convert to timestamp if not yet converted
df= df.withColumn('ts', to_timestamp('ts'))
df1= df1.withColumn('start_ts', to_timestamp('start_ts')).withColumn('end_ts', to_timestamp('end_ts'))
Solution
#convert to SQL table
df1.createOrReplaceTempView('df1')
df.createOrReplaceTempView('df')
#Use range between
spark.sql("SELECT * FROM df,df1 WHERE df.id= df1.id AND df.ts BETWEEN df1.start_ts and df1.end_ts").show()
outcome
+---+---+-------------------+---+-------------------+-------------------+
| id|val| ts| id| start_ts| end_ts|
+---+---+-------------------+---+-------------------+-------------------+
| a|1.3|2022-05-03 14:18:00| a|2022-05-03 14:00:00|2022-05-03 14:18:00|
| c|8.2|2022-05-03 13:48:00| c|2022-05-03 13:15:00|2022-05-03 13:48:00|
+---+---+-------------------+---+-------------------+-------------------+

Setting number format for floats when writing dataframe to Excel

I have a script producing multiple sheets for processing into a database but have strict number formats for certain columns in my dataframes.
I have created a sample dict for based on column headers and number format required and a sample df.
import pandas as pd
df_int_headers=['GrossRevenue', 'Realisation', 'NetRevenue']
df={'ID': [654398,456789],'GrossRevenue': [3.6069109,7.584326], 'Realisation': [1.5129510,3.2659478], 'NetRevenue': [2.0939599,4.3183782]}
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 4}
df=pd.DataFrame.from_dict(df)
def formatter(header):
for key, value in df_formats.items():
for head in header:
return header.round(value).astype(str).astype(float)
df[df_int_headers] = df[df_int_headers].apply(formatter)
df.to_excel('test.xlsx',index=False)
When using current code, all column number formats are returned as 3 .d.p. in my Excel sheet whereas I require different formats for each column.
Look forward to your replies.
For me working pass dictioanry to DataFrame.round, for your original key-value 'NetRevenue': 4 are returned only 3 values, in my opinion there should be 0 in end which is removed, because number:
df={'ID': 654398,'GrossRevenue': 3.6069109,
'Realisation': 1.5129510, 'NetRevenue': 2.0939599}
df = pd.DataFrame(df, index=[0])
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 5}
df_int_headers = list(df_formats.keys())
df[df_int_headers] = df[df_int_headers].round(df_formats)

How to make a function with dynamic variables to add rows in pandas?

I'm trying to make a table from a list of data using pandas.
Originally I wanted to make a function where I can pass dynamic variables so I could continuously add new rows from data list.
It works up until a point where adding rows part begun. Column headers are adding, but the data - no. It either keeps value at only last col or adds nothing.
My scrath was:
for title in titles:
for x in data:
table = {
title: data[x]
}
df.DataFrame(table, columns=titles, index[0]
columns list:
titles = ['timestamp', 'source', 'tracepoint']
data list:
data = ['first', 'second', 'third',
'first', 'second', 'third',
'first', 'second', 'third']
How can I make something like this?
timestamp, source, tracepoint
first, second, third
first, second, third
first, second, third
If you just want to initialize pandas dataframe, you can use dataframe’s constructor.
And you can also append row by using a dict.
Pandas provides other useful functions,
such as concatenation between data frames, insert/delete columns. If you need, please check pandas’s doc.
import pandas as pd
# initialization by dataframe’s constructor
titles = ['timestamp', 'source', 'tracepoint']
data = [['first', 'second', 'third'],
['first', 'second', 'third'],
['first', 'second', 'third']]
df = pd.DataFrame(data, columns=titles)
print('---initialization---')
print(df)
# append row
new_row = {
'timestamp': '2020/11/01',
'source': 'xxx',
'tracepoint': 'yyy'
}
df = df.append(new_row, ignore_index=True)
print('---append result---')
print(df)
output
---initialization---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
---append result---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
3 2020/11/01 xxx yyy

How to create multiple dataframes using multiple functions

I quite often write a function to return different dataframes based on the parameters I enter. Here's an example dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='M'), 10000)})
I then created a function to perform sub-totals for me like this:
def some_fun(DF1, agg_column, myList=[], *args):
y = pd.concat([
DF1.assign(**{x:'[Total]' for x in myList[i:]})\
.groupby(myList).agg(sumz = (agg_column,'sum')) for i in range(1,len(myList)+1)]).sort_index().unstack(0)
return y
I then write out lists that I'll pass as arguments to the function:
list_one = [pd.Grouper(key='Date',freq='A'),'Category','Product']
list_two = [pd.Grouper(key='Date',freq='A'),'Category','Sub-Category','Sub-Category-2']
list_three = [pd.Grouper(key='Date',freq='A'),'Sub-Category','Product']
I then have to run each list through my function creating new dataframes:
df1 = some_fun(df,'Units_Sold',list_one)
df2 = some_fun(df,'Dollars_Sold',list_two)
df3 = some_fun(df,'Units_Sold',list_three)
I then use a function to write each of these dataframes to an Excel worksheet. This is just an example - I perform this same exercise 10+ times.
My question - is there a better way to perform this task than to write out df1, df2, df3 with the function information applied? Should I be looking at using a dictionary or some other data type to do this my pythonically with a function?
A dictionary would be my first choice:
variations = ([('Units Sold', list_one), ('Dollars_Sold',list_two),
..., ('Title', some_list)])
df_variations = {}
for i, v in enumerate(variations):
name = v[0]
data = v[1]
df_variations[i] = some_fun(df, name, data)
You might further consider setting the keys to unique / helpful titles for the variations, that goes beyond something like 'Units Sold', which isn't unique in your case.
IIUC,
as Thomas has suggested we can use a dictionary to parse through your data, but with some minor modifications to your function, we can use the dictionary to hold all the required data then pass that through to your function.
the idea is to pass two types of keys, the list of columns and the arguments to your pd.Grouper call.
data_dict = {
"Units_Sold": {"key": "Date", "freq": "A"},
"Dollars_Sold": {"key": "Date", "freq": "A"},
"col_list_1": ["Category", "Product"],
"col_list_2": ["Category", "Sub-Category", "Sub-Category-2"],
"col_list_3": ["Sub-Category", "Product"],
}
def some_fun(dataframe, agg_col, dictionary,column_list, *args):
key = dictionary[agg_col]["key"]
frequency = dictionary[agg_col]["freq"]
myList = [pd.Grouper(key=key, freq=frequency), *dictionary[column_list]]
y = (
pd.concat(
[
dataframe.assign(**{x: "[Total]" for x in myList[i:]})
.groupby(myList)
.agg(sumz=(agg_col, "sum"))
for i in range(1, len(myList) + 1)
]
)
.sort_index()
.unstack(0)
)
return y
Test.
df1 = some_fun(df,'Units_Sold',data_dict,'col_list_3')
print(df1)
sumz
Date 2016-12-31 2017-12-31 2018-12-31
Sub-Category Product
X Product 1 18308 17839 18776
Product 2 18067 19309 18077
Product 3 17943 19121 17675
[Total] 54318 56269 54528
Y Product 1 20699 18593 18103
Product 2 18642 19712 17122
Product 3 17701 19263 20123
[Total] 57042 57568 55348
Z Product 1 19077 17401 19138
Product 2 17207 21434 18817
Product 3 18405 17300 17462
[Total] 54689 56135 55417
[Total] [Total] 166049 169972 165293
as you want to automate the writing of the 10x worksheets, we can again do that with a dictionary call over your function:
matches = {'Units_Sold': ['col_list_1','col_list_3'],
'Dollars_Sold' : ['col_list_2']}
then a simple for loop to write all the files to a single excel sheet, change this to match your required behavior.
writer = pd.ExcelWriter('finished_excel_file.xlsx')
for key,value in matches.items():
for items in value:
dataframe = some_fun(df,k,data_dict,items)
dataframe.to_excel(writer,f'{key}_{items}')
writer.save()

extracting data from numpy array in python3

I imported my csv file into a python using numpy.txt and the results look like this:
>>> print(FH)
array([['Probe_Name', '', 'A2M', ..., 'POS_D', 'POS_E', 'POS_F'],
['Accession', '', 'NM_000014.4', ..., 'ERCC_00092.1',
'ERCC_00035.1', 'ERCC_00034.1'],
['Class_Name', '', 'Endogenous', ..., 'Positive', 'Positive',
'Positive'],
...,
['CF33294_10', '', '6351', ..., '1187', '226', '84'],
['CF33299_11', '', '5239', ..., '932', '138', '64'],
['CF33300_12', '', '37372', ..., '981', '202', '58']], dtype=object)
every single list is a column and the first item of every column is the header. I want to plot the data in different ways. to do so, I want to make variable for every single column. for example the first column I want to print(Probe_Name) as the header and the results will be shown like this:
A2M
.
.
.
POS_D
POS_E
POS_F
and this is the case for the rest of columns. and then I will plot the variables.
I tried to do that in python3 like this:
def items(N_array:)
for item in N_array:
name = item[0]
content = item[1:]
return name, content
print(items(FH))it does not return what I expect. do you know how to fix it?
One simple way to do this is with pandas dataframes. When you read the csv file using a pandas dataframe, you essentially get a collection of 'columns' (called series in pandas).
import pandas as pd
df = pd.read_csv("your filename.csv")
df
Probe_Name Accession
0 A2m MD_9999
1 POS_D NM_0014.4
2 POS_E 99999
Now we can deal with each column, which is named automatically by the header column.
print(df['Probe_Name'])
0 A2m
1 POS_D
2 POS_E
Furthermore, you can you do plotting (assuming you have numeric data in here somewhere).
http://pandas.pydata.org/pandas-docs/stable/index.html

Resources