Join common text to multiple dataframes in one syntax - python-3.x

Is there a way to join text in one syntax. I have 10 different dataframes to which I need to join the same text. I have done it separately for now. e.g.
import pandas as pd
df1 = pd.DataFrame({'Col1': [40, 30, 20], 'COl2': [50, 10, 5]})
name = ['Sam']
lis = []
for i in name:
lis.append(i)
df = pd.DataFrame({'i': lis}) #Creating a dataframe to append the name
df1 = df1.join(df)
df1.join(df)
df2.join(df) ...... so on
I want to do it in one syntax. Making a list of dataframes and join text
[df1,df2,df3,df4].join(df)

Related

appending pytorch object in pandas DataFrame

I want to append tensor objects c, s in empty dataframe df_data1_cluster
df_data1_cluster = pd.DataFrame(columns = ["cluster", "text"])
label, center = detect_clusters(torch.as_tensor(embeddings), 50)
for c, s in zip(label, phrases):
df_data1_cluster.append(c,s)
It is resulting in error.
TypeError: cannot concatenate object of type '<class 'torch.Tensor'>'; only Series and DataFrame objs are valid
Hi this is a pandas problem, the append command requires u append a dataframe, there are ways to insert to a dataframe
import pandas as pd
import torch
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']})
df['numbers'].loc[-1] = torch.tensor([2, 3, 4]) # adding a row

Iterating over columns from two dataframes to estimate correlation and p-value

I am trying to estimate Pearson's correlation coefficient and P-value from the corresponding columns of two dataframes. I managed to write this code so far but it is just providing me the results from the last columns. Need some help with this code. Also, want to save the outputs in a new dataframe.
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame(pd.read_excel('15_Oct_Yield_A.xlsx'))
df_2= pd.DataFrame(pd.read_excel('Oct_Z_index.xlsx'))
for column in df_1.columns[1:]:
for column in df_2.columns[1:]:
x = (df_1[column])
y = (df_2[column])
correl = stats.pearsonr(x, y)
Your looping setup is incorrect on a couple measures... You are using the same variable name in both for-loops which is going to cause problems. Also, you are computing correl outside of your inner loop... etc.
What you want to do is loop over the columns with 1 loop, assuming that both data frames have the same column names. If they do not, you will need to take extra steps to find the common column names and then iterate over them.
Something like this should work:
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame({ 'A': ['dog', 'pig', 'cat'],
'B': [0.25, 0.50, 0.75],
'C': [0.30, 0.40, 0.90]})
df_2 = pd.DataFrame({ 'A': ['bird', 'monkey', 'rat'],
'B': [0.20, 0.60, 0.90],
'C': [0.80, 0.50, 0.10]})
results = dict()
for column in df_1.columns[1:]:
correl = stats.pearsonr(df_1[column], df_2[column])
results[column] = correl
print(results)

How to iteratively add rows to an inital empty pandas Dataframe?

I have to iteratively add rows to a pandas DataFrame and find this quite hard to achieve. Also performance-wise I'm not sure if this is the best approach.
So from time to time, I get data from a server and this new dataset from the server will be a new row in my pandas DataFrame.
import pandas as pd
import datetime
df = pd.DataFrame([], columns=['Timestamp', 'Value'])
# as this df will grow over time, is this a costly copy (df = df.append) or does pandas does some optimization there, or is there a better way to achieve this?
# ignore_index, as I want the index to automatically increment
df = df.append({'Timestamp': datetime.datetime.now()}, ignore_index=True)
print(df)
After one day the DataFrame will be deleted, but during this time, probably 100k times a new row with data will be added.
The goal is still to achieve this in a very efficient way, runtime-wise (memory doesn't matter too much as enough RAM is present).
I tried this to compare the speed of 'append' compared to 'loc' :
import timeit
code = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df= df.append({'A' : 3, 'B' : 4}, ignore_index = True)
"""
code2 = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df.loc[df.index.max()+1, :] = [3, 4]
"""
elapsed_time1 = timeit.timeit(code, number = 1000)/1000
elapsed_time2 = timeit.timeit(code2, number = 1000)/1000
print('With "append" :',elapsed_time1)
print('With "loc" :' , elapsed_time2)
On my machine, I obtained these results :
With "append" : 0.001502693824000744
With "loc" : 0.0010836279180002747
Using "loc" seems to be faster.

Most efficient way to convert Python multidimensional list to CSV file?

I want to output a multidimensional list to a CSV file.
Currently, I am creating a new DataFrame object and converting that to CSV. I am aware of the csv module, but I can't seem to figure out how to do that without manual input. The populate method allows the user to choose how many rows and columns they want. Basically, the data variable will usually be of form [[x1, y1, z1], [x2, y2, z2], ...]. Any help is appreciated.
FROM populator IMPORT populate
FROM pandas IMPORT DataFrame
data = populate()
df = DataFrame(data)
df.to_csv('output.csv')
CSVs are nothing but comma separated strings for each column and new-line separated for each row, which you can do like so:
data = [[1, 2, 4], ['A', 'AB', 2], ['P', 23, 4]]
data_string = '\n'.join([', '.join(map(str, row)) for row in data])
with open('data.csv', 'wb') as f:
f.write(data_string.encode())

Float format in matplotlib table

Given a pandas dataframe, I am trying to translate it into a table by using this code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = {"Name": ["John", "Leonardo", "Chris", "Linda"],
"Location" : ["New York", "Florence", "Athens", "London"],
"Age" : [41, 33, 53, 22],
"Km": [1023,2312,1852,1345]}
df = pd.DataFrame(data)
fig, ax = plt.subplots()
ax.axis('off')
ax.set_title("Table", fontsize=16, weight='bold')
table = ax.table(cellText=df.values,
bbox=[0, 0, 1.5, 1],
cellLoc='center',
colLabels=df.columns)
And it works. However I can figure out how to set the format for numbers as {:,.2f}, that is, with commas as thousands separators and two decimals.
Any suggestion?
Insert the following two lines of code after df is created and the rest of your code works as desired.
The Age and Km columns are defined as type int; convert these to float before using your str.format:
df.update(df[['Age', 'Km']].astype(float))
Now use DataFrame.applymap(str.format) on these two columns:
df.update(df[['Age', 'Km']].applymap('{:,.2f}'.format))

Resources