Adding n new columns to pandas Data Frame - python-3.x

Given a Data Frame like the following:
df = pd.DataFrame({'term' : ['analys','applic','architectur','assess','item','methodolog','research','rs','studi','suggest','test','tool','viewer','work'],
'newValue' : [0.810419, 0.631963 ,0.687348, 0.810554, 0.725366, 0.742715, 0.799152, 0.599030, 0.652112, 0.683228, 0.711307, 0.625563, 0.604190, 0.724763]})
df = df.set_index('term')
print(df)
newValue
term
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
I want to add n new empty columns "".
Therefore, I have a value stored in variable n which indicates the number of required new columns.
n = 5
Thanks for your help in advance!

According to this answer,
Each not empty DataFrame has columns, index and some values.
So your dataframe must not have a column without name anyway.
This is the shortest way that I know of to achieve your goal:
n = 5
for i in range(n):
df[len(df.columns)] = ""
newValue 1 2 3 4 5
term
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763

IIUC, you can use:
n= 5
df=(pd.concat([df,pd.DataFrame(columns=['col'+str(i)
for i in range(n)])],axis=1,sort=False).fillna(''))
print(df)
newValue col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
Note: You can remove the fillna() if you want NaN.

Related

Is there any alternative to merge multiple rows into a single row without using groupBy() & collect_list() in spark?

I am trying to merge multiple rows into a single row after grouping data on a different column.
col1 col2
A 1
A 2
B 1
B 3
to
col1 col2
A 1,2
B 1,3
By using the below code:
df = spark.sql("select col1, col2, col3,...., colN from tablename where col3 = 'ABCD' limit 1000")
df.select('col1','col2').groupby('col1').agg(psf.concat_ws(', ', psf.collect_list(df.col2))).display()
This is working fine when there is less data.
But if I try to increase the number of rows to 1million, the code fails with the exception:
java.lang.Exception: Results too large
Is there any alternative to merge multiple rows into a single row in spark without using the combination of groupby() & collect_list()

Printing Columns with a correlation greater than 80%

I have a pandas dataframe with a size of 235607 records, and 94 attributes. I am very new python I was able to create a correlation matrix between all of the attributes but it is a lot to look through individually. I tried writing a for loop to print a list of the columns with a correlation greater than 80% but I keep getting the error "'DataFrame' object has no attribute 'c1'"
This is the code I used to create the correlation between the attributes as well as the sample for loop. Thank you in advance for your help :-
corr = data.corr() # data is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)
drop = [cols for cols in upper.c1 if any (upper[c1] > 0.80)]
drop
Sort in place, if you need to use the same variable c1 and then just grab the variables-names pair, using a comprehensive list using the indexes
c1.sort_values(ascending=True, inplace=True)
columns_above_80 = [(col1, col2) for col1, col2 in c1.index if c1[col1,col2] > 0.8 and col1 != col2]
Edit: Added col1 != col2 in the comprehensive list so you don't grab the auto-correlation
you can simply use the numpy.where like this:
corr.loc[np.where(corr>0.8, 1, 0)==1].columns
the output would be array with the names of the columns, which are having values greater then 0.8.
EDIT: I hope this will work. I edited the code above little.

Display barplot in column order using pandas

I have a data frame with two columns - col1 and col2, but when I use df.plot.barh, the plot returns results in col2 and col1 order. Is there a way to get the plot to display results in col1 and col2 order?
df = pd.DataFrame(np.random.randint(0,10,(5,2)), columns=['col1','col2'])
df.plot.barh()
will yield this:
Instead using bar():
df = pd.DataFrame(np.random.randint(0,10,(5,2)), columns=['col1','col2'])
df.plot.bar()
In both instances, col1 is first in that it is closest to the x axis. To reverse the order of the columns, you would need to reverse the order in which they appear in your dataframe. For just two columns you can use:
df = df[df.columns[::-1]]

Executemany with WHERE using list

I´m trying to do something like this.
cur.executemany("UPDATE tableA SET Col1 ='S' WHERE Col2 = %s AND Col3= %s ", data [:][0], data [:][4])
Where "data" is a list. I need make an update for each line in my list (data). For each line I need look for Clo1 equal element 0 and Col2 equal elemente 4.
You need to transform the list before passing it to executemany, extracting the elements you need, perhaps like this:
cur.executemany("UPDATE tableA SET Col1 ='S' WHERE Col2 = ? AND Col3= ? ",
[(row[0], row[4]) for row in data])

Spark-R: How to transform Cassandra map and array columns into new DataFrame

Using SparkR (spark-2.1.0) using the DataStax cassandra connector.
I have a dataframe which connects to a table in Cassandra. Some of the columns in the cassandra table are of type map and set. I need to perform various filtering/aggregation operations on these "collection" columns.
my_data_frame <-read.df(
source = "org.apache.spark.sql.cassandra",
keyspace = "my_keyspace", table = "some_table")
my_data_frame
SparkDataFrame[id:string, col2:map<string,int>, col3:array<string>]
schema(my_data_frame)
StructType
|-name = "id", type = "StringType", nullable = TRUE
|-name = "col2", type = "MapType(StringType,IntegerType,true)", nullable = TRUE
|-name = "col3", type = "ArrayType(StringType,true)", nullable = TRUE
I would like to obtain:
A new dataframe containing the unique string KEYS in the col2 map over all rows in my_data_frame.
The sum() of VALUES in the col2 map for each row placed into a new column in my_data_frame.
The set of unique values in the col3 array over all rows in my_data_frame into a new dataframe
The map data in cassandra for col2 looks like:
VALUES ({'key1':100, 'key2':20, 'key3':50, ... })
If the original cassandra table looks like:
id col2
1 {'key1':100, 'key2':20}
2 {'key3':40, 'key4':10}
3 {'key1':10, 'key3':30}
I would like to obtain a dataframe containing the unique keys:
col2_keys
key1
key2
key3
key4
The sum of values for each id:
id col2_sum
1 120
2 60
3 40
The max of values for each id:
id col2_max
1 100
2 40
3 30
Additional info:
col2_df <- select(my_data_frame, my_data_frame$col2)
head(col2_df)
col2
1 <environment: 0x7facfb4fc4e8>
2 <environment: 0x7facfb4f3980>
3 <environment: 0x7facfb4eb980>
4 <environment: 0x7facfb4e0068>
row1 <- first(my_data_frame)
row1
col2
1 <environment: 0x7fad00023ca0>
I am new to Spark and R and have probably missed something obvious, but I don't
see any obvious functions for transforming maps and arrays in this manner.
I did see some references to using "environment" as a map in R but am not sure how that would work for my requirements.
spark-2.1.0
Cassandra 3.10
spark-cassandra-connector:2.0.0-s_2.11
JDK 1.8.0_101-b13
Thanks so much in advance for any help.

Resources