How can I count unique values and groupby? - excel

I have been trying to count and group per row the number of unique values. Perhaps will be easier to explain showing a table. I should first transpose before counting and groupby??
Box1
Box2
Box3
Count Result 1
Count Result 2
Count Result 3
Data A
Data A
Data B
Data A = 2
Data B = 1
Data C
Data D
Data B
Data C = 1
Data D = 1
Data B = 1

in GS try:
=ARRAYFORMULA(TRIM(SPLIT(FLATTEN(QUERY(QUERY(
QUERY(SPLIT(FLATTEN(A2:C3&" = ×"&ROW(A2:C3)), "×"),
"select max(Col1) group by Col1 pivot Col2")&
QUERY(SPLIT(FLATTEN(A2:C3&" = ×"&ROW(A2:C3)), "×"),
"select count(Col1) group by Col1 pivot Col2")&"​",
"offset 1", ),,9^9)), "​")))

Related

How to create Excel file from two SQLite tables using condition in pandas?

I have two SQLite tables as Sqlite Table 1 and Sqlite Table 2.
Table1 has ID,Name and Code columns. Table2 has ID, Values and Con columns.
I want to create Excel as ID,Name,Code and Values Columns. ID,Name and Code columns comes from Table1 and Values column comes from table2 with sum value of Values column of table2 with two conditions are ID columns should be match and Con column satisfied with Done Value.
Below image is for reference:
I would approach this problem in steps.
First extract the sql tables into pandas dataframes. I am no expert on that aspect of the problem, but assuming you have two dataframes like the following:
df1 = ID Name Code
0 1 a 1a
1 2 b 2b
2 3 a 3c
and
df2 = ID Values Con
0 1 5 Done
1 2 9 No
2 1 7 Done
3 2 4 No
4 1 8 No
5 3 1 Done
def sumByIndex(dx, row):
# return sum value or 0 if ID doesn't exist
idx = row['ID']
st = list(dx['ID'])
if idx in st:
return dx[dx['ID'] == idx]['Values'].values[0]
else:
return 0
def combineFrames(d1, d2):
#Return updated version of d1 with "Values" column added
d3 = d2[d2['Con'] == 'Done'].groupby('ID', as_index= False).sum()
d1['Values'] = d1.apply( lambda row: sumByIndex(d3, row), axis = 1)
return d1
then print(combineFrames(df1, df2)) yields:
ID Name Code Values
0 1 a 1a 12
1 2 b 2b 0
2 3 a 3c 1
  My program obtains the data from sqllite table 1 and sqlite table 2 in the form of lists (tuples and lists) with the corresponding values of ID, Name, Code and ID, Values, Con by making the request to the database like this 'SELECT * FROM sqlite table 1'
# sqlite table 1
table1 = [[5674, 'a', '1a'], [3385, 'b', '2b'], [5548, 'a', '3c']]
# sqlite table 2
table2 = [(5674, 5, 'Done'), (3385, 9, 'No'), (5674, 7, 'Done'), (3385, 4, 'No'), (5674, 8, 'No'), (5548, 1, 'Done')]
  To begin I will add all the values Values in a dictionary that matches it with the corresponding ID
map_values = {table2[i][0]:0 for i in range(len(table2))}
for i in range(len(table2)):
if (table2[i][2] == 'Done'):
map_values[table2[i][0]] += table2[i][1]
then I define the pandas.DataFrame() instance using sqlite table 1 by this way:
df = pd.DataFrame(table1, index=[i for i in range(1, len(table1)+1)], columns=["ID", "Name", "Code"])
also the values of "Values" are stored in that order to later be added with a new Values column.
df["Values"] = list(map_values.values())
output:
ID Name Code Values
1 5674 a 1a 12
2 3385 b 2b 0
3 5548 a 3c 1
excel:
df.to_excel(r'./excel_file.xlsx', index=False)

rotate/pivot a table from long to wide in pandas and create column of difference between previous columns

I have a table that looks like this:
> data = {'index':[0,1,2,3],'column_names':['foo_1','foo_2','bar_1','bar_2'], 'Totals':[1050,400,450,300]}
and I want to do three things:
Pivot each row in the 'column name' column to an actual column name.
Create an additional column whose values are the difference of the values of ('foo_1' and 'foo_2') and ('bar_1' and 'bar_2')
This also needs to be a dataframe object that looks like this:
data_t = {'foo_1':[1050],'foo_2':[400],'Foo % Diff':[650],'bar_1':[450],'bar_2':[300],'Bar % Diff':[150]}
Really would appreciate how to do this and explanations.
You need to .groupby to get the difference with diff().abs() in s1
Then, you need to .groupby to get the name of the total columns in s2 and concat s1 and s2 together.
From there, append the results of s1 and s2 to the dataframe and use .T to transpose the dataframe after setting the index to column_names, so values in column_names show up as the column headers instead of being in the first row, which is what would happen without .set_index():
data = {'column_names':['foo_1','foo_2','bar_1','bar_2'], 'Totals':[1050,400,450,300]}
df = pd.DataFrame(data)
s = df['column_names'].str.split('_').str[0]
s1 = df.groupby(s)['Totals'].diff().abs()
s2 = df.groupby(s)['column_names'].apply(lambda x: x.str.split('_').str[0] + '__% Diff')
df = (df.append(pd.concat([s1,s2], axis=1).dropna())
.sort_values('column_names').set_index('column_names').T)
df.columns = df.columns.str.replace('__', ' ') # the column name was chosen carefully to preserve order and then you can simply replace __ with a space
df
Out[1]:
column_names bar_1 bar_2 bar % Diff foo_1 foo_2 foo % Diff
Totals 450.0 300.0 150.0 1050.0 400.0 650.0
Or, if you are looking for the percentage change with pct_change():
data = {'column_names':['foo_1','foo_2','bar_1','bar_2'], 'Totals':[1050,400,450,300]}
df = pd.DataFrame(data)
s = df['column_names'].str.split('_').str[0]
s1 = df.groupby(s)['Totals'].apply(lambda x: x.pct_change())
s2 = df.groupby(s)['column_names'].apply(lambda x: x.str.split('_').str[0] + '__% Diff')
df = (df.append(pd.concat([s1,s2], axis=1).dropna())
.sort_values('column_names').set_index('column_names').T)
df.columns = df.columns.str.replace('__', ' ') # the column name was chosen carefully to preserve order and then you can simply replace __ with a space
df
Out[2]:
column_names bar_1 bar_2 bar % Diff foo_1 foo_2 foo % Diff
Totals 450.0 300.0 -0.333333 1050.0 400.0 -0.619048

how to use multiple columns values to calculate the result in pandas pivot table

def WA(A,B):
c = A*B
return c.sum()/A.sum()
sample = pd.pivot_table(intime,index=['year'], columns = 'DV', values = ['A','B'],
aggfunc={'A' : [len,np.sum], 'B' : [WA]}, fill_value = 0)
I'm grouping the dataframe by year and wanted to find the weighted avg of column B
I'm supposed to multiple column A with B then sum up the result and divide it by the sum of A [functian WA() does that]
I really have no idea on how to call the function by passing both the values

Data is lost while extracting data from xls

i have 12000 rows data in xls file, i want to read parse, and insert those data to database. I use extrame/xls library to read xls data,
but some data are different/lost from the actual data from excel.
this is my readXlsFile method :
func readXLSFile(filename string) ([][]string, error) {
result := [][]string{}
log.Println("Get into readXlsFile")
xlFile, err := xls.Open(filename, "utf-8")
if err != nil {
return nil, err
}
sheet1 := xlFile.GetSheet(0)
str := ""
//log.Println("Max Row ", int(sheet1.MaxRow))
for i := 0; i <= (int(sheet1.MaxRow)); i++ {
row1 := sheet1.Row(i)
temp := []string{}
for j := 0; j <= int(row1.LastCol()); j++ {
temp = append(temp, row1.Col(j))
//log.Println("Max Col", int(row1.LastCol()), "Of row ", i+1)
str += fmt.Sprintf("column %d data = %s ", j+1, row1.Col(j))
}
log.Printf("row %d data : %s \n", i+1, str)
str = ""
result = append(result, temp)
}
return result, nil
}
and here is my log that show the different data from my xls file :
2018/03/12 19:24:24 service.inquiry.go:4557: row 1836 data : column 1 data = :61:171218C59000NMSC column 2 data =
2018/03/12 19:24:24 service.inquiry.go:4557: row 1837 data : column 1 data = :86: column 2 data = column 3 data = column 4 data = column 5 data = column 6 data = column 7 data = column 8 data = column 9 data = column 10 data = column 11 data = column 12 data = PLS10299 column 13 data = column 14 data = 22162- column 15 data =
2018/03/12 19:24:24 service.inquiry.go:4557: row 1838 data : column 1 data = :61:171218D300NMSC column 2 data =
2018/03/12 19:24:24 service.inquiry.go:4557: row 1839 data : column 1 data = :86: column 2 data = column 3 data = column 4 data = column 5 data = column 6 data = column 7 data = column 8 data = column 9 data = column 10 data = column 11 data = column 12 data = PLS10299 column 13 data = column 14 data = 22162- column 15 data =
2018/03/12 19:24:24 service.inquiry.go:4557: row 1840 data : column 1 data = :61:171218D700NMSC column 2 data =
2018/03/12 19:24:24 service.inquiry.go:4557: row 1841 data : column 1 data = :86: column 2 data = column 3 data = column 4 data = column 5 data = column 6 data = column 7 data = column 8 data = column 9 data = column 10 data = column 11 data = column 12 data = PLS10299 column 13 data = column 14 data = 22162- column 15 data =
and this is the actual data from xls file :
does anybody know why is this happening, and how to fix it?

Spark-R: How to transform Cassandra map and array columns into new DataFrame

Using SparkR (spark-2.1.0) using the DataStax cassandra connector.
I have a dataframe which connects to a table in Cassandra. Some of the columns in the cassandra table are of type map and set. I need to perform various filtering/aggregation operations on these "collection" columns.
my_data_frame <-read.df(
source = "org.apache.spark.sql.cassandra",
keyspace = "my_keyspace", table = "some_table")
my_data_frame
SparkDataFrame[id:string, col2:map<string,int>, col3:array<string>]
schema(my_data_frame)
StructType
|-name = "id", type = "StringType", nullable = TRUE
|-name = "col2", type = "MapType(StringType,IntegerType,true)", nullable = TRUE
|-name = "col3", type = "ArrayType(StringType,true)", nullable = TRUE
I would like to obtain:
A new dataframe containing the unique string KEYS in the col2 map over all rows in my_data_frame.
The sum() of VALUES in the col2 map for each row placed into a new column in my_data_frame.
The set of unique values in the col3 array over all rows in my_data_frame into a new dataframe
The map data in cassandra for col2 looks like:
VALUES ({'key1':100, 'key2':20, 'key3':50, ... })
If the original cassandra table looks like:
id col2
1 {'key1':100, 'key2':20}
2 {'key3':40, 'key4':10}
3 {'key1':10, 'key3':30}
I would like to obtain a dataframe containing the unique keys:
col2_keys
key1
key2
key3
key4
The sum of values for each id:
id col2_sum
1 120
2 60
3 40
The max of values for each id:
id col2_max
1 100
2 40
3 30
Additional info:
col2_df <- select(my_data_frame, my_data_frame$col2)
head(col2_df)
col2
1 <environment: 0x7facfb4fc4e8>
2 <environment: 0x7facfb4f3980>
3 <environment: 0x7facfb4eb980>
4 <environment: 0x7facfb4e0068>
row1 <- first(my_data_frame)
row1
col2
1 <environment: 0x7fad00023ca0>
I am new to Spark and R and have probably missed something obvious, but I don't
see any obvious functions for transforming maps and arrays in this manner.
I did see some references to using "environment" as a map in R but am not sure how that would work for my requirements.
spark-2.1.0
Cassandra 3.10
spark-cassandra-connector:2.0.0-s_2.11
JDK 1.8.0_101-b13
Thanks so much in advance for any help.

Resources