What does the output of this line in pandas dataframe signify? - python-3.x

I am learning Pandas DataFrame and came across this code:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
Now when I use print(list(df.columns.values)) as suggested on this page, the output is:
[0, 1, 2]
I am unable to understand the output. What are the values 0,1,2 signifying. Since the height of DataFrame is 2, I suppose the last value 2 is signifying the height. What about 0 and 1?
I apologize if this question is a duplicate. I couldn't find any relevant explanation. If there is any similar question, please mention the link.
Many thanks.

If question is what are columns check samples:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
print (df)
0 1 2
0 1 2 3
1 4 5 6
#default columns names
print(list(df.columns.values))
[0, 1, 2]
print(list(df.index.values))
[0, 1]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]), columns=list('abc'))
print (df)
a b c
0 1 2 3
1 4 5 6
#custom columns names
print(list(df.columns.values))
['a', 'b', 'c']
print(list(df.index.values))
[0, 1]
You can also check docs:
The axis labeling information in pandas objects serves many purposes:
Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display
Enables automatic and explicit data alignment
Allows intuitive getting and setting of subsets of the data set

What is a data frame?
df is a data frame. Take a step back and take in what that means. I mean outside what it means from a Pandas perspective. Though there are many nuances to what different people mean by a data frame, generally, it is a table of data with rows and columns.
How do we reference those rows and/or columns?
Consider the example data frame df. I create a 4x4 table with tuples in each cell representing the (row, column) position of that cell. You'll also notice the labels on the rows are ['A', 'B', 'C', 'D'] and the labels on the columns are ['W', 'X', 'Y', 'Z']
df = pd.DataFrame(
[[(i, j) for j in range(4)] for i in range(4)],
list('ABCD'), list('WXYZ')
)
df
W X Y Z
A (0, 0) (0, 1) (0, 2) (0, 3)
B (1, 0) (1, 1) (1, 2) (1, 3)
C (2, 0) (2, 1) (2, 2) (2, 3)
D (3, 0) (3, 1) (3, 2) (3, 3)
If we wanted to reference by position, the zeroth row and third column is highlighted here.
df.style.applymap(lambda x: 'background: #aaf' if x == (0, 3) else '')
We could get at that position with iloc (which handles ordinal/positional indexing)
df.iloc[0, 3]
(0, 3)
What makes Pandas special is that it gives us an alternative way to reference both the rows and/or the columns. We could reference by the labels using loc (which handles label indexing)
df.loc['A', 'Z']
(0, 3)
I intentionally labeled the rows and columns with letters so as to not confuse label indexing with positional indexing. In your data frame, you let Pandas give you a default index for both rows and columns and those labels end up just being equivalent to positions when you begin.
What is the difference between label and positional indexing?
Consider this modified version of our data frame. Let's call it df_
df_ = df.sort_index(axis=1, ascending=False)
df_
Z Y X W
A (0, 3) (0, 2) (0, 1) (0, 0)
B (1, 3) (1, 2) (1, 1) (1, 0)
C (2, 3) (2, 2) (2, 1) (2, 0)
D (3, 3) (3, 2) (3, 1) (3, 0)
Notice that the columns are in reverse order. And when I call the same positional reference as above but on df_
df_.iloc[0, 3]
(0, 0)
I get a different answer because my columns have shifted around and are out of their original position.
However, if I call the same label reference
df_.loc['A', 'Z']
(0, 3)
I get the same thing. So label indexing allows me to reference regardless of the order of rows or columns.
OK! But what about OP's question?
Pandas stores the data in an attribute values
df.values
array([[(0, 0), (0, 1), (0, 2), (0, 3)],
[(1, 0), (1, 1), (1, 2), (1, 3)],
[(2, 0), (2, 1), (2, 2), (2, 3)],
[(3, 0), (3, 1), (3, 2), (3, 3)]], dtype=object)
The columns labels in an attribute columns
df.columns
Index(['W', 'X', 'Y', 'Z'], dtype='object')
And the row labels in an attribute index
df.index
Index(['A', 'B', 'C', 'D'], dtype='object')
It so happens that in OP's sample data frame, the columns were [0, 1, 2]

Related

Check if all list values in dataframe column are the same [duplicate]

If the type of a column in dataframe is int, float or string, we can get its unique values with columnName.unique().
But what if this column is a list, e.g. [1, 2, 3].
How could I get the unique of this column?
I think you can convert values to tuples and then unique works nice:
df = pd.DataFrame({'col':[[1,1,2],[2,1,3,3],[1,1,2],[1,1,2]]})
print (df)
col
0 [1, 1, 2]
1 [2, 1, 3, 3]
2 [1, 1, 2]
3 [1, 1, 2]
print (df['col'].apply(tuple).unique())
[(1, 1, 2) (2, 1, 3, 3)]
L = [list(x) for x in df['col'].apply(tuple).unique()]
print (L)
[[1, 1, 2], [2, 1, 3, 3]]
You cannot apply unique() on a non-hashable type such as list. You need to convert to a hashable type to do that.
A better solution using the latest version of pandas is to use duplicated() and you avoid iterating over the values to convert to list again.
df[~df.col.apply(tuple).duplicated()]
That would return as lists the unique values.

Iterate through Pandas dataframe rows in a triangular fashion

I have a Pandas dataframe df like this:
col1. col2
0. value11 List1
1. value12 List2
2. value13. List3
.. ... ...
i. value1i. List_i
j. value1j. List_j
.. ... ...
Col1 is the key (it does not repeat). Col2 is a list. In the end, I want a set intersection of each of the rows of Col2.
I would like to iterate through this dataframe in a triangular fashion.
Something along the lines of:
for i = 0 ; i < len(df); i++
for j = i+1 ; j < len(df) ; j++
Set(List_i).intersect(Set(List_j)
So, 1st iterator goes through the full dataframe, while the second iterator, starts from one greater index than the 1st iterator and goes until the end of the dataframe.
How to do this efficiently and in a fast manner?
Edit:
Naive way of doing this is:
col1_list = list(set(df.col1))
num_col1_entries = len(col1_list)
for idx, value1 in enumerate(col1_list):
for j in range(idx + 1, num_col1_entries):
value2 = col1_list[j]
list1 = df.loc[df.col1 == value1]['col2']
list2 = df.loc[df.col2 == value2]['col2']
print(set(list1).intersection(set(list2)))
Expected output: n(n-1)/2 prints of set intersections of each pair of rows of col2.
You can use itertools. Let's say this is your dataframe:
col1. col2
0 value11 List1
1 value12 List2
2 value13 List3
3 value14 List4
4 value15 List5
5 value16 List6
Then get al the combinations (15) and print the intersection between the two lists:
from itertools import combinations
for pair in list(combinations(df.index, 2)):
print(pair)
list1 = df.iloc[pair[0],1]
list2 = df.iloc[pair[1],1]
print(set(list1).intersection(set(list2)))
Output (only printing the pair):
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(0, 5)
(1, 2)
(1, 3)
(1, 4)
(1, 5)
(2, 3)
(2, 4)
(2, 5)
(3, 4)
(3, 5)
(4, 5)

Averaging n elements along 1st axis of 4D array with numpy

I have a 4D array containing daily time-series of gridded data for different years with shape (year, day, x-coordinate, y-coordinate). The actual shape of my array is (19, 133, 288, 620), so I have 19 years of data with 133 days per year over a 288 x 620 grid. I want to take the weekly average of each grid cell over the period of record. The shape of the weekly averaged array should be (19, 19, 288, 620), or (year, week, x-coordinate, y-coordinate). I would like to use numpy to achieve this.
Here I construct some dummy data to work with and an array of what the solution should be:
import numpy as np
a1 = np.arange(1, 10).reshape(3, 3)
a1days = np.repeat(a1[np.newaxis, ...], 7, axis=0)
b1 = np.arange(10, 19).reshape(3, 3)
b1days = np.repeat(b1[np.newaxis, ...], 7, axis=0)
c1year = np.concatenate((a1days, b1days), axis=0)
a2 = np.arange(19, 28).reshape(3, 3)
a2days = np.repeat(a2[np.newaxis, ...], 7, axis=0)
b2 = np.arange(29, 38).reshape(3, 3)
b2days = np.repeat(b2[np.newaxis, ...], 7, axis=0)
c2year = np.concatenate((a2days, b2days), axis=0)
dummy_data = np.concatenate((c1year, c2year), axis=0).reshape(2, 14, 3, 3)
solution = np.concatenate((a1, b1, a2, b2), axis=0).reshape(2, 2, 3, 3)
The shape of the dummy_data is (2, 14, 3, 3). Per the dummy data, I have two years of data, 14 days per year, over a 3 X 3 grid. I want to return the weekly average of the grid for both years, resulting in a solution with shape (2, 2, 3, 3).
You can reshape and take mean:
week_mean = dummy_data.reshape(2,-1,7,3,3).mean(axis=2)
# in your case .reshape(year, -1, 7, x_coord, y_coord)
# check:
(dummy_data.reshape(2,2,7,3,3).mean(axis=2) == solution).all()
# True

Pyspark Columnsimilarities interpretation

I was learning about how to use columnsimilarities can someone explain to me the matrix that was generated by the algorithm
lets say in this code
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
MatrixEntry(1, 2, 0.998441152599),
MatrixEntry(0, 1, 0.997463284056)]
how can I know which row is most similar given in the maxtrix? like (0,2,0.991935352214) mean that row 0 and row 2 have a result of 0.991935352214? I know that 0 and 2 are i and j the row and columns respectively of the matrix.
thank you
how can I know which row is most similar given in the maxtrix?
It is columnSimilarities not rowSimilarities so it is just not the thing you're looking for.
You could apply it on transposed matrix, but you really don't want. Algorithms used here are designed for thin and optimally sparse data. It just won't scale for wide one.

How to return the file number from bag of words

I am working with CountVectorizer from the sklearn, I want to know how I will access or extract the file number, these what I try
like from the out put: (1 ,12 ) 1
I want only the 1 which represent the file number
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
string1="these is my first statment in vectorizer"
string2="hello every one i like the place here"
string3="i am going to school every day day like the student in my school"
email_list=[string1,string2,string3]
bagofword=vectorizer.fit(email_list)
bagofword=vectorizer.transform(email_list)
print(bagofword)
output:
(0, 3) 1
(0, 7) 1
(0, 8) 1
(0, 10) 1
(0, 14) 1
(1, 12) 1
(1, 16) 1
(2, 0) 1
(2, 1) 2
You could iterate over the columns of the sparse array with,
features_map = [col.indices.tolist() for col in bagofword.T]
and to get a list of all documents that contain the feature k, simply take the element k of this list.
For instance, features_map[2] == [1, 2] means that feature number 2, is present in documents 1 and 2.

Resources