Step by step Python Code for data understanding (statistical analysis, use of pivot table, data sorting, etc.)
If we want to apply for any data analyst or data scientist role, it is necessary to know one of the programming languages used for such roles. It could be R or Python or Scala etc. To fulfill this, I have selected Python for data analysis.
If you want to check the practical code data loading step from different sources and data cleaning steps. Please check the below links.
After data cleaning, we are now in the actual data analysis stage. In this phase, we can perform different data analysis operations like statistical analysis, data aggregation using pivot table operation, etc.
To do descriptive statistical analysis, we can use describe command of the panda and get the detailed summary information. But using different aggregated functions, we can find out the results at the individual measure level.
# Creating Dataset
import pandas as pd
Emp = [‘Jane’,’Johny’,’Boby’,’Jon’,’Mary’,’Jony’,’Alice’,’Melica’]
Salary = [9500,7800,7600,9500,7700,7800,9900,10000]
SalaryList = zip(Emp,Salary)
df = pd.DataFrame(data = SalaryList,columns=[‘Emp’, ‘Salary’])
df[‘Salary’].count() # number of values
df[‘Salary’].mean() # arithmetic average
df[‘Salary’].std() # standard deviation
df[‘Salary’].min() # minimum
df[‘Salary’].max() # maximum
df[‘Salary’].quantile(.25) # first quartile
df[‘Salary’].quantile(.5) # second quartile
df[‘Salary’].quantile(.75) # third quartile
df[‘Salary’].median() # the middle value if they are sorted in order
df[‘Salary’].mode() #the most common values
df[‘Salary’].var()# computes the variance of the values in a column
df.var() # Computing Variance on All Numeric Columns
If you want to know more about descriptive statistics, please have a look my blogs about this.
Sometimes we need to rearrange the data. To do this, we can use the sorting features of python.
# Sorting by Salary Descending
df = df.sort_values(by=’Salary’, ascending=0)
# Sorting by Salary,Emp Ascending
df = df.sort_values(by=[‘Salary’, ‘Emp’],ascending=[True, True])
Data Interpretation using Pivot Table
We know the pivot table option is helping to reform the data analysis world. Let’s find out some meaning of data using the python pivot table feature.
# Create Data frame
import pandas as pd
df = pd.read_csv(“salarydata.csv”)
Code: Get Averages of All Numeric Columns Categorized by Gender
Code: Average Salary by Gender. By default aggregate function is average
Code: Minimum Grade by Gender
pd.pivot_table(df, values=[‘salary’],index=[‘gender’], aggfunc=’min’)
Code: Max Grade by Gender and Age. When we use two categorical fields
pd.pivot_table(df, index=[‘gender’,’age’], aggfunc=’max’, values=[‘salary’])
Code: Average Salary and Bonus by Gender
pd.pivot_table(df, index=[‘gender’], aggfunc=’mean’, values=[‘salary’,’bonus’])
Code: Average Salary and Bonus by Gender: Adding Filter condition
df2 = df.loc[df[‘age’] >45]
pd.pivot_table(df2, index=[‘gender’], aggfunc=’mean’, values=[‘salary’,’bonus’])
In this blog, we learn how to do Python coding for data interpretation purposes. If you have any questions, please post them in the comment section.