COGS 108 - Final Project

Analysis of contributions towards job automation

Overview

With the United States slated to lose over 1.5 million jobs to automation in the next decade alone, the causes of job automation can be a hot topic to many. This project looks at many factors we think will contribute to the probability of a job being automated, including annual and state-level income, the skills required for the job, job field growth over time, and the importance of the job to a state's total workforce. We have seen income does correlate with the likelihood of automation to some extent, low paying jobs will contribute more to state-wide automation.

Research Question

How is the further integration of job automation into the United States’ workforce related to employment salaries, skills, and employment rates across various job industries in the past 25 years?

Background and Prior Work

Background

At the outset of our project, our group had an interest in how automation has shaped the US economy, and how it will continue to shape the economy we live in. From fast food restaurants implementing kiosks to cashierless grocery stores, we’ve seen a wide variety of jobs automated and replaced in recent years. Even so, we were able to see that the number of employees in manufacturing from the years 2008 to 2018 from the Annual Survey of Manufacturers (ASM) have gone down. While it’s fascinating to see these numbers, we felt that there must be an analysis of the detriment that automation has on replaced workers.

Our group wanted to look further into how the integration of job automation has currently affected various industries in the job market. According to a blog post back from 2012, it’s stated that “the more ‘blue collar’ you are, the more likely you are to be unemployed.” That claim is backed by the U.S. Bureau of Labor Statistics in both their Current Population Surveys of 2010 and 2018, which indicate the less education achieved, the higher the unemployment rate. While education level was a factor that we initially considered potentially important to our research, we did not find relevant datasets when doing our research. In recent years (before the Coronavirus outbreak), unemployment has decreased significantly, and the trend continues, with less education resulting in a higher unemployment rate.

It is important to consider how job automation may impact other industries, workplaces, and unemployment rates, and after assessing the data we had, we centered in on this factor. The most similar prior work we have found looks at the Likelihood of Automation with comparisons between Occupation and Salary. The analyses done in this research use the same automation probability dataset that we work with in our analysis (Dataset 1). The second dataset used in this study focuses on a handful of data categories regarding income levels, which we have additionally adopted into our data collection for this project. From this article, we found data which comes from ONET, a US government resource for storing data and conducting data analysis on the job market in the United States. This paper focuses on nine datasets from ONET: Negotiation, Social Perceptiveness, Persuasion, Finger Dexterity, Manual Dexterity, Cramped Work Spaces, Originality, Creativity, and Assisting And Caring For Others. Utilizing this wide variety manner, the aspect of our study considering income and state automation likelihood acts as an extension of the prior work.

Although our initial findings had some face value, many factors were left unknown. That original research raised only more questions: how many jobs have already been affected by automation, and which jobs are they? Particularly, what is the probability that a job would be automated, and what is the likelihood of automation of the job pertaining to income, state of occupancy, and various job skills? We sought to find out.

Prior Work

The most similar prior work we have found looks at the Likelihood of Automation with comparisons between Occupation and Salary. The analyses done in this research use the same automation probability dataset that we work with in our analysis (Dataset 1). The second dataset used in this study focuses on a handful of data categories regarding income levels, which we have additionally adopted into our data collection for this project (Dataset 2). In that manner, the aspect of our study considering income and state automation likelihood acts as an extension of the prior work.

References:

Hypotheses

We hypothesize:

  • Occupations with higher salaries will have a lower probability of being automated.
  • Jobs with high automation probabilities will show a stagnant or decreasing change in employment rates for the eight-year period of time surrounding the point at which the automation likelihood was estimated.
  • High perception and manipulation skills correlate more to a high probability of automation.
  • High creative and social intelligence scores correlate more with a low probability of automation.

Dataset(s)

Dataset 1:

This dataset includes job title, OCC code (used to categorize), probability of occupation, and number of employees per state.

Dataset 2:

This dataset includes job title, OCC code, and various wage metrics per occupation (salary vs hourly, average pay, statistics for these fields, North American Industry Classification System code, employee number data, etc).

Dataset 3:

This dataset includes the number of employees per occupation present per state in any given year.

Dataset 4:

These datasets (collected for years between 2011 and 2019) show the number of employees per occupation by age group, and gives the total number of employees per occupation.

Dataset 5:

Dataset 5 is collecting information from 9 various datasets from the ONET site. These datasets are: Negotiation, social perceptiveness, persuasion, finger dexterity, manual dexterity, cramped work spaces, originality, creativity, and assisting and caring for others. They are all sorted by SOC code and contain 969 observations.

Dataset 6:

Dataset 6 provides the Metropolitan and Non-Metropolitan personal income values per state in the United States from 2008 to 2017.

Dataset 7 (A copy of Dataset 1):

This dataset includes job title, OCC code (used to categorize), probability of occupation, and number of employees per state. This dataset was copied and names df_occ to resolve a merging problem that we were having across datasets and analyses.



Datasets 1,2 provide information by occupations will be combined by their OCC code, since it is the most standard metric they share (note: some datasets have different occupation name string formatting, so this is not as easily used). Since we mostly want to consider this data in terms of probability of automation, data from Dataset 2 will only be kept if there is a corresponding OCC code in Dataset 1. Dataset 3 is used to turn the number of employees per occupation per state in Dataset 1 into fractions so we can look at relative numbers of employement as opposed to absolute numbers.

The 7 datasets from Dataset 4 (corresponding to 2013-2019 data) will be combined in order to calculate the change in employment over the 7 year period. This data will be combined by Dataset 1 by OCC code in order to investigate the relationship between job field growth and likelihood of automation for that job field.

Setup

In [1]:
# Installs plotly for displaying geospatial graphs
!pip install --user plotly
Requirement already satisfied: plotly in /Users/michaelbaluja/opt/anaconda3/lib/python3.7/site-packages (4.7.1)
Requirement already satisfied: retrying>=1.3.3 in /Users/michaelbaluja/opt/anaconda3/lib/python3.7/site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in /Users/michaelbaluja/opt/anaconda3/lib/python3.7/site-packages (from plotly) (1.14.0)
In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import plotly.graph_objects as go
from scipy import stats
from copy import copy
from scipy.stats import normaltest
In [3]:
# Used for building geo-spatial maps
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 
          'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois','Indiana', 'Iowa', 'Kansas', 
          'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 
          'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 
          'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 
          'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 
          'West Virginia', 'Wisconsin', 'Wyoming']

states_abbv = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 
               'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 
               'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 
               'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 
               'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']

Data Cleaning

How ‘clean’ is the data?

Our data are clean in that they were provided by reputable sources who provided the data in a format without any unnecessary variables, null data, etc. For one of the datasets, some wage information was represented by a or #, with the meaning not enough data was available for inclusion, and the # meaning wage exceeded $200,000/yr. These observations were dropped from our analyses

In regards to the job skillset data, the job skills datasets we collected from O-NET were already in a CSV format, which had easy-to-understand categories and pre-formulated constructs, it was easy for us to sort through the data without much cleaning. However, the skills datasets and the dataset used for automation probability used different titles for occupation, which meant we had to merge over the same code: the Standard Occupational Classification (SOC) system, which is a federal convention used for occupational data analysis. In order to use both data sets, we merged them by their similar SOC codes.

The annual incomes data was relatively clean. The only exclusion is that the first row and last rows needed to be removed because they were accounting for values in the United States as a whole, and not representing an individual state.

What did you have to do to get the data into a usable format?

Since some datasets were read in from excel datasheets, it was necessary to clean this data by removing the first few rows (title information), renaming the columns for proper identification, and resetting the index since the first n rows were removed. It was also necessary to remove additional columns that were not related to the dataframe, but were added for structural purposes in the excel datasheet.

Some dataframes required transposing/reshaping in order to more easily work with the data.

The skills datasets and the dataset used for automation probability used different titles for occupation, which meant we had to merge over the same code: the Standard Occupational Classification (SOC) system, which is a federal convention used for occupational data analysis. In order to use both data sets, we merged them by their similar SOC codes.

For annual income data, each state was split into “Metropolitan” and “Non-Metropolitan” income levels. We wanted to combine those income levels into one value, so we needed to use groupby to combine every second value together, across 102 values.

What pre-processing steps were required for your methods?

For our state analysis, it was necessary to transpose our data in the beginning, since the variables in the original dataset were now to be used as observations in this new dataset. An additional transformation was made to “normalize” the number of employees in each column by dividing them by the total number of workers per state.

We checked the distribution of variables such as probability of automation, wage, and employment percent change.

For the data in Datasets 4, we needed to drop all non-total employee rows and merge the 9 datasets into one set.

For the individual occupation wage analysis, it was necessary to drop most columns. We included occ code, annual mean wage, and occupation. The resulting dataset was merged with Dataset 1 by occ/soc code so we could easily compare the likelihood of automation and the annual mean wage.

To get our data into an optimal format for the job skillset analysis, we had to change the column titles on a few of the skill datasets, as well as filter out any rows that were listed as “Not Available” on a few of the datasets. Primarily, a dataset titled “Fine Arts” had this issue, which makes sense, since we were measuring our analyses based on the relatability of a skill (e.g fine arts) to a single job, and many jobs utilize no degree of fine arts skills. The “Assisting and Caring for Others” dataset also had this issue, as neither the job “Models” nor “Financial Analyst” had any relevance to this trait, so their data was removed. Moreover, the job “Mathematical Technician” had data which, across several datasets, corresponded little to the rest of the data, leading me to believe that row was incomplete, inconsistent, or significantly altered. It was removed from those datasets too.

In [4]:
# Function tidy-izes the data
def organize(df, year):
    '''
    - Removes the first 7 columns from the dataframe (corresponds to title and additional non-data cells from
    excel formatting)
    - Adds column titles back in
    - Drops null columns
    - Trims dataframe to only include occupation and title
    '''
    df = df[7:]
    df = df.rename(columns={df.columns[0]: 'Occupation', df.columns[1]: 'Total{}'.format(year)})
    df.dropna(inplace=True)
    df.reset_index(inplace=True)
    df = df[['Occupation', 'Total{}'.format(year)]]
 
    return df
In [5]:
# Creating the pertinent DataFrames
df_prob = pd.read_csv('datasets/raw_state_automation_data.csv', encoding='cp1252')
df_employment = pd.read_excel('datasets/employmentbystate.xls')
df_income = pd.read_excel('datasets/wagedata.xlsx')
income = pd.read_csv('datasets/PARPI_PORT_2008_2017.csv')
df_occ = pd.read_csv('datasets/raw_state_automation_data.csv', encoding='cp1252')


employment2011 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2011.xlsx'), 2011)
employment2012 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2012.xlsx'), 2012)
employment2013 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2013.xlsx'), 2013)
employment2014 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2014.xlsx'), 2014)
employment2015 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2015.xlsx'), 2015)
employment2016 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2016.xlsx'), 2016)
employment2017 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2017.xlsx'), 2017)
employment2018 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2018.xlsx'), 2018)
employment2019 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2019.xlsx'), 2019)

# Merging the different employment datasets in order to analyze percent change
employment = pd.merge(pd.merge(employment2011, employment2012), \
                      pd.merge(pd.merge(employment2013, pd.merge(employment2014, employment2015)), \
                               pd.merge(pd.merge(employment2016, employment2017), \
                                        pd.merge(employment2018, employment2019))))

# Creating copy of main dataframe to avoid issues with individual cleaning and analyses
df_prob_m = copy(df_prob)
df_prob_g = copy(df_prob)
df_prob_h = copy(df_prob)
df_prob_k = copy(df_prob)

Employment by Occupation

In [6]:
# Structure main dataset
df_prob_m.sort_values(by=['SOC'], inplace=True)

# Standardize Occupation column between datasets
employment['Occupation'] = employment['Occupation'].apply(lambda x: x.title())
df_prob_m['Occupation'] = df_prob_m['Occupation'].apply(lambda x: x.title())

# Create trimmed dataset
df_prob_m_trim = df_prob_m[['SOC', 'Occupation', 'Probability']]

# Include employment info
df_prob_m_trim = pd.merge(df_prob_m_trim, employment)

# Add percent change based on 2012 to 2019 data (Note that data cannot be analyzed for 2011 as there were some
## states in which occupations did not have employees, causing a division by 0 error)
df_prob_m_trim['percent_change'] = (df_prob_m_trim.Total2019 - df_prob_m_trim.Total2012)/df_prob_m_trim.Total2012

Income by Occupation

In [7]:
#Remove unnecessary income data
df_income = df_income[['OCC_CODE', 'OCC_TITLE', 'A_MEAN']]
df_income.rename(columns={'OCC_CODE':'SOC', 'OCC_TITLE':'Occupation'},inplace=True)

# Create combined probability & income dataset
df_probincome_m = pd.merge(df_prob_m, df_income, how='left', left_on='SOC', right_on='SOC')

# Drop rows if no mean wage info
# * is used to represent occupation with insufficient data
df_probincome_m = df_probincome_m[df_probincome_m.A_MEAN != '*']

# Remove any null income values
df_probincome_m.dropna(inplace=True,subset=['A_MEAN'])
df_probincome_m = df_probincome_m.reset_index()

# Rest of values should be numeric, so transform
df_probincome_m.A_MEAN = pd.to_numeric(df_probincome_m.A_MEAN)

# Add log mean income 
df_probincome_m['log_A_MEAN'] = df_probincome_m.A_MEAN.apply(np.log)

Employment & Automation by State

In [8]:
# Clean employment data by removing excel specific titles & rows, dropping null data
## adding proper column names, dropping unnecessary columns, and typecasting
df_employment = df_employment[5:]
df_employment.dropna(inplace=True)
df_employment = df_employment.rename(columns={df_employment.columns[1]:'State', df_employment.columns[2]:'Employment'})
df_employment.reset_index(inplace=True)
df_employment = df_employment[['State', 'Employment']]
df_employment.Employment = df_employment.Employment.astype(int)

# Reshape data to easily apply later transformation
df_employment = df_employment.transpose()
df_employment.columns = df_employment.iloc[0]
df_employment = df_employment.iloc[1:]

# Transform employment data to reflect employment relative to population
df_prob_m_normed = copy(df_prob_m)
for state in states:
    df_prob_m_normed[state] = df_prob_m_normed[state].apply(lambda x: x/df_employment[state])

# Don't need SOC values, so remove
#df_prob_m.drop(columns=['SOC'],inplace=True)

Income by State

In [9]:
income = income[['GeoName', 'LineCode','2008', '2009', '2010', '2011', '2012', '2013', '2014' , '2015', '2016', '2017']]
income = income[income.LineCode == 1.0]
income = income.loc[income.index > 0]
In [10]:
income = income.reset_index()
income = income.drop('index', 1)
In [11]:
N = 2
totalIn = income.groupby(income.index // N).sum()
In [12]:
totalIn['change'] = totalIn['2017'] - totalIn['2008']
totalIn['change'] = totalIn['change'] / totalIn['2008']
In [13]:
# Only want to look at probability, state data so remove other columns
df_occ.drop(columns=['SOC', 'Occupation'], inplace=True)
In [14]:
# Transform employment data to reflect employment relative to population
for state in states:
    df_occ[state] = df_occ[state].apply(lambda x: x/df_employment[state])
In [15]:
state_likelihood2 = []
for state in states:
    likelihood2 = 0
    for index in range(len(df_occ)):
        likelihood2 += df_occ['Probability'][index] * df_occ[state][index]
    state_likelihood2.append(likelihood2)
In [16]:
statesDF = pd.DataFrame(states)
totalIncomeState = totalIn.merge(statesDF, left_index = True, right_index = True)
totalIncomeState = totalIncomeState.rename(columns={0: "State"})
In [17]:
likelihoodAutomation = pd.DataFrame(state_likelihood2)
incomeANDautomation = likelihoodAutomation.merge(totalIncomeState, left_index=True, right_index=True)
incomeANDautomation = incomeANDautomation.rename(columns={0: "Automation"})

Job Skillset

In [18]:
# Setting up automation data 
automation = pd.read_csv('datasets/raw_state_automation_data.csv', encoding='cp1252')
#removing States data 
automation.drop(automation.columns.difference(['SOC','Occupation', 'Probability']), 1, inplace=True) 
automation['Probability'] = automation['Probability'] * 100
In [19]:
#Loading in all the job skills data
originality = pd.read_csv('datasets/Originality.csv')
negotiation = pd.read_csv('datasets/Negotiation.csv')
social_percept = pd.read_csv('datasets/Social_Perceptiveness.csv')
persuasion = pd.read_csv('datasets/Persuasion.csv')
finger_dext = pd.read_csv('datasets/Finger_Dexterity.csv')
manual_dext = pd.read_csv('datasets/Manual_Dexterity.csv')
cramped_work = pd.read_csv('datasets/Cramped_Work_Space_Awkward_Positions.csv')
fine_arts = pd.read_csv('datasets/Fine_Arts.csv')
df_prob_h = pd.read_csv('datasets/Assisting_and_Caring_for_Others.csv')
In [20]:
#Cleaning and merging 
df_prob_h['Assisting and Caring Importance'] = df_prob_h['Importance']
# Removing doubles "importance" column and level column as it is not used
df_prob_h = df_prob_h.drop(['Importance', 'Level'], axis=1)
df_prob_h['Originality Importance'] = originality['Importance']
df_prob_h['Negotiation Importance'] = negotiation['Importance']
df_prob_h['Social Perception Importance'] = social_percept['Importance']
df_prob_h['Persuasion Importance'] = persuasion['Importance']
df_prob_h['Finger Dexterity Importance'] = finger_dext['Importance']
df_prob_h['Manual Dexterity Importance'] = manual_dext['Importance']
df_prob_h['Cramped Work Context'] = cramped_work['Context']
df_prob_h['Fine Arts Importance'] = fine_arts['Importance']
In [21]:
#Merging the Data Sets
df_prob_h['SOC'] = df_prob_h['Code'].astype(str).replace('\.00', '', regex=True)
df_prob_h = pd.merge(left=df_prob_h, right=automation, left_on='SOC', right_on='SOC')
df_prob_h.drop(['Code', 'Occupation_x'], axis=1, inplace=True)
df_prob_h = df_prob_h.rename(columns = {'Occupation_y':'Occupation'})
In [22]:
#Creating 3 Seperate Categories Based Upon 
df_prob_h['Perception_and_Manipulation'] = df_prob_h[['Finger Dexterity Importance','Manual Dexterity Importance', 
                                       'Cramped Work Context']].mean(axis=1)
df_prob_h['Creative_Intelligence'] = df_prob_h[['Originality Importance','Fine Arts Importance']].mean(axis=1)

df_prob_h['Social_Intelligence'] = df_prob_h[['Social Perception Importance', 'Negotiation Importance', 
                                'Persuasion Importance', 'Assisting and Caring Importance']].mean(axis=1)

Data Analysis & Results

Employment by Occupation

EDA

Distributions

Our probability of automation variable is bound between 0-1 with a bimodal distribution. The peaks occur at the two boundaries. While the left mode has a higher peak, the right mode carries more weight.

Our variable representing the percent change in employment between 2012 and 2019 follows a relatively normal unimodal distribution with few mini-peaks that do not change the shape drastically. This distribution is slightly right-skewed.

Outliers

While there is a relatively smooth distribution outside of the boundary peaks, there is a non-modal peak in the 0.35-0.40 probability bin.

There is a percent change outlier right around 2.75. This corresponds to Financial Analysts, which grew from 84,000 to 345,000 employees.

Relationship between variables

There is a very poor (horizontal) linear relation between the probability of automation and change in employment between 2012 and 2019. There is a stronger linear relationship between log annual income and probability of automation.

In [23]:
# Distribution of probability of automation
sns.distplot(df_prob_m.Probability, bins=20)
plt.xlabel('Probability of Automation')
plt.title('Distribution of Automation Probability', loc='left')
plt.show()
In [24]:
# Distribution of percent change in employment
sns.distplot(df_prob_m_trim.percent_change)
plt.xlabel('Change in Employment')
plt.title('Distribution of Employment Change from 2012-2019', loc='left')
plt.show()
In [25]:
# Investigate outlier
df_prob_m_trim[df_prob_m_trim.percent_change >= 2]
Out[25]:
SOC Occupation Probability Total2011 Total2012 Total2013 Total2014 Total2015 Total2016 Total2017 Total2018 Total2019 percent_change
32 13-2051 Financial Analysts 0.23 84 89 105 261 322 307 309 307 345 2.8764
In [26]:
# Determine normality of percent_change
k2, p = stats.normaltest(df_prob_m_trim.percent_change)
print(p)
1.2612948679979974e-24
In [27]:
# Plot the percent change vs Probability of employment
sns.scatterplot(df_prob_m_trim.Probability, df_prob_m_trim.percent_change)
plt.xlabel('Probability of Automation')
plt.ylabel('Change in Employment')
plt.title('Probability of Automation vs Change in Employment', loc='left')
plt.show()

Analysis

What approaches did you use? Why?

We analyze the relationship between percent change in employment and probability of automation using an OLS Linear Regression model. This method of analysis was chosen because there is a relatively linear relationship between the two variables.

In [28]:
# OLS Regression for Probability and percent change
# NOTE: Data does not meet requirements necessary for testing linearity, but want to take a look 

outcome, predictors = patsy.dmatrices('Probability ~ percent_change', df_prob_m_trim)
model = sm.OLS(outcome, predictors)
results = model.fit()

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Probability   R-squared:                       0.853
Model:                            OLS   Adj. R-squared:                 -0.150
Method:                 Least Squares   F-statistic:                    0.8500
Date:                Wed, 10 Jun 2020   Prob (F-statistic):              0.761
Time:                        23:23:30   Log-Likelihood:                 140.34
No. Observations:                 274   AIC:                             197.3
Df Residuals:                      35   BIC:                             1061.
Df Model:                         238                                         
Covariance Type:            nonrobust                                         
============================================================================================================
                                               coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------
Intercept                                    0.5025      0.166      3.034      0.005       0.166       0.839
percent_change[T.-0.7857142857142857]        0.1075      0.438      0.245      0.808      -0.782       0.997
percent_change[T.-0.7636363636363637]        0.4875      0.438      1.113      0.273      -0.402       1.377
percent_change[T.-0.75]                      0.3175      0.438      0.725      0.473      -0.572       1.207
percent_change[T.-0.71875]                   0.3175      0.438      0.725      0.473      -0.572       1.207
percent_change[T.-0.7142857142857143]       -0.4435      0.438     -1.012      0.318      -1.333       0.446
percent_change[T.-0.6666666666666666]        0.1475      0.438      0.337      0.738      -0.742       1.037
percent_change[T.-0.6]                       0.4275      0.331      1.291      0.205      -0.245       1.100
percent_change[T.-0.5882352941176471]       -0.0925      0.438     -0.211      0.834      -0.982       0.797
percent_change[T.-0.5789473684210527]        0.4375      0.438      0.999      0.325      -0.452       1.327
percent_change[T.-0.5702479338842975]        0.3875      0.438      0.884      0.383      -0.502       1.277
percent_change[T.-0.5454545454545454]        0.4675      0.438      1.067      0.293      -0.422       1.357
percent_change[T.-0.5210084033613446]        0.3075      0.438      0.702      0.487      -0.582       1.197
percent_change[T.-0.5]                       0.4475      0.331      1.351      0.185      -0.225       1.120
percent_change[T.-0.4948453608247423]        0.4875      0.438      1.113      0.273      -0.402       1.377
percent_change[T.-0.4827586206896552]       -0.4385      0.438     -1.001      0.324      -1.328       0.451
percent_change[T.-0.4368932038834951]        0.4475      0.438      1.021      0.314      -0.442       1.337
percent_change[T.-0.42857142857142855]      -0.4225      0.438     -0.964      0.342      -1.312       0.467
percent_change[T.-0.4186046511627907]       -0.4845      0.438     -1.106      0.276      -1.374       0.405
percent_change[T.-0.4006849315068493]        0.4675      0.438      1.067      0.293      -0.422       1.357
percent_change[T.-0.4]                       0.4775      0.438      1.090      0.283      -0.412       1.367
percent_change[T.-0.375]                     0.4675      0.438      1.067      0.293      -0.422       1.357
percent_change[T.-0.3684210526315789]       -0.3425      0.438     -0.782      0.440      -1.232       0.547
percent_change[T.-0.36363636363636365]       0.2325      0.331      0.702      0.487      -0.440       0.905
percent_change[T.-0.36]                      0.1475      0.438      0.337      0.738      -0.742       1.037
percent_change[T.-0.35135135135135137]       0.4475      0.438      1.021      0.314      -0.442       1.337
percent_change[T.-0.34375]                  -0.4989      0.438     -1.139      0.263      -1.388       0.391
percent_change[T.-0.3383084577114428]        0.3275      0.438      0.747      0.460      -0.562       1.217
percent_change[T.-0.3333333333333333]       -0.0775      0.331     -0.234      0.816      -0.750       0.595
percent_change[T.-0.3116883116883117]        0.4575      0.438      1.044      0.304      -0.432       1.347
percent_change[T.-0.3103448275862069]       -0.1125      0.438     -0.257      0.799      -1.002       0.777
percent_change[T.-0.302158273381295]         0.4675      0.438      1.067      0.293      -0.422       1.357
percent_change[T.-0.3]                       0.3375      0.438      0.770      0.446      -0.552       1.227
percent_change[T.-0.29411764705882354]       0.2775      0.438      0.633      0.531      -0.612       1.167
percent_change[T.-0.2857142857142857]       -0.4625      0.331     -1.396      0.171      -1.135       0.210
percent_change[T.-0.2756756756756757]        0.2075      0.438      0.474      0.639      -0.682       1.097
percent_change[T.-0.2727272727272727]       -0.4325      0.438     -0.987      0.330      -1.322       0.457
percent_change[T.-0.2692307692307692]       -0.4115      0.438     -0.939      0.354      -1.301       0.478
percent_change[T.-0.26666666666666666]       0.4875      0.438      1.113      0.273      -0.402       1.377
percent_change[T.-0.25892857142857145]      -0.4981      0.438     -1.137      0.263      -1.388       0.391
percent_change[T.-0.258160237388724]         0.4875      0.438      1.113      0.273      -0.402       1.377
percent_change[T.-0.2545454545454545]        0.3875      0.438      0.884      0.383      -0.502       1.277
percent_change[T.-0.25]                     -0.0290      0.331     -0.088      0.931      -0.701       0.643
percent_change[T.-0.23076923076923078]      -0.1261      0.331     -0.381      0.706      -0.798       0.546
percent_change[T.-0.22631578947368422]       0.4775      0.438      1.090      0.283      -0.412       1.367
percent_change[T.-0.2]                       0.2775      0.331      0.838      0.408      -0.395       0.950
percent_change[T.-0.19444444444444445]       0.4175      0.438      0.953      0.347      -0.472       1.307
percent_change[T.-0.18181818181818182]       0.4675      0.438      1.067      0.293      -0.422       1.357
percent_change[T.-0.17391304347826086]       0.1325      0.331      0.400      0.692      -0.540       0.805
percent_change[T.-0.1724137931034483]       -0.4875      0.438     -1.113      0.273      -1.377       0.402
percent_change[T.-0.16666666666666666]      -0.0635      0.331     -0.192      0.849      -0.736       0.609
percent_change[T.-0.15384615384615385]       0.1025      0.331      0.310      0.759      -0.570       0.775
percent_change[T.-0.14754098360655737]      -0.0125      0.438     -0.028      0.977      -0.902       0.877
percent_change[T.-0.13414634146341464]      -0.4265      0.438     -0.973      0.337      -1.316       0.463
percent_change[T.-0.13043478260869565]       0.4475      0.438      1.021      0.314      -0.442       1.337
percent_change[T.-0.11946902654867257]      -0.4725      0.438     -1.078      0.288      -1.362       0.417
percent_change[T.-0.11764705882352941]      -0.1125      0.438     -0.257      0.799      -1.002       0.777
percent_change[T.-0.11612903225806452]      -0.3225      0.438     -0.736      0.467      -1.212       0.567
percent_change[T.-0.1111111111111111]        0.4275      0.438      0.976      0.336      -0.462       1.317
percent_change[T.-0.11016949152542373]       0.1275      0.438      0.291      0.773      -0.762       1.017
percent_change[T.-0.10714285714285714]       0.0175      0.331      0.053      0.958      -0.655       0.690
percent_change[T.-0.0967741935483871]        0.3975      0.438      0.907      0.370      -0.492       1.287
percent_change[T.-0.09345794392523364]       0.4875      0.438      1.113      0.273      -0.402       1.377
percent_change[T.-0.09208523592085235]      -0.4185      0.438     -0.955      0.346      -1.308       0.471
percent_change[T.-0.09090909090909091]      -0.2925      0.438     -0.668      0.509      -1.182       0.597
percent_change[T.-0.07768361581920905]      -0.4885      0.438     -1.115      0.273      -1.378       0.401
percent_change[T.-0.07692307692307693]       0.3875      0.438      0.884      0.383      -0.502       1.277
percent_change[T.-0.07547169811320754]      -0.4475      0.438     -1.021      0.314      -1.337       0.442
percent_change[T.-0.07272727272727272]       0.4375      0.438      0.999      0.325      -0.452       1.327
percent_change[T.-0.07142857142857142]       0.1975      0.438      0.451      0.655      -0.692       1.087
percent_change[T.-0.07063753367255313]       0.4175      0.438      0.953      0.347      -0.472       1.307
percent_change[T.-0.06666666666666667]       0.4775      0.438      1.090      0.283      -0.412       1.367
percent_change[T.-0.0660377358490566]        0.4275      0.438      0.976      0.336      -0.462       1.317
percent_change[T.-0.06382978723404255]       0.3275      0.438      0.747      0.460      -0.562       1.217
percent_change[T.-0.0584958217270195]        0.4075      0.438      0.930      0.359      -0.482       1.297
percent_change[T.-0.05825242718446602]      -0.1925      0.438     -0.439      0.663      -1.082       0.697
percent_change[T.-0.055415617128463476]      0.1475      0.438      0.337      0.738      -0.742       1.037
percent_change[T.-0.05416666666666667]      -0.0225      0.438     -0.051      0.959      -0.912       0.867
percent_change[T.-0.047619047619047616]      0.4075      0.438      0.930      0.359      -0.482       1.297
percent_change[T.-0.045454545454545456]      0.0475      0.438      0.108      0.914      -0.842       0.937
percent_change[T.-0.043478260869565216]     -0.4775      0.438     -1.090      0.283      -1.367       0.412
percent_change[T.-0.04048964218455744]       0.4375      0.438      0.999      0.325      -0.452       1.327
percent_change[T.-0.03389312977099237]       0.4675      0.438      1.067      0.293      -0.422       1.357
percent_change[T.-0.03368421052631579]       0.4575      0.438      1.044      0.304      -0.432       1.347
percent_change[T.-0.02768729641693811]      -0.4335      0.438     -0.989      0.329      -1.323       0.456
percent_change[T.-0.025830258302583026]      0.2675      0.438      0.611      0.545      -0.622       1.157
percent_change[T.-0.023809523809523808]      0.1975      0.438      0.451      0.655      -0.692       1.087
percent_change[T.-0.018867924528301886]      0.0875      0.438      0.200      0.843      -0.802       0.977
percent_change[T.-0.017857142857142856]      0.3375      0.438      0.770      0.446      -0.552       1.227
percent_change[T.-0.011049723756906077]      0.1475      0.438      0.337      0.738      -0.742       1.037
percent_change[T.-0.007936507936507936]     -0.4885      0.438     -1.115      0.273      -1.378       0.401
percent_change[T.-0.006535947712418301]      0.2075      0.438      0.474      0.639      -0.682       1.097
percent_change[T.-0.005639097744360902]     -0.3425      0.438     -0.782      0.440      -1.232       0.547
percent_change[T.-0.0015446400988569664]    -0.2225      0.438     -0.508      0.615      -1.112       0.667
percent_change[T.0.0]                        0.1960      0.206      0.952      0.348      -0.222       0.614
percent_change[T.0.0038314176245210726]      0.4775      0.438      1.090      0.283      -0.412       1.367
percent_change[T.0.009009009009009009]      -0.4365      0.438     -0.996      0.326      -1.326       0.453
percent_change[T.0.011190233977619531]      -0.4315      0.438     -0.985      0.332      -1.321       0.458
percent_change[T.0.012254901960784314]      -0.4944      0.438     -1.128      0.267      -1.384       0.395
percent_change[T.0.012354152367879203]       0.1875      0.438      0.428      0.671      -0.702       1.077
percent_change[T.0.014994232987312572]       0.0875      0.438      0.200      0.843      -0.802       0.977
percent_change[T.0.015384615384615385]       0.3275      0.438      0.747      0.460      -0.562       1.217
percent_change[T.0.016129032258064516]       0.3975      0.438      0.907      0.370      -0.492       1.287
percent_change[T.0.017094017094017096]       0.1075      0.438      0.245      0.808      -0.782       0.997
percent_change[T.0.019417475728155338]       0.4875      0.438      1.113      0.273      -0.402       1.377
percent_change[T.0.024691358024691357]       0.3675      0.438      0.839      0.407      -0.522       1.257
percent_change[T.0.025]                     -0.1625      0.438     -0.371      0.713      -1.052       0.727
percent_change[T.0.02666666666666667]       -0.4705      0.438     -1.074      0.290      -1.360       0.419
percent_change[T.0.03225806451612903]        0.2675      0.438      0.611      0.545      -0.622       1.157
percent_change[T.0.0364963503649635]         0.0075      0.438      0.017      0.986      -0.882       0.897
percent_change[T.0.03954802259887006]       -0.0125      0.438     -0.028      0.977      -0.902       0.877
percent_change[T.0.040605643496214726]       0.1375      0.438      0.314      0.755      -0.752       1.027
percent_change[T.0.040880503144654086]       0.1775      0.438      0.405      0.688      -0.712       1.067
percent_change[T.0.041228779304769606]       0.4575      0.438      1.044      0.304      -0.432       1.347
percent_change[T.0.04128440366972477]       -0.4725      0.438     -1.078      0.288      -1.362       0.417
percent_change[T.0.043478260869565216]       0.1775      0.438      0.405      0.688      -0.712       1.067
percent_change[T.0.04455445544554455]       -0.4865      0.438     -1.110      0.274      -1.376       0.403
percent_change[T.0.046296296296296294]       0.3675      0.438      0.839      0.407      -0.522       1.257
percent_change[T.0.04950495049504951]       -0.4725      0.438     -1.078      0.288      -1.362       0.417
percent_change[T.0.04980842911877394]       -0.4725      0.438     -1.078      0.288      -1.362       0.417
percent_change[T.0.05172413793103448]       -0.0440      0.331     -0.133      0.895      -0.716       0.628
percent_change[T.0.05194805194805195]       -0.4986      0.438     -1.138      0.263      -1.388       0.391
percent_change[T.0.05263157894736842]        0.0275      0.438      0.063      0.950      -0.862       0.917
percent_change[T.0.056179775280898875]      -0.4815      0.438     -1.099      0.279      -1.371       0.408
percent_change[T.0.056418642681929684]       0.2175      0.438      0.496      0.623      -0.672       1.107
percent_change[T.0.058823529411764705]      -0.4875      0.438     -1.113      0.273      -1.377       0.402
percent_change[T.0.06220095693779904]        0.4375      0.438      0.999      0.325      -0.452       1.327
percent_change[T.0.0633147113594041]         0.4275      0.438      0.976      0.336      -0.462       1.317
percent_change[T.0.07407407407407407]        0.4775      0.438      1.090      0.283      -0.412       1.367
percent_change[T.0.07570977917981073]       -0.3325      0.438     -0.759      0.453      -1.222       0.557
percent_change[T.0.07692307692307693]        0.4775      0.438      1.090      0.283      -0.412       1.367
percent_change[T.0.07758620689655173]        0.4475      0.438      1.021      0.314      -0.442       1.337
percent_change[T.0.07796610169491526]       -0.3325      0.438     -0.759      0.453      -1.222       0.557
percent_change[T.0.0783132530120482]         0.3875      0.438      0.884      0.383      -0.502       1.277
percent_change[T.0.08099173553719008]       -0.4675      0.438     -1.067      0.293      -1.357       0.422
percent_change[T.0.08108108108108109]       -0.1325      0.438     -0.302      0.764      -1.022       0.757
percent_change[T.0.08152173913043478]        0.1275      0.438      0.291      0.773      -0.762       1.017
percent_change[T.0.08173076923076923]       -0.4645      0.438     -1.060      0.296      -1.354       0.425
percent_change[T.0.08340573414422242]       -0.4275      0.438     -0.976      0.336      -1.317       0.462
percent_change[T.0.0851063829787234]         0.2175      0.438      0.496      0.623      -0.672       1.107
percent_change[T.0.08888888888888889]        0.2175      0.438      0.496      0.623      -0.672       1.107
percent_change[T.0.09090909090909091]        0.3275      0.438      0.747      0.460      -0.562       1.217
percent_change[T.0.09206349206349207]       -0.1325      0.438     -0.302      0.764      -1.022       0.757
percent_change[T.0.0981012658227848]         0.2275      0.438      0.519      0.607      -0.662       1.117
percent_change[T.0.1]                       -0.4815      0.438     -1.099      0.279      -1.371       0.408
percent_change[T.0.10185185185185185]        0.4175      0.438      0.953      0.347      -0.472       1.307
percent_change[T.0.10227272727272728]       -0.2525      0.438     -0.576      0.568      -1.142       0.637
percent_change[T.0.10344827586206896]       -0.4986      0.438     -1.138      0.263      -1.388       0.391
percent_change[T.0.1111111111111111]        -0.0252      0.331     -0.076      0.940      -0.698       0.647
percent_change[T.0.11247216035634744]        0.0575      0.438      0.131      0.896      -0.832       0.947
percent_change[T.0.11274787535410764]        0.4375      0.438      0.999      0.325      -0.452       1.327
percent_change[T.0.11403508771929824]        0.0675      0.438      0.154      0.878      -0.822       0.957
percent_change[T.0.1232876712328767]        -0.4035      0.438     -0.921      0.363      -1.293       0.486
percent_change[T.0.125]                      0.4375      0.438      0.999      0.325      -0.452       1.327
percent_change[T.0.12605042016806722]       -0.4855      0.438     -1.108      0.275      -1.375       0.404
percent_change[T.0.1262135922330097]         0.2675      0.438      0.611      0.545      -0.622       1.157
percent_change[T.0.12765217391304348]       -0.4935      0.438     -1.126      0.268      -1.383       0.396
percent_change[T.0.1326530612244898]         0.3975      0.438      0.907      0.370      -0.492       1.287
percent_change[T.0.13821138211382114]        0.3175      0.438      0.725      0.473      -0.572       1.207
percent_change[T.0.14285714285714285]       -0.1625      0.438     -0.371      0.713      -1.052       0.727
percent_change[T.0.1457286432160804]         0.3875      0.438      0.884      0.383      -0.502       1.277
percent_change[T.0.14915966386554622]        0.1475      0.438      0.337      0.738      -0.742       1.037
percent_change[T.0.15115207373271888]       -0.4195      0.438     -0.957      0.345      -1.309       0.470
percent_change[T.0.15254237288135594]       -0.4990      0.438     -1.139      0.263      -1.388       0.391
percent_change[T.0.15726495726495726]       -0.4952      0.438     -1.130      0.266      -1.385       0.394
percent_change[T.0.15822784810126583]        0.0375      0.438      0.086      0.932      -0.852       0.927
percent_change[T.0.16666666666666666]       -0.0510      0.331     -0.154      0.879      -0.723       0.621
percent_change[T.0.16870876531573986]       -0.4675      0.438     -1.067      0.293      -1.357       0.422
percent_change[T.0.16923076923076924]       -0.4675      0.438     -1.067      0.293      -1.357       0.422
percent_change[T.0.16956521739130434]        0.4775      0.438      1.090      0.283      -0.412       1.367
percent_change[T.0.1761786600496278]        -0.4025      0.438     -0.919      0.365      -1.292       0.487
percent_change[T.0.1875]                     0.3275      0.438      0.747      0.460      -0.562       1.217
percent_change[T.0.19230769230769232]       -0.4905      0.438     -1.119      0.271      -1.380       0.399
percent_change[T.0.19767441860465115]       -0.4535      0.438     -1.035      0.308      -1.343       0.436
percent_change[T.0.2052689352360044]        -0.4983      0.438     -1.137      0.263      -1.388       0.391
percent_change[T.0.20909090909090908]       -0.4055      0.438     -0.925      0.361      -1.295       0.484
percent_change[T.0.21052631578947367]       -0.2025      0.438     -0.462      0.647      -1.092       0.687
percent_change[T.0.21296296296296297]       -0.3625      0.438     -0.827      0.414      -1.252       0.527
percent_change[T.0.2180028129395218]         0.3675      0.438      0.839      0.407      -0.522       1.257
percent_change[T.0.21875]                   -0.4915      0.438     -1.122      0.270      -1.381       0.398
percent_change[T.0.2235294117647059]        -0.4645      0.438     -1.060      0.296      -1.354       0.425
percent_change[T.0.22535211267605634]       -0.4855      0.438     -1.108      0.275      -1.375       0.404
percent_change[T.0.22897800776196636]       -0.3725      0.438     -0.850      0.401      -1.262       0.517
percent_change[T.0.2328767123287671]        -0.4961      0.438     -1.132      0.265      -1.386       0.393
percent_change[T.0.24242424242424243]        0.4175      0.438      0.953      0.347      -0.472       1.307
percent_change[T.0.24308755760368664]        0.3675      0.438      0.839      0.407      -0.522       1.257
percent_change[T.0.25]                      -0.1525      0.438     -0.348      0.730      -1.042       0.737
percent_change[T.0.2553191489361702]         0.3575      0.438      0.816      0.420      -0.532       1.247
percent_change[T.0.25688073394495414]        0.2975      0.438      0.679      0.502      -0.592       1.187
percent_change[T.0.25862068965517243]       -0.4875      0.438     -1.113      0.273      -1.377       0.402
percent_change[T.0.2692307692307692]        -0.2925      0.438     -0.668      0.509      -1.182       0.597
percent_change[T.0.27075812274368233]        0.4375      0.438      0.999      0.325      -0.452       1.327
percent_change[T.0.27607361963190186]        0.1775      0.438      0.405      0.688      -0.712       1.067
percent_change[T.0.2777777777777778]         0.0229      0.287      0.080      0.937      -0.559       0.605
percent_change[T.0.2833333333333333]        -0.4855      0.438     -1.108      0.275      -1.375       0.404
percent_change[T.0.2937853107344633]        -0.4445      0.438     -1.014      0.317      -1.334       0.445
percent_change[T.0.2962962962962963]        -0.4984      0.438     -1.137      0.263      -1.388       0.391
percent_change[T.0.30303030303030304]       -0.3625      0.438     -0.827      0.414      -1.252       0.527
percent_change[T.0.3047034764826176]         0.0475      0.438      0.108      0.914      -0.842       0.937
percent_change[T.0.3150684931506849]         0.1575      0.438      0.359      0.721      -0.732       1.047
percent_change[T.0.3208092485549133]        -0.3525      0.438     -0.804      0.427      -1.242       0.537
percent_change[T.0.3268156424581006]        -0.4835      0.438     -1.103      0.277      -1.373       0.406
percent_change[T.0.3286573146292585]        -0.4960      0.438     -1.132      0.265      -1.385       0.394
percent_change[T.0.3333333333333333]         0.4775      0.438      1.090      0.283      -0.412       1.367
percent_change[T.0.33962264150943394]        0.4775      0.438      1.090      0.283      -0.412       1.367
percent_change[T.0.34615384615384615]       -0.0725      0.438     -0.165      0.870      -0.962       0.817
percent_change[T.0.36134453781512604]        0.2375      0.438      0.542      0.591      -0.652       1.127
percent_change[T.0.38927738927738925]       -0.2025      0.438     -0.462      0.647      -1.092       0.687
percent_change[T.0.4]                        0.3975      0.438      0.907      0.370      -0.492       1.287
percent_change[T.0.4146341463414634]         0.3175      0.438      0.725      0.473      -0.572       1.207
percent_change[T.0.4318181818181818]        -0.4025      0.438     -0.919      0.365      -1.292       0.487
percent_change[T.0.4330357142857143]        -0.4970      0.438     -1.134      0.264      -1.386       0.393
percent_change[T.0.44075829383886256]       -0.4815      0.438     -1.099      0.279      -1.371       0.408
percent_change[T.0.4444444444444444]         0.4175      0.438      0.953      0.347      -0.472       1.307
percent_change[T.0.4576719576719577]         0.0775      0.438      0.177      0.861      -0.812       0.967
percent_change[T.0.4787310742609949]         0.3775      0.438      0.862      0.395      -0.512       1.267
percent_change[T.0.49206349206349204]       -0.4958      0.438     -1.132      0.266      -1.385       0.394
percent_change[T.0.49748743718592964]       -0.4225      0.438     -0.964      0.342      -1.312       0.467
percent_change[T.0.5]                       -0.2791      0.287     -0.973      0.337      -0.861       0.303
percent_change[T.0.5277777777777778]         0.1075      0.438      0.245      0.808      -0.782       0.997
percent_change[T.0.5371900826446281]        -0.4805      0.438     -1.097      0.280      -1.370       0.409
percent_change[T.0.5761589403973509]        -0.3725      0.438     -0.850      0.401      -1.262       0.517
percent_change[T.0.6]                        0.4575      0.438      1.044      0.304      -0.432       1.347
percent_change[T.0.625]                     -0.1125      0.438     -0.257      0.799      -1.002       0.777
percent_change[T.0.6382978723404256]        -0.4905      0.438     -1.119      0.271      -1.380       0.399
percent_change[T.0.6536312849162011]         0.3175      0.438      0.725      0.473      -0.572       1.207
percent_change[T.0.6666666666666666]         0.2575      0.331      0.777      0.442      -0.415       0.930
percent_change[T.0.684931506849315]          0.1075      0.438      0.245      0.808      -0.782       0.997
percent_change[T.0.8]                        0.1475      0.438      0.337      0.738      -0.742       1.037
percent_change[T.0.8125]                     0.1575      0.438      0.359      0.721      -0.732       1.047
percent_change[T.0.8873239436619719]         0.4375      0.438      0.999      0.325      -0.452       1.327
percent_change[T.1.0]                       -0.1325      0.438     -0.302      0.764      -1.022       0.757
percent_change[T.1.0555555555555556]         0.4075      0.438      0.930      0.359      -0.482       1.297
percent_change[T.1.148936170212766]         -0.2825      0.438     -0.645      0.523      -1.172       0.607
percent_change[T.1.3333333333333333]         0.3675      0.438      0.839      0.407      -0.522       1.257
percent_change[T.1.3511904761904763]         0.3875      0.438      0.884      0.383      -0.502       1.277
percent_change[T.1.5]                       -0.4655      0.438     -1.062      0.295      -1.355       0.424
percent_change[T.1.5714285714285714]         0.3675      0.438      0.839      0.407      -0.522       1.257
percent_change[T.2.8764044943820224]        -0.2725      0.438     -0.622      0.538      -1.162       0.617
==============================================================================
Omnibus:                       60.489   Durbin-Watson:                   1.967
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              441.758
Skew:                          -0.635   Prob(JB):                     1.18e-96
Kurtosis:                       9.089   Cond. No.                         106.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

What were the results?

The results from our analysis of the relationship between percent change in employment and the probability of automation shows poor results. Due to the bimodal distribution of the probability of automation, our data was not properly distributed for this type of regression. Further attempts to normalize the data did not prove to be successful, nor did further regression attempts.

What were your interpretation of these findings?

Although the findings from the relation between employment percent change and automation did not prove to be conclusive of anything, we initially interpreted these results to show that the likelihood (probability) of automation is a complex factor with many contributing factors, and as such, we could not analyze it solely by looking at one possible contributor. This led to a breadth of comparisons being done in order to better understand the different contributing factors to likelihood of automation.

Income Analysis

Income Change by State

EDA

Distributions

The variable "percent change in income" consists of a unimodal distribution with the peak occuring between 0.1 and 0.2 percent change in income.

The variable "automation probability" consists of a bimodal distribution with the highest peak occuring close to 0.4 probability, and the second peak occuring at close to 0.3 probability.

Outliers

There was one outlier in the analysis of automation likelihood vs. percent change in income: District of Columbia, Automation: 0.299355, Income Percent Change: 0.364582.

Relationship Between Variables

There is a negative relationship between the likelihood of automation in a state and the percent change in income in a state.

In [29]:
sns.distplot(totalIn.change)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2530a690>
In [30]:
sns.distplot(incomeANDautomation.Automation)
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a25343510>

Analysis: Part 1

In [31]:
plt.figure(figsize=(8,8))
sns.scatterplot(incomeANDautomation.Automation, incomeANDautomation.change)
ax = sns.regplot(x="Automation", y="change", data=incomeANDautomation)
ax.set(title='Job Automation Likelihood vs. Percent Change in Income per State from 2008 to 2017', xlabel='Job Automation Likelihood', ylabel='Percent Change in Income')
plt.show()

What approaches did you use? Why?

We analyze the relationship between percent change in income and probability of automation using a Linear Regression model. This approach was used because there was a clear downwards shape in the data.

In [32]:
outcome2, predictors2 = patsy.dmatrices('Automation ~ change', incomeANDautomation)
model2 = sm.OLS(outcome2, predictors2)
results2 = model2.fit()

print(results2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Automation   R-squared:                       0.104
Model:                            OLS   Adj. R-squared:                  0.085
Method:                 Least Squares   F-statistic:                     5.671
Date:                Wed, 10 Jun 2020   Prob (F-statistic):             0.0212
Time:                        23:23:31   Log-Likelihood:                 117.09
No. Observations:                  51   AIC:                            -230.2
Df Residuals:                      49   BIC:                            -226.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4151      0.010     42.311      0.000       0.395       0.435
change        -0.1146      0.048     -2.381      0.021      -0.211      -0.018
==============================================================================
Omnibus:                        2.141   Durbin-Watson:                   1.684
Prob(Omnibus):                  0.343   Jarque-Bera (JB):                1.256
Skew:                          -0.284   Prob(JB):                        0.534
Kurtosis:                       3.518   Cond. No.                         14.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

What were the results?

With an $\alpha$ value of 0.05 and p value of 0.021, an Adjusted $R^2$ of 0.085 indicates that there is some significance in the correlation between automation likelihood in a state and its percent change in income levels.

What were your interpretation of these findings?

Since the correlation value is negative, we can assume that there is a correlation between an increase in automation level and the subsequent decrease in income change. This means that there is a correlation in where as automation levels increase, that income levels decrease.

Analysis: Part 2

In [33]:
automation1of3 = incomeANDautomation.loc[incomeANDautomation.Automation <= 0.35]
automation2of3 = incomeANDautomation.loc[(incomeANDautomation.Automation > 0.35) & (incomeANDautomation.Automation <= 0.4)]
automation3of3 = incomeANDautomation.loc[incomeANDautomation.Automation > 0.4]
In [34]:
plt.figure(figsize=(8,8))
automation1of3 = automation1of3.assign(Location='low')
automation2of3 = automation2of3.assign(Location='medium')
automation3of3 = automation3of3.assign(Location='high')
cdf = pd.concat([automation1of3, automation2of3, automation3of3]) 
ax = sns.boxplot(x="Location", y="change", data=cdf)
ax.set(title='Job Automation Likelihood vs. Income Change',xlabel='Job Automation Likelihood', ylabel='Income Change')

plt.show()

What approaches did you use? Why?

We split the data of Job Automation Likelihood evenly three ways between low probability (p <= 0.35), medium probability (0.35 < p <= 0.4), and high probability (p > 0.4). This was done to better group together the values of automation for easier analysis. It was done so that we have a better understanding of which areas of job automation will have a consequental greater result in income change.

We then analyze the relationship between percent change in income and probability of automation in low, medium, and high using a Linear Regression model. This approach was used because we took reference from the original plot and noticed that there was a downward shape in the data in the medium section, but slightly upward in the high section. We wanted to explore whether there is a different relationship between the medium likelihood and high likelihoods of job automation. In our "low" section, we had one data point, which was our outlier, so this form of analysis also gave us the chance to analyze everything beyond the outlier.

In [35]:
sns.scatterplot(automation2of3.Automation, automation2of3.change)
ax1 = sns.regplot(x="Automation", y="change", data=automation2of3)
ax1.set(xlabel='Automation Likelihood', ylabel='Percent Change in Income')
plt.show()
In [36]:
outcome3, predictors3 = patsy.dmatrices('Automation ~ change', automation2of3)
model3 = sm.OLS(outcome3, predictors3)
results3 = model3.fit()

print(results3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Automation   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     1.112
Date:                Wed, 10 Jun 2020   Prob (F-statistic):              0.301
Time:                        23:23:31   Log-Likelihood:                 87.546
No. Observations:                  30   AIC:                            -171.1
Df Residuals:                      28   BIC:                            -168.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.3871      0.007     55.211      0.000       0.373       0.401
change        -0.0353      0.033     -1.055      0.301      -0.104       0.033
==============================================================================
Omnibus:                        2.246   Durbin-Watson:                   1.315
Prob(Omnibus):                  0.325   Jarque-Bera (JB):                1.948
Skew:                          -0.519   Prob(JB):                        0.377
Kurtosis:                       2.305   Cond. No.                         14.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [37]:
sns.scatterplot(automation3of3.Automation, automation3of3.change)
ax3 = sns.regplot(x="Automation", y="change", data=automation3of3)
ax3.set(xlabel='Automation Likelihood', ylabel='Percent Change in Income')
plt.show()
In [38]:
outcome4, predictors4 = patsy.dmatrices('Automation ~ change', automation3of3)
model4 = sm.OLS(outcome4, predictors4)
results4 = model4.fit()

print(results4.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Automation   R-squared:                       0.017
Model:                            OLS   Adj. R-squared:                 -0.038
Method:                 Least Squares   F-statistic:                    0.3098
Date:                Wed, 10 Jun 2020   Prob (F-statistic):              0.585
Time:                        23:23:31   Log-Likelihood:                 60.719
No. Observations:                  20   AIC:                            -117.4
Df Residuals:                      18   BIC:                            -115.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4129      0.009     47.477      0.000       0.395       0.431
change         0.0264      0.047      0.557      0.585      -0.073       0.126
==============================================================================
Omnibus:                        2.816   Durbin-Watson:                   2.134
Prob(Omnibus):                  0.245   Jarque-Bera (JB):                2.031
Skew:                           0.773   Prob(JB):                        0.362
Kurtosis:                       2.779   Cond. No.                         17.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

What were the results?

Medium: From our regression test, we learned that with an alpha value of 0.01, an Adjusted $R^2$ value of -0.038 indicates that there is significance in the correlation between medium automation likelihood in a state and its percent change in income levels. In regards to the boxplot representation of medium automation, we learn that the median value of income percent change is around 0.2.

High: From our regression test, we learned that with an alpha value of 0.01, an Adjusted $R^2$ value of 0.017 indicates that there is significance in the correlation between large automation likelihood in a state and its percent change in income levels. In regards to the boxplot representation of high automation, we learn that the median value of income percent change is around 0.17.

What were your interpretation of these findings?

Regression test:

Medium: Since the regression line's slope is negative, we can assume that there is a correlation between an increase in automation level within the medium range and the subsequent decrease in income change. This means that there is a correlation in where as automation at a medium level increase, that income change levels decrease.

High: Something interesting happens here -- the regression line's slope is positive, as opposed to negative. This assumes that there is a correlation between an increase in automation level in the high automation range and also an increase in the levels of income change. While we may not be able to specifically pinpoint why this happens, we can conclude that the impact caused by medium and high levels of job automation are not the same.

Boxplot: From our boxplot, we learn that states with medium-level automation probability values between (0.35 < p <= 0.4) will have a higher median value of income change in comparison to states with high-level automation probability values (p > 0.4). This indicates that states that have medium-level automation probability will have larger income changes than states with high-level automation probability.

Wage by Occupation

EDA

Distributions

Our annual mean wage variable takes a right-skewed normal distribution. We remove the skew by applying a natural log function to this data.

Outliers

There is at least one outlier in annual mean income around $225,000

In [39]:
# Distribution of annual mean income
sns.distplot(df_probincome_m.A_MEAN)
plt.xlabel('Mean Annual Income')
plt.title('Distribution of Mean Annual Income', loc='left')
plt.show()
In [40]:
# Investigate mean annual income outlier
df_probincome_m[df_probincome_m.A_MEAN > 225000]
Out[40]:
index SOC Occupation_x Probability Alabama Alaska Arizona Arkansas California Colorado ... Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming Occupation_y A_MEAN log_A_MEAN
226 233 29-1022 Oral And Maxillofacial Surgeons 0.0036 0 0 0 0 620 0 ... 0 0 160 0 0 0 0 Oral and Maxillofacial Surgeons 232870 12.358236
227 234 29-1023 Orthodontists 0.0230 0 0 0 30 710 0 ... 0 0 0 0 0 0 0 Orthodontists 228780 12.340516

2 rows × 58 columns

We see that these two extremely high paying outliers are both oral health occupations with a low probability of automation and relatively low spread throughout the country.

In [41]:
# Distribution of log annual mean income
sns.distplot(np.log(df_probincome_m.A_MEAN))
plt.xlabel('Log(Mean Annual Income)')
plt.title('Distribution of log Mean Annual Income', loc='left')
plt.yticks(np.arange(0.0,1.2,0.2))
plt.show()
In [42]:
# Test normality of log wage
stat_income_mean_log, p_income_mean_log = normaltest(df_probincome_m.log_A_MEAN)
print('log mean income is normally distributed') if p_income_mean_log < 0.01 else print('log mean income is NOT normally distributed')
log mean income is normally distributed

Since the data for log mean wage are normally distributed, we can consider a linear regression model to analyze the relation between income and probability.

Analysis

What approaches did you use? Why?

We analyze the relationship between log annual income and probability of automation by using an OLS Linear Regression model because there is a clear linear relationship between these two variables.

In [43]:
# Complete regression for relationship between probabilty of automation and log mean annual income
outcome, predictors = patsy.dmatrices('Probability ~ log_A_MEAN', df_probincome_m)
model = sm.OLS(outcome, predictors)
results = model.fit()

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Probability   R-squared:                       0.331
Model:                            OLS   Adj. R-squared:                  0.330
Method:                 Least Squares   F-statistic:                     338.4
Date:                Wed, 10 Jun 2020   Prob (F-statistic):           1.10e-61
Time:                        23:23:32   Log-Likelihood:                -150.52
No. Observations:                 685   AIC:                             305.0
Df Residuals:                     683   BIC:                             314.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.6114      0.276     20.338      0.000       5.070       6.153
log_A_MEAN    -0.4688      0.025    -18.395      0.000      -0.519      -0.419
==============================================================================
Omnibus:                       27.175   Durbin-Watson:                   1.229
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.860
Skew:                          -0.267   Prob(JB):                     0.000132
Kurtosis:                       2.417   Cond. No.                         261.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [44]:
# Look at ditribution of probability and log annual income
sns.scatterplot(df_probincome_m.Probability, df_probincome_m.log_A_MEAN)
plt.xlabel('Probability of Automation')
plt.ylabel('Log(Mean Annual Income)')
plt.title('Probability of Automation vs Mean Annual Income', loc='left')
plt.show()

What were the results?

The results from our analysis of the relationship between log annual income and the probability of automation provide a $P|t|$ value of 0.000, with an Adjusted $R^2$ value of 0.330.

What were your interpretation of these findings?

We interpret the findings between log annual income and probability of income to support our initial hypothesis that lower paying jobs are more likely to suffer from job automation. Although our $R^2$ value is low from the OLS Regression we performed, our 0.000 p value shows that this data is at least conclusive. We believe that further analysis with additional variables can improve our adjusted $R^2$ value. However, it is important to be cautious with these findings, since our probability of automation data are not normally distributed.

Employment & Automation by State

Analysis

What approaches did you use? Why?

For analysing which jobs are most likely to most contribute to automation, we look at the total number of employees per occupation in each state, and take that as a percent of the total number of working employees in that state. That percentage is then multiplied by the occupation's respective probability of automation, and each state is summed across each occupation. This gives the relative weighted likelihood of automation across each state. We further look at which occupation most contributes to this factor, and organize this data geospatially. We chose to analyze this data in this way because it gave a relative score that allows us to compare each state’s overall probability of automation, instead of having our data skewed by looking at absolute employment values. This method also gives insight on which jobs need further attention in our analyses.

In [45]:
# Build composite likelihood of unemployment per state
state_likelihood = []
df_state_data = pd.DataFrame()
for state in states:
    likelihood = 0.0
    max_likelihood = 0.0
    for index in range(len(df_prob_m)):
        new_likelihood = df_prob_m['Probability'][index] * df_prob_m_normed[state][index]
        likelihood += new_likelihood
        if  new_likelihood > max_likelihood:
            max_likelihood = new_likelihood
            df_state_data[state] = (df_prob_m.Occupation[index], df_prob_m[state][index], df_prob_m.SOC[index])
         
    state_likelihood.append(likelihood)
    #print('state: {}\n\t {}'.format(state, likelihood))
In [46]:
# Transform state data dataframe for easier manipulation
df_state_data = df_state_data.transpose()
df_state_data.rename(columns={0:'Occupation', 1:'Number', 2:'SOC'},inplace=True)
In [47]:
# Change datatype for number of employees
df_state_data.Number = df_state_data.Number.astype(str)
In [48]:
# Look at the occupations that most contribute to a state's weighted likelihood of automation, along with the
## occupation's respective wage and number of states in which it is the top contributor
for occupation in df_state_data.Occupation.unique():
    SOC = df_state_data[df_state_data.Occupation == occupation].SOC.unique()[0]
    wage = int(df_income[df_income.SOC == SOC].A_MEAN.values[0])
    n_states = len(df_state_data[df_state_data.Occupation == occupation])
    print('Occupation: {}\n   Wage: {}\n   Num States: {}'.format(occupation, wage, n_states))
Occupation: Cashiers
   Wage: 21680
   Num States: 5
Occupation: Retail Salespersons
   Wage: 27180
   Num States: 40
Occupation: Secretaries And Administrative Assistants; Except Legal; Medical; And Executive
   Wage: 36140
   Num States: 1
Occupation: Combined Food Preparation And Serving Workers; Including Fast Food
   Wage: 20460
   Num States: 3
Occupation: Office Clerks; General
   Wage: 33010
   Num States: 2

From this we can see that an the most overwhelming amount of jobs that will be lost due to automation will occur to retail salespeople. (This analysis looks at the amount of people affected multiplied by the probability of automation)

In [49]:
# Add text column for hover-info on map
df_state_data['text'] = df_state_data.index + '<br>' + \
'Most affected occupation: ' + df_state_data.Occupation + '<br>' + \
'Num employees of occupation in state: ' + df_state_data.Number

# Plot the weighted likelihood of automation by state
fig = go.Figure(data=go.Choropleth(
    locations=states_abbv, # Spatial coordinates
    z = state_likelihood, # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'blues',
    colorbar_title =  "Automation Probability",
    autocolorscale=False,
    text = df_state_data.text
))

fig.update_layout(
    title_text = 'Likelihood of Job Automation by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

What were the results?

The results from our analysis between likelihood of automation by state show that there are only five different jobs across all 50 states + DC (51 total “areas”) that are the area’s highest contributors towards job automation. These occupations include cashiers, retail workers, administrative assistants, food prep & service workers, and office clerical workers. For 40 out of the 51 areas we analyze, the highest contributing occupation are retail workers. This occupation has an average US wage of around $27,000, which is less than half of the average income for 2016 in which this data were collected, and falls below the first quartile for US workers. We also see that the lowest aggregate probability of automation is in DC, at 29.93\% total risk of automation. The highest contributor in DC is secretaries.

What were your interpretation of these findings?

We interpret the findings of our state-automation data to support our hypothesis that lower-paying jobs will be more susceptible to job automation than higher paying jobs would. Since this statistic is based on the number of employees per occupation, it is necessary to make the distinction that this metric looks at the probability of automation for the highest number of jobs. A job with 100 employees and a .99 probability of automation would have the same metric as a job with 1000 employees and a .099 probability of automation.

It is also interesting to consider secretaries in DC. The secretary position is the highest paying occupation from the five highest-contributor occupations. We interpret this under the assumption that a secretary is a relatively common job in DC (with each of the many political positions, among others, requiring at least one secretary). This means that secretaries make up a significant amount of the jobs held in DC, but current technological advancements like digital assistants and phone screenings may soon be capable of automating this position.

Income Change by State

Analysis

What approaches did you use? Why?

In order to represent the income change values per state, we chose a geospatial map. A geospatial map will help us better assess the states in which income change is heavily affected. This will also help us in our comparison to the analysis above, of the automation-prone states and its impact on unemployment in the given state.

In [50]:
state_incomeChange = []
for state in states:
    incomeChange = 0
    for index in range(len(totalIncomeState)):
        incomeChange = totalIncomeState['change'][index] 
        state_incomeChange.append(incomeChange)
In [51]:
fig = go.Figure(data=go.Choropleth(
    locations=states_abbv,
    z = state_incomeChange,
    locationmode = 'USA-states',
    colorscale = 'Purples',
    colorbar_title =  "Income Change",
))

fig.update_layout(
    title_text = 'Income Percent Change 2008 to 2017',
    geo_scope='usa',
)

fig.show()

What were the results?

What were your interpretation of these findings?

Job Skillsets

EDA

Distributions

The distribution of the “Perception and Manipulation” dataset is unimodal, with a clear and obvious peak. It is not not normally distributed and is positively skewed. For the “Creative Intelligence” data set we see a bimodal distribution with the highest peak being around the 45 level of importance mark. The “Social Intelligence” distribution shows a non normally distributed unimodal peak, with a slight positive skew.

Outliers

There are no outliers among any of the job skillset datasets.

Relationships Between Variables

There is a negative relationship between job automation likelihood and job skill importance, the degree of this varies among category. (See further)

In [52]:
#Distribution of Perception and Manipulation Importance among jobs
sns.distplot(df_prob_h.Perception_and_Manipulation, bins=20)
plt.xlabel('Skill Importance')
plt.title('Importance of Perception and Manipulation Skills', loc='left')
plt.show()
In [53]:
#Distribution of Creative Intelligence Importance among jobs
sns.distplot(df_prob_h.Creative_Intelligence, bins=20)
plt.xlabel('Skill Importance')
plt.title('Importance of Creative Intelligence Skills', loc='left')
plt.show()
In [54]:
#Distribution of Social Intelligence Importance among jobs
sns.distplot(df_prob_h.Social_Intelligence, bins=20)
plt.xlabel('Skill Importance')
plt.title('Importance of Creative Intelligence Skills', loc='left')
plt.show()

Analysis

What approaches did you use? Why?

When we run a regression analysis, we see there is no linear relationship between the importance of a job skill and the likelihood that job is automated. For Perception and Manipulation skills, we see an adjusted 𝑅2 value of 0.026, an adjusted 𝑅2 value of 0.021 for Creativity and Intelligence skills, and an adjusted 𝑅2 value of 0.041 for Social Intelligence skills. Furthermore, the correlational data also shows no relationship between these variables. That said, while we see the relationship is not linear, we do see clusters of data points when analyzing scatterplots of each category.

When breaking up automation probability into three separate clusters – low, medium, and high – we can more clearly see the relationship between said clusters and the importance of a job skill. We broke up the cluster's automation likelihood into 3rds: 0-33%, 33-66%, and 66-100%.

For data visualization, we chose to utilize a segmented bar plot to represent this data, which gives readers the clearest answer to the question: what is the likelihood of a job being automated based on the skills it uses?

We separated the data into three distinctions: high probability, medium probability, and low probability of automation. Said probability distinctions on the X axis, and the importance of a job skill on the Y axis.

In [55]:
#Regression of Perception and Manipulation 
outcome, predictors = patsy.dmatrices('Probability ~ Perception_and_Manipulation', df_prob_h)
mod = sm.OLS(outcome, predictors)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Probability   R-squared:                       0.028
Model:                            OLS   Adj. R-squared:                  0.026
Method:                 Least Squares   F-statistic:                     17.74
Date:                Wed, 10 Jun 2020   Prob (F-statistic):           2.90e-05
Time:                        23:23:34   Log-Likelihood:                -3148.9
No. Observations:                 629   AIC:                             6302.
Df Residuals:                     627   BIC:                             6311.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                      67.1192      3.240     20.714      0.000      60.756      73.482
Perception_and_Manipulation    -0.3469      0.082     -4.212      0.000      -0.509      -0.185
==============================================================================
Omnibus:                     8381.111   Durbin-Watson:                   1.649
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               67.499
Skew:                          -0.368   Prob(JB):                     2.20e-15
Kurtosis:                       1.574   Cond. No.                         88.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [56]:
#Regression of Creative Intelligence 
outcome, predictors = patsy.dmatrices('Probability ~ Creative_Intelligence', df_prob_h)
mod = sm.OLS(outcome, predictors)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Probability   R-squared:                       0.022
Model:                            OLS   Adj. R-squared:                  0.021
Method:                 Least Squares   F-statistic:                     14.38
Date:                Wed, 10 Jun 2020   Prob (F-statistic):           0.000164
Time:                        23:23:34   Log-Likelihood:                -3150.6
No. Observations:                 629   AIC:                             6305.
Df Residuals:                     627   BIC:                             6314.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                75.0585      5.509     13.624      0.000      64.240      85.877
Creative_Intelligence    -0.4475      0.118     -3.792      0.000      -0.679      -0.216
==============================================================================
Omnibus:                     6793.594   Durbin-Watson:                   1.640
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               68.421
Skew:                          -0.359   Prob(JB):                     1.39e-15
Kurtosis:                       1.553   Cond. No.                         178.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [57]:
#Regression of Social Intelligence 
outcome, predictors = patsy.dmatrices('Probability ~ Social_Intelligence', df_prob_h)
mod = sm.OLS(outcome, predictors)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Probability   R-squared:                       0.042
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     27.76
Date:                Wed, 10 Jun 2020   Prob (F-statistic):           1.89e-07
Time:                        23:23:34   Log-Likelihood:                -3144.1
No. Observations:                 629   AIC:                             6292.
Df Residuals:                     627   BIC:                             6301.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              82.1142      5.360     15.318      0.000      71.588      92.641
Social_Intelligence    -0.5890      0.112     -5.269      0.000      -0.808      -0.369
==============================================================================
Omnibus:                    10213.691   Durbin-Watson:                   1.675
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               65.073
Skew:                          -0.383   Prob(JB):                     7.40e-15
Kurtosis:                       1.623   Cond. No.                         180.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [58]:
#Perception_and_Manipulation boxplot
df_prob_h.plot.scatter('Perception_and_Manipulation', 'Probability')
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a0f18d0>
In [59]:
#Creative_Intelligence
df_prob_h.plot.scatter('Creative_Intelligence', 'Probability')
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a29fd0fd0>
In [60]:
#Social Intelligence boxplot
df_prob_h.plot.scatter('Social_Intelligence', 'Probability')
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a236310>
In [61]:
#Perception and Manipulation Box Plot

merged_df1low = df_prob_h.loc[df_prob_h['Probability'] <= 0.33].assign(Location="Low")
merged_df1medium = df_prob_h.loc[(df_prob_h['Probability'] > 0.33) & (df_prob_h['Probability'] <= 0.66)].assign(Location="Medium")
merged_df1high = df_prob_h.loc[df_prob_h['Probability'] > 0.66].assign(Location="High")

allValues = [merged_df1low, merged_df1medium, merged_df1high]
allMerged = pd.concat(allValues)

plt.figure(figsize=(8,8))
cdf = pd.concat([merged_df1low, merged_df1medium, merged_df1high]) 
ax = sns.boxplot(x="Location", y="Perception_and_Manipulation", data=cdf)
ax.set(title='Job Automation Likelihood vs. Perception and Manipulation',xlabel='Job Automation Likelihood', 
       ylabel='Perception and Manipulation Importance')

plt.show()
merged_df1high.shape
Out[61]:
(601, 16)
In [62]:
#Creative Intelligence Box Plot

merged_df2low = df_prob_h.loc[df_prob_h['Probability'] <= 0.33].assign(Location="Low")
merged_df2medium = df_prob_h.loc[(df_prob_h['Probability'] > 0.33) & 
                                 (df_prob_h['Probability'] <= 0.66)].assign(Location="Medium")
merged_df2high = df_prob_h.loc[df_prob_h['Probability'] > 0.66].assign(Location="High")

allValues = [merged_df2low, merged_df2medium, merged_df2high]
allMerged = pd.concat(allValues)

plt.figure(figsize=(8,8))
cdf = pd.concat([merged_df2low, merged_df2medium, merged_df2high]) 
ax = sns.boxplot(x="Location", y='Creative_Intelligence', data=cdf)
ax.set(title='Job Automation Likelihood vs. Creative Intelligence Importance',xlabel='Job Automation Likelihood', 
       ylabel='Creative Intelligence Importance')

plt.show()
merged_df2high.shape
Out[62]:
(601, 16)
In [63]:
#Social Intelligence Box Plot

merged_df3low = df_prob_h.loc[df_prob_h['Probability'] <= 0.33].assign(Location="Low")
merged_df3medium = df_prob_h.loc[(df_prob_h['Probability'] > 0.33) & 
                                 (df_prob_h['Probability'] <= 0.66)].assign(Location="Medium")
merged_df3high = df_prob_h.loc[df_prob_h['Probability'] > 0.66].assign(Location="High")

allValues = [merged_df3low, merged_df3medium, merged_df3high]
allMerged = pd.concat(allValues)

plt.figure(figsize=(8,8))
cdf = pd.concat([merged_df3low, merged_df3medium, merged_df3high]) 
ax = sns.boxplot(x="Location", y="Social_Intelligence", data=cdf)
ax.set(title='Job Automation Likelihood vs. Social Intelligence Importance',xlabel='Job Automation Likelihood', 
       ylabel='Social Intelligence Importance')

plt.show()
merged_df3high.shape
Out[63]:
(601, 16)

What were the results?

When we run a regression analysis, we see there is no linear relationship between the importance of a job skill and the likelihood that job is automated. For Perception and Manipulation skills, we see an adjusted 𝑅2 value of 0.026, an adjusted 𝑅2 value of 0.021 for Creativity and Intelligence skills, and an adjusted 𝑅2 value of 0.041 for Social Intelligence skills. Furthermore, the correlational data also shows no relationship between these variables. That said, while we see the relationship is not linear, we do see clusters of data points when analyzing scatterplots of each category.

When breaking up automation probability into three separate clusters – low, medium, and high – we can more clearly see the relationship between said clusters and the importance of a job skill. We broke up the cluster's automation likelihood into 3rds: 0-33%, 33-66%, and 66-100%.

What were your interpretations of these findings?

We interpret it as our hypothesis being correct, there is a relationship between a job having requiring a high level of certain types of skills and a lower likelihood of automation.

Ethics & Privacy

All of our datasets have been provided by publicly available sources such as data.world, The Bureau of Labor Statistics, or other US Government agencies. Since these data are being provided to the public, we anticipate no restrictions in using it for the purpose of this project. Additionally, no restrictions have been posted for the datasets we are accessing. Much of the data we are using is provided by state or federal governments, which is required by law to provide strong protection for the data that are available to the public. For the data sources that are not guaranteed to do this, we will provide privacy by anonymizing any potentially personal identifiable information. However, we do not anticipate this being an issue, as all the data collected are aggregates that don’t include any potentially personal identifiable information. Occupations with less than 1,000 employees in a certain state were removed from our data.

One of our dataset sources, data.world, consists of contributors of various backgrounds and experience levels. This increases the potential for bias, since there is no way of confirming whether the presentation of the data is biased towards fulfilling the contributor’s needs. The user has the option of viewing the contributor’s Kaggle profile and their LinkedIn profiles, but the amount of information that is provided to the user is controlled by the contributor themselves. If we utilize the provided information to confirm the legitimacy of the contributor’s data, we can decrease the chance of potential bias. For our other data source, data.gov, there is a smaller potential of bias given the fact that it is a government source. While we cannot assume that government sources are entirely free of bias, we can assume that it is more of a fair source than non-government sources. Government sources are supposed to be free of affiliation to political parties, which is why we can assume that the data presented has relatively low levels of bias.

A potential ethical concern we have considered is data misinformation/misinterpretation. If an individual comes across our analysis without the understanding that this data relies partly on probability instead of concrete metrics, and that this analysis was not conducted by professional Data Scientists, it might lead to the unnecessary spread of fear that one might lose their job to automation, or further contribute to data misinterpretation.

Conclusion & Discussion

This project analyzed many factors that we believed were likely to contribute to the probability of a job being automated, including annual and state-level income, the skills required for the job, job field growth over time, and the importance of the job to a state's total workforce. To complete these analyses, we employed a combination of linear regression analyses and geospatial methods.

From our wage by occupation analysis, our results show that there is indeed a negative correlation between the mean annual income of a job and that job’s probability of automation. Additionally, our employment and automation by state analysis further shows that the single most popular job contributing towards a state’s weighted probability of automation is retail work, which has a mean annual income below the 1st quartile. We conclude that these results support our primary hypothesis that jobs with higher salaries will have lower automation probabilities, and as such, jobs with lower salaries will have higher automation probabilities. In regards to our analysis of gross annual income in a given state, our results indicate that for automation likelihood values between (0.35 < p <= 0.4) have a negative relationship with income level change, in that an increase in automation likelihood in a medium-risk state will result in a subsequent decrease in the income percent change levels. For high-risk states (p > 0.4) in automation probability, there is a positive relationship with income level change, in that an increase in the automation probability will result in an increase in income percent change. In regards to our skills analyses, jobs with low levels of Perception and Manipulation skills are more at risk of job automation. Similarly, for jobs with low levels of Creative Intelligence Importance are more at risk of job automation. Lastly, jobs with low levels of Social Intelligence Importance are also more at risk of job automation. This indicates that jobs that require skills under the umbrella of perception, creativity, and empathy are less likely to be at risk of job automation.

A current limitation of this project involves our use of single linear regressions. Since job automation is such a complex concept with so many different factors contributing to it, it is difficult for us to draw conclusions about any single variable, especially when only considering linear relationships. This is an issue that we encountered in our percent change in employment analysis; the data did not fit the model we were using, and further attempts did not resolve the issue. Due to the outstanding impact that job automation will have on millions of Americans within the next decade, further analysis should be done to analyze these factors in combination with each other, along with comparison to actual automation rates.

Team Contributions

Michael:

  • Setup/cleaning/eda/analysis/viz:
    • Employment by Occupation
    • Income by Occupation
    • Employment & Automation by State
  • Geospatial data & Employment normalization
  • Datasets 1-4
  • Ethics & Privacy
  • Conclusion + Discussion with another group member
  • Prior Work & Hypothesis
  • Overview