With the United States slated to lose over 1.5 million jobs to automation in the next decade alone, the causes of job automation can be a hot topic to many. This project looks at many factors we think will contribute to the probability of a job being automated, including annual and state-level income, the skills required for the job, job field growth over time, and the importance of the job to a state's total workforce. We have seen income does correlate with the likelihood of automation to some extent, low paying jobs will contribute more to state-wide automation.
How is the further integration of job automation into the United States’ workforce related to employment salaries, skills, and employment rates across various job industries in the past 25 years?
At the outset of our project, our group had an interest in how automation has shaped the US economy, and how it will continue to shape the economy we live in. From fast food restaurants implementing kiosks to cashierless grocery stores, we’ve seen a wide variety of jobs automated and replaced in recent years. Even so, we were able to see that the number of employees in manufacturing from the years 2008 to 2018 from the Annual Survey of Manufacturers (ASM) have gone down. While it’s fascinating to see these numbers, we felt that there must be an analysis of the detriment that automation has on replaced workers.
Our group wanted to look further into how the integration of job automation has currently affected various industries in the job market. According to a blog post back from 2012, it’s stated that “the more ‘blue collar’ you are, the more likely you are to be unemployed.” That claim is backed by the U.S. Bureau of Labor Statistics in both their Current Population Surveys of 2010 and 2018, which indicate the less education achieved, the higher the unemployment rate. While education level was a factor that we initially considered potentially important to our research, we did not find relevant datasets when doing our research. In recent years (before the Coronavirus outbreak), unemployment has decreased significantly, and the trend continues, with less education resulting in a higher unemployment rate.
It is important to consider how job automation may impact other industries, workplaces, and unemployment rates, and after assessing the data we had, we centered in on this factor. The most similar prior work we have found looks at the Likelihood of Automation with comparisons between Occupation and Salary. The analyses done in this research use the same automation probability dataset that we work with in our analysis (Dataset 1). The second dataset used in this study focuses on a handful of data categories regarding income levels, which we have additionally adopted into our data collection for this project. From this article, we found data which comes from ONET, a US government resource for storing data and conducting data analysis on the job market in the United States. This paper focuses on nine datasets from ONET: Negotiation, Social Perceptiveness, Persuasion, Finger Dexterity, Manual Dexterity, Cramped Work Spaces, Originality, Creativity, and Assisting And Caring For Others. Utilizing this wide variety manner, the aspect of our study considering income and state automation likelihood acts as an extension of the prior work.
Although our initial findings had some face value, many factors were left unknown. That original research raised only more questions: how many jobs have already been affected by automation, and which jobs are they? Particularly, what is the probability that a job would be automated, and what is the likelihood of automation of the job pertaining to income, state of occupancy, and various job skills? We sought to find out.
The most similar prior work we have found looks at the Likelihood of Automation with comparisons between Occupation and Salary. The analyses done in this research use the same automation probability dataset that we work with in our analysis (Dataset 1). The second dataset used in this study focuses on a handful of data categories regarding income levels, which we have additionally adopted into our data collection for this project (Dataset 2). In that manner, the aspect of our study considering income and state automation likelihood acts as an extension of the prior work.
References:
We hypothesize:
Dataset 1:
This dataset includes job title, OCC code (used to categorize), probability of occupation, and number of employees per state.
Dataset 2:
This dataset includes job title, OCC code, and various wage metrics per occupation (salary vs hourly, average pay, statistics for these fields, North American Industry Classification System code, employee number data, etc).
Dataset 3:
This dataset includes the number of employees per occupation present per state in any given year.
Dataset 4:
These datasets (collected for years between 2011 and 2019) show the number of employees per occupation by age group, and gives the total number of employees per occupation.
Dataset 5:
Dataset 5 is collecting information from 9 various datasets from the ONET site. These datasets are: Negotiation, social perceptiveness, persuasion, finger dexterity, manual dexterity, cramped work spaces, originality, creativity, and assisting and caring for others. They are all sorted by SOC code and contain 969 observations.
Dataset 6:
Dataset 6 provides the Metropolitan and Non-Metropolitan personal income values per state in the United States from 2008 to 2017.
Dataset 7 (A copy of Dataset 1):
This dataset includes job title, OCC code (used to categorize), probability of occupation, and number of employees per state. This dataset was copied and names df_occ to resolve a merging problem that we were having across datasets and analyses.
Datasets 1,2 provide information by occupations will be combined by their OCC code, since it is the most standard metric they share (note: some datasets have different occupation name string formatting, so this is not as easily used). Since we mostly want to consider this data in terms of probability of automation, data from Dataset 2 will only be kept if there is a corresponding OCC code in Dataset 1. Dataset 3 is used to turn the number of employees per occupation per state in Dataset 1 into fractions so we can look at relative numbers of employement as opposed to absolute numbers.
The 7 datasets from Dataset 4 (corresponding to 2013-2019 data) will be combined in order to calculate the change in employment over the 7 year period. This data will be combined by Dataset 1 by OCC code in order to investigate the relationship between job field growth and likelihood of automation for that job field.
# Installs plotly for displaying geospatial graphs
!pip install --user plotly
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import plotly.graph_objects as go
from scipy import stats
from copy import copy
from scipy.stats import normaltest
# Used for building geo-spatial maps
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware',
'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois','Indiana', 'Iowa', 'Kansas',
'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island',
'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
'West Virginia', 'Wisconsin', 'Wyoming']
states_abbv = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
How ‘clean’ is the data?
Our data are clean in that they were provided by reputable sources who provided the data in a format without any unnecessary variables, null data, etc. For one of the datasets, some wage information was represented by a or #, with the meaning not enough data was available for inclusion, and the # meaning wage exceeded $200,000/yr. These observations were dropped from our analyses
In regards to the job skillset data, the job skills datasets we collected from O-NET were already in a CSV format, which had easy-to-understand categories and pre-formulated constructs, it was easy for us to sort through the data without much cleaning. However, the skills datasets and the dataset used for automation probability used different titles for occupation, which meant we had to merge over the same code: the Standard Occupational Classification (SOC) system, which is a federal convention used for occupational data analysis. In order to use both data sets, we merged them by their similar SOC codes.
The annual incomes data was relatively clean. The only exclusion is that the first row and last rows needed to be removed because they were accounting for values in the United States as a whole, and not representing an individual state.
What did you have to do to get the data into a usable format?
Since some datasets were read in from excel datasheets, it was necessary to clean this data by removing the first few rows (title information), renaming the columns for proper identification, and resetting the index since the first n rows were removed. It was also necessary to remove additional columns that were not related to the dataframe, but were added for structural purposes in the excel datasheet.
Some dataframes required transposing/reshaping in order to more easily work with the data.
The skills datasets and the dataset used for automation probability used different titles for occupation, which meant we had to merge over the same code: the Standard Occupational Classification (SOC) system, which is a federal convention used for occupational data analysis. In order to use both data sets, we merged them by their similar SOC codes.
For annual income data, each state was split into “Metropolitan” and “Non-Metropolitan” income levels. We wanted to combine those income levels into one value, so we needed to use groupby to combine every second value together, across 102 values.
What pre-processing steps were required for your methods?
For our state analysis, it was necessary to transpose our data in the beginning, since the variables in the original dataset were now to be used as observations in this new dataset. An additional transformation was made to “normalize” the number of employees in each column by dividing them by the total number of workers per state.
We checked the distribution of variables such as probability of automation, wage, and employment percent change.
For the data in Datasets 4, we needed to drop all non-total employee rows and merge the 9 datasets into one set.
For the individual occupation wage analysis, it was necessary to drop most columns. We included occ code, annual mean wage, and occupation. The resulting dataset was merged with Dataset 1 by occ/soc code so we could easily compare the likelihood of automation and the annual mean wage.
To get our data into an optimal format for the job skillset analysis, we had to change the column titles on a few of the skill datasets, as well as filter out any rows that were listed as “Not Available” on a few of the datasets. Primarily, a dataset titled “Fine Arts” had this issue, which makes sense, since we were measuring our analyses based on the relatability of a skill (e.g fine arts) to a single job, and many jobs utilize no degree of fine arts skills. The “Assisting and Caring for Others” dataset also had this issue, as neither the job “Models” nor “Financial Analyst” had any relevance to this trait, so their data was removed. Moreover, the job “Mathematical Technician” had data which, across several datasets, corresponded little to the rest of the data, leading me to believe that row was incomplete, inconsistent, or significantly altered. It was removed from those datasets too.
# Function tidy-izes the data
def organize(df, year):
'''
- Removes the first 7 columns from the dataframe (corresponds to title and additional non-data cells from
excel formatting)
- Adds column titles back in
- Drops null columns
- Trims dataframe to only include occupation and title
'''
df = df[7:]
df = df.rename(columns={df.columns[0]: 'Occupation', df.columns[1]: 'Total{}'.format(year)})
df.dropna(inplace=True)
df.reset_index(inplace=True)
df = df[['Occupation', 'Total{}'.format(year)]]
return df
# Creating the pertinent DataFrames
df_prob = pd.read_csv('datasets/raw_state_automation_data.csv', encoding='cp1252')
df_employment = pd.read_excel('datasets/employmentbystate.xls')
df_income = pd.read_excel('datasets/wagedata.xlsx')
income = pd.read_csv('datasets/PARPI_PORT_2008_2017.csv')
df_occ = pd.read_csv('datasets/raw_state_automation_data.csv', encoding='cp1252')
employment2011 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2011.xlsx'), 2011)
employment2012 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2012.xlsx'), 2012)
employment2013 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2013.xlsx'), 2013)
employment2014 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2014.xlsx'), 2014)
employment2015 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2015.xlsx'), 2015)
employment2016 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2016.xlsx'), 2016)
employment2017 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2017.xlsx'), 2017)
employment2018 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2018.xlsx'), 2018)
employment2019 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2019.xlsx'), 2019)
# Merging the different employment datasets in order to analyze percent change
employment = pd.merge(pd.merge(employment2011, employment2012), \
pd.merge(pd.merge(employment2013, pd.merge(employment2014, employment2015)), \
pd.merge(pd.merge(employment2016, employment2017), \
pd.merge(employment2018, employment2019))))
# Creating copy of main dataframe to avoid issues with individual cleaning and analyses
df_prob_m = copy(df_prob)
df_prob_g = copy(df_prob)
df_prob_h = copy(df_prob)
df_prob_k = copy(df_prob)
# Structure main dataset
df_prob_m.sort_values(by=['SOC'], inplace=True)
# Standardize Occupation column between datasets
employment['Occupation'] = employment['Occupation'].apply(lambda x: x.title())
df_prob_m['Occupation'] = df_prob_m['Occupation'].apply(lambda x: x.title())
# Create trimmed dataset
df_prob_m_trim = df_prob_m[['SOC', 'Occupation', 'Probability']]
# Include employment info
df_prob_m_trim = pd.merge(df_prob_m_trim, employment)
# Add percent change based on 2012 to 2019 data (Note that data cannot be analyzed for 2011 as there were some
## states in which occupations did not have employees, causing a division by 0 error)
df_prob_m_trim['percent_change'] = (df_prob_m_trim.Total2019 - df_prob_m_trim.Total2012)/df_prob_m_trim.Total2012
#Remove unnecessary income data
df_income = df_income[['OCC_CODE', 'OCC_TITLE', 'A_MEAN']]
df_income.rename(columns={'OCC_CODE':'SOC', 'OCC_TITLE':'Occupation'},inplace=True)
# Create combined probability & income dataset
df_probincome_m = pd.merge(df_prob_m, df_income, how='left', left_on='SOC', right_on='SOC')
# Drop rows if no mean wage info
# * is used to represent occupation with insufficient data
df_probincome_m = df_probincome_m[df_probincome_m.A_MEAN != '*']
# Remove any null income values
df_probincome_m.dropna(inplace=True,subset=['A_MEAN'])
df_probincome_m = df_probincome_m.reset_index()
# Rest of values should be numeric, so transform
df_probincome_m.A_MEAN = pd.to_numeric(df_probincome_m.A_MEAN)
# Add log mean income
df_probincome_m['log_A_MEAN'] = df_probincome_m.A_MEAN.apply(np.log)
# Clean employment data by removing excel specific titles & rows, dropping null data
## adding proper column names, dropping unnecessary columns, and typecasting
df_employment = df_employment[5:]
df_employment.dropna(inplace=True)
df_employment = df_employment.rename(columns={df_employment.columns[1]:'State', df_employment.columns[2]:'Employment'})
df_employment.reset_index(inplace=True)
df_employment = df_employment[['State', 'Employment']]
df_employment.Employment = df_employment.Employment.astype(int)
# Reshape data to easily apply later transformation
df_employment = df_employment.transpose()
df_employment.columns = df_employment.iloc[0]
df_employment = df_employment.iloc[1:]
# Transform employment data to reflect employment relative to population
df_prob_m_normed = copy(df_prob_m)
for state in states:
df_prob_m_normed[state] = df_prob_m_normed[state].apply(lambda x: x/df_employment[state])
# Don't need SOC values, so remove
#df_prob_m.drop(columns=['SOC'],inplace=True)
income = income[['GeoName', 'LineCode','2008', '2009', '2010', '2011', '2012', '2013', '2014' , '2015', '2016', '2017']]
income = income[income.LineCode == 1.0]
income = income.loc[income.index > 0]
income = income.reset_index()
income = income.drop('index', 1)
N = 2
totalIn = income.groupby(income.index // N).sum()
totalIn['change'] = totalIn['2017'] - totalIn['2008']
totalIn['change'] = totalIn['change'] / totalIn['2008']
# Only want to look at probability, state data so remove other columns
df_occ.drop(columns=['SOC', 'Occupation'], inplace=True)
# Transform employment data to reflect employment relative to population
for state in states:
df_occ[state] = df_occ[state].apply(lambda x: x/df_employment[state])
state_likelihood2 = []
for state in states:
likelihood2 = 0
for index in range(len(df_occ)):
likelihood2 += df_occ['Probability'][index] * df_occ[state][index]
state_likelihood2.append(likelihood2)
statesDF = pd.DataFrame(states)
totalIncomeState = totalIn.merge(statesDF, left_index = True, right_index = True)
totalIncomeState = totalIncomeState.rename(columns={0: "State"})
likelihoodAutomation = pd.DataFrame(state_likelihood2)
incomeANDautomation = likelihoodAutomation.merge(totalIncomeState, left_index=True, right_index=True)
incomeANDautomation = incomeANDautomation.rename(columns={0: "Automation"})
# Setting up automation data
automation = pd.read_csv('datasets/raw_state_automation_data.csv', encoding='cp1252')
#removing States data
automation.drop(automation.columns.difference(['SOC','Occupation', 'Probability']), 1, inplace=True)
automation['Probability'] = automation['Probability'] * 100
#Loading in all the job skills data
originality = pd.read_csv('datasets/Originality.csv')
negotiation = pd.read_csv('datasets/Negotiation.csv')
social_percept = pd.read_csv('datasets/Social_Perceptiveness.csv')
persuasion = pd.read_csv('datasets/Persuasion.csv')
finger_dext = pd.read_csv('datasets/Finger_Dexterity.csv')
manual_dext = pd.read_csv('datasets/Manual_Dexterity.csv')
cramped_work = pd.read_csv('datasets/Cramped_Work_Space_Awkward_Positions.csv')
fine_arts = pd.read_csv('datasets/Fine_Arts.csv')
df_prob_h = pd.read_csv('datasets/Assisting_and_Caring_for_Others.csv')
#Cleaning and merging
df_prob_h['Assisting and Caring Importance'] = df_prob_h['Importance']
# Removing doubles "importance" column and level column as it is not used
df_prob_h = df_prob_h.drop(['Importance', 'Level'], axis=1)
df_prob_h['Originality Importance'] = originality['Importance']
df_prob_h['Negotiation Importance'] = negotiation['Importance']
df_prob_h['Social Perception Importance'] = social_percept['Importance']
df_prob_h['Persuasion Importance'] = persuasion['Importance']
df_prob_h['Finger Dexterity Importance'] = finger_dext['Importance']
df_prob_h['Manual Dexterity Importance'] = manual_dext['Importance']
df_prob_h['Cramped Work Context'] = cramped_work['Context']
df_prob_h['Fine Arts Importance'] = fine_arts['Importance']
#Merging the Data Sets
df_prob_h['SOC'] = df_prob_h['Code'].astype(str).replace('\.00', '', regex=True)
df_prob_h = pd.merge(left=df_prob_h, right=automation, left_on='SOC', right_on='SOC')
df_prob_h.drop(['Code', 'Occupation_x'], axis=1, inplace=True)
df_prob_h = df_prob_h.rename(columns = {'Occupation_y':'Occupation'})
#Creating 3 Seperate Categories Based Upon
df_prob_h['Perception_and_Manipulation'] = df_prob_h[['Finger Dexterity Importance','Manual Dexterity Importance',
'Cramped Work Context']].mean(axis=1)
df_prob_h['Creative_Intelligence'] = df_prob_h[['Originality Importance','Fine Arts Importance']].mean(axis=1)
df_prob_h['Social_Intelligence'] = df_prob_h[['Social Perception Importance', 'Negotiation Importance',
'Persuasion Importance', 'Assisting and Caring Importance']].mean(axis=1)
Distributions
Our probability of automation variable is bound between 0-1 with a bimodal distribution. The peaks occur at the two boundaries. While the left mode has a higher peak, the right mode carries more weight.
Our variable representing the percent change in employment between 2012 and 2019 follows a relatively normal unimodal distribution with few mini-peaks that do not change the shape drastically. This distribution is slightly right-skewed.
Outliers
While there is a relatively smooth distribution outside of the boundary peaks, there is a non-modal peak in the 0.35-0.40 probability bin.
There is a percent change outlier right around 2.75. This corresponds to Financial Analysts, which grew from 84,000 to 345,000 employees.
Relationship between variables
There is a very poor (horizontal) linear relation between the probability of automation and change in employment between 2012 and 2019. There is a stronger linear relationship between log annual income and probability of automation.
# Distribution of probability of automation
sns.distplot(df_prob_m.Probability, bins=20)
plt.xlabel('Probability of Automation')
plt.title('Distribution of Automation Probability', loc='left')
plt.show()
# Distribution of percent change in employment
sns.distplot(df_prob_m_trim.percent_change)
plt.xlabel('Change in Employment')
plt.title('Distribution of Employment Change from 2012-2019', loc='left')
plt.show()
# Investigate outlier
df_prob_m_trim[df_prob_m_trim.percent_change >= 2]
# Determine normality of percent_change
k2, p = stats.normaltest(df_prob_m_trim.percent_change)
print(p)
# Plot the percent change vs Probability of employment
sns.scatterplot(df_prob_m_trim.Probability, df_prob_m_trim.percent_change)
plt.xlabel('Probability of Automation')
plt.ylabel('Change in Employment')
plt.title('Probability of Automation vs Change in Employment', loc='left')
plt.show()
What approaches did you use? Why?
We analyze the relationship between percent change in employment and probability of automation using an OLS Linear Regression model. This method of analysis was chosen because there is a relatively linear relationship between the two variables.
# OLS Regression for Probability and percent change
# NOTE: Data does not meet requirements necessary for testing linearity, but want to take a look
outcome, predictors = patsy.dmatrices('Probability ~ percent_change', df_prob_m_trim)
model = sm.OLS(outcome, predictors)
results = model.fit()
print(results.summary())
What were the results?
The results from our analysis of the relationship between percent change in employment and the probability of automation shows poor results. Due to the bimodal distribution of the probability of automation, our data was not properly distributed for this type of regression. Further attempts to normalize the data did not prove to be successful, nor did further regression attempts.
What were your interpretation of these findings?
Although the findings from the relation between employment percent change and automation did not prove to be conclusive of anything, we initially interpreted these results to show that the likelihood (probability) of automation is a complex factor with many contributing factors, and as such, we could not analyze it solely by looking at one possible contributor. This led to a breadth of comparisons being done in order to better understand the different contributing factors to likelihood of automation.
Distributions
The variable "percent change in income" consists of a unimodal distribution with the peak occuring between 0.1 and 0.2 percent change in income.
The variable "automation probability" consists of a bimodal distribution with the highest peak occuring close to 0.4 probability, and the second peak occuring at close to 0.3 probability.
Outliers
There was one outlier in the analysis of automation likelihood vs. percent change in income: District of Columbia, Automation: 0.299355, Income Percent Change: 0.364582.
Relationship Between Variables
There is a negative relationship between the likelihood of automation in a state and the percent change in income in a state.
sns.distplot(totalIn.change)
sns.distplot(incomeANDautomation.Automation)
plt.figure(figsize=(8,8))
sns.scatterplot(incomeANDautomation.Automation, incomeANDautomation.change)
ax = sns.regplot(x="Automation", y="change", data=incomeANDautomation)
ax.set(title='Job Automation Likelihood vs. Percent Change in Income per State from 2008 to 2017', xlabel='Job Automation Likelihood', ylabel='Percent Change in Income')
plt.show()
What approaches did you use? Why?
We analyze the relationship between percent change in income and probability of automation using a Linear Regression model. This approach was used because there was a clear downwards shape in the data.
outcome2, predictors2 = patsy.dmatrices('Automation ~ change', incomeANDautomation)
model2 = sm.OLS(outcome2, predictors2)
results2 = model2.fit()
print(results2.summary())
What were the results?
With an $\alpha$ value of 0.05 and p value of 0.021, an Adjusted $R^2$ of 0.085 indicates that there is some significance in the correlation between automation likelihood in a state and its percent change in income levels.
What were your interpretation of these findings?
Since the correlation value is negative, we can assume that there is a correlation between an increase in automation level and the subsequent decrease in income change. This means that there is a correlation in where as automation levels increase, that income levels decrease.
automation1of3 = incomeANDautomation.loc[incomeANDautomation.Automation <= 0.35]
automation2of3 = incomeANDautomation.loc[(incomeANDautomation.Automation > 0.35) & (incomeANDautomation.Automation <= 0.4)]
automation3of3 = incomeANDautomation.loc[incomeANDautomation.Automation > 0.4]
plt.figure(figsize=(8,8))
automation1of3 = automation1of3.assign(Location='low')
automation2of3 = automation2of3.assign(Location='medium')
automation3of3 = automation3of3.assign(Location='high')
cdf = pd.concat([automation1of3, automation2of3, automation3of3])
ax = sns.boxplot(x="Location", y="change", data=cdf)
ax.set(title='Job Automation Likelihood vs. Income Change',xlabel='Job Automation Likelihood', ylabel='Income Change')
plt.show()
What approaches did you use? Why?
We split the data of Job Automation Likelihood evenly three ways between low probability (p <= 0.35), medium probability (0.35 < p <= 0.4), and high probability (p > 0.4). This was done to better group together the values of automation for easier analysis. It was done so that we have a better understanding of which areas of job automation will have a consequental greater result in income change.
We then analyze the relationship between percent change in income and probability of automation in low, medium, and high using a Linear Regression model. This approach was used because we took reference from the original plot and noticed that there was a downward shape in the data in the medium section, but slightly upward in the high section. We wanted to explore whether there is a different relationship between the medium likelihood and high likelihoods of job automation. In our "low" section, we had one data point, which was our outlier, so this form of analysis also gave us the chance to analyze everything beyond the outlier.
sns.scatterplot(automation2of3.Automation, automation2of3.change)
ax1 = sns.regplot(x="Automation", y="change", data=automation2of3)
ax1.set(xlabel='Automation Likelihood', ylabel='Percent Change in Income')
plt.show()
outcome3, predictors3 = patsy.dmatrices('Automation ~ change', automation2of3)
model3 = sm.OLS(outcome3, predictors3)
results3 = model3.fit()
print(results3.summary())
sns.scatterplot(automation3of3.Automation, automation3of3.change)
ax3 = sns.regplot(x="Automation", y="change", data=automation3of3)
ax3.set(xlabel='Automation Likelihood', ylabel='Percent Change in Income')
plt.show()
outcome4, predictors4 = patsy.dmatrices('Automation ~ change', automation3of3)
model4 = sm.OLS(outcome4, predictors4)
results4 = model4.fit()
print(results4.summary())
What were the results?
Medium: From our regression test, we learned that with an alpha value of 0.01, an Adjusted $R^2$ value of -0.038 indicates that there is significance in the correlation between medium automation likelihood in a state and its percent change in income levels. In regards to the boxplot representation of medium automation, we learn that the median value of income percent change is around 0.2.
High: From our regression test, we learned that with an alpha value of 0.01, an Adjusted $R^2$ value of 0.017 indicates that there is significance in the correlation between large automation likelihood in a state and its percent change in income levels. In regards to the boxplot representation of high automation, we learn that the median value of income percent change is around 0.17.
What were your interpretation of these findings?
Regression test:
Medium: Since the regression line's slope is negative, we can assume that there is a correlation between an increase in automation level within the medium range and the subsequent decrease in income change. This means that there is a correlation in where as automation at a medium level increase, that income change levels decrease.
High: Something interesting happens here -- the regression line's slope is positive, as opposed to negative. This assumes that there is a correlation between an increase in automation level in the high automation range and also an increase in the levels of income change. While we may not be able to specifically pinpoint why this happens, we can conclude that the impact caused by medium and high levels of job automation are not the same.
Boxplot: From our boxplot, we learn that states with medium-level automation probability values between (0.35 < p <= 0.4) will have a higher median value of income change in comparison to states with high-level automation probability values (p > 0.4). This indicates that states that have medium-level automation probability will have larger income changes than states with high-level automation probability.
Distributions
Our annual mean wage variable takes a right-skewed normal distribution. We remove the skew by applying a natural log function to this data.
Outliers
There is at least one outlier in annual mean income around $225,000
# Distribution of annual mean income
sns.distplot(df_probincome_m.A_MEAN)
plt.xlabel('Mean Annual Income')
plt.title('Distribution of Mean Annual Income', loc='left')
plt.show()
# Investigate mean annual income outlier
df_probincome_m[df_probincome_m.A_MEAN > 225000]
We see that these two extremely high paying outliers are both oral health occupations with a low probability of automation and relatively low spread throughout the country.
# Distribution of log annual mean income
sns.distplot(np.log(df_probincome_m.A_MEAN))
plt.xlabel('Log(Mean Annual Income)')
plt.title('Distribution of log Mean Annual Income', loc='left')
plt.yticks(np.arange(0.0,1.2,0.2))
plt.show()
# Test normality of log wage
stat_income_mean_log, p_income_mean_log = normaltest(df_probincome_m.log_A_MEAN)
print('log mean income is normally distributed') if p_income_mean_log < 0.01 else print('log mean income is NOT normally distributed')
Since the data for log mean wage are normally distributed, we can consider a linear regression model to analyze the relation between income and probability.
What approaches did you use? Why?
We analyze the relationship between log annual income and probability of automation by using an OLS Linear Regression model because there is a clear linear relationship between these two variables.
# Complete regression for relationship between probabilty of automation and log mean annual income
outcome, predictors = patsy.dmatrices('Probability ~ log_A_MEAN', df_probincome_m)
model = sm.OLS(outcome, predictors)
results = model.fit()
print(results.summary())
# Look at ditribution of probability and log annual income
sns.scatterplot(df_probincome_m.Probability, df_probincome_m.log_A_MEAN)
plt.xlabel('Probability of Automation')
plt.ylabel('Log(Mean Annual Income)')
plt.title('Probability of Automation vs Mean Annual Income', loc='left')
plt.show()
What were the results?
The results from our analysis of the relationship between log annual income and the probability of automation provide a $P|t|$ value of 0.000, with an Adjusted $R^2$ value of 0.330.
What were your interpretation of these findings?
We interpret the findings between log annual income and probability of income to support our initial hypothesis that lower paying jobs are more likely to suffer from job automation. Although our $R^2$ value is low from the OLS Regression we performed, our 0.000 p value shows that this data is at least conclusive. We believe that further analysis with additional variables can improve our adjusted $R^2$ value. However, it is important to be cautious with these findings, since our probability of automation data are not normally distributed.
What approaches did you use? Why?
For analysing which jobs are most likely to most contribute to automation, we look at the total number of employees per occupation in each state, and take that as a percent of the total number of working employees in that state. That percentage is then multiplied by the occupation's respective probability of automation, and each state is summed across each occupation. This gives the relative weighted likelihood of automation across each state. We further look at which occupation most contributes to this factor, and organize this data geospatially. We chose to analyze this data in this way because it gave a relative score that allows us to compare each state’s overall probability of automation, instead of having our data skewed by looking at absolute employment values. This method also gives insight on which jobs need further attention in our analyses.
# Build composite likelihood of unemployment per state
state_likelihood = []
df_state_data = pd.DataFrame()
for state in states:
likelihood = 0.0
max_likelihood = 0.0
for index in range(len(df_prob_m)):
new_likelihood = df_prob_m['Probability'][index] * df_prob_m_normed[state][index]
likelihood += new_likelihood
if new_likelihood > max_likelihood:
max_likelihood = new_likelihood
df_state_data[state] = (df_prob_m.Occupation[index], df_prob_m[state][index], df_prob_m.SOC[index])
state_likelihood.append(likelihood)
#print('state: {}\n\t {}'.format(state, likelihood))
# Transform state data dataframe for easier manipulation
df_state_data = df_state_data.transpose()
df_state_data.rename(columns={0:'Occupation', 1:'Number', 2:'SOC'},inplace=True)
# Change datatype for number of employees
df_state_data.Number = df_state_data.Number.astype(str)
# Look at the occupations that most contribute to a state's weighted likelihood of automation, along with the
## occupation's respective wage and number of states in which it is the top contributor
for occupation in df_state_data.Occupation.unique():
SOC = df_state_data[df_state_data.Occupation == occupation].SOC.unique()[0]
wage = int(df_income[df_income.SOC == SOC].A_MEAN.values[0])
n_states = len(df_state_data[df_state_data.Occupation == occupation])
print('Occupation: {}\n Wage: {}\n Num States: {}'.format(occupation, wage, n_states))
From this we can see that an the most overwhelming amount of jobs that will be lost due to automation will occur to retail salespeople. (This analysis looks at the amount of people affected multiplied by the probability of automation)
# Add text column for hover-info on map
df_state_data['text'] = df_state_data.index + '<br>' + \
'Most affected occupation: ' + df_state_data.Occupation + '<br>' + \
'Num employees of occupation in state: ' + df_state_data.Number
# Plot the weighted likelihood of automation by state
fig = go.Figure(data=go.Choropleth(
locations=states_abbv, # Spatial coordinates
z = state_likelihood, # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'blues',
colorbar_title = "Automation Probability",
autocolorscale=False,
text = df_state_data.text
))
fig.update_layout(
title_text = 'Likelihood of Job Automation by State',
geo_scope='usa', # limite map scope to USA
)
fig.show()
What were the results?
The results from our analysis between likelihood of automation by state show that there are only five different jobs across all 50 states + DC (51 total “areas”) that are the area’s highest contributors towards job automation. These occupations include cashiers, retail workers, administrative assistants, food prep & service workers, and office clerical workers. For 40 out of the 51 areas we analyze, the highest contributing occupation are retail workers. This occupation has an average US wage of around $27,000, which is less than half of the average income for 2016 in which this data were collected, and falls below the first quartile for US workers. We also see that the lowest aggregate probability of automation is in DC, at 29.93\% total risk of automation. The highest contributor in DC is secretaries.
What were your interpretation of these findings?
We interpret the findings of our state-automation data to support our hypothesis that lower-paying jobs will be more susceptible to job automation than higher paying jobs would. Since this statistic is based on the number of employees per occupation, it is necessary to make the distinction that this metric looks at the probability of automation for the highest number of jobs. A job with 100 employees and a .99 probability of automation would have the same metric as a job with 1000 employees and a .099 probability of automation.
It is also interesting to consider secretaries in DC. The secretary position is the highest paying occupation from the five highest-contributor occupations. We interpret this under the assumption that a secretary is a relatively common job in DC (with each of the many political positions, among others, requiring at least one secretary). This means that secretaries make up a significant amount of the jobs held in DC, but current technological advancements like digital assistants and phone screenings may soon be capable of automating this position.
What approaches did you use? Why?
In order to represent the income change values per state, we chose a geospatial map. A geospatial map will help us better assess the states in which income change is heavily affected. This will also help us in our comparison to the analysis above, of the automation-prone states and its impact on unemployment in the given state.
state_incomeChange = []
for state in states:
incomeChange = 0
for index in range(len(totalIncomeState)):
incomeChange = totalIncomeState['change'][index]
state_incomeChange.append(incomeChange)
fig = go.Figure(data=go.Choropleth(
locations=states_abbv,
z = state_incomeChange,
locationmode = 'USA-states',
colorscale = 'Purples',
colorbar_title = "Income Change",
))
fig.update_layout(
title_text = 'Income Percent Change 2008 to 2017',
geo_scope='usa',
)
fig.show()
What were the results?
What were your interpretation of these findings?
Distributions
The distribution of the “Perception and Manipulation” dataset is unimodal, with a clear and obvious peak. It is not not normally distributed and is positively skewed. For the “Creative Intelligence” data set we see a bimodal distribution with the highest peak being around the 45 level of importance mark. The “Social Intelligence” distribution shows a non normally distributed unimodal peak, with a slight positive skew.
Outliers
There are no outliers among any of the job skillset datasets.
Relationships Between Variables
There is a negative relationship between job automation likelihood and job skill importance, the degree of this varies among category. (See further)
#Distribution of Perception and Manipulation Importance among jobs
sns.distplot(df_prob_h.Perception_and_Manipulation, bins=20)
plt.xlabel('Skill Importance')
plt.title('Importance of Perception and Manipulation Skills', loc='left')
plt.show()
#Distribution of Creative Intelligence Importance among jobs
sns.distplot(df_prob_h.Creative_Intelligence, bins=20)
plt.xlabel('Skill Importance')
plt.title('Importance of Creative Intelligence Skills', loc='left')
plt.show()
#Distribution of Social Intelligence Importance among jobs
sns.distplot(df_prob_h.Social_Intelligence, bins=20)
plt.xlabel('Skill Importance')
plt.title('Importance of Creative Intelligence Skills', loc='left')
plt.show()
What approaches did you use? Why?
When we run a regression analysis, we see there is no linear relationship between the importance of a job skill and the likelihood that job is automated. For Perception and Manipulation skills, we see an adjusted 𝑅2 value of 0.026, an adjusted 𝑅2 value of 0.021 for Creativity and Intelligence skills, and an adjusted 𝑅2 value of 0.041 for Social Intelligence skills. Furthermore, the correlational data also shows no relationship between these variables. That said, while we see the relationship is not linear, we do see clusters of data points when analyzing scatterplots of each category.
When breaking up automation probability into three separate clusters – low, medium, and high – we can more clearly see the relationship between said clusters and the importance of a job skill. We broke up the cluster's automation likelihood into 3rds: 0-33%, 33-66%, and 66-100%.
For data visualization, we chose to utilize a segmented bar plot to represent this data, which gives readers the clearest answer to the question: what is the likelihood of a job being automated based on the skills it uses?
We separated the data into three distinctions: high probability, medium probability, and low probability of automation. Said probability distinctions on the X axis, and the importance of a job skill on the Y axis.
#Regression of Perception and Manipulation
outcome, predictors = patsy.dmatrices('Probability ~ Perception_and_Manipulation', df_prob_h)
mod = sm.OLS(outcome, predictors)
res = mod.fit()
print(res.summary())
#Regression of Creative Intelligence
outcome, predictors = patsy.dmatrices('Probability ~ Creative_Intelligence', df_prob_h)
mod = sm.OLS(outcome, predictors)
res = mod.fit()
print(res.summary())
#Regression of Social Intelligence
outcome, predictors = patsy.dmatrices('Probability ~ Social_Intelligence', df_prob_h)
mod = sm.OLS(outcome, predictors)
res = mod.fit()
print(res.summary())
#Perception_and_Manipulation boxplot
df_prob_h.plot.scatter('Perception_and_Manipulation', 'Probability')
#Creative_Intelligence
df_prob_h.plot.scatter('Creative_Intelligence', 'Probability')
#Social Intelligence boxplot
df_prob_h.plot.scatter('Social_Intelligence', 'Probability')
#Perception and Manipulation Box Plot
merged_df1low = df_prob_h.loc[df_prob_h['Probability'] <= 0.33].assign(Location="Low")
merged_df1medium = df_prob_h.loc[(df_prob_h['Probability'] > 0.33) & (df_prob_h['Probability'] <= 0.66)].assign(Location="Medium")
merged_df1high = df_prob_h.loc[df_prob_h['Probability'] > 0.66].assign(Location="High")
allValues = [merged_df1low, merged_df1medium, merged_df1high]
allMerged = pd.concat(allValues)
plt.figure(figsize=(8,8))
cdf = pd.concat([merged_df1low, merged_df1medium, merged_df1high])
ax = sns.boxplot(x="Location", y="Perception_and_Manipulation", data=cdf)
ax.set(title='Job Automation Likelihood vs. Perception and Manipulation',xlabel='Job Automation Likelihood',
ylabel='Perception and Manipulation Importance')
plt.show()
merged_df1high.shape
#Creative Intelligence Box Plot
merged_df2low = df_prob_h.loc[df_prob_h['Probability'] <= 0.33].assign(Location="Low")
merged_df2medium = df_prob_h.loc[(df_prob_h['Probability'] > 0.33) &
(df_prob_h['Probability'] <= 0.66)].assign(Location="Medium")
merged_df2high = df_prob_h.loc[df_prob_h['Probability'] > 0.66].assign(Location="High")
allValues = [merged_df2low, merged_df2medium, merged_df2high]
allMerged = pd.concat(allValues)
plt.figure(figsize=(8,8))
cdf = pd.concat([merged_df2low, merged_df2medium, merged_df2high])
ax = sns.boxplot(x="Location", y='Creative_Intelligence', data=cdf)
ax.set(title='Job Automation Likelihood vs. Creative Intelligence Importance',xlabel='Job Automation Likelihood',
ylabel='Creative Intelligence Importance')
plt.show()
merged_df2high.shape
#Social Intelligence Box Plot
merged_df3low = df_prob_h.loc[df_prob_h['Probability'] <= 0.33].assign(Location="Low")
merged_df3medium = df_prob_h.loc[(df_prob_h['Probability'] > 0.33) &
(df_prob_h['Probability'] <= 0.66)].assign(Location="Medium")
merged_df3high = df_prob_h.loc[df_prob_h['Probability'] > 0.66].assign(Location="High")
allValues = [merged_df3low, merged_df3medium, merged_df3high]
allMerged = pd.concat(allValues)
plt.figure(figsize=(8,8))
cdf = pd.concat([merged_df3low, merged_df3medium, merged_df3high])
ax = sns.boxplot(x="Location", y="Social_Intelligence", data=cdf)
ax.set(title='Job Automation Likelihood vs. Social Intelligence Importance',xlabel='Job Automation Likelihood',
ylabel='Social Intelligence Importance')
plt.show()
merged_df3high.shape
What were the results?
When we run a regression analysis, we see there is no linear relationship between the importance of a job skill and the likelihood that job is automated. For Perception and Manipulation skills, we see an adjusted 𝑅2 value of 0.026, an adjusted 𝑅2 value of 0.021 for Creativity and Intelligence skills, and an adjusted 𝑅2 value of 0.041 for Social Intelligence skills. Furthermore, the correlational data also shows no relationship between these variables. That said, while we see the relationship is not linear, we do see clusters of data points when analyzing scatterplots of each category.
When breaking up automation probability into three separate clusters – low, medium, and high – we can more clearly see the relationship between said clusters and the importance of a job skill. We broke up the cluster's automation likelihood into 3rds: 0-33%, 33-66%, and 66-100%.
What were your interpretations of these findings?
We interpret it as our hypothesis being correct, there is a relationship between a job having requiring a high level of certain types of skills and a lower likelihood of automation.
All of our datasets have been provided by publicly available sources such as data.world, The Bureau of Labor Statistics, or other US Government agencies. Since these data are being provided to the public, we anticipate no restrictions in using it for the purpose of this project. Additionally, no restrictions have been posted for the datasets we are accessing. Much of the data we are using is provided by state or federal governments, which is required by law to provide strong protection for the data that are available to the public. For the data sources that are not guaranteed to do this, we will provide privacy by anonymizing any potentially personal identifiable information. However, we do not anticipate this being an issue, as all the data collected are aggregates that don’t include any potentially personal identifiable information. Occupations with less than 1,000 employees in a certain state were removed from our data.
One of our dataset sources, data.world, consists of contributors of various backgrounds and experience levels. This increases the potential for bias, since there is no way of confirming whether the presentation of the data is biased towards fulfilling the contributor’s needs. The user has the option of viewing the contributor’s Kaggle profile and their LinkedIn profiles, but the amount of information that is provided to the user is controlled by the contributor themselves. If we utilize the provided information to confirm the legitimacy of the contributor’s data, we can decrease the chance of potential bias. For our other data source, data.gov, there is a smaller potential of bias given the fact that it is a government source. While we cannot assume that government sources are entirely free of bias, we can assume that it is more of a fair source than non-government sources. Government sources are supposed to be free of affiliation to political parties, which is why we can assume that the data presented has relatively low levels of bias.
A potential ethical concern we have considered is data misinformation/misinterpretation. If an individual comes across our analysis without the understanding that this data relies partly on probability instead of concrete metrics, and that this analysis was not conducted by professional Data Scientists, it might lead to the unnecessary spread of fear that one might lose their job to automation, or further contribute to data misinterpretation.
This project analyzed many factors that we believed were likely to contribute to the probability of a job being automated, including annual and state-level income, the skills required for the job, job field growth over time, and the importance of the job to a state's total workforce. To complete these analyses, we employed a combination of linear regression analyses and geospatial methods.
From our wage by occupation analysis, our results show that there is indeed a negative correlation between the mean annual income of a job and that job’s probability of automation. Additionally, our employment and automation by state analysis further shows that the single most popular job contributing towards a state’s weighted probability of automation is retail work, which has a mean annual income below the 1st quartile. We conclude that these results support our primary hypothesis that jobs with higher salaries will have lower automation probabilities, and as such, jobs with lower salaries will have higher automation probabilities. In regards to our analysis of gross annual income in a given state, our results indicate that for automation likelihood values between (0.35 < p <= 0.4) have a negative relationship with income level change, in that an increase in automation likelihood in a medium-risk state will result in a subsequent decrease in the income percent change levels. For high-risk states (p > 0.4) in automation probability, there is a positive relationship with income level change, in that an increase in the automation probability will result in an increase in income percent change. In regards to our skills analyses, jobs with low levels of Perception and Manipulation skills are more at risk of job automation. Similarly, for jobs with low levels of Creative Intelligence Importance are more at risk of job automation. Lastly, jobs with low levels of Social Intelligence Importance are also more at risk of job automation. This indicates that jobs that require skills under the umbrella of perception, creativity, and empathy are less likely to be at risk of job automation.
A current limitation of this project involves our use of single linear regressions. Since job automation is such a complex concept with so many different factors contributing to it, it is difficult for us to draw conclusions about any single variable, especially when only considering linear relationships. This is an issue that we encountered in our percent change in employment analysis; the data did not fit the model we were using, and further attempts did not resolve the issue. Due to the outstanding impact that job automation will have on millions of Americans within the next decade, further analysis should be done to analyze these factors in combination with each other, along with comparison to actual automation rates.
Michael: