With the United States slated to lose over 1.5 million jobs to automation in the next decade alone, the causes of job automation can be a hot topic to many. This project looks at many factors we think will contribute to the probability of a job being automated, including annual and state-level income, the skills required for the job, job field growth over time, and the importance of the job to a state's total workforce. We have seen income does correlate with the likelihood of automation to some extent, low paying jobs will contribute more to state-wide automation.

How is the further integration of job automation into the United States’ workforce related to employment salaries, skills, and employment rates across various job industries in the past 25 years?

At the outset of our project, our group had an interest in how automation has shaped the US economy, and how it will continue to shape the economy we live in. From fast food restaurants implementing kiosks to cashierless grocery stores, we’ve seen a wide variety of jobs automated and replaced in recent years. Even so, we were able to see that the number of employees in manufacturing from the years 2008 to 2018 from the Annual Survey of Manufacturers (ASM) have gone down. While it’s fascinating to see these numbers, we felt that there must be an analysis of the detriment that automation has on replaced workers.

Our group wanted to look further into how the integration of job automation has currently affected various industries in the job market. According to a blog post back from 2012, it’s stated that “the more ‘blue collar’ you are, the more likely you are to be unemployed.” That claim is backed by the U.S. Bureau of Labor Statistics in both their Current Population Surveys of 2010 and 2018, which indicate the less education achieved, the higher the unemployment rate. While education level was a factor that we initially considered potentially important to our research, we did not find relevant datasets when doing our research. In recent years (before the Coronavirus outbreak), unemployment has decreased significantly, and the trend continues, with less education resulting in a higher unemployment rate.

It is important to consider how job automation may impact other industries, workplaces, and unemployment rates, and after assessing the data we had, we centered in on this factor. The most similar prior work we have found looks at the Likelihood of Automation with comparisons between Occupation and Salary. The analyses done in this research use the same automation probability dataset that we work with in our analysis (Dataset 1). The second dataset used in this study focuses on a handful of data categories regarding income levels, which we have additionally adopted into our data collection for this project. From this article, we found data which comes from ONET, a US government resource for storing data and conducting data analysis on the job market in the United States. This paper focuses on nine datasets from ONET: Negotiation, Social Perceptiveness, Persuasion, Finger Dexterity, Manual Dexterity, Cramped Work Spaces, Originality, Creativity, and Assisting And Caring For Others. Utilizing this wide variety manner, the aspect of our study considering income and state automation likelihood acts as an extension of the prior work.

Although our initial findings had some face value, many factors were left unknown. That original research raised only more questions: how many jobs have already been affected by automation, and which jobs are they? Particularly, what is the probability that a job would be automated, and what is the likelihood of automation of the job pertaining to income, state of occupancy, and various job skills? We sought to find out.

The most similar prior work we have found looks at the Likelihood of Automation with comparisons between Occupation and Salary. The analyses done in this research use the same automation probability dataset that we work with in our analysis (Dataset 1). The second dataset used in this study focuses on a handful of data categories regarding income levels, which we have additionally adopted into our data collection for this project (Dataset 2). In that manner, the aspect of our study considering income and state automation likelihood acts as an extension of the prior work.

References:

We hypothesize:

- Occupations with higher salaries will have a lower probability of being automated.
- Jobs with high automation probabilities will show a stagnant or decreasing change in employment rates for the eight-year period of time surrounding the point at which the automation likelihood was estimated.
- High perception and manipulation skills correlate more to a high probability of automation.
- High creative and social intelligence scores correlate more with a low probability of automation.

Dataset 1:

- Dataset Name: Occupations by State and Likelihood of Automation
- Link to the dataset: https://data.world/wnedds/occupations-by-state-and-likelihood-of-automation
- Number of observations: 702 (1 per occupation)

This dataset includes job title, OCC code (used to categorize), probability of occupation, and number of employees per state.

Dataset 2:

- Dataset Name: Wage by Occupation
- Link to the dataset: https://data.world/quanticdata/occupation-and-salary-by-state-and-likelihood-of-automation/workspace/file?filename=national_M2016_dl.xlsx
- Number of observations: 1394

This dataset includes job title, OCC code, and various wage metrics per occupation (salary vs hourly, average pay, statistics for these fields, North American Industry Classification System code, employee number data, etc).

Dataset 3:

- Dataset Name: Employment by State
- Link to the dataset: https://www.bea.gov/data/employment/employment-by-state
- Number of observations: 51 (states + DC)

This dataset includes the number of employees per occupation present per state in any given year.

Dataset 4:

- Dataset Name: Employed persons by detailed occupation and age (table 11b)
- Link to the dataset: https://www.bls.gov/cps/tables.htm
- Number of observations: 567

These datasets (collected for years between 2011 and 2019) show the number of employees per occupation by age group, and gives the total number of employees per occupation.

Dataset 5:

- Name: ONET Job Skills/Abilities/Features
- Link: https://www.onetonline.org/find/descriptor/browse/Skills/
- Number of Observations: 969

Dataset 5 is collecting information from 9 various datasets from the ONET site. These datasets are: Negotiation, social perceptiveness, persuasion, finger dexterity, manual dexterity, cramped work spaces, originality, creativity, and assisting and caring for others. They are all sorted by SOC code and contain 969 observations.

Dataset 6:

- Name: PARPI Port 2008 to 2017
- Link: https://www.bea.gov/data/income-saving/personal-income-by-state
- Number of Observations: 210

Dataset 6 provides the Metropolitan and Non-Metropolitan personal income values per state in the United States from 2008 to 2017.

Dataset 7 (A copy of Dataset 1):

- Dataset Name: Occupations by State and Likelihood of Automation
- Link to the dataset: https://data.world/wnedds/occupations-by-state-and-likelihood-of-automation
- Number of observations: 702 (1 per occupation)

This dataset includes job title, OCC code (used to categorize), probability of occupation, and number of employees per state. This dataset was copied and names df_occ to resolve a merging problem that we were having across datasets and analyses.

Datasets 1,2 provide information by occupations will be combined by their OCC code, since it is the most standard metric they share (note: some datasets have different occupation name string formatting, so this is not as easily used). Since we mostly want to consider this data in terms of probability of automation, data from Dataset 2 will only be kept if there is a corresponding OCC code in Dataset 1. Dataset 3 is used to turn the number of employees per occupation per state in Dataset 1 into fractions so we can look at relative numbers of employement as opposed to absolute numbers.

The 7 datasets from Dataset 4 (corresponding to 2013-2019 data) will be combined in order to calculate the change in employment over the 7 year period. This data will be combined by Dataset 1 by OCC code in order to investigate the relationship between job field growth and likelihood of automation for that job field.

In [1]:

```
# Installs plotly for displaying geospatial graphs
!pip install --user plotly
```

In [2]:

```
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import plotly.graph_objects as go
from scipy import stats
from copy import copy
from scipy.stats import normaltest
```

In [3]:

```
# Used for building geo-spatial maps
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware',
'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois','Indiana', 'Iowa', 'Kansas',
'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island',
'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
'West Virginia', 'Wisconsin', 'Wyoming']
states_abbv = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
```

**How ‘clean’ is the data?**

Our data are clean in that they were provided by reputable sources who provided the data in a format without any unnecessary variables, null data, etc. For one of the datasets, some wage information was represented by a * or #, with the * meaning not enough data was available for inclusion, and the # meaning wage exceeded $200,000/yr. These observations were dropped from our analyses

In regards to the job skillset data, the job skills datasets we collected from O-NET were already in a CSV format, which had easy-to-understand categories and pre-formulated constructs, it was easy for us to sort through the data without much cleaning. However, the skills datasets and the dataset used for automation probability used different titles for occupation, which meant we had to merge over the same code: the Standard Occupational Classification (SOC) system, which is a federal convention used for occupational data analysis. In order to use both data sets, we merged them by their similar SOC codes.

The annual incomes data was relatively clean. The only exclusion is that the first row and last rows needed to be removed because they were accounting for values in the United States as a whole, and not representing an individual state.

**What did you have to do to get the data into a usable format?**

Since some datasets were read in from excel datasheets, it was necessary to clean this data by removing the first few rows (title information), renaming the columns for proper identification, and resetting the index since the first n rows were removed. It was also necessary to remove additional columns that were not related to the dataframe, but were added for structural purposes in the excel datasheet.

Some dataframes required transposing/reshaping in order to more easily work with the data.

The skills datasets and the dataset used for automation probability used different titles for occupation, which meant we had to merge over the same code: the Standard Occupational Classification (SOC) system, which is a federal convention used for occupational data analysis. In order to use both data sets, we merged them by their similar SOC codes.

For annual income data, each state was split into “Metropolitan” and “Non-Metropolitan” income levels. We wanted to combine those income levels into one value, so we needed to use groupby to combine every second value together, across 102 values.

**What pre-processing steps were required for your methods?**

For our state analysis, it was necessary to transpose our data in the beginning, since the variables in the original dataset were now to be used as observations in this new dataset. An additional transformation was made to “normalize” the number of employees in each column by dividing them by the total number of workers per state.

We checked the distribution of variables such as probability of automation, wage, and employment percent change.

For the data in Datasets 4, we needed to drop all non-total employee rows and merge the 9 datasets into one set.

For the individual occupation wage analysis, it was necessary to drop most columns. We included occ code, annual mean wage, and occupation. The resulting dataset was merged with Dataset 1 by occ/soc code so we could easily compare the likelihood of automation and the annual mean wage.

To get our data into an optimal format for the job skillset analysis, we had to change the column titles on a few of the skill datasets, as well as filter out any rows that were listed as “Not Available” on a few of the datasets. Primarily, a dataset titled “Fine Arts” had this issue, which makes sense, since we were measuring our analyses based on the relatability of a skill (e.g fine arts) to a single job, and many jobs utilize no degree of fine arts skills. The “Assisting and Caring for Others” dataset also had this issue, as neither the job “Models” nor “Financial Analyst” had any relevance to this trait, so their data was removed. Moreover, the job “Mathematical Technician” had data which, across several datasets, corresponded little to the rest of the data, leading me to believe that row was incomplete, inconsistent, or significantly altered. It was removed from those datasets too.

In [4]:

```
# Function tidy-izes the data
def organize(df, year):
'''
- Removes the first 7 columns from the dataframe (corresponds to title and additional non-data cells from
excel formatting)
- Adds column titles back in
- Drops null columns
- Trims dataframe to only include occupation and title
'''
df = df[7:]
df = df.rename(columns={df.columns[0]: 'Occupation', df.columns[1]: 'Total{}'.format(year)})
df.dropna(inplace=True)
df.reset_index(inplace=True)
df = df[['Occupation', 'Total{}'.format(year)]]
return df
```

In [5]:

```
# Creating the pertinent DataFrames
df_prob = pd.read_csv('datasets/raw_state_automation_data.csv', encoding='cp1252')
df_employment = pd.read_excel('datasets/employmentbystate.xls')
df_income = pd.read_excel('datasets/wagedata.xlsx')
income = pd.read_csv('datasets/PARPI_PORT_2008_2017.csv')
df_occ = pd.read_csv('datasets/raw_state_automation_data.csv', encoding='cp1252')
employment2011 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2011.xlsx'), 2011)
employment2012 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2012.xlsx'), 2012)
employment2013 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2013.xlsx'), 2013)
employment2014 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2014.xlsx'), 2014)
employment2015 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2015.xlsx'), 2015)
employment2016 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2016.xlsx'), 2016)
employment2017 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2017.xlsx'), 2017)
employment2018 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2018.xlsx'), 2018)
employment2019 = organize(pd.read_excel('datasets/blsdata/cpsaat11b2019.xlsx'), 2019)
# Merging the different employment datasets in order to analyze percent change
employment = pd.merge(pd.merge(employment2011, employment2012), \
pd.merge(pd.merge(employment2013, pd.merge(employment2014, employment2015)), \
pd.merge(pd.merge(employment2016, employment2017), \
pd.merge(employment2018, employment2019))))
# Creating copy of main dataframe to avoid issues with individual cleaning and analyses
df_prob_m = copy(df_prob)
df_prob_g = copy(df_prob)
df_prob_h = copy(df_prob)
df_prob_k = copy(df_prob)
```

In [6]:

```
# Structure main dataset
df_prob_m.sort_values(by=['SOC'], inplace=True)
# Standardize Occupation column between datasets
employment['Occupation'] = employment['Occupation'].apply(lambda x: x.title())
df_prob_m['Occupation'] = df_prob_m['Occupation'].apply(lambda x: x.title())
# Create trimmed dataset
df_prob_m_trim = df_prob_m[['SOC', 'Occupation', 'Probability']]
# Include employment info
df_prob_m_trim = pd.merge(df_prob_m_trim, employment)
# Add percent change based on 2012 to 2019 data (Note that data cannot be analyzed for 2011 as there were some
## states in which occupations did not have employees, causing a division by 0 error)
df_prob_m_trim['percent_change'] = (df_prob_m_trim.Total2019 - df_prob_m_trim.Total2012)/df_prob_m_trim.Total2012
```

In [7]:

```
#Remove unnecessary income data
df_income = df_income[['OCC_CODE', 'OCC_TITLE', 'A_MEAN']]
df_income.rename(columns={'OCC_CODE':'SOC', 'OCC_TITLE':'Occupation'},inplace=True)
# Create combined probability & income dataset
df_probincome_m = pd.merge(df_prob_m, df_income, how='left', left_on='SOC', right_on='SOC')
# Drop rows if no mean wage info
# * is used to represent occupation with insufficient data
df_probincome_m = df_probincome_m[df_probincome_m.A_MEAN != '*']
# Remove any null income values
df_probincome_m.dropna(inplace=True,subset=['A_MEAN'])
df_probincome_m = df_probincome_m.reset_index()
# Rest of values should be numeric, so transform
df_probincome_m.A_MEAN = pd.to_numeric(df_probincome_m.A_MEAN)
# Add log mean income
df_probincome_m['log_A_MEAN'] = df_probincome_m.A_MEAN.apply(np.log)
```

In [8]:

```
# Clean employment data by removing excel specific titles & rows, dropping null data
## adding proper column names, dropping unnecessary columns, and typecasting
df_employment = df_employment[5:]
df_employment.dropna(inplace=True)
df_employment = df_employment.rename(columns={df_employment.columns[1]:'State', df_employment.columns[2]:'Employment'})
df_employment.reset_index(inplace=True)
df_employment = df_employment[['State', 'Employment']]
df_employment.Employment = df_employment.Employment.astype(int)
# Reshape data to easily apply later transformation
df_employment = df_employment.transpose()
df_employment.columns = df_employment.iloc[0]
df_employment = df_employment.iloc[1:]
# Transform employment data to reflect employment relative to population
df_prob_m_normed = copy(df_prob_m)
for state in states:
df_prob_m_normed[state] = df_prob_m_normed[state].apply(lambda x: x/df_employment[state])
# Don't need SOC values, so remove
#df_prob_m.drop(columns=['SOC'],inplace=True)
```

In [9]:

```
income = income[['GeoName', 'LineCode','2008', '2009', '2010', '2011', '2012', '2013', '2014' , '2015', '2016', '2017']]
income = income[income.LineCode == 1.0]
income = income.loc[income.index > 0]
```

In [10]:

```
income = income.reset_index()
income = income.drop('index', 1)
```

In [11]:

```
N = 2
totalIn = income.groupby(income.index // N).sum()
```

In [12]:

```
totalIn['change'] = totalIn['2017'] - totalIn['2008']
totalIn['change'] = totalIn['change'] / totalIn['2008']
```

In [13]:

```
# Only want to look at probability, state data so remove other columns
df_occ.drop(columns=['SOC', 'Occupation'], inplace=True)
```

In [14]:

```
# Transform employment data to reflect employment relative to population
for state in states:
df_occ[state] = df_occ[state].apply(lambda x: x/df_employment[state])
```

In [15]:

```
state_likelihood2 = []
for state in states:
likelihood2 = 0
for index in range(len(df_occ)):
likelihood2 += df_occ['Probability'][index] * df_occ[state][index]
state_likelihood2.append(likelihood2)
```

In [16]:

```
statesDF = pd.DataFrame(states)
totalIncomeState = totalIn.merge(statesDF, left_index = True, right_index = True)
totalIncomeState = totalIncomeState.rename(columns={0: "State"})
```

In [17]:

```
likelihoodAutomation = pd.DataFrame(state_likelihood2)
incomeANDautomation = likelihoodAutomation.merge(totalIncomeState, left_index=True, right_index=True)
incomeANDautomation = incomeANDautomation.rename(columns={0: "Automation"})
```

In [18]:

```
# Setting up automation data
automation = pd.read_csv('datasets/raw_state_automation_data.csv', encoding='cp1252')
#removing States data
automation.drop(automation.columns.difference(['SOC','Occupation', 'Probability']), 1, inplace=True)
automation['Probability'] = automation['Probability'] * 100
```

In [19]:

```
#Loading in all the job skills data
originality = pd.read_csv('datasets/Originality.csv')
negotiation = pd.read_csv('datasets/Negotiation.csv')
social_percept = pd.read_csv('datasets/Social_Perceptiveness.csv')
persuasion = pd.read_csv('datasets/Persuasion.csv')
finger_dext = pd.read_csv('datasets/Finger_Dexterity.csv')
manual_dext = pd.read_csv('datasets/Manual_Dexterity.csv')
cramped_work = pd.read_csv('datasets/Cramped_Work_Space_Awkward_Positions.csv')
fine_arts = pd.read_csv('datasets/Fine_Arts.csv')
df_prob_h = pd.read_csv('datasets/Assisting_and_Caring_for_Others.csv')
```

In [20]:

```
#Cleaning and merging
df_prob_h['Assisting and Caring Importance'] = df_prob_h['Importance']
# Removing doubles "importance" column and level column as it is not used
df_prob_h = df_prob_h.drop(['Importance', 'Level'], axis=1)
df_prob_h['Originality Importance'] = originality['Importance']
df_prob_h['Negotiation Importance'] = negotiation['Importance']
df_prob_h['Social Perception Importance'] = social_percept['Importance']
df_prob_h['Persuasion Importance'] = persuasion['Importance']
df_prob_h['Finger Dexterity Importance'] = finger_dext['Importance']
df_prob_h['Manual Dexterity Importance'] = manual_dext['Importance']
df_prob_h['Cramped Work Context'] = cramped_work['Context']
df_prob_h['Fine Arts Importance'] = fine_arts['Importance']
```

In [21]:

```
#Merging the Data Sets
df_prob_h['SOC'] = df_prob_h['Code'].astype(str).replace('\.00', '', regex=True)
df_prob_h = pd.merge(left=df_prob_h, right=automation, left_on='SOC', right_on='SOC')
df_prob_h.drop(['Code', 'Occupation_x'], axis=1, inplace=True)
df_prob_h = df_prob_h.rename(columns = {'Occupation_y':'Occupation'})
```

In [22]:

```
#Creating 3 Seperate Categories Based Upon
df_prob_h['Perception_and_Manipulation'] = df_prob_h[['Finger Dexterity Importance','Manual Dexterity Importance',
'Cramped Work Context']].mean(axis=1)
df_prob_h['Creative_Intelligence'] = df_prob_h[['Originality Importance','Fine Arts Importance']].mean(axis=1)
df_prob_h['Social_Intelligence'] = df_prob_h[['Social Perception Importance', 'Negotiation Importance',
'Persuasion Importance', 'Assisting and Caring Importance']].mean(axis=1)
```

**Distributions**

Our probability of automation variable is bound between 0-1 with a bimodal distribution. The peaks occur at the two boundaries. While the left mode has a higher peak, the right mode carries more weight.

Our variable representing the percent change in employment between 2012 and 2019 follows a relatively normal unimodal distribution with few mini-peaks that do not change the shape drastically. This distribution is slightly right-skewed.

**Outliers**

While there is a relatively smooth distribution outside of the boundary peaks, there is a non-modal peak in the 0.35-0.40 probability bin.

There is a percent change outlier right around 2.75. This corresponds to Financial Analysts, which grew from 84,000 to 345,000 employees.

**Relationship between variables**

There is a very poor (horizontal) linear relation between the probability of automation and change in employment between 2012 and 2019. There is a stronger linear relationship between log annual income and probability of automation.

In [23]:

```
# Distribution of probability of automation
sns.distplot(df_prob_m.Probability, bins=20)
plt.xlabel('Probability of Automation')
plt.title('Distribution of Automation Probability', loc='left')
plt.show()
```

In [24]:

```
# Distribution of percent change in employment
sns.distplot(df_prob_m_trim.percent_change)
plt.xlabel('Change in Employment')
plt.title('Distribution of Employment Change from 2012-2019', loc='left')
plt.show()
```

In [25]:

```
# Investigate outlier
df_prob_m_trim[df_prob_m_trim.percent_change >= 2]
```

Out[25]:

In [26]:

```
# Determine normality of percent_change
k2, p = stats.normaltest(df_prob_m_trim.percent_change)
print(p)
```

In [27]:

```
# Plot the percent change vs Probability of employment
sns.scatterplot(df_prob_m_trim.Probability, df_prob_m_trim.percent_change)
plt.xlabel('Probability of Automation')
plt.ylabel('Change in Employment')
plt.title('Probability of Automation vs Change in Employment', loc='left')
plt.show()
```

**What approaches did you use? Why?**

We analyze the relationship between percent change in employment and probability of automation using an OLS Linear Regression model. This method of analysis was chosen because there is a relatively linear relationship between the two variables.

In [28]:

```
# OLS Regression for Probability and percent change
# NOTE: Data does not meet requirements necessary for testing linearity, but want to take a look
outcome, predictors = patsy.dmatrices('Probability ~ percent_change', df_prob_m_trim)
model = sm.OLS(outcome, predictors)
results = model.fit()
print(results.summary())
```

**What were the results?**

The results from our analysis of the relationship between percent change in employment and the probability of automation shows poor results. Due to the bimodal distribution of the probability of automation, our data was not properly distributed for this type of regression. Further attempts to normalize the data did not prove to be successful, nor did further regression attempts.

**What were your interpretation of these findings?**

Although the findings from the relation between employment percent change and automation did not prove to be conclusive of anything, we initially interpreted these results to show that the likelihood (probability) of automation is a complex factor with many contributing factors, and as such, we could not analyze it solely by looking at one possible contributor. This led to a breadth of comparisons being done in order to better understand the different contributing factors to likelihood of automation.

**Distributions**

The variable "percent change in income" consists of a unimodal distribution with the peak occuring between 0.1 and 0.2 percent change in income.

The variable "automation probability" consists of a bimodal distribution with the highest peak occuring close to 0.4 probability, and the second peak occuring at close to 0.3 probability.

**Outliers**

There was one outlier in the analysis of automation likelihood vs. percent change in income: District of Columbia, Automation: 0.299355, Income Percent Change: 0.364582.

**Relationship Between Variables**

There is a negative relationship between the likelihood of automation in a state and the percent change in income in a state.

In [29]:

```
sns.distplot(totalIn.change)
```

Out[29]:

In [30]:

```
sns.distplot(incomeANDautomation.Automation)
```

Out[30]:

In [31]:

```
plt.figure(figsize=(8,8))
sns.scatterplot(incomeANDautomation.Automation, incomeANDautomation.change)
ax = sns.regplot(x="Automation", y="change", data=incomeANDautomation)
ax.set(title='Job Automation Likelihood vs. Percent Change in Income per State from 2008 to 2017', xlabel='Job Automation Likelihood', ylabel='Percent Change in Income')
plt.show()
```

**What approaches did you use? Why?**

We analyze the relationship between percent change in income and probability of automation using a Linear Regression model. This approach was used because there was a clear downwards shape in the data.

In [32]:

```
outcome2, predictors2 = patsy.dmatrices('Automation ~ change', incomeANDautomation)
model2 = sm.OLS(outcome2, predictors2)
results2 = model2.fit()
print(results2.summary())
```

**What were the results?**

With an $\alpha$ value of 0.05 and p value of 0.021, an Adjusted $R^2$ of 0.085 indicates that there is some significance in the correlation between automation likelihood in a state and its percent change in income levels.

**What were your interpretation of these findings?**

Since the correlation value is negative, we can assume that there is a correlation between an increase in automation level and the subsequent decrease in income change. This means that there is a correlation in where as automation levels increase, that income levels decrease.

In [33]:

```
automation1of3 = incomeANDautomation.loc[incomeANDautomation.Automation <= 0.35]
automation2of3 = incomeANDautomation.loc[(incomeANDautomation.Automation > 0.35) & (incomeANDautomation.Automation <= 0.4)]
automation3of3 = incomeANDautomation.loc[incomeANDautomation.Automation > 0.4]
```

In [34]:

```
plt.figure(figsize=(8,8))
automation1of3 = automation1of3.assign(Location='low')
automation2of3 = automation2of3.assign(Location='medium')
automation3of3 = automation3of3.assign(Location='high')
cdf = pd.concat([automation1of3, automation2of3, automation3of3])
ax = sns.boxplot(x="Location", y="change", data=cdf)
ax.set(title='Job Automation Likelihood vs. Income Change',xlabel='Job Automation Likelihood', ylabel='Income Change')
plt.show()
```

**What approaches did you use? Why?**

We split the data of Job Automation Likelihood evenly three ways between low probability (p <= 0.35), medium probability (0.35 < p <= 0.4), and high probability (p > 0.4). This was done to better group together the values of automation for easier analysis. It was done so that we have a better understanding of which areas of job automation will have a consequental greater result in income change.

We then analyze the relationship between percent change in income and probability of automation in low, medium, and high using a Linear Regression model. This approach was used because we took reference from the original plot and noticed that there was a downward shape in the data in the medium section, but slightly upward in the high section. We wanted to explore whether there is a different relationship between the medium likelihood and high likelihoods of job automation. In our "low" section, we had one data point, which was our outlier, so this form of analysis also gave us the chance to analyze everything beyond the outlier.

In [35]:

```
sns.scatterplot(automation2of3.Automation, automation2of3.change)
ax1 = sns.regplot(x="Automation", y="change", data=automation2of3)
ax1.set(xlabel='Automation Likelihood', ylabel='Percent Change in Income')
plt.show()
```

In [36]:

```
outcome3, predictors3 = patsy.dmatrices('Automation ~ change', automation2of3)
model3 = sm.OLS(outcome3, predictors3)
results3 = model3.fit()
print(results3.summary())
```

In [37]:

```
sns.scatterplot(automation3of3.Automation, automation3of3.change)
ax3 = sns.regplot(x="Automation", y="change", data=automation3of3)
ax3.set(xlabel='Automation Likelihood', ylabel='Percent Change in Income')
plt.show()
```

In [38]:

```
outcome4, predictors4 = patsy.dmatrices('Automation ~ change', automation3of3)
model4 = sm.OLS(outcome4, predictors4)
results4 = model4.fit()
print(results4.summary())
```

**What were the results?**

Medium: From our regression test, we learned that with an alpha value of 0.01, an Adjusted $R^2$ value of -0.038 indicates that there is significance in the correlation between medium automation likelihood in a state and its percent change in income levels. In regards to the boxplot representation of medium automation, we learn that the median value of income percent change is around 0.2.

High: From our regression test, we learned that with an alpha value of 0.01, an Adjusted $R^2$ value of 0.017 indicates that there is significance in the correlation between large automation likelihood in a state and its percent change in income levels. In regards to the boxplot representation of high automation, we learn that the median value of income percent change is around 0.17.

**What were your interpretation of these findings?**

Regression test:

Medium: Since the regression line's slope is negative, we can assume that there is a correlation between an increase in automation level within the medium range and the subsequent decrease in income change. This means that there is a correlation in where as automation at a medium level increase, that income change levels decrease.

High: Something interesting happens here -- the regression line's slope is positive, as opposed to negative. This assumes that there is a correlation between an increase in automation level in the high automation range and also an increase in the levels of income change. While we may not be able to specifically pinpoint why this happens, we can conclude that the impact caused by medium and high levels of job automation are not the same.

Boxplot: From our boxplot, we learn that states with medium-level automation probability values between (0.35 < p <= 0.4) will have a higher median value of income change in comparison to states with high-level automation probability values (p > 0.4). This indicates that states that have medium-level automation probability will have larger income changes than states with high-level automation probability.

**Distributions**

Our annual mean wage variable takes a right-skewed normal distribution. We remove the skew by applying a natural log function to this data.

**Outliers**

There is at least one outlier in annual mean income around $225,000

In [39]:

```
# Distribution of annual mean income
sns.distplot(df_probincome_m.A_MEAN)
plt.xlabel('Mean Annual Income')
plt.title('Distribution of Mean Annual Income', loc='left')
plt.show()
```

In [40]:

```
# Investigate mean annual income outlier
df_probincome_m[df_probincome_m.A_MEAN > 225000]
```

Out[40]:

We see that these two extremely high paying outliers are both oral health occupations with a low probability of automation and relatively low spread throughout the country.

In [41]:

```
# Distribution of log annual mean income
sns.distplot(np.log(df_probincome_m.A_MEAN))
plt.xlabel('Log(Mean Annual Income)')
plt.title('Distribution of log Mean Annual Income', loc='left')
plt.yticks(np.arange(0.0,1.2,0.2))
plt.show()
```

In [42]:

```
# Test normality of log wage
stat_income_mean_log, p_income_mean_log = normaltest(df_probincome_m.log_A_MEAN)
print('log mean income is normally distributed') if p_income_mean_log < 0.01 else print('log mean income is NOT normally distributed')
```

Since the data for log mean wage are normally distributed, we can consider a linear regression model to analyze the relation between income and probability.

**What approaches did you use? Why?**

We analyze the relationship between log annual income and probability of automation by using an OLS Linear Regression model because there is a clear linear relationship between these two variables.

In [43]:

```
# Complete regression for relationship between probabilty of automation and log mean annual income
outcome, predictors = patsy.dmatrices('Probability ~ log_A_MEAN', df_probincome_m)
model = sm.OLS(outcome, predictors)
results = model.fit()
print(results.summary())
```

In [44]:

```
# Look at ditribution of probability and log annual income
sns.scatterplot(df_probincome_m.Probability, df_probincome_m.log_A_MEAN)
plt.xlabel('Probability of Automation')
plt.ylabel('Log(Mean Annual Income)')
plt.title('Probability of Automation vs Mean Annual Income', loc='left')
plt.show()
```

**What were the results?**

The results from our analysis of the relationship between log annual income and the probability of automation provide a $P|t|$ value of 0.000, with an Adjusted $R^2$ value of 0.330.

**What were your interpretation of these findings?**

We interpret the findings between log annual income and probability of income to support our initial hypothesis that lower paying jobs are more likely to suffer from job automation. Although our $R^2$ value is low from the OLS Regression we performed, our 0.000 p value shows that this data is at least conclusive. We believe that further analysis with additional variables can improve our adjusted $R^2$ value. However, it is important to be cautious with these findings, since our probability of automation data are not normally distributed.

**What approaches did you use? Why?**

For analysing which jobs are most likely to most contribute to automation, we look at the total number of employees per occupation in each state, and take that as a percent of the total number of working employees in that state. That percentage is then multiplied by the occupation's respective probability of automation, and each state is summed across each occupation. This gives the relative weighted likelihood of automation across each state. We further look at which occupation most contributes to this factor, and organize this data geospatially. We chose to analyze this data in this way because it gave a relative score that allows us to compare each state’s overall probability of automation, instead of having our data skewed by looking at absolute employment values. This method also gives insight on which jobs need further attention in our analyses.

In [45]:

```
# Build composite likelihood of unemployment per state
state_likelihood = []
df_state_data = pd.DataFrame()
for state in states:
likelihood = 0.0
max_likelihood = 0.0
for index in range(len(df_prob_m)):
new_likelihood = df_prob_m['Probability'][index] * df_prob_m_normed[state][index]
likelihood += new_likelihood
if new_likelihood > max_likelihood:
max_likelihood = new_likelihood
df_state_data[state] = (df_prob_m.Occupation[index], df_prob_m[state][index], df_prob_m.SOC[index])
state_likelihood.append(likelihood)
#print('state: {}\n\t {}'.format(state, likelihood))
```

In [46]:

```
# Transform state data dataframe for easier manipulation
df_state_data = df_state_data.transpose()
df_state_data.rename(columns={0:'Occupation', 1:'Number', 2:'SOC'},inplace=True)
```

In [47]:

```
# Change datatype for number of employees
df_state_data.Number = df_state_data.Number.astype(str)
```

In [48]:

```
# Look at the occupations that most contribute to a state's weighted likelihood of automation, along with the
## occupation's respective wage and number of states in which it is the top contributor
for occupation in df_state_data.Occupation.unique():
SOC = df_state_data[df_state_data.Occupation == occupation].SOC.unique()[0]
wage = int(df_income[df_income.SOC == SOC].A_MEAN.values[0])
n_states = len(df_state_data[df_state_data.Occupation == occupation])
print('Occupation: {}\n Wage: {}\n Num States: {}'.format(occupation, wage, n_states))
```

From this we can see that an the most overwhelming amount of jobs that will be lost due to automation will occur to retail salespeople. (This analysis looks at the amount of people affected multiplied by the probability of automation)

In [49]:

```
# Add text column for hover-info on map
df_state_data['text'] = df_state_data.index + '<br>' + \
'Most affected occupation: ' + df_state_data.Occupation + '<br>' + \
'Num employees of occupation in state: ' + df_state_data.Number
# Plot the weighted likelihood of automation by state
fig = go.Figure(data=go.Choropleth(
locations=states_abbv, # Spatial coordinates
z = state_likelihood, # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'blues',
colorbar_title = "Automation Probability",
autocolorscale=False,
text = df_state_data.text
))
fig.update_layout(
title_text = 'Likelihood of Job Automation by State',
geo_scope='usa', # limite map scope to USA
)
fig.show()
```