Women in academic STEM fields
In [6]:
Goals of this analysis
There are plenty of studies that detail the discrimination women face in being hired into STEM positions (Science, Technology, Engineering, Mathematics), and also many studies on the gender wage gap between men and women, e.g. here, here and here.
I thought I would use the University of Arizona’s own employee salary data to study these trends via analysing the differences in salaries paid across the university as a function of academic department, job type/level and gender. I focus on departments within STEM fields because I already have domain knowledge of their different sub-fields and academic job hierarchies.
The data set
The University of Arizona’s data set of employee salaries can be found online. I
scraped the data from the past year (2014-2015) using BeautifulSoup
into a
pandas dataframe. Presented below is roughly the chunk of code I used to do
that:
link2014 = "https://docs.google.com/spreadsheets/d/1xUWyf0DlM6eJTKsUSJ6MyQjM 6JnmmLSq7x4n5iq43uQ/pubhtml"
### Parse webpage with BeautifulSoup
text = requests.get(link).text
soup = BeautifulSoup(text)
### Find all rows
peopleTags = soup.find_all('tr')
# first row is just for formats -> skip
# second row has column names
header = peopleTags[1].find_all('td')
### Get column names
columns = []
for head in header:
columns.append(head.get_text())
# list of dictionaries
rows = []
for i in xrange(2,len(peopleTags)):
personData = peopleTags[i].find_all('td')
if len(personData)<1:
print "Found", len(rows), "employees"
break
# first grab text out of each tag
x = []
for p in personData:
x.append(p.get_text())
# create a person dictionary
person = {}
# parse name into first and surname
if len(x[0].split(','))<2:
print x
print x[0]
person['first_names'] = x[0].split(',')[1]
person['surname'] = x[0].split(',')[0]
# Primary title
person[columns[1]] = x[1]
# Annual salary at full FTE (full time equivalency)
person[columns[2]] = float(x[2].replace(',', ''))
# State fund ratio
person[columns[3]] = float(x[3].replace('%', ''))
# Department
person[columns[4]] = x[4]
# FTE (number between 0 and 1)
person[columns[5]] = float(x[5])
# Annual Sal Emplid FTE (just FTE*Annual salary at full FTE)
person[columns[6]] = float(x[6].replace(',', ''))
rows.append(person)
### Create pandas dataframe
dbData = pd.DataFrame(rows)
### Save output
outname = "salary2014.csv"
dbData.to_csv(outname)
From the comments in the code above you can see the data set contains:
- employee name
- job title
- annual salary (at full time equivalency)
- percentage of funding from State sources (I think? But I won’t use this anyway)
- department employed by
- FTE (1=full time, fraction indicates part-time)
- actual annual salary (annual salary at full time * FTE)
In total there were approximately 12,000 jobs.
Classing names with a binary gender value
To make the data set more interesting I used a library called SexMachine
(eye
roll) to assign a gender given the employee’s first name. The code I used
looked like this:
### Set up gender detector
d = gender.Detector()
### Add gender column to data table
sLength = len(data)
data['gender'] = pd.Series(np.random.randn(sLength), index=data.index)
for i in range(sLength):
nom = data['first_names'][i].split(' ')[0]
# check that the 'first' of the first names is not just an initial
# (some first names are entered e.g. like J Edward)
# if it is, use the 'second' of the first names to designate gender
if len(nom)<2:
print "Person", i+1 ,"of", sLength,
print "first name is initial", nom ,"taking second name",
nom = data['first_names'][i].split(' ')[1]
print nom, "(full name", data['first_names'][i],")"
sex = d.get_gender(nom)
if sex=="mostly_female":
sex = u"female"
elif sex=="mostly_male":
sex = u"male"
elif sex=="andy":
sex = "unknown"
data.loc[i,['gender']] = sex
Where I simply assigned a name as female
if the output was mostly_female
,
e.g. as might be returned if the supplied name was “Erin”, and similarly for
mostly_male
. The “SexMachine” mostly failed (sex=unknown
) on names of
Chinese or Indian origin. This could definitely cause some bias in the results
presented below if there is an over-representation of e.g. males with Chinese or
Indian origin names (e.g. in STEM)
To estimate the approximate error rate for the employees who did get classed with a gender, I first estimated the sample size required to provide a 95% confidence level for a margin of error of 1% in the error rate:
where I assumed the standard deviation for the error rate \(\sigma\), which can take values between 0 (no wrong classifications) and 1 (all classifications wrong), as being 0.5 to make sure the sample size would be over- rather than under-estimated. For a random sample of this size I manually checked the classifications and found that the error rate is about 2%.
This approach is obviously an over-simplification and problematic as it relies on the assumption that a person with a likely “female” or “male” sounding name would identify themselves with that same binary category. The categories “female” or “male” should be interpreted as the employee’s most likely binary gender choice, rather than a strict categorisation of the employees’ gender.
Data cleaning (90% of the time was spent here!)
Looking at the data, there are 375 unique “Department” names and 3705 unique “Primary Title”’s (i.e. job titles). Given the size of this data set (12,000 entries) I needed to aggregate together similar departments and similar job titles to study the variation of salaries in the university.
I added four more columns to the dataframe: “Academic?”, “STEM?”, “Department type”, and “Job group”.
The “Academic?” column is a True
or False
value depending on whether the
department is academic or not (e.g. non-academic being things like the
University Police department or Facilities Management)
The “STEM?” column is a True
or False
value depending on whether the
department is in a “Science”, “Technology”, “Engineering” or “Mathematics”
field. Therefore all "Academic?==False
departments will be "STEM?==False
too.
“Department type” is a broader name for the type of department, e.g. for
"Academic?==True
this could be one of: ‘Humanities’, ‘Professional’, ‘Social
Sciences’, ‘Natural Sciences’, ‘Formal Sciences’. Departments classed as ‘Social
Sciences’ could be either STEM (e.g. Psychology) or non-STEM (e.g. Sociology),
and departments classed as ‘Professional’ could be either STEM (e.g. Civil
Engineering, Medical imaging) or non-STEM (e.g. Accounting, Medcine, Law).
Aggregating “Primary Title” into different “Job group” was not as comprehensive, and just focussed on classifying “Primary title”’s into one of the following groups: student, postdoc, engineer, assistant scientist, associate scientist, senior scientist, assistant professor, associate professor, professor. Everything else was just classed as “other”.
The data!
In [7]:
Data snippet:
Annual Sal Emplid FTE Annual Salary at Full FTE \
0 58067.0 58067.0
1 37844.0 37844.0
2 35000.0 35000.0
3 40000.0 40000.0
4 53000.0 53000.0
Department FTE Primary Title \
0 Dept of Emergency Medicine 1.0 Manager, Residency Program
1 University Police Department 1.0 Police Evidence/Property Tech
2 Radiation Oncology 1.0 Administrative Associate
3 Africana Studies 1.0 Lecturer, Africana Studies
4 Physics 1.0 Assistant Research Scientist, Physics
State Fund Ratio first_names surname gender Academic? STEM? \
0 0.0 ******** ***** female True False
1 100.0 ******** ***** male False False
2 40.0 ******** ***** female True False
3 100.0 ******** ***** unknown True False
4 0.0 ******** ***** female True True
Department type Job group
0 Professional other
1 Infrastructure other
2 Professional other
3 Social Sciences lecturer
4 Natural Sciences assistant scientist
Across the entire university
To get the highest level picture first I plot the percentage of employees of each gender. 9% of employees were not assigned a gender, and from here on I will remove them from further analysis. As discussed above this could cause some bias in the following results.
From other sources I found the percentage of women of working age in the United States:
- female 51.7%
- male 48.3%
I can now calculate significance between the difference in proportions of female and male employees at the University of Arizona, compared to the pool of working age female and males in the USA.
The standard error of the difference between the estimated proportions of population 1 (\(\hat{p}_1\)) and population 2 (\(\hat{p}_2\)) is:
i.e. \(\hat{p}_1\) and \(\hat{p}_2\) are the population proportions estimated from the University of Arizona data, and \(n_1\) and \(n_2\) are the respective sample sizes. The p-value for the difference in these populations is calculated from the z-score:
where \(p_1\) and \(p_2\) are the actual population proportions, i.e. those found from the Department of Labor’s statistics on the percentage of women of working age in the USA.
The calculation below shows that the null hypthesis (that there is no difference between the UA’s male/female proportion compared to that of working age people in the USA) cannot be rejected at a signficiant confidence level. This means the female to male population of employees at the University of Arizona reasonably reflects that of the working age population in the USA.
In [8]:
Percentage of each gender (including unknown)
female 47.033757
male 43.908757
unknown 9.057487
Name: gender, dtype: float64
Percentage of each gender (excluding unknown)
female 51.718118
male 48.281882
Name: gender, dtype: float64
The null hypothesis that the UA female/male population proportion difference of 0.03400
is not different from the USA female/male population proportion difference of 0.03436
is rejected at (1-p)=0.515
For earnings I examine the “Annual Salary at Full FTE” because even if an individual position is part-time, this is its equivalent full time annual salary. Therefore, no matter the actual hours worked per position I am comparing the same equivalent salaries.
In terms of women’s earnings overall across the university, as compared to the the ratio of women’s to men’s earnings in 2015 of 81.8% found in Institute for Women’s Policy Research’s report on Status of Women in the States, women at the University of Arizona are doing a little better with a 85.3% ratio.
In [9]:
Ratio of women's to men's earnings: 85.344%
It’s more instructive to see the distribution of salaries by gender, so for this I use a box plot. The line across the middle of the box indicates the median salary. The box itself defines where the central 25% to 75% of the data lie (the interquartile range), and the caps at the end indicate the entire range of the data.
In [10]:
Comparing STEM and non-STEM departments
There are a greater number of women than men in the non-STEM departments, and there are fewer women than men in the STEM departments. Note that at this stage all jobs within a particular department are considered (e.g. potentially administrative, accounting, facilities maintenance positions), not just those that are academic in nature.
In both cases men are paid significantly more: with the ratio of women’s to men’s earnings being 70.2% for women in STEM departments (much lower than the 85.3% over the entire university) and 86.0% for women not in STEM (much more in line with the overall university ratio).
Comparing the median salary values, women in STEM are paid less than women not in STEM, however men in STEM are paid more than men not in STEM.
In [11]:
Median salaries:
gender female male
STEM?
False 55029.0 64000.0
True 50112.0 71414.0
Ratio of median salary, female/male:
STEM?
False 0.859828
True 0.701711
dtype: float64
Comparing different department types within STEM
Here are the rough defintions of the STEM deparment types:
- Professional: Biomedical, Engineering, Medical Imaging, Neuroscience, Ophthalmology, Pharmacology
- Social Sciences: Psychology
- Natural Sciences: Astronomy, Atmospheric Sciences, Chemistry, Ecology, Environmental Sciences, Geosciences, Optical Sciences, Physics, Plant Sciences
- Formal Sciences: Applied Mathematics, Computer Sciences, Informatics, Mathematics
The salaries of women in the Formal Sciences and Social Sciences (Psychology) are signficantly lower than men’s with their ratio of median salary female/male being around 60%, whereas for Professional and Natural Sciences it is around 70%. However there is a low number of employees within the Formal Sciences and Social Sciences groups, with much few individual departments included in their types, so it’s difficult to draw any conclusions.
In [12]:
Number of employees:
gender female male
Department type
Formal Sciences 54 91
Natural Sciences 601 937
Professional 169 278
Social Sciences 31 28
Ratio of median salary, female/male:
Department type
Formal Sciences 0.595987
Natural Sciences 0.714286
Professional 0.702947
Social Sciences 0.619141
dtype: float64
Comparing different job types in STEM departments
First I designated a hierachy of job types, going from most junior to most senior. In reality this trend is not precisely linear as the order of seniority between e.g. a senior scientist and a professor will depend on the exact positions being taken into consideration.
This hierachy goes: student, postdoc, engineer, assistant scientist, associate scientist, senior scientist, assistant professor, associate professor, professor
Also within these job type categories, often the distinction between an assistant, associate or senior scientist is fuzzy.
In [13]:
Number of employees
gender female male
Job group
assistant professor 62 87
assistant scientist 105 163
associate professor 45 126
associate scientist 15 37
engineer 3 42
postdoc 47 94
professor 48 267
senior scientist 7 20
student 40 42
Final thoughts
I think a major part of (though unlikely all) the salary disparity between women and men arises because there are just so many more men than women in the most senior academic positions (assistant professor and upwards), and these are the ones that are generally paid the highest wages. It might be worth repeating this analysis for the Natural Sciences and Professional department types alone to see if this trend remains.
In light of the above point it is worth pointing out the salary distribution of the “assistant professor” job type. The distribution for the women has a much larger range of higher salary values for its upper quartile compared to the men’s distribution. The origin of this could be the hire date of the female assistant professors. A hypothesis associated with this would be that salary raises for existing employees tend to be small, but salary offers (to be competative with offers from other institutions) for new faculty tend to track more current (higher) salary rates. Then if there are fractionally more women than men hired more recently they may tend to command higher salaries. If there were hire dates associated with this data, perhaps this hypothesis could be tested: how does annual salary track with hire date within the same job category?
To make more robust conclusions on the origin of the salary disparity between women and men I would need to analyse more salary data that goes back over multiple years, to track how the fraction of female to male employees and median salary evolves (if it does). Analysing this data in a time dependent way would show if in the future we expect more equitable employment between men and women, and how we expect their salaries to grow with respect to one another.
The main problem with doing this study on this dataset is it’s relatively small size. With a maximum of only a few hundred employees per STEM department type, and less than a hundred per job type, few major conclusions can be drawn overall. The main takeaway is that the University of Arizona’s salary and employment for women in STEM fits with what has been shown by other (larger and better) studies.