Indeed Web Crawler and data analysis in Python

-*- coding: utf-8 -*-

# Created on Sat Feb 17 10:32:35 2018

#Company fuction original code : https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b

#Find out the data-related jobs in Toronto, ON from Indeed.com

#Do some simple analysis and visualization

#@modify: Haby

# import package

import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Job title and company function code from Medium

https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b

Demo Code and Function

# Demo URL

# About Url
# 10 jobs list in each page and 4 of them are 'Sponsored' job which is not from searching
url ='https://ca.indeed.com/jobs?q=Data&l=Toronto%2C+ON'

# get the url
page = requests.get(url)

#decode with BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')

# job title is under div/a node with data-tn-element/title label
def job_title(soup):
  jobs = []
  for div in soup.find_all(name='div', attrs={'class':'row'}):
    for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
        jobs.append(a['title'])
  return(jobs)

# company title : <span class="company"> in div node
def company(soup):
    companies = []
    for div in soup.find_all(name='div', attrs={'class':'row'}):
        company = div.find_all(name='span', attrs={'class':'company'})
        # if can be select by company
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
        # else pick up by result-link-source
        else:
            sec_try = div.find_all(name='span', attrs={'class':'result-link-source'})
            for span in sec_try:
                companies.append(span.text.strip())
    return(companies)

# location
def location(soup):
  locations = []
  spans = soup.find_all('span', attrs={'class':'location'})
  for span in spans:
      locations.append(span.text)
  return(locations)

# Salary : most of salary doens't exist, fill NA
def salary(soup):
  salaries = []
  for div in soup.find_all(name='div', attrs={'class':'row'}):
    try:
      salaries.append(div.find(name = 'span',attrs = {'class','no-wrap'}).text.strip())
    except:
      salaries.append('Nothing_found')
  return(salaries)

# Job Summary
def job_summary(soup):
  summaries = []
  spans = soup.findAll('span', attrs={'class': 'summary'})
  for span in spans:
    summaries.append(span.text.strip())
  return(summaries)

# Post Day
def post_day(soup):
  postday = []
  for div in soup.find_all(name = 'div',attrs = {'class','result-link-bar'}):
      try :
          postday.append(div.find('span',attrs={'class': 'date'}).text)
  #spans = soup.findAll('span', attrs={'class': 'date'})
      except :
          postday.append('New Post')
  return(postday)

# Sponsered
def sponsered(soup):
  sponsered = []
  for div in soup.find_all(name = 'div',attrs = {'class','result-link-bar'}):
      try :
          sponsered.append(div.find('span',attrs={'class': 'sponsoredGray'}).text)
  #spans = soup.findAll('span', attrs={'class': 'date'})
      except :
          sponsered.append('Non-Sponsered')
  return(sponsered)

# reviews
def review(soup):
  review = []
  for div in soup.find_all(name='div', attrs={'class':'row'}):
    try:
      review.append(div.find(name = 'span',attrs = {'class','slNoUnderline'}).text.strip())
    except:
      review.append('0 reviews')
  return(review)

# concat all
columns = {'job_title' : job_title(soup), 'company_name' : company(soup),  
           'location' : location(soup), 'salary' : salary(soup),
           'review': review(soup), 'sponsered' :sponsered(soup),
           'post day' : post_day(soup),'summary': job_summary(soup),
           }
sample_df = pd.DataFrame(columns)
print(sample_df)

                           company_name  \
                      LG Electronics   
                        Canada Goose   
                        S&P Data LLC   
 Canadian Dealer Lease Services Inc.   
                             Viacorp   
        Capital Tech Consulting Inc.   
                          Quandl Inc   
                       ServiceSimple   
                         CanadaStays   
                         VICE Canada   
                               Visa   
              Rogers Communications   
                         Scotiabank   
                         DDB Canada   
                     LG Electronics   
                   Quantumfury Fund   

                                            job_title     location  \
  AI / Machine Learning Scientist – Toronto AI Lab  Toronto, ON   
                                        IT Analyst  Toronto, ON   
         Inbound Call Centre Agent $14.00/hr+Bonus  Toronto, ON   
                          Junior Reporting Analyst  Toronto, ON   
                                        Data Entry  Toronto, ON   
                             Data Analyst - Hadoop  Toronto, ON   
                        Junior Data Engineer (ETL)  Toronto, ON   
                 Data Analyst and Marketing Intern  Toronto, ON   
                                      Data Analyst  Toronto, ON   
                                      Data Analyst  Toronto, ON   
        Data Analyst, Visa Consulting & Analytics  Toronto, ON   
                                     Data Analyst  Toronto, ON   
                             Data Analyst/Modeler  Toronto, ON   
           Junior Data Analyst, Digital Analytics  Toronto, ON   
Director of Artificial Intelligence and Machin...  Toronto, ON   
Artificial Intelligence and Machine Learning D...  Toronto, ON   

        post day         review                      salary      sponsered  \
     New Post  1,527 reviews               Nothing_found      Sponsored   
     New Post     21 reviews               Nothing_found      Sponsored   
     New Post    338 reviews                 $14 an hour      Sponsored   
     New Post      0 reviews               Nothing_found      Sponsored   
  11 days ago      0 reviews        $600 - $1,400 a week  Non-Sponsered   
   5 days ago      0 reviews           $70 - $75 an hour  Non-Sponsered   
  11 days ago      0 reviews               Nothing_found  Non-Sponsered   
 30+ days ago      0 reviews               Nothing_found  Non-Sponsered   
   4 days ago      5 reviews               Nothing_found  Non-Sponsered   
   2 days ago      0 reviews               Nothing_found  Non-Sponsered   
  3 days ago    523 reviews               Nothing_found  Non-Sponsered   
  4 days ago  1,460 reviews               Nothing_found  Non-Sponsered   
 13 days ago  2,556 reviews               Nothing_found  Non-Sponsered   
 11 days ago      3 reviews               Nothing_found  Non-Sponsered   
    New Post  1,527 reviews               Nothing_found      Sponsored   
    New Post      0 reviews  $100,000 - $160,000 a year      Sponsored   

                                              summary  
 Spark, Hadoop), large scale data analysis, opt...  
 Gather required data from Systems, Vendors, an...  
 Navigate through computerized system and accur...  
 Extracting data sets and cleansing for reporti...  
 Are you a talented and motivated person on the...  
 Work could include moving data around data lak...  
 We are seeking a Junior Data Engineer (ETL) to...  
 Data Analyst and Marketing Intern*. 3 main res...  
 Communicate and evangelize data insights with ...  
 Background in data manipulation and data proce...  
Extensive experience with SQL for extracting a...  
You will get exposure to all aspects of the bu...  
Understand data modeling, metadata knowledge f...  
Good knowledge of data wrangling, data blendin...  
Speech, vision, audio, NLP, semantic analysis,...  
Data modelling, big data and expertise in one ...  

Iterating Pages

# Iterate Code

# iterate over pages : there are 10 job list in each page and the new page starts at 10
# ie page 2 :https://ca.indeed.com/jobs?q=Data&l=Toronto,+ON&start=10

url_o = 'https://ca.indeed.com/jobs?q=Data&l=Toronto,+ON&start='
page = [i*10 for i in range(0,41)]

# new dataframe for data
df = pd.DataFrame()#columns = columns)

# loop 40 pages
for i in page :
url =  url_o + str(i)  

# get the url
page = requests.get(url)

#specifying a desired format of “page” using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.
soup = BeautifulSoup(page.text, 'html.parser')

# concat all
temp = pd.DataFrame([job_title(soup),company(soup),location(soup),
                     salary(soup),review(soup),sponsered(soup),
                     post_day(soup),job_summary(soup)])
df = df.append(temp.T,ignore_index = True)

df.columns = ['job_title', 'company_name', 'location', 'salary', 'review',
          'sponsered','post day','summary']
print(df.head())
print('-'*40)
print(len(df))

                                          job_title  \
AI / Machine Learning Scientist – Toronto AI Lab   
                                      IT Analyst   
       Inbound Call Centre Agent $14.00/hr+Bonus   
                        Junior Reporting Analyst   
                                      Data Entry   

                          company_name     location                salary  \
                     LG Electronics  Toronto, ON         Nothing_found   
                       Canada Goose  Toronto, ON         Nothing_found   
                       S&P Data LLC  Toronto, ON           $14 an hour   
Canadian Dealer Lease Services Inc.  Toronto, ON         Nothing_found   
                            Viacorp  Toronto, ON  $600 - $1,400 a week   

          review      sponsered     post day  \
1,527 reviews      Sponsored     New Post   
   21 reviews      Sponsored     New Post   
  338 reviews      Sponsored     New Post   
    0 reviews      Sponsored     New Post   
    0 reviews  Non-Sponsered  11 days ago   

                                             summary  
Spark, Hadoop), large scale data analysis, opt...  
Gather required data from Systems, Vendors, an...  
Navigate through computerized system and accur...  
Extracting data sets and cleansing for reporti...  
Are you a talented and motivated person on the...  
----------------------------------------
622

Data cleaning

# data cleaning

# Job title
# job title : split first title outside by '-' , ',','?','$'
df['First_title'] = df['job_title'].map(lambda x : x.split('–')[0].split(',')[0].split('(')[0].split('$')[0].split('-')[0].strip())

# top 10 job titles
df['First_title'].value_counts()[:10].plot.barh(rot = 30,
  title = 'Counts for Different Titles',legend = True)
plt.show()

png

# salary

# most of salary are 'Nothing_found', replace by 0
df['salary'] = df['salary'].replace('Nothing_found','0')

# make all salary / year
# since most salary exists is a range, get mean value of salary
mean_salary = []
for s in df['salary'] :
    if s != '0' :
        min_sal = float(s.split()[0].replace('$','').replace(',',''))
        if s.split()[1] != '-' :
            mean_sal = min_sal
            #print(mean_sal)
            unit = s.split()[-1]
            # print(mean_sal,unit)
            if unit == 'week' :
                # assume 52 weeks / year
                mean_salary.append(mean_sal * 52)
            elif unit == 'month' :
                mean_salary.append(mean_sal * 12)
            elif unit == 'hour':
                # assume 40 hours/ week
                mean_salary.append(mean_sal * 40*52)
            else :
                mean_salary.append(mean_sal)    
        else :
            max_sal = float(s.split()[2].replace('$','').replace(',',''))
            mean_sal = (min_sal+max_sal)/2
            #print(mean_sal)
            unit = s.split()[-1]
            # print(mean_sal,type(mean_sal),unit)
            if unit == 'week' :
                # assume 52 weeks / year
                mean_salary.append(mean_sal * 52)
                #print(df['mean_sal'],'in df')
            elif unit == 'month' :
                mean_salary.append(mean_sal * 12)
            elif unit == 'hour':
                # assume 40 hours/ week
                mean_salary.append(mean_sal * 40*52)
            else :
                mean_salary.append(mean_sal)
    else :
        mean_salary.append(0)

len(mean_salary)       

df['mean_sal'] = mean_salary


```python
# number of reviews

df['review number'] = df['review'].map(lambda x : int(x.split()[0].replace(',','')))

# post days

# replace new_post as 0
# other str to int

df['post day'] = df['post day'].map(lambda x : 0 if x == 'New Post' else int(x.split()[0].replace('+','')))

# find keywords in job summary

words = []
for row in df['summary'] :
    for word in row.replace(',',' ').replace('.',' ').split() :
        word.replace('(',' ').replace(')',' ').strip()
        word = word.lower()
        words.append(word)

# delete useless elements
dropped = ['and','to','the','of','in','with','a','for','will','is','our','an','data','through','required','business',
           'as','from','including','we','are','by','be','for','us','it','.','-','or',
           'on','new']
key_words = []
for word in words :
    if word not in dropped :
        key_words.append(word)

Visualization

#Visualization

# salary visualization
g = sns.kdeplot(df['mean_sal'],shade = True)
g.set_title('Mean Salary for data-related jobs')
plt.show()

png

Apperantly there are 3 levels in salary for different level data-workers entry level : salary with 20000 - 40000, medium level : salary with 120000 - 130000, advanced level : salary with more than 175000,

# company name

# graph the top 20 companies who needed the data-related employees recently
df['company_name'].value_counts()[:20].plot.barh(title = 'Top 20 Companies Who Need Data-related Employees')
plt.show()

png

Top 2 is LG Electronics and Canada Goose. For LG company, they have plan to expand the scale of enterprise. I didn’t find some useful information about CG, but I guess they will produce more in winter weather ??

# reviews visualization
# plot top 10 companies with most reviews
df[['review number','company_name']].groupby('company_name').mean().sort_values(by = ['review number'],
  ascending = False).head(10).plot.barh(title = 'Top 10 companies with most reviews')
plt.show()

png

Most of top 10 commanies are so-called ‘big-company’, which can provide the better working environment, benefit and salary. Financial and IT related companies have more reviews than other companies.

# list the top 20 important words in data related job posts
pd.Series(key_words).value_counts()[:20].plot.barh(rot = 30,
  title = 'Top 20 Keywords for data-related job',legend = True)
plt.show()

png

# list the top 20-40 important words in data related job posts
pd.Series(key_words).value_counts()[20:40].plot.barh(rot = 30,
  title = 'Top 40 Keywords for data-related job',legend = True)
plt.show()

png

Top 20 keywords are more likely to be general features for all employees, such as analysis, experience,team,learning while top 20- 40 words are more likely to be professional skills, such as optimization, databases, management,programming. To me, it seems that employers are more care about what kind of person you are or can you work in a team environment rather than what kind of skills you have. This is totally new for me.

# sponsered
# just interested in whether sponsered company is different

sns.countplot(df['sponsered'])
#sns.countplot(df['mean_sal'])
plt.show()

png

# pair plot with post day /mean salary / reviews number hue by sponsered
sns.pairplot(df,hue = 'sponsered')
plt.show()

png

All sponsered posts are new post(post day = 0), which means if you pay, your post will stand on the top/ bottom of first page and don’t need to be worried about poshing down. Most of companies who publish the sponsered post seems to have less reviews, that means either these companies are new companies or these are small-size companies who don’t need so many employees. Although just few of posts offer the salary, for the posts who offer salary information, sponsered posts are majority. Maybe small company size and less reviews means these posts need salary as a highlight for employees.

For Data-Related Jobs in Toronto, ON

Demo Code and Function

Iterating Pages

Data cleaning

Visualization

CATALOG

FEATURED TAGS

FRIENDS