Indeed Web Crawler by Python Version 2

Working Environment and what is new in this version

-- coding: utf-8 --

Jupyter Notebook with Python 3

Created on Sat Feb 22, 2018

Find out the data-related jobs in Ontario from Indeed.ca

Version 2 : wanna check more information of ‘Job Requiretments’ and select more features details from linked pages

Data visualization using matplotlib, seaborn and tableau v10.5

Try to analysis key words from job responsibility using NLTK package with brown / pos tag categories

@author: Haby

Job Post Part

Demo Code and Functions

import pip

def install(package):
   pip.main(['install', package])

# install Natural language toolkit
install('nltk')

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

Requirement already satisfied: nltk in e:\software\anaconda\lib\site-packages
Requirement already satisfied: six in e:\software\anaconda\lib\site-packages (from nltk)
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adien\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adien\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\adien\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.





True

# import package

import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import nltk

# Demo Code and Function

# url
url ='https://ca.indeed.com/jobs?q=data&l=Ontario'

# get the url
page = requests.get(url)

#decode with BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')

# job title is under div/a node with data-tn-element/title label
def job_title(soup):
  jobs = []
  for div in soup.find_all(name='div', attrs={'class':'row'}):
    for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
        jobs.append(a['title'])
  return(jobs)

# This company code is from : https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b

# company title : <span class="company"> in div node
def company(soup):
    companies = []
    for div in soup.find_all(name='div', attrs={'class':'row'}):
        company = div.find_all(name='span', attrs={'class':'company'})
        # if can be select by company
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
        # else pick up by result-link-source
        else:
            sec_try = div.find_all(name='span', attrs={'class':'result-link-source'})
            for span in sec_try:
                companies.append(span.text.strip())
    return(companies)

# location
def location(soup):
  locations = []
  spans = soup.find_all('span', attrs={'class':'location'})
  for span in spans:
      locations.append(span.text)
  return(locations)

# Salary : most of salary doens't exist, fill NA
def salary(soup):
  salaries = []
  for div in soup.find_all(name='div', attrs={'class':'row'}):
    try:
      salaries.append(div.find(name = 'span',attrs = {'class','no-wrap'}).text.strip())
    except:
      salaries.append('Nothing_found')
  return(salaries)

# Post Day
def post_day(soup):
  postday = []
  for div in soup.find_all(name = 'div',attrs = {'class','result-link-bar'}):
      try :
          postday.append(div.find('span',attrs={'class': 'date'}).text)
  #spans = soup.findAll('span', attrs={'class': 'date'})
      except :
          postday.append('New Post')
  return(postday)

# reviews
def review(soup):
  review = []
  for div in soup.find_all(name='div', attrs={'class':'row'}):
    try:
      review.append(div.find(name = 'span',attrs = {'class','slNoUnderline'}).text.strip())
    except:
      review.append('0 reviews')
  return(review)

# post functions : what we get is the absolute address, so we need to add 'ca.indeed.com'
def post_link(soup):
  link = []
  for div in soup.find_all(name='div', attrs={'class':'row'}):
    for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
        link.append('http://ca.indeed.com'+a['href'])
  return(link)

# concat all
columns = {'job_title' : job_title(soup), 'company_name' : company(soup),  
           'location' : location(soup), 'salary' : salary(soup),
           'review': review(soup), 'post day' : post_day(soup),
           'links': post_link(soup)
           }
sample_df = pd.DataFrame(columns)
print(sample_df)

                     company_name  \
                LG Electronics   
                    IFDS Group   
                    Interactyx   
            Sun Life Financial   
                        Loblaw   
                          CHEO   
         Validus Research Inc.   
 University of Western Ontario   
      M.W.N. Technologies Inc.   
                 ServiceSimple   
             Global Pharmatek   
                      BlueDot   
                   Quandl Inc   
                       Sobeys   
                    Accenture   
    Excis Ltd (www.excis.com)   

                                           job_title  \
 AI / Machine Learning Scientist – Toronto AI Lab   
                                      Data Center   
                  Business Intelligence Developer   
                   Data Analyst, Client Analytics   
         Sr. Manager, Data, Reporting & Analytics   
                 Oncology Data Administrator, MDU   
           Research Analyst – Junior Statistician   
                               Research Scientist   
                       Entry-level Data Scientist   
                Data Analyst and Marketing Intern   
                                Data Power Admin   
                                  Data Scientist   
                      Junior Data Engineer (ETL)   
            Data Visualization Analyst - Tableau   
                         Test Data Manager (TDM)   
                            Data Center Engineer   

                                                links         location  \
 http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...      Toronto, ON   
 http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...      Toronto, ON   
 http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...       Ottawa, ON   
 http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...      Toronto, ON   
 http://ca.indeed.com/rc/clk?jk=40544eff6430811...     Brampton, ON   
 http://ca.indeed.com/rc/clk?jk=b34dcb56d215e1e...          Ontario   
 http://ca.indeed.com/company/Validus-Research-...     Waterloo, ON   
 http://ca.indeed.com/rc/clk?jk=5d92f59123260f1...       London, ON   
 http://ca.indeed.com/company/M.W.N.-Technologi...  Mississauga, ON   
 http://ca.indeed.com/company/Service-Simple/jo...      Toronto, ON   
http://ca.indeed.com/rc/clk?jk=1971e7726426442...          Ontario   
http://ca.indeed.com/rc/clk?jk=0d86897c7f75d89...      Toronto, ON   
http://ca.indeed.com/rc/clk?jk=b37b64344c41202...      Toronto, ON   
http://ca.indeed.com/rc/clk?jk=07d233c4fb41d4e...  Mississauga, ON   
http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...      Toronto, ON   
http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...      Toronto, ON   

        post day          review                   salary  
     New Post   1,531 reviews            Nothing_found  
   2 days ago       6 reviews            Nothing_found  
     New Post       0 reviews            Nothing_found  
     New Post     781 reviews            Nothing_found  
    1 day ago   1,646 reviews            Nothing_found  
   2 days ago       7 reviews  $23.92 - $29.36 an hour  
  17 days ago       0 reviews            Nothing_found  
   3 days ago     100 reviews            Nothing_found  
   2 days ago       0 reviews            Nothing_found  
 30+ days ago       0 reviews            Nothing_found  
30+ days ago       0 reviews            Nothing_found  
  5 days ago       0 reviews            Nothing_found  
 18 days ago       0 reviews            Nothing_found  
  4 days ago   1,596 reviews            Nothing_found  
    New Post  12,530 reviews            Nothing_found  
    New Post       0 reviews            Nothing_found  

# pickup usful information from linked pages
# info
words = []
def info(link) :
    soup = BeautifulSoup(requests.get(link).text,'html.parser')
    words.extend(soup.find('span',attrs={'id': 'job_summary'}).text.split())
    return(words)

# demo
info('https://ca.indeed.com/viewjob?jk=53d052c5a4791a68&tk=1c77cg3ku41dk895&from=serp&alid=3&advn=6591858045400699')[:10]

# There are lots of meanless words, try to drop them by stop words list

['At',
 'Sun',
 'Life,',
 'we',
 'work',
 'together,',
 'share',
 'common',
 'values',
 'and']

Iterating 50 Pages

# url
url ='https://ca.indeed.com/jobs?q=data&l=Ontario&start='

# loop pages
page = [i*10 for i in range(0,51)]

# new dataframe for data
df = pd.DataFrame()#columns = columns)

# loop 50 pages
for i in page :
    url =  url + str(i)  

    # get the url
    page = requests.get(url)

    #specifying a desired format of “page” using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.
    soup = BeautifulSoup(page.text, 'html.parser')

    # concat all
    temp = pd.DataFrame([job_title(soup),company(soup),location(soup),
                         salary(soup),review(soup),post_day(soup),post_link(soup)])
    df = df.append(temp.T,ignore_index = True)

df.columns = ['job_title', 'company_name', 'location', 'salary', 'review',
          'post day','links']
print(df.head())
print('-'*80)
print('Total Data we have now: ',len(df))

                                          job_title        company_name  \
AI / Machine Learning Scientist – Toronto AI Lab      LG Electronics   
                                     Data Center          IFDS Group   
                  Data Analyst, Client Analytics  Sun Life Financial   
                                  Data Scientist         Capital One   
        Sr. Manager, Data, Reporting & Analytics              Loblaw   

       location         salary         review    post day  \
 Toronto, ON  Nothing_found  1,531 reviews    New Post   
 Toronto, ON  Nothing_found      6 reviews  2 days ago   
 Toronto, ON  Nothing_found    781 reviews    New Post   
 Toronto, ON  Nothing_found  5,321 reviews    New Post   
Brampton, ON  Nothing_found  1,646 reviews   1 day ago   

                                               links  
http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...  
http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...  
http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...  
http://ca.indeed.com/pagead/clk?mo=r&ad=-6NYlb...  
http://ca.indeed.com/rc/clk?jk=40544eff6430811...  
--------------------------------------------------------------------------------
Total Data we have now:  815

Data Cleaning

# Job title
# job title : split first title outside by '-' , ',','?','$'
df['First_title'] = df['job_title'].map(lambda x : x.split('–')[0].split(',')[0].split('(')[0].split('$')[0].split('-')[0].strip())

# top 15 job titles
df['First_title'].value_counts()[:15].plot.barh(title = 'Top 15 Most Wanted Job Titles \n Ordered Ascendingly',
                                                colormap = 'summer')
plt.show()

png

# salary

# most of salary are 'Nothing_found', replace by 0
df['salary'] = df['salary'].replace('Nothing_found','0')

# make all salary / year
# since most salary exists is a range, get mean value of salary
mean_salary = []
for s in df['salary'] :
    if s != '0' :
        min_sal = float(s.split()[0].replace('$','').replace(',',''))
        if s.split()[1] != '-' :
            mean_sal = min_sal
            #print(mean_sal)
            unit = s.split()[-1]
            # print(mean_sal,unit)
            if unit == 'week' :
                # assume 52 weeks / year
                mean_salary.append(mean_sal * 52)
            elif unit == 'month' :
                mean_salary.append(mean_sal * 12)
            elif unit == 'hour':
                # assume 40 hours/ week
                mean_salary.append(mean_sal * 40*52)
            else :
                mean_salary.append(mean_sal)    
        else :
            max_sal = float(s.split()[2].replace('$','').replace(',',''))
            mean_sal = (min_sal+max_sal)/2
            #print(mean_sal)
            unit = s.split()[-1]
            # print(mean_sal,type(mean_sal),unit)
            if unit == 'week' :
                # assume 52 weeks / year
                mean_salary.append(mean_sal * 52)
                #print(df['mean_sal'],'in df')
            elif unit == 'month' :
                mean_salary.append(mean_sal * 12)
            elif unit == 'hour':
                # assume 40 hours/ week
                mean_salary.append(mean_sal * 40*52)
            else :
                mean_salary.append(mean_sal)
    else :
        mean_salary.append(0)

len(mean_salary)       

df['mean_sal'] = mean_salary


# number of reviews

df['review number'] = df['review'].map(lambda x : int(x.split()[0].replace(',','')))

python Visualization

#Visualization

# salary visualization
g = sns.kdeplot(df['mean_sal'],shade = True,color = 'g')
g.set_title('Mean Salary for data-related jobs')
plt.show()

png

df['mean_sal'].value_counts()

0        757
2     46
0      4
0      4
0      4
Name: mean_sal, dtype: int64

Salary is different with last version since many posts are renewed now and more posts won’t offer salary info in posts.

# company name

# graph the top 20 companies who needed the data-related employees recently
df['company_name'].value_counts()[:20].plot.barh(title = 'Top 20 Companies Who Need Data-related Employees by Feb/25/2018',
                                                colormap = 'summer')
plt.show()

png

# reviews visualization
# plot top 10 companies with most reviews
df[['review number','company_name']].groupby('company_name').mean().sort_values(by = ['review number'],
  ascending = False).head(10).plot.barh(title = 'Top 10 companies with most reviews \n Who Need Employees Recently',
                                        colormap = 'summer')
plt.show()

png

Amazon disappeared this time, I’m thinking like they won’t need more employees recently, so there are no info they posts

# Do more work on link column, check more details key features
import time
for l in df['links'] :

    # get all words for linked pages by function info
    info(l)

    # stop 0.5s for each loop
    time.sleep(0.5)

print('Total words now: ',len(words))
print('-'*40)
print('Unique words now: ',len(set(words)))

Total words now:  408426
----------------------------------------
Unique words now:  5012

# stop words
stopwords = nltk.corpus.stopwords.words('english')

# drop words exist in stopwords
keywords = []
for word in words :
    if word.lower() not in stopwords :

        # drop the words whose length is less than 3
        if len(word) >= 3 :
            keywords.append(word.lower())

print('Total keywords now: ',len(keywords))   
print('-'*40)
print('Unique Keywords now: ',len(set(keywords)))

Total keywords now:  266685
----------------------------------------
Unique Keywords now:  4237

# Pos tag words
word_tag = nltk.pos_tag(keywords)

# pick up tags of NN
nouns = [i for i,j in nltk.pos_tag(keywords) if j == 'NN']

print('Total Nouns now: ',len(nouns))   
print('-'*40)
print('Unique Nouns now: ',len(set(nouns)))

Total Nouns now:  108818
----------------------------------------
Unique Nouns now:  1912

pd.Series(nouns).value_counts().head(15).plot.barh(title = 'Top 15 Words in job description',colormap = 'summer')
plt.show()

png

Nouns are more likely to be useful skills or features that employers want.

I find that experience is the most important features employers need (since business is almost useless). Most of companies dont want to hire new-grad since they need many time to train. Others features like support or team mean most of jobs need many staff working together. I think that ‘machine’ is part of ‘machine learning’ since there are also many ‘learning’ in the list. Lastly, ‘client’ means most of companies post on indeed are customer-oriented organization instead of acedemic institution.

# pick up tags of VB
verbs = [i for i,j in nltk.pos_tag(keywords) if j in ['VB','VBD']]

print('Total Nouns now: ',len(verbs))   
print('-'*40)
print('Unique Nouns now: ',len(set(verbs)))

Total Nouns now:  9443
----------------------------------------
Unique Nouns now:  235

pd.Series(verbs).value_counts().head(15).plot.barh(title = 'Top 15 Words in job action',colormap = 'summer')
plt.show()

png

verbs are more likely to be the actions employees need to do.

Verbs can be seperated into some groups : ‘develop / improve / grow’ means increasing the performence or experience of company or clients. ‘qualified / ensure / perform’ means be able to do something. And the count of ‘ensure’ is much more tha others.

# list to csv and make visualization via tableau
pd.DataFrame(nouns,columns = ['nouns']).to_csv('keyword.csv',index = False)

# list to csv and make visualization via tableau
pd.DataFrame(verbs,columns = ['verbs']).to_csv('keyword.csv',index = False)

# list to csv and make visualization via tableau
pd.DataFrame(df).to_csv('indeed.csv',index = False)

Visualization Part with Tableau

png

Toronto has more Opportunities than other cities and the main reason is that Toronto is the largest city in Ontario. But for smaller cities there are also some good chance for us.

png

Data Scientist is the most welcomed jobs recently and there are some more data-Scientist-related job titles, such as Machine Learning Scientist and research Scientist. Meanwhile, most of job titles are ‘advanced-level’, which need more that 7 years working experience to apply, such as Sr.manager and data power admin. While there are a few intern jobs posting recently for summer students.

png

Most of companies who has lots of reviews don’t post the salary information, the companies who posted salary information has less reviews but not zero.

png

Although for each job, they will list for more than 30 days, we can still find out the trend of posting job title by ‘post days’. Post days equal 0 means this is a new post jobs. For all new post jobs, BI analytics, data power admin and data analytics are more needed than others.

png

For Data-Related Jobs in Toronto, ON

Working Environment and what is new in this version

Job Post Part

Visualization Part with Tableau

CATALOG

FEATURED TAGS

FRIENDS