The author:
I wish all the agency updates were in one place...
Well guess what we’re going to learn how to build today. Harnessing the infinite power of the internet, we are going to run what’s called a “table scrape” on the OSHA and NIOSH websites and pull some updates from them. Here’s the finished Product.
What is table scraping?
Table scraping (or web scraping) is the process of using bots to extract content and data from a website. The process extracts the underlying HTML on a webpage which can then be repurposed. Sounds complicated, and it is. Lucky for us libraries exist to handle the heavy lifting for us, we just need to know how to operate it. The library we will be using is cheerio:
CheerioHow do we use this?
We are going to set up a server on codesandbox.io then use cheerio to pull the HTML from the OSHA updates page, and the NIOSH updates page. Once we get the information we want we are going to spit it back out where we can view it in the browser. Let’s take a look at the OSHA site. The url we will be using on the OSHA side is here: https://www.osha.gov/whatsnew - if you right click the “Last 30 Days” header, then click “inspect” on the dropdown, you’ll see the table content is found within a container with an id of “last-30-days”. This is a unique id and one we can specify on our server to help us scrape.
If we look more closely at the picture above, we can see a “<ul>” tag followed by several “<li>” tags which ultimately contain the data we want. This is useful information we will use on the server side where the cheerio library will work its magic. Check it out:
What we did was tell the cheerio library that from that OSHA url, we want to access each “<li>” inside the container with the id “last-30-days”. Cheerio will do its thing and spit back the result. We can do the same thing with the NIOSH site https://www.cdc.gov/niosh/enews/default.html - and the finished product looks like this:
We can now go into our browser and when we navigate to the following URL the server will run the table scrape logic and give us the info we’re looking for: https://0lo1v.sse.codesandbox.io/
What this means
Think of how many regulatory agencies dictate how we operate in the health and safety industry. OSHA, NIOSH, ANSI, NFPA, etc. So many acronyms, so many updates and announcements. Being able to pull these all into a single repository would make our lives so much easier instead of going to each of their sites to gather update information. I think I just thought of my next project.
Like what you read? Tip the author!