Expanding Your Toolkit With Web-Scraping and Content Extraction
Submission ID: 6207
Date: Thursday, 10:15 AM to 11:45 AM
Session: Session D: T10:15 - 11:45 AM
Primary Presenter
Angelina KewalRamani, American Institutes for Research
Additional Authors or Round Table Presenters
, ,
, ,
, ,
, ,
, ,
, ,
, ,
, ,
Abstract
Due to the COVID-19 pandemic, elementary and secondary schools and post-secondary institutions were in dire need of support from the federal government. The Coronavirus Aid, Relief, and Economic Security Act, or CARES Act, was passed by Congress in March 2020 and provided funding to U.S. schools and institutions of higher education. Federal government agencies needed to distribute funding rapidly and subsequently collect important indicators from post-secondary institutions that received funding in very short timeframes. However, the majority of institutions were reporting data in unstructured PDF reports and posting the reports on their websites. Identifying the reports from nearly 5,000 institutions and extracting the relevant data using conventional techniques would have proved time consuming. We employed innovative and efficient methods to identify the reports and extract the data. This research will describe the process of developing code for a data pipeline that performed web-scraping and content extraction for the PDF documents. We supported the U.S. Department of Education in determining the uses of funds authorized to address challenges facing institutions of higher education resulting from the pandemic. Our team used an automated process to access the student and institutional portions of the CARES Act data reported through almost 5,000 websites. We used web scraping techniques to identify and collect information from CARES Act reports from approximately 4,900 postsecondary institutions across the country. Our team used specific search terms to identify relevant sections from unstructured reports. We performed text analysis and content extraction methods to identify answers for various text and numeric questions within the report. We conducted multiple rounds of quality control to fine-tune the content extraction algorithms and validate the accuracy of the results. Our web scraping program was developed in an open-source programming language. The findings will help researchers evaluate different methods of finding and extracting data.
Expanding Your Toolkit With Web-Scraping and Content Extraction
Category
Methodological Brief > Data Science, Big Data, and Administrative Records
Description
Do not remove contains CSS