The Web Integrity Project’s monitoring processes reveal an increasing disparity in Spanish-language HIV/AIDs content

by

WIP Monitoring Analyst, Aaron Lemelin, reflects on WIP’s monitoring processes in the context of the overhaul of Centers for Disease Control and Prevention’s HIV/AIDS website.

A screenshot from our web monitoring software showing changes made to the sidebar of the CDC HIV/AIDS website, including the removal of the link for the “HIV Among Incarcerated Populations” page (shown here with the link text “Incarcerated Populations”). This is the change Steven, one of WIP’s analysts, raised at a weekly web monitoring meeting.

In September 2018, the members of the Web Integrity Project (WIP) web monitoring team were huddled around their computers discussing recent changes made to federal agency websites. One team member, Steven, raised a change he’d caught: a link that was removed from the sidebar of the HIV/AIDS website hosted on the website of the Centers for Disease Control and Prevention (CDC). Clicking on the link revealed that a page titled “HIV Among Incarcerated Populations” had been removed. The removed page had described HIV as “a serious health issue for correctional facilities and their incarcerated populations” and presented statistical information on HIV among the incarcerated.

This change was concerning. One of the things WIP always seeks to document is the removal of important information, and the removal of information about the HIV epidemic would certainly qualify as important. But WIP often finds changes that turn out not to be significant or worthy of publication. To determine what is notable and what isn’t, we’ve developed a meticulous process for documenting and vetting the changes we uncover. The changes to HIV information on the CDC site are notable because, while our initial concerns turned out to be baseless — no significant information was removed from the site — our robust investigation methods led us to another concerning discovery about CDC’s HIV information, one we may not have discovered without the methods we’ve developed over the past two years.

WIP’s web monitoring and retrospective analysis processes

As the team discussed the changes Steven had discovered at our weekly web monitoring meeting, we considered reasons the content could have been removed and explored whether the content was available elsewhere. Additionally, the team reflected on some other recent changes we had seen on the CDC domain and wondered how the removal of the incarcerated populations page related to the larger set of changes on the domain.

The web monitoring team’s weekly meeting is a key part of WIP’s web monitoring process, which I have been a part of for the last year. I first became interested in monitoring websites when I joined another organization, the Environmental Data and Governance Initiative (EDGI), where we were monitoring climate change information being altered by the Trump administration. I joined WIP shortly after, where we adopted and tweaked EDGI’s process to monitor healthcare, criminal justice, and immigration domains.

The weekly monitoring process begins with a team of analysts proactively monitoring and reviewing the latest changes to a given domain and classifying each change according to our in-house system. A set of changes is captured by software, which monitors close to 30,000 federal government webpages and provides “snapshots,” or captures of the page, each time a change occurs.

The analysts examine each change, one-by-one, in meticulous detail, making decisions about whether the change is substantial enough for further vetting. Each week we see an array of alterations, many of which are routine web maintenance. It’s common to see the improvement of grammar or annual updates to statistics to webpages. These improvements are necessary to keep the content legible and up-to-date. However, we often find more significant changes, like  the removal of a link from a menu or sidebar or the removal of an entire page, that we would label as “substantive,” meaning they require a further look to determine if the change is important.

Given the seeming importance of the removal of the incarcerated populations page, we decided to take a look at the CDC HIV/AIDS website in its entirety. There was a good chance the removed page was part of a larger series of changes.

But, trying to analyze an entire website is a big endeavor. A quick Google search for “site:cdc.gov/hiv” returns 1520 pages and CDC does not offer a complete list of URLs for the website. Because of the size and complexity of the website, we needed to take an approach that would give us a more in-depth and complete look at how the website changed. We have developed a process for in-depth and complete examinations of websites, which we call “retrospective analysis.”

Retrospective analysis is a method we created to see how a website has changed from one point in time to another. In contrast to our weekly, proactive monitoring, retrospective analysis allows us to adjust the date range we want to analyze. The method allows comparison of the site’s main components, including headers, menus, and footers, as well as content on each individual page.

To start, we scope out the website. In the case of the CDC’s HIV/AIDS website, we navigated the site using the sidebar to identify potential URLs to collect for comparison. We prioritized pages that contained resources, program priorities, fact sheets, and other important information. After we scoped the site, we had identified URLs for 129 pages within the CDC HIV/AIDS website, including the incarcerated populations page, to analyze.

Once we had our list of URLs, we needed to collect “before” and “after” snapshots of each page using the Internet Archive’s Wayback Machine (IAWM). The analysts were tasked with gathering the two snapshots, one from before President Trump’s inauguration (i.e. January 19, 2017, or earlier) and the most current version of the page on IAWM. This time period allowed us to compare each page at the end of the Obama administration to the current version of the page under the Trump administration.

Once the two snapshots were gathered, we then used “diffing” software to reveal the differences in the text between the two snapshots of each page. The software shows only changes to the text, such as revisions in the main body content or in the menu. To make sure non-textual changes are identified, analysts also did a visual comparison to check for changes like link removals or URL changes.  

Three analysts, including myself, split the 129 URLs. As we compared snapshots of each page, we realized a considerable amount of content had been changed and that the HIV/AIDs website had been overhauled since the 2017 inauguration. Most commonly, the content was altered to update statistics (from data for 2015 or earlier to 2017 data) and sections on the challenges faced and strategies adopted by CDC to combat HIV. Other parts of the overhaul involved changed menus and removed content and pages.

Whenever we find removed content or pages, the next step in our process is to check whether similar content can be found elsewhere on the agency website. We used an internet search engine, in this case, Google, and searched the CDC website for the particular key term (in the case of the removed incarcerated populations page: “incarcerated populations” site:cdc.gov) and also searched for quotes from the removed content or page (e.g. for the removed incarcerated populations page: “HIV is a serious health issue for correctional facilities and their incarcerated populations” site:cdc.gov).

Often the searches lead us to similar existing or new content on the agency’s website. For the removed incarcerated populations page, the search led us to another section of CDC’s website titled, “Correctional Health.” This section of the site features an array of information on HIV/AIDS in correctional settings, including recommendations, guidance, reports, and education materials.

When we saw that much of the content on the removed page was available elsewhere on the site, our initial concerns were allayed. Ultimately, we concluded that, because of the similarities between the removed content and content in the “Correctional Health” section, the removal of the “HIV Among Incarcerated Populations” page was not significant enough to warrant a report.

This isn’t an unusual outcome. In fact, the decision not to pursue a report is the most common outcome of our investigations into website changes when we find that alternative content is available elsewhere or new content has been added. We strive to interrogate changes fully and often find our initial concerns were unfounded. This is actually a good thing. Every time we find our fears unfulfilled, we can be reassured that the American public has not lost access to significant amounts of information.

WIP’s method of retrospective analysis allowed us to better understand the technical details and context of the changes that occurred on the CDC’s HIV/AIDS website. By going through the entirety of the site, we were able to see the routine changes that were being made, including the updates to statistics. We could also see where duplicated content was being streamlined.

Increased disparities between English- and Spanish-language content

The careful analysis process also, however, turned up something else of concern. The retrospective analysis of the CDC HIV/AIDS site showed that during the overhaul, CDC made accessing Spanish versions of the page more difficult and significantly reduced the amount of Spanish-language content on the HIV/AIDs site. We also saw a scattershot approach to the provision of Spanish-language resources, with no obvious structure or principles guiding which resources were available in Spanish and which were not.  

Reducing access to Spanish versions of the content, CDC removed a dropdown feature that allowed users to easily select the Spanish version of a page from the English version of the page.

Even before the overhaul, only a small portion of the English-language content was replicated in Spanish. The Spanish-language pages were organized into the Spanish version of the CDC HIV/AIDS site titled, “VIH/SIDA.” In the English version of the site, the homepage linked to at least 130 pages from the sidebar; as few as thirty pages were linked in the Spanish version.

During the overhaul, CDC updated many English-language pages to reflect the availability of new statistics for the 2017 year. By contrast, instead of being updated or left unchanged with some 2015 or earlier data points on the page, the Spanish versions were removed in their entirety. After the overhaul, the VIH/SIDA website site linked to only fifteen pages, indicating that at least fifteen Spanish-language pages were removed during the overhaul.

For instance, the page titled “El VIH en las mujeres,” which corresponded to the “HIV Among Women” page, was formerly part of the VIH/SIDA site. The page contained data on the number of women with HIV by race and ethnicity, information on how they contracted the virus, the key challenges that the CDC faces in reducing the prevalence of HIV among women, and the actions CDC is taking to combat the problem. While the English version of the page is still available (with updated statistics and new pie charts), the Spanish version was removed during the overhaul.

A screenshot from the Internet Archive’s Wayback Machine of the “El VIS en las mujeres” webpage. The page was removed from the CDC HIV/AIDS website during an overhaul.

It must be acknowledged that agencies face serious resource constraints and the translation of all material into every language is not possible. Perhaps the CDC lacked a staff person to translate the new statistical information and did not have funds in the budget for a professional translation service. However, even granting these constraints, it is unclear why the presence of 2015 or earlier data on the page necessitated the removal of the whole page.

In some instances, Spanish-language pages with old statistics were left on the webpage. For example, the “Trabajadores Sexuales” page remained (and remains) on the VIH/SIDA site even though it contains statistics relating to 2012 and 2015. Similarly, the “Transmisión Ocupacional” page contains a statistics relating to 1999 through 2013, and it remains live on the website. These statistics, while relatively old, remain correct and are unlikely to mislead.

Even before the removal of the Spanish-language pages as part of the post-inauguration overhaul, there was already a large disparity in the amount of content in Spanish and English. For example, the English-language “HIV Among Women” page was one of twenty pages linked from the “HIV by Group” page and dedicated to information about HIV in different populations like gay and bisexual men and low-income households. The pages were all located in the www.cdc.gov/hiv/group subdirectory. By contrast, in the Spanish-language site the www.cdc.gov/hiv/spanish/group subdirectory contained only 9 pages, and did not include content that would likely be useful to Spanish-speaking users like information on HIV among low-income households.

Throughout our analysis, we struggled to understand what kind of criteria the CDC used to determine whether to offer content in Spanish and whether to update, retain, or remove existing content. We wondered to what extent they considered how useful the content might be to Spanish speakers when deciding to provide content in Spanish. It may be that the CDC does have guiding principles around Spanish language content, but they do not appear on the CDC’s website.  

Even for pages that have both an English and a Spanish version, there were differences in the content provided in the two languages. The “Prevention” page presents information in the form of frequently asked questions. The English version contains fourteen questions, whereas the corresponding Spanish-language page, titled “Prevención del VIH,” has only eight. Missing on the Spanish page are questions about preventing HIV transmission for those living with HIV and the transmission of HIV via oral or anal sex. Again we wondered what criteria CDC were using to determine which content to provide in Spanish. Why were some topics important enough to be covered in English and Spanish, but others only English?

These findings encouraged us to think more about limited English proficiency (LEP) access and contributed to our article about language access issues and the variety of approaches taken by individual agencies.

The Department of Health and Human Services, of which the CDC is part, has laid out broad principles for language access to programs and online material. However, the department does not mandate what content is translated among its agencies. Nor has it set out principles or a defined approach to assist agencies in determining what content is particularly important for Spanish-speaking communities and should be offered in Spanish as well as English.

The disparity between the English and Spanish versions of CDC’s HIV/AIDS website, in terms of both the content available and the way existing content was treated during the overhaul, highlights some serious issues around the provision of information intended for LEP populations. Missing or removed content previously available in a language imposes a barrier in access to information on HIV and AIDS for Spanish-speaking people with limited proficiency in English. In an ideal world, the CDC would provide and maintain information, including statistics, reports, and recommendations to the same extent and in the same way they do for English speaking populations.  Even recognizing the real constraints under which agencies operate, the LEP public should be able to expect a principled approach to overhauling content in languages other than English, one that puts front and center the goal of providing and retaining as much information as possible so long as it is not misleading or incorrect.

Conclusion

Much of our success here at WIP has been driven by our research process. The process, from proactive weekly monitoring to in-depth retrospective analysis, allows us to monitor changes to government websites and to differentiate between routine maintenance and important changes. Through this process, we are able to maintain the integrity of our research, ensuring that removed content is, in fact, removed and no longer available elsewhere.  Sometimes, as was the case with the incarcerated populations page, our initial worries about removals are unfounded. Yet, our process helps us clarify the changes we come across and often leads us to discover other issues. In analyzing the overhaul of CDC’s HIV/AIDS website, we made an unexpected discovery involving increased disparities in access to English and Spanish content, underscoring the value of our process.