Home United States USA — software Using PostgreSQL Aggregate Functions in YugabyteDB to Analyze COVID-19 Data

Using PostgreSQL Aggregate Functions in YugabyteDB to Analyze COVID-19 Data

December 17, 2020

274

Presenting the use case for PostgreSQL Aggregate Functions to analyze data from Carnegie Mellon’s COVIDcast in YugabyteDB.
Join the DZone community and get the full member experience. An article in the Washington Post, published on 23-Oct-2020, argues the case for wearing a mask while the COVID-19 pandemic continues and refers to data from Carnegie Mellon’s COVIDcast, an academic project tracking real-time coronavirus statistics. Look for this: There’s a simple statistical measure of correlation intensity called ‘R-squared,’ which goes from zero (absolutely no relationship between the two variables) to 1 (the variables move perfectly in [linear] tandem). The ‘R-squared’ of CovidCast’s mask and symptom data is 0.73, meaning that you can predict about 73 percent of the variability in state-level COVID-19 symptom prevalence simply by knowing how often people wear their masks. The ‘R-squared’ measure is implemented in YugabyteDB as the YSQL aggregate function regr_r2(). YugabyteDB inherits this, along with about forty other aggregate functions from PostgreSQL. This is just a part of the bigger picture, brought by the decision that Yugabyte engineers made to reuse the PostgreSQL C code for SQL processing ‘as is’ and to wire it up to YugabyteDB’s distributed storage layer. The COVIDcast site includes this note: We are happy for you to use this data in products and publications. Please acknowledge us as a source: Data from Delphi COVIDcast, covidcast.cmu.edu. Not long before I read the Washington Post article, I had just completed writing the section on aggregate functions in the YugabyteDB YSQL documentation. I decided that it would be useful to add a new subsection, ‘Linear regression analysis of COVID data from Carnegie Mellon’s COVIDcast project.’ The page has a link to a zip-file that contains all the code, together with the downloaded COVIDcast data that the use case account relies on. This blog post summarizes my account of that use case and follows its organization. The account has two distinct parts. The first part is covered by the sections Finding and downloading the COVIDcast data and Ingesting, checking, and combining the COVIDcast data. It explains how to ingest the downloaded COVIDcast data into a single ‘covidcast_fb_survey_results’ table upon which various analysis queries can be run. The table has this structure: ‘mask_wearing_pct’, ‘symptoms_pct’, and ‘cmnty_symptoms_pct’ are explained in the section ‘Finding and Downloading the COVIDcast Data’, below. If your interest is limited to how to use the YSQL functions for linear regression analysis and how to understand the values that they return, you can skip the whole of this first part, simply accept the table as the starting point (taking the column names to have self-evident meanings), and start reading at the section Using the YSQL linear regression analysis functions on the COVIDcast data. However, the considerations that this part explains and the SQL techniques that are used to establish that the data that you import into the ‘covidcast_fb_survey_results’ table meet the rules that the COVIDcast site documents are interesting in their own right, independently of how the ingested data eventually will be used. A time-honored principle of proper practice insists that when you ingest data from a third-party provider, especially when you have no formalized relationship with this party, you must run stringent QA tests on every successive ingestion run. Interestingly, by starting with a real goal and by implementing appropriate solutions, I found myself using various techniques that I had recently documented. For example, in an earlier exercise, I had documented the YSQL features that implement the functionality of the array data type.