Illuminating Big Data will leave governments in the dark


Robin Wigglesworth in the Financial Times: “Imagine a world where interminable waits for backward-looking, frequently-revised economic data seem as archaically quaint as floppy disks, beepers and a civil internet. This fantasy realm may be closer than you think.

The Bureau of Economic Analysis will soon publish its preliminary estimate for US economic growth in the first three months of the year, finally catching up on its regular schedule after a government shutdown paralysed the agency. But other data are still delayed, and the final official result for US gross domestic product won’t be available until July. Along the way there are likely to be many tweaks.

Collecting timely and accurate data are a Herculean task, especially for an economy as vast and varied as the US’s. But last week’s World Bank-International Monetary Fund’s annual spring meetings offered some clues on a brighter, more digital future for economic data.

The IMF hosted a series of seminars and discussions exploring how the hot new world of Big Data could be harnessed to produce more timely economic figures — and improve economic forecasts. Jiaxiong Yao, an IMF official in its African department, explained how it could use satellites to measure the intensity of night-time lights, and derive a real-time gauge of economic health.

“If a country gets brighter over time, it is growing. If it is getting darker then it probably needs an IMF programme,” he noted. Further sessions explored how the IMF could use machine learning — a popular field of artificial intelligence — to improve its influential but often faulty economic forecasts; and real-time shipping data to map global trade flows.

Sophisticated hedge funds have been mining some of these new “alternative” data sets for some time, but statistical agencies, central banks and multinational organisations such as the IMF and the World Bank are also starting to embrace the potential.

The amount of digital data around the world is already unimaginably vast. As more of our social and economic activity migrates online, the quantity and quality is going to increase exponentially. The potential is mind-boggling. Setting aside the obvious and thorny privacy issues, it is likely to lead to a revolution in the world of economic statistics. …

Yet the biggest issues are not the weaknesses of these new data sets — all statistics have inherent flaws — but their nature and location.

Firstly, it depends on the lax regulatory and personal attitudes towards personal data continuing, and there are signs of a (healthy) backlash brewing.

Secondly, almost all of this alternative data is being generated and stored in the private sector, not by government bodies such as the Bureau of Economic Analysis, Eurostat or the UK’s Office for National Statistics.

Public bodies are generally too poorly funded to buy or clean all this data themselves, meaning hedge funds will benefit from better economic data than the broader public. We might, in fact, need legislation mandating that statistical agencies receive free access to any aggregated private sector data sets that might be useful to their work.

That would ensure that our economic officials and policymakers don’t fly blind in an increasingly illuminated world….(More)”.

Synthetic data: innovation for public good


Blog Post by Catrin Cheung: “What is synthetic data, and how can it be used for public good? ….Synthetic data are artificially generated data that have the look and structure of real data, but do not contain any information on individuals. They also contain more general characteristics that are used to find patterns in the data.

They are modelled on real data, but designed in a way which safeguards the legal, ethical and confidentiality requirements of the original data. Given their resemblance to the original data, synthetic data are useful in a range of situations, for example when data is sensitive or missing. They are used widely as teaching materials, to test code or mathematical models, or as training data for machine learning models….

There’s currently a wealth of research emerging from the health sector, as the nature of data published is often sensitive. Public Health England have synthesised cancer data which can be freely accessed online. NHS Scotland are making advances in cutting-edge machine learning methods such as Variational Auto Encoders and Generative Adversarial Networks (GANs).

There is growing interest in this area of research, and its influence extends beyond the statistical community. While the Data Science Campus have also used GANs to generate synthetic data in their latest research, its power is not limited to data generation. It can be trained to construct features almost identical to our own across imagery, music, speech and text. In fact, GANs have been used to create a painting of Edmond de Belamy, which sold for $432,500 in 2018!

Within the ONS, a pilot to create synthetic versions of securely held Labour Force Survey data has been carried out using a package in R called “synthpop”. This synthetic dataset can be shared with approved researchers to de-bug codes, prior to analysis of data held in the Secure Research Service….

Although much progress is done in this field, one challenge that persists is guaranteeing the accuracy of synthetic data. We must ensure that the statistical properties of synthetic data match properties of the original data.

Additional features, such as the presence of non-numerical data, add to this difficult task. For example, if something is listed as “animal” and can take the possible values “dog”,”cat” or “elephant”, it is difficult to convert this information into a format suitable for precise calculations. Furthermore, given that datasets have different characteristics, there is no straightforward solution that can be applied to all types of data….particular focus was also placed on the use of synthetic data in the field of privacy, following from the challenges and opportunities identified by the National Statistician’s Quality Review of privacy and data confidentiality methods published in December 2018….(More)”.

Predictive Big Data Analytics using the UK Biobank Data


Paper by Ivo D Dinov et al: “The UK Biobank is a rich national health resource that provides enormous opportunities for international researchers to examine, model, and analyze census-like multisource healthcare data. The archive presents several challenges related to aggregation and harmonization of complex data elements, feature heterogeneity and salience, and health analytics. Using 7,614 imaging, clinical, and phenotypic features of 9,914 subjects we performed deep computed phenotyping using unsupervised clustering and derived two distinct sub-cohorts. Using parametric and nonparametric tests, we determined the top 20 most salient features contributing to the cluster separation. Our approach generated decision rules to predict the presence and progression of depression or other mental illnesses by jointly representing and modeling the significant clinical and demographic variables along with the derived salient neuroimaging features. We reported consistency and reliability measures of the derived computed phenotypes and the top salient imaging biomarkers that contributed to the unsupervised clustering. This clinical decision support system identified and utilized holistically the most critical biomarkers for predicting mental health, e.g., depression. External validation of this technique on different populations may lead to reducing healthcare expenses and improving the processes of diagnosis, forecasting, and tracking of normal and pathological aging….(More)”.

Statistics Estonia to coordinate data governance


Article by Miriam van der Sangen at CBS: “In 2018, Statistics Estonia launched a new strategy for the period 2018-2022. This strategy addresses the organisation’s aim to produce statistics more quickly while minimising the response burden on both businesses and citizens. Another element in the strategy is addressing the high expectations in Estonian society regarding the use of data. ‘We aim to transform Statistics Estonia into a national data agency,’ says Director General Mägi. ‘This means our role as a producer of official statistics will be enlarged by data governance responsibilities in the public sector. Taking on such responsibilities requires a clear vision of the whole public data ecosystem and also agreement to establish data stewards in most public sector institutions.’…

the Estonian Parliament passed new legislation that effectively expanded the number of official tasks for Statistics Estonia. Mägi elaborates: ‘Most importantly, we shall be responsible for coordinating data governance. The detailed requirements and conditions of data governance will be specified further in the coming period.’ Under the new Act, Statistics Estonia will also have more possibilities to share data with other parties….

Statistics Estonia is fully committed to producing statistics which are based on big data. Mägi explains: ‘At the moment, we are actively working on two big data projects. One project involves the use of smart electricity meters. In this project, we are looking into ways to visualise business and household electricity consumption information. The second project involves web scraping of prices and enterprise characteristics. This project is still in an initial phase, but we can already see that the use of web scraping can improve the efficiency of our production process.’ We are aiming to extend the web scraping project by also identifying e-commerce and innovation activities of enterprises.’

Yet another ambitious goal for Statistics Estonia lies in the field of data science. ‘Similarly to Statistics Netherlands, we established experimental statistics and data mining activities years ago. Last year, we developed a so-called think-tank service, providing insights from data into all aspects of our lives. Think of birth, education, employment, et cetera. Our key clients are the various ministries, municipalities and the private sector. The main aim in the coming years is to speed up service time thanks to visualisations and data lake solutions.’ …(More)”.

Big Data Applications in Governance and Policy


Introduction to Special Issue of Politics and Governance by Sarah Giest and Reuben Ng: ” Recent literature has been trying to grasp the extent as to which big data applications affect the governance and policymaking of countries and regions (Boyd & Crawford, 2012; Giest, 2017; Höchtl, Parycek, & Schöllhammer, 2015; Poel, Meyer, & Schroeder, 2018). The discussion includes the comparison to e-government and evidence-based policymaking developments that existed long before the idea of big data entered the policy realm. The theoretical extent of this discussion however lacks some of the more practical consequences that come with the active use of data-driven applications. In fact, much of the work focuses on the input-side of policymaking, looking at which data and technology enters the policy process, however very little is dedicated to the output side.

In short, how has big data shaped data governance and policymaking? The contributions to this thematic issue shed light on this question by looking at a range of factors, such as campaigning in the US election (Trish, 2018) or local government data projects (Durrant, Barnett, & Rempel, 2018). The goal is to unpack the mixture of big data applications and existing policy processes in order to understand whether these new tools and applications enhance or hinder policymaking….(More)”.

Platform Surveillance


Editorial by David Murakami Wood and Torin Monahan of Special Issue of Surveillance and Society: “This editorial introduces this special responsive issue on “platform surveillance.” We develop the term platform surveillance to account for the manifold and often insidious ways that digital platforms fundamentally transform social practices and relations, recasting them as surveillant exchanges whose coordination must be technologically mediated and therefore made exploitable as data. In the process, digital platforms become dominant social structures in their own right, subordinating other institutions, conjuring or sedimenting social divisions and inequalities, and setting the terms upon which individuals, organizations, and governments interact.

Emergent forms of platform capitalism portend new governmentalities, as they gradually draw existing institutions into alignment or harmonization with the logics of platform surveillance while also engendering subjectivities (e.g., the gig-economy worker) that support those logics. Because surveillance is essential to the operations of digital platforms, because it structures the forms of governance and capital that emerge, the field of surveillance studies is uniquely positioned to investigate and theorize these phenomena….(More)”.

Responsible Data Governance of Neuroscience Big Data


Paper by B. Tyr Fothergill et al: “Current discussions of the ethical aspects of big data are shaped by concerns regarding the social consequences of both the widespread adoption of machine learning and the ways in which biases in data can be replicated and perpetuated. We instead focus here on the ethical issues arising from the use of big data in international neuroscience collaborations.

Neuroscience innovation relies upon neuroinformatics, large-scale data collection and analysis enabled by novel and emergent technologies. Each step of this work involves aspects of ethics, ranging from concerns for adherence to informed consent or animal protection principles and issues of data re-use at the stage of data collection, to data protection and privacy during data processing and analysis, and issues of attribution and intellectual property at the data-sharing and publication stages.

Significant dilemmas and challenges with far-reaching implications are also inherent, including reconciling the ethical imperative for openness and validation with data protection compliance, and considering future innovation trajectories or the potential for misuse of research results. Furthermore, these issues are subject to local interpretations within different ethical cultures applying diverse legal systems emphasising different aspects. Neuroscience big data require a concerted approach to research across boundaries, wherein ethical aspects are integrated within a transparent, dialogical data governance process. We address this by developing the concept of ‘responsible data governance’, applying the principles of Responsible Research and Innovation (RRI) to the challenges presented by governance of neuroscience big data in the Human Brain Project (HBP)….(More)”.

Know-how: Big Data, AI and the peculiar dignity of tacit knowledge


Essay by Tim Rogan: “Machine learning – a kind of sub-field of artificial intelligence (AI) – is a means of training algorithms to discern empirical relationships within immense reams of data. Run a purpose-built algorithm by a pile of images of moles that might or might not be cancerous. Then show it images of diagnosed melanoma. Using analytical protocols modelled on the neurons of the human brain, in an iterative process of trial and error, the algorithm figures out how to discriminate between cancers and freckles. It can approximate its answers with a specified and steadily increasing degree of certainty, reaching levels of accuracy that surpass human specialists. Similar processes that refine algorithms to recognise or discover patterns in reams of data are now running right across the global economy: medicine, law, tax collection, marketing and research science are among the domains affected. Welcome to the future, say the economist Erik Brynjolfsson and the computer scientist Tom Mitchell: machine learning is about to transform our lives in something like the way that steam engines and then electricity did in the 19th and 20th centuries. 

Signs of this impending change can still be hard to see. Productivity statistics, for instance, remain worryingly unaffected. This lag is consistent with earlier episodes of the advent of new ‘general purpose technologies’. In past cases, technological innovation took decades to prove transformative. But ideas often move ahead of social and political change. Some of the ways in which machine learning might upend the status quo are already becoming apparent in political economy debates.

The discipline of political economy was created to make sense of a world set spinning by steam-powered and then electric industrialisation. Its central question became how best to regulate economic activity. Centralised control by government or industry, or market freedoms – which optimised outcomes? By the end of the 20th century, the answer seemed, emphatically, to be market-based order. But the advent of machine learning is reopening the state vs market debate. Which between state, firm or market is the better means of coordinating supply and demand? Old answers to that question are coming under new scrutiny. In an eye-catching paper in 2017, the economists Binbin Wang and Xiaoyan Li at Sichuan University in China argued that big data and machine learning give centralised planning a new lease of life. The notion that market coordination of supply and demand encompassed more information than any single intelligence could handle would soon be proved false by 21st-century AI.

How seriously should we take such speculations? Might machine learning bring us full-circle in the history of economic thought, to where measures of economic centralisation and control – condemned long ago as dangerous utopian schemes – return, boasting new levels of efficiency, to constitute a new orthodoxy?

A great deal turns on the status of tacit knowledge….(More)”.

What you don’t know about your health data will make you sick


Jeanette Beebe at Fast Company: “Every time you shuffle through a line at the pharmacy, every time you try to get comfortable in those awkward doctor’s office chairs, every time you scroll through the web while you’re put on hold with a question about your medical bill, take a second to think about the person ahead of you and behind you.

Chances are, at least one of you is being monitored by a third party like data analytics giant Optum, which is owned by UnitedHealth Group, Inc. Since 1993, it’s captured medical data—lab results, diagnoses, prescriptions, and more—from 150 million Americans. That’s almost half of the U.S. population.

“They’re the ones that are tapping the data. They’re in there. I can’t remove them from my own health insurance contracts. So I’m stuck. It’s just part of the system,” says Joel Winston, an attorney who specializes in privacy and data protection law.

Healthcare providers can legally sell their data to a now-dizzyingly vast spread of companies, who can use it to make decisions, from designing new drugs to pricing your insurance rates to developing highly targeted advertising.

It’s written in the fine print: You don’t own your medical records. Well, except if you live in New Hampshire. It’s the only state that mandates its residents own their medical data. In 21 states, the law explicitly says that healthcare providers own these records, not patients. In the rest of the country, it’s up in the air.

Every time you visit a doctor or a pharmacy, your record grows. The details can be colorful: Using sources like Milliman’s IntelliScript and ExamOne’s ScriptCheck, a fuller picture of you emerges. Your interactions with the health are system, your medical payments, your prescription drug purchase history. And the market for the data is surging.

Its buyers and sharers—pharma giants, insurers, credit reporting agencies, and other data-hungry companies or “fourth parties” (like Facebook)—say that these massive health data sets can improve healthcare delivery and fuel advances in so-called “precision medicine.”

Still, this glut of health data has raised alarms among privacy advocates, who say many consumers are in the dark about how much of their health-related info is being gathered and mined….

Gardner predicted that traditional health data systems—electronic health records and electronic medical records—are less than ideal, given the “rigidity of the vendors and the products” and the way our data is owned and secured. Don’t count on them being around much longer, she said, “beyond the next few years.”

The future, Gardner suggested, is a system that runs on blockchain, which she defined for the committee as “basically a secure, visible, irrefutable ledger of transactions and ownership.” Still, a recent analysis of over 150 white papers revealed most healthcare blockchain projects “fall somewhere between half-baked and overly optimistic.”

As larger companies like IBM sign on, the technology may be edging closer to reality. Last year, Proof Work outlined a HIPAA-compliant system that manages patients’ medical histories over time, from acute care in the hospital to preventative checkups. The goal is to give these records to patients on their phones, and to create a “democratized ecosystem” to solve interoperability between patients, healthcare providers, insurance companies, and researchers. Similar proposals from blockchain-focused startups like Health Bank and Humanity.co would help patients store and share their health information securely—and sell it to researchers, too….(More)”.

Nearly Half of Canadian Consumers Willing to Share Significant Personal Data with Banks and Insurers in Exchange for Lower Pricing, Accenture Study Finds


Press Release: “Nearly half of Canadian consumers would be willing to share significant personal information, such as location data and lifestyle information, with their bank and insurer in exchange for lower pricing on products and services, according to a new report from Accenture (NYSE: ACN).

Consumers willing to share personal data in select scenarios. (CNW Group/Accenture)
Consumers willing to share personal data in select scenarios. (CNW Group/Accenture)

Accenture’s global Financial Services Consumer Study, based on a survey of 47,000 consumers in 28 countries which included 2,000 Canadians, found that more than half of consumers would share that data for benefits including more-rapid loan approvals, discounts on gym memberships and personalized offers based on current location.

At the same time, however, Canadian consumers believe that privacy is paramount, with nearly three quarters (72 per cent) saying they are very cautious about the privacy of their personal data. In fact, data security breaches were the second-biggest concern for consumers, behind only increasing costs, when asked what would make them leave their bank or insurer.

“Canadian consumers are willing to sharing their personal data in instances where it makes their lives easier but remain cautious of exactly how their information is being used,” said Robert Vokes, managing director of financial services at Accenture in Canada. “With this in mind, banks and insurers need to deliver hyper-relevant and highly convenient experience in order to remain relevant, retain trust and win customer loyalty in a digital economy.”

Consumers globally showed strong support for personalized insurance premiums, with 64 per cent interested in receiving adjusted car insurance premiums based on safe driving and 52 per cent in exchange for life insurance premiums tied to a healthy lifestyle. Four in five consumers (79 per cent) would provide personal data, including income, location and lifestyle habits, to their insurer if they believe it would help reduce the possibility of injury or loss.

In banking, 81 per cent of consumers would be willing to share income, location and lifestyle habit data for rapid loan approval, and 76 per cent would do so to receive personalized offers based on their location, such as discounts from a retailer. Approximately two-fifths (42 per cent) of Canadian consumers specifically, want their bank to provide updates on how much money they have based on spending that month and 46 per cent want savings tips based on their spending habits.  

Appetite for data sharing differs around the world

Appetite for sharing significant personal data with financial firms was highest in China, with 67 per cent of consumers there willing to share more data for personalized services. Half (50 per cent) of consumers in the U.S. said they were willing to share more data for personalized services, and in Europe — where the General Data Protection Regulation took effect in May — consumers were more skeptical. For instance, only 40 per cent of consumers in both the U.K. and Germany said they would be willing to share more data with banks and insurers in return for personalized services…(More)”,