The Role of Big Data Analytics in Predicting Suicide


Chapter by Ronald C. Kessler et al: “…reviews the long history of using electronic medical records and other types of big data to predict suicide. Although a number of the most recent of these studies used machine learning (ML) methods, these studies were all suboptimal both in the features used as predictors and in the analytic approaches used to develop the prediction models. We review these limitations and describe opportunities for making improvements in future applications.

We also review the controversy among clinical experts about using structured suicide risk assessment tools (be they based on ML or older prediction methods) versus in-depth clinical evaluations of needs for treatment planning. Rather than seeing them as competitors, we propose integrating these different approaches to capitalize on their complementary strengths. We also emphasize the distinction between two types of ML analyses: those aimed at predicting which patients are at highest suicide risk, and those aimed at predicting the treatment options that will be best for individual patients. We explain why both are needed to optimize the value of big data ML methods in addressing the suicide problem….(More)”.

See also How Search Engine Data Enhance the Understanding of Determinants of Suicide in India and Inform Prevention: Observational Study.

The Lancet Countdown: Tracking progress on health and climate change using data from the International Energy Agency (IEA)


Victoria Moody at the UK Data Service: “The 2015 Lancet Commission on Health and Climate Change—which assessed responses to climate change with a view to ensuring the highest attainable standards of health for populations worldwide—concluded that “tackling climate change could be the greatest global health opportunity of the 21st century”. The Commission recommended that more accurate national quantification of the health co-benefits and economic impacts of mitigation decisions was essential in promoting a low-carbon transition.

Building on these foundations, the Lancet Countdown: tracking progress on health and climate change was formed as an independent research collaboration…

The partnership comprises 24 academic institutions from every continent, bringing together individuals with a broad range of expertise across disciplines (including climate scientists, ecologists, mathematicians, geographers, engineers, energy, food, and transport experts, economists, social and political scientists, public health professionals, and physicians).

Four of the indicators developed for Working Group 3 (Mitigation actions and health co-benefits) uses International Energy Agency (IEA) data made available by the the IEA via the UK Data Service for use by researchers, learners and teaching staff in UK higher and further education. Additionally, two of the indicators developed for Working Group 4 (Finance and economics) also use IEA data.

Read our impact case study to find our more about the impact and reach of the Lancet Countdown, watch the YouTube film below, read the Lancet Countdown 2018 Report …(More)”

Urban Computing


Book by Yu Zheng:”…Urban computing brings powerful computational techniques to bear on such urban challenges as pollution, energy consumption, and traffic congestion. Using today’s large-scale computing infrastructure and data gathered from sensing technologies, urban computing combines computer science with urban planning, transportation, environmental science, sociology, and other areas of urban studies, tackling specific problems with concrete methodologies in a data-centric computing framework. This authoritative treatment of urban computing offers an overview of the field, fundamental techniques, advanced models, and novel applications.

Each chapter acts as a tutorial that introduces readers to an important aspect of urban computing, with references to relevant research. The book outlines key concepts, sources of data, and typical applications; describes four paradigms of urban sensing in sensor-centric and human-centric categories; introduces data management for spatial and spatio-temporal data, from basic indexing and retrieval algorithms to cloud computing platforms; and covers beginning and advanced topics in mining knowledge from urban big data, beginning with fundamental data mining algorithms and progressing to advanced machine learning techniques. Urban Computing provides students, researchers, and application developers with an essential handbook to an evolving interdisciplinary field….(More)”

Data Was Supposed to Fix the U.S. Education System. Here’s Why It Hasn’t.


Simon Rodberg at Harvard Business School: “For too long, the American education system failed too many kids, including far too many poor kids and kids of color, without enough public notice or accountability. To combat this, leaders of all political persuasions championed the use of testing to measure progress and drive better results. Measurement has become so common that in school districts from coast to coast you can now find calendars marked “Data Days,” when teachers are expected to spend time not on teaching, but on analyzing data like end-of-year and mid-year exams, interim assessments, science and social studies and teacher-created and computer-adaptive tests, surveys, attendance and behavior notes. It’s been this way for more than 30 years, and it’s time to try a different approach.

The big numbers are necessary, but the more they proliferate, the less value they add. Data-based answers lead to further data-based questions, testing, and analysis; and the psychology of leaders and policymakers means that the hunt for data gets in the way of actual learning. The drive for data responded to a real problem in education, but bad thinking about testing and data use has made the data cure worse than the disease….

The leadership decision at stake is how much data to collect. I’ve heard variations on “In God we trust; all others bring data” at any number of conferences and beginning-of-school-year speeches. But the mantra “we believe in data” is actually only shorthand for “we believe our actions should be informed by the best available data.” In education, that mostly means testing. In other fields, the kind of process is different, but the issue is the same. The key question is not, “will the data be useful?” (of course it can be) or, “will the data be interesting?” (Yes, again.) The proper question for leaders to ask is: will the data help us make better-enough decisions to be worth the cost of getting and using it? So far, the answer is “no.”

Nationwide data suggests that the growth of data-driven schooling hasn’t worked even by its own lights. Harvard professor Daniel Koretz says “The best estimate is that test-based accountability may have produced modest gains in elementary-school mathematics but no appreciable gains in either reading or high-school mathematics — even though reading and mathematics have been its primary focus.”

We wanted data to help us get past the problem of too many students learning too little, but it turns out that data is an insufficient, even misleading answer. It’s possible that all we’ve learned from our hyper-focus on data is that better instruction won’t come from more detailed information, but from changing what people do. That’s what data-driven reform is meant for, of course: convincing teachers of the need to change and focusing where they need to change….(More)”.

The Internet of Bodies: A Convenient—and, Yes, Creepy—New Platform for Data Discovery


David Horrigan at ALM: “In the Era of the Internet of Things, we’ve become (at least somewhat) comfortable with our refrigerators knowing more about us than we know about ourselves and our Apple watches transmitting our every movement. The Internet of Things has even made it into the courtroom in cases such as the hot tub saga of Amazon Echo’s Alexa in State v. Bates and an unfortunate wife’s Fitbit in State v. Dabate.

But the Internet of Bodies?…

The Internet of Bodies refers to the legal and policy implications of using the human body as a technology platform,” said Northeastern University law professor Andrea Matwyshyn, who works also as co-director of Northeastern’s Center for Law, Innovation, and Creativity (CLIC).

“In brief, the Internet of Things (IoT) is moving onto and inside the human body, becoming the Internet of Bodies (IoB),” Matwyshyn added….


The Internet of Bodies is not merely a theoretical discussion of what might happen in the future. It’s happening already.

Former U.S. Vice President Dick Cheney revealed in 2013 that his physicians ordered the wireless capabilities of his heart implant disabled out of concern for potential assassin hackers, and in 2017, the U.S. Food and Drug Administration recalled almost half a million pacemakers over security issues requiring a firmware update.

It’s not just former vice presidents and heart patients becoming part of the Internet of Bodies. Northeastern’s Matwyshyn notes that so-called “smart pills” with sensors can report back health data from your stomach to smartphones, and a self-tuning brain implant is being tested to treat Alzheimer’s and Parkinson’s.

So, what’s not to like?

Better with Bacon?

“We are attaching everything to the Internet whether we need to or not,” Matwyshyn said, calling it the “Better with Bacon” problem, noting that—as bacon has become a popular condiment in restaurants—chefs are putting it on everything from drinks to cupcakes.

“It’s great if you love bacon, but not if you’re a vegetarian or if you just don’t like bacon. It’s not a bonus,” Matwyshyn added.

Matwyshyn’s bacon analogy raises interesting questions: Do we really need to connect everything to the Internet? Do the data privacy and data protection risks outweigh the benefits?

The Northeastern Law professor divides these IoB devices into three generations: 1) “body external” devices, such as Fitbits and Apple watches, 2) “body internal” devices, including Internet-connected pacemakers, cochlear implants, and digital pills, and 3) “body embedded” devices, hardwired technology where the human brain and external devices meld, where a human body has a real time connection to a remote machine with live updates.

Chip Party for Chipped Employees

A Wisconsin company, Three Square Market, made headlines in 2017—including an appearance on The Today Show—when the company microchipped its employees, not unlike what veterinarians do with the family pet. Not surprisingly, the company touted the benefits of implanting microchips under the skin of employees, including being able to wave one’s hand at a door instead of having to carry a badge or use a password….(More)”.

High-performance medicine: the convergence of human and artificial intelligence


Eric Topol in Nature: “The use of artificial intelligence, and the deep-learning subtype in particular, has been enabled by the use of labeled big data, along with markedly enhanced computing power and cloud storage, across all sectors. In medicine, this is beginning to have an impact at three levels: for clinicians, predominantly via rapid, accurate image interpretation; for health systems, by improving workflow and the potential for reducing medical errors; and for patients, by enabling them to process their own data to promote health. The current limitations, including bias, privacy and security, and lack of transparency, along with the future directions of these applications will be discussed in this article. Over time, marked improvements in accuracy, productivity, and workflow will likely be actualized, but whether that will be used to improve the patient–doctor relationship or facilitate its erosion remains to be seen….(More)”.

The Datafication of Employment


Report by Sam Adler-Bell and Michelle Miller at the Century Foundation: “We live in a surveillance society. Our every preference, inquiry, whim, desire, relationship, and fear can be seen, recorded, and monetized by thousands of prying corporate eyes. Researchers and policymakers are only just beginning to map the contours of this new economy—and reckon with its implications for equity, democracy, freedom, power, and autonomy.

For consumers, the digital age presents a devil’s bargain: in exchange for basically unfettered access to our personal data, massive corporations like Amazon, Google, and Facebook give us unprecedented connectivity, convenience, personalization, and innovation. Scholars have exposed the dangers and illusions of this bargain: the corrosion of personal liberty, the accumulation of monopoly power, the threat of digital redlining,1 predatory ad-targeting,2 and the reification of class and racial stratification.3 But less well understood is the way data—its collection, aggregation, and use—is changing the balance of power in the workplace.

This report offers some preliminary research and observations on what we call the “datafication of employment.” Our thesis is that data-mining techniques innovated in the consumer realm have moved into the workplace. Firms who’ve made a fortune selling and speculating on data acquired from consumers in the digital economy are now increasingly doing the same with data generated by workers. Not only does this corporate surveillance enable a pernicious form of rent-seeking—in which companies generate huge profits by packaging and selling worker data in marketplace hidden from workers’ eyes—but also, it opens the door to an extreme informational asymmetry in the workplace that threatens to give employers nearly total control over every aspect of employment.

The report begins with an explanation of how a regime of ubiquitous consumer surveillance came about, and how it morphed into worker surveillance and the datafication of employment. The report then offers principles for action for policymakers and advocates seeking to respond to the harmful effects of this new surveillance economy. The final sections concludes with a look forward at where the surveillance economy is going, and how researchers, labor organizers, and privacy advocates should prepare for this changing landscape….(More)”

Innovations In The Fight Against Corruption In Latin America


Blog Post by Beth Noveck:  “…The Inter-American Development Bank (IADB) has published an important, practical and prescriptive report with recommendations for every sector of society from government to individuals on innovative and effective approaches to combatting corruption. While focused on Latin America, the report’s proposals, especially those on the application of new technology in the fight against corruption, are relevant around the world….

IADB Anti-Corruption Report

The recommendations about the use of new technologies, including big data, blockchain and collective intelligence, are drawn from an effort undertaken last year by the Governance Lab at New York University’s Tandon School of Engineering to crowdsource such solutions and advice on how to implement them from a hundred global experts. (See the Smarter Crowdsourcing against Corruption report here.)…

Big data, when published as open data, namely in a form that can be re-used without legal or technical restriction and in a machine-readable format that computers can analyze, is another tool in the fight against corruption. With machine readable, big and open data, those outside of government can pinpoint and measure irregularities in government contracting, as Instituto Observ is doing in Brazil.

Opening up judicial data, such as information about case processing times, judges’ and prosecutors’ salaries, information about selection processes, such as CV’s, professional and academic backgrounds, and written and oral exam scores provides activists and reformers with the tools to fight judicial corruption. The Civil Association for Equality and Justice (ACIJ) (a non-profit advocacy group) in Argentina uses such open justice data in its Concursos Transparentes (Transparent Contests) to fight judicial corruption. Jusbrasil is a private open justice company also using open data to reform the courts in Brazil….(More)”

On the privacy-conscientious use of mobile phone data


Yves-Alexandre de Montjoye et al in Nature: “The breadcrumbs we leave behind when using our mobile phones—who somebody calls, for how long, and from where—contain unprecedented insights about us and our societies. Researchers have compared the recent availability of large-scale behavioral datasets, such as the ones generated by mobile phones, to the invention of the microscope, giving rise to the new field of computational social science.

With mobile phone penetration rates reaching 90% and under-resourced national statistical agencies, the data generated by our phones—traditional Call Detail Records (CDR) but also high-frequency x-Detail Record (xDR)—have the potential to become a primary data source to tackle crucial humanitarian questions in low- and middle-income countries. For instance, they have already been used to monitor population displacement after disasters, to provide real-time traffic information, and to improve our understanding of the dynamics of infectious diseases. These data are also used by governmental and industry practitioners in high-income countries.

While there is little doubt on the potential of mobile phone data for good, these data contain intimate details of our lives: rich information about our whereabouts, social life, preferences, and potentially even finances. A BCG study showed, e.g., that 60% of Americans consider location data and phone number history—both available in mobile phone data—as “private”.

Historically and legally, the balance between the societal value of statistical data (in aggregate) and the protection of privacy of individuals has been achieved through data anonymization. While hundreds of different anonymization algorithms exist, most of them are variations and improvements of the seminal k-anonymity algorithm introduced in 1998. Recent studies have, however, shown that pseudonymization and standard de-identification are not sufficient to prevent users from being re-identified in mobile phone data. Four data points—approximate places and times where an individual was present—have been shown to be enough to uniquely re-identify them 95% of the time in a mobile phone dataset of 1.5 million people. Furthermore, re-identification estimations using unicity—a metric to evaluate the risk of re-identification in large-scale datasets—and attempts at k-anonymizing mobile phone data ruled out de-identification as sufficient to truly anonymize the data. This was echoed in the recent report of the [US] President’s Council of Advisors on Science and Technology on Big Data Privacy which consider de-identification to be useful as an “added safeguard, but [emphasized that] it is not robust against near-term future re-identification methods”.

The limits of the historical de-identification framework to adequately balance risks and benefits in the use of mobile phone data are a major hindrance to their use by researchers, development practitioners, humanitarian workers, and companies. This became particularly clear at the height of the Ebola crisis, when qualified researchers (including some of us) were prevented from accessing relevant mobile phone data on time despite efforts by mobile phone operators, the GSMA, and UN agencies, with privacy being cited as one of the main concerns.

These privacy concerns are, in our opinion, due to the failures of the traditional de-identification model and the lack of a modern and agreed upon framework for the privacy-conscientious use of mobile phone data by third-parties especially in the context of the EU General Data Protection Regulation (GDPR). Such frameworks have been developed for the anonymous use of other sensitive data such as census, household survey, and tax data. The positive societal impact of making these data accessible and the technical means available to protect people’s identity have been considered and a trade-off, albeit far from perfect, has been agreed on and implemented. This has allowed the data to be used in aggregate for the benefit of society. Such thinking and an agreed upon set of models has been missing so far for mobile phone data. This has left data protection authorities, mobile phone operators, and data users with little guidance on technically sound yet reasonable models for the privacy-conscientious use of mobile phone data. This has often resulted in suboptimal tradeoffs if any.

In this paper, we propose four models for the privacy-conscientious use of mobile phone data (Fig. 1). All of these models 1) focus on a use of mobile phone data in which only statistical, aggregate information is ultimately needed by a third-party and, while this needs to be confirmed on a per-country basis, 2) are designed to fall under the legal umbrella of “anonymous use of the data”. Examples of cases in which only statistical aggregated information is ultimately needed by the third-party are discussed below. They would include, e.g., disaster management, mobility analysis, or the training of AI algorithms in which only aggregate information on people’s mobility is ultimately needed by agencies, and exclude cases in which individual-level identifiable information is needed such as targeted advertising or loans based on behavioral data.

Figure 1
Figure 1: Matrix of the four models for the privacy-conscientious use of mobile phone data.

First, it is important to insist that none of these models is a silver bullet…(More)”.

Towards matching user mobility traces in large-scale datasets


Paper by Daniel Kondor, Behrooz Hashemian,  Yves-Alexandre de Montjoye and Carlo Ratti: “The problem of unicity and reidentifiability of records in large-scale databases has been studied in different contexts and approaches, with focus on preserving privacy or matching records from different data sources. With an increasing number of service providers nowadays routinely collecting location traces of their users on unprecedented scales, there is a pronounced interest in the possibility of matching records and datasets based on spatial trajectories. Extending previous work on reidentifiability of spatial data and trajectory matching, we present the first large-scale analysis of user matchability in real mobility datasets on realistic scales, i.e. among two datasets that consist of several million people’s mobility traces, coming from a mobile network operator and transportation smart card usage. We extract the relevant statistical properties which influence the matching process and analyze their impact on the matchability of users. We show that for individuals with typical activity in the transportation system (those making 3-4 trips per day on average), a matching algorithm based on the co-occurrence of their activities is expected to achieve a 16.8% success only after a one-week long observation of their mobility traces, and over 55% after four weeks. We show that the main determinant of matchability is the expected number of co-occurring records in the two datasets. Finally, we discuss different scenarios in terms of data collection frequency and give estimates of matchability over time. We show that with higher frequency data collection becoming more common, we can expect much higher success rates in even shorter intervals….(More)”.