Brian Hayes in the American Scientist: “Kim studies parallel algorithms, designed for computers with thousands of processors. Chris builds computer simulations of fluids in motion, such as ocean currents. Dana creates software for visualizing geographic data. These three people have much in common. Computing is an essential part of their professional lives; they all spend time writing, testing, and debugging computer programs. They probably rely on many of the same tools, such as software for editing program text. If you were to look over their shoulders as they worked on their code, you might not be able to tell who was who.
Despite the similarities, however, Kim, Chris, and Dana were trained in different disciplines, and they belong to different intellectual traditions and communities. Kim, the parallel algorithms specialist, is a professor in a university department of computer science. Chris, the fluids modeler, also lives in the academic world, but she is a physicist by training; sometimes she describes herself as a computational scientist (which is not the same thing as a computer scientist). Dana has been programming since junior high school but didn’t study computing in college; at the startup company where he works, his title is software developer.
These factional divisions run deeper than mere specializations. Kim, Chris, and Dana belong to different professional societies, go to different conferences, read different publications; their paths seldom cross. They represent different cultures. The resulting Balkanization of computing seems unwise and unhealthy, a recipe for reinventing wheels and making the same mistake three times over. Calls for unification go back at least 45 years, but the estrangement continues. As a student and admirer of all three fields, I find the standoff deeply frustrating.
Certain areas of computation are going through a period of extraordinary vigor and innovation. Machine learning, data analysis, and programming for the web have all made huge strides. Problems that stumped earlier generations, such as image recognition, finally seem to be yielding to new efforts. The successes have drawn more young people into the field; suddenly, everyone is “learning to code.” I am cheered by (and I cheer for) all these events, but I also want to whisper a question: Will the wave of excitement ever reach other corners of the computing universe?…
What’s the difference between computer science, computational science, and software development?…(More)”
Big Data Now
Radar – O’Reilly: “In the four years we’ve been producing Big Data Now, our wrap-up of important developments in the big data field, we’ve seen tools and applications mature, multiply, and coalesce into new categories. This year’s free wrap-up of Radar coverage is organized around seven themes:
at- Cognitive augmentation: As data processing and data analytics become more accessible, jobs that can be automated will go away. But to be clear, there are still many tasks where the combination of humans and machines produce superior results.
- Intelligence matters: Artificial intelligence is now playing a bigger and bigger role in everyone’s lives, from sorting our email to rerouting our morning commutes, from detecting fraud in financial markets to predicting dangerous chemical spills. The computing power and algorithmic building blocks to put AI to work have never been more accessible.
- The convergence of cheap sensors, fast networks, and distributed computation: The amount of quantified data available is increasing exponentially — and aside from tools for centrally handling huge volumes of time-series data as it arrives, devices and software are getting smarter about placing their own data accurately in context, extrapolating without needing to ‘check in’ constantly.
- Reproducing, managing, and maintaining data pipelines: The coordination of processes and personnel within organizations to gather, store, analyze, and make use of data.
- The evolving, maturing marketplace of big data components: Open-source components like Spark, Kafka, Cassandra, and ElasticSearch are reducing the need for companies to build in-house proprietary systems. On the other hand, vendors are developing industry-specific suites and applications optimized for the unique needs and data sources in a field.
- The value of applying techniques from design and social science: While data science knows human behavior in the aggregate, design works in the particular, where A/B testing won’t apply — you only get one shot to communicate your proposal to a CEO, for example. Similarly, social science enables extrapolation from sparse data. Both sets of tools enable you to ask the right questions, and scope your problems and solutions realistically.
- The importance of building a data culture: An organization that is comfortable with gathering data, curious about its significance, and willing to act on its results will perform demonstrably better than one that doesn’t. These priorities must be shared throughout the business.
- The perils of big data: From poor analysis (driven by false correlation or lack of domain expertise) to intrusiveness (privacy invasion, price profiling, self-fulfilling predictions), big data has negative potential.
Download our free snapshot of big data in 2014, and follow the story this year on Radar.”
Computer-based personality judgments are more accurate than those made by humans
Paper by Wu Youyou, Michal Kosinski and David Stillwell at PNAS (Proceedings of the National Academy of Sciences): “Judging others’ personalities is an essential skill in successful social living, as personality is a key driver behind people’s interactions, behaviors, and emotions. Although accurate personality judgments stem from social-cognitive skills, developments in machine learning show that computer models can also make valid judgments. This study compares the accuracy of human and computer-based personality judgments, using a sample of 86,220 volunteers who completed a 100-item personality questionnaire. We show that (i) computer predictions based on a generic digital footprint (Facebook Likes) are more accurate (r = 0.56) than those made by the participants’ Facebook friends using a personality questionnaire (r = 0.49); (ii) computer models show higher interjudge agreement; and (iii) computer personality judgments have higher external validity when predicting life outcomes such as substance use, political attitudes, and physical health; for some outcomes, they even outperform the self-rated personality scores. Computers outpacing humans in personality judgment presents significant opportunities and challenges in the areas of psychological assessment, marketing, and privacy…(More)”.
Businesses dig for treasure in open data
Lindsay Clark in ComputerWeekly: “Open data, a movement which promises access to vast swaths of information held by public bodies, has started getting its hands dirty, or rather its feet.
Before a spade goes in the ground, construction and civil engineering projects face a great unknown: what is down there? In the UK, should someone discover anything of archaeological importance, a project can be halted – sometimes for months – while researchers study the site and remove artefacts….
During an open innovation day hosted by the Science and Technologies Facilities Council (STFC), open data services and technology firm Democrata proposed analytics could predict the likelihood of unearthing an archaeological find in any given location. This would help developers understand the likely risks to construction and would assist archaeologists in targeting digs more accurately. The idea was inspired by a presentation from the Archaeological Data Service in the UK at the event in June 2014.
The proposal won support from the STFC which, together with IBM, provided a nine-strong development team and access to the Hartree Centre’s supercomputer – a 131,000 core high-performance facility. For natural language processing of historic documents, the system uses two components of IBM’s Watson – the AI service which famously won the US TV quiz show Jeopardy. The system uses SPSS modelling software, the language R for algorithm development and Hadoop data repositories….
The proof of concept draws together data from the University of York’s archaeological data, the Department of the Environment, English Heritage, Scottish Natural Heritage, Ordnance Survey, Forestry Commission, Office for National Statistics, the Land Registry and others….The system analyses sets of indicators of archaeology, including historic population dispersal trends, specific geology, flora and fauna considerations, as well as proximity to a water source, a trail or road, standing stones and other archaeological sites. Earlier studies created a list of 45 indicators which was whittled down to seven for the proof of concept. The team used logistic regression to assess the relationship between input variables and come up with its prediction….”
Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency
Medium: “…So why, then, does granular, social data make people uncomfortable? Well, ultimately—and at the risk of stating the obvious—it’s because data of this sort brings up issues regarding ethics, privacy, bias, fairness, and inclusion. In turn, these issues make people uncomfortable because, at least as the popular narrative goes, these are new issues that fall outside the expertise of those those aggregating and analyzing big data. But the thing is, these issues aren’t actually new. Sure, they may be new to computer scientists and software engineers, but they’re not new to social scientists.
atThis is why I think the world of big data and those working in it — ranging from the machine learning researchers developing new analysis tools all the way up to the end-users and decision-makers in government and industry — can learn something from computational social science….
So, if technology companies and government organizations — the biggest players in the big data game — are going to take issues like bias, fairness, and inclusion seriously, they need to hire social scientists — the people with the best training in thinking about important societal issues. Moreover, it’s important that this hiring is done not just in a token, “hire one social scientist for every hundred computer scientists” kind of way, but in a serious, “creating interdisciplinary teams” kind of kind of way.
While preparing for my talk, I read an article by Moritz Hardt, entitled “How Big Data is Unfair.” In this article, Moritz notes that even in supposedly large data sets, there is always proportionally less data available about minorities. Moreover, statistical patterns that hold for the majority may be invalid for a given minority group. He gives, as an example, the task of classifying user names as “real” or “fake.” In one culture — comprising the majority of the training data — real names might be short and common, while in another they might be long and unique. As a result, the classic machine learning objective of “good performance on average,” may actually be detrimental to those in the minority group….
As an alternative, I would advocate prioritizing vital social questions over data availability — an approach more common in the social sciences. Moreover, if we’re prioritizing social questions, perhaps we should take this as an opportunity to prioritize those questions explicitly related to minorities and bias, fairness, and inclusion. Of course, putting questions first — especially questions about minorities, for whom there may not be much available data — means that we’ll need to go beyond standard convenience data sets and general-purpose “hammer” methods. Instead we’ll need to think hard about how best to instrument data aggregation and curation mechanisms that, when combined with precise, targeted models and tools, are capable of elucidating fine-grained, hard-to-see patterns….(More).”
Big video data could change how we do everything — from catching bad guys to tracking shoppers
Sean Varah at VentureBeat: “Everyone takes pictures and video with their devices. Parents record their kids’ soccer games, companies record employee training, police surveillance cameras at busy intersections run 24/7, and drones monitor pipelines in the desert.
With vast amounts of video growing vaster at a rate faster than the day before, and the hottest devices like drones decreasing in price and size until everyone has one (OK, not in their pocket quite yet) it’s time to start talking about mining this mass of valuable video data for useful purposes.
Julian Mann, the cofounder of Skybox Imaging — a company in the business of commercial satellite imagery and the developer advocate for Google Earth outreach — says that the new “Skybox for Good” program will provide “a constantly updated model of change of the entire planet” with the potential to “save lives, protect the environment, promote education, and positively impact humanity.”…
Mining video data through “man + machine” artificial intelligence is new technology in search of unsolved problems. Could this be the next chapter in the ever-evolving technology revolution?
For the past 50 years, satellite imagery has only been available to the U.S. intelligence community and those countries with technology to launch their own. Digital Globe was one of the first companies to make satellite imagery available commercially, and now Skybox and a few others have joined them. Drones are even newer, having been used by the U.S. military since the ‘90s for surveillance over battlefields or, in this age of counter-terrorism, playing the role of aerial detectives finding bad guys in the middle of nowhere. Before drones, the same tasks required thousands of troops on the ground, putting many young men and women in harm’s way. Today, hundreds of trained “eyes” safely located here in the U.S. watch hours of video from a single drone to assess current situations in countries far away….”
Smarter Than Us: The Rise of Machine Intelligence
Can we instruct AIs to steer the future as we desire? What goals should we program into them? It turns out this question is difficult to answer! Philosophers have tried for thousands of years to define an ideal world, but there remains no consensus. The prospect of goal-driven, smarter-than-human AI gives moral philosophy a new urgency. The future could be filled with joy, art, compassion, and beings living worthwhile and wonderful lives—but only if we’re able to precisely define what a “good” world is, and skilled enough to describe it perfectly to a computer program.
AIs, like computers, will do what we say—which is not necessarily what we mean. Such precision requires encoding the entire system of human values for an AI: explaining them to a mind that is alien to us, defining every ambiguous term, clarifying every edge case. Moreover, our values are fragile: in some cases, if we mis-define a single piece of the puzzle—say, consciousness—we end up with roughly 0% of the value we intended to reap, instead of 99% of the value.
Though an understanding of the problem is only beginning to spread, researchers from fields ranging from philosophy to computer science to economics are working together to conceive and test solutions. Are we up to the challenge?
A mathematician by training, Armstrong is a Research Fellow at the Future of Humanity Institute (FHI) at Oxford University. His research focuses on formal decision theory, the risks and possibilities of AI, the long term potential for intelligent life (and the difficulties of predicting this), and anthropic (self-locating) probability. Armstrong wrote Smarter Than Us at the request of the Machine Intelligence Research Institute, a non-profit organization studying the theoretical underpinnings of artificial superintelligence.”
Code of Conduct: Cyber Crowdsourcing for Good
Patrick Meier at iRevolution: “There is currently no unified code of conduct for digital crowdsourcing efforts in the development, humanitarian or human rights space. As such, we propose the following principles (displayed below) as a way to catalyze a conversation on these issues and to improve and/or expand this Code of Conduct as appropriate.
This initial draft was put together by Kate Chapman, Brooke Simons and myself. The link above points to this open, editable Google Doc. So please feel free to contribute your thoughts by inserting comments where appropriate. Thank you.
An organization that launches a digital crowdsourcing project must:
- Provide clear volunteer guidelines on how to participate in the project so that volunteers are able to contribute meaningfully.
- Test their crowdsourcing platform prior to any project or pilot to ensure that the system will not crash due to obvious bugs.
- Disclose the purpose of the project, exactly which entities will be using and/or have access to the resulting data, to what end exactly, over what period of time and what the expected impact of the project is likely to be.
- Disclose whether volunteer contributions to the project will or may be used as training data in subsequent machine learning research
- ….
An organization that launches a digital crowdsourcing project should:
- Share as much of the resulting data with volunteers as possible without violating data privacy or the principle of Do No Harm.
- Enable volunteers to opt out of having their tasks contribute to subsequent machine learning research. Provide digital volunteers with the option of having their contributions withheld from subsequent machine learning studies
- … “
When Experts Are a Waste of Money
Vivek Wadhwa at the Wall Street Journal: “Corporations have always relied on industry analysts, management consultants and in-house gurus for advice on strategy and competitiveness. Since these experts understand the products, markets and industry trends, they also get paid the big bucks.
But what experts do is analyze historical trends, extrapolate forward on a linear basis and protect the status quo — their field of expertise. And technologies are not progressing linearly anymore; they are advancing exponentially. Technology is advancing so rapidly that listening to people who just have domain knowledge and vested interests will put a company on the fastest path to failure. Experts are no longer the right people to turn to; they are a waste of money.
Just as the processing power of our computers doubles every 18 months, with prices falling and devices becoming smaller, fields such as medicine, robotics, artificial intelligence and synthetic biology are seeing accelerated change. Competition now comes from the places you least expect it to. The health-care industry, for example, is about to be disrupted by advances in sensors and artificial intelligence; lodging and transportation, by mobile apps; communications, by Wi-Fi and the Internet; and manufacturing, by robotics and 3-D printing.
To see the competition coming and develop strategies for survival, companies now need armies of people, not experts. The best knowledge comes from employees, customers and outside observers who aren’t constrained by their expertise or personal agendas. It is they who can best identify the new opportunities. The collective insight of large numbers of individuals is superior because of the diversity of ideas and breadth of knowledge that they bring. Companies need to learn from people with different skills and backgrounds — not from those confined to a department.
When used properly, crowdsourcing can be the most effective, least expensive way of solving problems.
Crowdsourcing can be as simple as asking employees to submit ideas via email or via online discussion boards, or it can assemble cross-disciplinary groups to exchange ideas and brainstorm. Internet platforms such as Zoho Connect, IdeaScale and GroupTie can facilitate group ideation by providing the ability to pose questions to a large number of people and having them discuss responses with each other.
Many of the ideas proposed by the crowd as well as the discussions will seem outlandish — especially if anonymity is allowed on discussion forums. And companies will surely hear things they won’t like. But this is exactly the input and out-of-the-box thinking that they need in order to survive and thrive in this era of exponential technologies….
Another way of harnessing the power of the crowd is to hold incentive competitions. These can solve problems, foster innovation and even create industries — just as the first XPRIZE did. Sponsored by the Ansari family, it offered a prize of $10 million to any team that could build a spacecraft capable of carrying three people to 100 kilometers above the earth’s surface, twice within two weeks. It was won by Burt Rutan in 2004, who launched a spacecraft called SpaceShipOne. Twenty-six teams, from seven countries, spent more than $100 million in competing. Since then, more than $1.5 billion has been invested in private space flight by companies such as Virgin Galactic, Armadillo Aerospace and Blue Origin, according to the XPRIZE Foundation….
Competitions needn’t be so grand. InnoCentive and HeroX, a spinoff from the XPRIZE Foundation, for example, allow prizes as small as a few thousand dollars for solving problems. A company or an individual can specify a problem and offer prizes for whoever comes up with the best idea to solve it. InnoCentive has already run thousands of public and inter-company competitions. The solutions they have crowdsourced have ranged from the development of biomarkers for Amyotrophic lateral sclerosis disease to dual-purpose solar lights for African villages….”
Training Students to Extract Value from Big Data
The nation’s ability to make use of data depends heavily on the availability of a workforce that is properly trained and ready to tackle high-need areas. Training students to be capable in exploiting big data requires experience with statistical analysis, machine learning, and computational infrastructure that permits the real problems associated with massive data to be revealed and, ultimately, addressed. Analysis of big data requires cross-disciplinary skills, including the ability to make modeling decisions while balancing trade-offs between optimization and approximation, all while being attentive to useful metrics and system robustness. To develop those skills in students, it is important to identify whom to teach, that is, the educational background, experience, and characteristics of a prospective data-science student; what to teach, that is, the technical and practical content that should be taught to the student; and how to teach, that is, the structure and organization of a data-science program.
Training Students to Extract Value from Big Data summarizes a workshop convened in April 2014 by the National Research Council’s Committee on Applied and Theoretical Statistics to explore how best to train students to use big data. The workshop explored the need for training and curricula and coursework that should be included. One impetus for the workshop was the current fragmented view of what is meant by analysis of big data, data analytics, or data science. New graduate programs are introduced regularly, and they have their own notions of what is meant by those terms and, most important, of what students need to know to be proficient in data-intensive work. This report provides a variety of perspectives about those elements and about their integration into courses and curricula…”