What is Data Science?
Data science is the field of applying advanced analytics techniques and scientific principles to extract valuable information from data for business decision-making, strategic planning, and other uses. It’s increasingly critical to businesses: The insights that data science generates help organizations increase operational efficiency, identify new business opportunities and improve marketing and sales programs, among other benefits. Ultimately, they can lead to competitive advantages over business rivals.
Data science incorporates various disciplines — for example, data engineering, data preparation, data mining, predictive analytics, machine learning, and data visualization, as well as statistics, mathematics, and software programming. It’s primarily done by skilled data scientists, although lower-level data analysts may also be involved. In addition, many organizations now rely partly on citizen data scientists, a group that can include business intelligence (BI) professionals, business analysts, data-savvy business users, data engineers, and other workers who don’t have a formal data science background
Why is Data science important?
Science is based on gathering evidence and interpreting the evidence to draw logical conclusions. This principle has served civilization well enough to enable trans-Atlantic flights, telephony, disease treatments, landing rovers on the surface of Mars, and much more. In the modern world, a proliferation of data is being gathered. Data about lifestyle habits, dietary preferences, music choices, purchasing habits, energy consumption, weather systems, migratory patterns, seismic activity, flight times, and so much more
That’s more information about the world around us than we’ve ever had access to, and it’s spread across a wider sample set than ever. Analyzing large data sets can lead to surprising revelations. Sometimes patterns and correlations are found in places not previously expected or that had only been theorized before. Observing and analyzing the environment is important for humans to learn, grow, and become better-informed. A lot of data science is applied to frivolous pursuits—and sometimes ethically questionable ones—but there is just as much analysis happening around worthwhile, healthy, and helpful causes that open source should be proud to support.
And it turns out that open-source software is vital to the growth and development of data science.
Data science plays an important role in virtually all aspects of business operations and strategies. For example, it provides information about customers that helps companies create stronger marketing campaigns and targeted advertising to increase product sales. It aids in managing financial risks, detecting fraudulent transactions, and preventing equipment breakdowns in manufacturing plants and other industrial settings. It helps block cyber-attacks and other security threats in IT systems.
From an operational standpoint, data science initiatives can optimize the management of supply chains, product inventories, distribution networks, and customer service. On a more fundamental level, they point the way to increased efficiency and reduced costs. Data science also enables companies to create business plans and strategies that are based on informed analysis of customer behavior, market trends, and competition. Without it, businesses may miss opportunities and make flawed decisions.
Data science is also vital in areas beyond regular business operations. In healthcare, its uses include diagnosis of medical conditions, image analysis, treatment planning, and medical research. Academic institutions use data science to monitor student performance and improve their marketing to prospective students. Sports teams analyze player performance and plan game strategies via data science. Government agencies and public policy organizations are also big users.
Basis of Data Science
Data science is an interdisciplinary field focused on extracting knowledge from data sets, which are typically large. The field encompasses analysis, preparing data for analysis, and presenting findings to inform high-level decisions in an organization. As such, it incorporates skills from computer science, mathematics, statistics, information visualization, graphic design, complex systems, communication, and business. Statistician Nathan Yau, drawing on Ben Fry, also links data science to human-computer interaction: users should be able to intuitively control and explore data. In 2015, the American Statistical Association identified database management, statistics, and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.
Because of the vast amount of data that data science analyses, the field requires a solid computing infrastructure. The datasets involved in serious data science are often too large to process on a single machine or even a small cluster, so hybrid clouds are used to store and process information and to make correlations among what’s been parsed. This means that a data scientist’s toolbox includes a platform like OpenShift for running processing services, distributed computing software like Apache Hadoop or Apache Spark, a distributed file system like Ceph or Gluster for scalable and highly available storage, and so on. A data scientist’s job is as much about statistics and math as it is programming and computer engineering.