I was previously contacted by a journalist and asked to provide my quick thoughts on unstructured data. Here are a few points that I provided regarding the opportunities and challenges presented with increased availability and use of unstructured data. First, definitions:
Structured data is organized and can be more easily searchable using schemas, labels, data types, and other previously defined characteristics.
Unstructured data is data that is not organized in a predetermined manner; it’s not stored in a fixed format. Unstructured data includes audio, video, text files, email, social media posts, images, sensor data, and many other media types.
- Exponential growth. A lot of unstructured data exists. International Data Corporation (IDC) estimates that by 2025, 80 percent of the generated data will be unstructured, primarily driven by audio, video, and rich media content.
- Improving data quality – There’s undoubtedly business value in leveraging unstructured data, for example, to validate against and enrich existing, structured datasets.
- Expanding our understanding of “context” – Structured data often gives a false impression of meaning; with predefined fields, values, etc. there is an assumption that the context is clear without careful consideration for the other factors that drive context such as governance, politics, and other limitations that define the populations and domains represented in the data. Working with unstructured data may present a unique opportunity for teams to expand their understanding of how unstructured data can change the “data story.”
CHALLENGES & DANGERS
- Analysis Without Context: Unstructured data is not collected or stored with AI in mind, and often exists in silos (data islands), so the context of the information collected is not always clear. What does a single tweet mean on the day tweeted? What does it mean 5 years later? 100 years later? What does it mean when you combine with the tweeter’s financial information? Healthcare data?
- Security Risks: There is increased complexity in securing the growing percentage of sensitive data that exists in less secure unstructured formats. A system may prohibit access to an electronic health record, while allowing access to a physician’s email or voice data; each source may contain sensitive health information.
- Compromised Consent: Merging structured and unstructured data may compromise a person’s consent regarding the use of their data and present other ethical and regulatory challenges. Data is collected with intention and a basis for governance and usage consent. You may have thought that you consented to Twitter using your data to deliver custom content (and ads), but did you consent to a third party’s access or use of your data for other purposes beyond social communication (ex: to determine the state of your mental health in delivering health, financial or life insurance services).
- Misalignment of Intention: There is the potential of losing meaning by transforming unstructured data into a more structured format. For example, in healthcare, converting unstructured data to extract reference codes for billing purposes may eliminate the notes around a patient’s disposition or sentiment. It’s essential to consider and clearly define the intention for collecting the source data. For example, if the data source is an AI system designed to improve patient care and clinical outcomes, then one should carefully transform & use the data so that these sentiments are not lost in translation.
HOMEWORK: CONSIDER THIS …
Using the project brief below, what are the potential opportunities and dangers in the planned use of unstructured data?
Sample Project Brief: The COVID-19 pandemic has significant financial implications in the healthcare business sector. The project goal is to create a tool that enables health organizations to reduce their costs by identifying high-risk/high-cost patients and recommending a cost-effective treatment plan. This project team intends to supplement electronic health information with other unstructured information such as social media posts and loyalty card data.