What Is Structured Data?

Structured data has a well-defined schema for the information it holds. To give an extremely simple definition, any data that can be presented in a spreadsheet program like Google Sheets or Microsoft Excel is structured data. In this example, data can be represented as rows and columns. Each column represents a different attribute, while each row will have the data associated with the attribute for a single instance. Rows and columns form a table that can be referenced easily. Different tables can be connected—that is, they can be said to be related by the common column present in both tables. If multiple tables are related in succession and combination, this creates a relational database. For instance, the customer, sales, and inventory data of a department store can be considered structured data stored as a relational database. Each customer will have a customer ID, as well as fields for their name, contact number, credit card information, address, etc. The database of customers can be connected to the database of sales, with attributes including the time of purchase, item codes purchased, total amount spent, customer ID, etc. Both the tables will be connected with the common attribute of customer ID. Finally, the sales database can be connected to the database of inventory using the common attribute of item code, effectively interconnecting all three tables into a relational database. Structured data like this is generally stored in relational database management systems (RDBMSes). Databases can be written, read, and manipulated using Structured Query Language (SQL), a language that was developed by IBM in the 1970s to support its mainframe databases (though it was initially known as Sequence English Query Language or SEQUEL). It was so named since it reads pretty much like the English language. SQL in its current form was popularized by Relational Software, Inc. (now called Oracle).

What Is Unstructured Data?

Every piece of data that is not structured data can be classified as unstructured data. It’s estimated that by 2025, 80% of the data we encounter will be unstructured data in the form of text, audio, image, or video 1 . In short, unstructured data is modern data. It’s often: Born digital and unpredictable Always being created and on the move Blended, multimodal, and interoperable Geo-distributed for better protection Unstructured data can have some associated metadata that can, in turn, have a structure. For example, a video can have metadata of video resolution, bit rate, frames per second (FPS), owner of the video, etc. But the video itself is unstructured. When there’s some structured metadata associated with unstructured data, it’s occasionally referred to as semi-structured data. Looking more closely at the example of a YouTube video, some metadata is present, such as the time of upload, date of upload, number of views (partial or full), number of likes and dislikes, etc. But the content inside the video title, the video description, and the video itself is unstructured. It has a qualitative aspect that cannot be captured purely by numbers. The most commonly used database for unstructured data is NoSQL. NoSQL stands for “not only SQL,” indicating that the database can handle a wider range of data beyond the capabilities of SQL databases. There’s no schema or tabular structure for NoSQL databases; it’s just a collection of data grouped together.

퓨어 지식 (Pure Knowledge)
Guide to Big Data
빅데이터 vs. 전통적인 데이터

초심자들을 위한 빅데이터 가이드

정형 데이터 vs 비정형 데이터

지난 10년 동안 데이터가 무엇인지에 대한 정의와 이해가 극적으로 바뀌었습니다. 비정형 데이터를 읽고, 저장하고, 분석할 수 있는 새로운 도구의 가용성이 증가했기 때문입니다.

과거에는 해석의 어려움으로 인해 비정형 데이터가 제대로 활용되지 못했습니다. 하지만 이제는 새로운 기술을 통해 비정형 데이터를 더 쉽게 이해할 수 있을 뿐만 아니라 이와 같은 데이터의 보고에서 귀중한 인사이트를 얻을 수 있습니다.

IDC에 따르면, 2024년까지 전 세계에서 생성, 캡처, 복사 및 소비되는 데이터의 총량은 매년 149제타바이트를 넘어설 것이며 그중 상당수는 비정형 데이터가 될 것입니다. 모든 조직은 비정형 데이터 분석 기능을 구축함으로써 이점을 얻게 될 것입니다. 그리고 이와 같은 여정의 첫 번째 단계는 정형 데이터와 비정형 데이터가 무엇인지 이해하는 것입니다.

다음은 더 자세한 설명과 함께 이들의 차이점에 대한 간략한 요약입니다.

특징	정형 데이터	비정형 데이터
데이터의 특성	양적 데이터	질적 데이터
데이터 모델	사전에 정의됨. 모델이 정의되고 일부 데이터가 저장된 후에는 모델을 변경하기 어렵습니다.	비정형 데이터에는 특정 스키마가 포함되지 않습니다. 데이터 모델은 매우 유연합니다.
데이터 포맷	제한된 수의 데이터 포맷을 사용할 수 있습니다.	비정형 데이터에는 매우 다양한 데이터 포맷을 사용할 수 있습니다.
데이터베이스	SQL 기반 관계형 데이터베이스가 사용됩니다.	특정 스키마가 없는 NoSQL 데이터베이스가 사용됩니다.
검색	데이터베이스 또는 데이터 세트 내에서 데이터를 검색하고 찾기가 매우 쉽습니다.	비정형 특성으로 인해 특정 데이터 검색이 매우 어렵습니다.
분석	데이터의 양적 특성으로 인해 분석이 매우 쉽습니다.	기존 소프트웨어 도구로도 분석이 매우 어렵습니다.
저장 방법	정형 데이터에는 데이터 웨어하우스가 사용됩니다.	비정형 데이터를 저장하는 데에는 데이터 레이크가 사용됩니다.

Slide

정형 데이터란 무엇인가요?

정형 데이터에는 보유하고 있는 정보에 대해 적절히 정의된 스키마가 있습니다. 매우 간단히 정의하면 구글 스프레드시트 또는 마이크로소프트 엑셀과 같은 스프레드시트 프로그램에 표시할 수 있는 모든 데이터는 정형 데이터입니다.

이 예에서 데이터는 행과 열로 표시될 수 있습니다. 각 열은 다른 속성을 나타내는 반면 각 행에는 단일 인스턴스의 속성과 연결된 데이터가 있습니다. 행과 열은 쉽게 참조할 수 있는 테이블을 형성합니다.

서로 다른 테이블들은 연결될 수 있습니다. 즉, 두 테이블에 있는 공통 열로 관련되어 있다고 말할 수 있습니다.

여러 테이블이 연속적으로 연결되어 있으면 관계형 데이터베이스가 생성됩니다. 예를 들어, 백화점의 고객, 판매 및 재고 데이터는 관계형 데이터베이스로 저장된 정형 데이터로 간주할 수 있습니다.

각 고객은 고객 ID와 이름, 연락처, 신용카드 정보, 주소 등의 필드를 갖게 됩니다.
고객 데이터베이스는 구매 시간, 구매 항목 코드, 총 지출 금액, 고객 ID 등의 속성을 가진 판매 데이터베이스에 연결될 수 있습니다. 두 테이블 모두 고객 ID라는 공통 속성으로 연결됩니다.
마지막으로 항목 코드라는 공통 속성을 사용하여 판매 데이터베이스를 재고 데이터베이스에 연결하여 세 테이블을 모두 관계형 데이터베이스로 효과적으로 연결할 수 있습니다.

이와 같은 정형 데이터는 일반적으로 RDBMS(관계형 데이터베이스 관리 시스템)에 저장됩니다. 데이터베이스는 1970년대에 IBM이 메인프레임 데이터베이스를 지원하기 위해 개발한 언어인 SQL 사용하여 작성, 읽기 및 조작할 수 있습니다. (처음에는 Sequence English Query Language 또는 SEQUEL로 알려진 바 있습니다.) SEQUEL의 발음이 SQL과 거의 흡사하다고 해서 붙여진 이름입니다. 현재 형태의 SQL은 Relational Software, Inc.(오늘날 Oracle)에 의해 대중화되었습니다.

비정형 데이터란 무엇인가요?

정형 데이터가 아닌 모든 데이터는 비정형 데이터로 분류될 수 있습니다. 2025년까지 우리가 접하는 데이터의 80%가 텍스트, 오디오, 이미지 또는 동영상 형태의 비정형 데이터가 될 것으로 예상됩니다¹.

간단히 말해 비정형 데이터가 현대적 데이터라 할 수 있습니다. 비정형 데이터는 보통 다음과 같습니다.

디지털 데이터로서 예측 불가능
상시 생성되며 항상 이동 중에 있음
혼합, 다중 모드 및 상호 운용 가능
지리적으로 분산되어 더 나은 보호 수단 제공

비정형 데이터에는 구조를 가질 수 있는 일부 관련 메타데이터가 포함될 수 있습니다. 예를 들어 동영상에는 동영상 해상도, 비트 전송률, FPS(초당 프레임 수), 동영상 소유자 등의 메타데이터가 포함될 수 있습니다. 그러나 동영상 자체는 비정형 데이터입니다. 비정형 데이터와 관련된 일부 정형 메타데이터가 있는 경우 반정형 데이터라고도 합니다.

YouTube 영상의 예를 자세히 살펴보면 업로드 시간, 업로드 날짜, 조회수(일부 또는 전체), 좋아요 및 싫어요 수 등과 같은 일부 메타데이터가 있습니다. 그러나 영상 제목, 설명 및 영상 자체의 내용은 비정형입니다. 순전히 숫자로 포착할 수 없는 질적인 측면도 있습니다.

비정형 데이터에 가장 일반적으로 사용되는 데이터베이스는 NoSQL입니다. NoSQL은 "not only SQL"의 약자로, 데이터베이스가 SQL 데이터베이스의 기능을 넘어 더 넓은 범위의 데이터를 처리할 수 있음을 나타냅니다. NoSQL 데이터베이스에는 스키마나 테이블 구조가 없습니다. 함께 그룹화된 데이터 모음일 뿐입니다.

비정형 데이터 스토리지와 UFFO

이처럼 비정형 데이터는 엄청난 혁신의 잠재력을 지닌 중요한 인사이트를 제공할 수 있지만 이를 위해서 해결해야 하는 도전 과제들도 많습니다. 퓨어스토리지의 고성능 UFFO 스토리지 솔루션인 플래시블레이드(FlashBlade®)는 플래시 스토리지의 속도는 물론 모든 아키텍처를 민첩하게 확장할 수 있는 기능을 제공합니다. 자세히 알아보고 싶으신가요? 퓨어스토리지는 약정 없이 플래시블레이드(FlashBlade)를 테스트할 수 있는 무료 평가판을 제공하고 있습니다.