Several ideas for data warehouse data modeling

Author：Eve Cole Update Time：2009-12-31 16:49:38

Last week we had a lengthy discussion on the direction of data warehouse modeling, trying to find a modeling methodology suitable for the development of Alibaba's data warehouse platform. The two typical theories of data warehouse modeling are dimensional modeling and entity relationship modeling based on subject domains. These two methods are represented by the two masters Kimball and Immon respectively. Dimensional modeling is driven by data analysis needs and advocates a bus architecture: consistent facts and consistent dimensions. This data model is easy for users to understand and operate for data analysis. Entity relationship modeling based on the subject domain is driven by source system data, integrates all data of the enterprise, abstracts and integrates the data from an enterprise level, and adopts 3NF entity relationship theory modeling. This data modeling method uses A more abstract approach attempts to establish a relatively stable data model and can describe enterprise-level data relationships. In the industry, the two methods are often combined to use different data hierarchies in data warehouses.

Last week we mainly had an in-depth discussion on the method of data integration in entity relationship modeling based on subject domains, and discussed the following three ideas:

Attributes of different entities in the same subject domain in the form of attribute aggregation. For example, for entity objects such as members, companies, customers, etc., we all have address attribute information, name identification attribute information, etc. This idea is to integrate fields with high attribute cohesion and label different attributes with type identification. Stored in the form of a tree table. Its advantages are: first, the model is stable, and if the peripheral system changes fields, it only needs to add different types without changing the table structure; second, it reduces a large amount of redundant historical data. Its disadvantages are: first, it loses a lot of attribute identification information of entities. We will not be able to see what address attributes a member has from the model. This information can only be obtained by querying the type code; second, it is extremely The number of records in the data table is expanded because it is stored in the form of a vertical table; thirdly, it is difficult to apply, and efficiency is a big problem, because we often need to use multiple fields of an entity, and there will be a lot of join operations and vertical The operation of turning horizontally. Fourth: Attribute aggregation is also a relatively difficult process because it is an abstract process, which places high demands on the business background knowledge and abstraction capabilities of the modeler; Fifth: Although redundancy is reduced It records historical data, but the operation of recording history is also more complicated.
Use object-oriented modeling to abstract the common attributes of different entities, and then use object-oriented ideas such as inheritance and combination to concretize the entities step by step. Its advantage is that the concept of the model is relatively clear, but its disadvantage is that the model is relatively unstable, and the subsequent integration of the data should also face the problem of recombination.
Source-attached modeling method: Modeling is carried out in a way that basically maintains the source system, focusing on the standardization and consistency of data, and sorting out the business significance of the data. This approach is similar to our current data warehouse approach. It is relatively easy to implement, fast to implement, and the front-end can directly use data; the disadvantage is that the degree of integration is not high and the model is unstable.
After all, the model serves data analysis applications. The specific modeling method used needs to be determined based on the actual business characteristics and the characteristics of the source system. Alibaba's source system changes quickly, and data analysis should change quickly and respond quickly. Moreover, our demand for integration between different systems is not great. Deep data integration often brings applications It’s inconvenient to get on. Therefore, I personally think that the method of posting the source is the better solution at present.

This article comes from the CSDN blog. Please indicate the source when reprinting: http://blog.csdn.net/wsbupt/archive/2009/12/30/5109309.aspx
-