Saturday, February 27, 2010

The Evolution of Data Management Technology

1.0 INTRODUCTION

The recent developments in information systems technologies have resulted in the computerizing of many applications in various business areas. Data has become critical resource in many organizations. Therefore, it is important for data resources in all information system to be organized and structured in a logical manner so that they can be shared, access easily, processed efficiently, retrieved quickly, and managed effectively. As a result, there have been several efforts on integrating the various data sources scattered across several sites.

2.0 THE EVOLUTION OF DATA MANAGEMENT TECHNOLOGIES

During the early stage of data management, we have the traditional technologies like database design, transaction processing, and benchmarking. Then there are database systems based on data models such as hierarchical, network, relational, multidimensional and object-oriented. There are also database systems based on multiple sites or processors such as distributed and heterogeneous database systems, parallel systems, and systems being migrated. Finally, there are the emerging technologies such as data warehousing and mining, collaboration, and the Internet. For today, we’ll only focus on the data warehouse and data mining.

3.0 TRADITIONAL FILE PROCESSING

In file system approach, each user defines and implements the needed files for a specific application to run. Even though both of the users are interested in the each other data, they will have their details in separate files and they need different programs to manipulate their files. This lead to wastage of space and redundancy or replication of data's, which may lead to confusion, sharing of data among various users is not possible, data inconsistency may occur. These files will not be having any inter-relationship among the data's stored in these files. Therefore in traditional file processing every user will be defining their own constraints and implement the files needed for the applications.

Besides, many programmers will be creating files over a long period and various files have different format, in various application languages. Therefore there is possibility of information getting duplicated, this redundancy is storing same data multiple times leads to higher costs and wastage of space. This may result in data inconsistency in the application; this is because update is done to some of the files only and not all the files. In traditional file system, if any changes are made to the structure of the files if will affect all the programs, so changes to the structure of a file may require changing of all programs that access the file. The file processing approach finally become too cumbersome, costly, and inflexible to supply information needed. As a result, it was replaced by the database management approach.


4.0 DATABASE MANAGEMENT SYSTEM (DBMS)

What is a database management system? Technically speaking, it is a software system that uses a standard method of cataloguing, retrieving, and running queries on data. The DBMS manages incoming data, organizes it, and provides ways for the data to be modified or extracted by users or other programs. It allows one or more computer users create and access data in a database. The DBMS manages user requests (and requests from other programs) so that users and other programs are free from having to understand where the data is physically located on storage media and, in a multi-user system, who else may also be accessing the data. In handling user requests, the DBMS ensures the integrity of the data (that is, making sure it continues to be accessible and is consistently organized as intended) and security (making sure only those with access privileges can access the data). Therefore, data management systems include database systems, data warehouses, and data mining systems.

The relational database management system (RDBMS) is the most common used DBMS. A standard user and program interface is the Structured Query Language (SQL). A newer kind of DBMS is the object-oriented database management system (ODBMS). On PCs, Microsoft Access is a popular example of a single- or small-group user DBMS. Microsoft's SQL Server is an example of a DBMS that serves database requests from multiple (client) users. Other popular DBMS include MySQL, PostgreSQL, Microsoft Access, SQL Server, FileMaker, Oracle, RDBMS, dBASE, Clipper, and FoxPro.


5.0 DATA WAREHOUSE


A data warehouse is a relational database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

According to Bill Inmon, famous author for several data warehouse books, "A data warehouse is a subject oriented, integrated, time variant, non volatile collection of data in support of management's decision making process".

-Subject Oriented

Data that gives information about a particular subject instead of about a company's ongoing operations.

-Integrated
Data that is gathered into the data warehouse from a variety of sources merged into a coherent whole.


-Time-variant

All data in the data warehouse is identified with a particular time period.


-Non-volatile

Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.


A complete data warehouse system starts with the data acquisition, which is the process of moving company data from the source systems into the warehouse. It is performed with software products known as ETL (Extract/Transform/Load) tools. Data acquisition is an ongoing, scheduled process, which is executed to keep the warehouse current to a pre-determined period in time, (i.e. the warehouse is refreshed monthly). Data cleansing is typically performed in conjunction with data acquisition (it can be part of the "T" in "ETL"). It is a complicated process that validates and, if necessary, corrects the data before it is inserted into the warehouse. For example, the company could have three "Customer Name" entries in its various source systems, one entered as "IBM", one as "I.B.M.", and one as "International Business Machines". Obviously, these are all the same customer. Someone in the organization must make a decision as to which is correct, and then the data cleansing tool will change the others to match the rule. This process is also referred to as "data scrubbing" or "data quality assurance". It can be an extremely complex process, especially if some of the warehouse inputs are from older mainframe file systems (commonly referred to as "flat files" or "sequential files").

Follow on is data aggregation; a process often performed during the "T" phase of ETL, if it is performed at all. Data warehouses can be designed to store data at the detail level (each individual transaction), at some aggregate level (summary data), or a combination of both. The advantage of summarized data is that typical queries against the warehouse run faster. The disadvantage is that information, which may be needed to answer a query, is lost during aggregation. The tradeoff must be carefully weighed, because the decision cannot be undone without rebuilding and repopulating the warehouse. The safest decision is to build the warehouse with a high level of detail, but the cost in storage can be extreme. Throughout the entire process of identifying, acquiring, and querying the data, metadata management takes place. Metadata is defined as "data about data. Metadata is stored in a Metadata Repository and provides extremely useful information to all of the tools mentioned previously. It is required to build the warehouse, since the ETL tool needs to know the metadata attributes of the sources and targets in order to "map" the data properly.


Some of the key data warehouse systems and the most widely used database engines for storing and serving data for the enterprise business intelligence and performance management:
-Teradata
-SAP BW - Business Information Warehouse
-Oracle
-Microsoft SQL Server
-IBM DB2
-SAS


6.0 DATA MINING

Data mining is defined as the extraction of hidden predictive information from large databases. It is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Data mining tools predict future trends and behaviors, moving beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time-consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

The technique that is used to perform these feats in data mining is called modeling. Modeling is simply the act of building a model in one situation where you know the answer and then applying it to another situation that you don't. Computers are loaded up with lots of information about a variety of situations where an answer is known and then the data mining software on the computer must run through that data and distill the characteristics of the data that should go into the model. Once the model is built it can then be used in similar situations where you don't know the answer.

Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?"

The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game.

By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot.

1 comment:

  1. In my opinion, computerization is constantly evolving and it will not stop soon enough. I am very happy to use IT solutions that are able to help me in my daily work. I have a favorite company whose services I use and if I am running out of package then through https://grapeup.com/cloudboostr/pricing/ I know exactly what the price is.

    ReplyDelete