MyOOPS開放式課程
請加入會員以使用更多個人化功能
來自全球頂尖大學的開放式課程,現在由世界各國的數千名義工志工為您翻譯成中文。請免費享用!
課程來源:MIT
     
15.062 2003春季課程:資料探勘(Data Mining, Spring 2003)

翻譯:宋昭慧
編輯:劉夏泱

 
A drawing of a beaver wearing a mining hat.
戴了頂採礦帽的海狸(圖片由Geoffrey Wilson惠予提供)
Beaver wearing a mining hat. (Image by Geoffrey Wilson.)

課程重點

本課程的特點,為在課程中使用了由本課程教師Nitin Patel,所設計的XLMINER資料探勘 工具。同時也提供了大量的課堂講稿作業

This course features XLMiner tools, designed by the instructor, Nitin Patel. Extensive lecture notes and assignments are also available.

課程描述

由於科技的發展,導致與管理決策相關的資料,正以一種驚人的速度被不斷累積。作為網際網路、電子商務、電子銀行、銷售點設備、條碼閱讀機及智慧型機器等創新的副產品,電子資料的獲取,已經變得廉價而且普及。

這 些資料通常被儲存於資料倉儲和資料超市中,並特別準備為管理決策提供支援。資料探勘是一個快速成長的領域,其主要任務在發展相關技術以協助管理者對這些資 料庫進行智能化的運用。而資料探勘在信用評級、詐欺揭發、行銷資料庫、客戶關係管理及股票投資市場等方面的一些成功應用亦已經被揭示。資料探勘領域已經從 統計學及人工智能領域演變發展出來。

本課程將檢視那些在這兩個領域中新誕生,並被證實從應用的角度上,對於模式的辨識和預測具有價值的方法。我們將會了解那些應用,並透過簡單易用的軟體及案例,提供親自動手操作資料探勘演算法的機會。

Data that has relevance for managerial decisions is accumulating at an incredible rate due to a host of technological advances. Electronic data capture has become inexpensive and ubiquitous as a by-product of innovations such as the internet, e-commerce, electronic banking, point-of-sale devices, bar-code readers, and intelligent machines. Such data is often stored in data warehouses and data marts specifically intended for management decision support. Data mining is a rapidly growing field that is concerned with developing techniques to assist managers to make intelligent use of these repositories. A number of successful applications have been reported in areas such as credit rating, fraud detection, database marketing, customer relationship management, and stock market investments. The field of data mining has evolved from the disciplines of statistics and artificial intelligence.

This course will examine methods that have emerged from both fields and proven to be of value in recognizing patterns and making predictions from an applications perspective. We will survey applications and provide an opportunity for hands-on experimentation with algorithms for data mining using easy-to- use software and cases.

師資

講師:
Nitin Patel 教授

上課時數

教師授課:
每週3節
每節1小時

程度
大學部 / 研究所
回應
告訴我們您對本課程或「開放式課程網頁」的建議。
聲明
麻省理工學院開放式課程認可 開放式課程計畫(OOPS)的翻譯計畫,開放式課程計畫(OOPS)乃是運用其獨立團隊、獨立資源、獨立流程進行翻譯計畫之團隊。

所有麻省理工學院開放式課程之材料皆以麻省理工學院開放式課程創作共享授權發佈,所有之翻譯資料皆由開放式課程計畫(OOPS)所提供,並由其負翻譯品質之責任。

此 處麻省理工學院開放式課程之資料乃由 開放式課程計畫(OOPS) 譯為正體中文。麻省理工學院開放式課程在此聲明,不論是否遭遇或發現相關議題,麻省理工學院開放式課程、麻省理工學院教師、麻省理工學院校方並不對翻譯正 確度及完整性作保證。上述單位並對翻譯後之資料不作明示或默許對任一特定目的之適合性之保證、非侵權之保證、或永不出錯之保證。麻省理工學院校方、麻省理 工學院開放式課程對翻譯上之不正確不負任何責任。由翻譯所引發任何關於此等資料之不正確或其他瑕疵,皆由開放式課程計畫(OOPS)負全責,而非麻省理工 學院開放式課程之責。

原文聲明

 

課程摘要

由於科技的發展,導致與管理決策相關的資料,正以一種驚人的速度被不斷累積。作為網際網路、電子商務、電子銀行、銷售點設備、條碼閱讀機及智慧型機器等創新的副產品,電子資料的獲取,已經變得廉價而且普及。

這 些通常被儲存於資料倉儲和資料超市中,並特別準備管理決策提供支援。資料探勘是一個快速成長的領域,其主要任務在發展相關技術以協助管理者對這些資料庫進 行智能化的運用。而資料探勘在信用評級、詐欺揭發、行銷資料庫、客戶關係管理及股票投資市場等方面的一些成功應用亦已被揭示。資料探勘領域已經從統計學及 人工智能領域演變發展起來。

本課程將檢視那些在這兩個領域中出現,並被證實從應用的角度上,對於模式的辨識和預測,具有價值的方法。我們將會了解那些應用,並通過簡單易用的軟體及案例,提供親自動手操作資料探勘演算法的機會。



課程目標

理解當前流行的資料探勘技術之效力與侷限,並能夠辨明資料探勘的商業應用前景。學生們能積極地管理和參與由顧問或專家執行的資料探勘計畫。從本課程中將獲得額外有用的技巧,是學會使用Excel中強大的資料數據分析功能。

課堂講稿

講稿和指定的家庭作業,將可自課程網頁SloanSpace中取得。同學們須自行下載,以進行課前預習和繳交作業。

補充閱讀資料

以下書籍為課程的補充讀物,而書中的材料將有可能被加進課程講稿中。

在Dewey圖書館的藏書 :

Hand,Mannila和 Smyth。《資料探勘原理》。MIT出版社,2001。

電子媒體 :

Berry 和 Linoff。《精通資料探勘》,Wiley,2000。http://library.books24x7.com/book/id_827/toc.asp

Delmater和Hancock。《資料探勘詳解》,數位印刷。2001。http://library.books24x7.com/book/id_2643/toc.asp



軟體

我們將使用XLMIER,一種Excel的插件,來完成作業。欲下載免費版本可至 http://www.xlminer.com

免費的版本權限有限,您將需要一種由RESAMPLING Stats所提供之功能更強的版本,以完成作業和案例。下載地址:http://www.resample.com/xlminer/MIT

在所需處理大量資料的計畫中,將可利用SAS Enterprise Miner軟體。對於軟體的應用指南將被詳細介紹。



成績評量

您的課程成績將取決於案例寫作、家庭作業、團隊研究計畫、期中考試。這些成份所佔比例如下:

案例寫作及家庭作業(占30%);期中考試(占30%);研究計畫(占40%)

課堂參與情況將由教師主觀評量,並在成績處於及格邊緣時,用於決定最終成績。

Course Summary


Data that has relevance for managerial decisions is accumulating at an incredible rate due to a host of technological advances. Electronic data capture has become inexpensive and ubiquitous as a by-product of innovations such as the internet, e-commerce, electronic banking, point-of-sale devices, bar-code readers, and intelligent machines. Such data is often stored in data warehouses and data marts specifically intended for management decision support. Data mining is a rapidly growing field that is concerned with developing techniques to assist managers to make intelligent use of these repositories. A number of successful applications have been reported in areas such as credit rating, fraud detection, database marketing, customer relationship management, and stock market investments. The field of data mining has evolved from the disciplines of statistics and artificial intelligence.

This course will examine methods that have emerged from both fields and proven to be of value in recognizing patterns and making predictions from an applications perspective. We will survey applications and provide an opportunity for hands-on experimentation with algorithms for data mining using easy-to- use software and cases.



Course Objective

To develop an understanding of the strengths and limitations of popular data mining techniques and to be able to identify promising business applications of data mining. Students will be able to actively manage and participate in data mining projects executed by consultants or specialists in data mining. A useful takeaway from the course will be the ability to perform powerful data analysis in Excel.



Lecture Notes

Lecture notes and homework assignments will be available at the class website in SloanSpace. You will be responsible for downloading them to prepare for class as well as to submit home works.



Supplementary Readings

The following books are available as supplementary materials. Occasionally, readings from these books will be suggested to augment the lecture notes.

On reserve in Dewey library:

Hand, Mannila, and Smyth. Principles of Data Mining. MIT Press, 2001.

Available in electronic media:

Berry and Linoff. Mastering Data Mining. Wiley, 2000. http://library.books24x7.com/book/id_827/toc.asp

Delmater and Hancock. Data Mining Explained. Digital Press, 2001. http://library.books24x7.com/book/id_2643/toc.asp



Software

We will be using XLMiner, an Excel add-in, for homework assignments. To download a free version go to http://www.xlminer.com

The free version is limited. For your home works and case assignments you will need a more powerful version that will be provided by Resampling Stats at http://www.resample.com/xlminer/MIT

SAS Enterprise Miner will be available for projects that require handling large amounts of data. Instructions on using the software will be provided in recitations.



Grading

Your course grade will be based on case write-ups, homework, a team project and a mid-term exam. The weights given to these components is:

Case write-ups and Homework (30%); Mid-term Exam (30%); Project (40%)

Class participation will be subjectively evaluated and will be used in borderline cases to determine the final grade.




        課       課程單元      
 
  導言
Introduction
 
     
  1       資料探勘概述,以K-最近鄰法進行預測及分類。
Data Mining Overview, Prediction and Classification with k-Nearest Neighbors
     
 
  分類
Classification
 
     
  2       分類及貝氏法則,樸素貝氏分析
Classification and Bayes Rule, Naïve Bayes
     
     
  3       分類樹(派給家庭作業1)
Classification Trees (Homework 1 given out)
     
     
  4       區別分析
Discriminant Analysis
     
     
  5       邏輯迴歸案例:手搖紡織機
Logistic Regression Case: Handlooms
     
     
  6       神經網絡
Neural Nets
     
     
  7       案例:直效行銷/「德國客戶信用評等」案例(家庭作業1繳交期限)( 派給家庭作業2)
Cases: Direct Marketing/German Credit (Homework 1 due)(Homework 2 given out)
     
 
  預測
Prediction
 
     
  8       評估預測表現
Assessing Prediction Performance
     
     
  9       迴歸模型中子集的選取
Subset Selection in Regression
     
     
  10       迴歸樹,案例:IBM/GM每週的投資報酬率(家庭作業2繳交期限)
Regression Trees, Case: IBM/GM weekly returns (Homework 2 due)
     
 
  群聚
Clustering
 
     
  11       K-均值分群法,階層分群法
k-Means Clustering, Hierarchical Clustering
     
     
  12       案例:零售行銷規劃
Case: Retail Merchandising
     
     
  13       期中考試
Midterm Exam
     
 
  維數簡約
Dimension Reduction
 
     
  14       主成分分析
Principal Components
     
     
  15       Dr. Ira Haimowitz博士的客座演講:資料探勘及Pfizer公司的客戶關係管理
Guest Lecture by Dr. Ira Haimowitz: Data Mining and CRM at Pfizer
     
 
  資料庫方法
Data Base Methods
 
     
  16       關聯規則(購物籃分析)
Association Rules (Market Basket Analysis)
     
     
  17       推薦系統:協同過濾
Recommendation Systems: Collaborative Filtering
     
 
  總結
Wrap Up
 
     
  18       Dr. John Elder IV資深研究員的客座演講:資料探勘的實際操作
Guest Lecture by Dr. John Elder IV, Elder Research: The Practice of Data Mining
     
     
  19       專題報告
Project Presentations
     

 

課 課程單元 資料來源
1 資料探勘概述
Data Mining Overview
(英文PDF)
(英文DOC)

用K-最近鄰法做預測和分類
Prediction and Classification with k-Nearest Neighbors

例1:騎乘式割草機
Example 1: Riding Mowers
(英文PDF)
(英文DOC)

表 11.1,從頁 584:Johnson,Richard,和Dean Wichern. 《應用多變量統計分析》 5th ed. Prentice-Hall,2002。ISBN;0-13-092553-5 begin_of_the_skype_highlighting              0-13-092553-5      end_of_the_skype_highlighting
Table 11.1 from page 584 of: Johnson, Richard, and Dean Wichern. Applied Multivariate Statistical Analysis. 5th ed. Prentice-Hall, 2002. ISBN: 0-13-092553-5.
2 分類及貝氏法則,樸素貝氏分析
Classification and Bayes Rule, Naïve Bayes
(英文PDF)
(英文DOC)
 
3 分類樹
Classification Trees
(英文PDF)
(英文DOC)
〈家庭資料庫(Boston)〉是由美國加州大學Irvine分校電腦與資訊科學學院公佈的資料:機器智能學習資料庫。(Machine Learning Repository of Detabases)
"Housing Database (Boston)." Publicly available data at University of California, Irvine School of Information and Computer Science, Machine Learning Repository of Databases.
4 例2區別分析:Fisher的Iris實驗數據(前後需統一)
Discriminant Analysis Example 2: Fisher's Iris data
(英文PDF)
(英文DOC)
“鳶尾花研究資料庫”是由美國加州大學Irvine分校電腦與資訊科學學院公佈的資料:機器智能學習資料庫。(Machine Learning Repository of Detabases)
"Iris Plant Database." Publicly available data at University of California, Irvine School of Information and Computer Science, Machine Learning Repository of Databases.
5 邏輯迴歸案例
Logistic Regression Case
(英文PDF)
(英文DOC)

手搖紡織機
Handlooms
(英文PDF)
(英文PPT)
 
6 神經網絡
Neural Nets
(英文PDF)
(英文DOC)
 
7 作業討論-見指定作業 問題1
Discussion of homework - see Problem 1 in assignments section
 
8 多變數迴歸檢視
Multiple Regression Review
(英文PDF)
(英文PPT)
 
9 資料探勘中的線性複迴歸模式
Multiple Linear Regression in Data Mining
(英文PDF)
(英文DOC)
 
10 迴歸樹,案例:IBM/GM週投資報酬率
Regression Trees, Case: IBM/GM weekly returns

資料探勘技術的比較
Comparison of Data Mining Techniques
(英文PDF)
(英文DOC)


作業討論-見指定作業問題2
Discussion of homework - see Problem 2 in assignments section
 
11 K-均值分群法,階層分群法
k-Means Clustering, Hierarchical Clustering
(英文PDF)
(英文DOC)
 
12 案例:零售行銷規劃
Case: Retail Merchandising
 
13 期中考試
Midterm Exam
 
14 主成分分析
Principal Components
(英文PDF)
(英文DOC)

例一,長子頭圍測量:Rencher,Alvin。《多變數分析方法》第二版。
Example 1, Head Measurements of Adult Sons: Rencher, Alvin. Methods of Multivariate Analysis. 2nd ed. Wiley-Interscience, 2002. Table 3.7, p. 79. ISBN: 0-471-46172-5.

例 2, 酒的特質: 〈酒類的識別資料庫〉是由美國加州大學Irvine分校電腦與資訊科學學院公佈的資料: 機器智能學習資料庫。(Machine Learning Repository of Detabases)
Example 2, Charactersitics of Wine: "Wine Recognition Database." Publicly available data at University of California, Irvine School of Information and Computer Science, Machine Learning Repository of Databases.

15 Dr. Ira Haimowitz博士客座演講:資料探勘及Pfizer公司的客戶關係管理
Guest Lecture by Dr. Ira Haimowitz: Data Mining and CRM at Pfizer
 
16 關聯規則(購物籃分析)
Association Rules (Market Basket Analysis)
(英文PDF)
(英文DOC)
Han,Jiawei和Micheline Kamber。資料探勘:概念和技術。Morgan Kauffman Publishers,2001。例6.1(Figure 6.2)。ISBN:1-55860-489-8 begin_of_the_skype_highlighting              1-55860-489-8      end_of_the_skype_highlighting。
Han, Jiawei, and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kauffman Publishers, 2001. Example 6.1 (Figure 6.2). ISBN: 1-55860-489-8.
17 推薦系統:協同過濾
Recommendation Systems: Collaborative Filtering
 
18 John Elder IV博士的客座演講:資料探勘的實際操作。
Guest Lecture by Dr. John Elder IV, Elder Research: The Practice of Data Mining
 

 
本課程有兩個主要的作業:

作業 1 問題集1

問題1: Charles 讀書俱樂部案例

請研讀此案例並回答案例後的所有相關問題

閱讀資料:

Bhandari, Vinni,和Dr. Nitin Patel.〈Charles讀書俱樂部案例〉

Levin, Nissan,和Jacob Zahav.〈資料庫行銷的一個案例研究〉,Tel Aviv大學。直效行銷教育基金會,Inc..1995年3月

美國出版商協會。產業統計,2002年



佛羅倫斯的藝術史

一 本名為《佛羅倫斯的藝術史》的新書正準備出版。CBC從其客戶資料庫中隨機抽選4,000位客戶,寄發測試性的廣告郵件。接著客戶的回覆資料會與他們過去 的購買資料整合,並將此資料隨機切割為3個部份。分別為訓練組資料(共1,800位客戶):始,初始資料用來比對回應資料模型;驗證組資料(共1,400 位客戶),提供資料用於比較不同的回應模型之表現;最後為測試組資料(共800位用戶),資料為當最終模型選定後,被用於評估所運用之模型的準確性。樣本 資料在一個區隔的電子表格CBC_4000.xls(XLS)中。在資料表格內(非表格頂列)的每一列(或案例),對應著各個市場測試客戶。每一行代表一個變量,在頂列中提供了這些變量的名稱。變量名稱和描述如下表所示︰

表(一): CBC_4000.xls檔案的變量列表

        變量名稱       描述      
     
  Seq#       資料劃分中的序號
     
     
  ID#       整個(未劃分的)市場測試資料集中的標識號
     
     
  Gender       O=男性1=女性
     
     
  M       消費金額- 購買書籍的總消費金額
     
     
  R       嶄新性- 距離最後一次購買的月數
     
     
  F       次數 – 總購買次數
     
     
  FirstPurch       距離第一次購買的月數
     
     
  ChildBks       兒童類別圖書的購買數量
     
     
  YouthBks       青少年類別圖書的購買數量
     
     
  CookBks       廚藝類別圖書的購買數量
     
     
  DoItYBks       DIY類別圖書的購買數量
     
     
  RefBks       參考類別圖書(地圖集、大英百科全書、字典..等)的購買數量
     
     
  ArtBks       藝術類別圖書的購買數量
     
     
  GeoBks       地理類別圖書的購買數量
     
     
  ItalCook       《義大利烹調祕訣》一書的購買數量
     
     
  ItalAtlas       《義大利歷史版圖》一書的購買數量
     
     
  ItalArt       "Italian Art."《義大利藝術》一書的購買數量
     
     
  Florence       =1 代表已購買《佛羅倫斯的藝術史》一書
=0 則表示未購買此書
     
     
  Related purchase       相關書籍的購買數量上午 12:11 2008/7/13
     



問題2: 「德國客戶信用評等」案例
(英文PDF)、 (英文DOC)
:
請研讀此案例並回答案例後的所有相關問題 (XLS)

「德國客戶信用評等」案例資料集(XLS)



作業2 問題集2

問題1:

一個區別分析的通常應用是將不同的債卷作級別分類。這些分級主要是為了能反應債券的風險程度並影響發行債卷的公司之借款成本。不同的財務評等是從每年的報告中呈現出來的,通常可以協助決定公司的債卷分級。

Excel電子資料表BondRatingProb1.xls (XLS) 中,包含名為訓練資料(Training data)與驗證資料(Validation data)的兩個表格。這些資料來自於從COMPUSTAT財務資料檔案中抽樣出的95家公司的數據。當中公司的債卷已經經由「Moody的債卷分 級」(1980)分類成從AAA(最安全的等級)到C(風險最高的等級)的7個風險等級。, 這些資料包含每家公司的10個財務變數。內容如下:

LOPMAR:營運利潤率的對數
LFIXMAR:稅前固定支出回收率的對數
LTDCAP: 長期債務資本化
LGERRAT:長期債務總額對權益總額比率的對數
LLEVER: 槓桿度的對數
LCASHLTD: 現金流量對長期債務的對數
LACIDRAT: 速動比率的對數
LCURRAT: 流動資產對流動負債的對數
LRECTURN: 應收週轉率的對數
LASSLTD: 淨有形資產對長期負債的對數

以上的資料,有81筆觀察值被歸類為訓練組資料;另外14筆觀察值則定為驗證組資料。債卷評級被編碼為欄位標 題為CODERTG中的數值。例如: AAA被編碼為1,AA編碼為2..等等。使用 XLMiner構造區別分析(與神經網絡模型,以分類驗證資料的債卷評級。你將需要使用評分新資料的功能。你所能發現的最佳分類器具有多少的效益?另外, 要注意資料中的分類變量是有序的(例如︰ AAA的評等優於AA,而AA又優於A)。是否有某個分類器的誤分類率劣於另一個分類器。若是如此,你將如何考慮其結果。

問題2:

判斷下列問題的正誤,並用一句話來說明

在線性複迴歸模型中,對於一系列獨立變量而言,調整的 R2永遠低於R2值。

在線性複迴歸模型中,最佳的變量子集就是那些具有較少變量,而具有較高Mallow’s Cp值的子集。

一個不含隱藏層的神經網絡,通常利用p個輸入變量︰ x1, x2 … xp,預測一個連續變量y。此網絡是採用訓練資料建立模型,並且發現在驗證資料中的誤差平方和為SSN。另外帶有自變量x1, x2 … xp和因變量y的線性複迴歸模型,也適合於相同的驗證數據。此迴歸模型的殘差平方和是SSR,而且SSR不會比SSN大。

當反向傳播算法被用於構造神經網絡時,網絡通常會在誤差函數之誤差的整體或局部極小值點處停止。

被使用於構造人工神經網絡模型的變量數等於神經網絡的所有節點數。

問題3:

Excel的電子資料表RegressionProb3.xls (XLS) ,包含了名為訓練組資料與驗證組資料的兩個表格。我們將使用XLMiner、根據訓練組資料,建立兩個模型。並使用驗證組資料比較它們作為預測模型的效能。

模型1︰根據訓練組資料,建立從變數x1到x9(以及常數項)之所有變量的複迴歸模型。我們稱此模型的係數向量為ß1。 .

使用XLMiner中的子集選項,來選擇一個只使用訓練組資料所建立的模型。我們稱此模型的係數向量為ß2。.

使用驗證組資料將ß1複製B5到k5的單位格中,計算模型1的平均值與標準差。 同樣複製ß2,計算模型2。.

分別從(i)預測值的乖離率,(ii)與預測值的均方誤差,來比較兩個模型。

問題4:

Excel電子資料表NormalsProb4.xls (XLS) 包含了兩個分群(群0和群1)以及兩個變量(x和y),共1000筆觀察值。

將所有資料點標繪成2維的散佈圖。並且將群1和群0不同地標記(例如︰一個標記為x,另一個標記為o)。 如此,你將能夠清楚視覺化各群的分布情況。

將資料切分為600筆訓練組與400筆分類組(驗證組)。

比較以下不同演算法的模型效能:
邏輯迴歸
區別分析
神經網絡
k-最近鄰分類

請記得邏輯迴歸與區別分析都屬於線性分類器。亦即,它會在一個平面上將點,區分成不同的類別。相對地,神經網路和K-最近鄰分類,則允許非線性分類(你是否對於後兩種資料點應如何分類,具有幾何上的直覺?)

針對每一種方法,將最佳的分類器標繪成散佈圖。
每一個散佈圖,需顯示以下系列的點︰
群0被正確分類的點
群0被錯誤分類的點
群1被正確分類的點
群1被錯誤分類的點

資料是被模擬的。其每一個類別的(x,y)值皆服從二元常態分布。最小的錯分的貝氏法則,有誤差率18.5%。你根據各個形態所做出的最佳分類器,有多接近這個誤差率?請針對為何某些形態的分類器,對於這個資料數據能有更佳的表現,給予一個直覺性的說明。

There are two major assignments for this course:


Homework 1 Problem Set 1

Problem 1: The Charles Book Club Case

Read the case and answer all the questions at the end of the case.

Readings:

Bhandari, Vinni, and Dr. Nitin Patel. The Charles Book Club Case.”

Levin, Nissan, and Jacob Zahav. A Case Study in Database Marketing.” Tel Aviv University. Direct Marketing Educational Foundation, Inc.. March 1995.

Association of American Publishers. Industry Statistics, 2002.



Art History of Florence

A new title, "The Art History of Florence", is ready for release. CBC has sent a test mailing to a random sample of 4,000 customers from its customer base. The customer responses have been collated with past purchase data. The data has been randomly partitioned into 3 parts- Training Data (1800 customers): initial data to be used to fit response models, Validation Data (1400 customers): hold-out data used to compare the performance of different response models, and Test Data (800 Customers): data only to be used after a final model has been selected to estimate the likely accuracy of the model when it is deployed. The Sample Data are in a separate spreadsheets CBC_4000.xls (XLS). Each row (or case) in the spreadsheet (other than the header) corresponds to one market test customer. Each column is a variable with the header row giving the name of the variable. The variable names and descriptions are given in Table 1, below:



Table 1: List of Variables in CBC_4000.xls

        VARIABLE NAMES       DESCRIPTION      
     
  Seq#       Sequence number in the partition
     
     
  ID#       Identification number in the full (unpartitioned) market test data set
     
     
  Gender       O=Male 1=Female
     
     
  M       Monetary- Total money spent on books
     
     
  R       Recency- Months since last purchase
     
     
  F       Frequency - Total number of purchases
     
     
  FirstPurch       Months since first purchase
     
     
  ChildBks       Number of purchases from the category: Child books
     
     
  YouthBks       Number of purchases from the category: Youth books
     
     
  CookBks       Number of purchases from the category: Cookbooks
     
     
  DoItYBks       Number of purchases from the category Do It Yourself books
     
     
  RefBks       Number of purchases from the category: Reference books (Atlases, Encyclopedias, Dictionaries)
     
     
  ArtBks       Number of purchases from the category: Art books
     
     
  GeoBks       Number of purchases from the category: Geography books
     
     
  ItalCook       Number of purchases of book title: "Secrets of Italian Cooking."
     
     
  ItalAtlas       Number of purchases of book title: "Historical Atlas of Italy."
     
     
  ItalArt       Number of purchases of book title: "Italian Art."
     
     
  Florence       =1 "The Art History of Florence." was bought,
=0 if not
     
     
  Related purchase       Number of related books purchased
     



Problem 2: The German Credit Case
(英文PDF)、 (英文DOC)
:
Read the case and answer all the questions at the end of the case

German Credit Case Data (XLS)



Homework 2 Problem Set 2

Problem 1:

A common application of Discriminant Analysis is the classification of bonds into various bond rating classes. These ratings are intended to reflect the risk of the bond and influence the cost of borrowing for companies that issue bonds. Various financial ratios culled from annual reports are often used to help determine a company’s bond rating.

The Excel spreadsheet BondRatingProb1.xls (XLS) contains two sheets named Training data and Validation data. These are data from a sample of 95 companies selected from COMPUSTAT financial data tapes. The company bonds have been classified by Moody’s Bond Ratings (1980) into seven classes of risk ranging from AAA, the safest, to C, the most risky. The data include ten financial variables for each company. These are:

LOPMAR: Logarithm of the operating margin,
LFIXMAR: Logarithm of the pretax fixed charge coverage,
LTDCAP: Long-term debt to capitalization,
LGERRAT: Logarithm of total long-term debt to total equity,
LLEVER: Logarithm of the leverage,
LCASHLTD: Logarithm of the cash flow to long-term debt,
LACIDRAT: Logarithm of the acid test ratio,
LCURRAT: Logarithm of the current assets to current liabilities,
LRECTURN: Logarithm of the receivable turnover,
LASSLTD: Logarithm of the net tangible assets to long-term debt.

The data are divided into 81 observations in the Training data sheet and 14 observations in the Validation data sheet. The bond ratings have been coded into numbers in the column with the title CODERTG, with AAA coded as 1, AA as 2, etc. Use XLMiner to develop Discriminant Analysis and Neural Networks models to classify the bonds in the Validation data sheet. You will need to use the score new data option. What is the performance of the best classifier you have been able to find? Notice that the there is order in the class variables (i.e., AAA is better than AA, which is better than A,…). Would certain misclassification errors be worse than others? If so, how would you suggested measuring this?

Problem 2:

Give true false answers to the following questions with one sentence to justify your answer.

The adjusted R2 value for a set of independent variables in multiple linear regression is always less than the value of R2.

The most promising subsets of variables to include in a multiple linear regression model are those that have few variables and have a high value for Mallow’s Cp.

An Artificial Neural Network with no hidden layers is used to predict a continuous variable y using p input variables, x1, x2 … xp. The network is trained on a training dataset and it is found that the sum of squared errors on a validation dataset is SSN. A multiple linear regression model with independent variables x1, x2 … xp and dependent variable y is fitted to the same validation data. The sum of squared residuals for the regression model is SSR. SSR cannot be greater than SSN.

The backprop algorithm when used in training an Artificial Neural Network will always terminate at a global or local minimum of the error function.

The number of variables used in training an Artificial Neural Network is equal to the total number of nodes in the network.

Problem 3:

The Excel spreadsheet RegressionProb3.xls (XLS) contains two sheets named Training Data and Validation Data. We will use XLMiner to build two models with the training data and then use the validation data to compare their performance as prediction models.

Fit a multiple regression model, Model1, to the training data using all the variables X1 through X9 (and the constant term). Call the coefficient vector for this model ß1.

Use the subset selection options in XLMiner to choose a model using only the training data. Call the coefficient vector for this model ß2.

Use the Validation Data to compute the mean and the standard deviation of errors for Model1 by copying ß1 into cells B5 through K5. Do the same for Model2 by copying ß2.

Compare the models in terms of (i) bias in the predictions, (ii) mean square error of predictions.

Problem 4:

The Excel spreadsheet NormalsProb4.xls (XLS) contains 1000 observations with two groups (Group 0 and Group 1) and two variables (x and y).

Plot all the data points in a 2-dimensional scatter plot. Mark Group 1 points and
Group 0 points differently (e.g., one with a 'x' and the other with 'o') so you can
visualize the distribution of the points of each Group.

Partition the data into training and classification sets with 600 and 400
observations respectively.

Compare the performance of:
Logistic Regression
Discriminant Analysis
Neural Nets
K-Nearest Neighbor Classifiers

Remember that logistic regression and discriminant analysis are linear classifiers
- i.e., it separates points of different classes with a plane. In contrast, neural
networks and k-nearest neighbors allow non-linear classifiers (do you have an
intuitive idea on the geometry of how the latter two classifies points?).

For each method, plot a scatter plot for the best classifier. On each plot, display
the following series
Group 0 points that are classified correctly,
Group 0 points that are misclassified,
Group 1 points that are classified correctly,
Group 1 points that are misclassified.

The data was simulated. The (x,y) values for each class follow a bivariate normal
distribution. The Bayes Rule for minimum misclassification has an error rate of
18.5%. How close is the best classifier you have developed of each type to this
error rate? Give an intuitive explanation of why certain types of classifiers seem
to be better for this data.


這門課程的期中考試題,可由以下取得
The Midterm Exam for this course is available below.


期中考
(英文PDF)、 (英文DOC)
Midterm Exam
(英文PDF)、 (英文DOC)



XLMiner軟體 (Excel 插件)
XLMiner Software (Excel Add-in)

XL Miner 相關教學
XL Miner Package
XLMiner由Romy Shioda個別指導
(英文PDF)、 (英文DOC)
XLMiner Tutorial by Romy Shioda
(英文PDF)、 (英文DOC)

矩陣數學由Adam Mersereau指導
(英文PDF)、 (英文DOC)
Matrix Math Review by Adam Mersereau
(英文PDF)、 (英文DOC)


留下您對本課程的評論
標題:
您目前為非會員,留言名稱將顯示「匿名非會員」
只能進行20字留言

留言內容:

驗證碼請輸入8 + 1 =

標籤

現有標籤:1
新增標籤:


有關本課程的討論

课程讨论
Data mining不应该是叫数据挖掘么?

cupenoruler, 2013-12-29 13:27:47
課程討論
1
Anonymous, 2012-08-23 14:16:01
課程討論
1
Anonymous, 2012-08-23 14:15:59
課程討論
1
Anonymous, 2012-08-23 14:15:56
課程討論
-1'
Anonymous, 2012-08-23 14:15:54

Creative Commons授權條款 本站一切著作係採用 Creative Commons 授權條款授權。
協助推廣單位: