翻譯:宋昭慧
編輯:劉夏泱
|
![]() |
|
||||||||||||||||||
|
這 些通常被儲存於資料倉儲和資料超市中,並特別準備管理決策提供支援。資料探勘是一個快速成長的領域,其主要任務在發展相關技術以協助管理者對這些資料庫進 行智能化的運用。而資料探勘在信用評級、詐欺揭發、行銷資料庫、客戶關係管理及股票投資市場等方面的一些成功應用亦已被揭示。資料探勘領域已經從統計學及 人工智能領域演變發展起來。
本課程將檢視那些在這兩個領域中出現,並被證實從應用的角度上,對於模式的辨識和預測,具有價值的方法。我們將會了解那些應用,並通過簡單易用的軟體及案例,提供親自動手操作資料探勘演算法的機會。
以下書籍為課程的補充讀物,而書中的材料將有可能被加進課程講稿中。
在Dewey圖書館的藏書 :
Hand,Mannila和 Smyth。《資料探勘原理》。MIT出版社,2001。
電子媒體 :
Berry 和 Linoff。《精通資料探勘》,Wiley,2000。http://library.books24x7.com/book/id_827/toc.asp
Delmater和Hancock。《資料探勘詳解》,數位印刷。2001。http://library.books24x7.com/book/id_2643/toc.asp
免費的版本權限有限,您將需要一種由RESAMPLING Stats所提供之功能更強的版本,以完成作業和案例。下載地址:http://www.resample.com/xlminer/MIT
在所需處理大量資料的計畫中,將可利用SAS Enterprise Miner軟體。對於軟體的應用指南將被詳細介紹。
您的課程成績將取決於案例寫作、家庭作業、團隊研究計畫、期中考試。這些成份所佔比例如下:
案例寫作及家庭作業(占30%);期中考試(占30%);研究計畫(占40%)
課堂參與情況將由教師主觀評量,並在成績處於及格邊緣時,用於決定最終成績。
Course Summary
Data that has relevance for managerial decisions is accumulating at an incredible rate due to a host of technological advances. Electronic data capture has become inexpensive and ubiquitous as a by-product of innovations such as the internet, e-commerce, electronic banking, point-of-sale devices, bar-code readers, and intelligent machines. Such data is often stored in data warehouses and data marts specifically intended for management decision support. Data mining is a rapidly growing field that is concerned with developing techniques to assist managers to make intelligent use of these repositories. A number of successful applications have been reported in areas such as credit rating, fraud detection, database marketing, customer relationship management, and stock market investments. The field of data mining has evolved from the disciplines of statistics and artificial intelligence.
This course will examine methods that have emerged from both fields and proven to be of value in recognizing patterns and making predictions from an applications perspective. We will survey applications and provide an opportunity for hands-on experimentation with algorithms for data mining using easy-to- use software and cases.
To develop an understanding of the strengths and limitations of popular data mining techniques and to be able to identify promising business applications of data mining. Students will be able to actively manage and participate in data mining projects executed by consultants or specialists in data mining. A useful takeaway from the course will be the ability to perform powerful data analysis in Excel.
Lecture notes and homework assignments will be available at the class website in SloanSpace. You will be responsible for downloading them to prepare for class as well as to submit home works.
The following books are available as supplementary materials. Occasionally, readings from these books will be suggested to augment the lecture notes.
On reserve in Dewey library:
Hand, Mannila, and Smyth. Principles of Data Mining. MIT Press, 2001.
Available in electronic media:
Berry and Linoff. Mastering Data Mining. Wiley, 2000. http://library.books24x7.com/book/id_827/toc.asp
Delmater and Hancock. Data Mining Explained. Digital Press, 2001. http://library.books24x7.com/book/id_2643/toc.asp
We will be using XLMiner, an Excel add-in, for homework assignments. To download a free version go to http://www.xlminer.com
The free version is limited. For your home works and case assignments you will need a more powerful version that will be provided by Resampling Stats at http://www.resample.com/xlminer/MIT
SAS Enterprise Miner will be available for projects that require handling large amounts of data. Instructions on using the software will be provided in recitations.
Your course grade will be based on case write-ups, homework, a team project and a mid-term exam. The weights given to these components is:
Case write-ups and Homework (30%); Mid-term Exam (30%); Project (40%)
Class participation will be subjectively evaluated and will be used in borderline cases to determine the final grade.
|
|
問題1: Charles 讀書俱樂部案例
請研讀此案例並回答案例後的所有相關問題
閱讀資料:
Bhandari, Vinni,和Dr. Nitin Patel.〈Charles讀書俱樂部案例〉
Levin, Nissan,和Jacob Zahav.〈資料庫行銷的一個案例研究〉,Tel Aviv大學。直效行銷教育基金會,Inc..1995年3月
美國出版商協會。產業統計,2002年
|
問題1:
一個區別分析的通常應用是將不同的債卷作級別分類。這些分級主要是為了能反應債券的風險程度並影響發行債卷的公司之借款成本。不同的財務評等是從每年的報告中呈現出來的,通常可以協助決定公司的債卷分級。
Excel電子資料表BondRatingProb1.xls (XLS) 中,包含名為訓練資料(Training data)與驗證資料(Validation data)的兩個表格。這些資料來自於從COMPUSTAT財務資料檔案中抽樣出的95家公司的數據。當中公司的債卷已經經由「Moody的債卷分 級」(1980)分類成從AAA(最安全的等級)到C(風險最高的等級)的7個風險等級。, 這些資料包含每家公司的10個財務變數。內容如下:
LOPMAR:營運利潤率的對數
LFIXMAR:稅前固定支出回收率的對數
LTDCAP: 長期債務資本化
LGERRAT:長期債務總額對權益總額比率的對數
LLEVER: 槓桿度的對數
LCASHLTD: 現金流量對長期債務的對數
LACIDRAT: 速動比率的對數
LCURRAT: 流動資產對流動負債的對數
LRECTURN: 應收週轉率的對數
LASSLTD: 淨有形資產對長期負債的對數
以上的資料,有81筆觀察值被歸類為訓練組資料;另外14筆觀察值則定為驗證組資料。債卷評級被編碼為欄位標 題為CODERTG中的數值。例如: AAA被編碼為1,AA編碼為2..等等。使用 XLMiner構造區別分析(與神經網絡模型,以分類驗證資料的債卷評級。你將需要使用評分新資料的功能。你所能發現的最佳分類器具有多少的效益?另外, 要注意資料中的分類變量是有序的(例如︰ AAA的評等優於AA,而AA又優於A)。是否有某個分類器的誤分類率劣於另一個分類器。若是如此,你將如何考慮其結果。
問題2:
判斷下列問題的正誤,並用一句話來說明
問題3:
Excel的電子資料表RegressionProb3.xls (XLS) ,包含了名為訓練組資料與驗證組資料的兩個表格。我們將使用XLMiner、根據訓練組資料,建立兩個模型。並使用驗證組資料比較它們作為預測模型的效能。
問題4:
Excel電子資料表NormalsProb4.xls (XLS) 包含了兩個分群(群0和群1)以及兩個變量(x和y),共1000筆觀察值。
請記得邏輯迴歸與區別分析都屬於線性分類器。亦即,它會在一個平面上將點,區分成不同的類別。相對地,神經網路和K-最近鄰分類,則允許非線性分類(你是否對於後兩種資料點應如何分類,具有幾何上的直覺?)
每一個散佈圖,需顯示以下系列的點︰
There are two major assignments for this course:
Problem 1: The Charles Book Club Case
Read the case and answer all the questions at the end of the case.
Readings:
Bhandari, Vinni, and Dr. Nitin Patel. “The Charles Book Club Case.”
Levin, Nissan, and Jacob Zahav. ” A Case Study in Database Marketing.” Tel Aviv University. Direct Marketing Educational Foundation, Inc.. March 1995.
Association of American Publishers. Industry Statistics, 2002.
A new title, "The Art History of Florence", is ready for release. CBC has sent a test mailing to a random sample of 4,000 customers from its customer base. The customer responses have been collated with past purchase data. The data has been randomly partitioned into 3 parts- Training Data (1800 customers): initial data to be used to fit response models, Validation Data (1400 customers): hold-out data used to compare the performance of different response models, and Test Data (800 Customers): data only to be used after a final model has been selected to estimate the likely accuracy of the model when it is deployed. The Sample Data are in a separate spreadsheets CBC_4000.xls (XLS). Each row (or case) in the spreadsheet (other than the header) corresponds to one market test customer. Each column is a variable with the header row giving the name of the variable. The variable names and descriptions are given in Table 1, below:
|
Problem 2: The German Credit Case
(英文PDF)、 (英文DOC)
:
Read the case and answer all the questions at the end of the case
German Credit Case Data (XLS)
Problem 1:
A common application of Discriminant Analysis is the classification of bonds into various bond rating classes. These ratings are intended to reflect the risk of the bond and influence the cost of borrowing for companies that issue bonds. Various financial ratios culled from annual reports are often used to help determine a company’s bond rating.
The Excel spreadsheet BondRatingProb1.xls (XLS) contains two sheets named Training data and Validation data. These are data from a sample of 95 companies selected from COMPUSTAT financial data tapes. The company bonds have been classified by Moody’s Bond Ratings (1980) into seven classes of risk ranging from AAA, the safest, to C, the most risky. The data include ten financial variables for each company. These are:
LOPMAR: Logarithm of the operating margin,
LFIXMAR: Logarithm of the pretax fixed charge coverage,
LTDCAP: Long-term debt to capitalization,
LGERRAT: Logarithm of total long-term debt to total equity,
LLEVER: Logarithm of the leverage,
LCASHLTD: Logarithm of the cash flow to long-term debt,
LACIDRAT: Logarithm of the acid test ratio,
LCURRAT: Logarithm of the current assets to current liabilities,
LRECTURN: Logarithm of the receivable turnover,
LASSLTD: Logarithm of the net tangible assets to long-term debt.
The data are divided into 81 observations in the Training data sheet and 14 observations in the Validation data sheet. The bond ratings have been coded into numbers in the column with the title CODERTG, with AAA coded as 1, AA as 2, etc. Use XLMiner to develop Discriminant Analysis and Neural Networks models to classify the bonds in the Validation data sheet. You will need to use the score new data option. What is the performance of the best classifier you have been able to find? Notice that the there is order in the class variables (i.e., AAA is better than AA, which is better than A,…). Would certain misclassification errors be worse than others? If so, how would you suggested measuring this?
Problem 2:
Give true false answers to the following questions with one sentence to justify your answer.
Problem 3:
The Excel spreadsheet RegressionProb3.xls (XLS) contains two sheets named Training Data and Validation Data. We will use XLMiner to build two models with the training data and then use the validation data to compare their performance as prediction models.
Problem 4:
The Excel spreadsheet NormalsProb4.xls (XLS) contains 1000 observations with two groups (Group 0 and Group 1) and two variables (x and y).
Group 0 points differently (e.g., one with a 'x' and the other with 'o') so you can
visualize the distribution of the points of each Group.
observations respectively.
Remember that logistic regression and discriminant analysis are linear classifiers
- i.e., it separates points of different classes with a plane. In contrast, neural
networks and k-nearest neighbors allow non-linear classifiers (do you have an
intuitive idea on the geometry of how the latter two classifies points?).
the following series
distribution. The Bayes Rule for minimum misclassification has an error rate of
18.5%. How close is the best classifier you have developed of each type to this
error rate? Give an intuitive explanation of why certain types of classifiers seem
to be better for this data.
The Midterm Exam for this course is available below.
XLMiner Software (Excel Add-in)
XL Miner 相關教學
XL Miner Package
XLMiner由Romy Shioda個別指導
(英文PDF)、 (英文DOC)
XLMiner Tutorial by Romy Shioda
(英文PDF)、 (英文DOC)
矩陣數學由Adam Mersereau指導
(英文PDF)、 (英文DOC)
Matrix Math Review by Adam Mersereau
(英文PDF)、 (英文DOC)
留下您對本課程的評論 |
標籤 現有標籤:1 |
有關本課程的討論
课程讨论
Data mining不应该是叫数据挖掘么?
1
1
1
-1'