- Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation.
- All the data collected in a particular study are referred to as the data set for the study.
- Elements are the entities on which data are collected.
- A variable is a characteristic of interest for the elements.
- The set of measurements obtained for a particular element is called an observation.
- A data set with n elements contains n observations.
- The total number of data values in a complete data set is the number of elements multiplied by the number of variables.
- Scales of measurement include
·
Nominal
·
Ordinal
·
Interval
·
Ratio
- The scale determines the amount of information contained in the data.
- The scale indicates the data summarization and statistical analyses that are most appropriate.
• Data
are labels or names used to identify an attribute of the element.
• A
nonnumeric label or numeric code may be used.
Ordinal scale
• The
data have the properties of nominal data and the order or rank of the data
is meaningful.
• A
nonnumeric label or numeric code may be used.
Interval scale
• The
data have the properties of ordinal data, and the interval between observations
is expressed in terms of a fixed unit of measure.
• Interval
data are always numeric.
Ratio scale
•
Data have all the properties of interval data
and the ratio of two values is meaningful.
•
Ratio data are always numerical.
•
Zero value is included in the scale.
Categorical and Quantitative Data
• Data
can be further classified as being categorical or quantitative.
• The
statistical analysis that is appropriate depends on whether the data for the
variable are categorical or quantitative.
• In
general, there are more alternatives for statistical analysis when the data are
quantitative.
Categorical Data
• Labels
or names are used to identify an attribute of each element
• Often
referred to as qualitative data
• Use
either the nominal or ordinal scale of measurement
• Can
be either numeric or nonnumeric
• Appropriate
statistical analyses are rather limited
Quantitative Data
• Quantitative
data indicate how many or how much.
• Quantitative
data are always numeric.
• Ordinary
arithmetic operations are meaningful for quantitative data.
Cross-Sectional Data
Cross-sectional
data are collected at the same or approximately the same point in time.
Time Series Data
Time series
data are collected over several time periods.
Graphs of time series data help analysts
understand
• what
happened in the past
• identify
any trends over time, and
• project
future levels for the time series
Data Sources - Existing Sources
• Internal
company records – almost any department
• Business
database services – Dow Jones & Co.
• Government
agencies - U.S. Department of Labor
• Industry
associations – Travel Industry Association of America
• Special-interest
organizations – Graduate Management Admission Council (GMAT)
• Internet
– more and more firms
Data Acquisition Considerations
Time Requirement
• Searching
for information can be time consuming.
• Information
may no longer be useful by the time it is available.
Cost of Acquisition
• Organizations
often charge for information even when it is not their primary business
activity.
Data Errors
• Using
any data that happen to be available or were acquired with little care can lead
to misleading information.
Descriptive Statistics
• Most
of the statistical information in newspapers, magazines, company reports, and
other publications consists of data that are summarized and presented in a form
that is easy to understand.
• Such
summaries of data, which may be tabular, graphical, or numerical, are referred
to as descriptive statistics.
Statistical Inference
Population:
The set of all elements of interest in a particular study.
Sample: A
subset of the population.
Statistical
inference: The process of using data obtained from a sample to make
estimates and test hypotheses about the characteristics of a population.
Census: Collecting
data for the entire population.
Sample
survey: Collecting data for a sample.
Analytics
Analytics is
the scientific process of transforming data into insight for making better
decisions.
Techniques:
• Descriptive
analytics: This describes what has happened in the past.
• Predictive
analytics: Use models constructed from past data to predict the future or
to assess the impact of one variable on another.
• Prescriptive
analytics: The set of analytical techniques that yield a best course of action.
Big data and Data Mining:
Big data:
Large and complex data set.
Three V’s of
Big data:
- Volume : Amount of available data
- Velocity: Speed at which data is collected and processed
- Variety: Different data types
Data
warehousing is the process of capturing, storing, and maintaining the data.
• Organizations
obtain large amounts of data on a daily basis by means of magnetic card
readers, bar code scanners, point of sale terminals, and touch screen monitors.
• Wal-Mart
captures data on 20-30 million transactions per day.
• Visa processes 6,800 payment transactions
per second.
Data Mining
• Methods
for developing useful decision-making information from large databases.
• Using
a combination of procedures from statistics, mathematics, and computer science,
analysts “mine the data” to convert it into useful information.
• The
most effective data mining systems use automated procedures to discover
relationships in the data and predict future outcomes prompted by general and
even vague queries by the user.
Data Mining Applications
- The major applications of data mining have been made by
companies with a strong consumer focus such as retail, financial, and
communication firms.
- Data mining is used to identify related products that customers who have already purchased a specific product are also likely to purchase (and then pop-ups are used to draw attention to those related products).
- Data mining is also used to identify customers who should receive special discount offers based on their past purchasing volumes
Data Mining Requirements
• Statistical
methodology such as multiple regression, logistic regression, and correlation
are heavily used.
• Also
needed are computer science technologies involving artificial intelligence and
machine learning.
• A
significant investment in time and money is required as well.
Data Mining Model Reliability
• Finding
a statistical model that works well for a particular sample of data does not
necessarily mean that it can be reliably applied to other data.
• With
the enormous amount of data available, the data set can be partitioned into a
training set (for model development) and a test set (for validating the model).
• There
is, however, a danger of overfitting the model to the point that misleading
associations and conclusions appear to exist.
• Careful
interpretation of results and extensive testing is important.
Ethical Guidelines for Statistical Practice
• In
a statistical study, unethical behavior can take a variety of forms including:
• Improper
sampling
• Inappropriate
analysis of the data
• Development
of misleading graphs
• Use
of inappropriate summary statistics
• Biased
interpretation of the statistical results
• One
should strive to be fair, thorough, objective, and neutral as you collect,
analyze, and present data.
• As
a consumer of statistics, one should also be aware of the possibility of
unethical behavior by others.
Ethical Guidelines for Statistical Practice
• The
American Statistical Association developed the report “Ethical Guidelines for
Statistical Practice”.
• It
contains 67 guidelines organized into 8 topic areas:
• Professionalism
• Responsibilities
to Funders, Clients, Employers
• Responsibilities
in Publications and Testimony
• Responsibilities
to Research Subjects
• Responsibilities
to Research Team Colleagues
• Responsibilities
to Other Statisticians/Practitioners
• Responsibilities
Regarding Allegations of Misconduct
• Responsibilities
of Employers Including Organizations, Individuals, Attorneys, or Other Clients
No comments:
Post a Comment