官术网_书友最值得收藏!

Common data issues

We can categorize data difficulties into several groups. The most generally accepted groupings (of data issues) include:

  • Accuracy: There are many varieties of data inaccuracies and the most common examples include poor math, out-of-range values, invalid values, duplication, and more.
  • Completeness: Data sources may be missing values from particular columns, missing entire columns, or even missing complete transactions.
  • Update status: As part of your quality assurance, you need to establish the cadence of data refresh or update, as well as have the ability to determine when the data was last saved or updated. This is also referred to as latency.
  • Relevance: It is identification and elimination of information that you don't need or care about, given your objectives. An example would be removing sales transactions for pickles if you are intending on studying personal grooming products.
  • Consistency: It is common to have to cross-reference or translate information from data sources. For example, recorded responses to a patient survey may require translation to a single consistent indicator to later make processing or visualizing easier.
  • Reliability: This is chiefly concerned with making sure that the method of data gathering leads to consistent results. A common data assurance process involves establishing baselines and ranges, and then routinely verifying that the data results fall within the established expectations. For example, districts that typically have a mix of both registered Democrat and Republican voters would warrant the investigation if the data suddenly was 100 percent single partied.
  • Appropriateness: Data is considered appropriate if it is suitable for the intended purpose; this can be subjective. For example, it is considered a fact that holiday traffic affects purchasing habits (an increase in US Flags Memorial day week does not indicate an average or expected weekly behavior).
  • Accessibility: Data of interest may be watered down in a sea of data you are not interested in, thereby reducing the quality of the interesting data since it would be mostly inaccessible. This is particularly common in big data projects. Additionally, security may play a role in the quality of your data. For example, particular computers might be excluded from captured logging files or certain health-related information may be hidden and not part of shared patient data.
主站蜘蛛池模板: 满城县| 鹤山市| 澄迈县| 墨江| 高州市| 民丰县| 湖州市| SHOW| 开阳县| 大安市| 河东区| 社旗县| 荥经县| 德令哈市| 辽阳县| 龙岩市| 新野县| 五寨县| 五家渠市| 屯昌县| 鄂伦春自治旗| 德兴市| 托克托县| 勐海县| 和平县| 大田县| 广南县| 牙克石市| 湘潭县| 秀山| 铁力市| 阳城县| 方正县| 宝山区| 新密市| 三亚市| 雷波县| 佛学| 芒康县| 越西县| 偃师市|