Data Mining

View mindmap
  • Data Mining
    • - The automatic analysis/ sorting of big data in a data warehouse.
      • - Pattern recognition used to identify patterns/ correlations and to predict trends/ relationships.
        • - Data is combined from multiple sources
    • refers to the process of analysinglarge data sets (Big Data) with a view to discoveringpatterns and trends that go beyond simple analysis.
    • It combines the application of artificial intelligence,statistics and database systems in the analysis ofgroups of structured and unstructured data setswhich prove difficult to analyse using traditionalmeans.
    • Patterns such as those identified above are thenpresented as a summary of the input data andcan be used for further analysis,
    • Big data is a term associated with data sets thatare so complex that traditional database (such asRDBMS’s) and other processing applications areunable to capture, curate (the process of organisingdata from a range of data sources), manage andprocess them within an acceptable time frame.
      • Big data challenges can be defined as the 3V’s:               • Volume – the amount of data to be processed      • Variety – the number of types of data to beanalysed • Velocity – the speed of data processing
      • Social media is one ofthe biggest sources of Big Data.
      • Data sources can be categorised into internaland external. The internal data includes sourcessuch as customer details, product details, salesdata etc.
      • External sources include data collected from business partners, data suppliers, internet,
        • Inessence the commonly used data sources are  :• social media            ;• machine data – data generated from devicessuch as RFID chip readers, GPS results;    1and• transactional data – data generated from companies such as eBay, Amazon and largestores such as Tesco’s.
        • The key requirements of bigdata storage therefore are that it can handle verylarge amounts of data and keep scaling to keep upwith the growth of data sets, in addition to beingable to provide high speed Input/Output operationsnecessary to support the delivery of data analyticsas they are carried out.
        • Big data practitioners run what are known as hyperscale computing environments which consist of a vast number of servers with Direct Attached Storage (DAS).
          • Smaller organisations can support the storageof big data through the use of clustered NetworkAttached Storage (NAS) devices.
      • Object-based storage systems offer an alternativeto NAS devices and the problems it can lead to.
        • Each file stored in a object-based storage systemwill be given its own unique identifier and index tosupport high speed access to a particular data fileor data set.
      • Big data processing techniques analyse data setsat terabyte or even petabyte scale.
        • Cluster analysis – where groups of data recordsare identified;
          • Classification – where the data mining processis used to determine an appropriate structureto new data, in the way for example an email application may classify some emails as spam;
            • Anomaly detection – where unusual recordsare identified. Such anomalies may meritfurther investigation as a point of interest to theorganisation or they may be representative ofdata errors;
              • Association rule mining and sequential pattern mining – where dependencies between dataitems can be identified, for example the use ofdata sets by a supermarket to determine whichpatterns of products are purchased together;
                • Regression – where relationships between datavariables are investigated to help how a changein an independent variable can impact upon adependant data variable;
                  • Summarisation – where data is summarised in avisual format .


No comments have yet been made

Similar ICT resources:

See all ICT resources »See all Data Mining resources »