CA4: PROJECT AND BLOG

INTRODUCTION

The module B8IS100 Data Management and Analytics has proven exceptionally fulfilling in all aspects covered. Learning about data management, analytic methods, the related tools and applying this knowledge in a practical sense has provided a swift learning curve. The journey has been a truly fascinating exercise and has given me an appetite to explore further learnings and qualifications in this hugely active and growing field.

INFORMATION SYSTEMS

We have learned that Information Systems (IS) comprise a functioning network of hardware, software and networks that people and organisations use to collect, create, analyse, secure and distribute data.

IS Hardware will comprise all the physical user devices, data processing systems, media storage devices along with all physical materials associated with and required for information processing.

IS Software will include both the applications software (programs, tools etc) along with the Operational Support System (OSS). OSS is the network entity through which the system personnel (not end users) conduct all system maintenance, upgrades, GUI modifications and more. Basically anything which is outside the domain of the system end users/clients.

IS Networks provide the systems interconnectivity across the global telecommunications network system. These networks comprise The Internet, Intranets and Extranets. They provide the essential interconnection necessities for all successful modern day businesses and enterprises to maintain, grow and protect their products, services and resources.

The People aspect of an IS includes the end-users who utilise the system and the IS specialists and Business Analysts who design, develop, implement, maintain, upgrade and secure the IS entities.

There are four major types of Information Systems: –

  • MIS – Management Information Systems
  • DSS – Decision Support Systems
  • ESS – Executive Support Systems
  • TPS – Transactional Support Systems

The following diagram, sourced at Paginas MIS illustrates and details the above four specified IS systems types: –

IS x 4

We also looked at the decision making scenario’s presented by each Information System, the organisational layer these decisions will generally occur at, plus the type of decision input/output that can be anticipated. We came to understand how structured decision making generally occurs at Operations Level, semi-structured tactical decision making usually at mid-management level plus how long term strategic decision making will occur at the most senior levels of management within a company right up to CEO level.

The following diagram sourced at Process Consultant Blog Spot summarises typical roles associated with each of the three above specified decision layers: –

decision making

IS Security

It is appropriate to include a subsection on IS security since the volumes of data being generated and transgressing networks and corporations continues to soar at never before anticipated growth rates.

The scale and pace with which organisations are having to implement information systems presents ever new security challenges. Data managers and their systems analysts must ensure their Management Information Systems have robust protective measures in place to protect their business interests, their clients and staff’s personal information and the overall investments.

I found the following article summing up perfectly the challenges and exposures that modern businesses must contend with and deal with to secure all aspects and resources of their business from on-line interference, damage, theft and other cyber security breaches.

https://heatsoftware.com/security-blog/6358/dealing-with-todays-information-systems-complexity/

Dealing with today’s Information Systems complexity

Information systems complexity is the enemy of security. From mobile to the cloud and practically everything in between, all businesses have information systems complexities which are creating big security issues. This complexity rears its ugly head time and again in businesses both small and large and, given our dependence on information, appears to be on an exponential growth track.

Information systems complexity isn’t just about the quantity of systems on the network. It goes much deeper than that and includes factors like:

  • Multitude of applications, virtual machines, and even cloud service providers that are both known and unknown (i.e. other people in the business doing their own thing without involving IT and security staff)
  • Guidelines, standards, and policies that some (rarely all) people may or may not be held accountable to

There’s another element of complexity – often the biggest – that can create immeasurable security risks in your environment at any given time: your users. The human aspects of computer usage such as thinking and decision-making have a profound impact on IT management and information risks.

It’s not just our own networks that are complex either. The very threats we’re fighting off can be very complicated as well. The techniques used by criminal hackers and advanced malware are beyond the comprehension of many people, including IT professionals. Further complicating matters is the reality that it’s hard to protect against something that hasn’t yet happened[END OF REPOST]

Above quoted items helps highlight the exceptional focus, urgency and prioritisation that all organisations and the system designers must place on ensuring rigorous and impenetrable security firewalls are placed around all access points to Information Systems. Due to the exponential rise that will occur over forthcoming years with The Internet of Things and the over twenty billion predicted connected devices, all aspects of security are being tested from each and every angle and by a global plethora of on-line hackers and dedicated web cyber-criminal communities.  I found the content of the above published article extremely informative regarding the sheer volume of management information systems that criss-cross our lives on a daily basis. It is an article that highlights the responsibility that system providers and administrators must adhere to in their diligence with the build and management of each and every Information System.

BUSINESS INTELLIGENCE

The following concise definition of business intelligence, is courtesy of Webopedia: –

http://www.webopedia.com/TERM/B/Business_Intelligence.html

Business intelligence (BI) represents the tools and systems that play a key role in the strategic planning process within a corporation. These BI systems allow a company to gather, store, access and analyze corporate data to aid in decision-making. Generally these systems will illustrate business intelligence in the areas of customer profiling, customer support, market research, market segmentation, product profitability, statistical analysis, and inventory and distribution analysis to name a few.

To supplement the excellent material our class received from out lecturers uploads to Moodle, my on-line research led me to the following overview summary diagram of BI toolsets. I will use this diagram courtesy of Predictive Analytics Today to briefly expand on the toolsets that comprise the current BI environment.

Top-Business-Intelligence-Tools-List

  • Spreadsheets – predominantly MS Office Excel and MAC Numbers
  • Reporting and Querying – Organisations primarily using their own software to report, query, sort, filter and display data
  • OLAP – Online Analytical Tools to enable users perform interactive analyse on data from multiple sources incorporating multidimensional view.
  • Digital dashboards – Real-time user interfaces that allow graphical presentation of current operational status and ability to quickly extract historical reports for all previous reports too
  • Data mining – Identifying patterns in large data sets encompassing various methods e.g. artificial intelligence, machine learning, statistics, database systems
  • Data warehousing – Centralised storage location for gathered data retrieved from multiple sources. Serves as repository for all data that will have future predictive analysis conducted on. Robust back up processes and infrastructure are key requirements of all industry best practice data warehouses
  • Process mining – Analysis based on historical events logs available in an information system and utilised for process mining
  • Business performance management – Processes for managing the performance of a business or organisational unit
  • GIS information systems – Supports and facilitates geographic information system reporting
BIG DATA ANALYTICS

In a 2001, Gartner analyst Doug Laney defined the 3V’s of Big Data, namely volume, velocity and variety.

Some years later, following further studies and evaluations, IBM introduced a fourth V to Big Data, Veracity.

Volume – Data is being generated in exponentially increasing volumes each day. Sources include on-line and mobile phone transactions, social media platforms, industry computer logs. The Big Data strategists and tools developers are continually evaluating how business and mankind ensures that all the data being generated and collected globally on a daily basis is being utilised correctly to benefit organisations and society, Apache Hadoop is the industry recognised tool that is used to perform data analytics on the vast swathes of unstructured data that is being gathered and exchanged across the Internet every second of every day.

Velocity –Velocity not only refers to the speed that data is being generated and collected but also to the turn around time on the utilisation of the gathered data into useful and meaningful information for industry and society. The advent of smartphones and tablets has placed new demands on Big Data storage systems and how velocity factors are responding to data traffic.

Variety – Variety refers to the multiple formats that data is now being presented across the web and then channeled into data warehouses. It is primarily unstructured data i.e. e-mails, photos, on-line purchase transactions, social media exchanges, tweets, phone call records to name just a handful.

Here is an IBM comment that helps put Gartners Big Data 3V’s into context (IBM Big Data Hub): “On Facebook alone we send 10 billion messages per day, click the like button 4.5 billion times and upload 350 million new pictures each and every day. If we take all the data generated in the world between the beginning of time and the year 2000, it is the same amount we now generate every minute! This increasingly makes data sets too large to store and analyze using traditional database technology. With big data technology we can now store and use these data sets with the help of distributed systems, where parts of the data is stored in different locations, connected by networks and brought together by software”.

3 V’s of Big Data diagram courtesy of Data Science Central Blog

3Vs

Veracity – IBM state on their Big Data Analytics Hub: The average billion dollar company is losing $130 million a year due to poor data management. Veracity refers to the uncertainty surrounding data, which is due to data inconsistency and incompleteness, which leads to another challenge, keeping big data organized.

 The volume, velocity, variety and veracity of data that is being generated today goes beyond what traditional analytics systems can handle in a timely and efficient manner. This leads to the fifth V that organizations are struggling with, finding the Value within their data.
Value – IBM state on their Big Data Analytics Hub: Through effective data mining and analytics, the massive amount of data that we collect throughout the normal course of doing business can be put to good use and yield value and business opportunities. By applying data mining and analytics to expose valuable business information embedded in structured, unstructured, and streaming data and data warehouses, this insight can be used to help revamp supply chains, improve program planning, track sales and marketing activities, measure performance across channels, and transform into an on-demand business. A big data strategy gives businesses the capability to better analyze this data with a goal of accelerating profitable growth.
The above quoted IBM Big Data Analytics Hub in relation to the Big Data V 4 (Veracity) and V 5 (Value) reinforce the necessity that lies with all developers and data scientists to work together on ensuring we have the tools and systems that will cope with the Big Data equivalent of a tsunami which will continue to gather momentum at an unprecedented rate.
The exposure on our course given to current tools and processes being used for Big Data analytics and the collective efforts being harnessed to deal with the challenges that Big Data analytics is presenting has highlighted the seriousness with which industry is treating the Big Data revolution.
Our course has also highlighted the opportunities that currently exist and will continue to materialise for individuals with suitable qualifications to become involved in career developmental openings within Big Data Analytics.
IBM_4-Vs-of-big-data
MATER DATA MANAGEMENT, DATA GOVERNANCE, DATA QUALITY
Master Data Management refers to systems, tools, information and processes that comprise and oversee the data resources at an organisation or enterprise.
The following diagram from simbyte.com.au helps us visualise the volume of data processes that can potentially interact. These interactions will include internal and external intefaces.
mdm-main-processes
 The diagram also highlights the importance that will thus attach to data governance. Best practice guidelines on data governance reinforce the importance of differentiating on the roles and responsibilities that apply to the key roles within data governance.
These roles are specifically relating to steering at C-Level (Strategic), Data Owner (Tactical) Level and Data Steward (Operational) Level.
The following diagram courtesy of nabler.com/analytics encompasses people, processes and platforms the data governance activities and responsibilities that require focus and adherence.
data_governance_structure
Data Quality and what exactly constitutes “good” data quality demands continued diligence and focus since the sheer volumes of information now being electronically gathered and stored, exposes organisations to running the risk of data quantity winning out over data quality.
Fortunately, the data analytic tools, systems and processes are evolving at the same rapid pace as that of the Big Data revolution.
The following have been identified as the dimensions that need to focused upon and monitored when ranking the qualitative properties of data being managed within a Master Data Management system: –
  1. Completeness [for example: Mandatory fields vs Optional fields compliance and correctness with the supplied and stored data]
  2. Timeliness [for example: Publishing data when agreed and obliged to do so, Up to Date information from Customers Services for incoming inquiries, accurate cross checking on credit card accounts]
  3. Consistency [For example: All credit card data info across an enterprise is fully synched up, close of dates on promotions are validated across the system, eligibility for offers is up to date with customer purchasing activity]
  4. Validity [For example: no invalid characters appearing in a data string and thus rendering it useless or missing field]
  5. Integrity [For example: All business rules pertaining to primary and foreign key attributes are correctly adhered to]
  6. Accuracy [For example: Adherence to a format that date of birth must be captured in]

CONCLUDING: PROJECTS ASSESSMENTS

The projects undertaken for module B8IS100 Data Management and Analytics have been extremely interesting, informative, stimulating and at times challenging too.

Getting acquainted with Fusion Tables highlighted to us the typical tools that are freely available to all who wish to engage with and learn more regarding data analytics. We worked with raw data from government publications (e.g. census population figures, district crime rate figures), formatted the data within excel to ensure that it was structured using pivot table requirements and then went about manipulating the data using Google Fusion Tables (from Google Drive) to generate a heat map. Creating a heat map for population densities in Ireland with census figure from 2011 was our particular project assignment in this case.

Following our lectures on R programming language and completing the R:School on-line course, we downloaded R:Studio to our laptops our PC’s. Again, we would work with data that was captured in excel. In this case we saved our data in comma separated value format (.csv) and used the .csv file in our R:Studio workspace. Working with R and R:Studio has been a hugely beneficial exercise that illustrated and equipped us with the basics on working with large volumes of data and converting this into charts for interpretation and decision making. Again, a big task to try and do with excel became a repeatable and manageable exercise in R.

Our journey with projects kicked off in Semester One when we studied SQL, standard query language. We learned how to create our own relational database, load up our data and to practically recreate a real living user test case. Our exercise required us to create a relational data base for a video rental store and to run queries on customer accounts including their rental history and status.

All in all, I am very pleased with the projects we were assigned. The preparation our lecturer provided us with regarding theory, practical aspects and support was superb. When combining all the learnings across this module and others from the overall course, I have to say that I have surprised myself with how quickly I came to grips with so much of the material and the appetite the course has given me to expand now upon the learnings and groundwork done.

Data Management and Big Data Analytics is a compelling subject matter to delve into. It will no doubt provide countless opportunities for students who study this course at the college to pursue rewarding careers. It will also inspire a good many to take up further studies and become highly qualified specialists within data management and data analytics.

========================================================

 

 

CA3: STATISTICAL ANALYSIS

Q1: Lift Analysis

Please calculate the following lift values for the table correlating Burger & Chips below:

  • LIFT(Burger, Chips)
  • LIFT(Burger, ^Chips)
  • LIFT(^Burger, Chips)
  • LIFT(^Burger, ^Chips)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation.

 

Chips ^Chips Total Row
Burgers 600 400 1000
^Burgers 200 200 400
Total Column 800 600 1400

1. LIFT ( Burgers, Chips)

s(Burgers u Chips) = 600/1400 = 3/7 = 0.43

s(Burgers) = 1000/1400 = 5/7 = 0.71

s(Chips) = 800/1400 = 4/7 = 0.57

LIFT(Burgers, Chips) = 0.43/0.71*0.57 = 0.43/0.40 = 1.075

LIFT(Burgers, Chips) > 1 meaning that Burgers and Chips are positively correlated

 

2. LIFT (Burgers, ^Chips)

s(Burgers u ^Chips) = 400/1400 = 2/7 = 0.29

s(Burgers) = 1000/1400 = 5/7 = 0.71

s(^Chips) = 600/1400 = 3/7 = 0.43

LIFT(Burgers, ^Chips) = 0.29/0.71*0.43 = 0.29/0.31 = 0.94

LIFT(Burgers, ^Chips) < 1 meaning that Burgers and ^Chips are negatively correlated

 

3. LIFT (^Burgers, Chips)

s(^Burgers u Chips) = 200/1400 = 1/7 = 0.14

s(^Burgers) = 400/1400 = 2/7 = 0.29

s(Chips) = 800/1400 = 4/7 = 0.57

LIFT(^Burgers, Chips) = 0.14/0.29*0.57 = 0.14/0.17 = 0.82

LIFT(^Burgers, Chips) < 1 meaning that ^Burgers and Chips are negatively correlated

 

4. LIFT (^Burgers, ^Chips)

s(^Burgers u ^Chips) = 200/1400 = 1/7 = 0.14

s(^Burgers) = 400/1400 = 2/7 = 0.29

s(^Chips) = 600/1400 = 3/7 = 0.43

LIFT(^Burgers, ^Chips) = 0.14/0.29*0.43 = 0.14/0.12 = 1.17

LIFT(^Burgers, ^Chips) > 1 meaning that Burgers and Chips are positively correlated

 

Q2. Please calculate the following lift values for the table correlating Ketchup & Shampoo below:

  • LIFT(Ketchup, Shampoo)
  • LIFT(Ketchup, ^Shampoo)
  • LIFT(^Ketchup, Shampoo)
  • LIFT(^Ketchup, ^Shampoo)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation.

 

Shampoo ^Shampoo Total Row
Ketchup 100 200 300
^Ketchup 200 400 600
Total Column 300 600 900

1. LIFT (Ketchup, Shampoo)

s(Ketchup u Shampoo) = 100/900 = 1/9 = 0.11

s(Ketchup) = 300/900 = 1/3 = 0.33

s(Shampoo) = 300/900 = 1/3 = 0.33

LIFT(Ketchup, Shampoo) = 0.11/0.33*0.33 = 0.11/0.11 = 1

LIFT(Ketchup, Shampoo) = 1 meaning that Ketchup and Shampoo are independent

 

2. LIFT (Ketchup, ^Shampoo)

s(Ketchup u ^Shampoo) = 200/900 = 2/9 = 0.22

s(Ketchup) = 300/900 = 1/3 = 0.33

s(^Shampoo) = 600/900 = 2/3 = 0.67

LIFT(Ketchup, ^Shampoo) = 0.22/0.33*0.67 = 0.22/0.22 = 1

LIFT(Ketchup, ^Shampoo) = 1 meaning that Ketchup and Shampoo are independent

 

3. LIFT (^Ketchup, Shampoo)

s(^Ketchup u Shampoo) = 200/900 = 2/9 = 0.22

s(^Ketchup) = 600/900 = 2/3 = 0.67

s(Shampoo) = 300/900 = 1/3 = 0.33

LIFT(^Ketchup, Shampoo) = 0.22/0.67*0.33 = 0.22/0.22 = 1

LIFT(Ketchup, Shampoo) = 1 meaning that Ketchup and Shampoo are independent

 

4. LIFT (^Ketchup, ^Shampoo)

s(^Ketchup u ^Shampoo) = 400/900 = 4/9 = 0.44

s(^Ketchup) = 600/900 = 2/3 = 0.67

s(^Shampoo) = 600/900 = 2/3 = 0.67

LIFT(^Ketchup, ^Shampoo) = 0.44/0.67*0.67 = 0.44/0.44 = 1

LIFT(Ketchup, Shampoo) = 1 meaning that Ketchup and Shampoo are independent

 

Q3. Chi Squared Analysis

Please calculate the following chi Squared values for the table correlating Burger and Chips below (Expected values in brackets).

  • Burgers & Chips
  • Burgers & Not Chips
  • Not Burgers & Chips
  • Not Burgers & Not Chips

For the above options, please also indicate if each of your answer would suggest independent, positive or negative correlation.

 

Chips ^Chips Total Row
Burgers 900 (800) 100 (200) 1000
^Burgers 300 (400) 200 (100) 500
Total Column 1200 300 1500

 

Chi-squared = ∑ (observed-expected) 2/ (expected)

 

Χ2 = (900-800)2 / 800 + (100-200)2 / 200 + (300-400)2 / 400 + (200-100)2 / 100

= 1002 / 800 + (-100)2 / 200 + (-100)2 / 400 + 1002 / 100

= 10000/800 + 10000/200 +10000/400 + 10000/100 = 12.5 + 50 + 25 + 100 = 187.5

Burgers & Chips are correlated because Χ2  > 0.

As expected value is 800 and observed value is 900 we can say that Burgers & Chips are positively correlated.

As expected value is 200 and observed value is 100 we can say that Burgers & ^Chips are positively correlated.

As expected value is 400 and observed value is 300 we can say that ^Burgers & Chips are positively correlated.

As expected value is 100 and observed value is 200 we can say that ^Burgers & ^Chips are positively correlated.

 

Q4: Chi Squared Analysis

Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).

  • Burgers & Sausages
  • Burgers & Not Sausages)
  • Sausages & Not Burgers
  • Not Burgers and Not Sausages

 For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

Chips ^Chips Total Row
Burgers 800 (800) 200 (200) 1000
^Burgers 400 (400) 100 (100) 500
Total Column 1200 300 1500

 

Χ2 = (800-800)2 / 800 + (200-200)2 / 200 + (400-400)2 / 400 + (100-100)2 / 100

= 02 / 800 + 02 / 200 + 02 / 400 + 02 / 100 = 0

Burgers & Chips are independent because Χ2  = 0.

Burgers & Chips – observed & expected values are the same (800) – independent

Burgers & ^Chips – observed & expected values are the same (200) – independent

^Burgers & Chips – observed & expected values are the same (400) – independent

^Burgers & ^Chips – observed & expected values are the same (100) – independent

 

Q5: Under what conditions would Lift and Chi Squared analysis prove to be a poor algorithm to evaluate correlation/dependency between two events?

In scenario’s with an abundance of Null transactions, Lift and Chi Squared Analysis algorithms would not be considered outstanding methods to utilise for correlation/dependency evaluations.

Q6: Please suggest another algorithm that could be used to rectify the flaw in Lift and Chi Squared? 

  • AllConf
  • Cosine
  • Kulczynski
  • Jaccard
  • MaxConf

CA2: TRY R

I’ll run you through the steps taken from data gathering (data creation) to final step of producing a orderly graphic using R.

Data Creation

For the purpose of the exercise I have created a datasheet (dummy data) which has data on car brands and their respective market share of the global market.

I have saved the data in .csv format for ease of import into R and to avoid the introduction of any unwanted symbols or characters which would cause problems or errors during the R toolset stages.

Screenshot (not very good .png unfortunately, made a few attempts to sharpen it, no joy…)

List of car manufacturers and their respective market share, in fractional notation.

csv

The data file has been saved as <CarBrandPerCent>

No gaps showing in the file title to facilitate the import into R step that will be outlined in the next section of the blog.

Graphics Construction

R programming language supports all forms of statistical computing and allows the formulation and construction of graphics in relation to data gathered.

Once the required cleansing or formatting of the data is ensured, which is not a complicated task, the data can be accurately and reliably presented and analysed in graphical form.

A preliminary step is to always go to Session tab and in drop down, go to Select Working Directory. This will ensure our required dataset will be pulled into the R session for us to perform the necessary analytics and graphical outputs.

Once we have the correct working directory pointing to our session, an essential mandatory step is to ensure we have <ggplot> package downloaded to our R datasets.

We use the command library(ggplot2) as shown below.

The # line and subsequent text is included to allow explanatory plain script to be shown in the command window. By using the # character, we ensure the plain text script will be ignored by the R Studio and as such will not interfere with the data being presented for processing in the R Console window.

# Load ggplot library

library(ggplot2)

Our next step is to import the <CarBrandPerCent> data into R. A preliminary step is to always go to Session tab and in drop down, go to Select Working Directory. This will ensure our required dataset will be pulled into the R session for us to perform the necessary analytics and graphical outputs.

We use the following command including str function to assist and enhance how the structure of the data is presented in R

# Load my Car Brand Market Share data, which is located in CarBrandPerCent.csv

CarBrandPerCent = read.csv(“CarBrandPerCent.csv”)

str(CarBrandPerCent)

Next step is to set up our Bar Plot with the x and y axes assigned.

Bar Plot will be Car Brand Per Cent, x axis will be Brand, y axis will percent of market share.

# Next step is to make a bar plot with Brand on the X axis

# and Market Share Percentage on the y-axis.

ggplot(CarBrandPerCent, aes(x=Brand, y=PercentOfGlobal)) +

geom_bar(stat=”identity”) +

geom_text(aes(label=PercentOfGlobal))

We will use the next set of command script to apply an ordered factor to the Car Berand list plus we will also have R translate the fractions appearing in the original data sheet into percentages instead.

# Make Brand an ordered factor

# This will be possible using the re-order command and transform command.

CarBrandPerCent = transform(CarBrandPerCent, Brand = reorder(Brand, -PercentOfGlobal))

# Look at the structure

str(CarBrandPerCent)

# Make the percentages out of 100 instead of fractions

CarBrandPerCent$PercentOfGlobal = CarBrandPerCent$PercentOfGlobal * 100

We are now in good shape to create our neatly structured, easily analysed graphic chart that started out as a list of car brands and market share. The original data was in a list format, no order being shown regarding which was the market share leader and we had no supporting graphical view either to allow quick analysis to occur either.

The above explanation along with the associated R programming scripts illustrates quite vividly the power of R to process and package data. We have used a small dataset but the same quick and reliable result could be achieved in a similar timeframe with an extensively larger dataset using the exact same steps I have outlined.

So, after promoting the strengths and usability of R, here is the plot I created in R from the dataset I originally created at the start of the exercise.

# Make the plot

ggplot(CarBrandPerCent, aes(x=Brand, y=PercentOfGlobal)) +

geom_bar(stat=”identity”, fill=”dark blue”) +

geom_text(aes(label=PercentOfGlobal), vjust=-0.4) +

ylab(“CarBrandPerCent”) +

theme(axis.title.x = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))

Graphic 1: Car Manufacturers and Global Market Share

CarBrandPerCent

What information can be gleamed form the dataset

  • At a glance, which car manufacturers have big market share, which car manufactures have lesser market share
  • Likewise, the percentage share for each manufacturer is clearly indicated in a very neat and presentable manner thus allowing quick and immediate conclusions to be drawn now from the data that originally was a list of names and numbers in the .csv file we started with

Other ideas/concepts that could be represented via R Graphic, if time permitted

  • Organise the Car Brand Manufacturers on a geographical basis, create another graphic to compare European brands and their combined market share to Japanese brands and likewise, their combined market share
  • Organise and combine the smaller players into a combined “other manufacturers” category, and create another graphic to illustrate the sheer dominance and strength the top 3 to 4 players have in the global market share

 

CA1: FUSION TABLES

Before we get into the details of the specific exercise on Ireland Census Population 2011 and the associated heatmap, let’s focus on the preliminary questions and requirements.

  1. First up, I’ll give an overview on the benefits of data analytics and data visualisation to an organisation of my choice, in this case I select a large supermarket retail company, e.g. Tesco, Dunnes, SuperValu.
  • Every business transaction the supermarket has with suppliers and customers has a digital record that contains multiple facets of information. Modern data gathering and data analytic tools and methods are enabling businesses to respond more accurately and cleverly to their customers tastes, requests and expectations
  • Customers are responding positively to abilities of supermarkets who are now able to fulfil these customers needs and requirements by actively supporting loyalty card schemes which are tailoring both in-house and on-line shopping experiences to meet the finer detail of their customers. Again, these deep insights on customers specific preferences are occuring through the gathering of data about customers and the utilisation of data analytic toolsets to achieve the tailored shopping packages and more finely tuned in-store shopping experiences.
  • Behind the scenes, the adaptation and correlation of mutiple layers and volumes of data into performance dashboards and strategic planning overviews is equipping the business owners to make more reasoned, logical and informed decisions regarding how the business is developed to maintain and grow the business. As with the previous points, data gathering, data analytics and data visualisations all contributing to help the business and the customer experience

2. Analysis of the population data

The capability to conduct analysis of the population data has been significantly enhanced via the use of excel and fusion tables. We’ll explain the technicalities around this aspect shortly.

In the meantime, here are some conclusions and associations we can draw from the analysis and visualisation of the 2011 census: –

  • The main population centres definitely are Dublin, Cork and Galway. This supports the long held views that the major cities attract the workforce, 3rd level students and immigrants. In Ireland, the scale of the population gap between the major few cities and the rest of the country is quite extreme and is not showing signs of change since the last century. This is despite the efforts of governments in providing initiatives to secure investment into a bigger cross selection reach of the regional towns and provincial locations. The use of data analytics to conduct deep dive research and produce effective long term solutions to the needs of all the population with additional focus on rural communities is hopefully going to prove a benefit in securing improved standards of living for all the population and not just the citizens of our major conurbations and cities

How I achieved the HeatMap

Subject chosen for this particular heat map exercise was CSO Census Of Ireland 2011 figures. Data was obtained from www.cso.ie and downloaded to local C:Drive.

Next requirement was to purify and and do a certain level of data reorganising to ensure that data was representing the 26 counties of Ireland without inclusion of breakdown into further administrative units e.g. Dun Laoghaire (part of Dublin), Waterford City and County, Tipperary North and Tipperary South. All such scenarios had their populations consolidated to ensure we had the total population figure on a pure county by county only basis. This was essential as a mandatory step for the county by county representation.

Next step was to source a KML file that contained the specific details on coordinates of county borders and boundaries. This is required for later step when we merge our census data (cleansed) with our aforementioned KML file iton Google Drive. Utilisation of the Google Fusion app will be the software that will allow us create our Ireland Population 2011 Heat Map.

In Google Drive and using the Goole Fusion app, we merge the two data tables. There were some corrections necessary to data in the KML file. Situations where counties had there names mixed up in the information fields were easily corrected. The data was corrected on-line in Google Drive by checking the county names in the KML detailed data lists aligned with the county primary name listing.

Interesting enough, the above described issue came to light in the course of addressing an issue relating to some counties in Ireland defaulting to locations in Google Fusion Map to an location elsewhere in the world with the exact same name. Two such cases were Longford, a town near Heathrow being chosed by Google Maps for County Longford and a town in a northern state in USA being selected as default for County Clare. This was remedied by adding Geo Location hint (Ireland) during the Google Fusion Maps merge step.

Here is the end result of the merging the two tables and utilising the features of Google fusion: –

Ireland_heatmap_0.1

Information that can be gleamed form the heatmap has been outlined in the earlier section regarding analysis of population so we won’t be good reading to repeat the findings on from data analysis regarding population migration and growth in three primary locations in Ireland.

Other concepts/ideas that could be represented in the heatmap: –

  •  Proliferation of population growth on east coast of the country, we could achieve deeper breakdown on population rate differences by further range breakdown in the map (more buckets in Google Fusion hence more colour representations at increased breakdown views)
  • Concentration and development of many major population centres at port cities which were in previous times the economic trading hubs and primary industrially developing centres in the country. Hence, a legacy issue exists with the  poor distribution of population across Ireland