Maven UK train rides

The data which is analyzed here is taken from Maven Analytics Data playground Free Data Sets & Dataset Samples | Maven Analytics . The data set is UK Train rides. Thanks for Maven Analytics for publishing the data sample. I have also answered the four questions in the data playground.

The data is downloaded into Microsoft Access. There I have used SQL queries. Then I made a data model in Excel. The Access database is the source for my data model in Excel.

  1. What are the most popular routes?

    The query:

    SELECT COUNT([Departure Station] & [Arrival Destination]), [Departure Station] & [Arrival Destination]

    FROM Railway

    GROUP BY [Departure Station] & [Arrival Destination]

    ORDER BY COUNT([Departure Station] & [Arrival Destination]) DESC;

    The same result was found with Excel.

    A route is a combination of departure station and arrival destination.

    =CALCULATE(count(Railway[Route]),ALLEXCEPT(Railway, Railway[Route]))

    The most popular route is Manchester Piccadilly – Liverpool Lime Street, the second is the route from London Euston to Birmingham New Street.

    2. What are the peak travel times?

      The query:

      SELECT COUNT([Departure Time]), [Departure Time]

      FROM Railway

      GROUP BY [Departure Time]

      ORDER BY COUNT([Departure Time]) DESC;

      In Excel the DAX is nearly the same as with the first question.

      =CALCULATE(count(Railway[Departure Time]),ALLEXCEPT(Railway, Railway[Departure Time]))

      The most popular travelling times in the morning are 6:30 and 8:00. Looks like 7:00 is not popular time at all. People are leaving for school or for work early 6:30 or bit later 8:00.

      In the evening most popular travelling times are 18:45 and 17:45. There is a difference of one hour. 18:00 is not popular time.

      3. How does revenue vary by ticket types and classes?

      The query:

      SELECT SUM(Price), [Ticket Type], [Ticket Class]

      FROM Railway

      GROUP BY [Ticket Type], [Ticket Class]

      ORDER BY SUM(Price) DESC

      With Excel, the Dax is simple. Just calculate the sum just like in Excel.

      =sum(Railway[Price])

      Standard tickets produce most of the revenue. Advance is selling best and anytime worst among the ticket types.

      4. What is the on-time performance? What are the main contributing factors?

      Query:

      SELECT COUNT([Journey Status]), [Journey Status], ROUND(COUNT(*) *100/ (SELECT count(*) FROM Railway), 1)

      FROM Railway

      GROUP BY [Journey Status];

      The same can be calculated in Dax with following sentences:

      =CALCULATE(count(Railway[Journey Status]),ALLEXCEPT(Railway, Railway[Journey Status]))

      The Dax above can be replaced by this:

      =COUNTAX(Railway, [Journey Status])

      It would be useful to have percentages also. To count the total number of values in Journal status field, we want to have following Dax:

      =CALCULATE(COUNTA(Railway[Journey Status]), ALL(Railway[Journey Status]))

      To divide first with second, we have this Dax.

      =divide([4_1_delay], [4_2_delay])

      Query:

      SELECT COUNT([Reason for Delay]), [Reason for Delay], ROUND(COUNT([Reason for Delay]) *100/ (SELECT count([Reason for Delay]) FROM Railway), 1)

      FROM Railway

      GROUP BY [Reason for Delay]

      ORDER BY COUNT([Reason for Delay]) DESC;

      To count how many times each reason for delay is appearing, we use COUNTAX Dax:

      =COUNTAX(Railway, [Reason for Delay])

      To have the total number of reason of delays, we have:

      =CALCULATE(COUNTA(Railway [Reason for Delay]), ALL(Railway [Reason for Delay]))

      To have a division we need to have DIVIDE command:

      =divide([4_5_reason],[4_6_reason])

      The most typical reason for delay is weather. The are two reason codes weather and weather conditions. The two could be summarized, then we would see that about one third of delays are due to weather.   

      I was using two tools to answer the questions in Maven Analytics playground: Access and Excel. Same answers could be found using both the tools. Data sets are useful as you can practice with the tools you like.  

      Correlation

      Correlation means how strongly two statistical variables are dependent on each other. Graphically, we can see the dependence in XY scatter graph. On Y axis we have the dependent variable (variable to be explained) on X the independent variable (explaining variable).

      If the correlation is between 0,8 and 1 or -0,8 and -1, then the correlation is obvious.

      We have simple data independent and dependent variables.

      First, we calculate the averages.

      Then we need to calculate the difference between the observation and average.

      After that, we multiply the differences.

      The next step is to calculate the squares for differences.

      Then, we need to calculate the sums for differences in columns F, G and H.

      Finally, we calculate the correlation by having multiplication of x and y differences, divided by square root of x and y difference multiplications.

      The correlation in our case is roughly 0,959 which is a high value and there is a statistical dependence between independent and dependent variables.

      To make it easier, you can just use either CORREL or PEARSON functions. Those functions return the same value as we calculated manually.

      One quick way to calculate the correlation is to use data analysis. I have data analysis under data menu, last one in the right.

      Select correlation.

      Enter the input and output ranges.

      The result is still the same.

      Do you still want to have a scatter graph ?

      The result is still the same.

      Do you still want to have a scatter graph ?

      Activate the data and select home – analyze data.

      Scroll down to find the scatter graph.

      Select the dots in the graph and press right mouse button. Select add trendline.

      When you scroll downwards the format trendline, you can select display R-squared value on chart.

      Now we have a trend line. We see that the dots are pretty close to the trend line.

      R square is a square for the correlation.

      R square indicates how many percents of changes in independent variable explains the changes in dependent variable. In our case that is 92 %.