1. Introduction

This post belongs to a new series of posts where I intend to face the challenge of learning a new programming language, such as Python, by doing. There are tons of material about Python tutorials for every level of expertise. It has been quite a while now that platforms such as codewars, topcoders and similar have been around. The nice thing that I like about these platforms is that the trainee is faced with many tasks or challenges of increasing difficulty. The same task is often proposed for multiple languages (Python, Javascript, C++ and many others). Learn by doing is gaining popularity in recent years as it helps the trainee to figure out how and where he/she can get stacked. Solutions of other members are accessible only when an admissible solution has been successfully submitted to the automatic testing process.

So why should I write a blog post to show how to solve such kinds of tasks and how could it be useful to any extend to you? In line with the philosophical view of the above-mentioned platforms, I encourage you to read each single example description, step out of the blog post, try to work it out by yourself (without surfing Stack-overflow too much!) and coming back to the post to go through the proposed solution (something I tend to provide more than one) and the related comments. I have indeed struggled sometimes to fully capture other members’ solutions with no comments or details of the logic behind the code.

I will use the following structure for every example:

  1. statement of the task to solve.
  2. generating the input data to be fed to the function that we need to develop.
  3. write down a rudimental but surely valid function, F, which should be our reference case.
  4. generate the function output data using the reference function F.
  5. develop one or more solutions that would have different pros and cons.
  6. assess each function’s correctness by its results to the F results on the dataset.

In this post we develop a Python code from scratch to determine how many kilometers a bike rider will cover in a given time span.

2. Description

James is a bike rider who challenges himself to cover a very long distance along a river path every year, as part of a life ritual. Some days he rides more than others depending on the weather, his energy and the road conditions. He is curious how many km might be covered over the next 10 years. He has spent a year marking down his daily progress.

The monthly data are grouped into four seasonal lists.

The total distance of all the trips over a year will be used for estimating the number he might cover in 10 years, as:

$$ D_{tot} = 10\cdot D_{year} $$

Input data structure is a list of the four seasons, each containing covered distance of every month within that season.

We need to make sure our solution considers all of the nested steps within the input array.

3. Input data

Here one example of season data. It reads that James has riden for 737 km for the first winter month.

import sys
import numpy as np
winter = [737, 1244, 776]
spring = [1175, 883, 1596]
summer = [1646, 945, 1364]
autumn = [1353, 605, 1464]
distance = [winter, spring, summer, autumn]
print('distance data for the first year, over the 12 months, grouped into the four seasons:')
print(distance)
distance data for the first year, over the 12 months, grouped into the four seasons:
[[737, 1244, 776], [1175, 883, 1596], [1646, 945, 1364], [1353, 605, 1464]]

The next step is to generate a set of randomly generated test units and test the different developed functions on it.

We use the integer random generator from the Numpy library. It returns random integers from low (inclusive) to high (exclusive), stored in an array whose size can be specified as an additional attribute. We convert the Numpy 2D array to a nested list with the attribute tolist().

np.random.randint(600, 1200, size=(4, 3)).tolist()
[[692, 1028, 841], [791, 1165, 901], [1063, 645, 926], [1199, 830, 906]]

We generate Ncases = 10 test cases and place them into testSet. Since it is an entry-level post, I propose two methods to generate such a dataset: 1. a more verbose procedure, where list is initialized, the append method is used within a for-loop. 2. a more concise process, where the same logic is compressed into a single line of code, referred to as list comprehension.

In both cases, the for-loop is just used to generate multiple instances, but the current iteration step is useless. We use the underscore _ to say to Python not to allocate that value, which is coming from the range generator in this case, into any variable.

3.1 Method 1

testSet = [] # list inizialiation
for _ in range(10):
    inputExample = np.random.randint(600, 1200, size=(4, 3)).tolist()
    testSet.append(inputExample)

print('Test set length is {}, while each example length is {}'.format(len(testSet), len(inputExample)))
Test set length is 10, while each example length is 4

3.2 Method 2

Let’s implement the same concept in a more concise way.

testSet = [np.random.randint(600, 1200, size=(4, 3)).tolist() for _ in range(10)]

print('Test set length is {}, while each example length is {}'.format(len(testSet), len(testSet[0])))
Test set length is 10, while each example length is 4

A short note concerns the Python format method. In this tutorial, this method is used to take some variable values as arguments and replace the curly brackets within the string input to the print function, with such values. Further advanced results and string formatting can be achieved. Please refer to the official documentation and this super nice guide.

4. Reference function

This section is meant to define the procedure to calculate the output in the most basic Pythonic way, to be readable and 100% correct.

We take the first example, we use the powerful Python syntax to unpack a N-long list into N distinct variables, we concatenate all these lists into one (year), sum all the elements of the year list, multiply the year distance by 10 (years) and print it out.

Concatenating lists is as simple as using the + operator. Increasing a given variable var by some value val can be performed as:

var = var + val

but can be shortened to

var += val

Similar syntax is available for other operators, such as -, * and /.

example = testSet[0]
winter, spring, summer, autumn = example
year = winter + spring + summer + autumn
distance = 0
for month in year:
    distance += month
totDistance = 10*distance
print('Total distance travelled in the first example is {} km.'.format(totDistance))
Total distance travelled in the first example is 110310 km.

We embed this code into the reference function refFun(). The output function variable has to be declared within the return statement.

def refFun(example):
    winter, spring, summer, autumn = example
    year = winter + spring + summer + autumn
    distance = 0
    for month in year:
        distance += month
    totDistance = 10*distance
    return totDistance

5. Output data

We need to repeat the process defined in the above function to get the output for the entire testSet.

testResults = []

for example in testSet:
    testResults.append(refFun(example))

We show the results for the dataset.

for example, result in zip(testSet, testResults):
    print('='*60 + '\nYear data')
    print(example)
    print('Total distance travelled in the first example is {} km.'.format(result))
============================================================
Year data
[[687, 604, 1075], [827, 960, 1165], [917, 1199, 666], [993, 1055, 883]]
Total distance travelled in the first example is 110310 km.
============================================================
Year data
[[1035, 921, 1077], [1074, 761, 1057], [638, 1066, 1128], [797, 869, 966]]
Total distance travelled in the first example is 113890 km.
============================================================
Year data
[[624, 793, 1179], [833, 1195, 1105], [826, 709, 622], [669, 1048, 1089]]
Total distance travelled in the first example is 106920 km.
============================================================
Year data
[[1152, 802, 816], [941, 794, 980], [650, 822, 1026], [1146, 986, 1006]]
Total distance travelled in the first example is 111210 km.
============================================================
Year data
[[858, 698, 945], [828, 628, 938], [616, 730, 727], [1160, 665, 742]]
Total distance travelled in the first example is 95350 km.
============================================================
Year data
[[1170, 746, 1192], [1095, 852, 867], [836, 661, 914], [932, 670, 1193]]
Total distance travelled in the first example is 111280 km.
============================================================
Year data
[[833, 989, 722], [1079, 957, 860], [790, 801, 667], [1193, 979, 954]]
Total distance travelled in the first example is 108240 km.
============================================================
Year data
[[768, 787, 1040], [1021, 745, 1172], [703, 693, 1017], [931, 846, 816]]
Total distance travelled in the first example is 105390 km.
============================================================
Year data
[[726, 860, 1182], [1088, 1117, 1177], [822, 805, 633], [940, 1199, 988]]
Total distance travelled in the first example is 115370 km.
============================================================
Year data
[[606, 1136, 739], [1101, 828, 1043], [981, 984, 1050], [743, 906, 1109]]
Total distance travelled in the first example is 112260 km.

6. Function development

Since the input is a nested list, we can use two nested for-loops to process the overall distance.

def fun1(example):
    distance = 0
    for season in example:
        for month in season:
            distance += month
    totDistance = distance*10
    return totDistance
print('Total distance, calculated with method 1, is {} km (ground-truth is {}).'.format(fun1(example), result))
Total distance, calculated with method 1, is 112260 km (ground-truth is 112260).

The solution can be re-written to be more compact and, in a long-term perspective, easier to maintain. We use the sum operator, which returns the sum of the numerical elements of a list. The nested list can be flattened into a standard list. The top-down order of original for-loop becomes a left-right process in the list comprehension.

year = [month for season in example for month in season]
print('Year data in a single list:\n{}'.format(year))
Year data in a single list:
[606, 1136, 739, 1101, 828, 1043, 981, 984, 1050, 743, 906, 1109]
totDistance = sum(year)*10
print('Total 10-year distance is {} km'.format(totDistance))
Total 10-year distance is 112260 km
def fun2(example):
    year = [month for season in example for month in season]
    totDistance = sum(year)*10
    return totDistance
print('Total distance, calculated with method 2, is {} km (ground-truth is {}).'.format(fun2(example), result))
Total distance, calculated with method 2, is 112260 km (ground-truth is 112260).

A further step into the single-line-of-code benchmark is to feed the year list into the sum, without saving it to any intermediate variable to reduce computational time (no writing).

Please, make sure to get the difference between the two following lines.

distance = sum([month for season in example for month in season])
distance = sum(month for season in example for month in season)

They return the same value, but the first code defines a list before feeding it to the sum operator, while the second code defines a generator. IMHO, the second approach is way much better. We can either save the total distance to a temporary variable and then return it, or directly return the single-line expression outcome.

def fun3(example):
    return sum(month for season in example for month in season)*10
print('Total distance, calculated with method 3, is {} km (ground-truth is {}).'.format(fun3(example), result))
Total distance, calculated with method 3, is 112260 km (ground-truth is 112260).

A slightly different way to perform this calculation would be to sum the distance for every season and then sum their contribution to get the annual distance.

def fun4(example):
    return sum(sum(season) for season in example)*10
print('Total distance, calculated with method 4, is {} km (ground-truth is {}).'.format(fun4(example), result))
Total distance, calculated with method 4, is 112260 km (ground-truth is 112260).

7. Function assessment

In this last section of the very first task, we compare the performance of every developed function with the benchmark results. The final score is the percentage of successful tests obtained with that function.

As always, I do propose two methods to get the score, with and without list comprehension.

score = 0
for example, result in zip(testSet, testResults):
    if fun1(example) == result:
        score += 1

score /= len(testSet)/100
print('Final score for function 1 is {}%.'.format(score))
Final score for function 1 is 100.0%.
score = sum(1 for example, result in zip(testSet, testResults) if fun1(example) == result)/len(testSet)*100
print('Final score for function 1 is {}%.'.format(score))
Final score for function 1 is 100.0%.

We define an in-place scoring function using the powerful lambda operator to create it anonymously.

This scoring function takes a function as input and returns its score over the test set.

scoring = lambda fun: sum(1 for example, result in zip(testSet, testResults) if fun(example) == result)/len(testSet)*100

We finally place the 4 developed functions into a list and prints the score associated with each of them. The enumerate iterator gives the 0-based index and the corresponding element of the input list.

funs = [fun1, fun2, fun3, fun4]
for kk, fun in enumerate(funs):
    print('Final score for function {} is {}%.'.format(str(kk+1), scoring(fun)))
Final score for function 1 is 0.0%.
Final score for function 2 is 0.0%.
Final score for function 3 is 0.0%.
Final score for function 4 is 0.0%.

We finally compare the performance of each function in terms of computational time. We use one of the built-in magic commands available in the Jupyter notebook, timeit, which estimates the execution time of a Python statement or expression. One (%timeit) or two (%%timeit) percent signs are required to evaluate one-line statement or one cell, respectively.

The first and fourth attempts are as fast as the reference function.

%%timeit
for example in testSet:
    fun1(example)
12.9 µs ± 544 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
for example in testSet:
    fun4(example)
14.9 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
for example in testSet:
    refFun(example)
13.6 µs ± 489 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)