## 1. Introduction

This post belongs to a new series of posts where I intend to face the challenge of learning a new programming language, such as Python, **by doing**.
In the first two posts (Part1 and Part2), we develop a Python code from scratch to determine the travelled distance of a bike rider in a given time span, for a fixed and increasing annual distance target.
We started to play with some text processing tasks in Part3.

The new challenge aims at identifying the highest wages from a company employee’s salary list in Python. We have split this post into two sections. Part4 focuses on the function definition and the reference output, while this post tackles the implementation of other methods and the comparison in terms of computational performance.

## 2. Function development

For every function we define a second argument `case`

to determine which kind of output is required out of largest (`case==1`

), second largest (`case==2`

) and n-th largest (`case==3`

) cases.
The third attribute is pertaining to the option to remove duplicates in the input array.
The input is first reduced to a list of unique values if `duplicates`

is true and then cleaned to get rid of `None`

elements.

The first function uses the `nlargest`

method from `heapq`

module.
First we need to specify how many of the top largest items we want to retrieve (`nlargest(4, xx)`

gives the first 4 largest items in `x`

), and then we take the last one.

```
import heapq
```

```
def fun1(example, case=1, duplicates=False):
"""heapq"""
if duplicates:
example = list(set(example))
nn = 1 if case==1 else 2 if case==2 else 4
example = clean(example)
return heapq.nlargest(nn, clean(example))[-1]
```

```
print('The 4-th largest unique item in {} is: {}'.format(example, fun1(example, 3, duplicates=True)))
```

```
The 4-th largest unique item in [83, 'None', 107, 'None', 123, 46, 'None', 'None', 127, 75, 108, 51, 129, 'None', 126, 147, 97, 84, 112, 87, 41, 80, 119, 89, 65] is: 126
```

The second function takes the maximum item with Python `max`

method to save it to `maxVal`

.
If we look for n-th largest item where `n>=2`

, we keep repeating the process of removing the current maximum value and search for the new maximum one more time.
This is carried out in a for-loop for `n-1`

times.
The final value of `maxVal`

is then returned.

```
def fun2(example, case=1, duplicates=False):
"""max-remove"""
if duplicates:
example = list(set(example))
nn = 1 if case==1 else 2 if case==2 else 4
example = clean(example)
maxVal = max(example)
for _ in range(nn-1):
example.remove(maxVal)
maxVal = max(example)
return maxVal
```

```
print('The 4-th largest unique item is: {}'.format(fun2(example, 3, duplicates=True)))
```

```
The 4-th largest unique item is: 126
```

The third, fourth and fifth functions use the `sort`

method from Numpy and native Python, respectively and Python `sorted`

method.
The only point is that the Python `sort`

works in-place, so the code gets a bit longer.
The final step is to retrieve the `-n`

element since the list/array is sorted in an ascending fashion.

```
def fun3(example, case=1, duplicates=False):
"""np-sort"""
if duplicates:
example = list(set(example))
nn = 1 if case==1 else 2 if case==2 else 4
return np.sort(clean(example))[-nn]
```

```
print('The 4-th largest unique item is: {}'.format(fun3(example, 3, duplicates=True)))
```

```
The 4-th largest unique item is: 126
```

```
def fun4(example, case=1, duplicates=False):
"""list-sort"""
if duplicates:
example = list(set(example))
nn = 1 if case==1 else 2 if case==2 else 4
example = clean(example)
example.sort()
return example[-nn]
```

```
print('The 4-th largest unique item is: {}'.format(fun4(example, 3, duplicates=True)))
```

```
The 4-th largest unique item is: 126
```

```
def fun5(example, case=1, duplicates=False):
"""sorted"""
if duplicates:
example = list(set(example))
nn = 1 if case==1 else 2 if case==2 else 4
return sorted(clean(example))[-nn]
```

```
print('The 4-th largest unique item is: {}'.format(fun5(example, 3, duplicates=True)))
```

```
The 4-th largest unique item is: 126
```

The last function uses the same logic as the second one but by employing Numpy functionalities.
Since the `delete`

function requires the array index to delete, we use `argmax`

instead of `max`

to access the array maximum location.
If we look for n-th largest item where `n>=2`

, we keep repeating the process of removing the current maximum value and search for the new maximum one more time.
This is carried out in a for-loop for `n-1`

times.
The final `index`

is then used to return the input item at that location.

```
def fun6(example, case=1, duplicates=False):
"""np-argmax-delete"""
if duplicates:
example = list(set(example))
nn = 1 if case==1 else 2 if case==2 else 4
example = clean(example)
index = np.argmax(example)
for _ in range(nn-1):
example = np.delete(example, index)
index = np.argmax(example)
return example[index]
```

```
print('The 4-th largest unique item is: {}'.format(fun6(example, 3, duplicates=True)))
```

```
The 4-th largest unique item is: 126
```

## 3. Function assessment

### 3.1 Process correctness

In this last section of this **sorting** task, we compare the performance of every developed function with the benchmark results, both in terms of process correctness and computational time.
The process correctness is evaluated with a final score, which is the percentage of successful tests obtained with that function.
As always, I do propose two methods to get the score, with and without list comprehension.

The computational time is first assessed with the `%%time`

magic command available in Jupyter and then with a method available in the Python `time`

library.
The second method is useful to store intermediate steps into an array for different input sizes and different developed functions.

Every time the results of the first reference function, `refFun_largest`

, are compared to the ones obtained with the developed functions (in the following snippet, `fun1`

is used).
If they match, the score is increased by one unit.
At the end of the for loop, the score is divided by the length of the test samples and converted to a percentage.
Since the division is embedded into the assignment sign, `/=`

, we need to place `/`

in front of `100`

.

```
score = 0
for example, result in zip(testSet, testResults1):
if fun1(example) == result:
score += 1
score /= len(testSet)/100
print('Final score for function 1 is {}%.'.format(score))
```

```
Final score for function 1 is 100.0%.
```

Here a more compact version of the same process. The developed function returns the same results as the reference function did.

```
score = sum(1 for example, result in zip(testSet, testResults1) if fun1(example) == result)/len(testSet)*100
print('Final score for function 1 is {}%.'.format(score))
```

```
Final score for function 1 is 100.0%.
```

We define the scoring function with four arguments, the developed, the reference functions, the case from 1 to 3 to select the largest, second-largest or the n-th largest item, and the duplicate-removing option.
When `case=1`

, we set the `case`

attribute within the developed function equal to 1 and we select the first reference function, `refFun_largest`

to generate the ground-truth result.
Same procedure is used for second (`case=2`

) and n-th (`case=3`

) largest item search.

This `scoring`

function returns its score over the test set.
We place the three reference functions into the `refFuns`

list, while the six developed functions into the `devFuns`

list.

```
def scoring(fun, refFuns, case=1, duplicates=False):
score = 0
for example in testSet:
if fun(example, case=case, duplicates=duplicates) == refFuns[case-1](example, duplicates=duplicates):
score += 1
score /= len(testSet)/100
return score
```

```
refFuns = [refFun_largest, refFun_secondLargest, refFun_nthLargest]
devFuns = [fun1, fun2, fun3, fun4, fun5, fun6]
```

We test the **largest-item** option.

We write a `report`

, which takes the `scores`

list and check it contains only `100`

marks, using the `set`

method.
If the set does not coincide with the `{100}`

set, we check which function failed and return its index, `kk+1`

, via a generator to the `join`

method, which combines into comma-space-separated string.

We got the `report`

string using the powerful `lambda`

operator to create a function anonymously.

```
scores = [scoring(fun, refFuns, case=2) for fun in devFuns]
report = lambda scores: 'All tests have been successfully passed' if set(scores) == {100} else \
'functions ' + ', '.join(kk+1 for kk, ss in enumerate(scores) if ss<100) + 'failed.'
print(report(scores))
```

```
All tests have been successfully passed
```

We test the **second-largest-item** option, where duplicates are first kept and then deleted.

```
print(report(scoring(fun, refFuns, case=2) for fun in devFuns))
```

```
All tests have been successfully passed
```

```
print(report(scoring(fun, refFuns, case=2, duplicates=True) for fun in devFuns))
```

```
All tests have been successfully passed
```

We test the **n-th-largest-item** option, where duplicates are first kept and then deleted.

```
print(report(scoring(fun, refFuns, case=3) for fun in devFuns))
```

```
All tests have been successfully passed
```

```
print(report(scoring(fun, refFuns, case=3, duplicates=True) for fun in devFuns))
```

```
All tests have been successfully passed
```

### 3.2 Computational time

We finally compare the performance of each function in terms of computational time.
We first use one of the built-in magic commands available in the Jupyter notebook, `timeit`

, which estimates the execution time of a Python statement or expression.
One (`%timeit`

) or two (`%%timeit`

) percent signs are required to evaluate one-line statement or one cell, respectively.
It is very practical when a single cell has to be assessed.

The first developed function, which employs the module `Heapq`

, is almost double faster than the fifth developed function, which uses the Python `sorted`

function.

```
Nwages = 10**6
wages = np.random.randint(30, 150, size=(Nwage,)).tolist()
```

```
%%timeit
result = fun1(wages, case=3)
```

```
301 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

```
%%timeit
result = fun5(wages, case=3)
```

```
581 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Here we want to assess both reference and developed functions for different input sizes and store these timing values to a 2D array (one dimension for functions and one for different input sizes).
Then we can plot as many lines as the number of functions with `plot`

command in `Matplotlib`

, where the vertical axis is the computational time (the lower the better) and the horizontal axis is the input size.

We use the `perf_counter`

method from `time`

library.
It gives the current time when it called.
We call it and append its output to the `timings`

list before and after every execution of the tested functions.

```
from time import perf_counter
```

```
funs = [refFun_nthLargest] + devFuns
Nwages = [10**kk for kk in range(1, 7)]
timings = []
for Nwage in Nwages:
wages = np.random.randint(30, 150, size=(Nwage,)).tolist()
for fun in funs:
timings.append(perf_counter())
if fun.__name__.startswith('ref'):
result = fun(wages)
else:
result = fun(wages, case=3)
timings.append(perf_counter())
```

We now convert the `timings`

list into a 2D Numpy array, by using the `reshape`

method.
It forces the 2D array to have as many rows as the number of different sizes we wanted to test.
We let Numpy figure it out how many columns are required to fill the original content, with the `-1`

argument.

But what we really need is the actual time spent for every function call.
Since every odd row (Python is 0-index!) gives the time at the end of the function call, while every even row gives the time at the beginning of the call, we index every two columns from **second** to last (`[1::2]`

) to get the final timing and we index every two columns from **first** to last but one (`[::2]`

) to get the initial timing.
The simple difference is stored in `deltaTimings`

, whose shape is (6, 7).

```
timings = np.array(timings).reshape(len(Nwages), -1)
deltaTimings = timings[:, 1::2]-timings[:, ::2]
```

```
print('Shape of deltaTimings is {}, we have {} different input sizes and {} functions'\
.format(deltaTimings.shape, len(Nwages), len(funs)))
```

```
Shape of deltaTimings is (6, 7), we have 6 different input sizes and 7 functions
```

We plot the computational time of these 7 functions for different 6 input sizes that range in an exponential fashion from `10^1`

to `10^6`

.
Since Python treat a function as a class instance, we use the built-in attribute `__doc__`

to get access to the description that we inserted at the first line of the function definition itself.

We can only see 6 lines clearly, since `list-sort`

and `sorted`

do completely overlap each other!
The best option for the **n-th largest** case is to use the `bisect-insort`

method defined in the reference function.
`Heapq`

, `max-remove`

and `np-sort`

are quite similar one to the other.
The same idea of `max-remove`

implemented with Numpy functionalities, `np-argmax-delete`

, is a bit slower.
The two overlapping cases, `list-sort`

and `sorted`

, are instead way too much slower.

```
plt.figure()
plt.plot(Nwages, deltaTimings, lw=2.5, ls='--', alpha=.5)
plt.legend([fun.__doc__ for fun in funs])
```

```
<matplotlib.legend.Legend at 0x20533571f60>
```