1. Introduction

This post belongs to a new series of posts where I intend to face the challenge of learning a new programming language, such as Python, by doing. There are tons of material about Python tutorials for every level of expertise. It has been quite a while now that platforms such as codewars, topcoders and similar have been around. The nice thing that I like about these platforms is that the trainee is faced with many tasks or challenges of increasing difficulty. The same task is often proposed for multiple languages (Python, Javascript, C++ and many others). Learn by doing is gaining popularity in recent years as it helps the trainee to figure out how and where he/she can get stacked. Solutions of other members are accessible only when an admissible solution has been successfully submitted to the automatic testing process.

So why should I write a blog post to show how to solve such kinds of tasks and how could it be useful to any extend to you? In line with the philosophical view of the above-mentioned platforms, I encourage you to read each single example description, step out of the blog post, try to work it out by yourself (without surfing Stack-overflow too much!) and coming back to the post to go through the proposed solution (something I tend to provide more than one) and the related comments. I have indeed struggled sometimes to fully capture other members’ solutions with no comments or details of the logic behind the code.

In the first two posts (Part1 and Part2), we develop a Python code from scratch to determine the travelled distance of a bike rider in a given time span, for a fixed and increasing annual distance target. We started to play with some text processing tasks in Part3.

The new challenge aims at identifying the highest wages from a company employee’s salary list in Python. We actually develop three different functions to return:

  1. the largest wage
  2. the second largest wage
  3. the N-th largest wage of the list.

We split this post into two sections. This post focuses on the function definition and the reference output, while the next will tackle the implementation of other methods and the comparison in terms of computational performance.

2. Description

James has been told to develop a function for a customer that takes a list of employees’ wages and automatically extract the largest salary in the company. A further step is to retrieve the 2-nd largest salary and then in general the n-th largest salary. A second argument, duplicates, to the function is required to handle duplicates:

  1. if it is True, duplicates have to be kept. The second largest amount of input [100, 50, 100] is therefore 100.
  2. if it is False, duplicates have to be drop and unique values only are kept. The second largest amount of input [100, 50, 100] is therefore 50.

The wage recording file, however, contains also some text inputs, since whenever the salary amount of an employee was not available, the string value None has been recorded. The developer needs then to remove string elements from the list before searching for the largest amounts If the request for the n-th largest salary exceeds the size of available salary values, the textual input 'Not available' has to be returned.

Please, note that the salary amounts are stored as k€. For instance, a salary record of [95, 'None', 95, 40, 50, 60, 150, 'None'] should return 150 as the largest amount, 95 as the 2-nd largest amount, 60 as the 3-rd largest amount if duplicates=False, 'Not available' as 7-th largest amount.

3. Input data

Here we generate one example of wage data. First we create a list of 5 integers, ranging from 30k€ to 150k€. We can use two methods to introduce some None values into the wages list:

  1. iterate through the list, generate a random number from 0 to 1 for each value, if it is less than a given threshold (say 0.1), the numeric value is replaced with None, otherwise it is copied to the new wages list.
  2. choose a small integer value that defines the amount of None elements (say N), append this list of N None elements to the initial wages list and shuffle it. A way to choose N is to take the closest integer to the following ratio $ N = (\delta/(1-\delta)\cdot N_0) $, where $N_0$ is the initial number of integers and $\delta$ is the threshold. That would give a proportion of None values to the total final wages input equal to $\delta$.

First, we import some libraries.

import sys, random
import numpy as np
import pandas as pd
import bisect
import matplotlib.pyplot as plt
%matplotlib inline
Nwage = 20
wages = np.random.randint(30, 150, size=(Nwage,)).tolist()
print(wages)
[129, 124, 57, 83, 80, 100, 116, 94, 44, 138, 103, 55, 141, 105, 58, 106, 145, 92, 37, 85]

3.1 Single sample - method 1

Replace some values with None with probability $\delta = 0.2$.

delta = .2
rawWages = ['None' if random.random()<delta else wage for wage in wages]
print(rawWages)
print('Ratio of None to whole list size is {}'.format(sum(1 for wage in rawWages if wage=='None')/len(rawWages)))
[129, 124, 57, 83, 80, 100, 'None', 94, 44, 138, 103, 55, 141, 105, 58, 'None', 145, 'None', 37, 'None']
Ratio of None to whole list size is 0.2

3.2 Single sample - method 2

Append some values with None with probability $\delta = 0.2$ and shuffle. We need to append $\frac{0.2}{0.8}\cdot 20 = 5$ None elements.

delta = .2
Nnone = int(delta/(1-delta)*len(wages))
rawWages = wages + ['None']*Nnone
random.shuffle(rawWages)
print(rawWages)
print('Ratio of None to whole list size is {}'.format(sum(1 for wage in rawWages if wage=='None')/len(rawWages)))
[145, 141, 103, 57, 44, 'None', 94, 92, 116, 129, 85, 55, 100, 105, 'None', 106, 'None', 58, 83, 'None', 80, 124, 'None', 138, 37]
Ratio of None to whole list size is 0.2

To avoid the in-place feature of shuffle, we can also use sample from the same library by giving the length of the initial list as an additional argument. This is useful to implement the list-comprehension version of the test set generation.

3.3 Sample set

We generate Nsample = 10 test cases and place them into testSet. Since it is an entry-level post, I propose two methods to generate such a dataset:

  1. a more verbose procedure, where list is initialized, the append method is used within a for-loop.
  2. a more concise process, where the same logic is compressed into a single line of code, referred to as list comprehension.

In both cases, the for-loop is just used to generate multiple instances, but the current iteration step is useless. We use the underscore _ to say to Python not to allocate that value, which is coming from the range generator in this case, into any variable.

Sample set - method 1

Nsample = 50
testSet = [] # list inizialiation
for _ in range(Nsample):
    wages = np.random.randint(30, 150, size=(Nwage,)).tolist()
    rawWages = wages + ['None']*Nnone
    random.shuffle(rawWages)
    testSet.append(rawWages)

print('Test set length is {}, while each example length is {}'.format(len(testSet), len(testSet[0])))
Test set length is 50, while each example length is 25

Sample set - method 2

The second method offers a more concise way.

testSet = [random.sample(np.random.randint(30, 150, size=(Nwage,)).tolist()+['None']*Nnone, Nwage+Nnone) for _ in range(Nsample)]
print('Test set length is {}, while each example length is {}'.format(len(testSet), len(testSet[0])))
Test set length is 50, while each example length is 25

A short note concerns the Python format method. In this tutorial this method is used to take some variable values as arguments and replace the curly brackets within the string input to the print function, with such values. Further advanced results and string formatting can be achieved. Please refer to the official documentation and this super nice guide.

4. Reference function

This section is meant to define the procedure to calculate the output in the most basic Pythonic way, to be readable and 100% correct.

We take the first example, we keep the wage instance in the list only if it is not the None string. We then can assign the first element of the list to the maxVal variable, which is meant to store the highest wage so far ever encountered. The next step is to iterate from second item on (index is 1) in a for-loop and compare each list value to the current maximum value maxVal with a if statement. Whenever it is true, the maxVal variable is updated. This procedure is proportional to the list size. It means that doubling the list size requires double effort and time. The notation that is commonly used for the time complexity behaviour is the big O. In case of linear time, it becomes O(n).

4.1 Largest wage in the list

example = testSet[0]
example = [wage for wage in example if wage!='None']
print(example)
[56, 80, 98, 134, 53, 68, 148, 72, 49, 84, 52, 48, 110, 80, 129, 125, 138, 71, 135, 122]
maxVal = example[0]
for val in example[1:]:
    if val > maxVal:
        maxVal = val
print('The largest integer in the first sample is {}'.format(maxVal))
The largest integer in the first sample is 148

We embed this code into the reference function largestRef(). The output function variable has to be declared within the return statement. Since the first step is to clean the wage list from the None string, which is common for every reference function, we introduce a dedicated function for that as well. We need to add a second attribute, duplicates, also for this first case to make the function structure more general. However, it does not obviously affect the outcome even if it is set to True.

In general, we need to make sure that the input is not empty to prevent Python returning an error when we index for the first element to initialize maxVal. In that case, we return None.

def clean(example):
    return [wage for wage in example if wage!='None']

def refFun_largest(example, duplicates=False):
    """ref-linear"""
    example = clean(example)
    if not example:
        return None
    maxVal = example[0]
    for val in example[1:]:
        if val > maxVal:
            maxVal = val
    return maxVal

4.2 Second largest wage in the list

The second step is to search for the second largest wage. This task requires to specify how to handle duplicates. If the function argument duplicates is set to True, we use set() to create a set of unique wage values and convert it to list. Then we make sure the input list contains at least two values, otherwise we return None. Finally we implement the code to perform the actual task. To this end, we use two placeholders, mv1 for the current largest item, mv2 for the current 2nd largest item. We take the first two items of the input list, we sort in a reversed fashion and assign those values to mv1 and mv2. We then iterate over the input list from the third element on. Whenever an element val is greater than mv2, we have two options:

  1. Such an element is also greater than mv1, so we have to overwrite mv1 with the element val and mv2 with mv1 (what was previously stored in mv1 becomes the second largest value).
  2. Such an element is less than mv1, so we have to simply overwrite mv2 with the element val.

This is implemented in the two below snippets.

duplicates = True
if duplicates:
    values = list(set(example))

print('Input list with no duplicates: {}'.format(values))
Input list with no duplicates: [129, 98, 68, 134, 71, 72, 135, 138, 110, 80, 49, 48, 148, 53, 84, 52, 56, 122, 125]
mv1, mv2 = sorted(values[:2], reverse=True)
for val in example[2:]:
    if val > mv2:
        if val >= mv1:
            mv1, mv2 = val, mv1
        else:
            mv2 = val
print('The 2nd largest integer in the first sample is {}'.format(mv2))
The 2nd largest integer in the first sample is 138

We embed this code into the reference function secondLargestRef().

def refFun_secondLargest(example, duplicates=False):
    """ref-linear"""
    example = clean(example)
    if duplicates:
        example = list(set(example))
    if len(example) < 2:
        return None
    
    mv1, mv2 = sorted(example[:2], reverse=True)
    for val in example[2:]:
        if val > mv2:
            if val >= mv1:
                mv1, mv2 = val, mv1
            else:
                mv2 = val
    
    return mv2

4.3 N-th largest wage in the list

The final step is to search for the n-th largest wage. This task requires to specify how to handle duplicates and n itself.

Then we make sure the input list contains at least n values, otherwise we return None. Finally we implement the code to perform the actual task. To this end, we use the list nLargestVals as a container of the current n-th largest values encountered so far by the algorithm.

We take the first n items of the input list, we sort in an increasing fashion and assign those values to nLargestVals. We then iterate over the input list from the n+1-th element on. Whenever an element val is greater than the first element of nLargestVals (smallest value between the current n-th largest values), we insert this value into the sorted nLargestVals with insort method from the bisect library, which helps to keep the list sorted, and we delete the first element of the updated list. At the end of the for-loop, we return the first element of the nLargestVals.

print('Input list with no duplicates: {}'.format(values))
Input list with no duplicates: [129, 98, 68, 134, 71, 72, 135, 138, 110, 80, 49, 48, 148, 53, 84, 52, 56, 122, 125]
n = 4
nLargestVals = sorted(values[:n])
for val in values[n:]:
    print(nLargestVals)
    if val > nLargestVals[0]:
        bisect.insort(nLargestVals, val)
        del nLargestVals[0]

print('The first n largest integers are {}'.format(nLargestVals))
print('The n-th largest integer in the first sample is {}'.format(nLargestVals[0]))
[68, 98, 129, 134]
[71, 98, 129, 134]
[72, 98, 129, 134]
[98, 129, 134, 135]
[129, 134, 135, 138]
[129, 134, 135, 138]
[129, 134, 135, 138]
[129, 134, 135, 138]
[129, 134, 135, 138]
[134, 135, 138, 148]
[134, 135, 138, 148]
[134, 135, 138, 148]
[134, 135, 138, 148]
[134, 135, 138, 148]
[134, 135, 138, 148]
The first n largest integers are [134, 135, 138, 148]
The n-th largest integer in the first sample is 134

A short example of how insort works is shown here, where we append 101 to the sorted 5-integer list-

tmp = sorted(np.random.randint(30, 150, size=(5,)).tolist())
print(tmp)
[43, 90, 125, 139, 142]
bisect.insort(tmp, 101)
print(tmp)
[43, 90, 101, 125, 139, 142]

We embed this code into the third reference function nthLargestRef().

def refFun_nthLargest(example, n=4, duplicates=False):
    """bisect-insort"""
    example = clean(example)
    if duplicates:
        example = list(set(example))
    if len(example) < n:
        return None
    
    nLargestVals = sorted(example[:n])
    for val in example[n:]:
        if val > nLargestVals[0]:
            bisect.insort(nLargestVals, val)
            del nLargestVals[0]
    
    return nLargestVals[0]

5. Output data

We need to repeat the process defined in the above function to get the output for the entire testSet. Since we have developed three concepts, namely the largest, the second largest and the n-th largest item in a list, we pass each input sample to the three functions and store the result to three separate lists.

testResults1, testResults2, testResultsN = [], [], []

for example in testSet:
    testResults1.append(refFun_largest(example))
    testResults2.append(refFun_secondLargest(example))
    testResultsN.append(refFun_nthLargest(example))

We show the results for the first 10 dataset inputs.

for kk in range(10):
    example, result1, result2, resultN = testSet[kk], testResults1[kk], testResults2[kk], testResultsN[kk]
    print('='*60)
    print('In the following wage list\n{}\nthe largest, second largest and n={}-th largest wage are\n{}, {} and {}'\
          .format(example, 4, result1, result2, resultN))
============================================================
In the following wage list
[56, 80, 98, 134, 53, 68, 148, 72, 49, 'None', 'None', 84, 52, 48, 110, 80, 129, 125, 138, 71, 'None', 135, 'None', 122, 'None']
the largest, second largest and n=4-th largest wage are
148, 138 and 134
============================================================
In the following wage list
[47, 'None', 138, 87, 78, 98, 105, 144, 36, 124, 129, 31, 'None', 71, 43, 'None', 'None', 50, 33, 90, 114, 'None', 60, 117, 66]
the largest, second largest and n=4-th largest wage are
144, 138 and 124
============================================================
In the following wage list
[107, 32, 104, 56, 54, 95, 'None', 32, 68, 116, 'None', 148, 124, 74, 81, 62, 'None', 132, 'None', 144, 69, 127, 38, 122, 'None']
the largest, second largest and n=4-th largest wage are
148, 144 and 127
============================================================
In the following wage list
['None', 100, 'None', 125, 42, 79, 119, 47, 'None', 47, 124, 148, 46, 'None', 100, 'None', 124, 109, 75, 36, 102, 133, 51, 76, 128]
the largest, second largest and n=4-th largest wage are
148, 133 and 125
============================================================
In the following wage list
[85, 148, 112, 'None', 118, 114, 'None', 104, 120, 66, 89, 107, 132, 46, 50, 59, 'None', 31, 96, 'None', 'None', 49, 47, 94, 116]
the largest, second largest and n=4-th largest wage are
148, 132 and 118
============================================================
In the following wage list
[87, 'None', 103, 148, 96, 101, 52, 'None', 63, 144, 64, 73, 82, 100, 31, 74, 'None', 79, 57, 'None', 44, 86, 68, 'None', 124]
the largest, second largest and n=4-th largest wage are
148, 144 and 103
============================================================
In the following wage list
[83, 'None', 82, 114, 89, 70, 113, 135, 39, 123, 139, 53, 82, 84, 73, 139, 96, 'None', 52, 'None', 67, 131, 'None', 83, 'None']
the largest, second largest and n=4-th largest wage are
139, 139 and 131
============================================================
In the following wage list
[40, 103, 54, 91, 'None', 'None', 111, 147, 73, 100, 109, 76, 'None', 77, 31, 32, 'None', 91, 68, 80, 72, 112, 103, 'None', 59]
the largest, second largest and n=4-th largest wage are
147, 112 and 109
============================================================
In the following wage list
[94, 143, 139, 'None', 100, 30, 'None', 73, 'None', 'None', 'None', 94, 51, 101, 77, 131, 86, 62, 68, 121, 67, 87, 49, 59, 97]
the largest, second largest and n=4-th largest wage are
143, 139 and 121
============================================================
In the following wage list
[84, 87, 140, 'None', 95, 128, 116, 44, 106, 100, 145, 'None', 143, 79, 'None', 99, 133, 91, 57, 114, 'None', 'None', 89, 107, 117]
the largest, second largest and n=4-th largest wage are
145, 143 and 133