1. Introduction

This post belongs to a new series of posts where I intend to face the challenge of learning a new programming language, such as Python, by doing. There are tons of material about Python tutorials for every level of expertise. It has been quite a while now that platforms such as codewars, topcoders and similar have been around. The nice thing that I like about these platforms is that the trainee is faced with many tasks or challenges of increasing difficulty. The same task is often proposed for multiple languages (Python, Javascript, C++ and many others). Learn by doing is gaining popularity in recent years as it helps the trainee to figure out how and where he/she can get stacked. Solutions of other members are accessible only when an admissible solution has been successfully submitted to the automatic testing process.

So why should I write a blog post to show how to solve such kinds of tasks and how could it be useful to any extend to you? In line with the philosophical view of the above-mentioned platforms, I encourage you to read each single example description, step out of the blog post, try to work it out by yourself (without surfing Stack-overflow too much!) and coming back to the post to go through the proposed solution (something I tend to provide more than one) and the related comments. I have indeed struggled sometimes to fully capture other members’ solutions with no comments or details of the logic behind the code.

In the previous two posts (Part1 and Part2), we develop a Python code from scratch to determine the travelled distance of a bike rider in a given time span, for a fixed and increasing annual distance target.

We now move to text processing: how to extract initials from a name string in Python?

2. Description

James is responsible to develop a function for a customer that takes any employee full name and automatically extract the initials to append to the signature of every email sent by that specific employee. To this end, he needs to write a function that takes two space-separated words the full name is made of and returns the two capital letters with a dot separating them.

For instance, a full name James Bond input should be returned as J.B.

3. Input data

Here one example of character data.

import sys, random
import numpy as np
import pandas as pd
fullNames = ["James Bond", "Bruce Wayne", "Bruce Banner", "Peter Parker", "Mary Poppins"]
print('It consists of {} characters'.format(len(fullNames)))
It consists of 5 characters

The next step is to generate a set of randomly generated test units and test the different developed functions on it.

We use the random selector choice from the random library (see the official site and this Stackoverflow answer).

It returns a random element from the non-empty input sequence fullNames. We convert the 2-word name to a list with split() and take the first element for the first name firstName. We repeat the same procedure one more time and take the last element as the last name lastName.

Nsample = 50
firstName = random.choice(fullNames).split()[0]
lastName = random.choice(fullNames).split()[-1]
firstName + " " + lastName
'Bruce Parker'

We generate Ncases = 10 test cases and place them into testSet. Since it is an entry-level post, I propose two methods to generate such a dataset: 1. a more verbose procedure, where list is initialized, the append method is used within a for-loop. 2. a more concise process, where the same logic is compressed into a single line of code, referred to as list comprehension.

In both cases, the for-loop is just used to generate multiple instances, but the current iteration step is useless. We use the underscore _ to say to Python not to allocate that value, which is coming from the range generator in this case, into any variable.

3.1 Method 1

testSet = [] # list inizialiation
for _ in range(Nsample):
    inputExample = random.choice(fullNames).split()[0] + " " + random.choice(fullNames).split()[-1]
    testSet.append(inputExample)

strLenMin = min(len(name) for name in testSet)
strLenMax = max(len(name) for name in testSet)
print('Test set length is {}, while the minimum/maximum example length is {}/{}, respectively'\
      .format(len(testSet), strLenMin, strLenMax))
Test set length is 50, while the minimum/maximum example length is 9/13, respectively

3.2 Method 2

Let’s implement the same concept in a more concise way.

testSet = [random.choice(fullNames).split()[0] + " " + random.choice(fullNames).split()[-1] for _ in range(Nsample)]

strLenMin = min(len(name) for name in testSet)
strLenMax = max(len(name) for name in testSet)
print('Test set length is {}, while the minimum/maximum example length is {}/{}, respectively'\
      .format(len(testSet), strLenMin, strLenMax))
Test set length is 50, while the minimum/maximum example length is 9/13, respectively

A short note concerns the Python format method. In this tutorial this method is used to take some variable values as arguments and replace the curly brackets within the string input to the print function, with such values. Further advanced results and string formatting can be achieved. Please refer to the official documentation and this super nice guide.

4. Real data names

We get a more comprehensive list of names from here. At this stage, we assume that a surname can be used as a first name as well. We read the CSV file in with the Pandas method read_csv. It returns a dataframe, df, which can be seen a table with as many rows as the file records and as many columns as the comma-separated text string variables stored in the CSV file itself. We can easily have a look at the first 5 rows of the dataframe with head().

df = pd.read_csv('names/app_c.csv')
df.head()
name rank count prop100k cum_prop100k pctwhite pctblack pctapi pctaian pct2prace pcthispanic
0 SMITH 1 2376206 880.85 880.85 73.35 22.22 0.40 0.85 1.63 1.56
1 JOHNSON 2 1857160 688.44 1569.30 61.55 33.80 0.42 0.91 1.82 1.50
2 WILLIAMS 3 1534042 568.66 2137.96 48.52 46.72 0.37 0.78 2.01 1.60
3 BROWN 4 1380145 511.62 2649.58 60.71 34.54 0.41 0.83 1.86 1.64
4 JONES 5 1362755 505.17 3154.75 57.69 37.73 0.35 0.94 1.85 1.44

We are only interested in the actual names, each formatted with a capital letter for the first letter and lowercase for the second character on. We locally drop (inplace=True) each dataframe row where the column name is empty (see the official site and this Stackoverflow answer).

The interested column name is extracted with either df['name'] or df.name methods. The former method tends to be preferred in the community. This returns a Series object, which is a special object for one-single column data in Pandas. This can be converted to a Python list with tolist().

#names = df.name.tolist()
df.dropna(subset=['name'], inplace=True)
names = df['name'].tolist()

Now, the following list comprehension takes each name element and capitalizes it, i.e., it creates a new element with the first letter unchanged (uppercase) and the remaining letters ([1:]) as lowercase.

longNameList = [name[0] + name[1:].lower() for name in names]
name2show = 5
print('The first {} names are: {}'.format(name2show, longNameList[:name2show]))
print('Name list length: {}'.format(len(longNameList)))
The first 5 names are: ['Smith', 'Johnson', 'Williams', 'Brown', 'Jones']
Name list length: 151670

We generate the final dataset by selecting names from the 151k name list.

testSet = [random.choice(longNameList) + " " + random.choice(longNameList) for _ in range(Nsample)]

strLenMin = min(len(name) for name in testSet)
strLenMax = max(len(name) for name in testSet)
print('Test set length is {}, while the minimum/maximum example length is {}/{}, respectively'\
      .format(len(testSet), strLenMin, strLenMax))
Test set length is 50, while the minimum/maximum example length is 10/21, respectively

5. Reference function

This section is meant to define the procedure to calculate the output in the most basic Pythonic way, to be readable and 100% correct.

We take the first example, we convert the string characters into upper case with the upper() method, split the new string into a list containing the words that are separated by arbitrary strings of whitespace characters with split() and unpack this 2-item list into the first and last variables, each of which is a string again. The last step is to get the first character of each name string and concatenate those strings with the dot character ('.'). Concatenating strings is as simple as using the + operator.

example = testSet[0]
uppercaseName = example.upper()
names = uppercaseName.split(' ')
first, last = names
initials = first[0] + '.' + last[0]
print('The initials of the full name {} are {}!'.format(example, initials))
The initials of the full name Tinajero Rieg are T.R!

We embed this code into the reference function refFun(). The output function variable has to be declared within the return statement.

def refFun(example):
    uppercaseName = example.upper()
    names = uppercaseName.split(' ')
    first, last = names
    initials = first[0] + '.' + last[0]
    return initials

6. Output data

We need to repeat the process defined in the above function to get the output for the entire testSet.

testResults = []

for example in testSet:
    testResults.append(refFun(example))

We show the results for the first 10 dataset inputs.

for kk in range(10):
    example, result = testSet[kk], testResults[kk]
    print('='*60)
    print('The initials of the full name {} are {}!'.format(example, result))
============================================================
The initials of the full name Tinajero Rieg are T.R!
============================================================
The initials of the full name Ana Fengel are A.F!
============================================================
The initials of the full name Palenske Bruening are P.B!
============================================================
The initials of the full name Khatun Primrose are K.P!
============================================================
The initials of the full name Mackedanz Goranson are M.G!
============================================================
The initials of the full name Cure Kindoll are C.K!
============================================================
The initials of the full name Wynings Brautigan are W.B!
============================================================
The initials of the full name Hornick Wollam are H.W!
============================================================
The initials of the full name Tutino Schwertner are T.S!
============================================================
The initials of the full name Alvera Shehu are A.S!

7. Function development

Since the input is a string, we can directly feed a white-space-separated word list (example.split()) to a list comprehension. For each word we take the first letter only, convert it to uppercase and give this generator as input to the join method, which concatenates the generated words with intervening occurrences of '.'. This one-single line code is more practical for longer strings. The unpacking method instead implies the developer to define as many new variables as the number of words.

initials = '.'.join(wd[0].upper() for wd in example.split())
print('The initials of the full name {} are {}'.format(example, initials))
The initials of the full name Alvera Shehu are A.S
def fun1(example):
    return '.'.join(wd[0].upper() for wd in example.split())

A very similar approach would convert the final concatenated string to uppercase, instead of repeating this step for each character within the list comprehension step.

initials = '.'.join(wd[0] for wd in example.split()).upper()
print('The initials of the full name {} are {}'.format(example, initials))
The initials of the full name Alvera Shehu are A.S
def fun2(example):
    return '.'.join(wd[0] for wd in example.split()).upper()

8. Function assessment

In this last section of the very first task, we compare the performance of every developed function with the benchmark results. The final score is the percentage of successful tests obtained with that function.

As always, I do propose two methods to get the score, with and without list comprehension.

score = 0
for example, result in zip(testSet, testResults):
    if fun1(example) == result:
        score += 1

score /= len(testSet)/100
print('Final score for function 1 is {}%.'.format(score))
Final score for function 1 is 100.0%.
score = sum(1 for example, result in zip(testSet, testResults) if fun1(example) == result)/len(testSet)*100
print('Final score for function 1 is {}%.'.format(score))
Final score for function 1 is 100.0%.

We define an in-place scoring function using the powerful lambda operator to create it anonymously.

This scoring function takes a function as input and returns its score over the test set.

scoring = lambda fun: sum(1 for example, result in zip(testSet, testResults) if fun(example) == result)/len(testSet)*100

We finally place the 4 developed functions into a list and prints the score associated with each of them. The enumerate iterator gives the 0-based index and the corresponding element of the input list.

funs = [fun1, fun2]
for kk, fun in enumerate(funs):
    print('Final score for function {} is {}%.'.format(str(kk+1), scoring(fun)))
Final score for function 1 is 100.0%.
Final score for function 2 is 100.0%.

We finally compare the performance of each function in terms of computational time. We use one of the built-in magic commands available in the Jupyter notebook, timeit, which estimates the execution time of a Python statement or expression. One (%timeit) or two (%%timeit) percent signs are required to evaluate one-line statement or one cell, respectively.

The two shorter functions are slower than the reference function. For names with a few words, some simple Python unpacking and concatenating can do the job nicely and faster. The advantage of the two other functions comes into play for longer names. We can also appreciate that applying the upper() method only once to the final string (fun2) reduces the computational time by $4\mu s$.

%%timeit
for example in testSet:
    fun1(example)
73.6 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
for example in testSet:
    fun2(example)
69.4 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
for example in testSet:
    refFun(example)
42.1 µs ± 965 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)