Getting started¶

Overview¶

The central class in datapad is datapad.Sequence . This class provides an intuitive API for manipulating any sequence-like object using fluent programming. You can wrap python lists, iterators, sets, and tuples with this class to get access to all of the fluent-style APIs.

Let’s begin by importing datapad:

>>> import datapad as dp

Creating Sequences¶

Creating a sequence is as simple as instantiating the Sequence class with any iterable data type. In the example below, we wrap a range iterator using the Sequence class:

>>> seq = dp.Sequence(range(10))
>>> seq
<Sequence at 0x102983a5>

By default, Sequences are “lazily” evaluated. This means a sequence will only return data when a result is requested. To evaluate a sequence to get a result, call the collect method:

>>> seq = dp.Sequence(range(10))
>>> seq.collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Examining sequences¶

Slicing¶

Sometimes you might not want to evaluate an entire sequence. For example, you might only want to evaluate the first element of a sequence. You can do so by calling the first function:

>>> seq = dp.Sequence(range(10))
>>> seq.first()
0

Note, multiple calls to the first function will advance the Sequence iterator:

>>> seq = dp.Sequence(range(10))
>>> seq.first()
0
>>> seq.first()
1
>>> seq.first()
2

If you want to examine more than just the first element, you can call the take function with a integer representing the number of items you want to evaluate from your Sequence:

>>> seq = dp.Sequence(range(10))
>>> seq.take(4).collect()
[0, 1, 2, 3]

Counting¶

You can count the number of elements with the count method:

>>> seq = dp.Sequence(range(10))
>>> seq.count()
10

Or you can count occurences of all distinct elements in your sequence:

>>> seq = dp.Sequence(['a', 'a', 'b', 'b', 'b', 'c'])
>>> seq.count(distinct=True).collect()
[('a', 2), ('b', 3), ('c', 1)]

Manipulating sequences¶

In addition to examining the data in a Sequence object, Datapad provides a variety of methods to transform the data in your sequence.

Transforming elements¶

You can use the map() method to apply a function to every element in your sequence:

>>> seq = dp.Sequence(range(10))
>>> seq = seq.map(lambda elem: elem * 2)
>>> seq.collect()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

By default, most methods of the Sequence class returns a new sequence, enabling you to chain multiple map calls together in order to process your data in multiple steps:

>>> seq = dp.Sequence(range(3))
>>> seq = seq.map(lambda elem: elem * 2)\
...          .map(lambda elem: (elem, elem))\
...          .collect()
[(0, 0), (2, 2), (4, 4)]

Filtering elements¶

You can filter unwanted items from a sequence using the filter() method. This function takes as its arguments a single function that returns a boolean. All sequence elements that evaluate to True using this function will be returned, and all elements evaluating to False will be discarded:

>>> seq = dp.Sequence(range(10))
>>> seq = seq.filter(lambda elem: elem > 6)
>>> seq.collect()
[7, 8, 9]

Sorting elements¶

Sort sequences using the sort() method.

>>> seq = dp.Sequence([2,1, 5, 3])
>>> seq = seq.sort()
>>> seq.collect()
[1, 2, 3, 5]

Grouping elements¶

Group sequence elements togethering using the groupby() function. This function will return a sequence of tuples where the first item is the key of the group and the second item is a list of items in the group. Note: the groupby() function expects the sequence to be sorted to work properly:

>>> seq = Sequence(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'd'])
>>> seq.sort().groupby(key=lambda x: x).collect()
[
    ('a', ['a', 'a', 'a']),
    ('b', ['b', 'b']),
    ('c', ['c']),
    ('d', ['d', 'd']),
]

Discarding duplicates¶

You can find all unique values in a Sequence by calling the distinct() function:

>>> seq = Sequence(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'd'])
>>> seq.distinct().collect()
['a', 'b', 'c', 'd']

Joining sequences¶

A common operation needed when working with messy data is to combine multiple sequences together based on a matching field. For example in the sequences below, we a presented with two sequences which are correlated using an id field. One sequence contains a person’s name information and the other contains age information.

To match each element in each sequence to the same id, we can use the join() function:

>>> import datapad.fields as F
>>> import datapad as dp
>>> seq = dp.Sequence([
...     {'id': 1, 'name': 'John'},
...     {'id': 2, 'name': 'Nayeon'},
...     {'id': 3, 'name': 'Reza'}
... ])
>>> other = dp.Sequence([
...     {'id': 1, 'age': 2},
...     {'id': 2, 'age': 3}
... ])
>>> seq.join(other, key=F.get('id')).collect()
[
    ({'id': 1, 'name': 'John'}, {'id': 1, 'age': 2}),
    ({'id': 2, 'name': 'Nayeon'}, {'id': 2, 'age': 3})
]

This function uses the key function F.get('id') (see datapad.fields.get() for more details) to match ids in sequence seq and other. The result is a sequence of 2-tuples (a, b) where a is an element in seq whose id matched element b in other.

Note, any non-matching elements are simply discarded. This operation is commonly known in SQL terminology as an inner join.

If you’d would like a single, combined, dictionary instead of a sequence of 2-tuples, you can map a merging function over the resulting sequence:

>>> import datapad.fields as F
>>> import datapad as dp
>>> seq = dp.Sequence([
...     {'id': 1, 'name': 'John'},
...     {'id': 2, 'name': 'Nayeon'},
...     {'id': 3, 'name': 'Reza'}
... ])
>>> other = dp.Sequence([
...     {'id': 1, 'age': 2},
...     {'id': 2, 'age': 3}
... ])
>>> seq.join(other, key=F.get('id'))\
...    .map(lambda d: dict(list(d[0].items()) + list(d[1].items())))\
...    .collect()
[
    ({'id': 1, 'name': 'John',  'age': 2}),
    ({'id': 2, 'name': 'Nayeon', 'age': 3})
]

Fields and Structured sequences¶

In nontrivial use-cases, Sequences are often made up of Dictionaries, Lists, or other container data-types. Datapad provides a set of functions in the datapad.fields module to work with these nested data types.

Combining this module along with methods like datapad.Sequence.map() gives you a flexible and powerful framework for manipulating data sequences containing dictionaries and lists.

Below you’ll find a few examples of working with sequences containing structured data. To begin, import the fields module:

import datapad as dp
import datapad.fields as F

Concepts¶

Structured sequences are simply Sequences that have dicts or lists as elements. These elements can be thought of as a row in a table.
Fields are individual items within each row. They can be thought of as a columns in tabular data.
A field-key is used to look up a specific field-value in a given row or element of a structured sequence.
- When elements are dicts, a field-key refers to the dictionary key and a field-value refers to the corresponding dictionary value.
- When elements are lists, a field-key refers to a specific index in the list and a field-value refers to the item at that list index.

Here’s an example of a list-based structured sequence:

>>> seq = dp.Sequence([
...     ['a', 1, 3],
...     ['b', 2, 3],
...     ['c', 3, 3]
... ])
>>> seq.first()
['a', 1, 3]

Here’s an example of a dict-based structure sequence:

>>> seq = dp.Sequence([
...     {'a': 1, 'b': 2},
...     {'a': 4, 'b': 4},
...     {'a': 5, 'b': 7}
... ])
>>> seq.first()
{'a': 1, 'b': 2}

Selecting fields¶

You can retrieve individual fields within the elements of a structured sequence using the datapad.fields.select() function, which takes a list of keys for dict-based structured sequences:

>>> seq = dp.Sequence([
...     {'a': 1, 'b': 2},
...     {'a': 4, 'b': 4},
...     {'a': 5, 'b': 7}
... ])
>>> seq.map(F.select(['a'])).collect()
[
    {'a': 1},
    {'a': 4},
    {'a': 5}
]

Or indices in the case of list-based structured sequences:

>>> seq = dp.Sequence([
...     ['a', 1, 3],
...     ['b', 2, 3],
...     ['c', 3, 3]
... ])
>>> seq.map(F.select([0, 2])).collect()
[
    ['a', 3],
    ['b', 3],
    ['c', 3]
]

Transforming fields¶

You can apply functions to individual fields using the datapad.fields.apply() function.

The simplest way to use this function is to pass it a field key or index and a function that will transform the field value:

>>> seq = dp.Sequence([
...     {'a': 1, 'b': 2},
...     {'a': 4, 'b': 4},
...     {'a': 5, 'b': 7}
... ])
>>> seq.map(F.apply('a', lambda x: x*2))\
...    .map(F.apply('b', lambda x: x*3))\
...    .collect()
[
    {'a': 2, 'b': 6},
    {'a': 8, 'b': 12},
    {'a': 10, 'b': 21}
]

Adding fields¶

You can add fields using the datapad.fields.add() function.

The simplest way to use this function is to pass it a field key that you want to add and a function to generate a new field value. The function that you pass in must accept a the entire element and return a new value for the field. See below for an example:

>>> seq = dp.Sequence([
...     {'a': 1, 'b': 2},
...     {'a': 4, 'b': 4},
...     {'a': 5, 'b': 7}
... ])
>>> seq.map(F.add('c', lambda row: row['a'] + row['b']))\
...    .collect()
[
    {'a': 1, 'b': 2, 'c': 3},
    {'a': 4, 'b': 4, 'c': 8},
    {'a': 5, 'b': 7, 'c': 12}
]