官术网_书友最值得收藏!

Sets

Sets are unordered collections of elements, with the additional restriction that the elements must be unique.  The main use-cases where sets are a good choice are membership tests (testing if an element is present in the collection) and, unsurprisingly, set operations such as union, difference, and intersection.

In Python, sets are implemented using a hash-based algorithm just like dictionaries; therefore, the time complexities for addition, deletion, and test for membership scale as O(1) with the size of the collection.

Sets contain only unique elements. An immediate use case of sets is the removal of duplicates from a collection, which can be accomplished by simply passing the collection through the set constructor, as follows:

    # create a list that contains duplicates
x = list(range(1000)) + list(range(500))
# the set *x_unique* will contain only
# the unique elements in x
x_unique = set(x)

The time complexity for removing duplicates is O(N), as it requires to read the input and put each element in the set.

Sets expose a number of operations like union, intersection, and difference. The union of two sets is a new set containing all the elements of both the sets; the intersection is a new set that contains only the elements in common between the two sets, and the difference is a new set containing the element of the first set that are not contained in the second set. The time complexities for these operations are shown in the following table. Note that since we have two different input sizes, we will use the letter S to indicate the size of the first set (called s), and T to indicate the size of the second set (called t):

An application of set operations are, for example, Boolean queries. Going back to the inverted index example of the previous subsection, we may want to support queries that include multiple terms. For example, we may want to search for all the documents that contain the words cat and table. This kind of a query can be efficiently computed by taking the intersection between the set of documents containing cat and the set of documents containing table.

In order to efficiently support those operations, we can change our indexing code so that each term is associated to a set of documents (rather than a list). After applying this change, calculating more advanced queries is a matter of applying the right set operation. In the following code, we show the inverted index based on sets and the query using set operations:

    # Building an index using sets
index = {}
for i, doc in enumerate(docs):
# We iterate over each term in the document
for word in doc.split():
# We build a set containing the indices
# where the term appears
if word not in index:
index[word] = {i}
else:
index[word].add(i)

# Querying the documents containing both "cat" and "table"
index['cat'].intersection(index['table'])
主站蜘蛛池模板: 晋中市| 哈尔滨市| 兴海县| 松江区| 通许县| 永福县| 墨江| 德格县| 桂阳县| 双城市| 萨迦县| 抚远县| 淅川县| 日喀则市| 岳阳县| 安宁市| 呼图壁县| 崇明县| 正镶白旗| 贞丰县| 宁国市| 湾仔区| 屯留县| 木里| 昭平县| 天门市| 泰安市| 汾西县| 普安县| 巴林右旗| 昆明市| 天津市| 南康市| 沧源| 汉中市| 安达市| 上林县| 金昌市| 隆德县| 平乡县| 玉田县|